
When we passed a html document or string to a beautifulsoup constructor, beautifulsoup basically converts a complex html page into different python objects. Below we are going to discuss four major kinds of objects:
Tag
NavigableString
BeautifulSoup
Comments
A HTML tag is used to define various types of content. A tag object in BeautifulSoup corresponds to an HTML or XML tag in the actual page or document.
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<b class="boldest">Howcodex</b>')
>>> tag = soup.html
>>> type(tag)
<class 'bs4.element.Tag'>
Tags contain lot of attributes and methods and two important features of a tag are its name and attributes.
Every tag contains a name and can be accessed through ‘.name’ as suffix. tag.name will return the type of tag it is.
>>> tag.name 'html'
However, if we change the tag name, same will be reflected in the HTML markup generated by the BeautifulSoup.
>>> tag.name = "Strong" >>> tag <Strong><body><b class="boldest">Howcodex</b></body></Strong> >>> tag.name 'Strong'
A tag object can have any number of attributes. The tag <b class=”boldest”> has an attribute ‘class’ whose value is “boldest”. Anything that is NOT tag, is basically an attribute and must contain a value. You can access the attributes either through accessing the keys (like accessing “class” in above example) or directly accessing through “.attrs”
>>> tutorialsP = BeautifulSoup("<div class='tutorialsP'></div>",'lxml')
>>> tag2 = tutorialsP.div
>>> tag2['class']
['tutorialsP']
We can do all kind of modifications to our tag’s attributes (add/remove/modify).
>>> tag2['class'] = 'Online-Learning' >>> tag2['style'] = '2007' >>> >>> tag2 <div class="Online-Learning" style="2007"></div> >>> del tag2['style'] >>> tag2 <div class="Online-Learning"></div> >>> del tag['class'] >>> tag <b SecondAttribute="2">Howcodex</b> >>> >>> del tag['SecondAttribute'] >>> tag </b> >>> tag2['class'] 'Online-Learning' >>> tag2['style'] KeyError: 'style'
Some of the HTML5 attributes can have multiple values. Most commonly used is the class-attribute which can have multiple CSS-values. Others include ‘rel’, ‘rev’, ‘headers’, ‘accesskey’ and ‘accept-charset’. The multi-valued attributes in beautiful soup are shown as list.
>>> from bs4 import BeautifulSoup
>>>
>>> css_soup = BeautifulSoup('<p class="body"></p>')
>>> css_soup.p['class']
['body']
>>>
>>> css_soup = BeautifulSoup('<p class="body bold"></p>')
>>> css_soup.p['class']
['body', 'bold']
However, if any attribute contains more than one value but it is not multi-valued attributes by any-version of HTML standard, beautiful soup will leave the attribute alone −
>>> id_soup = BeautifulSoup('<p id="body bold"></p>')
>>> id_soup.p['id']
'body bold'
>>> type(id_soup.p['id'])
<class 'str'>
You can consolidate multiple attribute values if you turn a tag to a string.
>>> rel_soup = BeautifulSoup("<p> howcodex Main <a rel='Index'> Page</a></p>")
>>> rel_soup.a['rel']
['Index']
>>> rel_soup.a['rel'] = ['Index', ' Online Library, Its all Free']
>>> print(rel_soup.p)
<p> howcodex Main <a rel="Index Online Library, Its all Free"> Page</a></p>
By using ‘get_attribute_list’, you get a value that is always a list, string, irrespective of whether it is a multi-valued or not.
id_soup.p.get_attribute_list(‘id’)
However, if you parse the document as ‘xml’, there are no multi-valued attributes −
>>> xml_soup = BeautifulSoup('<p class="body bold"></p>', 'xml')
>>> xml_soup.p['class']
'body bold'
The navigablestring object is used to represent the contents of a tag. To access the contents, use “.string” with tag.
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup("<h2 id='message'>Hello, Howcodex!</h2>")
>>>
>>> soup.string
'Hello, Howcodex!'
>>> type(soup.string)
>
You can replace the string with another string but you can’t edit the existing string.
>>> soup = BeautifulSoup("<h2 id='message'>Hello, Howcodex!</h2>")
>>> soup.string.replace_with("Online Learning!")
'Hello, Howcodex!'
>>> soup.string
'Online Learning!'
>>> soup
<html><body><h2 id="message">Online Learning!</h2></body></html>
BeautifulSoup is the object created when we try to scrape a web resource. So, it is the complete document which we are trying to scrape. Most of the time, it is treated tag object.
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup("<h2 id='message'>Hello, Howcodex!</h2>")
>>> type(soup)
<class 'bs4.BeautifulSoup'>
>>> soup.name
'[document]'
The comment object illustrates the comment part of the web document. It is just a special type of NavigableString.
>>> soup = BeautifulSoup('<p><!-- Everything inside it is COMMENTS --></p>')
>>> comment = soup.p.string
>>> type(comment)
<class 'bs4.element.Comment'>
>>> type(comment)
<class 'bs4.element.Comment'>
>>> print(soup.p.prettify())
<p>
<!-- Everything inside it is COMMENTS -->
</p>
The navigablestring objects are used to represent text within tags, rather than the tags themselves.