How Xpath Plays Vital Role In Web Scraping
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
XPath is a language for finding information in structured documents like XML or HTML. You can say that XPath is (sort of) SQL for XML or HTML files. XPath is used to navigate through elements and attributes in an XML or HTML document.
To understandΒ XPathΒ we must be clear about elements and nodes which are the building blocks of XML and HTML. Letβs talk about them. Here is an example element in an HTML document:
Β Β Β <a class=βhyperlinkβ href=http://www.google.com>google</a>
Copy the above text to a file, name it asΒ sample.htmlΒ and open it in a browser. This will end up as a text link displaying the words βgoogleβ and it will take you to www.google.com. For each element there are three main parts:Β The type,Β the attributes, andthe text. They are listed below:
Β a Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Type
class, Β href Β Β Β Β Β Β Β Β Attributes
google Β Β Β Β Β Β Β Β Β Β Β TextΒ
Letβs grab some XPath developer tools. I am onΒ Firebug for FirefoxΒ or you can useΒ Chromeβs developer tools. We will now form some XPath expressions to extract data from the above element. We will also verify the XPath by using Firebug Console.
For extracting the text βgoogleβ:
Β Β Β //a[@href]/text() Β Β Β
Β Β Β //a[@class=βhyperlinkβ]/text()
Β Β
For extracting the hyperlink i.e. βwww.google.comβ :
Β Β //a/@href
//a[@class=βhyperlinkβ]/@href
Thatβs all with a single element but in reality, you need to deal with more complex forms.
Letβs proceed to the idea of nodes, and its familial relationship of HTML elements. Look at this example code:
<div title=βSection1β³>
Β Β Β <table id=βSearchβ>
Β Β Β Β Β Β Β <tr class=βYahooβ>Yahoo Search</tr>
Β Β Β Β Β Β Β <tr class=βGoogleβ>Google Search</tr>
Β Β Β </table>
</div>
Notice theΒ </div>Β at the bottom? That means theΒ tableΒ andΒ trΒ elements are contained within theΒ div. These other elements are considered descendants of theΒ div. TheΒ tableΒ is a child, and theΒ trΒ is a grandchild (and so on and so forth). The twoΒ trΒ elements are considered siblings each other. This is vital, as XPath uses these relationships to find your element.
So suppose you want to find the Google item. Any of the following expressions will work:
Β Β Β //tr[@class=βGoogleβ]
Β Β Β //div/table/tr[2]
Β //div[@title=βSection1β³]//tr[text()=βGoogle Searchβ]
So letβs analyze the expressions. We start at the top element (also known as a node). TheΒ //Β means to search all descendants,Β /Β means to just look at the current elementβs children. SoΒ //divΒ means look through all descendants for aΒ divΒ element. The bracketsΒ []Β specify something about that element. So we can look for an attribute with theΒ @Β symbol, or look for text with theΒ text()Β function. We can chain as many of these together as we can.
Here is a quick reference:
Β Β Β //Β Β Β Β Β Β Β Β Search all descendant elements
Β Β Β /Β Β Β Β Β Β Β Β Β Search all child elements
Β Β Β []Β Β Β Β Β Β Β The predicate (specifies something about the element you are looking for)
Β Β @Β Β Β Β Β Β Specifies an element attribute. (For example, @title)
Β Β Β
Β Β Β . Β Β Β Β Β Β Β Β Β Specifies the current node (useful when you want to look for an elementβs children in the predicate)
Β Β Β .. Β Β Β Β Β Β Β Β Specifies the parent node
Β text()Β Β Β Β Gets the text of the element.
Β Β Β Β
In the context of web scraping,Β XPath is a nice tool to have in your belt, as it allows you to write specifications of document locations more flexibly than CSS selectors.
Read the original article here:
How Xpath Plays Vital Role In Web Scraping
The next session of this article wil be published soon.