How Xpath Plays Vital Role In Web Scraping

Sandra Moraes

Posted on Oct 18, 2019

The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

XPath is a language for finding information in structured documents like XML or HTML. You can say that XPath is (sort of) SQL for XML or HTML files. XPath is used to navigate through elements and attributes in an XML or HTML document.

To understand XPath we must be clear about elements and nodes which are the building blocks of XML and HTML. Let’s talk about them. Here is an example element in an HTML document:

<a class=”hyperlink” href=http://www.google.com>google</a>

Copy the above text to a file, name it as sample.html and open it in a browser. This will end up as a text link displaying the words “google” and it will take you to www.google.com. For each element there are three main parts: The type, the attributes, andthe text. They are listed below:

a Type
class, href Attributes
google Text

Let’s grab some XPath developer tools. I am on Firebug for Firefox or you can use Chrome’s developer tools. We will now form some XPath expressions to extract data from the above element. We will also verify the XPath by using Firebug Console.

For extracting the text “google”:

//a[@href]/text()

//a[@class=”hyperlink”]/text()

For extracting the hyperlink i.e. ”www.google.com” :

//a/@href
//a[@class=”hyperlink”]/@href

That’s all with a single element but in reality, you need to deal with more complex forms.

Let’s proceed to the idea of nodes, and its familial relationship of HTML elements. Look at this example code:

<div title=”Section1″>

<table id=”Search”>

<tr class=”Yahoo”>Yahoo Search</tr>

<tr class=”Google”>Google Search</tr>

</table>

</div>

Notice the </div> at the bottom? That means the table and tr elements are contained within the div. These other elements are considered descendants of the div. The table is a child, and the tr is a grandchild (and so on and so forth). The two tr elements are considered siblings each other. This is vital, as XPath uses these relationships to find your element.

So suppose you want to find the Google item. Any of the following expressions will work:

   //tr[@class=’Google’]
   //div/table/tr[2]
//div[@title=”Section1″]//tr[text()=”Google Search”]

So let’s analyze the expressions. We start at the top element (also known as a node). The // means to search all descendants, / means to just look at the current element’s children. So //div means look through all descendants for a div element. The brackets [] specify something about that element. So we can look for an attribute with the @ symbol, or look for text with the text() function. We can chain as many of these together as we can.

Here is a quick reference:

   //     Search all descendant elements
   /     Search all child elements
   []   The predicate (specifies something about the element you are looking for)
@   Specifies an element attribute. (For example, @title)

   .       Specifies the current node (useful when you want to look for an element’s children in the predicate)
   ..     Specifies the parent node
text()   Gets the text of the element.

In the context of web scraping, XPath is a nice tool to have in your belt, as it allows you to write specifications of document locations more flexibly than CSS selectors.

Read the original article here:

How Xpath Plays Vital Role In Web Scraping

The next session of this article wil be published soon.

About Author

Sandra Moraes

View all posts by Sandra Moraes >

Machine Learning

Beware of Feature Importance for Business Decisions

Student Works

Power of a Predictive Model for Ames, Iowa Housing

Capstone

LendingClub Grade Optimization

Data Visualization

Ames Iowa Home Sale Prediction

Machine Learning

Boosting Real Estate Decisions

No comments found.

How Xpath Plays Vital Role In Web Scraping

The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

How Xpath Plays Vital Role In Web Scraping

About Author

Sandra Moraes

Related Articles

Leave a Comment

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our
amazing bootcamp!

Offerings

About

SOCIAL MEDIA

How Xpath Plays Vital Role In Web Scraping

The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

How Xpath Plays Vital Role In Web Scraping

About Author

Sandra Moraes

Related Articles

Leave a Comment

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Get detailed curriculum information about our
amazing bootcamp!