How Xpath Plays Vital Role In Web Scraping

Avatar
Posted on Oct 18, 2019

 

 

XPath is a language for finding information in structured documents like XML or HTML. You can say that XPath is (sort of) SQL for XML or HTML files. XPath is used to navigate through elements and attributes in an XML or HTML document.

To understand XPath we must be clear about elements and nodes which are the building blocks of XML and HTML. Let’s talk about them. Here is an example element in an HTML document:

   <a class=”hyperlink” href=http://www.google.com>google</a>

Copy the above text to a file, name it as sample.html and open it in a browser. This will end up as a text link displaying the words “google” and it will take you to www.google.com. For each element there are three main parts: The typethe attributes, andthe text. They are listed below:

 a                                 Type
class,  href                Attributes
google                       Text 


Let’s grab some XPath developer tools. I am on Firebug for Firefox or you can use Chrome’s developer tools. We will now form some XPath expressions to extract data from the above element. We will also verify the XPath by using Firebug Console.


For extracting the text “google”:

   //a[@href]/text()    

   //a[@class=”hyperlink”]/text()
  

For extracting the hyperlink i.e. ”www.google.com” :

   //a/@href
//a[@class=”hyperlink”]/@href

That’s all with a single element but in reality, you need to deal with more complex forms.

Let’s proceed to the idea of nodes, and its familial relationship of HTML elements. Look at this example code:

 

 

<div title=”Section1″>

   <table id=”Search”>

       <tr class=”Yahoo”>Yahoo Search</tr>

       <tr class=”Google”>Google Search</tr>

   </table>

</div>

 

Notice the </div> at the bottom? That means the table and tr elements are contained within the div. These other elements are considered descendants of the div. The table is a child, and the tr is a grandchild (and so on and so forth). The two tr elements are considered siblings each other. This is vital, as XPath uses these relationships to find your element.


So suppose you want to find the Google item. Any of the following expressions will work:

   //tr[@class=’Google’]
   //div/table/tr[2]
  //div[@title=”Section1″]//tr[text()=”Google Search”]

So let’s analyze the expressions. We start at the top element (also known as a node). The // means to search all descendants, / means to just look at the current element’s children. So //div means look through all descendants for a div element. The brackets [] specify something about that element. So we can look for an attribute with the @ symbol, or look for text with the text() function. We can chain as many of these together as we can.

Here is a quick reference:

   //             Search all descendant elements
   /              Search all child elements
   []             The predicate (specifies something about the element you are looking for)
   @           Specifies an element attribute. (For example, @title)
    
   .               Specifies the current node (useful when you want to look for an element’s children in the predicate)
   ..              Specifies the parent node
  text()       Gets the text of the element.
    
In the context of web scraping, XPath is a nice tool to have in your belt, as it allows you to write specifications of document locations more flexibly than CSS selectors.

 

Read the original article here:

How Xpath Plays Vital Role In Web Scraping

The next session of this article wil be published soon.

About Author

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp