Jun 9, 2012

Parsing HTML with lxml and Google App Engine: Python

This is a continuation of my Python 2.7 and Google App Engine series. If you are just starting out I suggest you start reading Getting Started and First App. If you are after parsing XML files please see my post 'Parsing XML with Google App Engine: Python'.

We are going to assume you will be using Eclipse and a fresh project. In this example we are going to use Triple J Unearthed's Top 100 charts HTML page to parse.

Adding lxml to Google App Engine

The first thing we need to do is add the lxml library to our app.yaml configuration file. In your Eclipse project add a new file called app.yaml and add the following:

application: almightynassar
version: 1
runtime: python27
api_version: 1
threadsafe: true

handlers:
- url: /.*
  script: triplej.app

libraries:
- name: lxml
  version: latest

Most of these fields were covered in Getting Started, but we now have a new field: libraries. This is where we declare all third party libraries not included in GAE default python environment.

Using lxml

Create a new file called triplej.py and add the following code:

# The webapp2 framework
import webapp2

# lxml parser for XML and HTML
from lxml import etree

# The URL Fetch library
from google.appengine.api import urlfetch

# Fetches an XML document and parses it
class MainPage(webapp2.RequestHandler):
    # Respond to a HTTP GET request
    def get(self):
            # Grabs the HTML
            url = urlfetch.fetch('http://www.triplejunearthed.com/Charts/')
           
            # Parses the HTML
            tree   = etree.HTML(url.content)

            # Converts the DOM into a string       
            result = etree.tostring(tree, pretty_print=True, method="html")

           
           # Output the results onto the screen
           self.response.out.write(str(result))
   
       
# Create our application instance that maps the root to our
# MainPage handler
app = webapp2.WSGIApplication([('/*', MainPage)], debug=True)

If you run this code you will notice that all it does is simply download the HTML page, parses it, and then outputs the page exactly as it was downloaded (minus all the images and CSS styling). Nothing impressive, but we proved the concept works. Now on to something a little more beefy....

Parsing, Extracting and Cleaning the HTML

In this example we will perform multiple functions that will only extract the chart from the Triple J Unearthed website. Replace the triplej.py code with the following:

# The webapp2 framework
import webapp2

# lxml parser for XML and HTML
from lxml import html

# HTML cleaner
from lxml.html.clean import Cleaner

# The URL Fetch library
from google.appengine.api import urlfetch

# Fetches an XML document and parses it
class MainPage(webapp2.RequestHandler):
    # Respond to a HTTP GET request
    def get(self):
        # Grabs the HTML
        url = 'http://www.triplejunearthed.com/Charts/'
        website = urlfetch.fetch(url)
       
        # Saves our content as a string
        page = str(website.content)

        # Parses the HTML
        tree = html.fromstring(page)

        # The ID string of the table element we want          # NOTE: This is bound to change!!! Double check the HTML source first!!!
        elementID = "ctl00_ctl00_ctl00_ctl00_MainBody_ContentPlaceHolder1_ContentPlaceHolder1_ContentPlaceHolder1_GridView1"
       
        # Grab the chart element
        #
        # style: removes styling
        # links: removes links
        # add_nofollow: adds rel="nofollow" to anchor tags
        # page_structure: removes <html>, <head>, and <title> tages
        # safe_attrs_only: only allows safe element attributes
        # javascript: removes embedded javascript
        # scripts: remove script tags
        # kill_tags: remove the element and content
        # remove_tags: remove only the element, but not the content
        #
        # There are more available. See the API reference for lxml
        cleaner = Cleaner(style=True, links=True, add_nofollow=True,
                          page_structure=True, safe_attrs_only=True,

                          javascript=True, scripts=True, kill_tags = set(['img','th']),
                          remove_tags = (['div']))

       
        # Grab only our chart (but scrub it clean first!)
        chart = cleaner.clean_html(tree.get_element_by_id(elementID))
       
        # Change all relative links into absolute links based on the url
        chart.make_links_absolute(url)
       
        # Converts the DOM element into a string
        result = html.tostring(chart)
       
        # Output the results onto the screen
        self.response.out.write(result)        
       
# Create our application instance that maps the root to our
# MainPage handler
app = webapp2.WSGIApplication([('/*', MainPage)], debug=True)

Running this code should result in a sanitized version of the Triple J Top 100 chart!

References

1 comment:

Thanks for contributing!! Try to keep on topic and please avoid flame wars!!