This is the first part detailing my experiences writing a basic html screen scraper using a combination of bash, python and structured grep to retrieve, massage and present data. The results can be seen at onesecondshy.com. Code will eventually be published there as well.

First off, motivation. The motivation for this little project was to mine the data from gastips.com and present it more directly (and with less advertisements). GasTips.com is a grassroots site allowing individuals to submit gas prices on a community-by-community basis. Data is presented in well-formed tables accompanied by a plethora of google advertisements.

In an ideal world, we would have APIs that provide data is easily consumable formats. However, as long as there is value attached to data, there’ll be a desire to keep it private. The following is not meant as a step-by-step guide to scraping data and graphing it in python. Instead, my goal is to simply explain one approach to solving what is a common problem.

Step 1: Data Retrieval

Nothing too fancy here. A simple call to wget to download raw html.

Part 2 of this series will move beyond this rather simplistic approach and cover data retrieval and aggregation using a web spider.

Step 2: Massaging

Once you’ve got raw html, the next logical step is to reduce it to a format amenable to manipulation… namely csv. Sgrep (or Structured Grep) a tool for searching and indexing html (amongst other things) easily gets the job done.

 sgrep -g html 'stag("TABLE") containing                       (attribute("SUMMARY") containing "XXX") .. etag("TABLE")' input.html
<html><body><table summary="XXX"><tr><td>ABC</td></tr></table></body></html>

Running the above sgrep command on the html snippet will yield the following result:

<table summary="XXX"><tr><td>ABC</td></tr></table>

You could easily take it further and retrieve only the contents of a particular table column or row. When parsing the gastips.com data, I used sgrep to get the data into a row-delimited form before running a simple python script that converted to csv. The row delimited form was as follows (where *’s denote a column header):

*Header1*
*Header2*
Row1Column1
Row1Column2
Row2Column1
...

The final csv output was:

Header1,Header2,
Row1Column1,Row1Column2,
Row2Column1,Row2Column2,
...

Step 3: Presentation

Sticking with the python theme established in the previous step, I investigated a few python graphing and charting libraries. I settled for the Matplotlib, vastly overkill for my needs but a fun challenge nonetheless.

60 lines of hacky python later and I had something that could parse a csv file and plot resulting data.

Understandably, the devil is in the details. Plotting a simple x,y graph is trivial, it’s slightly more difficult to create something with some semblance of polish.

Step 4: Tying It All Together

The last step in all this is to create a suitable front-end. I chose a simple for loop in bash.

for x in urls
1. download raw html
2. massage it (html -> csv)                                                                                                                                          
3. plot it (csv -> html)

Making a conscious distinction between the crawling (downloading), indexing (massaging) and presenting (plotting ) allows increased opportunity for parallel operations. Once you have the raw data, you can execute multiple indexing and plotting operations. A more scalable alternative to the common monolithic approach encompassing data retrieval, transformation and presentation in a single all-in-one package.

 

That’s it for Part 1. As mentioned earlier, you can see the results at onesecondshy.com. The plots will actually mean something should you live on Vancouver Island. Future plans include writing a web spider and aggregation of data over a longer period of time (than the week provided by gastips.com) amongst other things.


Leave a Comment




  • Pet Peeve: Don’t email my password to me in plain text You know the drill. Signup for some random service on the internet Receive a confirmation email with your account information or Forget a password for some random service ...

  • Eclipise Memory Analyzer (MAT) I must say the Eclipse Memory Analyzer looks pretty slick. There is some pretty good material over on the developers blog. Lastly, there was a talk on it ...

  • Open-source Web-based Code Review Tool: Rietveld Guido van Rossum, of Python fame, has recently released a Django-based application that enables web-based code reviews... Rietveld. It supports any language and currently can hook into Subversion repositories. You ...

  • An implementation of the JVM in Javascript? Caught this over on JavaPosse Google Groups. Essentially, some bright fellows over in Japan have developed a bytecode->javascript compiler. There's a demo floating around that took a Tetris ...

  • Facebook Chat? So it looks like the Facebook Chat service has finally started rolling out to my network (Facebook Chat has been mentioned previously). Not quite sure how ...