Beautiful Souped Up – A Beautiful Soup GUI Utility to make Screen Scraping Even Easier
Posted in Uncategorized on October 10th, 2009 by Bryce Thomas – 6 CommentsQuick Download:
BeautifulSoupedUp.py
Recently I’ve been doing some screen scraping in Python using Beautiful Soup, the great HTML/XML parser library written by Leonard Richardson. For those not familiar with Beautiful Soup, you give it a HTML/XML document and what you get back is a nice data structure which allows you to easily query for the elements of a document you’re after. Taken from the Beautiful Soup documentation, here’s a basic example of some of the stuff you can do with Beautiful Soup:
from BeautifulSoup import BeautifulSoup
doc = ['<html><head><title>Page title</title></head>',
'<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
'<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
'</html>']
soup = BeautifulSoup(''.join(doc))
print soup.prettify()
# <html>
# <head>
# <title>
# Page title
# </title>
# </head>
# <body>
# <p id="firstpara" align="center">
# This is paragraph
# <b>
# one
# </b>
# .
# </p>
# <p id="secondpara" align="blah">
# This is paragraph
# <b>
# two
# </b>
# .
# </p>
# </body>
# </html>
soup.contents[0].name
# u'html'
soup.contents[0].contents[0].name
# u'head'
head = soup.contents[0].contents[0]
head.parent.name
# u'html'
head.next
# <title>Page title</title>
head.nextSibling.name
# u'body'
head.nextSibling.contents[0]
# <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
head.nextSibling.contents[0].nextSibling
# <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
So Beautiful Soup works pretty well as is. It’s possible to test out different queries fairly quickly using the Python interactive interpreter. The problem with testing out queries in the interactive interpreter, especially with larger documents, is that you don’t get to see the results within the overall context of the document. What I thought would be useful is if there were a means of testing out Beautiful Soup style queries in a way that let you see precisely where the results fitted in within the broader context of the entire document. Inspired by the find feature found in most text editors, I’ve developed a GUI that highlights Beautiful Soup matches within a document. Difference being, instead of typing in a search term that is a string of characters to look for, you type in a Beautiful Soup style query. The parts of the text that the Beautiful Soup query matches are then highlighted. Take a look:
So a typical use case for such a tool goes a little something like this:
- You start out with a HTML document that you want to scrape. You’re using Beautiful Soup and you want to establish what kind of query you’re going to have to give it to extract the pieces of information you’re after.
- You fire up the Beautiful Souped Up GUI and copy the HTML document into the textbox in the middle of the window.
- You enter your soup query at the top of the window and run it. Any parts of the HTML document that your query matches get highlighted, allowing you to quickly determine whether you’ve written a query that’s extracting the pieces of information you’re actually after.
- Once you’ve found a query that does highlight the pieces of information you’re after, you take the same query and use it in your own Python program wherever it’s needed.
The Beautiful Souped Up GUI is just a single file that you can run from the command prompt/terminal (e.g. python BeautifulSoupedUp.py). There’s no fancy installer or anything at this point; I’ll wait and see how much demand there is for an installer first. I’ve tested it on both Windows Vista and also Ubuntu Linux 9.04. Note that if you’re running it on Ubuntu, you’ll probably need to install python-tk first (e.g. sudo apt-get install python-tk). Other than Beautiful Soup, I don’t think you’ll need anything additional installed on a Windows installation of Python.
I’ve tested Beautiful Souped Up under Python 2.5.4 and Beautiful Soup version 3.0.7a. If you’re running Python version 2.5.2 or earlier and get an error message about Tkinter, you’ll probably need to upgrade to Python 2.5.4. I haven’t had the opportunity to test Beautiful Souped Up with any other version of Python or Beautiful Soup, so if you have a different version of either, please leave a comment and let me know how it goes.
Beautiful Souped Up is shown running on Vista above. Here’s what it looks like running on Ubuntu below:
For anyone interested in contributing to Beautiful Souped Up, I’ve set up a public repository over on GitHub at http://github.com/brycethomas/BeautifulSoupedUp. Leave a comment and let me know what you think.

