Beautiful Souped Up – A Beautiful Soup GUI Utility to make Screen Scraping Even Easier
Quick Download:
BeautifulSoupedUp.py
Recently I’ve been doing some screen scraping in Python using Beautiful Soup, the great HTML/XML parser library written by Leonard Richardson. For those not familiar with Beautiful Soup, you give it a HTML/XML document and what you get back is a nice data structure which allows you to easily query for the elements of a document you’re after. Taken from the Beautiful Soup documentation, here’s a basic example of some of the stuff you can do with Beautiful Soup:
from BeautifulSoup import BeautifulSoup
doc = ['<html><head><title>Page title</title></head>',
'<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
'<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
'</html>']
soup = BeautifulSoup(''.join(doc))
print soup.prettify()
# <html>
# <head>
# <title>
# Page title
# </title>
# </head>
# <body>
# <p id="firstpara" align="center">
# This is paragraph
# <b>
# one
# </b>
# .
# </p>
# <p id="secondpara" align="blah">
# This is paragraph
# <b>
# two
# </b>
# .
# </p>
# </body>
# </html>
soup.contents[0].name
# u'html'
soup.contents[0].contents[0].name
# u'head'
head = soup.contents[0].contents[0]
head.parent.name
# u'html'
head.next
# <title>Page title</title>
head.nextSibling.name
# u'body'
head.nextSibling.contents[0]
# <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
head.nextSibling.contents[0].nextSibling
# <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
So Beautiful Soup works pretty well as is. It’s possible to test out different queries fairly quickly using the Python interactive interpreter. The problem with testing out queries in the interactive interpreter, especially with larger documents, is that you don’t get to see the results within the overall context of the document. What I thought would be useful is if there were a means of testing out Beautiful Soup style queries in a way that let you see precisely where the results fitted in within the broader context of the entire document. Inspired by the find feature found in most text editors, I’ve developed a GUI that highlights Beautiful Soup matches within a document. Difference being, instead of typing in a search term that is a string of characters to look for, you type in a Beautiful Soup style query. The parts of the text that the Beautiful Soup query matches are then highlighted. Take a look:
So a typical use case for such a tool goes a little something like this:
- You start out with a HTML document that you want to scrape. You’re using Beautiful Soup and you want to establish what kind of query you’re going to have to give it to extract the pieces of information you’re after.
- You fire up the Beautiful Souped Up GUI and copy the HTML document into the textbox in the middle of the window.
- You enter your soup query at the top of the window and run it. Any parts of the HTML document that your query matches get highlighted, allowing you to quickly determine whether you’ve written a query that’s extracting the pieces of information you’re actually after.
- Once you’ve found a query that does highlight the pieces of information you’re after, you take the same query and use it in your own Python program wherever it’s needed.
The Beautiful Souped Up GUI is just a single file that you can run from the command prompt/terminal (e.g. python BeautifulSoupedUp.py). There’s no fancy installer or anything at this point; I’ll wait and see how much demand there is for an installer first. I’ve tested it on both Windows Vista and also Ubuntu Linux 9.04. Note that if you’re running it on Ubuntu, you’ll probably need to install python-tk first (e.g. sudo apt-get install python-tk). Other than Beautiful Soup, I don’t think you’ll need anything additional installed on a Windows installation of Python.
I’ve tested Beautiful Souped Up under Python 2.5.4 and Beautiful Soup version 3.0.7a. If you’re running Python version 2.5.2 or earlier and get an error message about Tkinter, you’ll probably need to upgrade to Python 2.5.4. I haven’t had the opportunity to test Beautiful Souped Up with any other version of Python or Beautiful Soup, so if you have a different version of either, please leave a comment and let me know how it goes.
Beautiful Souped Up is shown running on Vista above. Here’s what it looks like running on Ubuntu below:
For anyone interested in contributing to Beautiful Souped Up, I’ve set up a public repository over on GitHub at http://github.com/brycethomas/BeautifulSoupedUp. Leave a comment and let me know what you think.


Hi Bruce -
I am not a programmer but this code looks like it would make what i want pretty easy. I am looking for a easy tool to scrape websites and convert them to something my Kindle can read.
In particular I want to be able to scape VirtualTourist.Com on a particular city or country.
For example I am going to Hamburg in two weeks. The page for info is:
http://www.virtualtourist.com/travel/Europe/Germany/Freie_und_Hansestadt_Hamburg/Hamburg-56480/TravelGuide-Hamburg.html
What would this look like after scaping in Beautiful Soup?
Hi Scott,
In terms of taking a website (e.g. one of the travel pages), scraping its content and converting it into a useful format (e.g. something suitable for the Kindle), Beautiful Souped Up really isn’t the tool for the job. The Beautiful Souped Up utility I’ve written here really is aimed at programmers who are already familiar with the Python programming language and the Beautiful Soup software library.
Beautiful Souped Up is just a tool that lets programmers meddle around a little with the page to figure out what code they’d need to write to extract the pieces of the page they’re after. It doesn’t do the other work required in a real screen scraper such as retrieving the page in the first place and converting the extracted pieces of text into an appropriate format. For this reason, Beautiful Souped Up really isn’t a complete screen scraping solution. It just helps a little in the meddling stage. The rest of the work requires a lot more Python code.
Theoretically, using Beautiful Soup you could take whatever pieces of the webpage’s HTML you’re after and convert it whatever format you want, whether that be some format that lets you read it on the Kindle or some other format altogether. Actually programming it to do such a thing however would be a fairly involved process, especially when the output has to be converted to comply with a specific format such as something suitable for the Kindle. So unfortunately I can’t be of much help given you’re circumstances.
Nice… What an awesome little utility. Have you considered hosting it as an Open Source project on sourceforge (or equivalent)?
IMHO, this app has potential reaching far beyond the personal blogosphere.
I can’t wait to dig into the source to see how you did the highlighting part. Nice work.
Hi Evan,
Sure – it’s already hosted open source at http://github.com/brycethomas/BeautifulSoupedUp. I haven’t had the time available to do much more with it myself, though I’d love to see someone port this over to a browser extension that integrated in nicely with the “view HTML source” feature.
Bryce,
Cool, I’ve added it to my watch-list.