After reading the chapter on screen scraping, I realized—what maybe should have been obvious beforehand—that the topic is really not a good match for this course. It has a lot more to do with information representation (i.e., how to parse and extract information from HTML) and so might be better matched to INFO I308. I think the main lessons here are that screen scraping (trying to extract meaningful information computationally from HTML) is a very painful and brittle process, and that we are much better off using a well-defined API if one is available.
Required: Rhodes and Goerzen, Chapter 10
urllibstandard library modules.
GETto a form-processing URL (CGI script or the like), with form data—to get to the page you want. You may even need to go through a series of forms. Tools such as Selenium and Windmill can help to capture the commands needed to automate such a process.
Once you get the HTML page, what do you do with it? How do you find the content that you're interested in?
There are two possible approaches:
Parsing is the process of inferring the grammatical structure of expressions in some language. When you hear someone say, "The fat cow leaped over the apple tree", you probably parse this, consciously or unconsciously, as subject = "The fat cow", predicate = "leaped over the apple tree." You can further parse the subject as having a noun "cow" modified by an adjective phrase "The fat", and the predicate as having a verb "leaped" and a prepositional phrase "over the apple tree", and so on. If you go all the way down to the individual words, you can summarize your parsing with a beautiful parse tree:
(Was this the apple tree the fat cow leaped over?)
Similarly, when a web browser receives an HTML page, before displaying the page it parses it. We'll just look at a little fragment of HTML:
<body> <p>Every morning:</p> <ol> <li>Eat breakfast.</li> <li>Drink <em>fruit</em> juice.</li> <li>Brush teeth.</li> </ol> </body>
—which could be parsed into this tree:
There are multiple algorithms and libraries for parsing, in and out of Python.
But what if the HTML document isn't well structured HTML? Well, you could try to fix it up (with HTML Tidy or the like), or use a parsing library that's "forgiving" (like web browsers) (or you could even decide to treat it as just text, not HTML—see the next section).
Once you've got the tree, you can dig into it with various tools, including XPath, CSS selectors, and just general tree operations, to find what you want.
Details of parsing and of extracting information from the document tree are beyond the scope of this course. They are somewhat more relevant to INFO I308 Information Representation (and are covered in that course, though using XML rather than HTML, but the techniques are very similar).
Alternatively, we could decide to ignore the HTML structure (if any) of the document, and just treat it as a big string to be searched in.
For example, suppose we find the weather page contain's today's high temperature in a context such as this:
We want the high temperature, which on this particular day is 79, and it's always found between
"</td>", which we'll call the prefix and the postfix.
Without even interpreting this as HTML, we can use some of Python's string methods to find it:
findmethod to locate the prefix, namely
findmethod again to locate the postfix, namely
table_lookup1 in search.py
We can use an even more powerful tool called regular expressions. A regular expression (or "regex") is a kind of string pattern.
<tr><td>High:</td><td>and the postfix
.matches any character
*means "repeated 0 or more times"
.*means "any zero or more characters"
.*. So we can group this by enclosing it in parentheses. When we find a match for the whole regular expression, then, we can ask the matcher to deliver the first group, i.e., what matched the part in parentheses.
table_lookup2 in search.py
The structure of most web pages is meant to be appealing towards human vision rather than screen scraping. It's difficult for a screen-scraping program to find the information we want in the HTML structure—if there really is any HTML structure—and that structure is subject to change without notice, forcing us to redesign our screen scrapers.
Screen-scraping is a nasty, ugly business which ought to be avoided if possible. If any kind of API is available, use it instead.
urllibPython Standard Library pages: