Chapter 9, "HTTP"
The HTTP protocol is used by web browsers to request a web page from a server, and by the web server to deliver the web page to the browser.
Key points to know:
HTTP is a connection-oriented. An HTTP client (web browser) establishes a connection with a web server. In the original version this connection would last for only one request and response (e.g., request a web page and get it); nowadays, it is usually made to persist longer.
A persistent connection avoids the overhead of setting up a new TCP connection for each additional request.
Visiting even a single web page often results in many requests for graphics, style sheets, and other resources.
HTTP is a text-based, request-response protocol. Both requests and responses have the same four-part structure (with some parts optional):
The request or response line
The headers
All requests must include a Host header (RFC 2616, sec. 14.23):
Host: www.example.com:80
This allows a single web server to serve pages for multiple host names and ports. The port and colon, :80, can be omitted for the default port, 80.
There are many other, optional headers; some will be described below.
A blank line
The body (normally, the body is the document requested)
See the Examples section.
Types of requests (request methods): GET, HEAD, POST, PUT
HTTP (by itself) is stateless. Each request-response interchange is entirely independent of any that went before. That is, it does not remember any state from previous interchanges. This has turned out to be somewhat awkward; as we shall see, there are ways of working around this.
Requests are sent by the client, e.g., to request a particular web page.
A minimal GET request, consisting of a request line and a Host header and a blank line:
GET /examples/index.html HTTP/1.1
Host: www.example.com
〈blank line〉
The request line asks to GET a particular URI on the server, namely /examples/index.html, and specifies the protocol version HTTP 1.1. The blank lines indicates the end of the request.
A GET request with a header. The header consists of keyword: value pairs, one per line, and supply additional information modifying the request.
GET /examples/index.html HTTP/1.1
Host: www.example.com:8080
Accept: */*
Connection: Keep-Alive
User-Agent: Firefox/3.01
〈blank line〉
In this example, the client specifies that it is a Firefox 3.01 browser, that it will accept any MIME type (*/*), and that it wants the server to keep the connection alive, i.e., not close it after the response.
MIME (Multipurpose Internet Mail Extension) originally was developed to allow email attachments with arbitrary data; it is now also used in HTTP to specify the type of data being sent. Each MIME type specification has the form Type/Subtype. Examples of MIME types include text/plain, text/html, application/octet-stream, application/pdf, image/png, audio/midi, and video/mpeg.
A POST request, submitting a form.
POST /cgi-bin/form1 HTTP/1.1
Host: www.example.com:80
Accept: */*
Connection: Keep-Alive
User-Agent: Firefox/3.01
Content-type: application/x-www-form-urlencoded
Content-length: 32
〈blank line〉
first_name=Howard&last_name=Duck
The additional header lines specify the MIME type of the form data and the number of characters in the form data. The form data itself follows the blank line.
Responses are sent by the server after a request from the client.
A minimal response, consisting of a status line with an error status, and a blank line:
HTTP/1.1 404 NOT FOUND
〈blank line〉
The response means that the requested URI was not found.
A response with a small web page:
HTTP/1.1 200 OK
Date: Mon, 27 Oct 2008 13:41:01 UTC
Server: Apache/2.2.9 (Linux)
Last-Modified: Sun, 26 Oct 2008 20:00:00 UTC
Content-Type: text/html
Content-Length: 116
〈blank line〉
<html>
<head>
<title>Hello WWW</title>
</head>
<body>
<p>Hello, World Wide Web!</p>
</body>
</html>
The status line indicates status code 200, no problem. The header lines give some information about the server, the date of retrieval, and the document retrieved. Then following the blank line comes the document itself.
URLs have parts, most (all?) of which are optional:
http://www.example.com:3080/foo/bar/baz.php?a=b+c&d=e
& signs separate key-value pairs, and the + signs represent spaces in the values{'a': 'b c', 'd': 'e'}Special characters must be encoded (e.g., %20 encodes a space); the module urllib.parse has functions quote and quote_plus for doing this:
>>> import urllib.parse
>>> url = 'http://blue.example.com/~james/a few thoughts.html'
>>> urllib.parse.quote(url)
'http%3A//blue.example.com/%7Ejames/a%20few%20thoughts.html'
>>> query = "weight=20 pounds&height=6 feet 7 inches"
>>> urllib.parse.quote_plus(query)
'weight%3D20+pounds%26height%3D6+feet+7+inches'
URLs can be absolute or relative. Absolute URLs, like absolute file paths, begin at the beginning and so tell you absolutely where to go. The example above is an absolute URL. Relative URLs leave out some of the initial parts, which are understood as relative to where you are now. So if you are "at" the URL above and encounter the relative URL
../golf/home.html
it means
http://www.example.com:3080/foo/bar/../golf/home.html
which is equivalent to
http://www.example.com:3080/foo/golf/home.html
The chapter emphasizes the Python 2 modules urllib2 (for accessing URLs), urlparse (for taking URLs apart and putting them together), and httplib (for persistent connections).
In Python 3, these have been split and renamed:
They are mostly useful for client-side HTTP work: for example, writing "spiders" that crawl the web, or extracting information from web pages or various APIs delivered over HTTP.
This course emphasizes server-side programs, so we will not pay too much attention to details here. (Should we? Client-side is also part of distributed computing.)
It is easy to do simple things with urllib.request and the like:
>>> import urllib.request, urllib.response
>>> resp = urllib.request.urlopen("http://www.iue.edu/")
>>> resp.geturl()
'http://www.iue.edu/'
>>> resp.info()
<http.client.HTTPMessage object at 0xb6d2356c>
>>> for item in resp.info().items(): print(item)
...
('Content-Type', 'text/html')
('Server', 'Microsoft-IIS/6.0')
('X-Powered-By', 'Infinite Monkeys, which eat for breakfast:
Bananas, Dates, Nuts')
('X-Powered-By', 'ASP.NET')
('Date', 'Thu, 27 Oct 2011 23:04:08 GMT')
('Connection', 'close')
>>> resp = urllib.request.urlopen("http://www.iue.edu/")
>>> page = resp.read().decode()
>>> print(page)
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
...
</body>
</html>
Doing more complex things, as on page 142, may require defining subclasses to override methods of the module classes. The next section reviews classes and object-oriented programming in Python.