Module 15: Scaling Up

Version 1.11

Reading

Required: Rhodes and Goerzen, Chapter 8

Notes

This chapter deals with three topics: caches, specifically the use of memcached; message queues; and map-reduce. All relate to the common goal of ramping up our distributed applications to a very large scale; for example, a web site with millions of users.

Memcached

ATW told me (November, 2011) that his company's main web server is hosted by Amazon, it has 64 GB RAM and 16 quad-core quad-core processors (so, 64 cores). Their web services uses a lot of PHP. He reports that when they started using memcached, it reduced their CPU usage by 50%.

Facebook is a more famous user of memcached (and claims to be the largest user). Several other prominent users are listed on the memcached home page.

What is it?

Memcached is a distributed key-value store. That is, in Python terms, it's like a big dictionary, that's distributed in multiple nodes on the network. You can add key-value pairs, look them up, update them, and delete them.

What kinds of key-value pairs? SQL queries and their results, dynamically generated web pages—and anything else that would otherwise slow down your service. You can cache the results of function calls.

How does it work?

Basic set-up:

  1. You have lots of servers; run the memcached daemon (that's what the "d) is for on every one that has spare memory. These are the memcached servers.
  2. Every client gets a list of the memcached server IP and port numbers.
  3. When clients need data, they ask the memcached servers to look it up.

Details:

  1. The servers are oblivious to one another. They do not communicate.
  2. The client always knows exactly which server to contact for a given piece of information. This is done by hashing the key of the request. The hash value determines which server is responsible for that particular key. The collection of memcached servers is, in effect, a great distributed hash table. This is also called sharding. (If you don't know what hashing is, it's a way of computing a small summary of a large piece of data. Look at md5sum or sha1sum for example. If you want to know more about hash tables, take CSCI C243 Data Structures in the fall. I don't have time to explain the details in this course.)
  3. Operations you can perform on the key/value pairs:
    1. Insert into the store.
    2. Lookup the key, retrieving the value. (This, of course, may fail if the key is not in the store.)
    3. Update the value associated with a key.
    4. Delete the key-value pair when it becomes invalid. You can either drop it from the store or replace it with a new, valid value.
  4. What happens when a server runs out of space? Memcached uses a least-recently used (LRU) algorithm with expiration to delete something and make room. It will keep track of how recently things have been accessed, and drop expired items first, and then the least recently used item, on the theory that what was least recently used is least likely to be needed again in the near future.

Risks:

  1. Keys, obviously, need to be unique.
  2. There is no persistence with memcached, since it uses RAM for storage. It does not have the durability of a DBMS.
  3. Be careful to remove or replace invalid data. You can either set an expiration date for an item or actively remove or update it when it becomes invalid (e.g., when you update a part of a database, whatever was cached from that part might become invalid).

See also Why should you not use Memcached?

Message Queues

A message queue provides asynchronous communication, in the way that email is asynchronous. A sender sends a message; the message goes into a queue, where it remains until the receiver retrieves it.

Message queuing systems are sometimes called message-oriented middleware (MOM) because, of course, they sit between the sender and the receiver.

As Rhodes points out, sending messages via a message queue can be point to point (single sender, single receiver), like a TCP message, but it needn't be. In the publish/subscribe model, a message could be delivered to multiple receivers based on their interests. Receivers can subscribe to particular topics (so each message must specify a topic) and receive all messages on that topic—like subscribing to an email list. Or, receivers can specify filters that select messages based on their content.

Map-reduce

From the world of functional programming:

These operations tend to distribute well to computers on a network. If xs contains 1600 numbers and you have 16 computing nodes, each node could compute square x for 100 of the xs. Once that is done, each node could also add up its 100 squared xs, and then we have only 16 numbers to be added to get the final result.

"Guest Presentation"

Scaling Facebook with OpenSource tools (You Tube)

References


  1. Revisions
    • Version 1.1, 2011 Dec 1. Added content.
    • Version 1.0, 2011 Nov 26. Hardly more than an outline.
  2. See, there are at least two packages for Haskell: memcached and starling.