Faces in The Cloud: High-Throughput Data Processing with Message Queues, Part 1

As the title of this post alludes, this tutorial will guide you though the process of setting up a grid computing cluster, leveraging a high-performance message queue for task arbitration. This particular example will perform face detection on a number of images using Python, OpenCV, ImageMagick, and leveraging beanstalkd. My hope is that by the end of this tutorial you will understand a little bit about message queues, image processing, and grid computing.

The task

Imagine you work for a government three-letter agency. You're tasked with the job of taking traversing through a massive set of images, picking out the images that are of people's faces, and highlighting the face for future identification.

The process

We will divide this process into three parts, each which can be further parallelized:

  1. Load each image as a job into and divvy it out to the computation grid.
  2. Process each image, detect any faces in the image, and hightlight them.
  3. Store the highlighted image in a database.

The set up

This tutorial will have us building a grid of computers (physical, virtual, cloud, or othewise) to perform various steps of the process. For the purposes of this tutorial, our servers' names will be:

Note: These do not necessarily need to be on separate systems

The arbiter

A message queue is an asynchronous form of interprocess communication. Our system will leverage beanstalkd to allow us to pass image processing jobs to an arbitrary number of grid workers.

We will start by installing and running beanstalkd on our arbitration server:

The queuers

In this tutorial, the queuers are simple beasts. Along with Python, they need only the beanstalkc library installed. They must:

  1. Connect to the beanstalk server "arbiter"
  2. Tell arbiter to create a message queue for new images
  3. Read through a collection of images
  4. Serialize each image and pass it to the arbiter's new image message queue

In an full implementation, the queuers would perform some meaningful actions to gather these images; they may crawl social networking site, capture data from surveilance cameras, or scour sex-offender registries. Ideally, it would upload images to a network data store, and pass an identifier on the job queue.

The filters

The filters will be the bulk of the workforce. We will be able to scale out as many of these beasts as the job requires with little effort. Their task is the most complicated:

  1. Connect to the beanstalk server "arbiter"
  2. Tell arbiter to create a message queue for images of people
  3. Request jobs from the arbiter's new image queue
  4. Detect faces using OpenCV
  5. Discard images without faces
  6. Highlight faces with ImageMagick
  7. Serialize the processed image, and pass it to the arbiter's people message queue

In a more robust form, the filter could perform additional transforms on the images. Attaching names to faces, identifying copyrighted photographs, and building social-network graphs are all examples of additional functions the filters could perform.

This technique, and snippets of code, were taken from Creating with Code.

The database

In this example, the database is just a machine holding processed images in a directory. We could easily envision a number of these machines reading images and storing them in an indexed Oracle database, or running them against mugshots of known criminals. In essence, the database will:

  1. Connect to the beanstalk server "arbiter"
  2. Request jobs from the arbiter's people queue
  3. Store the images.

A real system would necessarily have a RDBM backend, which has been excluded from this tutorial for brevity.

Running the system

In order to launch our image processing grid, we first need to start our database:

We then start our fitlers:


And finally, we launch our image queuers:


Checking our database, we should see that images are already piling up:

Attached Image Attached Image Attached Image

Taking it further

What we have built is an extremely basic, but flexible grid computing architecture for detecting faces. There are a number of shortcuts that were taken here to decrease complexity and improve code brevity. In the code above, raw image data is passed on the message queue, our database is just a directory of files, and our job queue has no resiliancy or failover. If this were to be put in use, we would want the entire system to be driven by a database, with images be stored in a network-acessible share and image identifiers passed on the job queue.

Ideas that could be added to this:

comments powered by Disqus