Faces in The Cloud: High-Throughput Data Processing with Message Queues, Part 1
As the title of this post alludes, this tutorial will guide you though the process of setting up a grid computing cluster, leveraging a high-performance message queue for task arbitration. This particular example will perform face detection on a number of images using Python, OpenCV, ImageMagick, and leveraging beanstalkd. My hope is that by the end of this tutorial you will understand a little bit about message queues, image processing, and grid computing.
The task
Imagine you work for a government three-letter agency. You're tasked with the job of taking traversing through a massive set of images, picking out the images that are of people's faces, and highlighting the face for future identification.
The process
We will divide this process into three parts, each which can be further parallelized:
- Load each image as a job into and divvy it out to the computation grid.
- Process each image, detect any faces in the image, and hightlight them.
- Store the highlighted image in a database.
The set up
This tutorial will have us building a grid of computers (physical, virtual, cloud, or othewise) to perform various steps of the process. For the purposes of this tutorial, our servers' names will be:
- arbiter: the machine that runs the message queue
- queuer-0..n: these machines will hold un-analysed images to be processed and give them to the arbiter as jobs to be completed.
- filter-0..n: these machines will request jobs from the arbiter, filter out images without faces, highlight faces in the remaining images, and pass them back to the arbiter as completed jobs.
- database: the machine that holds the final images, with faces highlighted.
Note: These do not necessarily need to be on separate systems
The arbiter
A message queue is an asynchronous form of interprocess communication. Our system will leverage beanstalkd to allow us to pass image processing jobs to an arbitrary number of grid workers.
We will start by installing and running beanstalkd on our arbitration server:
The queuers
In this tutorial, the queuers are simple beasts. Along with Python, they need only the beanstalkc library installed. They must:
- Connect to the beanstalk server "arbiter"
- Tell arbiter to create a message queue for new images
- Read through a collection of images
- Serialize each image and pass it to the arbiter's new image message queue
In an full implementation, the queuers would perform some meaningful actions to gather these images; they may crawl social networking site, capture data from surveilance cameras, or scour sex-offender registries. Ideally, it would upload images to a network data store, and pass an identifier on the job queue.
The filters
The filters will be the bulk of the workforce. We will be able to scale out as many of these beasts as the job requires with little effort. Their task is the most complicated:
- Connect to the beanstalk server "arbiter"
- Tell arbiter to create a message queue for images of people
- Request jobs from the arbiter's new image queue
- Detect faces using OpenCV
- Discard images without faces
- Highlight faces with ImageMagick
- Serialize the processed image, and pass it to the arbiter's people message queue
In a more robust form, the filter could perform additional transforms on the images. Attaching names to faces, identifying copyrighted photographs, and building social-network graphs are all examples of additional functions the filters could perform.
This technique, and snippets of code, were taken from Creating with Code.
The database
In this example, the database is just a machine holding processed images in a directory. We could easily envision a number of these machines reading images and storing them in an indexed Oracle database, or running them against mugshots of known criminals. In essence, the database will:
- Connect to the beanstalk server "arbiter"
- Request jobs from the arbiter's people queue
- Store the images.
A real system would necessarily have a RDBM backend, which has been excluded from this tutorial for brevity.
Running the system
In order to launch our image processing grid, we first need to start our database:
We then start our fitlers:
...
And finally, we launch our image queuers:
...
Checking our database, we should see that images are already piling up:
Taking it further
What we have built is an extremely basic, but flexible grid computing architecture for detecting faces. There are a number of shortcuts that were taken here to decrease complexity and improve code brevity. In the code above, raw image data is passed on the message queue, our database is just a directory of files, and our job queue has no resiliancy or failover. If this were to be put in use, we would want the entire system to be driven by a database, with images be stored in a network-acessible share and image identifiers passed on the job queue.
Ideas that could be added to this:
- Harvest images from Facebook.
- Perform facial recognition, not just detection.
- Process video instead of images.
- Utilize a more robust message queue such as RabbitMQ so you can work with larger jobs.
- Store data in a real database.