The other night I sat down and spent some time playing around with Hadoop.

What follows here is based on my brief understanding of the project and one nights worth of experience :)

Hadoop is an Apache Lucene project that provides an open-source implementation of MapReduce. MapReduce is a programming model emphasizing parallel processing that has been developed and popularized by Google. From everything I’ve read and seen of Google’s MapReduce implementation, Hadoop looks to be very similar. Interestingly enough, Yahoo! is a significant supporter of this project and has widely published and presented on their experiences.

Having a general conceptual understanding of MapReduce but little experience with it, I set out to solve a relatively straight forward problem.

As part of our interviewing process, potential candidates are given a programming problem to solve. One that we’ve used previously involved the parsing of apache-style log files and gathering specific statistics. It wasn’t a stretch to see how this could be implemented using MapReduce and Hadoop.

Step 1: Installing Hadoop

My daily driver is a MacBook Pro so the installation went quite smoothly. Nothing significantly more difficult than unpacking an archive, updating a few environment variables, and running a few example jobs.

I setup a two machine cluster - my laptop and another VMware virtual machine. Nothing particularly sophisticated.

Step 2: Breaking down the problem

The first step of any implementation involves breaking the problem down into a series of map and reduction steps. Hadoop provides interfaces and abstract base classes that help you get started. Arguably the process of breaking any problem down into its fundamental building blocks is a significant challenge, but at the same time it’s part of the elegance enabled by a programming model such as MapReduce.

It’s important to note that the framework is responsible for all I/O operations and actually builds this functionality on top of it’s own implementation of a distributed file system. In order to make data available for processing, you must first copy it to the distributed file system. Likewise, if you want to do additional work on output data (and not do it using Hadoop), you must copy it from the distributed file system. This not only enables larger volumes of data to be made available to the cluster but it also allows the framework to partition and store records closer to the machines that will be processing them. Google has an interesting paper that describes their approach to distributed file systems (See Google File System)

In the simplest terms, a mapping step is responsible for processing input data and generating key -> value pairs. Hadoop takes care of aggregating all values with their corresponding keys and passing this result (key -> collection of values) to a reduction step. A reduction step is responsible for reducing the collection of values into a suitable output value. See the Hadoop quick start guide for a simple implementation of both a map and reduce step.

Step 3: The mapping step

Take the following problem definition: Given a collection of apache log files determine the urls with the most hits and return them in descending order.

The mapping step is fairly straight forward. It’s responsible for parsing individual log records and outputting a url and count for each.

In this example, the count would always be 1 and you can think of this mapping step as a marker. It’s responsible for flagging the url from each record it parses. It’s the reduction step’s responsibility to count up all of the individual flags.

Step 4: The reduction step

Example Input: [http://www.google.ca, [1,1,1,1,1,1,1]], [http://www.yahoo.com, [1,1,1], [http://www.microsoft.com, [1,1,1,1,1]]

As mentioned previously, the responsibility of the reduction step in this example is to reduce these inputs by counting up the number of flags. The output would look something like:

[http://www.google.ca, 7], [http://www.yahoo.com, 3]

Step 5: Putting it all together

That’s it. The result of running a job with these map and reduce steps would be similar to:

http://www.google.ca 7

http://www.yahoo.com 3

http://www.microsoft.com 5

Unfortunately the outputs are sorted by value and we haven’t yet quite satisfied the problem definition. However, the beauty is that we can now pass this output through another mapper that will invert the key/value and then sort by key. The output of that job will be a list of urls sorted by their hit count. Success!

All in all it’s an incredibly powerful idea and framework. Absolutely overkill for small tasks like these but if you’ve got a many terabytes (or petabytes) of data and a couple thousand processing nodes, it’s more than adequate.


  1. secr

    Hi,
    thank you for sharing this interesting post. It might sound stupid to you, but it could be even more useful with some code-snippets. And unbroken links.

  2. ajordens

    Thanks.

    I’m actually not sure what happened to the links. I apologize for their brokenness, not sure whether that happened as a result of migrating from 1and1 to WebFaction or when I switched to using Ecto as a publishing platform.

    You’re right, code snippets would probably have been useful :) Essentially what I discussed in my post was a simple extension of what the Hadoop guys have documented in their quick start guide.

    Cheers.

  3. Facebook Dev

    Hadoop MapReduce is powerful, but what promises to be substantially more disruptive is the combination of structured and unstructured data analysis. Aster Data, for example, touted the world’s first-ever fully integrated MapReduce + SQL MPP relational database:

    http://www.asterdata.com/product/mapreduce.html

    Pretty cool stuff…i hear it’s in production too!

Leave a Comment




  • Pet Peeve: Don’t email my password to me in plain text You know the drill. Signup for some random service on the internet Receive a confirmation email with your account information or Forget a password for some random service ...

  • Eclipise Memory Analyzer (MAT) I must say the Eclipse Memory Analyzer looks pretty slick. There is some pretty good material over on the developers blog. Lastly, there was a talk on it ...

  • Open-source Web-based Code Review Tool: Rietveld Guido van Rossum, of Python fame, has recently released a Django-based application that enables web-based code reviews... Rietveld. It supports any language and currently can hook into Subversion repositories. You ...

  • An implementation of the JVM in Javascript? Caught this over on JavaPosse Google Groups. Essentially, some bright fellows over in Japan have developed a bytecode->javascript compiler. There's a demo floating around that took a Tetris ...

  • Facebook Chat? So it looks like the Facebook Chat service has finally started rolling out to my network (Facebook Chat has been mentioned previously). Not quite sure how ...