By Phil Whelan on December 31, 2010
Here I demonstrate, with repeatable steps, how to fire-up a Hadoop cluster on Amazon EC2, load data onto the HDFS (Hadoop Distributed File-System), write map-reduce scripts in Ruby and use them to run a map-reduce job on your Hadoop cluster. You will not need to ssh into the cluster, as all tasks are run from your local machine. Below I am using my MacBook Pro as my local machine, but the steps I have provided should be reproducible on other platforms running bash and Java.
Posted in Data processing, Hadoop, Ruby | Tagged amazon ec2, bash, cloudera, data processing, hadoop, hadoop cluster, hadoop streaming, hdfs, jclouds, map-reduce, ruby, whirr |
By Phil Whelan on December 8, 2010
There are data sources out there, but which data source you choose depends on which technology you wish to get experience working with. The experience should be of the technologies you are using, rather than what the data is. Certain datasets pair better with certain technologies. Simulating the data can be another approach. You just need a clever way of generating and randomizing your fake data. Thirdly, you can use a hybrid approach. Take real data and replay it on a loop, randomizing it as it goes through. Simulating the Twitter fire-hose should not be too hard, should it?
Posted in Data processing | Tagged amazon ec2, bbc, cassandra, data processing, dbpedia, ebs, freebase, hadoop, hdfs, large datasets, livedoor, lucene, mongodb, nosql, rackspace, solr, wikipedia, working with data |
By Phil Whelan on October 1, 2010
This is a follow-up to my recent blog-post on Working With Large Data Sets. That post had some interest, so I thought it would be a good idea to go through the methodologies I had used for processing this data. I entitled this “Data Mining Without Hadoop”, because I have experience using Hadoop, and although [...]
Posted in Data processing, Hadoop | Tagged data mining, data processing, hadoop, hdfs, high scalability, large data, sort, streaming |
By Phil Whelan on September 24, 2010
For the past three and a half years I have been working for a start-up in downtown Vancouver. We have been developing a high performance SMTP proxy that can scale to handle tens of thousands of connections per second on each machine, even with pretty standard hardware. For the past 18 months I’ve moved from [...]
Posted in Data processing | Tagged data, data processing, fire hose, firehose, hadoop, hdfs, large data sets, mailchannels, real-time search, spamhaus, statistics |