By Phil Whelan on January 21, 2011
If you have read my previous post, “Map-Reduce With Ruby Using Hadoop“, then you will know that firing up a Hadoop cluster is really simple when you use Whirr. Without even ssh’ing on the machines in the cloud you can start-up your cluster and interact with it. In this post I’ll show you that it [...]
Posted in Cassandra, Web Development | Tagged amazon ec2, cassandra, cluster, homebrew, mac os x, nosql, whirr |
By Phil Whelan on January 11, 2011
In this post I will define what I believe to be the most important projects within the Apache Projects for building scalable web sites and generally managing large volumes of data.
Posted in Data processing, Web Development | Tagged activemq, apache projects, apache software foundation, asf, cassandra, hadoop, hbase, high scalability, lucene, mahout, rabbitmq, solr, zeromq, zookeeper |
By Phil Whelan on December 8, 2010
There are data sources out there, but which data source you choose depends on which technology you wish to get experience working with. The experience should be of the technologies you are using, rather than what the data is. Certain datasets pair better with certain technologies. Simulating the data can be another approach. You just need a clever way of generating and randomizing your fake data. Thirdly, you can use a hybrid approach. Take real data and replay it on a loop, randomizing it as it goes through. Simulating the Twitter fire-hose should not be too hard, should it?
Posted in Data processing | Tagged amazon ec2, bbc, cassandra, data processing, dbpedia, ebs, freebase, hadoop, hdfs, large datasets, livedoor, lucene, mongodb, nosql, rackspace, solr, wikipedia, working with data |