29 responses to “Map-Reduce With Ruby Using Hadoop”

  1. The Apache Projects – The Justice League Of Scalability | Phil Whelan's Blog

    [...] and manages the running of distributed Map-Reduce jobs. In an previous post I gave an example using Ruby with Hadoop to perform Map-Reduce [...]

  2. andy

    kudos on the good article

    I was able to follow along and everything worked (after correcting for my typos).

    One thing that wasn’t clear, one can be in any dir for the ruby & bash file work. I had done this originally in the hadoop folder.

    thanks

  3. andy

    Hi Phil,

    one more question,

    is there a way to use s3 buckets for the input and output?

  4. Navin

    Phil – thanks very much for taking the time to write this up! I am new to AWS and EC2 and am running into a problem at the point where I first try and launch clusters – wondering if you can please help? I have my Access Key ID and Secret Access Key in my hadoop.properties, and the output I get is as follows:

    
    [~/src/cloudera/whirr-0.1.0+23]$ bin/whirr launch-cluster --config hadoop.properties                                                                                                     rvm:ruby-1.8.7-p299
    Launching myhadoopcluster cluster
    Exception in thread "main" com.google.inject.CreationException: Guice creation errors:
    
    1) No implementation for java.lang.String annotated with @com.google.inject.name.Named(value=jclouds.credential) was bound.
      while locating java.lang.String annotated with @com.google.inject.name.Named(value=jclouds.credential)
        for parameter 2 at org.jclouds.aws.filters.FormSigner.(FormSigner.java:91)
      at org.jclouds.aws.config.AWSFormSigningRestClientModule.provideRequestSigner(AWSFormSigningRestClientModule.java:66)
    
    1 error
    	at com.google.inject.internal.Errors.throwCreationExceptionIfErrorsExist(Errors.java:410)
    	at com.google.inject.internal.InternalInjectorCreator.initializeStatically(InternalInjectorCreator.java:166)
    	at com.google.inject.internal.InternalInjectorCreator.build(InternalInjectorCreator.java:118)
    	at com.google.inject.InjectorBuilder.build(InjectorBuilder.java:100)
    	at com.google.inject.Guice.createInjector(Guice.java:95)
    	at com.google.inject.Guice.createInjector(Guice.java:72)
    	at org.jclouds.rest.RestContextBuilder.buildInjector(RestContextBuilder.java:141)
    	at org.jclouds.compute.ComputeServiceContextBuilder.buildInjector(ComputeServiceContextBuilder.java:53)
    	at org.jclouds.aws.ec2.EC2ContextBuilder.buildInjector(EC2ContextBuilder.java:101)
    	at org.jclouds.compute.ComputeServiceContextBuilder.buildComputeServiceContext(ComputeServiceContextBuilder.java:66)
    	at org.jclouds.compute.ComputeServiceContextFactory.buildContextUnwrappingExceptions(ComputeServiceContextFactory.java:72)
    	at org.jclouds.compute.ComputeServiceContextFactory.createContext(ComputeServiceContextFactory.java:114)
    	at org.apache.whirr.service.ComputeServiceContextBuilder.build(ComputeServiceContextBuilder.java:41)
    	at org.apache.whirr.service.hadoop.HadoopService.launchCluster(HadoopService.java:84)
    	at org.apache.whirr.service.hadoop.HadoopService.launchCluster(HadoopService.java:61)
    	at org.apache.whirr.cli.command.LaunchClusterCommand.run(LaunchClusterCommand.java:61)
    	at org.apache.whirr.cli.Main.run(Main.java:65)
    	at org.apache.whirr.cli.Main.main(Main.java:91)
    

    PS: The rvm:ruby-1.8.7-p299 is part of my prompt and not in the output.

    Thanks in advance for any pointers on getting this resolved.

    Regards,
    Navin

  5. Adrian Cole

    Hi, Navin.

    There’s probably an issue with the property syntax inside your hadoop.properties. Have a look at http://incubator.apache.org/projects/whirr.html and https://cwiki.apache.org/WHIRR/configuration-guide.html

    If you still have issues, send your query to the user list, you’ll get on track quickly!

    whirr-user@incubator.apache.org

    Cheers,
    -Adrian

  6. Navin

    Um, yes, sorry! A problem with “whirr.credentials” getting truncated to “whirr.cred” while copying keys across. How embarrassing!

    Apologies – and thanks for taking the time to reply …

    Navin

  7. cjbottaro

    Great post, thank you very much for that.

    Next question though… how do you use Cassandra for input/output (while still using Ruby)?

    I know you can run Hadoop jobs against Cassandra in Java with the InputFormat they provide, but how to do so using streaming?

  8. Sid

    What a great post! Really clearly written and keeps to it’s promise of being able to demonstrate through repeatable steps. Particularly liked the cues for tea breaks for the lengthy download/install steps. Nice work!

  9. Allen

    When i would like fire up a customized ec2 instance, i added following parameters into hadoop.properties:

    whirr.image-id= us-west-1/ami-**********
    jclouds.ec2.ami-owners=******************
    whirr.hardware-id= m1.large
    whirr.location-id=us-west-1

    But it doesn’t work. Any thought? thanks

  10. Adrian Cole

    I think you would be best taking this to the whirr user list, where you can let us know what didn’t work:
    https://cwiki.apache.org/confluence/display/WHIRR/MailingLists

    There are recipes in the latest version of whirr here, as well:
    http://svn.apache.org/repos/asf/incubator/whirr/trunk/recipes/hadoop-ec2.properties

    Unrelated, but if you don’t mind trying 0.5.0, voting it up can get the new rev, including recipes released!
    http://people.apache.org/~tomwhite/whirr-0.5.0-incubating-candidate-1/

    Cheers,
    A

    1. Allen

      Cool, I will try both, post the questions and new Whirr, will update here later on if i get any solution. Thx :)

    2. Allen

      I tried whirr 0.5.0, and it still doesn’t work.

      It worked well if my hadoop.properties as below:

      whirr.service-name=hadoop
      whirr.cluster-name=myhadoopcluster
      whirr.instance-templates=1 jt+nn,1 dn+tt
      whirr.provider=ec2
      whirr.identity=
      whirr.credential=
      whirr.private-key-file=${sys:user.home}/.ssh/id_rsa
      whirr.public-key-file=${sys:user.home}/.ssh/id_rsa.pub
      whirr.hardware-id= m1.large
      whirr.location-id=us-west-1
      whirr.hadoop-install-runurl=cloudera/cdh/install
      whirr.hadoop-configure-runurl=cloudera/cdh/post-configure

      But once i added two more lines into hadoop.properties file, it went wrong:
      whirr.image-id= us-west-1/ami-***** (my ami)
      jclouds.ec2.ami-owners=(my owner id>

      I have posted a question on whirr forum and see if i could get any solution. Will update here if i get anything. thx

  11. Jack Veenstra

    There’s a bug in your reduce script. You output the total only when you get a new key. So the last key’s total will never be included in the output.

  12. threecuptea

    I started using EC2 yesterday and I got this working today thanks to your article. However, it’s not without twist and turn.
    1. I got http://whirr.s3.amazonaws.com/0.3.0-cdh3u1/util/configure-hostnames not found error when I run ‘whirr launch-cluster’. They haven’t put configure-hostnames for cdh3u1 in s3 yet. I workaround by adding –run-url-base http://whirr.s3.amazonaws.com/0.3.0-cdh3u0/util. I got this advice from http://www.cloudera.com/blog/2011/07/cdh3u1-released/

    2. I followed the instruction to set up Hadoop client in the the host initiating whirr and got the following error when it tried to connect to name node.
    11/08/07 01:03:48 INFO ipc.Client: Retrying connect to server: ec2-107-20-60-75.
    compute-1.amazonaws.com/10.116.78.198:8020. Already tried 9 time(s).
    Bad connection to FS. command aborted. exception: Call to ec2-107-20-60-75.compu
    te-1.amazonaws.com/10.116.78.198:8020 failed on local exception: java.net.Socket
    Exception: Connection refused

    It has to do with security. I checked the security group jclouds#myhadoopcluster3#us-east-1. It allows inbound on 80, 50070, 50030 only from the host initiating whirr launch-cluster and allow inbound on 8020, 8021 only from the name node host. I added rules to allow inbound on 8020, 8021 from the host initiating whirr and apply the rule change. That doesn’t help. In my case, the host initiating whirr launch-cluster is a EC2 instance too.

    3. I can ssh to cluster hosts from the host initiating whirr without any key. iptable is empty and selinux is disabled. Network rules seems set up outside the linux box. No luck.

    4. I ends up transferring files to name nodes and run map reduce job there. Whirr script create /user/hive/warehouse but no /user/ec2-user. Need to create that directory and input sub-directory. You might also add -jobconf mapred.reduce.tasks=1 since the default is 10 in this case.

    Thanks.

  13. Harit

    I guess the reducer is missing the code,because the last line when it completes, has to put the result to the output.
    I ran the same logic in Java and then in Ruby using hadoop and realized that my last node is missing in the result data. so I added the following line at the very end of reducer.rb

    puts prev_key + separator + key_total.to_s

    and it worked.

  14. Daniel

    Hi Phil,

    I am using Hadoop Mapreduce to predict secondary structure of a given long sequence. The idea is, I have a chunks of segments of a sequence and they are written into a single file input where each line is one segment. I have used one of the programs for secondary structure predictions as my mapper code (Hadoop Streaming).
    The out put of the mapper was successful that it produces the predicted structures in terms of dot-bracket notation. I want to use a simple reducer that glue all the outputs from the mapper in an orderly manner.
    For Example, If my input was like

    ….
    And my mapper output is a predicted structure but not in order

    What I am looking is a reducer code that sorts and Glue and outputs in a form similar to the following:
    ……

    Any help…Thanks

  15. Dr. SHyam Sarkar

    Hello,

    We have following properties set :

    whirr.service-name=hadoop
    whirr.cluster-name=myhadoopcluster
    whirr.instance-templates=1 jt+nn,1 dn+tt
    whirr.provider=ec2
    whirr.credential=mYar/KSbx+UL+nqGr9hSgGHIOqXC9tjNcuO9UwF/
    whirr.identity=AKIAJCDTYGREJYIECQZA
    whirr.private-key-file=${sys:user.home}/.ssh/id_rsa
    whirr.public-key-file=${sys:user.home}/.ssh/id_rsa.pub
    whirr.hadoop-install-runurl=cloudera/cdh/install
    whirr.hadoop-configure-runurl=cloudera/cdh/post-configure
    whirr.image-id=ami-3bc9997e
    whirr.hardware-id=i-bffa23f8
    whirr.location-id=us.west-1c

    But we are getting following error:

    [ec2-user@ip-10-170-103-243 ~]$ ./whirr-0.3.0-cdh3u1/bin/whirr launch-cluster –config hadoop.properties –run-url-base http://whirr.s3.amazonaws.com/0.3.0-cdh3u0/util
    Bootstrapping cluster
    Configuring template
    Exception in thread “main” java.util.NoSuchElementException
    at com.google.common.collect.AbstractIterator.next(AbstractIterator.java:147)
    at com.google.common.collect.Iterators.find(Iterators.java:679)
    at com.google.common.collect.Iterables.find(Iterables.java:555)
    at org.jclouds.compute.domain.internal.TemplateBuilderImpl.locationId(TemplateBuilderImpl.java:492)
    at org.apache.whirr.service.jclouds.TemplateBuilderStrategy.configureTemplateBuilder(TemplateBuilderStrategy.java:41)
    at org.apache.whirr.service.hadoop.HadoopTemplateBuilderStrategy.configureTemplateBuilder(HadoopTemplateBuilderStrategy.java:31)
    at org.apache.whirr.cluster.actions.BootstrapClusterAction.buildTemplate(BootstrapClusterAction.java:144)
    at org.apache.whirr.cluster.actions.BootstrapClusterAction.doAction(BootstrapClusterAction.java:94)
    at org.apache.whirr.cluster.actions.ScriptBasedClusterAction.execute(ScriptBasedClusterAction.java:74)
    at org.apache.whirr.service.Service.launchCluster(Service.java:71)
    at org.apache.whirr.cli.command.LaunchClusterCommand.run(LaunchClusterCommand.java:61)
    at org.apache.whirr.cli.Main.run(Main.java:65)
    at org.apache.whirr.cli.Main.main(Main.java:91)

    Can we get any help ? What shold we do ?

    Thanks,
    S.Sarkar

  16. Joao Salcedo

    Nice tutorial, Everything work just how it should be !!!

    Just a small question , what if I wanna connect to the instance , where I can find the key in order to connect to it.

    Cheers,

    Joao

  17. Andrii Vozniuk

    Phil, thanks for the detailed tutorial!
    I had my custom MapReduce application up and running on an EC2 cluster just in a few hours.
    I reproduced the steps with whirr-0.7.1 and hadoop-0.20.2-cdh3u4.

  18. bodla dharani kumar

    hi to all,
    Good Morning,
    I had a set of 22documents in the form of text files of size 20MB and loaded in hdfs,when running hadoop streaming map/reduce funtion from command line of hdfs ,it took 4mins 31 secs for streaming the 22 text files.How to increase the map/reduce process as fast as possible so that these text files should complete the process by 5-10 seconds.
    What changes I need to do on ambari hadoop.
    And having cores = 2
    Allocated 2GB of data for Yarn,and 400GB for HDFS
    default virtual memory for a job map-task = 341 MB
    default virtual memory for a job reduce-task = 683 MB
    MAP side sort buffer memory = 136 MB
    And when running a job ,Hbase error with Region server goes down,Hive metastore status service check timed out.

    Note:[hdfs@s ~]$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0.2.0.6.0-76.jar -D mapred.map.tasks=2 -input hdfs:/apps/*.txt -output /home/ambari-qa/8.txt -mapper /home/coartha/mapper.py -file /home/coartha/mapper.py -reducer /home/coartha/reducer.py -file /home/coartha/reducer.py

    Thanks & regards,
    Bodla Dharani Kumar,