Hadoop for a N00b



  • I've been playing around with Hadoop recently, and just to get started, I decided I wanted to run four nodes in a virtual machine, using Vagrant to start things up and Puppet to provision the machines.

    ok. You can stop laughing now.

    Please. Stop.

    The long and the short of the issues I ran into was that Hadoop is not
    made to be used behind a firewall, almost not made to be run on machines
    with more than one network interface, and certainly not on virtual
    machines. I have it up and running now, and I have run a wordcount
    mapreduce example on it, so I have solved all the issues with the help
    of creative hacking in Vagrant and Puppet, but here are my major gripes
    with hadoop ATM:

    • In order to add a file or start a job you have to log in to one of the
      nodes and run a CLI command. There is no other way, as far as I can
      tell. The best practice seems to be to dedicate one node as an "edge
      node" and trust the people who log in to that node.

    • This also means that in order to upload a very large (set of) file(s)
      to your hadoop cluster, you first need to upload it to your edge node
      (and make sure you have room for it there), and then push it onwards
      from there.

    • Hadoop seems ignorant to the fact that it is internally able to access
      a data node on a 192.168-net, and this net is not accessible from the
      outside. For example, in order to get a file from hadoop, I go to
      the name node and ask for the file. I then get a redirect to the data
      node(s) where I can find the file. This redirect is to the data node's
      funny-name in /etc/hosts and not the canonical name of the host on
      which I am running the VM. As before, the best practice is to use the
      edge node to get the files out of hadoop, and then take the file from
      the edge node instead.

    • So far I've counted between 10 and 15 ports that are essential that
      they are open for a working hadoop cluster. Not all are necessary on
      all types of nodes, and this is very much a minimal set; I can easily
      find 20-30 more ports that might also be open. As long as I run the
      cluster on a private network, this is no problem, but the first 10-15
      ports should really be open to the outside world if I want to be able
      to run nodes e.g. on another machine or network. Again, the way round
      this is the wonderful edge node with SSH/CLI access.

    • Partially the same problem is that a data node expects to open ports
      50010 (dfs.datanode.address) and 50020 (dfs.datanode.ipc.address). I
      can thus have one data node in my virtual machine on one host and
      this would work fine and be visible to the outside world with port
      forwarding and some creative hacking of the right configuration file.
      The moment I want to add another data node to my virtual machine I am
      SooL, because I need to forward this to another port (either hard-code
      this in my VM conf, or let Vagrant auto-assign (which creates another
      nightmare of finding out my host IP and port from inside the guest)).
      But then I also need to edit the hadoop config-files and create a
      unique instance for each data node.

    I think the bottom line is that setting up or running hadoop in more
    than a trivial configuration is non-trivial. As you can see, I've got
    solutions to most of the issues above. It's just that they appear to me
    to be really ugly hacks that boil down to that hadoop is not meant for
    "normal" users. @blakeyrat must have rubbed off on me, because I think this sucks elephant balls!

    What have I missed? I assume that there is some shortcut early on that I failed to take and so I get to go the scenic route, but where did I go wrong (INB4: (hadoop|vagrant|puppet)+)?


  • Discourse touched me in a no-no place

    @Mikael_Svahnberg said in Hadoop for a N00b:

    The long and the short of the issues I ran into was that Hadoop is not made to be used behind a firewall, almost not made to be run on machines with more than one network interface, and certainly not on virtual machines.

    That's exactly it. It's aimed at workloads where running virtualised is a poor idea anyway, and where you've got dedicated physical nodes and networking in place. Anything less and you're likely to find that the overheads of Hadoop exceed any benefit, and the overheads are quite significant. Even medium-sized workloads can probably be done faster on a beefy workstation or server than a Hadoop cluster unless you've got some sort of monster disk system available that talks native HDFS (I think EMC were selling this sort of thing, at least a few years ago). If you're able to offload the storage like that, Hadoop's performance can become a lot better, but they're expensive.

    What's with the bonus line breaks?


  • Winner of the 2016 Presidential Election

    @Mikael_Svahnberg said in Hadoop for a N00b:

    Hadoop is not made to be used

    I fully agree 🚎


  • Discourse touched me in a no-no place

    @asdf Oh, I wouldn't go that far. It's probably a good fit for applications that also use Mongo properly…



  • @dkf said in Hadoop for a N00b:

    @Mikael_Svahnberg said in Hadoop for a N00b:

    The long and the short of the issues I ran into was that Hadoop is not made to be used behind a firewall, almost not made to be run on machines with more than one network interface, and certainly not on virtual machines.

    That's exactly it. It's aimed at workloads where running virtualised is a poor idea anyway, and where you've got dedicated physical nodes and networking in place. Anything less and you're likely to find that the overheads of Hadoop exceed any benefit, and the overheads are quite significant. Even medium-sized workloads can probably be done faster on a beefy workstation or server than a Hadoop cluster unless you've got some sort of monster disk system available that talks native HDFS (I think EMC were selling this sort of thing, at least a few years ago). If you're able to offload the storage like that, Hadoop's performance can become a lot better, but they're expensive.

    I wasn't intending to put much load on it while it was running in a VM. My idea is that I want a single-command way to start things up (using Vagrant and Puppet), get everything to work properly and tested locally, and then add the --provider=digital_ocean flag to vagrant up. I know this is stubborn of me, but I really don't want to manually fire up machines and provision them, and I want a transparent way to decide where to run the cluster. Is that too much to ask? Hmm? Hmm?

    What's with the bonus line breaks?

    I typed it up in an e-mail to my colleague first, and then decided that I might as well put it out here to ask for advice as well.


  • area_pol

    @Mikael_Svahnberg said in Hadoop for a N00b:

    I want a single-command way to start things up, get everything to work properly and tested locally

    If you are testing on a local machine, maybe it would be easier to create a set of hadoop containers with docker-compose? Then you could specify the containers and network between them in one file and start them all together.


  • I survived the hour long Uno hand

    @Adynathos You can also use vagrant to spin up a cluster of dockers inside a host VM if you don't want to install docker on your workstation. I've done it before, but the write-up I did is published under my legal name so I won't link to it here.


  • Considered Harmful

    @Mikael_Svahnberg said in Hadoop for a N00b:

    more than a trivial configuration is non-trivial.

    You don't say.


    Filed under: It's not trivial when it's not trivial.


  • area_pol

    @Yamikuronue said in Hadoop for a N00b:

    vagrant to spin up a cluster of dockers inside a host VM if you don't want to install docker on your workstation

    I suggested docker instead of vagrant, because I found installing and running docker on my workstation easy and it does not need any VMs so should be faster to run. Never tried vagrant so can't compare directly though.


  • I survived the hour long Uno hand

    @Adynathos Vagrant makes it trivial to spin up a VM, so I tend to use it for anything I'm not sure I want living on my hard drive. If you use the Docker provider, it spins up a host VM, installs Docker on it, and then deploys your specified containers to it using docker


  • area_pol

    @Yamikuronue Docker containers are supposed to be isolated from the rest of the OS too, so theoretically no VM needed.
    Not that I suggest you trust this isolation when you have some important things on that machine :P


  • I survived the hour long Uno hand

    @Adynathos Right, but docker itself needs to be installed on the machine 😉
    Plus I run Windows normally.


  • area_pol

    @Yamikuronue said in Hadoop for a N00b:

    Plus I run Windows normally.

    Ah then you need a VM for docker anyway, right?


  • I survived the hour long Uno hand

    @Adynathos I think so? I think there's some really experimental stuff that might make docker run on Windows, but it's easier to just spin up a VM.

    It ends up being as simple to configure as:

    	config.vm.define "hub" do |hub|
    		# Configure the Docker provider for Vagrant
    		hub.vm.provider "docker" do |docker|
    
    			# Define the location of the Vagrantfile for the host VM
    			# Comment out this line to use default host VM
    			docker.vagrant_vagrantfile = "dockerHost/Vagrantfile"
    
    			# Specify the Docker image to use
    			docker.image = "selenium/hub"
    
    			# Specify port mappings
    			# If omitted, no ports are mapped!
    			docker.ports = ['80:80', '443:443', '4444:4444']
    
    			# Specify a friendly name for the Docker container
    			docker.name = 'selenium-hub'
    		end
    	end
    

    and then to open a second container linked to the first one:

    	config.vm.define "chrome" do |chrome|
    		# Configure the Docker provider for Vagrant
    		chrome.vm.provider "docker" do |docker|
    
    			# Define the location of the Vagrantfile for the host VM
    			# Comment out this line to use default host VM that is
    			# based on boot2docker
    			docker.vagrant_vagrantfile = "dockerHost/Vagrantfile"
    
    			# Specify the Docker image to use
    			docker.image = "selenium/node-chrome:2.53.0"
    
    			# Specify a friendly name for the Docker container
    			docker.name = 'selenium-chrome'
    
    			docker.link('selenium-hub:hub')
    		end
    	end
    

  • area_pol

    @Yamikuronue Good to know, looks like Docker but for VMs :)



  • @Mikael_Svahnberg hoopdoop?



  • @Adynathos said in Hadoop for a N00b:

    @Yamikuronue Good to know, looks like Docker but for VMs :)

    The reason I went with Vagrant is that it is one or two lines of configuration and then I can choose with a single CLI flag when I start my guests whether they should run locally or on a cloud provider.

    Also, Vagrant mounts the directory where the Vagrantfile is located under /vagrant/, which meant that I could install Hadoop there, with all the configurations and everything, and it was immediately accessible to all guests. Incidentally, that's also how I pass /etc/hosts around.


  • I survived the hour long Uno hand

    @Mikael_Svahnberg said in Hadoop for a N00b:

    Also, Vagrant mounts the directory where the Vagrantfile is located under /vagrant/, which meant that I could install Hadoop there, with all the configurations and everything, and it was immediately accessible to all guests

    :doing_it_wrong:

    You're meant to script the installation of Hadoop and stick that into your vagrantfile so you never have to do it again, even when you move to the cloud



  • @Yamikuronue said in Hadoop for a N00b:

    @Mikael_Svahnberg said in Hadoop for a N00b:

    Also, Vagrant mounts the directory where the Vagrantfile is located under /vagrant/, which meant that I could install Hadoop there, with all the configurations and everything, and it was immediately accessible to all guests

    :doing_it_wrong:

    You're meant to script the installation of Hadoop and stick that into your vagrantfile so you never have to do it again, even when you move to the cloud

    👀 :seye: erm. Of course :caughtwithmypantsdown: ? It seems like it's just "download the right tarball and unzip it somewhere", so that would be easy, and makes sense for migrating to the cloud. The tricky bit is to automate "hand-edit six config-files in xml and put things in there that really should be there by default", but I guess I can just overwrite these with my copies.

    Come to think of it, I'd better revisit my /etc/hosts solution as well. I probably need to do a double pass, rather than my clunky "alias /etc/hosts to /vagrant/hosts" to make sure all entries are in there -- even the ones added after this machine is provisioned.


  • Impossible Mission - B

    @Yamikuronue said in Hadoop for a N00b:

    @Adynathos I think so? I think there's some really experimental stuff that might make docker run on Windows, but it's easier to just spin up a VM.

    It ends up being as simple to configure as:

    	config.vm.define "hub" do |hub|
    		# Configure the Docker provider for Vagrant
    		hub.vm.provider "docker" do |docker|
    
    			# Define the location of the Vagrantfile for the host VM
    			# Comment out this line to use default host VM
    			docker.vagrant_vagrantfile = "dockerHost/Vagrantfile"
    
    			# Specify the Docker image to use
    			docker.image = "selenium/hub"
    
    			# Specify port mappings
    			# If omitted, no ports are mapped!
    			docker.ports = ['80:80', '443:443', '4444:4444']
    
    			# Specify a friendly name for the Docker container
    			docker.name = 'selenium-hub'
    		end
    	end
    

    and then to open a second container linked to the first one:

    	config.vm.define "chrome" do |chrome|
    		# Configure the Docker provider for Vagrant
    		chrome.vm.provider "docker" do |docker|
    
    			# Define the location of the Vagrantfile for the host VM
    			# Comment out this line to use default host VM that is
    			# based on boot2docker
    			docker.vagrant_vagrantfile = "dockerHost/Vagrantfile"
    
    			# Specify the Docker image to use
    			docker.image = "selenium/node-chrome:2.53.0"
    
    			# Specify a friendly name for the Docker container
    			docker.name = 'selenium-chrome'
    
    			docker.link('selenium-hub:hub')
    		end
    	end
    

    I would hardly call it "simple" when

    1. you have to use Ruby :wtf:
    2. it requires nested closures! :wtf: :doing_it_wrong: :wtf: :doing_it_wrong:

Log in to reply