Software exercises and notes: Getting started with Hortonworks Data Platform (HDP) sandbox in Amazon Web Services (AWS)

A convenient way to start working on Big Data concepts is to use the Hortpnworks (HDP) Sandbox. I am not aware of a similar sandbox in other Big Data vendors. This sandbox encapsulates all the various components of the Big data ecosystem in a single VM, so it becomes easy to install. This Sandbox is available in 3 compact formats - VirtualBox, VMware and Docker.

I started working with VirtualBox, since that is (in my opinion) a compact and simple container.
But I faced issues because the Sandbox is very large (>10 GB) and it needs at least 8GB RAM. The installation just froze and I could not proceed.

I then decided to try it out in the AWS cloud, since I could spin up more powerful hosts machines. This worked. :)

In this post, I will explain how to get started.

What do we need to get started?

AWS account
Basic undesstanding of AWS concepts (I have glossed over the AWS instance setup, to keep this post compact)
An EC2 instance with minimum 128 GB disk storage and 16 GB RAM. (I used an instance type of m4.2xlarge). The larger, the better.

How much time do we need?

The Hortonworks portal claims that you need just 15 minutes (!), but lets assume 1 hour.

Steps

Step 1 - Launch the EC2 instance with the Amazon Linux AMI. Ensure that the security group is configured to allow the ports 8080, 888 and 22 to be accessible.

Step 2 - Install Docker in AWS. This post from AWS will help - http://docs.aws.amazon.com/AmazonECS/latest/developerguide/docker-basics.html

Step 3 - Modify the Docker config file "/etc/sysconfig/docker-storage" to add the line:

DOCKER_STORAGE_OPTIONS="--storage-opt dm.basesize=20G"

This step is needed to ensure that Docker is able to import the Sandbox. This is because by default Docker containers can have a maximum size of 10 GB and the Hortonworks Sandbox is about 13 GB. Restart Docker ("sudo ").

Step 4 - Download, unzip, load and start the Docker sandbox image from Hortonworks. This link has the details - http://hortonworks.com/hadoop-tutorial/hortonworks-sandbox-guide/#section_4

While the Hortonworks guide is well written, it misses out the changes needed to increase the device memory.

Please try out this installation and post your comments. Thanks for reading.

Common Issues

If Step 3 above was not run, the you will see the following error when loading the Docker image.

# docker load < HDP_2.5_docker.tar
b1b065555b8a: Loading layer 202.2 MB/202.2 MB
3901568415a3: Loading layer 10.37 GB/13.85 GB
ApplyLayer exit status 1 stdout: stderr: write /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.111.x86_64/jre/lib/amd64/server/libjvm.so: no space left on device

References

https://community.hortonworks.com/questions/64835/installing-hdp-sandbox-on-docker-for-aws-gives-no.html#answer-64852

Software exercises and notes

Thursday, November 3, 2016

Getting started with Hortonworks Data Platform (HDP) sandbox in Amazon Web Services (AWS)

What do we need to get started?

How much time do we need?

Steps

Common Issues

References

1 comment: