Software exercises and notes: November 2016

Tuesday, November 15, 2016

Course Review - Apache Spark with Scala

I recently completed a course on Apache Spark with Scala. It was a wonderful learning experience and gave me the skills necessary to complete a prototype in Big Data.

Where is this course?

This is a course on Udemy. https://www.udemy.com/apache-spark-with-scala-hands-on-with-big-data/

What does it cover?

This course covers a deep range of topics related to Apache Spark. While the Spark framework supports several languages, this course is based on Scala language. There is an introductory section that explains Scala basics, followed by Spark libraries. It just gives a brief introduction to Spark architecture.

The hands-on nature of the courses are great. Each course does not exceed 15 minutes' which makes learning easier even if you have short time every day.

What is the cost?

The cost varies. It is not free.

How long is this course?

The course duration is 7.5 hours. I took about 30 hours (learning 1 hour everyday) to complete the course.

What are the prerequisites?

It would be ideal for the learner to have at least 4 years of programming experience and database knowledge before starting this course. Basic familiarity with UNIX commands or other CLI would help. Basic knowledge in Java/Scala programming would be helpful, thought there are a few introductory lessons for Scala.

How will it benefit the learner?

You will have enough knowledge to become an intermediate level Spark Scala developer in Big Data projects.

Friday, November 4, 2016

Course review - MongoDB - Udemy

I completed a course about MongoDB ("MongoDB Essentials - Understand the Basics of MongoDB"). This course covered the basics of MongoDB in a very simple and easy manner.

Where is this course?

This is a Udemy course. https://www.udemy.com/mongodb-essentials/learn/v4/overview

What does it cover?

Introduction to MongoDB.

What is the cost?

It is free.

How long is this course?

About 35 minutes. It would take about 2 hours to complete, if the learner runs every step in the tutorial.

What are the prerequisites?

You will need a computer with admin privileges and internet connection.

How will it benefit the learner?

After completing this course, the learner will become equipped to work on entry level tasks in MongoDB.

Thursday, November 3, 2016

Course Review - Github Introduction - Udemy

I completed a course about Git and Github. This course was brilliant in its simplicity and effectiveness in teaching about Git. The course focuses more on practical usage than architecture, which will help beginners to learn Git easily.

Where is this course?

This is a Udemy course. https://www.udemy.com/short-and-sweet-get-started-with-git-and-github-right-now/learn/v4/overview

What does it cover?

Basics of Git and Github.

What is the cost?

It is free.

How long is this course?

About 35 minutes. It would take about 2 hours to complete, if the learner practices every step from the tutorial.

What are the prerequisites?

You will need a computer with admin privileges and internet connection.

How will it benefit the learner?

The learner can become a beginner level Git user and can contribute to software projects.

Getting started with Hortonworks Data Platform (HDP) sandbox in Amazon Web Services (AWS)

A convenient way to start working on Big Data concepts is to use the Hortpnworks (HDP) Sandbox. I am not aware of a similar sandbox in other Big Data vendors. This sandbox encapsulates all the various components of the Big data ecosystem in a single VM, so it becomes easy to install. This Sandbox is available in 3 compact formats - VirtualBox, VMware and Docker.

I started working with VirtualBox, since that is (in my opinion) a compact and simple container.
But I faced issues because the Sandbox is very large (>10 GB) and it needs at least 8GB RAM. The installation just froze and I could not proceed.

I then decided to try it out in the AWS cloud, since I could spin up more powerful hosts machines. This worked. :)

In this post, I will explain how to get started.

What do we need to get started?

AWS account
Basic undesstanding of AWS concepts (I have glossed over the AWS instance setup, to keep this post compact)
An EC2 instance with minimum 128 GB disk storage and 16 GB RAM. (I used an instance type of m4.2xlarge). The larger, the better.

How much time do we need?

The Hortonworks portal claims that you need just 15 minutes (!), but lets assume 1 hour.

Steps

Step 1 - Launch the EC2 instance with the Amazon Linux AMI. Ensure that the security group is configured to allow the ports 8080, 888 and 22 to be accessible.

Step 2 - Install Docker in AWS. This post from AWS will help - http://docs.aws.amazon.com/AmazonECS/latest/developerguide/docker-basics.html

Step 3 - Modify the Docker config file "/etc/sysconfig/docker-storage" to add the line:

DOCKER_STORAGE_OPTIONS="--storage-opt dm.basesize=20G"

This step is needed to ensure that Docker is able to import the Sandbox. This is because by default Docker containers can have a maximum size of 10 GB and the Hortonworks Sandbox is about 13 GB. Restart Docker ("sudo ").

Step 4 - Download, unzip, load and start the Docker sandbox image from Hortonworks. This link has the details - http://hortonworks.com/hadoop-tutorial/hortonworks-sandbox-guide/#section_4

While the Hortonworks guide is well written, it misses out the changes needed to increase the device memory.

Please try out this installation and post your comments. Thanks for reading.

Common Issues

If Step 3 above was not run, the you will see the following error when loading the Docker image.

# docker load < HDP_2.5_docker.tar
b1b065555b8a: Loading layer 202.2 MB/202.2 MB
3901568415a3: Loading layer 10.37 GB/13.85 GB
ApplyLayer exit status 1 stdout: stderr: write /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.111.x86_64/jre/lib/amd64/server/libjvm.so: no space left on device

References

https://community.hortonworks.com/questions/64835/installing-hdp-sandbox-on-docker-for-aws-gives-no.html#answer-64852

Docker commands - Reference

Useful Docker commands

This blog post is a reference for Docker commands.

These commands can be got by typing "docker" or "man docker" in the CLI after successful installation of Docker.

Basic commands

info - Display system-wide information
version - Show the Docker version information
start - Start one or more stopped containers
stop - Stop a running container

pull - Pull an image or a repository from a registry
push - Push an image or a repository to a registry

cp - Copy files/folders between a container and the local filesystem
create - Create a new container
diff - Inspect changes on a container's filesystem
restart - Restart a container

Other commands

attach - Attach to a running container
build - Build an image from a Dockerfile
commit - Create a new image from a container's changes
events - Get real time events from the server
exec - Run a command in a running container
export - Export a container's filesystem as a tar archive
history - Show the history of an image
images - List images
import - Import the contents from a tarball to create a filesystem image
inspect - Return low-level information on a container or image
kill - Kill a running container
load - Load an image from a tar archive or STDIN
login - Log in to a Docker registry
logout - Log out from a Docker registry
logs - Fetch the logs of a container
network - Manage Docker networks
pause - Pause all processes within a container
port - List port mappings or a specific mapping for the CONTAINER
ps - List containers
rename - Rename a container
rm - Remove one or more containers
rmi - Remove one or more images
run - Run a command in a new container
save - Save one or more images to a tar archive
search - Search the Docker Hub for images
stats - Display a live stream of container(s) resource usage statistics
tag - Tag an image into a repository
top - Display the running processes of a container
unpause - Unpause all processes within a container
update - Update configuration of one or more containers
volume - Manage Docker volumes
wait - Block until a container stops, then print its exit code

Tuesday, November 1, 2016

Debugging HTTPS webservices in Java

I recently worked on a RESTful webservice client that accessed the webservice endpoint over HTTPS. This is the preferred communication protocol to ensure that the data transferred between the client and server are encrypted and secure.

More details here - https://en.wikipedia.org/wiki/HTTPS

Troubleshooting

One common issue I have seen is that debugging and trouble shooting the webservice invocations can be difficult. This because the SOAP or RESTful frameworks rely on the Java Runtime (JVM) to make the HTTPS connection. This encapsulation hides the transport layer and network layer exceptions from the webservice frameworks.

It will help troubleshooting if the JVM runtime is configured to log all details of a network call (including HTTPS certificate lookup).

Enabling debugging is a simple step. Ensure that this Java runtime argument (-Djavax.net.debug=all) is added to the Java command.

For standalone Java programs, just add this argument to the Java command.

$java -jar -Djavax.net.debug=all <JAR file name>

For Apache Tomcat, edit the the Tomcat/bin/catalina.sh file and append the -Djavax.net.debug=all option to the JAVA_OPTS variable.

For Java based application servers like IBM Websphere and Oracle Weblogic this option can be added in the Admin console (please refer docs).

Caution: Enabling debug mode generates a lot of log entries and it should be used only for trouble shooting.

References:

This link from Oracle Java docs has more details. http://docs.oracle.com/javase/7/docs/technotes/guides/security/jsse/ReadDebug.html
This blog post has additional notes related to this post - http://karunsubramanian.com/websphere/how-to-enable-ssl-debugging-in-java/
Full debugging is not the only option. The programmer can debug specific aspects of the JVM. This article shows the various other debugging options available - http://www.ibm.com/support/knowledgecenter/SSYKE2_7.1.0/com.ibm.java.security.component.71.doc/security-component/jsse2Docs/debug.html