Tuesday, November 15, 2016

Course Review - Apache Spark with Scala

I recently completed a course on Apache Spark with Scala. It was a wonderful learning experience and gave me the skills necessary to complete a prototype in Big Data.


Where is this course?


What does it cover?

   This course covers a deep range of topics related to Apache Spark. While the Spark framework supports several languages, this course is based on Scala language. There is an introductory section that explains Scala basics, followed by Spark libraries. It just gives a brief introduction to Spark architecture.

   The hands-on nature of the courses are great. Each course does not exceed 15 minutes' which makes learning easier even if you have short time every day.


What is the cost?

   The cost varies. It is not free.


How long is this course?

   The course duration is 7.5 hours. I took about 30 hours (learning 1 hour everyday) to complete the course.


What are the prerequisites?

   It would be ideal for the learner to have at least 4 years of programming experience and database knowledge before starting this course. Basic familiarity with UNIX commands or other CLI would help. Basic knowledge in Java/Scala programming would be helpful, thought there are a few introductory lessons for Scala.

How will it benefit the learner?

  You will have enough knowledge to become an intermediate level Spark Scala developer in Big Data projects.

Friday, November 4, 2016

Course review - MongoDB - Udemy

I completed a course about MongoDB ("MongoDB Essentials - Understand the Basics of MongoDB").  This course covered the basics of MongoDB in a very simple and easy manner.

Where is this course?


What does it cover?

Introduction to MongoDB.

What is the cost?

It is free.

How long is this course?

About 35 minutes. It would take about 2 hours to complete, if the learner runs every step in the tutorial.


What are the prerequisites?

You will need a computer with admin privileges and internet connection.


How will it benefit the learner?

After completing this course, the learner will become equipped to work on entry level tasks in MongoDB.

Thursday, November 3, 2016

Course Review - Github Introduction - Udemy

I completed a course about Git and Github.  This course was brilliant in its simplicity and effectiveness in teaching about Git. The course focuses more on practical usage than architecture, which will help beginners to learn Git easily.


Where is this course?


What does it cover?

Basics of Git and Github.


What is the cost?

It is free.


How long is this course?

About 35 minutes. It would take about 2 hours to complete, if the learner practices every step from the tutorial.

What are the prerequisites?

You will need a computer with admin privileges and internet connection.


How will it benefit the learner?

The learner can become a beginner level Git user and can contribute to software projects.

Getting started with Hortonworks Data Platform (HDP) sandbox in Amazon Web Services (AWS)

A convenient way to start working on Big Data concepts is to use the Hortpnworks (HDP) Sandbox. I am not aware of a similar sandbox in other Big Data vendors. This sandbox encapsulates all the various components of the Big data ecosystem in a single VM, so it becomes easy to install. This Sandbox is available in 3 compact formats - VirtualBox, VMware and Docker.

I started working with VirtualBox, since that is (in my opinion) a compact and simple container.
But I faced issues because the Sandbox is very large (>10 GB) and it needs at least 8GB RAM. The installation just froze and I could not proceed.

I then decided to try it out in the AWS cloud, since I could spin up more powerful hosts machines. This worked. :)


In this post, I will explain how to get started.

What do we need to get started?

  1. AWS account
  2. Basic undesstanding of AWS concepts (I have glossed over the AWS instance setup, to keep this post compact)
  3. An EC2 instance with minimum 128 GB disk storage and 16 GB RAM. (I used an instance type of m4.2xlarge). The larger, the better.

How much time do we need?


       The Hortonworks portal claims that you need just 15 minutes (!), but lets assume 1 hour.

Steps


     Step 1 - Launch the EC2 instance with the Amazon Linux AMI. Ensure that the security group is configured to allow the ports 8080, 888 and 22 to be accessible.

     Step 2 - Install Docker in AWS. This post from AWS will help -  http://docs.aws.amazon.com/AmazonECS/latest/developerguide/docker-basics.html

     Step 3 - Modify the Docker config file "/etc/sysconfig/docker-storage" to add the line:            
    DOCKER_STORAGE_OPTIONS="--storage-opt dm.basesize=20G"
          This step is needed to ensure that Docker is able to import the Sandbox. This is because by default Docker containers can have a maximum size of 10 GB and the Hortonworks Sandbox is about 13 GB.  Restart Docker ("sudo ").

     Step 4 - Download, unzip, load and start the Docker sandbox image from Hortonworks. This link has the details - http://hortonworks.com/hadoop-tutorial/hortonworks-sandbox-guide/#section_4


     While the Hortonworks guide is well written, it misses out the changes needed to increase the device memory.

      Please try out this installation and post your comments. Thanks for reading.

Common Issues


     If  Step 3 above was not run, the you will see the following error when loading the Docker image.

# docker load < HDP_2.5_docker.tar
b1b065555b8a: Loading layer 202.2 MB/202.2 MB
3901568415a3: Loading layer 10.37 GB/13.85 GB
ApplyLayer exit status 1 stdout:  stderr: write /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.111.x86_64/jre/lib/amd64/server/libjvm.so: no space left on device

References

Docker commands - Reference

Useful Docker commands

This blog post is a reference for Docker commands.

These commands can be got by typing "docker" or "man docker" in the CLI after successful installation of Docker.

Basic commands


    info  - Display system-wide information
    version -  Show the Docker version information
    start -  Start one or more stopped containers
    stop -  Stop a running container
    pull -  Pull an image or a repository from a registry
    push - Push an image or a repository to a registry
    cp -  Copy files/folders between a container and the local filesystem
    create -   Create a new container
    diff - Inspect changes on a container's filesystem
    restart -  Restart a container


Other commands


    attach  -  Attach to a running container
    build  -    Build an image from a Dockerfile
    commit -  Create a new image from a container's changes
    events -    Get real time events from the server
    exec  -    Run a command in a running container
    export -   Export a container's filesystem as a tar archive
    history -  Show the history of an image
    images  -  List images
    import  -  Import the contents from a tarball to create a filesystem image
    inspect -   Return low-level information on a container or image
    kill   -   Kill a running container
    load -   Load an image from a tar archive or STDIN
    login -  Log in to a Docker registry
    logout - Log out from a Docker registry
    logs - Fetch the logs of a container
    network -  Manage Docker networks
    pause -   Pause all processes within a container
    port -  List port mappings or a specific mapping for the CONTAINER
    ps -  List containers
    rename -   Rename a container
    rm -      Remove one or more containers
    rmi -     Remove one or more images
    run -     Run a command in a new container
    save -    Save one or more images to a tar archive
    search - Search the Docker Hub for images
    stats -   Display a live stream of container(s) resource usage statistics
    tag -     Tag an image into a repository
    top -     Display the running processes of a container
    unpause - Unpause all processes within a container
    update -  Update configuration of one or more containers
    volume -  Manage Docker volumes
    wait -    Block until a container stops, then print its exit code


Tuesday, November 1, 2016

Debugging HTTPS webservices in Java

I recently worked on a RESTful webservice client that accessed the webservice endpoint over HTTPS. This is the preferred communication protocol to ensure that the data transferred between the client and server are encrypted and secure.


Troubleshooting

     One common issue I have seen is that debugging and trouble shooting the webservice invocations can be difficult. This because the SOAP or RESTful frameworks rely on the Java Runtime (JVM) to make the HTTPS connection. This encapsulation hides the transport layer and network layer exceptions from the webservice frameworks.

    It will help troubleshooting if the JVM runtime is configured to log all details of a network call (including HTTPS certificate lookup). 

    Enabling debugging is a simple step. Ensure that this Java runtime argument (-Djavax.net.debug=all) is added to the Java command.

    For standalone Java programs, just add this argument to the Java command.
    $java -jar -Djavax.net.debug=all <JAR file name>

    For Apache Tomcat, edit the the Tomcat/bin/catalina.sh file and append the -Djavax.net.debug=all option to the JAVA_OPTS variable.

    For Java based application servers like IBM Websphere and Oracle Weblogic this option can be added in the Admin console (please refer docs).

    Caution: Enabling debug mode generates a lot of log entries and it should be used only for trouble shooting.

    References: 

  1. This link from Oracle Java docs has more details.     http://docs.oracle.com/javase/7/docs/technotes/guides/security/jsse/ReadDebug.html
  2. This blog post has additional notes related to this post - http://karunsubramanian.com/websphere/how-to-enable-ssl-debugging-in-java/
  3. Full debugging is not the only option. The programmer can debug specific aspects of the JVM. This article shows the various other debugging options available - http://www.ibm.com/support/knowledgecenter/SSYKE2_7.1.0/com.ibm.java.security.component.71.doc/security-component/jsse2Docs/debug.html