Software exercises and notes: 2016

Thursday, December 8, 2016

What is Hadoop?

What is Hadoop? What are the basic components that make up a Hadoop (Big Data) system?

At its most simplest terms, these are the components that a Hadoop ecosystem contains.

Hadoop Distributed File System (HDFS)

Java based, distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster. Store and process vast quantities of data in a storage layer that scales linearly. It is designed to run across low-cost commodity hardware.

Hadoop YARN

A resource-management platform responsible for managing computing resources in clusters and using them for scheduling of users’ applications. It provides the resource management and a pluggable architecture for enabling a wide variety of data access methods to operate on data stored in Hadoop with predictable performance and service levels.

Hadoop MapReduce

A programming model for large scale data processing.

Hadoop Common

Libraries and utilities needed by other Hadoop modules.

Reference:

https://github.com/hortonworks/tutorials/blob/hdp-2.5/tutorials/hortonworks/hello-hdp-an-introduction-to-hadoop/hello-hdp-section-2.md#13-apache-hadoop

Course Review - Apache Spark Streaming with Scala

I recently completed a course on Apache Spark Streaming with Scala. It was a great learning experience and gave me the skills necessary to complete a complex prototype in Big Data. The prototype involved reading a stream of information and processing them with Apache Spark framework.

Where is this course?

This is a course on Udemy. https://www.udemy.com/taming-big-data-with-spark-streaming-hands-on/learn/v4/overview

What does it cover?

This course covers Apache Spark Streaming module. While the Spark framework is foundation for Big Data processing, its Streaming module is used for processing streaming information. This Streaming module supports both Python and Scala. This course is based on Scala language. There is an introductory section that explains Scala basics. The rest of the Course explains how Spark Streaming module works. A bonus feature is the lesson explaining Apache Kafka and Spark Streaming integration.

The hands-on nature of the courses are great. Each course does not exceed 15 minutes' which makes learning easier even if you have short time every day.

What is the cost?

The cost varies. It is not free.

How long is this course?

The course duration is 6 hours. I took about 30 hours (learning 1 hour everyday) to complete the course.

What are the prerequisites?

The learner must be familiar workign with Apache Spark and Scala. It would be ideal for the learner to have at least 5 years of programming experience and database knowledge before starting this course. Basic familiarity with UNIX commands or other CLI is needed. Basic knowledge in Java/Scala programming would be helpful, thought there are a few introductory lessons for Scala.

How will it benefit the learner?

You will have enough knowledge to become an advanced Spark Scala developer in Big Data projects.

Tuesday, November 15, 2016

Course Review - Apache Spark with Scala

I recently completed a course on Apache Spark with Scala. It was a wonderful learning experience and gave me the skills necessary to complete a prototype in Big Data.

Where is this course?

This is a course on Udemy. https://www.udemy.com/apache-spark-with-scala-hands-on-with-big-data/

What does it cover?

This course covers a deep range of topics related to Apache Spark. While the Spark framework supports several languages, this course is based on Scala language. There is an introductory section that explains Scala basics, followed by Spark libraries. It just gives a brief introduction to Spark architecture.

The hands-on nature of the courses are great. Each course does not exceed 15 minutes' which makes learning easier even if you have short time every day.

What is the cost?

The cost varies. It is not free.

How long is this course?

The course duration is 7.5 hours. I took about 30 hours (learning 1 hour everyday) to complete the course.

What are the prerequisites?

It would be ideal for the learner to have at least 4 years of programming experience and database knowledge before starting this course. Basic familiarity with UNIX commands or other CLI would help. Basic knowledge in Java/Scala programming would be helpful, thought there are a few introductory lessons for Scala.

How will it benefit the learner?

You will have enough knowledge to become an intermediate level Spark Scala developer in Big Data projects.

Friday, November 4, 2016

Course review - MongoDB - Udemy

I completed a course about MongoDB ("MongoDB Essentials - Understand the Basics of MongoDB"). This course covered the basics of MongoDB in a very simple and easy manner.

Where is this course?

This is a Udemy course. https://www.udemy.com/mongodb-essentials/learn/v4/overview

What does it cover?

Introduction to MongoDB.

What is the cost?

It is free.

How long is this course?

About 35 minutes. It would take about 2 hours to complete, if the learner runs every step in the tutorial.

What are the prerequisites?

You will need a computer with admin privileges and internet connection.

How will it benefit the learner?

After completing this course, the learner will become equipped to work on entry level tasks in MongoDB.

Thursday, November 3, 2016

Course Review - Github Introduction - Udemy

I completed a course about Git and Github. This course was brilliant in its simplicity and effectiveness in teaching about Git. The course focuses more on practical usage than architecture, which will help beginners to learn Git easily.

Where is this course?

This is a Udemy course. https://www.udemy.com/short-and-sweet-get-started-with-git-and-github-right-now/learn/v4/overview

What does it cover?

Basics of Git and Github.

What is the cost?

It is free.

How long is this course?

About 35 minutes. It would take about 2 hours to complete, if the learner practices every step from the tutorial.

What are the prerequisites?

You will need a computer with admin privileges and internet connection.

How will it benefit the learner?

The learner can become a beginner level Git user and can contribute to software projects.

Getting started with Hortonworks Data Platform (HDP) sandbox in Amazon Web Services (AWS)

A convenient way to start working on Big Data concepts is to use the Hortpnworks (HDP) Sandbox. I am not aware of a similar sandbox in other Big Data vendors. This sandbox encapsulates all the various components of the Big data ecosystem in a single VM, so it becomes easy to install. This Sandbox is available in 3 compact formats - VirtualBox, VMware and Docker.

I started working with VirtualBox, since that is (in my opinion) a compact and simple container.
But I faced issues because the Sandbox is very large (>10 GB) and it needs at least 8GB RAM. The installation just froze and I could not proceed.

I then decided to try it out in the AWS cloud, since I could spin up more powerful hosts machines. This worked. :)

In this post, I will explain how to get started.

What do we need to get started?

AWS account
Basic undesstanding of AWS concepts (I have glossed over the AWS instance setup, to keep this post compact)
An EC2 instance with minimum 128 GB disk storage and 16 GB RAM. (I used an instance type of m4.2xlarge). The larger, the better.

How much time do we need?

The Hortonworks portal claims that you need just 15 minutes (!), but lets assume 1 hour.

Steps

Step 1 - Launch the EC2 instance with the Amazon Linux AMI. Ensure that the security group is configured to allow the ports 8080, 888 and 22 to be accessible.

Step 2 - Install Docker in AWS. This post from AWS will help - http://docs.aws.amazon.com/AmazonECS/latest/developerguide/docker-basics.html

Step 3 - Modify the Docker config file "/etc/sysconfig/docker-storage" to add the line:

DOCKER_STORAGE_OPTIONS="--storage-opt dm.basesize=20G"

This step is needed to ensure that Docker is able to import the Sandbox. This is because by default Docker containers can have a maximum size of 10 GB and the Hortonworks Sandbox is about 13 GB. Restart Docker ("sudo ").

Step 4 - Download, unzip, load and start the Docker sandbox image from Hortonworks. This link has the details - http://hortonworks.com/hadoop-tutorial/hortonworks-sandbox-guide/#section_4

While the Hortonworks guide is well written, it misses out the changes needed to increase the device memory.

Please try out this installation and post your comments. Thanks for reading.

Common Issues

If Step 3 above was not run, the you will see the following error when loading the Docker image.

# docker load < HDP_2.5_docker.tar
b1b065555b8a: Loading layer 202.2 MB/202.2 MB
3901568415a3: Loading layer 10.37 GB/13.85 GB
ApplyLayer exit status 1 stdout: stderr: write /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.111.x86_64/jre/lib/amd64/server/libjvm.so: no space left on device

References

https://community.hortonworks.com/questions/64835/installing-hdp-sandbox-on-docker-for-aws-gives-no.html#answer-64852

Docker commands - Reference

Useful Docker commands

This blog post is a reference for Docker commands.

These commands can be got by typing "docker" or "man docker" in the CLI after successful installation of Docker.

Basic commands

info - Display system-wide information
version - Show the Docker version information
start - Start one or more stopped containers
stop - Stop a running container

pull - Pull an image or a repository from a registry
push - Push an image or a repository to a registry

cp - Copy files/folders between a container and the local filesystem
create - Create a new container
diff - Inspect changes on a container's filesystem
restart - Restart a container

Other commands

attach - Attach to a running container
build - Build an image from a Dockerfile
commit - Create a new image from a container's changes
events - Get real time events from the server
exec - Run a command in a running container
export - Export a container's filesystem as a tar archive
history - Show the history of an image
images - List images
import - Import the contents from a tarball to create a filesystem image
inspect - Return low-level information on a container or image
kill - Kill a running container
load - Load an image from a tar archive or STDIN
login - Log in to a Docker registry
logout - Log out from a Docker registry
logs - Fetch the logs of a container
network - Manage Docker networks
pause - Pause all processes within a container
port - List port mappings or a specific mapping for the CONTAINER
ps - List containers
rename - Rename a container
rm - Remove one or more containers
rmi - Remove one or more images
run - Run a command in a new container
save - Save one or more images to a tar archive
search - Search the Docker Hub for images
stats - Display a live stream of container(s) resource usage statistics
tag - Tag an image into a repository
top - Display the running processes of a container
unpause - Unpause all processes within a container
update - Update configuration of one or more containers
volume - Manage Docker volumes
wait - Block until a container stops, then print its exit code

Tuesday, November 1, 2016

Debugging HTTPS webservices in Java

I recently worked on a RESTful webservice client that accessed the webservice endpoint over HTTPS. This is the preferred communication protocol to ensure that the data transferred between the client and server are encrypted and secure.

More details here - https://en.wikipedia.org/wiki/HTTPS

Troubleshooting

One common issue I have seen is that debugging and trouble shooting the webservice invocations can be difficult. This because the SOAP or RESTful frameworks rely on the Java Runtime (JVM) to make the HTTPS connection. This encapsulation hides the transport layer and network layer exceptions from the webservice frameworks.

It will help troubleshooting if the JVM runtime is configured to log all details of a network call (including HTTPS certificate lookup).

Enabling debugging is a simple step. Ensure that this Java runtime argument (-Djavax.net.debug=all) is added to the Java command.

For standalone Java programs, just add this argument to the Java command.

$java -jar -Djavax.net.debug=all <JAR file name>

For Apache Tomcat, edit the the Tomcat/bin/catalina.sh file and append the -Djavax.net.debug=all option to the JAVA_OPTS variable.

For Java based application servers like IBM Websphere and Oracle Weblogic this option can be added in the Admin console (please refer docs).

Caution: Enabling debug mode generates a lot of log entries and it should be used only for trouble shooting.

References:

This link from Oracle Java docs has more details. http://docs.oracle.com/javase/7/docs/technotes/guides/security/jsse/ReadDebug.html
This blog post has additional notes related to this post - http://karunsubramanian.com/websphere/how-to-enable-ssl-debugging-in-java/
Full debugging is not the only option. The programmer can debug specific aspects of the JVM. This article shows the various other debugging options available - http://www.ibm.com/support/knowledgecenter/SSYKE2_7.1.0/com.ibm.java.security.component.71.doc/security-component/jsse2Docs/debug.html

Tuesday, October 25, 2016

AWS Certified Solutions Architect – Associate exam - FAQ

I recently passed the AWS (Amazon Web Services) Certified Solutions Architect – Associate exam.

In this blog post I list out some tips and observations about this certification.

Note: The most authoritative source of this certification is the AWS website - https://aws.amazon.com/certification/certified-solutions-architect-associate/
This blog gives my subjective version.

Who is this certification for?

This certification is for architects who will design entire systems using the AWS infrastructure. This certification covers a broad list of topics, with very less depth. It is for generalists, not specialists.

What should I know before I start preparing?

You should have experience designing or maintaining architecture for at least 1 year. You need basic IT experience at least 7 years.

How much time will it take to prepare?

I needed about 3 months of preparation, spending about 1 hour a day. I have been an IT solutions architect for a few years and I have some experience using basic AWS features.

What resources should I use?

[Note - this is subjective] My main guide was Linux Academy (https://linuxacademy.com/). I shopped around several options and found that their course style matched my preference. I also referred to this blog for revising and notes - http://codingbee.net/tutorials/aws/aws-csa-associate/aws-about-this-course/

How will I benefit?

This certification will give you confidence of knowing nearly 80% of the features of AWS. You will be able to demonstrate that you can setup and maintain a simple cloud infrastructure, or you can be a confident member of a team that runs a large cloud infrastructure.

What can I do after I pass? Check out the AWS Certified Solutions Architect - Professional exam. This exam will provide you the advanced skills needed to be an AWS architect.

Monday, October 17, 2016

Troubleshooting SAML security

Troubleshooting a SAML (https://en.wikipedia.org/wiki/Security_Assertion_Markup_Language) based SSO application can be tricky because of the complexity of the underlying framework over which the development team will have very less control and transparency. There is also the added complexity of not knowing if the problem occurred at the Service Provider or the Identity Provider. These two teams will usually be separate and communication will be a challenge.

Recently while working on a Spring Security SAML project, there was a need to trouble shoot the application to find the root cause.

Here are three simple tips to make troubleshooting simpler for Spring applications that use SAML for security.

Enable SAML logging

SAML in Spring uses SLF4J for logging. Enable logging. More details here - http://docs.spring.io/spring-security-saml/docs/1.0.x-SNAPSHOT/reference/htmlsingle/#logging

Install a SAML plugin in your browser.

Install a SAML plugin to the browser and debug the SAML requests. While there are several good plugins for Chrome and Firefox browsers, I found (subjectively) the SAML Tracer (https://addons.mozilla.org/en-US/firefox/addon/saml-tracer/) Firefox extension to be very simple and intuitive.

Firefox Extension - SAML Tracer

The SAML Chrome Panel (https://chrome.google.com/webstore/detail/saml-chrome-panel/paijfdbeoenhembfhkhllainmocckace) for Google Chrome browser is more advanced and can be used if you need more visibility.

Ensure that the error.jsp is configured correctly and prints the stacktrace

This is an usually underrated step, but it will be the easiest and best way to start debugging.

Every Authentication Manager in the Spring Security app shall have a Failure Handler. Ensure that a SimpleUrlAuthenticationFailureHandler (http://docs.spring.io/spring-security/site/docs/3.2.9.RELEASE/apidocs/org/springframework/security/web/authentication/SimpleUrlAuthenticationFailureHandler.html) is configured that forwards to a JSP page.

This JSP page should be capable of catching and printing Exception stack traces.

Refer this code here for a good example of how this JSP should look like. https://github.com/spring-projects/spring-security-saml/blob/master/sample/src/main/webapp/error.jsp

Authentication log

Maintaining an Authentication log that has key steps can be very helpful in trouble shooting SAML events. More details here - http://docs.spring.io/spring-security-saml/docs/1.0.x-SNAPSHOT/reference/htmlsingle/#configuration-authentication-log

Climate datasets for data analysis

For learners and practitioners of data processing projects (like database administration, database programming, etc) starting with a reliable and interesting dataset can be very useful.

This dataset must be syntactically correct, large enough to provide interesting results and if possible should be real data, to maintain the interest of the practitioner.
For the Apache Spark programming exercises, I have chosen to process climate data of some of the cities I have lived in. This data helps me to run several interesting analysis and

The data source I recommend is the National Centers for Environmental Information (NCEI) website. http://www.ncdc.noaa.gov/cdo-web/datasets

This dataset is correctly formatted and has interesting metrics. The following notes will show how to get daily climate information for a city over a large date range.

Step 1 - Open the URL http://www.ncdc.noaa.gov/cdo-web/datasets in your browser. You will see this page.

Step 2 - Click and expand the (+) sign at "Daily Summaries"

Step 3 - Click "Search tool"

Step 4 - Select "Daily Summaries", provide the date range, select "cities" in the "Search For" option and enter "Chennai" in the search term (you can enter your chosen city name here). Click Search.

Step 5 - In the results page, click the button "ADD TO CART" next to the city, or entity that you searched for.

Step 6 - Hover the mouse over the "Cart" icon in the top right. Click "VIEW ALL ITEMS".

Step 7 - In the next page, select the CSV option, verify the date range and click CONTINUE.

Step 8 - Next select the data columns that you need. Precipitation, temperature, etc. Click CONTINUE.

Step 9 - This is the final step. Verify the requested info and provide your email. Click SUBMIT ORDER button. You will receive an email with instructions to download the dataset.

Saturday, October 1, 2016

About this blog

Hello. Welcome to my blog. I am a software developer working mostly on Java based projects. I am maintaining this blog to record some observations and exercises about some software projects that I have a knowledge about.