Thursday, December 8, 2016

What is Hadoop?

What is Hadoop? What are the basic components that make up a Hadoop (Big Data) system?

At its most simplest terms, these are the components that a Hadoop ecosystem contains.

Hadoop Distributed File System (HDFS)


Java based, distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster. Store and process vast quantities of data in a storage layer that scales linearly. It is designed to run across low-cost commodity hardware.

Hadoop YARN


A resource-management platform responsible for managing computing resources in clusters and using them for scheduling of users’ applications. It provides the resource management and a pluggable architecture for enabling a wide variety of data access methods to operate on data stored in Hadoop with predictable performance and service levels.

Hadoop MapReduce


A programming model for large scale data processing.

Hadoop Common 


Libraries and utilities needed by other Hadoop modules.



Reference:


https://github.com/hortonworks/tutorials/blob/hdp-2.5/tutorials/hortonworks/hello-hdp-an-introduction-to-hadoop/hello-hdp-section-2.md#13-apache-hadoop


Course Review - Apache Spark Streaming with Scala

I recently completed a course on Apache Spark Streaming with Scala. It was a great learning experience and gave me the skills necessary to complete a complex prototype in Big Data. The prototype involved reading a stream of information and processing them with Apache Spark framework.


Where is this course?


What does it cover?

   This course covers Apache Spark Streaming module. While the Spark framework is foundation for Big Data processing, its Streaming module is used for processing streaming information. This Streaming module supports both Python and Scala. This course is based on Scala language. There is an introductory section that explains Scala basics. The rest of the Course explains how Spark Streaming module works. A bonus feature is the lesson explaining Apache Kafka and Spark Streaming integration.

   The hands-on nature of the courses are great. Each course does not exceed 15 minutes' which makes learning easier even if you have short time every day.


What is the cost?

   The cost varies. It is not free.


How long is this course?

   The course duration is 6 hours. I took about 30 hours (learning 1 hour everyday) to complete the course.


What are the prerequisites?

   The learner must be familiar workign with Apache Spark and Scala. It would be ideal for the learner to have at least 5 years of programming experience and database knowledge before starting this course. Basic familiarity with UNIX commands or other CLI is needed. Basic knowledge in Java/Scala programming would be helpful, thought there are a few introductory lessons for Scala.

How will it benefit the learner?

  You will have enough knowledge to become an advanced Spark Scala developer in Big Data projects.