Software exercises and notes

Wednesday, July 24, 2019

"multi-target not supported" error in PyTorch NLLoss

I recently came across this Exception when training a model with PyTorchand calculating the loss using NLLLoss function (NLLLoss).
RuntimeError: multi-target not supported at /pytorch/aten/src/THNN/generic/ClassNLLCriterion.c:20

This exception is because I provided the wrong dimensions for the labels passed to the Criteria (NLLLoss) function. from the documentation I find that if the input to NLLLoss function is N x C (where N is the batch size, and C is the number of classifications), then the target (second argument) should be of size N (one class for each batch element).

Wednesday, August 15, 2018

Spring Kafka Producer Class diagram

Spring Kafka offers a Spring Bean structure for Producing Kafka Messages. For someone familiar with using Kafka API, Spring Kafka can seem a bit different.

Here is a code snippet from Spring Kafka documentation showing the Classes involved in the Spring Kafka Producer.

We can see that there are 2 important classes involved. DefaultKafkaProducerFactory and KafkaTemplate. Here is a UML class diagram showing how they are related.

The KafkaTemplate bean needs an implementation ProducerFactory interface that has information to produce messages. There is only one implementation available - the DefaultKafkaProducerFactory class.

DefaultKafkaProducerFactory takes a Hashmap with properties like Kafka Server URLs, Serializer classes, etc.

Reference - https://docs.spring.io/spring-kafka/reference/htmlsingle/#_overview

Thursday, May 10, 2018

Some common errors seen after upgrading Elasticsearch, Logstash & Kibana stack from version 5 to version 6

I recently worked on upgrading an existing Elasticsearch, Logstash & Kibana (ELK ) stack from version 5.2 to 6.2.4. There are several breaking changes in this upgrade.

I encountered the following changes in the Index mapping. Please see each error below and how to fix them. I hope this will help someone searching on Google to fix them.

These errors will be displayed in the logstash-plain.log or logstash.log.

[2018-05-09T03:45:12,204][WARN ][logstash.outputs.elasticsearch] Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"logstash-2018.05.09", :_type=>"doc", :_routing=>nil}, #<LogStash::Event:0x673f5e81>], :response=>{"index"=>{"_index"=>"logstash-2018.05.09", "_type"=>"doc", "_id"=>"t92BRGMBQMgRhS0WQ5xE", "status"=>400, "error"=>{"type"=>"mapper_parsing_exception", "reason"=>"failed to find type parsed [string] for [uriParams]"}}}}

Solution - Change the field type of the uriParams field from string to text or keyword.

[2018-05-09T13:17:45,930][WARN ][logstash.outputs.elasticsearch] Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"logstash-2018.05.09", :_type=>"doc", :_routing=>nil}, #<LogStash::Event:0x238de49c>], :response=>{"index"=>{"_index"=>"logstash-2018.05.09", "_type"=>"doc", "_id"=>"n06NRmMBQMgRhS0WdPQs", "status"=>400, "error"=>{"type"=>"mapper_parsing_exception", "reason"=>"failed to parse", "caused_by"=>{"type"=>"illegal_argument_exception", "reason"=>"Could not convert [fielddata] to boolean", "caused_by"=>{"type"=>"illegal_argument_exception", "reason"=>"Failed to parse value [{format=false}] as only [true] or [false] are allowed."}}}}}}

Solution - Change the value of the "fielddata" attributes in the Index mapping. If you had been running ELK since version 2, your "fielddata" value will have been set as

"fielddata": {
   "format": "disabled"

This is not supported in version 6. Change the value as follows:

"fielddata": true

References:

https://www.elastic.co/blog/strings-are-dead-long-live-strings

https://www.elastic.co/guide/en/elasticsearch/reference/6.2/fielddata.html

Tuesday, January 23, 2018

Apache Kafka InvalidReplicationFactorException

When creating a topic this Exception may be thrown by Kafka.

Scenario

If we try to create a topic with replication factor larger than the number of cluster, we will see an error.

Exception

This is the exception thrown by Apache Kafka.

Resolution

Ensure that the replication factor is less than or equal to the number of brokers in the cluster.

Kafka Exception - kafka.common.InconsistentBrokerIdException

I came across this exception when I was working with Apache Kafka.

Scenario

Set up multiple Kafka brokers in a cluster. Copy the server.properties to create a new propeties file for the new broker. I was following the instructions here https://kafka.apache.org/documentation/#quickstart_multibroker

Result

The first broker started without error. When I started the second broker, I saw this exception - kafka.common.InconsistentBrokerIdException.

Resolution

I found that the cause was that when I copied the proerties file I had missed changing the "log.dirs" property. If the two brokers share the same "log.dirs" property then we get this error. I restarted the second instance with a unique value for "log.dirs" property. The issue was resolved.

Thursday, December 8, 2016

What is Hadoop?

What is Hadoop? What are the basic components that make up a Hadoop (Big Data) system?

At its most simplest terms, these are the components that a Hadoop ecosystem contains.

Hadoop Distributed File System (HDFS)

Java based, distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster. Store and process vast quantities of data in a storage layer that scales linearly. It is designed to run across low-cost commodity hardware.

Hadoop YARN

A resource-management platform responsible for managing computing resources in clusters and using them for scheduling of users’ applications. It provides the resource management and a pluggable architecture for enabling a wide variety of data access methods to operate on data stored in Hadoop with predictable performance and service levels.

Hadoop MapReduce

A programming model for large scale data processing.

Hadoop Common

Libraries and utilities needed by other Hadoop modules.

Reference:

https://github.com/hortonworks/tutorials/blob/hdp-2.5/tutorials/hortonworks/hello-hdp-an-introduction-to-hadoop/hello-hdp-section-2.md#13-apache-hadoop

Course Review - Apache Spark Streaming with Scala

I recently completed a course on Apache Spark Streaming with Scala. It was a great learning experience and gave me the skills necessary to complete a complex prototype in Big Data. The prototype involved reading a stream of information and processing them with Apache Spark framework.

Where is this course?

This is a course on Udemy. https://www.udemy.com/taming-big-data-with-spark-streaming-hands-on/learn/v4/overview

What does it cover?

This course covers Apache Spark Streaming module. While the Spark framework is foundation for Big Data processing, its Streaming module is used for processing streaming information. This Streaming module supports both Python and Scala. This course is based on Scala language. There is an introductory section that explains Scala basics. The rest of the Course explains how Spark Streaming module works. A bonus feature is the lesson explaining Apache Kafka and Spark Streaming integration.

The hands-on nature of the courses are great. Each course does not exceed 15 minutes' which makes learning easier even if you have short time every day.

What is the cost?

The cost varies. It is not free.

How long is this course?

The course duration is 6 hours. I took about 30 hours (learning 1 hour everyday) to complete the course.

What are the prerequisites?

The learner must be familiar workign with Apache Spark and Scala. It would be ideal for the learner to have at least 5 years of programming experience and database knowledge before starting this course. Basic familiarity with UNIX commands or other CLI is needed. Basic knowledge in Java/Scala programming would be helpful, thought there are a few introductory lessons for Scala.

Software exercises and notes

Wednesday, July 24, 2019

"multi-target not supported" error in PyTorch NLLoss

Wednesday, August 15, 2018

Spring Kafka Producer Class diagram

Thursday, May 10, 2018

Some common errors seen after upgrading Elasticsearch, Logstash & Kibana stack from version 5 to version 6

References:

Tuesday, January 23, 2018

Apache Kafka InvalidReplicationFactorException

Scenario

If we try to create a topic with replication factor larger than the number of cluster, we will see an error.

Exception

Resolution

Kafka Exception - kafka.common.InconsistentBrokerIdException

Scenario

Result

Resolution

Thursday, December 8, 2016

What is Hadoop?

Hadoop Distributed File System (HDFS)

Hadoop YARN

Hadoop MapReduce

Hadoop Common

Reference:

Course Review - Apache Spark Streaming with Scala

I recently completed a course on Apache Spark Streaming with Scala. It was a great learning experience and gave me the skills necessary to complete a complex prototype in Big Data. The prototype involved reading a stream of information and processing them with Apache Spark framework.

Where is this course?

This is a course on Udemy. https://www.udemy.com/taming-big-data-with-spark-streaming-hands-on/learn/v4/overview

What does it cover?

What is the cost?

The cost varies. It is not free.

How long is this course?

The course duration is 6 hours. I took about 30 hours (learning 1 hour everyday) to complete the course.

What are the prerequisites?

How will it benefit the learner?

You will have enough knowledge to become an advanced Spark Scala developer in Big Data projects.