Introduction to Elasticsearch

Elasticsearch is the most popular, open-source, cross-platform, distributed and scalable search-engine based on Lucene. It is written in Java and released under the terms of the Apache License.

Elasticsearch is developed alongside Logstash and KibanaLogstash is a data-collection and log-parsing engine while Kibana is an analytics and visualization platform. The three products combined are referred to as the Elastic Stack (formerly known as the ELK stack). They are developed to provide integrated solutions and are designed to be used together.

The data stored in Elasticsearch is in the form schema-less JSON documents; similar to NO-SQL databases. You can communicate with the Elasticsearch server through an HTTP REST API; the response generated will be in the form of JSON object. Elasticsearch is designed to take chunks of big-data from different sources, analyze it and search through it. It is optimized to work well with huge data set; the searches happen very quickly i.e., almost near real-time!

To understand Elasticsearch in detail; we need to understand its core concepts and terminologies.

We will go through each one of them in brief:

Near real-time

In Elasticsearch the data is distributed and stored in different clusters. So when a change is made on an index it may not be readily available; a latency of 1-2 seconds is expected! Contrary to this, when a change is made, it is propagated instantly in a relational database as they are deployed on a single machine. We can live with this slight delay as it is due to it’s distributed architecture; it is required to make it scalable and robust.

At the end of this post; you will get the clear picture of what happens internally and why this latency is expected!

Cluster

A cluster is a group of Elasticsearch servers. These servers are called nodes. Depending upon the use-case and scalability preferences; a cluster can have any number of nodes. Each node is identified by a unique name; all the data is distributed amongst these nodes which in turn is grouped into different clusters. A cluster allows you to index and search the stored data.

Node

A node is a server; it is a single unit of the cluster. If it is a single node cluster; all data will be stored in that single node; else the data will be distributed amongst n nodes which are the part of that cluster. Nodes participate in a cluster’s search and indexing capabilities. Depending upon the type of query fired; they will collaborate and will return the matching response.

Index

An index is a collection or grouping of documents; the index has a property called type. In relational database terms; an index is something like database and type is something like a table. This comparison may not be always true because it very much depends on how you design your cluster, but in most cases, it will hold true.

Any number of indexes can be defined; just like node and cluster, an index is identified by a unique name; this name should be in lower-case.

Type

Type is a category or a class of similar documents; as explained in the above paragraph; it comes close to a table in relational database terms. It consists of a unique name and mapping. Whenever you query an index; Elasticsearch reads _type from a metadata file and applies a filter on this field; Lucene internally has no idea of what the type is! An index can have any number of types, and types can have their own mapping!

Mapping

Mapping is somewhat similar to the schema of a relational database. It is not mandatory to define mapping explicitly; if a mapping is not provided Elasticsearch will add it dynamically based on its data when the document is added. Mapping generally describes the field and its datatype in a document. It also includes information on how to index and store fields by Lucene.

Document

A document is the smallest and most basic unit of information that can be indexed. It consists of key-value pairs; the values can of datatype string, number, date, object etc. An index can have any number of documents stored within it; in object-oriented terms, a document is something like an object. It is in the form of JSON. In relational database terms; a document can be thought of as a single row of a table.

Shards

An index can be divided into multiple independent sub-indexes; these sub-indexes are fully functional on their own and are called shards. They are useful when an index needs to have more data than the hardware capability of a node(server) supports; for example, 800 GB data on 500 GB disk!

Sharding allows you to horizontally scale by volume and space; it enhances the performance of a cluster by running parallel operations and distributing loads across different shards. By default; Elasticsearch adds 5 primary shards for an index. This can be manually configured to suit your requirements.

Replica

A replica is a copy of an individual shard. Elasticsearch creates a replica for each shard; the replica and the original shard never reside on the same node.

Shards and Replicas - Image
This image is downloaded from google; copyright infringement is not intended.

Replica comes into picture when nodes in a cluster fail, shards in a node fail or a spike in read-throughput is encountered; replica promises the high availability of the data in such situations. When a write query is fired; the original shard is updated first and then the replicas are updated with some overlying latency. But read queries can run in parallel across replicas; this will improve the performance of read operations overall.

By default; a single copy of each primary shard is created, but a shard can have more than one replicas in some special cases.

Inverted-Index

It is a unique type of data structure that Lucene uses to make huge dataset readily searchable. Inverted-Index is a set of words, phrases or tokens associated with different documents to allow full-text-search. In simple terms, an inverted index is something like an appendix page at the end of the book; it will have mappings of words to documents.

Inverted Index Image
This image is downloaded from google; copyright infringement is not intended.
Segment

Each shard consists of multiple segments; these segments are nothing but inverted-indexes; which will search in parallel, get results and combine them in the final output for that particular shard.

ES Architecture Image
Visual representation of Internal ES Architecture

As and when the documents are indexed; Elasticsearch writes it to new segments, refreshes the search data and updates transaction logs. This happens very frequently to make data in new segment visible to all queries. Elasticsearch is not meant for updates and delete, so if data needs to be deleted or updated it actually just marks the old document as deleted, and indexes a new document. The merge process also expunges these old deleted documents. Elasticsearch constantly merges similar segments into common big segments in the background; querying too many small segments is not very optimum. After the bigger segment is written; the smaller segments are dropped and log files are again updated to reflect new changes.

It may seem complicated.. But

You don’t have to deal with the internal working of Lucene and ElasticSearch as it is abstracted; you just have to configure clusters with the right number of nodes and create indexes with appropriate mappings! Everything else is done internally. Several organizations like IBM, AWS, Searchly, Elastic Cloud etc., offer Elasticsearch as a managed service; so you don’t have to worry about managing servers, doing deployments, taking backups etc. It will take care of these things for you to save your time and effort to operate these servers.

This post was meant to cover basics of ElasticSearch and a brief idea of how it works internally. I hope that I have done justice to it. In my next post; I aim to cover ‘How to query on Elasticsearch index using Kibana?’.

Stay tuned.

Share this blog

A layman’s guide to Attribution Models

The understanding of attribution models requires a basic knowledge of some digital marketing jargons and concepts. Let us quickly go through some of them.

Ad

A short description of a product or an idea or an event aimed to create an awareness for a particular brand, strategically.

Campaign

A collection of ads through which an agency wants to convey an idea or a message to the users. It generally has a theme around it; aimed at creating an appeal for a set of products to target users.

View or Impression Event

Whenever a user sees an ad; a view or an impression event is triggered.

Click Event

Whenever a user clicks on an ad; click event is triggered.

Conversion Event

Whenever a user makes some sort of transaction; i.e., buying a product, downloading an app, filling a form etc., conversion event is triggered.

User Journey

It is a series and timeline of views and clicks which led to a particular conversion. It can also be termed as the path to conversion.

To visualize everything, consider this example:

I want to buy a mobile phone; so I randomly googled ‘best mobile phone under 30k’; google showed an ad for a device called Moto Z play. I clicked on it and went through its specs but I decided against buying it due to some reason.

Now that google knows my history; it will keep showing me relevant ads even on some third party sites. I encountered one of those ads but didn’t click on it. After a week or so; an ad popped up on my facebook newsfeed that there is a 20% discount on Moto Z play on amazon, at this moment I decided to buy. So I clicked on this facebook ad and was about to buy this product from amazon; but before that, I checked for discount coupons on Coupon-Dunia, and finally, I bought the product for 25k.

User journey for this particular conversion will be something like this: Click1 (Google, Campaign1, Ad1), View1 (Display, Campaign2, Ad2), View2 (Facebook, Campaign3, Ad3), Click3 (Facebook, Campaign3, Ad4), View3 (Display, Campaign4, Ad5) -> Conversion (Amazon, Revenue:25000)

Now that we have a basic idea about ads, campaigns, impressions/views, clicks, conversions and user journey; we can dive deeper into attribution.

What is attribution?

In simple terms, the process of identifying which ad/ campaign led to conversion is called attribution. Attribution is necessary to understand how your ads/campaigns are performing on different search, social, and display networks; using this insight you can plan the budget allocation and targeting rules for different campaigns. It gives a clear picture in terms of ROI and lead generation.

Attribution can be performed using different models; these models can be categorized into two types: 1) Single-Touch Models and 2) Multi-Touch Models

Single-Touch Models

The philosophy of single-touch models is very simple; attribute the conversion to only one event; either the first event or the last event. These are generally ‘click only‘ models; views are not given credit for the conversions.

First-Touch (First-Click)

The entire credit for the conversion will go to the first event in this model. In our case, if we consider the above user journey; entire conversion weight and revenue will be distributed to Click1 (Google, Campaign1, Ad1).

Last-Touch (Last-Click)

The entire credit for the conversion will go to the last event in this model. In our case, if we consider the above user journey; entire conversion weight and revenue will be distributed to Click3 (Facebook, Campaign3, Ad4). If you look carefully at our user journey; the last event was View3 (Display, Campaign4, Ad5) but as this is a ‘click only‘ model we will give credit to the last click.

So this is all about single-touch models; they are very easy to understand and fairly easy to implement. It makes sense when your use case is simple and you are using singular campaigns on a fixed network. It is not reliable when multiple engines and multiple campaigns are involved; the weight and revenue distribution, in this case, will not give you correct insight on your ad/campaign performance vs spending!

Multi-Touch Models

Multi-Touch models credit each and every event (both clicks and views) that leads to conversion. Weight distribution varies from model to model, but it validates the contribution of each event for conversion!

There are four commonly used multi-touch models: 1) Linear 2) U-Shape 3) Time-decay 4) Custom. We will go through them one by one in detail.

Linear

In Linear model, weights are evenly distributed amongst all the events. It gives an equal pie of revenue to each ad/campaign which contributed to conversion.

Linear model weight distribution with graph

In our case, there are total 5 events; two clicks and three views, each of these events will get an equal weight and the corresponding ad/campaign will get fifth part of revenue!

U-Shape

In U-shape model, weights are assigned in the ratio of 40-20-40; it means that the first and last event will get 40% weight and the remaining events will get an equal distribution of 20% weight.

U-Shape model weight distribution with graph

In our case, Click1 (Google, Campaign1, Ad1) and View3 (Display, Campaign4, Ad5) will get 40% each. View1 (Display, Campaign2, Ad2), View2 (Facebook, Campaign3, Ad3) and Click3 (Facebook, Campaign3, Ad4) will get (20/3)% each!

This model is quite unique as compared to above models; it gives more credit to edge events, edge events are the ones which actually starts or stops a user journey. So from a campaign strategy perspective, these are the events which actually called for an action in the real sense.

Time-decay

In Time-decay model, the weight of the event is inversely proportional to the time difference between the event date and conversion date. In other words, the event closer the conversion will get higher weights and the event further to the conversion will get lesser weights.

Time-decay model weight distribution with graph

In our case, Click1 (Google, Campaign1, Ad1), View1 (Display, Campaign2, Ad2), View2 (Facebook, Campaign3, Ad3), Click3 (Facebook, Campaign3, Ad4), View3 (Display, Campaign4, Ad5) will get credit in increasing order; i.e., Click1<View1<View2<Click3<View3.

From a theoretical viewpoint, Time-decay seems to be the most sensible weight distribution mechanism! But in reality, the model selection depends on the type of business, type of product, and use-case of the client. For some businesses, U-Shape will make more sense while for other Time-Decay will be more preferable.

Custom

There are times when a certain business solution or strategy would require meticulous analytics information. U-Shape and Time-Decay are used widely for attribution but they may not be enough! At this point, if we want to dive one level deeper we can design a custom model, which may assign weights according to the type of events, campaigns and ads, it may also consider the timeline of the events.

Sometimes there are events which do not lead to conversion; these are called negative paths. We can use this data to design a custom algorithm which assigns and updates weights at entity (engine, campaign, ad) levels depending upon the event type (click or view).

There are some engines/ channels which are not that important in a user journey; it doesn’t matter whether this engine is a part of the path or not! In our case, last event View3 (Display, Campaign4, Ad5) was triggered because as a user I tried to find discount coupons on Coupon-Dunia; this event is relevant but doesn’t add value to the path because conversion would still happen even if this event is not present in the path! It still gets higher weights in Time-Decay and U-Shape. These discrepancies can be removed in the custom algorithmic model after we have enough data to find and connect these invisible dots!

Time-decay vs U-Shape vs Linear

To wrap things up, it is really difficult to say that a particular model is 100% perfect and efficient; you need to try and figure out which model works better for you. Ideally to derive optimum insights; a combination of two or three models should be tried and this data should be then used to design a custom algorithmic model to cater your needs!

Stay tuned for more insights!

Share this blog

How to use PIVOT in SQL SERVER for more than one columns?

Hi, folks! As a developer, I have to query SQL Server on regular basis but we hardly use PIVOT in our day to day lives. Even if we use PIVOT; we generally use it for a single column.

Today, I had to generate a report for one of our clients; according to their business requirement, I had to show them aggregated data which was dynamically calculated by a group by query.

Sample data after the above step looked something like this,

ModelDateEngineRevenueConversions
First Click2/22/2017bing2388103.292649
First Click2/22/2017facebook289642387
First Click2/22/2017google8199735.056566
Last Click2/22/2017bing2534385.592903
Last Click2/22/2017facebook304360380
Last Click2/22/2017google7884433.756148
Weighted2/22/2017bing2522729.0292930.342273
Weighted2/22/2017facebook300418.4666394.102303
Weighted2/22/2017google8282122.8416526.552723

 

I wanted this to be in the format below,

 

DateSearchEngineFirstClickConversionLastClickConversionWeightedConversionFirstClickRevenueLastClickRevenueWeightedRevenue
2/22/2017bing264929032930.3422732388103.292534385.592522729.029
2/22/2017facebook387380394.102303289642304360300418.4666
2/22/2017google656661486526.5527238199735.057884433.758282122.841

If you look carefully, all these are dynamically calculated columns. So, I needed to PIVOT on ‘Conversion’ and ‘Revenue’ columns simultaneously. I tried to surf around the internet hoping to find a quick workaround. I went through a couple of accepted solutions on stack overflow, but they were pretty complex to understand and implement.

I thought of a simple JUGAAD to implement this; I stored the results of pivot1 and pivot2 into tempTable1 and tempTable2 respectively then I used an inner join to get the aggregated transposed result which I required!

SELECT * INTO #tempTable1
FROM
 (SELECT CASE
 WHEN modelid = 1 THEN 'FirstClickRevenue'
 WHEN modelid = 2 THEN 'LastClickRevenue'
 WHEN modelid = 3 THEN 'WeightedRevenue'
 END AS ModelType,
 convert(date,conversiondate) AS Date,
 campaign,
 SUM(revenue) AS Revenue
 FROM SOMETABLENAME
 WHERE conversiondate >= @FromDate
 AND conversiondate <= @ToDate
 GROUP BY modelid,
 convert(date,conversiondate),
 campaign) AS s PIVOT (sum(revenue)
 FOR [ModelType] IN (FirstClickRevenue,LastClickRevenue,WeightedRevenue))AS pvt1

SELECT * INTO #tempTable2
FROM
 (SELECT CASE
 WHEN modelid = 1 THEN 'FirstClickConversion'
 WHEN modelid = 2 THEN 'LastClickConversion'
 WHEN modelid = 3 THEN 'WeightedConversion'
 END AS ModelType,
 convert(date,conversiondate) AS Date,
 campaign,
 SUM(weight) AS Conversions
 FROM SOMETABLENAME
 WHERE conversiondate >= @FromDate
 AND conversiondate <= @ToDate
 GROUP BY modelid,
 convert(date,conversiondate),
 campaign) AS s PIVOT (sum(conversions)
 FOR [ModelType] IN (FirstClickConversion,LastClickConversion,WeightedConversion))AS pvt2
ORDER BY date,campaign

SELECT p1.*,
 p2.FirstClickRevenue,
 p2.LastClickRevenue,
 p2.WeightedRevenue
FROM #tempTable2 p1
INNER JOIN #tempTable1 p2 ON p1.date=p2.date
AND p1.campaign=p2.campaign
ORDER BY p1.date,
 p1.campaign

It’s a simple solution; a workaround for using PIVOT in special conditions like these! Nothing fancy but it efficiently serves the purpose; I hope it helps someone in need! 🙂

Here is the link to Microsoft documentation on using PIVOT.

Cheers! Happy coding and happy querying!

Share this blog

What is stopping you from being a KICKASS Developer?

This blog is about how to kick asses and how not to get our asses kicked!  I have tried to compile a list of things which may stop you directly or indirectly from being a kickass developer and hinder your desire to reach your full potential.

Many blogs will try to motivate you and explain to you what to do but I am here to warn you by sharing what not to do! (most of the things from my own learnings and personal experience :P)

  • Not planning your progress

Sometimes we are limited by our thoughts, ideologies, fears and inherent beliefs! Every developer has this phase when he is neither a senior and nor a junior anymore. This is the time when he has to plan things for himself. Once a wise guy suggested me that, you should not try to achieve each and everything simultaneously; pick your battles, fight them, struggle and shine! These battles should be carefully chosen; they should prepare you for BIGGER battles which you will face in near future some day!

some meme found over the internet!
some meme found over the internet!

You should accept that Rome was not built in a day!  If you try to do too many things in a single shot; you will eventually fail and feel demotivated. Set a target for yourself; plan a progressive scheme for yourself; divide the bigger task into smaller doable subtasks and then conquer! Slogging hard randomly will work for you on some occasions but if you really want to achieve something big in long run then plan your work; play your career like a Test match rather than a T20.

  • Just working on office stuff

We all work on tight schedules; there is always a possibility that we might have to work on office stuff or commitments even after office hours. Try to define this complex term called office hours.

It is always necessary to have a good work ethic but you should be selfish sometimes in terms of what you do after these office hours. This is your personal time; your sacred shell which can only be utilized for your personal goals.

This time can be utilized for developing your hobby, hitting a gym, playing recreational sport, biking or taking an online course to add or enhance a skill! This is very important because all round development is very important in long run. For example; if you are hitting a gym you should work on chest, biceps, legs, shoulders etc equally!

Each of your personal/ professional skill will connect the dots in your personality; they will complement each other and after some time eventually, you will be more groomed, confident and focused.

  • Not failing enough / not breaking things

What is more damaging than failure; it’s your fear to fail! When we get something good or when we become better as a person; we somehow become defensive of what we have achieved! This is a trap! Don’t fall into this trap! Failure is the greatest teacher; don’t fear to fail rather learn from it. You should know WHAT NOT TO DO in long run? 😛

breaakthings
some motivating shit from zuck!

Don’t care if people judge you or laugh at you! Do what you like and learn from your mistakes.

Shake, break and make something fruitful out of the situation!

  • Not having a perspective on anything / Not knowing things

Having a perspective on anything indirectly means that you know about something, you have tried learning that thing or at least heard about that thing from someone else.

Whenever a new task is assigned in terms of software development; managers and CTO’s look for the guys who have tried something on their own or someone who has at least some sort of the idea behind the new feature. So it’s important to know things and having an independent opinion. It is a quality of a thinking developer.

knowthings
I like him! So I am putting him on my blog for no apparent reason!
  • Comparing yourself with others (in a negative way)

It’s okay to feel intimidated sometimes; there will be always someone who knows more than you and do things better than you. Your aim is to bridge the gap and compete with yourself.

Try to outdo yourself every day; try to be better than what you were yesterday. Eventually, you will harness that confidence and skill set which will make you outstanding amongst others. Don’t get demotivated if you performed bad or if someone is better than you. Just keep working on yourself, and keep believing that you will shine one day.

  • Not challenging yourself enough

Sometimes we get too comfortable with our situation, surroundings and our own self! There is a saying that,

The one who sweats more in peace; bleeds less in war.

Life, job, and circumstances will always challenge you from time to time; you need to be ready for this war. Prepare yourself for this battle by practicing stuff when you have time; sharpen your swords, add some new weapons in your artillery, work on your stamina and be war-ready.

In the current context, try learning new programming languages, work on random algorithmic problems, publish apps, read and write blogs; all these things will add on to your knowledge, skill set, and confidence.
Try to achieve something complex; it’s okay if you fail. Keep improving; keep working on yourself!

challenge
some more motivation!
  • By allowing your personal life to affect your professional goals

On an average, most humans are irrational and emotional fools (unlike robots and aliens)! Our lives, emotions, mental well-being, physical strength, ability to make decisions, etc. is synchronized with each other.
If even one of these things goes wrong, it will directly or indirectly affect our performance and confidence. It’s hard to compartmentalize everything and keep everything disassociated.

giphy
Get your fucking shit together bro!

There will be times when things will be topsy-turvy; at this time you need to take a step back, calm down, realign your goals by prioritizing most important battles. Once you do this, back yourself, start working and hope for the best!

Buck up.. bitches!!!
Life is this… I like this! – Harvey Specter

Don’t deem yourself unworthy of anything; go for what you deserve. Don’t let the mediocrity creep into your lives! Get out in the real world and kick some asses (literally :P)

Share this blog

Why Capital Punishment is necessary?

*Image is used just for representation purpose from google search. Copyright infringement is not intended if any.

We live in a socialized, economy driven democracy. We call ourselves civilized; we take pride in our rich cultural heritage and yet we see lots of crime committed every hour in each and every part of this country. Why these criminals commit these crimes without any fear and guilt? What boosts their confidence when they commit any crime? Why don’t they think before they commit any crime?  The answer is simple – lack of strict rules and lack of ability to implement them. I would like to discuss the idea of capital punishment here without politicizing it or connecting it with a particular incident/ religion/ person/ crime.

I strongly believe in philanthropy, my parents have taught me that love and forgiveness are the best virtues of a good human-being. Mistakes are part of a person’s life, every person commit mistakes. If a person’s mistake is not severe and if it doesn’t affect or hinder other person’s life, health and property, then the person must be given a chance to ameliorate himself but if a person commits an unpardonable offence like terrorism, rape, murder etc. then he/she must be punished severely.

Many people believe in ‘An eye for an eye and a tooth for a tooth’; but I am not one of them. I believe in justice more than revenge. Nowadays the rate, intensity and severity of crimes has increased to an alarming level in our society. Every day in newspaper, we read about cases of kidnapping, cold blooded murders, acts of terrorism, cases of sexual harassment etc. So, it is the ask of the time to strictly implement capital punishment to suppress the increasing level of criminal activities occurring in the society.

Hundreds of criminals are caught, tried and put into jails; most of them serve their time and are freed. It is important to observe them after they are freed. Some of these are habitual offenders; they can commit crimes even after they are punished/ penalized. In the cases of rape, terrorism, murder; capital punishment may not be 100% ethical but can be justified.

We are a bunch of religious morons; so I will explain this in a religious context. The Garudpuranas of Hindus and the Bible of Christians also suggest that punishments are must for a sin, and their strictness should be decided depending on the nature of sin committed. I would like to mention the punishments for the seven deadly sins as depicted in the bible: pride – a person is broken into pieces on the wheel, envy – a person is freezed to death in cold water, gluttony – a person is forced to eat rats and snakes, lust – a person is smothered in fire, wrath– a person will be dismembered alive, greed – a person will be boiled alive in oil and sloth – a person is thrown into snake pits.

Well, we should not implement everything from these mythological books; but the point I would like to prove here is, the capital punishment is simply a society’s way of concluding that a particular individual deserves to die because of the severity of crimes committed by that person. The punishment should be justified by the nature of crime committed. It is about setting a right example for others, if a person is pardoned for his crime or guilt, then it might encourage some egocentrics to commit same crimes for satisfying their selfish motives.

Ideally we would want to fix this criminal ideology right from the root of its inception through ethical and moral upbringing, proper education, social awareness and discipline. But we have already missed this part and now we are at that stage where we will have to unfortunately instigate fear of punishment for stopping these heinous crimes.

I would like to conclude saying that, capital punishment may not be ethically acceptable but it is axiomatically desirable!  Jay Hind!

Share this blog