Exploring graph database for capturing User Journey

Before diving into the actual topic; it is important to have some understanding about Graph Databases.

What is a Graph Database?

A graph consists of two things: 1) Vertices and 2) Edges. Vertices represent the entities and Edges represent the relationship between those entities. A graph database is a NoSQL database which stores data in the form of a graph; it is a highly efficient way to store data as you don’t require complex joins to fetch data at runtime. In a graph database, you can directly traverse to/through different vertices (objects) in any direction using the edges (relationship) between them. This process is called traversal.

It has a clear edge over the traditional databases in terms of database design and modelling, data ingestion and retrieving data involving many-to-many relationships. These vertices and edges can have independent properties; these properties are stored in key-value pairs. Some graph databases need a schema specification with datatypes and labels; most graphs allow you to manipulate or ingest data without a fixed schema.

A graph database allows you to traverse through millions of nodes and access specific information by using properties to query only the part of data which satisfies the condition of the query; the other part of data which doesn’t match the query pattern remains un-accessed. So, it is a very fast and straight-forward way to access aggregated data based on relationships.

A graph database is very popular in domains like fraud-detection, asset management and social networks. It can also be used in the scenario of capturing online events of a user on a website or a mobile app. The goal is to build a journey of a user by tracking all events/activities like ad-click, ad-impression, page-view, add-to-cart, sale etc; and to efficiently access important information from event properties when required.

There are many graph database products available in the market; some of these are managed and some of them are raw. The popular available options are: 1) Azure Cosmos 2) AWS Neptune 3) NEO4J 4) IBM Graph and 5) Datastax Graph. Out of these, Azure Cosmos and AWS Neptune are new entrants and are fully managed cloud-based solutions.

Let’s see how if a graph database can resolve our problem and can help us capture user journey or not!

What is a User Journey?

It is a series and timeline of views, clicks and other custom events on a website or a mobile application which led to a particular conversion/ sale/lead submit. It can also be termed as the path to conversion.

To visualize this, consider this example:

I want to buy a mobile phone; so I randomly googled ‘best mobile phone under 30k INR’; google showed an ad for a device called Moto Z play. I clicked on it and went through its specs but I decided against buying it due to some reason.

Now that Google knows my history; it will keep showing me relevant ads even on some third-party sites. I encountered one of those ads but didn’t click on it. After a week or so; an ad popped up on my facebook newsfeed that there is a 20% discount on Moto Z play on Amazon, at this moment I decided to buy. So I clicked on this facebook ad and was about to buy this product from Amazon; but before that, I checked for discount coupons on Coupon-Dunia, and finally, I bought the product for 25k from Amazon.

User journey for this particular conversion will be something like this: Click1 (Google, Campaign1, Ad1), View1 (Display, Campaign2, Ad2), View2 (Facebook, Campaign3, Ad3), Click3 (Facebook, Campaign3, Ad4), View3 (Display, Campaign4, Ad5) -> Conversion (Amazon, Revenue:25000, CartItem: MotoZPlay, No of Items: 1)

Modeling the incoming stream of events

We have two main entities: 1) Users 2) Events. Both these will be represented by nodes, and there will be an edge connecting them which will be called a ‘performed‘ edge. We will have one more edge representing the relationship ‘previous‘; so if event1, event2 and event3 are the events performed by a user at time t1, t2 and t3 respectively where t1<t2<t3 then the graph of that user would look as depicted in the diagram shown below.

User-Event Graph Model

There are two possibilities to link events; 1) using the ‘next‘ edge from event1 to event2 and 2) using the ‘previous‘ edge from event2 to event1. But there is one problem with the first approach; when event1 is inserted at time t1, we actually don’t have event2 in the system! Event2 will be inserted at time t2 where t2>t1. It makes more sense to insert a ‘previous‘ edge from event2 at time t2 to event1 whenever event2 occurs. So, the logic of ingesting new event node will also check if there is any event prior to the incoming event for the same user. If there are any events it will take the newest event and plot an edge from the incoming event to the newest event in the current system. So, 1) when event1 comes, no previous edge will be created as it is the first event in the system, 2) for event2 it will find event1 and plot an edge to it. and 3) for event 3 it will find event1, and event2 but it will choose event2 as it the newest event in the current system and will plot a ‘previous’ edge to it. This approach gives us flexibility and an easy way to traverse through any user node and the events associated with it.

Now as we have decided on our modeling; let’s try to plot the user-journey example of Moto Z play as per our model. There are 6 events of different types, and its tabular representation and graph model would look something like this.

MotoZ Play – User Journey – Tabular Representation
User Journey Graph Model

Event analytics and tracking of a user journey requires a new way of storing and querying data. I have worked extensively on tracking and attribution side of things; initially, we had opted SQL Server as the backend to store data but the relational database requires multiple joins and heavy table scans to fetch complicated breakdowns of data required for event filters. Also, the event database can be huge (in GBs/TBs). With our learnings from the traditional databases and after going through a rigorous trial & error phase, we came up with the above graph model.

We have been testing AWS Neptune/Azure Cosmos for last 3 months and we are astonished to see the possibilities and performance of Graph database. It’s too early for me to say that Graph database is a silver bullet for our requirement but it has been a great learning experience as a whole. I would publish a new blog post soon, on how to ingest and query graph databases for complicated outputs of dashboard charts and segmentation filters.

Stay tuned.

Share this blog

Introduction to Elasticsearch

Elasticsearch is the most popular, open-source, cross-platform, distributed and scalable search-engine based on Lucene. It is written in Java and released under the terms of the Apache License.

Elasticsearch is developed alongside Logstash and KibanaLogstash is a data-collection and log-parsing engine while Kibana is an analytics and visualization platform. The three products combined are referred to as the Elastic Stack (formerly known as the ELK stack). They are developed to provide integrated solutions and are designed to be used together.

The data stored in Elasticsearch is in the form schema-less JSON documents; similar to NO-SQL databases. You can communicate with the Elasticsearch server through an HTTP REST API; the response generated will be in the form of JSON object. Elasticsearch is designed to take chunks of big-data from different sources, analyze it and search through it. It is optimized to work well with huge data set; the searches happen very quickly i.e., almost near real-time!

To understand Elasticsearch in detail; we need to understand its core concepts and terminologies.

We will go through each one of them in brief:

Near real-time

In Elasticsearch the data is distributed and stored in different clusters. So when a change is made on an index it may not be readily available; a latency of 1-2 seconds is expected! Contrary to this, when a change is made, it is propagated instantly in a relational database as they are deployed on a single machine. We can live with this slight delay as it is due to it’s distributed architecture; it is required to make it scalable and robust.

At the end of this post; you will get the clear picture of what happens internally and why this latency is expected!

Cluster

A cluster is a group of Elasticsearch servers. These servers are called nodes. Depending upon the use-case and scalability preferences; a cluster can have any number of nodes. Each node is identified by a unique name; all the data is distributed amongst these nodes which in turn is grouped into different clusters. A cluster allows you to index and search the stored data.

Node

A node is a server; it is a single unit of the cluster. If it is a single node cluster; all data will be stored in that single node; else the data will be distributed amongst n nodes which are the part of that cluster. Nodes participate in a cluster’s search and indexing capabilities. Depending upon the type of query fired; they will collaborate and will return the matching response.

Index

An index is a collection or grouping of documents; the index has a property called type. In relational database terms; an index is something like database and type is something like a table. This comparison may not be always true because it very much depends on how you design your cluster, but in most cases, it will hold true.

Any number of indexes can be defined; just like node and cluster, an index is identified by a unique name; this name should be in lower-case.

Type

Type is a category or a class of similar documents; as explained in the above paragraph; it comes close to a table in relational database terms. It consists of a unique name and mapping. Whenever you query an index; Elasticsearch reads _type from a metadata file and applies a filter on this field; Lucene internally has no idea of what the type is! An index can have any number of types, and types can have their own mapping!

Mapping

Mapping is somewhat similar to the schema of a relational database. It is not mandatory to define mapping explicitly; if a mapping is not provided Elasticsearch will add it dynamically based on its data when the document is added. Mapping generally describes the field and its datatype in a document. It also includes information on how to index and store fields by Lucene.

Document

A document is the smallest and most basic unit of information that can be indexed. It consists of key-value pairs; the values can of datatype string, number, date, object etc. An index can have any number of documents stored within it; in object-oriented terms, a document is something like an object. It is in the form of JSON. In relational database terms; a document can be thought of as a single row of a table.

Shards

An index can be divided into multiple independent sub-indexes; these sub-indexes are fully functional on their own and are called shards. They are useful when an index needs to have more data than the hardware capability of a node(server) supports; for example, 800 GB data on 500 GB disk!

Sharding allows you to horizontally scale by volume and space; it enhances the performance of a cluster by running parallel operations and distributing loads across different shards. By default; Elasticsearch adds 5 primary shards for an index. This can be manually configured to suit your requirements.

Replica

A replica is a copy of an individual shard. Elasticsearch creates a replica for each shard; the replica and the original shard never reside on the same node.

Shards and Replicas - Image
This image is downloaded from google; copyright infringement is not intended.

Replica comes into picture when nodes in a cluster fail, shards in a node fail or a spike in read-throughput is encountered; replica promises the high availability of the data in such situations. When a write query is fired; the original shard is updated first and then the replicas are updated with some overlying latency. But read queries can run in parallel across replicas; this will improve the performance of read operations overall.

By default; a single copy of each primary shard is created, but a shard can have more than one replicas in some special cases.

Inverted-Index

It is a unique type of data structure that Lucene uses to make huge dataset readily searchable. Inverted-Index is a set of words, phrases or tokens associated with different documents to allow full-text-search. In simple terms, an inverted index is something like an appendix page at the end of the book; it will have mappings of words to documents.

Inverted Index Image
This image is downloaded from google; copyright infringement is not intended.
Segment

Each shard consists of multiple segments; these segments are nothing but inverted-indexes; which will search in parallel, get results and combine them in the final output for that particular shard.

ES Architecture Image
Visual representation of Internal ES Architecture

As and when the documents are indexed; Elasticsearch writes it to new segments, refreshes the search data and updates transaction logs. This happens very frequently to make data in new segment visible to all queries. Elasticsearch is not meant for updates and delete, so if data needs to be deleted or updated it actually just marks the old document as deleted, and indexes a new document. The merge process also expunges these old deleted documents. Elasticsearch constantly merges similar segments into common big segments in the background; querying too many small segments is not very optimum. After the bigger segment is written; the smaller segments are dropped and log files are again updated to reflect new changes.

It may seem complicated.. But

You don’t have to deal with the internal working of Lucene and ElasticSearch as it is abstracted; you just have to configure clusters with the right number of nodes and create indexes with appropriate mappings! Everything else is done internally. Several organizations like IBM, AWS, Searchly, Elastic Cloud etc., offer Elasticsearch as a managed service; so you don’t have to worry about managing servers, doing deployments, taking backups etc. It will take care of these things for you to save your time and effort to operate these servers.

This post was meant to cover basics of ElasticSearch and a brief idea of how it works internally. I hope that I have done justice to it. In my next post; I aim to cover ‘How to query on Elasticsearch index using Kibana?’.

Stay tuned.

Share this blog

Who are Fullstack Developers?

 

Software/ Website development can be categorized into,

Front endHTML, Javascript, CSS
Back endJava, PHP, ASP.NET / C#.NET, Ruby
DatabaseMicrosoft SQL Server, MySQL, Oracle

So by definition a developer who works on front-end is a front-end developer and a developer who works on back-end is a back-end developer.

Front-end developers are responsible for a website’s user-interface and the user-experience architecture. They work closely with designers to construct and improve the ui/ ux of a website. A good front-end developer can be able to accurately identify specific issues in user-experience and provide recommendations and coding solutions to improve the design.

Back-end developers generally handle the server and the data. Their job is to build an application and also to design/ implement it’s interaction with the server and the database. They manipulate data and also work with public and private API’s. A good back-end developer should have a sound knowledge of Linux/ Windows as a development and deployment system; he/ she should also have insights on different version control systems such as GIT/ SVN.

These were the specialized position of developers. But as the requirements continued to become more complicated and ambitious, some kickass people started to build frameworks and helper libraries. JQuery is the most common example, it made JavaScript development in the browser significantly easier, other examples are AngularJs, Knockout, Backbone, EmberJs. There were similar shifts in back-end technologies, such as Zend, Symfony, CakePHP forPHP and CodeIgnitor for Ruby on Rails and PHP both. Thus, today browsers have became more capable, and the frameworks are becoming excessively powerful.

This ignited the emergence of fullstack developers which blurred the lines between front-end and back-end developers. Start-ups played an important part to popularize this role. These developers are jack-of-all-trades and master-of-some. They provide full package and can cross functionally work on entire technology stack for a company. This is a win-win situation for both an individual and a company. For an individual developer, it adds solid skills on his/her resume, a solid learning experience is in store, moreover they get to work on some kickass and challenging stuff but sometimes the job of a fullstack developer can be very complex and demanding.

Happy coding and developing!

Share this blog