Before diving into the actual topic; it is important to have some understanding about Graph Databases.
What is a Graph Database?
A graph consists of two things: 1) Vertices and 2) Edges. Vertices represent the entities and Edges represent the relationship between those entities. A graph database is a NoSQL database which stores data in the form of a graph; it is a highly efficient way to store data as you don’t require complex joins to fetch data at runtime. In a graph database, you can directly traverse to/through different vertices (objects) in any direction using the edges (relationship) between them. This process is called traversal.
It has a clear edge over the traditional databases in terms of database design and modelling, data ingestion and retrieving data involving many-to-many relationships. These vertices and edges can have independent properties; these properties are stored in key-value pairs. Some graph databases need a schema specification with datatypes and labels; most graphs allow you to manipulate or ingest data without a fixed schema.
A graph database allows you to traverse through millions of nodes and access specific information by using properties to query only the part of data which satisfies the condition of the query; the other part of data which doesn’t match the query pattern remains un-accessed. So, it is a very fast and straight-forward way to access aggregated data based on relationships.
A graph database is very popular in domains like fraud-detection, asset management and social networks. It can also be used in the scenario of capturing online events of a user on a website or a mobile app. The goal is to build a journey of a user by tracking all events/activities like ad-click, ad-impression, page-view, add-to-cart, sale etc; and to efficiently access important information from event properties when required.
There are many graph database products available in the market; some of these are managed and some of them are raw. The popular available options are: 1) Azure Cosmos 2) AWS Neptune 3) NEO4J 4) IBM Graph and 5) Datastax Graph. Out of these, Azure Cosmos and AWS Neptune are new entrants and are fully managed cloud-based solutions.
Let’s see how if a graph database can resolve our problem and can help us capture user journey or not!
What is a User Journey?
It is a series and timeline of views, clicks and other custom events on a website or a mobile application which led to a particular conversion/ sale/lead submit. It can also be termed as the path to conversion.
To visualize this, consider this example:
I want to buy a mobile phone; so I randomly googled ‘best mobile phone under 30k INR’; google showed an ad for a device called Moto Z play. I clicked on it and went through its specs but I decided against buying it due to some reason.
Now that Google knows my history; it will keep showing me relevant ads even on some third-party sites. I encountered one of those ads but didn’t click on it. After a week or so; an ad popped up on my facebook newsfeed that there is a 20% discount on Moto Z play on Amazon, at this moment I decided to buy. So I clicked on this facebook ad and was about to buy this product from Amazon; but before that, I checked for discount coupons on Coupon-Dunia, and finally, I bought the product for 25k from Amazon.
User journey for this particular conversion will be something like this: Click1 (Google, Campaign1, Ad1), View1 (Display, Campaign2, Ad2), View2 (Facebook, Campaign3, Ad3), Click3 (Facebook, Campaign3, Ad4), View3 (Display, Campaign4, Ad5) -> Conversion (Amazon, Revenue:25000, CartItem: MotoZPlay, No of Items: 1)
Modeling the incoming stream of events
We have two main entities: 1) Users 2) Events. Both these will be represented by nodes, and there will be an edge connecting them which will be called a ‘performed‘ edge. We will have one more edge representing the relationship ‘previous‘; so if event1, event2 and event3 are the events performed by a user at time t1, t2 and t3 respectively where t1<t2<t3 then the graph of that user would look as depicted in the diagram shown below.
There are two possibilities to link events; 1) using the ‘next‘ edge from event1 to event2 and 2) using the ‘previous‘ edge from event2 to event1. But there is one problem with the first approach; when event1 is inserted at time t1, we actually don’t have event2 in the system! Event2 will be inserted at time t2 where t2>t1. It makes more sense to insert a ‘previous‘ edge from event2 at time t2 to event1 whenever event2 occurs. So, the logic of ingesting new event node will also check if there is any event prior to the incoming event for the same user. If there are any events it will take the newest event and plot an edge from the incoming event to the newest event in the current system. So, 1) when event1 comes, no previous edge will be created as it is the first event in the system, 2) for event2 it will find event1 and plot an edge to it. and 3) for event 3 it will find event1, and event2 but it will choose event2 as it the newest event in the current system and will plot a ‘previous’ edge to it. This approach gives us flexibility and an easy way to traverse through any user node and the events associated with it.
Now as we have decided on our modeling; let’s try to plot the user-journey example of Moto Z play as per our model. There are 6 events of different types, and its tabular representation and graph model would look something like this.
Event analytics and tracking of a user journey requires a new way of storing and querying data. I have worked extensively on tracking and attribution side of things; initially, we had opted SQL Server as the backend to store data but the relational database requires multiple joins and heavy table scans to fetch complicated breakdowns of data required for event filters. Also, the event database can be huge (in GBs/TBs). With our learnings from the traditional databases and after going through a rigorous trial & error phase, we came up with the above graph model.
We have been testing AWS Neptune/Azure Cosmos for last 3 months and we are astonished to see the possibilities and performance of Graph database. It’s too early for me to say that Graph database is a silver bullet for our requirement but it has been a great learning experience as a whole. I would publish a new blog post soon, on how to ingest and query graph databases for complicated outputs of dashboard charts and segmentation filters.