Graph your data with Gremlin

CosmosDB is one of the most interesting, and useful, services in Microsoft Azure in my view though, it can be quite daunting to use. It features many different types of databases from Cassandra, to Mongo/Core, to Table, and to Gremlin. While naturally falling under the umbrella of “NoSQL”, when understood, these databases can bring immense value to your use case. For today, I wanted to talk about Gremlin, as I have been experimenting with it heavily of late.

Have you ever wondered just how Facebook, Twitter and others maintain their data despite the volume and varied nature of the data they collect. Social networks in general represent a very unique use cases as the consistency model of RDBMS cannot easily be instituted due to volume and the nature of the data does not easily lend itself to the eventual nature of Document stores like MongoDB. Instead, these services tend to rely on Graph databases to hold their data, and in Cosmos Gremlin (Apache Tinkerpop) is the Graph API.

When might you use a Graph database?

First, no single solution is good for everything. If you are dealing with a highly transactional system (like a bank) an RDBMS like Azure SQL or PostgresSQL is likely going to fit your needs the best. On the other end, if you ingesting a high volume of data or data which has varying forms, the eventual models of NoSQL (Mongo, Raven, Cockroach) is likely going to be ideal. Where I feel Graph databases come in is when you have data that is HIGHLY relatable or where relationships are always growing or need to change. A great example of this is an organization hierarchy.

In classic computer science, we would often experiment with building these highly relative systems in RDBMS to help students and professionals better understand normalization or, because it was all we had lying around. Let me be clear, nothing stops any of these systems from handling any use case, however, always try to pick the best tool for the job.

Let’s consider a group of people based on my family. Here is a visualization (in graph form) of the data:

Nothing would stop us from representing this in an RDBMS, in fact its a well worn problem with nullable “related to” columns but, is it the best tool. If we think about Facebook and when someone “likes” a story we have to also consider how we ensure integrity of a count, aggregating would be impossible at that scale, we need our entities to “magically” keep track of these values.

Enter a Graph database. Each bubble above is referred to as a “vertex” and each arrow an “edge”. The relationships are mono-directional, though they can certainly circle back to make it feel bidirectional. But each vertex knows how many edges (or relationships) it has and can easily spit that value back. If someone chooses to “unlike” a Facebook story, for example, that edge simply disappears from the Vertex.

I think an even better example is a company hierarchy. Consider how often a company like Microsoft, for example, might shift around positions, create new titles and positions and move who reports to whom. While it could be represented in a RDBMS database, it would be far easier in a graph database.

How do I get started?

First, I would recommend creating a CosmosDB instance using Gremlin (if you like AWS, they have Neptune). Instructions are here: https://docs.microsoft.com/en-us/azure/cosmos-db/graph/create-graph-dotnet

Its good to understand but, honestly, the sample app is not very good and the default driver library leaves much to be desired. After some searching I came across Gremlinq by ExRam and I LOVE this library. It makes things so much easier and there is even a complete sample project for Gremlin (and Neptune) provided. Working with CosmosDB I created the following objects:

var clairePerson = await querySource.AddV<Person>(new Person() { Id = Guid.NewGuid(), FirstName = "Claire" }).FirstAsync();
var ethanPerson = await querySource.AddV<Person>(new Person() { Id = Guid.NewGuid(), FirstName = "Ethan" }).FirstAsync();
var jasonPerson = await querySource.AddV<Person>(new Person() { Id = Guid.NewGuid(), FirstName = "Jason" }).FirstAsync();
var brendaPerson = await querySource.AddV<Person>(new Person() { Id = Guid.NewGuid(), FirstName = "Brenda" }).FirstAsync();
var stevenPerson = await querySource.AddV<Person>(new Person() { Id = Guid.NewGuid(), FirstName = "Steven" }).FirstAsync();
var seanPerson = await querySource.AddV<Person>(new Person() { Id = Guid.NewGuid(), FirstName = "Sean" }).FirstAsync();
var katiePerson = await querySource.AddV<Person>(new Person() { Id = Guid.NewGuid(), FirstName = "Katie" }).FirstAsync();
var madelinePerson = await querySource.AddV<Person>(new Person() { Id = Guid.NewGuid(), FirstName = "Madeline" }).FirstAsync();
var myungPerson = await querySource.AddV<Person>(new Person() { Id = Guid.NewGuid(), FirstName = "Myung" }).FirstAsync();
var chanPerson = await querySource.AddV<Person>(new Person() { Id = Guid.NewGuid(), FirstName = "Chan" }).FirstAsync();
view raw family.cs hosted with ❤ by GitHub

Once you have these (and it would be easy to create these dynamically) you can set up about relating them. As I said above, I tried to keep my relationships flowing in a single direction. I could allow a loopback if needed but I wanted to avoid creating bi-directional relationships. Not sure if this is a good practice or not yet.

await querySource
.V(ethanPerson.Id)
.AddE<ChildOf>(new ChildOf() { Id = Guid.NewGuid() })
.To(_ => _.V(clairePerson.Id))
.FirstAsync();
await querySource
.V(ethanPerson.Id)
.AddE<ChildOf>(new ChildOf() { Id = Guid.NewGuid() })
.To(_ => _.V(jasonPerson.Id))
.FirstAsync();
await querySource
.V(clairePerson.Id)
.AddE<MarriedTo>(new MarriedTo() { Id = Guid.NewGuid() })
.To(_ => _.V(jasonPerson.Id))
.FirstAsync();
await querySource
.V(clairePerson.Id)
.AddE<ChildOf>(new ChildOf() { Id = Guid.NewGuid() })
.To(_ => _.V(myungPerson.Id))
.FirstAsync();
await querySource
.V(clairePerson.Id)
.AddE<ChildOf>(new ChildOf() { Id = Guid.NewGuid() })
.To(_ => _.V(chanPerson.Id))
.FirstAsync();
await querySource
.V(jasonPerson.Id)
.AddE<ChildOf>(new ChildOf() { Id = Guid.NewGuid() })
.To(_ => _.V(brendaPerson.Id))
.FirstAsync();
await querySource
.V(jasonPerson.Id)
.AddE<ChildOf>(new ChildOf() { Id = Guid.NewGuid() })
.To(_ => _.V(stevenPerson.Id))
.FirstAsync();
await querySource
.V(seanPerson.Id)
.AddE<ChildOf>(new ChildOf() { Id = Guid.NewGuid() })
.To(_ => _.V(brendaPerson.Id))
.FirstAsync();
await querySource
.V(seanPerson.Id)
.AddE<ChildOf>(new ChildOf() { Id = Guid.NewGuid() })
.To(_ => _.V(stevenPerson.Id))
.FirstAsync();
await querySource
.V(jasonPerson.Id)
.AddE<SiblingOf>(new SiblingOf() { Id = Guid.NewGuid() })
.To(_ => _.V(seanPerson.Id))
.FirstAsync();
await querySource
.V(seanPerson.Id)
.AddE<MarriedTo>(new MarriedTo() { Id = Guid.NewGuid() })
.To(_ => _.V(katiePerson.Id))
.FirstAsync();
await querySource
.V(madelinePerson.Id)
.AddE<ChildOf>(new ChildOf() { Id = Guid.NewGuid() })
.To(_ => _.V(seanPerson.Id))
.FirstAsync();
await querySource
.V(madelinePerson.Id)
.AddE<ChildOf>(new ChildOf() { Id = Guid.NewGuid() })
.To(_ => _.V(katiePerson.Id))
.FirstAsync();

What I came to find while doing this is, it feels like a good idea to create a base type (I used Vertex and Edge) and then create derivations for specific object and relationship types. For example:

public abstract class Vertex
{
public abstract string Label { get; }
}
public class Person : Vertex
{
public override string Label => "Person";
public Guid Id { get; set; }
public string FirstName { get; set; }
public string partitionKey => FirstName.Substring(0, 1);
}
public abstract class Edge
{
public Guid Id { get; set; }
public abstract string Label { get; }
}
public class SiblingOf : Edge
{
public override string Label => "Sibling Of";
}
public class ChildOf : Edge
{
public override string Label => "Child Of";
}
public class MarriedTo : Edge
{
public override string Label => "Married To";
}
view raw class.cs hosted with ❤ by GitHub

I am still trying to understand the best way to lay this out on account of when I query the whole graph in Cosmos, all I get are Guids everywhere where I would like some better identifying data being shown.

Now the strength of the Graph is that you can more easily traverse the data versus something like RDBMS where you would be writing cartesian queries or doing a lot of subquerying. To aid in this, and I was surprised it was not baked into the query language itself (it might be and I just have not found it yet), I wrote a recursive traversal algorithm:

public static async Task<List<Person>> GetAncestors(Person vertex, IGremlinQuerySour
{
var ancestors = await querySource.V<Person>(vertex.Id)
.Out<ChildOf>()
.Cast<Person>()
.ToArrayAsync();
if (ancestors.Count() == 0)
{
return ancestors.ToList();
}
var ancestorsReturn = new List<Person>(ancestors);
foreach (var ancestor in ancestors)
{
ancestorsReturn.AddRange(await GetAncestors(ancestor, querySource));
}
return ancestorsReturn;
}
view raw traversal.cs hosted with ❤ by GitHub

What this does is, given a specific object that we are looking for (all we need is the Id) it looks for Edges attached to the Person of type ChildOf, indicating the person is a child of another node. The graph nature allows me to count the edges attached to my node and even traverse them like a typical B-tree or other Advanced Data Type in computer science.

Closing thoughts… for now

This is a pretty shallow dip into Graph databases but I am very intrigued. Last year, while working with CSharpfritz on KlipTok we struggled with this due to the desire to have relationships between the data. Ultimately, we got around it but, looking back I realize our data was definitely graphable as the site continues to add relationships and want to do traversals and aggregate relationships. In this way, one definite conclusion is Graph allows relationships to scale much easier than in RDBMS and certainly better than NoSQL systems like Cassandra and Mongo.

I look forward to diving into this deeper and hope the above gives others reasonable insight into the approach and a good starting point.

One thought on “Graph your data with Gremlin

Leave a comment