Building a Real Time Data Pipeline in Azure – Part 3

Part 1 – here Part 2 – here

In the first two parts of this series we created an Event Hub that we could blast with data in high amounts, this is a common use case one sees with the sort of real time application we are building the backbone for. Next, we showed how to setup a Stream Analytics job to output query results on the data based on a bounding window which allowed us to output these results to a storage medium.

At the conclusion of Part 2 we had these results streaming into Blob storage which is good, but not overly practical (at least not until Azure has something like Athena). Truthfully to effectively use this data we need it in a storage medium that supports querying.

Rules of the Land

I am a huge fan of Azure CosmosDB and the various DB providers it offers including DocumentDB. At the time of this writing the ONLY supported CosmosDB API that allows an Analytics Job output to as a destination. For now, this rule must be followed as you cannot use anything else; if you do not follow it, be prepared for a cryptic error.

Setting the output

Returning to your output screen (and you can have multiple outputs if you want) you can click Add and select Cosmos DB. As with the previous section, you will want as much automated as possible, I would even recommend having your DB prepared ahead of time but, the wizard can do it for you as well.

Testing the process

Once you have all of this in place you can turn on the hose (make sure the Analytics job is started) and wait for data to appear. The truth is, debugging this is simply a matter of checking the flow at each point and seeing where data is stopping if its not making it all the way through.

Your next step is to write an application that queries this data as you need it to provide insights into the data you are creating. Once you have it in the output you can just write your normal queries against it.

One thought on “Building a Real Time Data Pipeline in Azure – Part 3

Leave a comment