Building a Real Time Data Pipeline in Azure – Part 2

Part 1 here

Continuing on with this series, we now turn our attention to how to ingest and process the data collected through the Event Hub. For real time apps, this is where we want to perform our bounded queries to generate data within a window. Remember, traditionally we naturally get bounded queries because we are querying a fixed data set. With this sort of application our data is constantly coming in and so we have to be cognizant of limiting our queries within a given window.

Creating our Stream Analytics Job

Click Add within your Resource Group and stream for Stream Analytics and look for Stream Analytics job.

For this demo, you can set Streaming Units to 1. Streaming Units is a measure of the scalability for the job, you should adjust this based on your incoming load. For more information see here.

When we look at this sort of service we concern ourselves with three aspects: InputQuery, and Output. For the job to work we will need to fulfill each of these pieces.

Configure the Input

For Input, we already have this, our Event Hub. We can click on Inputs and select Add Stream Input. From here you can select your Event Hub, you will be promoted to fill in values from the right side panel. As a tip, try to use the EventHub selection, dont try to specify the values. I have found that things wont work properly.

This provides an input into our analytics job for our next section we will configure the query.

Configure the Query

As I mentioned before, the thing to remember with this sort of process is you are dealing with unbounded data. Because of this we need to make sure the queries look at the data in a bounded context.

From the Overview for your Stream Analytics Job click the Edit Query which will drop you into the Query editor. Some important things to take note of here:

  • When you created your Input you gave it a name, this serves as the table you will use as the source for your data
  • The query result is automatically directed to your output. There are ways to send multiple outputs, but I wont be talking that here

So here is a sample query that I used to select the sum of shares purchased or sold on a per minute basis:


There are a couple things to point out here:

  • We use System.Timestamp to produce a timestamp for the event. This will be used on the frontend if you want to graph this data over time. It also serves as a primary key (when combined with symbol) in this case
  • StockTransactionsRaw is the input that we defined for this Stream Analytics job, you can call this whatever
  • TumblingWindow is a feature supported within Analytics job to allow for creating a bounded query. In this case, we slide a window on a per minute basis. Here is more information on TumblingWindow – click

One big tip I can give here is using Sample Data to test. Once you have your Input stream going for a bit you can return to the Inputs list and click Sample Data for that specific input. This will generate a file you can download with the data being received through that input.

Once you have this file, you can return to the Edit Query screen and click the Test button, this will let you upload the file you downloaded and will display the query results. I found this a great way to test your query logic.

Once you have things working to your liking you need to move on to the Output.

Configure the Output

For now, we are going to use Blob Storage to handle our data. We will cover hooking this up to Cosmos in Part 3. I want to break this up a bit just so its not too much being covered in one entry and, I like 3 as a number 🙂

Click on Outputs and select Add and choose Blob Storage. Here is what my configuration looks like:


This will drop the raw data into files within your blob storage that you can read back later. Though, the real value here will be the ability to query it from something like Cosmos which we cover in the next section.

So, now you have an Event Hub that is able to ingest large amounts of data and it pumps the data into the Analytics Job which runs a query (or queries) against it and it sends the outputs, for now, to our Blob Storage.

Looking to catch you in Part 3.


Building a Real Time Data Pipeline in Azure – Part 1

Previously I walked through the process of using the Amazon Kinesis Service to build a real time analytics pipeline in which a bunch of data was ingested and then processed with the results pushed out to a data store which we could then query at our leisure. You can find the start of this series here.

This same ability is available in Microsoft Azure using a combination of Event Hubs, Analytic Streams, and CosmosDB. Let’s walk through creating this pipeline.

Event Hubs

Event Hubs in Azure are similar to Kinesis Firehose in that the aim is to intake a lot of data at scale. This is where it differs from something like Service Bus or Event Grid both of which are more geared towards enabling event driven programming.

To create an Event Hub you need to first create a namespace, it is within these namespaces that you created the hubs you will send data to. To create the namespace search for and select Event Hubs by Microsoft.

One of the important aspects here is the determination of your throughput units which determines how much load your Hub can accommodate. For a simple demo, you can leave it at 1, for more advanced scenarios you would increase it.

Enable Auto Inflate allows you to specify the minimum number of throughput units but lets Azure automatically increase the number as needed.

Once you have your Namespace, click the local Add button to add your Hub. As a convention, I always suffix these with hub.

When creating the Hub you need to be aware of the Partition count. Partitions are an important part of this sort of big data processing as they enabling data sharding which lets the system better balance the amount of data.

The number here MUST range of 2 to 32. You should try to think of a logical grouping of your data so that you can balance it. For example, in my sample I only have 10 user Ids so I create 10 partition and each one handles an individual user. In a real life scenario I might have users so this would not work.

Regardless, selecting the appropriate partition key is essential for maximizing the processing power of Event Hubs (and other stream services).

This bit of documentation explains this better:

Connecting to your Hub

So now that you have a hub up we need to write events to it. This can be accomplished in a variety of ways, but I have not yet found a way to do it via API Management. I sense that this sort of data processing is of a different nature than pure Event Driven programming.

For starters you will need the Microsoft.Azure.EventHubs NuGet package which will make writing to the endpoint much easier.

Within your application you need to create a client reference to EventHubClient this is created using your Endpoint and Shared Access Key.

You can get these values in one of two ways: globally or locally.

  • At the Namespace level, you can click on Shared access policies to work with policies that apply to ALL Hubs within that namespace – by default Microsoft expects this is what you will do and so creates the RootManageSharedAccessKey here
  • At the Hub level, you can click on Shared access policies to work with policies that apply ONLY to that Hub – this is what I recommend as I would rather not have my Namespace that open

Regardless of how you do this, a policy can support up to three permissions: ManageSend, and Listen. In the context of eventing these should be pretty self explanatory.

Returning to the task at hand, we need to select the appropriate policy and copy the Connection String – primary key value. As a note, if you copy this value at the Namespace level you will NOT see the EntityPath included in the connection string, if you copy at the Hub level you will.

EntityPath can be applied via EntityHubsConnectionStringBuilder. The value is the name of the Hub you will connect to within the namespace.


Again, if you copy the Connection String from the Hub level you can OMIT the EntityPath specification that I have in the code above. I think that is the better approach anyway.

Sending Events

Once we have our client up and configured we can start to send events (or messages) to our Hub. We do this by calling SendAsync on the EventHubClient instance. The message can be passed in a variety of ways, but the standard is as a UTF8 byte array. This is easy to do with the Encoding class in .NET.


As with any sort of test involving streaming you need to make sure you generate a considerable amount of data so you can see the ingestion rates in the Hub metrics.

Ok so, we now have events streaming from our apps up to the Hub, now we need to do something with our data. In the next section, we will set up a Stream Analytics Job that runs against our incoming data, produces results, and drops the data into a Blob storage container.

Part 2 – here

Getting Started with Azure Event Grid

One of the most interesting scenarios the Cloud gives rise to is better support for building event driven backends. Such backends are necessary as the complexity and scale of applications grow and developers want to decouple service references. Additionally, cloud providers are raising events from their own services and allow developers to hook into these events, this is most noticeable on AWS.

While this tend is not new on Azure (EventHubs have existed for a while) support for it compared to AWS has lagged. But by introducing EventGrid, Microsoft seems to be positioning themselves to support the same sort of event driven programming as allowed with in Amazon with the added benefit of additional flexibility supported, mainly, by EventGrid.

What is Event Grid?

In the most simplest sense, EventGrid is a router. It allows events to be posted to a topic (similar to a Pub/Sub) and can support multiple subscriptions, each with differing configurations, including what eventType to support and what matching subjects will be sent to subscribers.

In effect, EventGrid allow you to define how you want the various services you are using to communicate. In that way, it can potentially provide a greater level of flexibility, I feel, than AWS, whose eventing generally lacks a filtering mechanism.

Creating an Event Topic

So, the first thing to understand is that the concept of topic will be appear twice. The first is the name of the EventGrid itself, which is a topic itself.

Choose to Add a Resource and search for event grid. Select Event Grid Topic from Microsoft from the returned results.

By default, the name of the topic will match the given name of the Event Grid so, pick wisely if this is for something where the name will need to have meaning.

Once this completes we will need to create a subscription.

Before you create a subscription

When you create a subscription you will be asked to give an endpoint type. This is where matched events to the topics will be sent. If you use a standard Azure service (ie Event Hubs) you can select and go, however, if you use an Azure Function you will need to do something special to ensure the subscription can be created.

If you fail to follow this, you will receive (at the time of this writing) an undefined error in the Portal. The only way to get more information is to use the Azure CLI to create the subscription.

The error is in reference to a validation event that is sent to the perscribed endpoint to make sure it is valid. There is a good guide on doing this here. Essentially, we have to look for a certain event type and return the provided validation code as proof that we own the event endpoint.

I do not believe you have to do this with other Azure service specific endpoints, but you certainly have to do it if your endpoint is going to be of type WebHook.

The destination for this Event Grid subscription is the Azure Function Url.

Regardless of your endpoint type…

You will need to setup the various filtering mechanisms that indicate which events, when received, are passed through your subscription to the endpoint.

Most of these are fairly self explanatory except for the Event Schema field which refers to the structure of the data coming through the topic. It is necessary for the subscription to understand the arrangement so it can be analyzed.

The default is the pre-defined EventGridEvent which looks like this:

        "id": "string",
        "eventType": "string",
        "subject": "string",
        "eventTime": "string",
        "data": "object",
        "dataVersion": "string"

The key here is the data object which will contain your payload.

I am not sure of the other schemas as I have not explored them, mainly because the .NET EventGrid libraries have good native support for the EventGridEvent schema.

Let’s Send an Event

There are quite a few ways to send events to EventGrids. For any attempt you are going to need your Topic Endpoint which you can find on the Overview section for your Event Grid. This is the Url you will submit new events to for propagation within Azure, it is also the value you will need to get from any Azure service that you want to submit events to the Event Grid.

Disclaimer: At the time of this writing, there was some weirdness in the portal where you needed to use the CLI to route events from Azure services (like Storage) to an EventGrid.

Receiving Events from an Azure Resource

Within this tutorial you are shown how to use the Azure CLI (via the command shell or through the portal) to create the appropriate subscriptions. Do note that, once created, you will not see this listed in the portal

Send a Custom Event
This is probably the case you came here to read about. Principally, sending custom events involves posting to the Event Grid Topic Url with a schema for the body that matches a subscription.

Doing this is easiest if you use the Microsoft.Azure.EventGrid NuGet library which will expose the the PublishEventAsync method, here is an example:


You can also send to this endpoint manually using the HttpClient library, but I recommend using the NuGet package. To be a better idea of what is in there, here is the source page:

Keep in mind that this will send an event using the EventGridEvent schema, which is what I recommend.

How do I test that things are working…

Once you have things in place, I recommend selecting the Azure Function and clicking the Run button. This will enter into a mode where you can see a streaming log at the bottom. This is the stream for the function and is not relegated to just what is passed via testing.

Once you have this up, send an event to your Event Grid Topic and wait for a log message to appear which will tell you things are connected correctly.

What are my next steps?

This gets into a bigger overall point but, in general, I do NOT recommend using the Event Grid Topic Url for Production because it couples you to a Url that cannot be versioned and breaks the identity of your API.

Instead, as should be the case when you use Azure Functions you need to set things up behind an Azure API Management service. This lets you customize what the Url looks like, version, and maintain better control over its usage.

Additionally, in an event driven system you should view events as a way to maximize throughput so users should be posting to this endpoint and the event should fan out.

To put our Event Grid Topic endpoint behind an Azure API Management Route, do this:

  1. Create an API Management service instance (this will take time) and wait for it to activate
  2. Create a Blank API
    • Tip: When you create the API you will be asked to select a Product. Without getting too much into what this is, select Unlimited to make things easy on yourself
  3. Access your API definition via the APIs link navigation bar
  4. Add a POST route –  you can provide whatever values you want to the required fields
  5. Copy the aeg-sas-key from the Access Keys section under your Event Grid Topic and make this a required header for your POST route
  6. Once the route is created selected the Edit icon for the Backend
  7. Select Http(s) endpoint and paste the Event Grid Topic Url, make sure to NOT include the /events at the end of this URL
  8. Hit Save
  9. Click on your API Operation (POST route) in the selection method to the left
  10. Select Inbound Processing
  11. Change the Backend line such that it reads with the POST method and the resource is /events
  12. Hit Save and you can click Test to check this. Building the payload is a bit tedious. Use the source here to understand how the JSON should look for the EventGridEvent that I assume you are sending.

Ok so let’s recap.

The first bit of this is to create an Azure API Management service instance and then create a POST path. Azure API Management is a huge topic that I have covered part of in the past. It is an essential tool when building a MicroService architecture in Azure as it allows for unification of different service types (serverless, containerized, traditional, etc) into what appears to be a single unified API.

All requests to Azure Event Grid require the aeg-sas-key so, we make sure that Azure API Management does not allow any requests to the proxy to come through unless that header is at least present. Event Grid will determine its validity.

Once the general route is created we need to tell it where to forward the request it receives. This is simply the Event Grid Topic endpoint that we can get from the Event Grid Topic Overview page in the Azure Portal. However, we do NOT want to take the entire Url as we will need to use Url Rewriting instead of Url Forwarding. So, I recommend taking everything but the trailing /events which we will use in the next section. Be sure to also check Override to allow for text entry into the Url field.

Ok, now lets complete the circle. You need to click Inbound Processing and change the default processing from URL Forwarding to URL Rewrite. There is a bit of a quirk here where you cannot leave the text field for your backend blank. This is where you will want to drop the /events that we omitted from our backend URL.

You will want to use the Test feature to test this. I provided the source code for the Azure Event Grid EventGridEvent class so you can see how the JSON needs to be in test.


Event Driven Programming is a very important and vital concept in enabling high throughput endpoints. The design is inherently serverless and enables developers to tap into more of Azure’s potential and easily declare n number of things that happen when an endpoint is called.

API Management is a vital tool in creating a unified and consistent look interface for external developers to use. There are many tools it comes with including Header enforcement and event event body validation and transformation and it can integrated with an authentication mechanism. If you are going to create a Microservice type API or if you intend to break up a monolith, API Management is a create way to get free versioning and abstraction.

The biggest thing to remember is that Event Grid is geared towards enabling Event Driven scenarios, it is not designed for processing a high number of events, that is the realm of Event Hubs. If you were going to do real time processing, you would feed your data pipeline into Event Hubs and route those events to Event Grid (potentially).

More information here

MVP for another year

It is with great humility that I announce Microsoft’s decision to renew my MVP status for an additional  year. While it is my second renewal, it comes as the first true renewal that I have had since being selected in 2016.

What I mean by that is, after I was selected in 2016 the program’s renewal cycle changed and as part of the change I was grandfathered into the MVP program for 2017 into 2018. This mean’t it would be my accomplishments in 2017 that would dictate whether my MVP status would continue.

I spoke and blogged quite a bit in early 2017 but, shut things down around August to focus on my wedding and honeymoon. What’s more, throughout 2017 I was tasked with a large $4mil web project using NodeJS, AWS, and ReactJS for West Monroe. I was worried as this certainly drew my focus away from what got me my MVP. In addition, I decided to also refocus on the web and away from Xamarin (this as part of an overall decision to focus more on the Cloud side of things).

2018 has also not been easy as the AWS project finishes up and I celebrate the birth of my first child, my son Ethan. But I am committed to finding the balance and have already spoken at two conferences (CodeMash and Chicago Code Camp) and am selected to speak at TechBash and have abstracts out to VSLive.

In the end, my willingness to share my ideas here and the awesome people who have to read what I wrote and even share some of my articles on forums and StackOverflow, helped get that MVP renewal and so, I send out a Thank You to all.

Scratching the Surface of Real Time Apps – Part 3

Part 1
Part 2

This is it, our final step. This is basically saving our query result from the Analytics application to a DynamoDB where it can easily be queried. For this we will again use Lambda as a highly scalable and easy to deploy mechanism to facilitate this transfer.

Creating your Lambda function

In Part 1, I talked about installing the AWS Lambda templates for Visual Studio Code, if you do a list on your templates using:

dotnet new -l

You will get a listing of all installed templates. A casual observation of this shows two provided Lambda templates that target Kinesis events. However, neither of these will work. I actually thought I had things working with these templates, however, the structure of the event sent from Analytics is very different, but the Records property is the only one that maps, and its all empty objects.

I would recommend starting with the serverless.EmptyServerless template. What you actually need is to use the KinesisAnalyticsOutputDeliveryEvent object which is available in the Amazon.Lambda.KinesisAnalyticsEvents NuGet package (link). Once you have this installed, ensure the first parameter is of type KinesisAnalyticsOutputDeliveryEvent and Amazon will handle the rest. For reference, here is my source code which writes to my DynamoDB table:

Here we are using the AWSSDK.DynamoDbv2 NuGet package to connect to Dynamo (the Lambda role must have access to Dynamo and our specific ARN). We use AttributeValue here to write, essentially, a JSON object to the table.

In my case, I have defined the partition and sorting keys since the data we can expect is a unique combination of a minute based timestamp and symbol; this is a composite key.

As with the Kinesis Firehose write Lambda use the following command to deploy (or update) the Lambda function:

dotnet lambda deploy-function <LambdaName>

Again, the Lambda name does NOT have to match the local name, it is the name will be used when connecting with other AWS Services, like our Analytics application.

Returning to our Analytics Application

We need to make sure our new Lambda function is set up as the destination for our Analytics Application. Recall that, we created a stream when we did our SQL Query. Using the Destination feature, we can ensure these bounded results wind up in the table. Part 2 explains this a bit better.

Running End to End

So, if you followed all three segments you should now be ready to see if things are working. I would recommend having the DynamoDb table Items table open so you can see new rows as they appear.

Start your ingestion application and refresh Dynamo. It may take some time for rows to appear. If you are inpatient, like me, you can use the SQL Editor to see rows as they come across. I also like to have Console output message in the Lambda functions that I can see in CloudWatch.

With any luck you should start seeing rows in your application. If you dont it could be a few things:

  • You really want to hit the Firehose with a lot of data, I recommend a MINIMUM of 200 records. If you send less the processing can be a bit delayed. Remember, what you are building is a system that is DESIGNED to handle lots of data. If there is not a lot of data, Kinesis may not be the best choice
  • Check that your Kinesis Query is returning results and that you have defined an OUTPUT_STREAM. I feel like the AWS Docs do not drive this point home well enough and seem to imply the presence of a “default” stream; this is not the case, or not in so much as I found.
  • Use CloudWatch obsessively. If things are not working, use Console.WriteLine to figure out where the break is. I found this a valuable tool
  • Ensure you are using the right types for your arguments. I spent hours debugging my write to Dynamo only to discover I was using the wrong type for the incoming Knesis event

Final Thoughts

Taking a step back and looking at this end to end product, the reality is we did not write that much code. Between my Lambda’s I wrote perhaps lines of code, total. Not bad given what this application can do and its level of extensibility (Kinesis streams can send to multiple destinations).

This is what I love about Cloud platforms. In the past, designing and building a system like this would take as long as the application that might be using the data. But now, with Cloud, its mostly point and click and the platform handles the rest. Streaming data applications, like this, will only become more and more common as we see more systems move to the Cloud and more business want to gather data in new and unique ways.

Kinesis in AWS, EventGrid in Azure, and Pub/Sub in Google Cloud represent tooling that allows us to take this very valuable, highly complicated applications, and building them in hours instead of weeks. Though, in my view, AWS has the best support for applications like these, though I expect the others to make progress to catch up.

I hope you enjoyed this long example and I hope it gives you ideas for your next application and reveals how easy it is to get a value production Real Time Data application quickly.

Scratching the Surface of Real Time Apps – Part 2

See Part 1 here

If you followed Part 1, we have a Lambda that sits behind an API Gateway that writes the contents of the payload it receives to an Amazon Kinesis Firehose Delivery Stream. This stream is currently configured to dump its contents, at set conditions, to S3.

It being S3 does not really lend itself to real time processing, though we could use something like Athena and query it, our aim is to see our data changes in real time. For that we want to hook our Firehose stream up to a Data Analytics stream where the principal job is to process “bounded” data.


The most important thing with this step is having an idea of what you want to look for before you start. Generally, these are going to be time based metrics, but they dont have to be. Once you have established this ideas we can move on to the creation phase.

Create the analytics application

Configure the source of you Analytics Application

Go to your Amazon AWS Console and select Services -> Kinesis. This will bring up your Kinesis Dashboard, you should see the Firehose stream you created previously. Click Create analytics application.

Similar to the Firehose process we give our new stream an application name and click the Create action button.

An analytics application is comprised of three parts: Source, Processor, and Destination. You need to configure these in order. Amazon makes it pretty easy, except for the part on debugging which we will talk about in a bit.

For the Source you will want to connect it to your Firehose Datastream that was created previous, if you are following this tutorial. You can also use a Kinesis Data Stream as well. This part is pretty straightforward.

The important thing to take note of is the name of you incoming In-Application stream, this will serve as the “table”.

In order to save the configuration the application will want to discover the schema from the incoming data. This is done using the Discover Schema button at the bottom, you need to have data flowing through.

As a side note, when I wrote my application with sends data I wrote logic to limit the size of the result set being sent. This helps speed up development and lets you explore other data scenarios you way want to explore (pre-processing). I find this approach is better than truly opening the firehose before you are ready, pun intended.

Configure the Processing of your streaming data

When you think about writing a SQL Query against a database like MySQL you probably dont often think about that data set changing underneath you, but that is what you need to consider for many operations within Analytics, because you are dealing with unbounded data.

The term “unbounded data” specifically applies to doing things like averages, mins, maxs, sums, etc, values that are calculated from the complete set of data. This cant happen with Analytics because you are dealing with a continuous stream of data. So you could never compute the average because it will keep changing. To mitigate this, query results need to operate on bounded data, especially for aggregates.

This bounding can be done a few ways but the most common is the use of a Window. Amazon has some good documentation on this here, as well as the different types. Our example will use Tumbling Window because we will consider the set of data relevant for a given minute, specifically the number of stock shares purchased or sold within a given minute,

For stage 2 (Real Time Analytics) of the Analytics Application you will want to click the Go to SQL Results button which will take you into the SQL Query editor. This is where you define the SQL to extract the metric you want. Remember you, MUST provide criteria to make the query operate on a bounded data set, otherwise the editor will throw an error.

For my application this is the SQL I used:

There is a lot going on here so let’s unpack it.

The first step you need to take is to create a stream that will contain the results of your query. You can KIND OF think of this as a “temp table” from SQL land, but its more sophisticated.

In our example, we defining a STREAM called OUTPUT_STREAM, we will be able to select this in Step 3, when we determine what happens after the query. This stream will feature 3 columns (timesymbol, and shareChange).

Our second step focuses on getting data from the query into the stream, which is done using a Pump. This will run the INSERT statement with each query result and insert the data into the stream.

The final bit is the query that actually does the lifting. Its pretty standard with the exception of the STEP operator in the GROUP BY clause – this is what creates the tumbling window we mentioned before.

Once you have set your query you can click the Save and Run, this will save the query and execute it. You will be much more pleased with the outcome of this if you are sending data to your Firehose so the Analytics application has data it can show.

Connect to a Destination

The final configuration task for the Analytics Application is to specify where the query results are sent. For our example we will use a Lambda. However, doing this with C# can be somewhat confusing so, we will cover it in the next part.

You can specify this configuration by access the Destination section from the Application homepage (below):

Alternatively, you can select the Destination tab from within the SQL Editor. I find this way to be better since you can directly select your output stream as opposed to free typing it.

For both of these options, Amazon has made it very easy and as simple as point and click.


Congrats. You now have a web endpoint that will automatically scale and write data to a pipe that will store things in S3. In addition, we created an Analytics Application which queries the streamed data to derive metrics based on a window (per minute in this case). We then take these results and pass them to a Lambda which we will cover in Part 3, its going to write the results to a DynamoDB table that we can easily query from a web app and see our results in real time.

Part 3

Scratching the Surface of Real Time Apps – Part 1

Recently, I began experimenting with Amazon Kinesis as a way to delve into the world of real time data applications. For the uninitiated, this is referred to a “streaming” and it basically deals with the ideas of continuous data sets. By using streaming, we can make determinations in real time which is allows a system to generate more value for those wanting insights into their data.

Amazon has completely embraced the notion of streaming within many of their services, supported by Lambda. This notion of being event driven is not a new pattern but, by listening for events from within the Cloud they become much easier to develop and enable more scenarios than ever before.

Understanding Our Layout

Perhaps the most important thing when designing this, and really any Cloud based application, is a coherent plan of what your after and what services you intend to use. Here is a diagram of what we are going to build (well parts of it).



Our main focus will be on the Amazon components. For the Console App, I wrote a simple .NET Core app which randomly generates stock prices and shares bought. Really you can build anything that will just generate a lot of data for practice.

This data gets written to an API Gateway which allows us to create a more coherent mapping to our Lambda rather than exposing it directly. API Gatways enable a host of useful scenarios such as unifying disparate services behind a single front or starting the breakup of a monolith into microservices.

The Lambda takes the request and writes it to a Kinesis Firehose which is, as it sounds, a wide pipe that is designed to take A LOT of data. We do this to maximize throughput. Its like a queue, but much more than that and its all managed by Amazon and so will scale automatically.

The Firehose than writes the data it receives to another Kinesis Stream, but its an Analytics stream, designed to be queried to see trends and other aspects of the incoming data. This outputs to a stream which is then sent to another Lambda. The thing to remember here is that we are working against a continious data set, so bounding our queries is important.

The final Lambda, takes the query result from analytics and saves it to Dynamo. This gives us our data, time based, that we can query against from a web application.


Creating our Kinesis Ingestion Stream

Start off from the AWS Console, under Analytics selecting Kinesis. You will want to click the Create Firehose Delivery Stream.

For the first screen, you will want to give the stream a name and ensure the Source is set to Direct PUT. Click Next.

On the second screen, you will be asked if you want to do any kind of preprocessing on the items. This can be very important depending on the data set you are injesting. In our case, we are not going to do any preprocessing, click Next.

On the third screen you will be asked about Destination. This is where the records you write to Firehose go after they pass through the buffer and into the ether. For our purposes, select S3 and create a new bucket (or use an existing one). I encourage you to explore the others as they can serve other use cases. Click Next.

On the fourth screen, you will be asked to configure various settings. including the all important S3 Buffer Conditions. This is when Firehose will store the data it receives. Be sure to set your IAM Role, you can auto create if you need to. Click Next.

On the final screen click Create Delivery Stream. Once the process is complete we can return and write some code for the Lambda which will write to this stream.

Creating the Ingestion Lambda

There are many ways to get data into Kinesis each of which is geared towards a specific scenario and dependent on the desired ingestion rate. For our example, we are going to use a Http Trigger Lambda written in C# (.NET Core 2.0).

Setup: I am quite fond of Visual Studio Code (download) and it has become my editor of choice. To make this process easier, you will want to run the following command to install the Amazon AWS Lambda templates (note: this assumes you have installed .NET Core SDK to your machine)

dotnet new -i Amazon.Lambda.Templates::*

More information here

Once this is installed we can create our project using the EmptyServerless template, here is the command:

dotnet new lambda.EmptyServerless –name <YourProjectName>

This will create a new directory named after your project with src/ and test/ directories. Our efforts will be in /src/<ProjectName> directory, you will want to make sure you cd here as well as you must be in the project root to run the installed dotnet lambda commands.

Our basic process with this Api call is to receive a payload, via POST, and write those contents to our Firehose stream. Here is the code that I used:

A few things to take note of here:

  • The first parameter is of type object and it needs to match the incoming JSON schema. Amazon will handle the deserialization from string to object for us. Though we end up needing the raw contents anyway.
  • We need to install the AWSSDK.KinesisFirehose (I used version which gives us access to AmazonKinesisFirehoseClient
  • We call PutItemAsync to write to our given Delivery stream (in the case of this code that stream is named input_transactions)
  • I have changed the name of the function handler from FunctionHandler (which comes with the template) to PostHandler. This is fine, but make sure to update the aws-lambda-tools-defaults.json file to ensure function-handler is correct

Once you have your code written we need to deploy the function. Note that I would rather do it from the command line since it avoids the auto creation of the API Gateway path. I have done both and I find I like the clarity offered with doing it in stages. You need to run this command to deploy:

dotnet lambda deploy-function <LambdaFunctionName>

LambdaFunctionName is the name of the function as it appears in the AWS Console. It does NOT have to match what you call it in development, very useful to know.

As a prerequisite to this, you need to define a role for the Lambda to operate within. As with anytime you define an IAM Role, it should only have access to the bare minimum of services that callers will need.

When you run the deploy, you will receive a prompt to pick an applicable role for the newly created function, future deployments of the same Lambda name will not ask this.

Our next step is to expose this on the web, for that we need to create a API Gateway route. As a side note, there is a concept of lambda-proxy which I could not get to work. If you can, great, but I made due with the more standard approach.

Add Web Access

I think API Gateways are one of the coolest and most useful services offered by Cloud providers (API Gateway on Azure and Endpoints on Google). They enable us to tackle the task of API generation in a variety of ways, offer standard features prebuilt, and offer an abstract to let us make changes.

I will be honest, I can never find API Gateway in the Services list, I always have to search for it. Click on it once you find it. On the next page press Create API. Here are the settings you will need to get started:

  • Name: Anything you cant, this doesnt get exposed to the public. Be descriptove
  • Endpoint type: Regional (at this point I have explored the differences between the types)

Click Create and you will be dropped on the API configuration main page.

The default endpoint given is / or the root of the API. Often we will add resources (i.e User, Product, etc) and then have our operations against those resources. For simplicity here we will operate against the root, this is not typical but API organization and development is not the focus of this post.

With / selected click Actions and select Create Method. Select POST and click the checkmark. API Gateway will create a path to allow POST against / with no authentication requirements, this is what we want.

Here is a sample of the configuration we will specify for our API Gateway to Lambda pipe, note that my function is called PostTransaction, yours will likely differ.


Click Save and you will get a nice diagram showing the stages of the API Gateway route. We can test that things are working by clicking Test and sending a payload to our endpoint.

I wont lie, this part can be maddening. I tried to keep this simple to avoid problems. There is good response logging from the Test output you just have to play with it.

So we are not quite ready to call our application via Postman or whatever yet, we need to create a stage. Stages allow us to deploy the API to different environments (Development, QA, Staging, Production). We can actually combine this with Deploy API.

IMPORTANT: Your API changes are NEVER live when you make them, you must deploy, each time. This is important because it can be annoying figuring out why your API doesnt work only to find you didnt deploy your change, as I did.

When you do Actions -> Deploy Api you will be prompted to select a Stage. In the case of our API, we have no stages, we can create one, I choose Production, you can choose whatever.

Once the deploy finishes we need to get the actual Url that we will use. We can go to Stages in the left hand navigation and select our stage, the Url will be displayed along the top (remember to append your resource name if you didnt use /). This is the Url we can use to call our Lambda from Postman.

The ideal case here is that you create an app that sends this endpoint a lot of data so we can write a large amount to Firehose for processing.


In this part, we did not write that much code and the code we did write could be tested rather easily within the testing harnesses for both API Gateway and Lambda. One thing that is useful is to use Console.WriteLine as CloudWatch will capture anything in Logs from STDOUT.

Next Steps

With this we have an endpoint applications can send data to and it will get written to Firehose. By using Firehose we can support ingesting  A LOT of data and with Lambda we get automatic scaling. We abstract the Lambda path away using API Gateway which lets us also include other API endpoints that could be pointing at other services.

In our next part we will create a Kinesis Analytics Data Stream which allow us to analyze the data as it comes into the application – Part 2