So I made a Stock Data App

I decided to build a Event Driven Stock Price Application using Event Grid, SignalR, and ReactJS. Just a little something to play with as I prepare to join Microsoft Consulting Services. I thought I would recount my experience here. First, here is what the flow looks like:

Figure 1 – Diagram of Stock App

While the diagram may look overbearing it really is quite simple:

  • Producer console app starts with some seed data of stock prices I gathered
  • It adjusts these values using some random numbers
  • The change in price is sent to an Event Grid topic with an EventType declared
  • The EventGrid subscriptions look for events with a matching EventType
  • Those that match will fire their respective Azure function
  • The Azure Function will then carry out its given task

I really prefer Event Grid for my event driven applications. Its fast, cost effective, and has a better interaction experience than Service Bus topics, in my opinion. The subscription filters can get down to analyzing the raw JSON coming through and it supports the up and coming Cloud Events (cloudevents.io) standard. It also can tie into the tenant providers and respond to native Azure events, such as blob creation/deletion. All in all, it is one of my favorite Azure services.

So regarding the application, I choose to approach this from a purely event driven fashion. All price changes are seen as events. The CalculateChangePercent receives all events and, using the symbol as the partition key, looks up the most recent price stored in the database.

Based on this and the incoming data it determines the change percent and creates a new event. Here is the code for that:

[FunctionName("CalculateChangePercent")]
public void CalculateChangePercent(
[EventGridTrigger] EventGridEvent incomingEvent,
[Table("stockpricehistory", Connection = "AzureWebJobsStorage")] CloudTable stockPriceHistoryTable,
[EventGrid(TopicEndpointUri = "TopicUrlSetting", TopicKeySetting = "TopicKeySetting")] ICollector<EventGridEvent> changeEventCollector,
ILogger logger)
{
var stockData = ((JObject)incomingEvent.Data).ToObject<StockDataPriceChangeEvent>();
var selectQuery = new TableQuery<StockDataTableEntity>().Where(
TableQuery.GenerateFilterCondition(nameof(StockDataTableEntity.PartitionKey), QueryComparisons.Equal, stockData.Symbol)
);
var symbolResults = stockPriceHistoryTable.ExecuteQuery(selectQuery).ToList();
var latestEntry = symbolResults.OrderByDescending(x => x.Timestamp)
.FirstOrDefault();
if (latestEntry != null)
{
var oldPrice = (decimal) latestEntry.Price;
var newPrice = stockData.Price;
var change = Math.Round((oldPrice newPrice) / oldPrice, 2) * 1;
stockData.Change = change;
}
changeEventCollector.Add(new EventGridEvent()
{
Id = Guid.NewGuid().ToString(),
Subject = $"{stockData.Symbol}-price-change",
Data = stockData,
EventType = "EventDrivePoc.Event.StockPriceChange",
DataVersion = "1.0"
});
}
view raw create-event.cs hosted with ❤ by GitHub

This is basically “event redirection”, that is taking one event and create one or more events from it. Its a very common approach to handle sophisticated event driven workflows. In this case, once the change percent is calculated the information is ready for transmission and persistence.

This sort of “multi-casting” is at the heart of what makes event driven so powerful and, so risky. Here two subscribers will receive the exact same event and take very different operations:

  • Flow 1 – this flow takes the incoming event and saves it to a persistence store. Usually, this needs to be something high availability, consistency is usually not something we care about.
  • Flow 2 – this flow takes the incoming event and sends it to the Azure SignalR service so we can have a real time feed of the stock data. This approach in turn allows connecting clients to also be event driven since we will “push” data to them.

Let’s focus on Flow 1 as it is the most typical flow. Generally, you will always want a record of the events the system received either for analysis or potential playback (in the event of state loss or debugging). This is what is being accomplished here with the persistence store.

The reason you will often see this as a Data Warehouse or some sort of NoSQL database is, consistency is not a huge worry and NoSQL database emphasize the AP portion of the CAP theorem (link) and are well suited to handling high write volumes – this is typical in event heavy systems, especially as you get closer to patterns such as Event Sourcing (link). There needs to be a record of the events the system processed.

This is not to say you should rely on a NoSQL database over an RDBMS (Relational Database Management System), each has their place and there are many other patterns which can be used. I like NoSQL for things like ledgers because they dont enforce a standard schema so all events can be stored together which allows for easier re-sequencing.

That said, there are also patterns which periodically read from NoSQL stores and create data into RDBMS – this is often done if data ingestion needs are such that a high volume is expected but the data itself can be trusted to be consistent. This may create data into a system where we need consistency checks for other operations.

Build the Front End

Next on my list was to build a frontend reader to see the data as it came across. I choose to use ReactJS for a few reasons:

  • Most examples seem to use JQuery and I am not particularly fond of JQuery these days
  • ReactJS is, to me, the best front end JavaScript framework and I hadnt worked with it in some time
  • I wanted to ensure I still understood how to implement the Redux pattern and ReactJS has better support than Angular; not sure about Vue.js

If you have never used the Redux pattern, I highly recommend it for front end applications. It emphasizes a mono-directional flow of data built on deterministic operations. Here is a visual:

https://xximjasonxx.files.wordpress.com/2021/05/2821e-1bzq8fpvjwhrbxoed3n9yhw.png

I first used this pattern several years ago when leading a team at West Monroe, we built a task completion engine for restaurants, we got pretty deep into the pattern. I was quite impressed.

Put simply, the goal of Redux is that all actions are handled the same and state is recreated each time a change is made, as opposed to updating state. By taking this mentality, operations are deterministic meaning the same result will occur no matter how many times the same action is executed. This bakes very nicely with the event driven model from the backend which SignalR carries to the frontend.

Central to this is the Store which facilitates subscribing and dispatching events. I wont go much deeper into Redux here, much better sources out there such as https://redux.js.org/. Simply put, when SignalR sends out a messages it sends an event to listeners – in my case its the UpdateStockPrice event. I can use a reference to the store to dispatch the event, which allows my reducers to see it and change their state.

Once a reducer changes state, a state updated event is raised and any component which is connected will update, if needed (ReactJS uses shadow DOM to ensure components only change if they were actually changed). Here is the code which is used (simplified):

// located at the bottom of index.js the application bootstrap
let connection = new HubConnectionBuilder()
.withAutomaticReconnect()
.withUrl("https://func-stockdatareceivers.azurewebsites.net/api/stockdata&quot;)
.build();
connection.on('UpdateStockPrice', data => {
store.dispatch({
type: UpdateStockPriceAction,
data
});
});
connection.start();
// reducers look for actions and make changes. The format of the action (type, data) is standard
// if the reducer is unaware of the action, we return whatever the current state held is
const stockDataReducer = (state = initialState, action) => {
switch (action.type) {
case UpdateStockPriceAction:
const newArray = state.stockData.filter(s => s.Symbol !== action.data.Symbol);
newArray.push(action.data);
newArray.sort((e1, e2) => {
if (e1.Symbol > e2.Symbol)
return 1;
if (e1.Symbol < e2.Symbol)
return 1;
return 0;
});
return { stockData: newArray };
default:
return state;
}
};
// the component is connected to the store and will rerender when state change is made
class StockDataWindow extends Component {
render() {
return (
<div>
{this.props.stockData.map(d => (
<StockDataLine stockData={d} key={d.Symbol} />
))}
</div>
);
}
};
const mapStateToProps = state => {
return {
stockData: state.stockData
};
};
export default connect(mapStateToProps, null)(StockDataWindow);
view raw update.js hosted with ❤ by GitHub

This code makes use of the redux and react-redux helper libraries. ReactJS, as I said before, supports Redux extremely well, far better than Angular last I checked. It makes the pattern very easy to implement.

So what happens is:

  • SignalR sends a host of price change events to our client
  • Our client dispatches events for each one through our store
  • The events (actions) are received by our reducer which changes its state
  • This state change causes ReactJS to fire render for all components, updating Shadow DOM
  • Shadow DOM is compared against action DOM and components update where Shadowm DOM differs

This whole process is very quick and is, at its heart, deterministic. In the code above, you notice the array is recreated each time rather than pushing the new price or trying to find the existing index an updating. This may seems strange but, it very efficiently PREVENTS side effects – which often manifest as some of the more nastier bugs.

As with our backend, the same action could be received by multiple reducers – there is no 1:1 rule.

Closing

I wrote this application more to experiment with Event Driven programming on the backend and frontend. I do believe this sort of pattern can work well for most applications; in terms of Redux I think any application of even moderate complexity can benefit.

Code is here: https://github.com/jfarrell-examples/StockApp

Happy Coding

Common Misconception #3 – DevOps is a tool

DevOps is a topic very near and dear to me. Its something that I helped organizations with a lot as an App Modernization Consultant in the Cognizant Microsoft Business Group. However, I find it ubiquitous that DevOps is misunderstood or misrepresented to organizations.

What is DevOps?

In the most simplest sense, DevOps is a culture that is focused on collaboration that aims to maximize team affinity and organizational productivity. While part of its adherence is the adoption of tools that allow team to effectively scale, at its core its a cultural shift to remove team silos and emphasize free and clear communication. One could argue, as Gene Kim does in his book The Phoenix Project, that the full realization of DevOps is the abolishment of IT departments; instead IT is seen as a resource embedded in each department.

From a more complex perspective, the tenants of DevOps mirror the tenants of Agile and focus on small iterations allowing organizations (and the teams within) to adjust more easily to changing circumstances. These tenants (like with Agile) are rooted in Lean Management which was borne out of the Toyota Production System (TPS) (link) which revolutionized manufacturing and allowed Toyota to keep pace with GM and Ford, despite the later two being much larger.

The Three Ways

DevOps culture carries forth from TPS the three ways which govern how work flows through the system, how it is evaluated for quality, and how observations upon that work inform future decisions and planning. For those familiar with Agile, this should, again, sound familiar – DevOps and Agile share many similarities in terms of doctrine. A great book for understanding The Three Ways (also authored by Gene Kim) is The DevOps Handbook.

The First Way

The First Way focuses on maximizing the left to right flow of work. For engineers this would be the flow of a chance from conception to production. The critical idea to this Way is the notion of small batches. We want teams to consistently and quickly sending work through flows and to production (or some higher environment as quickly as possible). Perhaps contrary to established thought, the First Way stresses that the faster a team moves the higher their quality.

Consider, if a team works on a large amount of changes (100), does testing, and then ultimately deploys the change the testing and validation is spread out across these 100 changes. Not only are teams at the mercy of a quality process which must be impossibly strict, if a problem does occur, the team must sort out WHICH of the 100 changes caused the problem. Further, the sheer size of the deployment would likely make rollback very difficult, if not impossible. Thus, the team may also be contending with downtime or limited options to ensure the problem does not introduce bad data.

Now consider, if that same team deployed 2 changes. The QA team can focus on a very narrow set of testing and if something goes wrong, diagnosing is much easier given the smaller size. Further, the changes could likely be backed out (or turned off) to prevent the introduction of bad data into the system.

There is a non-linear relationship between the size of the change and the potential risk of integrating the change – when you go from a ten-line code change to a one-hundred-line code change, the risk of something going wrong is more than 10x higher, and so forth

Randy Shoup, DevOps Manager, Google

Smaller batch sizes can help your teams move faster and get their work in front of stakeholders more efficiently and quickly, doing so induces better communication between the team and their users, which ultimately help each side get what they want out of the process. There is nothing worse than going off in a corner for 4 months, building something, and having it fall short of the needs of the business.

The Second Way

Moving fast is great but is only part of the equation. Like so much in DevOps (and Agile) the core learnings are defined in a way that is supplementary to each other. The Second Way emphasizes the need for fast feedback cycles or more directly, is aimed at ensuring that the speed is supported by automated and frequent quality checks.

The Second Way is often tied to a concept in DevOps called shift-left, shown by the graph visual below:

Shift Left in action

It is not uncommon for organizations embracing a siloed approach to Quality Assurance to start QA near the end of a cycle, generally to ensure they can validate the complete picture. While this makes sense, it value is misplaced. I would ask anyone who has built or tested software how often this process ends up being a bottleneck (reasons be damned) in delivery? If you are like most clients I have worked with, the answer is Yes and Always.

The truth is, such a model does not work if we want teams to move with speed and quality. Shift Left therefore emphasizes that it is the people at the LEFT who need to do the testing (in the case of engineering that would be the developers). The goal is to discovered a problem as quickly as possible so that it can be corrected building on the ubiquitous understanding that the earlier a problem is found the cheaper it is to fix.

To but it bluntly, teams cannot make changes to systems, teams, or anything is there is not a sense of validation to know what they did worked. For engineering, we can only know something is working if we can test that is working, hence the common rule for high-performing teams that no problem should ever occur twice.

I cannot overstate how important these feedback cycles are, especially in terms of automation. Especially in engineering, giving developers confidence that IF they make a mistake (and they will) it will get caught before it gets to production is HUGE. Without this confidence, the value provided by The First Way will be limited.

And equally critical to creating the cycles is UNDERSTANDING the means of testing and what use case is best tested by what. Here is an image of the Testing Pyramid which I commonly use with clients when explaining feedback cycles for Engineering.

For those wondering where manual testing goes – it is at the very top and has the fewest number. Manual tests should be transitioned to an automated tool.

A final point I want to share here, DevOps considers QA as a strategic resource NOT a tactical one. That is, high functioning teams do NOT expect QA persons to do the testing, these individual are expected to ORGANIZE the testing. From this standpoint, they would plan out what tests are needed and ensure the testing is happening. In some cases, they may be called on to educate developers on what tests fit certain use cases. Too often, I have seen teams view QA as the person who must do the testing – this is false and only encourages bottlenecking. Shift-left is very clear that DEVELOPERS need to do the majority of testing since they are closer to a given change than QA.

The Third Way

No methodology is without fault and it would be folly to believe there is a prescriptive approach to anything that fits any team. Thus, The Third Way stresses that we should use metrics to learn about and modify our process. This includes how we work as well as how our systems work. The aim is to create a generative culture that is constantly accepting of new ideas and seeks to improve itself. Teams embracing this Way apply the scientific method to any change in process and work to build high-trust. Any failure is seen, not as a time to blame or extort, but rather to learn and evolve.

“The only sustainable competitive advantage is an organization’s ability to learn faster than the competition”

Peter Senge – Founder of the Society for Organizational Learning

For any organization the most valuable asset is the knowledge of their employees for only through this knowledge can improvements be made that enable their products to continue to produce value for customers. Put another way:

Agility is not free. It’s cost is the continual investment to ensure teams can maintain velocity. I have seen software engineering department leads ask, over and over, why is the team not hitting their pre-determined velocity. Putting aside the fallacy of telling a team what speed they should work at, velocity is not free. If I own a sports car, but I perform not maintenance on it, soon it will drive the same as a typical consumer sedan. Why?

“.. in the absence of improvements, processes do NOT stay the same. Due to chaos and entropy, processes actually degrade over time”

Mike Rother – Toyota Kata

No organization, least of all engineering, can hope to achieve its goals if it does not continually invest in the process of reaching those goals. Teams which do not perform maintenance on themselves are destined to fail and, depending on the gravity of the failure, the organization could lose more than just money.

In Scrum, teams will use the Sprint Retrospective to call to attention things which should be stopped, started, and continued as a way to ensure they are continually enhancing their process. However, too often, I have seen these same teams shy away from ensuring, in each sprint, there is time taken to remove technical debt or add some automation, usually because they must hit a target velocity or a certain feature. This completely gets away from the spirit of Agile and DevOps.

Its about culture

Hopefully, despite my occasional references to engineering, you can understand that The Three Ways are about culture and about embracing many lessons learned from manufacturing about how to effectively move work through flows. DevOps is an extremely deep topic that, regrettably, often gets boiled down to a somewhat simplistic question of “Do you have automated builds?”. And yes, automation is a key to embracing DevOps but less important than establishing the cultural norms to support it. Simply having an automated build is indifferent if all work must be pass through central figures or certain work is given into silos where the timeline is no longer the teams.

Further Reading

The topic of DevOps is well covered, especially if you are a fan of Gene Kim. I recommend these books to help understand DevOps culture better – I list them in order of quality:

  • The DevOps Handbook (Gene Kim) – Amazon
  • The Phoenix Project (Gene Kim et al) – Amazon
  • Effective DevOps (Davis and Daniels) – Amazon
  • The Unicorn Project (Gene Kim) – Amazon
  • Accelerate (Forsgren et al) – Amazon

Thank you for reading

Common Misconception #2 – Serverless is good for APIs

The next entry in this series is something that hits very close to home for me: Serverless for APIs. Let me first start off by saying, I am not stating this as an unequivocal rule. As with anything in technology there are cases where this makes sense. And, in fact, much of my consternation could be alleviated by using Containers. Nevertheless, I still believe the following to be true:

For any non-simple API, a serverless approach is going to be more harmful and limiting, and in some cases more costly, than using a traditional server.

Background

When AWS announced Lambda back in 2014 it marked the first time a major cloud platform had added support for what would be come known as FaaS (Function as a Service). The catchphrase was serverless which, did not make a lot of sense to people since there was obviously still a server involved – but marketing people gotta market.

The concept was simple enough, using Lambda I could deploy just the code I wanted to run and pay per invocation (and the cost was insanely cheap). One of the things Lambda enabled was the ability to listen for internal events from things like S3 or DynanmoDB and create small bits of code which responded to those events. This enabled a whole new class of event driven applications as Lambda could serve as the glue between services – EventBridge came along later (a copy of Azure’s EventGrid service) and further elevated this paradigm.

One of the Events is a web request

One of the most common types of applications people write are APIs and so, Lambda ensured to include support for supporting web calls – effectively by listening for a request event coming from outside the cloud. Using serverless and S3 static web content, a company could effectively run a super sophisticated website for a fraction of the cost of traditional serving models.

This ultimately led developers to use Lambda and Azure Functions as a replacement for Elastic Beanstalk or Azure App Service. And this is where the misconception lies. While Lambda is useful to glue services together and provide for simple webhooks it is often ill-suited for complex APIs.

The rest of this will be in the context of Azure Functions but, conceptually the same problems existing with Google Cloud Functions and AWS Lambdas.

You are responding to an Event

In traditional web server applications, a request is received by the ISAPI (Internet Server Application Program Interface) where it is analyzed to determine its final destination. And this destination can be affected by code, filters, and other mechanisms.

Serverless, however, is purely event driven which means, once an event enters the system it cannot be cancelled or redirected, it must invoke its handler. Consider the following problem that was encountered with Azure Function Filters while developing KlipTok.

On KlipTok there was a need to ensure that each request contained a valid header. In traditional ASP .NET Core, we would write a simple piece of middleware to intercept the request and, if necessary, short-circuit it should it be deemed invalid. While technically possible in Azure Functions it requires a fairly robust and in-depth knowledge and customization to achieve.

In the end, we leveraged IFunctionsInvocationFilter (a preview feature) which allowed code to run ahead of the functions execution (no short-circuit allowed) and mark the request. Each function then had to check for this mark. It did allow us to reduce the code but certainly was not as clean as a traditional API framework.

The above example is one of many examples of elements which are planned and allotted for in full-fledged API frameworks (being able to plug into the ISAPI being another) but are otherwise lacking in serverless framework, Azure Functions in this case. While there does exist the ability to supplement some of these features with containerization or third party libraries, I still believe such a play detracts from the intended purpose of serverless: to be the glue in complex distributed systems.

It is not to say you never should

The old saying “never say never” certainly holds true in Software Engineering as much as anywhere. I am not trying to say you should NEVER do this, there are cases where Serverless makes sense. This is usually because the API is simple or the Serverless piece is leverage by a proxy API or represents specific routes within the API. But I have, too often, seen teams leverage serverless as if it was a replacement for Azure App Service or Elastic Beanstalk – it is not.

As with most things teams need to be aware and make informed decisions with an eye on the likely road of evolution a software product will take. While tempting, Azure Functions have a laundry list of drawbacks you need to be aware of including:

  • Pricing which will vary with load taken by the server (if using Consumption style plans)
  • Long initial request time as the Cloud provider must standup the infrastructure to support the serverless code – often times our methods will go to sleep
  • Difficulties with organization and code reuse. This is certainly much easier in Azure than AWS but, still something teams need to consider as the size of the API grows
  • Diminished support for common and expected API features. Ex: JWT authentication and authorization processing, dependency injection in filters, lack of ability to short circuit.

There are quite a few more but, you get the idea. In general, the aim for a serverless method is to be simple and short.

There are simply better options

In the end, the main point here is, while you can write APIs in Serverless often times you simply shouldnt – there are simply better options available. A great example is the wealth of features web programmers will be used to and expect when building APIs that are simply not available or not easy to implement with serverless programming. Further, as project sizes grow the ability to properly maintain and manage the codebase because more difficult with serverless than with traditional programming.

In the end, serverless main purpose should be to glue your services together, enabling you to easily build a chain like this:

The items in blue represent the Azure Functions this sequence would require (at a minimum). The code here is fairly straightforward thanks to the use of bindings. These elements hold the flow together and support automated retry and fairly robust failure handling right out of the box.

Bindings are the key to using Serverless correctly

I BELIEVE EventBridge in AWS enables something like this but, as is typical, Microsoft has a much more thoughtout experience for developers in Azure than AWS has – especially here.

Triggers and bindings in Azure Functions | Microsoft Docs

Bindings in Azure allow Functions to connect to services like ServiceBus, EventGrid, Storage, SignalR, SendGrid, and a whole lot more. Developers can even author their own bindings. By using them, the need to write the boilerplate connect and listen code is removed so the functions can contain code which is directed at their intended purpose. One of these bindings is an input trigger called HttpTrigger, and if you have ever written a Azure Function you are familiar with it. Given what we have discussed, its existence should make more sense to you.

A function is always triggered by an event. And the one that everyone loves to listen for in the HttpTrigger that is an event to your function app which matches certain criteria defined in the HttpTrigger attribute.

So returning to the main point, everything in serverless is the result of an event so, we want to view the methods created as event handler not full fledged endpoints. While serverless CAN support an API, it lacks many of the core features which are built into API frameworks and therefore should be avoided for all but simple APIs.

Common Misconception #1 – Entity Framework Needs a Data Layer

This is the first post in what I hope to be a long running series on common misconceptions I come across in my day to day as a developer and architect in the .NET space though, some of the entries will be language agnostic. The goal is to clear up some of the more common problems I find teams get themselves into when building applications.

A little about me, I am a former Microsoft MVP and have been working as a consultant in the .NET space for close to 15yrs at this point. One of the common tasks I find myself doing is helping teams develop maintainable and robust systems. This can be from the standpoint of embracing more modern architecture such as Event Driven systems or using containers or it can be a modernization of the process to support more efficient workflows that enable teams to deliver more consistent and reliable outcomes while balancing the effort with sustainability.

The first misconception is one which I run across A LOT. And that is the way in which I find teams leveraging Entity Framework.

Entity Framework is a Repository

One of the most common data access patterns right now is the Repository pattern – Link. The main benefit is that it enables developers to embrace the Unit of Work technique which results in simpler more straightforward code. However, too often I see teams build their repository and simple create data access methods on the classes – effectively creating a variant of the Active Record or Provider pattern with the name Repository.

This is incorrect and diminishes much of the value the Repository pattern is design to bring mainly, that operations can work with data in memory as if they were talking to the database and save their changes at the end. Something like this:

Repository patterns works VERY well with web applications and frameworks like ASP .NET because we can SCOPE the database connection (called Context in Entity Framework) to the request allowing our application to maximize the connection pool.

In the above flow, we only talk to the database TWO times despite the operations, everything is done in memory and the underlying framework will handle the details for us. Too often I see code like such:

public async Task<bool> DoWork(IList<SomeItem> items)
{
foreach (var item in items.Where(x => x.Id % 2 == 0))
{
await _someRepo.DeleteItem(item.Id);
}
}
view raw bad-repo.cs hosted with ❤ by GitHub

This looks fairly benign but it is actually quite bad as it machine guns the database with each Id. In a small, low traffic application this wont be a problem but, in a larger site with high volume this is likely to cause bottlenecks, record locking, and other problems. How could this be written better?

// variant 1
public async Task<bool> DoWork(IList<SomeItem> items)
{
// assume _context is our EF Context
foreach (var item in items.Where(x => x.Id % 2 == 0))
{
var it = await _context.FirstOrDefaultAsync(x => x.Id == item.Id);
_context.Remove(it);
}
await _context.SaveChangesAsync();
}
// variant 2
public async Task<bool> DoWork(IList<SomeItem> items)
{
// assume _context is our EF Context
var targetItems = await _context.Items.Where(
x => items.Where(x => x.Id % 2 == 0).Contains(x.Id)).ToListAsync());
foreach (var item in targetItems)
{
_context.Remove(item);
}
await _context.SaveChangesAsync();
}
view raw good-repo.cs hosted with ❤ by GitHub

In general, reads are less a problem for locking and throughout that write operations (create, update, delete) so, reading the database as in Variant 1 is not going to be a huge problem right away. Variant 2 leans on EF for SQL Generation to create a query which gets our items in one shot.

But the key thing to notice in this example is the direct use of the context. Indeed, what I have been finding is I dont create a data layer at all and instead allow Entity Framework to be the data layer itself. This opens a tremendous amount of possibilities as we can then take a building block approach to our service layer.

Services facilitate the Operation

The term “service” is horrendously overused in software engineering as it applies to so many things. In my case, I am using it to describe the classes which do the thing. Taking a typical example application here is how I prefer to organize things:

  • Controller – the controller is the traffic cop determining if the provided data meets acceptable criteria such that we can accept that request. There is absolutely no business logic here HOWEVER, for simple reads we may choose to inject our Context to perform those reads
  • Service – the guts of the application, this contains a variety of services varying in size and types. I try to stick with the Single Responsibility Principle in defining these classes. At a minimum we have a set of facilitators which facilitate a business process (we will cover this next) and other smaller services which are reusable blocks.
  • Data Layer – this is the EF context. Any custom mapping or definitions are written here

The key feature of a facilitator is the call to SaveChanges as this will mark the end of the Unit of Work. By taking this approach we get a transaction for free since the code can validate the data as it places it into the context, instead of waiting for a SQL Exception to indicate a problem.

By taking this approach, code is broken into reusable modules which can be reinjected and reused, plus it is VERY testable. This is an example flow I wrote up for a client:

Here the Process Payment Service is the facilitator and calls on the sub-services (shaded in blue). Each of these gets a context injection but, since the context is scoped each gets the same one. This means everyone gets to work with what is essentially their own copy of the database during their execution run.

The other benefit this approach has is avoid what I refer to as service wastelands. These are generic service files in our code (PersonService, TransactionService, PaymentService, etc) which becoming dumping grounds for methods – I have seen some of these files have upwards of 100 methods. Teams need to avoid doing this because the file becomes so long that ensuring uniqueness and efficiency among the methods becomes an untenable task.

Instead, teams should focus on creating purpose driven services which either facilitate a process or contain core business logic that may be reused in the code base. Combined with using Entity Framework as the data layer, code becomes cleaner and more straightforward.

What are Exceptions?

So, am I saying you should have NO Data Layer ever? No. As with anything this not black and white and there are cases for a data layer of sorts. For example, some queries to the database are too complex to put into a LINQ statement and developers will need to resort to SQL. For these cases, you will want to have a wrapper around the call for both reuse and maintenance.

But, do not take that to mean you need a method to ensure you do not rewrite FirstOrDefault in two or more spots. Of course, if you have a particularly complex LINQ query you might chose to hide it. However, keep in mind the MAIN REASON to hide code is to avoid requiring another person to have certain intimate knowledge of a process to carry out the operation. It is NOT, despite popular opinion, to avoid duplication (that is an entirely separate issue I will discuss later).

Indeed, the reason you should be hiding something is because it is complex in nature and error prone in its implementation such that problems could arise later. A simple Id look up does not fall into this category.

Conclusion

The main point I made here is Entity Framework IS an implementation of the Repository pattern and so, placing a repository pattern around it is superfluous. ASP .NET Core contains methods to ensure the context is scoped appropriately and disposed of with the end of a request. Leverage this and use the context directly in your services and lean on the Unite of Work pattern while treating the Context as your in-memory database. Let Entity Framework take responsibility for updating the database when you are complete.

Manual JWT Validation in .NET Core

Recently, I have been working with Jeff Fritz over at https://www.twitch.tv/csharpfritz as part of his effort to build a TikTok like site for Twitch, uniquely called KlipTok (https://www.kliptok.com). Mainly my efforts have been on shoring up the backend code in the BackOffice using Azure Functions.

This was one of my first major exposures with the Twitch API. Its fine overall but, it oddly does not use JWT tokens to communication states back and forth, rather an issues string is required for authenticated requests. I wanted to try a different approach to handling token auth and refresh so, I devised the following POC: https://github.com/jfarrell-examples/TwitchTokenPoc.

One of the aspects of the Twitch API is that tokens can expire and calls should be ready to refresh an access token which enters this state. The trouble is, these are two tokens and I didnt want the clients required to send both tokens, nor did I want the client to have to resubmit a request. I decided, I would create my own token and store within it, as claims, the access token and refresh token.

Taking this approach would allow the POC to, in effect, make it seem like Twitch is issues JWT tokens while still allowing the backend to perform the refresh. I decided, for additional security, I would encrypt the token claims in my JWT using Azure Key Vault Keys.

Part 1: Creating the Token

This approach hinges on what I refer to as token interception. As part of any OAuth/OIDC flow, there is a callback after the third party site (Twitch in this case) has completed the login. Tokens are sent to this callback for the sole purpose of allowing the caller to store them.

In order to achieve this, I created a method which a client would call at the very start. This contacts Twitch and reissues the active tokens, if they exist, or requests the user to login in again:

public IActionResult Get()
{
var redirectUri = WebUtility.UrlEncode("https://localhost:5001/home/callback");
var urlString = @$"https://id.twitch.tv/oauth2/authorize?client_id={_configuration["TwitchClientId"]}"
+ $"&redirect_uri={redirectUri}"
+ "&response_type=code"
+ "&scope=openid";
return Redirect(urlString);
}
view raw login.cs hosted with ❤ by GitHub

The key here is the redirectUri which redirects the provided response code back to the application. Here we can create the token and send it to the client. You can find this method in the provided GitHub repository, HomeController.

You can find MANY examples of creating a JWT Token on the internet, I will use this one for reference: https://www.c-sharpcorner.com/article/asp-net-web-api-2-creating-and-validating-jwt-json-web-token/

Here is my code which creates the token string with the access token and refresh token as claims:

public async Task<string> CreateJwtTokenString(string accessToken, string refreshToken)
{
var jwtSigningKey = await _keyVaultService.GetJwtSigningKey();
var securityKey = new SymmetricSecurityKey(Encoding.UTF8.GetBytes(jwtSigningKey));
var signingCredentials = new SigningCredentials(securityKey, SecurityAlgorithms.HmacSha256Signature);
var secToken = new JwtSecurityToken(
issuer: _configuration["Issuer"],
audience: _configuration["Audience"],
claims: new List<Claim>
{
new Claim("accessToken", await _cryptoService.Encrypt(accessToken)),
new Claim("refreshToken", await _cryptoService.Encrypt(refreshToken))
},
notBefore: null,
expires: DateTime.Now.AddDays(1),
signingCredentials);
return new JwtSecurityTokenHandler().WriteToken(secToken);
view raw jwt-create.cs hosted with ❤ by GitHub

The actual signing key is stored as a secret in Azure Key Vault with access controlled using ClientSecretCredentials, those values are stored in environment variables and not located in source code. You can find more information on this approach here: https://jfarrell.net/2020/07/14/controlling-azure-key-vault-access/ The one critical point I will make is ClientSecretCredential is only appropriate for local development – when deploying into Azure be sure code is using a Managed Identity driven approach.

I defined a simple method which grabs the Encryption key from Azure Key Vault and encrypts (or decrypts the data).

// getting the key
private KeyClient KeyClient => new KeyClient(
vaultUri: new Uri(_configuration["KeyVaultUri"]),
credential: _getCredentialService.GetKeyVaultCredentials());
public async Task<KeyVaultKey> GetEncryptionKey()
{
var keyResponse = await KeyClient.GetKeyAsync("encryption-key");
return keyResponse.Value;
}
// usage
public async Task<string> Encrypt(string rawValue)
{
var encryptionKey = await _keyVaultService.GetEncryptionKey();
var cryptoClient = new CryptographyClient(encryptionKey.Id, _getCredentialService.GetKeyVaultCredentials());
var byteData = Encoding.Unicode.GetBytes(rawValue);
var encryptResult = await cryptoClient.EncryptAsync(EncryptionAlgorithm.RsaOaep, byteData);
return Convert.ToBase64String(encryptResult.Ciphertext);
}
view raw gistfile1.txt hosted with ❤ by GitHub

The beauty of using Azure Key Vault is NO ONE but Azure is aware of the key. Using this, even if our JWT token is somehow leaked, the data within is not easy to decipher.

Once generated, this token can be passed back to the client either as data or in some header, allowing the client to store it. We can then use the built-in validation to require the token with each call.

Part 2: Validating the Token

Traditionally, tokens are signed by an authority and the underlying system will contact that authority to validate the token. However, in our case, we have no such authority so, we will want to MANUALLY validate the token, mainly its signature.

It turns out this is rather tricky to perform in ASP .NET Core due to the way the validation middleware is implemented. The best way I found to get it work and be clean is to adjust the way you register certain dependencies in ConfigureServices, as such:

var keyVaultService = new KeyVaultService(new GetCredentialService(Configuration), Configuration);
var tokenSecurityValidator = new JwtSecurityTokenValidator(Configuration, keyVaultService);
services.AddTransient<CryptoService>()
.AddTransient<JwtTokenService>()
.AddSingleton<TwitchAuthService>()
.AddSingleton(p => keyVaultService)
.AddTransient(p => tokenSecurityValidator)
.AddSingleton<GetCredentialService>()
.AddTransient<TwitchApiService>()
.AddTransient<GetTokensFromHttpRequestService>()
.AddTransient<ProcessApiResultFilter>();
// add auth middleware
services.AddAuthentication(JwtBearerDefaults.AuthenticationScheme)
.AddJwtBearer(options =>
{
options.RequireHttpsMetadata = false;
options.SecurityTokenValidators.Add(tokenSecurityValidator);
});
view raw statup.cs hosted with ❤ by GitHub

You can see the keyVaultService and tokenSecurityValidator are defined as concrete dependencies and we use the provider override syntax for AddSingleton to pass the instance directly. This is done so we can pass the direct instance of tokenSecurityValidator to our the options for validating our Bearer token.

This class calls on its dependencies and validates the signature of the token and ensures it matches with our expectations: https://github.com/jfarrell-examples/TwitchTokenPoc/blob/master/JwtSecurityTokenValidator.cs

The result of adding this (and the appropriate Use methods in the Configure method) is we can fully leverage [Authorize] on our actions and controllers. Users who pass no token or a token that we cannot validate will receive a 401 Unauthorized.

Part 3: Performing the Refresh

First step with any call is the ability to GET the token for the request so it can be used. There are MANY ways to do this. As I wanted to keep this simple I elected to use the IHttpContextAccessor. This is a special dependency you can have ASP .NET Core inject that lets you access the HttpContext anywhere in the call chain. I wrapped this in a service:
https://github.com/jfarrell-examples/TwitchTokenPoc/blob/master/Services/GetTokensFromHttpRequestService.cs

This class very simply yanks the token from the incoming request and return the specific claim that represents the token. It also calls the decryption method so the fetched token is ready for immediate use.

This is by no means a perfect approach, in fact were I to see this in Production code I would comment that its a violation of the separation of concerns since a web concerns is being accessed in the services layer. More ideally, you would want to use middleware or similar to hydrate a scoped dependency which can be injected into your layers.

The TwitchApiService (https://github.com/jfarrell-examples/TwitchTokenPoc/blob/master/Services/TwitchApiService.cs) houses the logic to perform the request for user data from Twitch that I chose to showcase the refresh functionality.

This code is crucial for the functionality:

client.DefaultRequestHeaders.Add("Client-Id", _configuration["TwitchClientId"]);
client.DefaultRequestHeaders.Authorization =
new AuthenticationHeaderValue("Bearer", await _getTokensFromHttpRequestService.GetAccessToken());
var result = new ApiResult<TwitchUser>();
var response = await client.GetAsync($"helix/users?login={loginName}");
if (response.StatusCode == HttpStatusCode.Unauthorized)
{
// refresh tokens
var (accessToken, refreshToken) = await _authService.RefreshTokens(await _getTokensFromHttpRequestService.GetRefreshToken());
result.TokensChanged = true;
result.NewAccessToken = accessToken;
result.NewRefreshToken = refreshToken;
// re-execute the request with the new access token
client.DefaultRequestHeaders.Authorization =
new AuthenticationHeaderValue("Bearer", accessToken);
response = await client.GetAsync($"helix/users?login={loginName}");
}
if (response.IsSuccessStatusCode == false)
throw new Exception($"GetUser request failed with status code {response.StatusCode} and reason: '{response.ReasonPhrase}'");
var responseContent = await response.Content.ReadAsStringAsync();
view raw call.cs hosted with ❤ by GitHub

I wrote this in a very heavy fashion, it simple makes the call, check if it failed with a 401 Unauthorized and, if so, refreshes the token using the TwitchAuthService () and then makes the same call again.

The result is a return to the caller with the appropriate data (or an error if the request still failed).

Part 4: Notify of new Token

Something you may have noticed in the previous code, the use of a generic ApiResult<T>. This is necessary because JWT tokens are designed to be immutable. This means they cannot be changed once created, its this aspect which makes them secure. However, in this case, we are creating a token with data that will change (on a refresh) and thus necessitate a regeneration of the token.

The purpose of this ApiResult<T> class it to hold NOT JUST the result but to tell us if the token needs to change. If it does change, that new version must be passed to the client so it can be saved. This may seem like a drawback to the approach but, in actuality this is a typical part of any application interacting with an OAuth flow where token refresh is being used.

However, what we DO NOT want to do is require logic in every action to check the result, rebuild the token, and pass it to the caller. Instead, we want to intercept the return result and, in a central spot, strip away the extra data and ensure our new token, if appropriate, is in the response headers.

To that end I created the following ActionFilter:

public class ProcessApiResultFilter : IActionFilter
{
private readonly JwtTokenService _jwtTokenService;
public ProcessApiResultFilter(JwtTokenService jwtTokenService)
{
_jwtTokenService = jwtTokenService;
}
public void OnActionExecuting(ActionExecutingContext context)
{
// no action
}
public void OnActionExecuted(ActionExecutedContext context)
{
if ((context.Result as OkObjectResult)?.Value is ApiResult result)
{
if (result.TokensChanged)
{
var newTokenString = _jwtTokenService.CreateJwtTokenString(
result.NewAccessToken, result.NewRefreshToken).Result;
context.HttpContext.Response.Headers.Add("X-NewToken", newTokenString);
}
context.Result = new ObjectResult(result.Result);
}
}
}
view raw filter.cs hosted with ❤ by GitHub

Our ApiResult<T> inherits from ApiResult which gives it the non-generic read only Result property, which is used in the code sample above. The ApiResult<T> includes a setter whose accepted type is T. This allows the application to interact with it in a type-safe way.

Above you can see the Result being sent to the user is altered so its the inner result. Meanwhile, if the token changes we regenerate that token using our JwtTokenService and its stored in the X-NewToken header in the response. Client can now check for this header when receiving the response and update their stores as needed.

One final thing, I am using Dependency Injection in the filter. To achieve this you must wrap its usage in the ServiceFilterAttribute. Example here: https://github.com/jfarrell-examples/TwitchTokenPoc/blob/master/Startup.cs#L27

And that is it. Let’s walk through the example again.

Understanding what happens

A given client will make its initial page the response to /Login which will return the Twitch Login screen OR, if a token is already present, the callback will be called instantly. This callback will generate a token and send it down to the caller (right now its printed to the screen), generally this would be a page in your client app that will store the token and show the initial page.

When the client makes a request, they MUST pass the custom JWT Token given to them, the application will be checking for it as an Authorization Bearer token – failure to pass it will result in a 401 Unauthorized being sent back.

The application, after validating the token, will proceed with its usual call to the Twitch API. Part of this will access whatever the Access Token was passed. If Twitch responds with a 401 Unauthorized, the code will extract the refresh token from the JWT Token and refresh the access token. Upon successfully doing this, the call to Twitch will be executed again.

The result is sent back to the caller in a wrapper, ApiResult<T> which, along with carrying the call result, also contains information on whether the token changed. The caller will simply return this result as it would any normal Action call.

We use a special ActionFilter to intercept the response, and rewrite it so the caller returns the expected result in the response body. If the token did change, the new token is written into the response behind the X-NewToken header.

Throughout the process, we never reveal the tokens and all of the values involved in signing, encryption, and decryption are stored in Azure Key Vault outside of our application. For local dev, we are using an App Registration to govern access to the Key Vault, if we were deployed in Azure we would want to associate our Azure service to a managed identity.

Conclusion

Hopefully, this example has been instructive and helpful. I know I learned quite a bit going through this process. So, if it helps you, drop me a comment and let me know. If something does not make sense feel free to also drop me a comment. Cheers.

Getting Started with KEDA and Queues

One of the limitations inside Kubernetes was the metrics that were supported to allow for scaling within the cluster for a deployment. The HorizontalPodAutoScaler or HPA for short, could only monitor CPU Utilization to determine if more Pods needed to be added to support a given workload. As you can imagine, in a queue based or event system, CPU usage wont tell, accurately, whether or not more pods are needed.

Note: The Kubernetes team realizing has added support for custom-metrics into the platform: https://github.com/kubernetes/community/blob/master/contributors/design-proposals/instrumentation/custom-metrics-api.md

Noticing this, Microsoft engineers began work on a project to address this, called KEDA (Kubernetes Event-Driven Autoscaling) comprised of custom resources which were capable of triggering scaling events based on external cluster criteria: queue tail length, message availability, etc. Now in 2.1 the team has added support for MANY popular external products which would dictate scaling needs in unique ways.

Here is the complete list: https://keda.sh/docs/2.1/scalers/

For this post, I wanted to walk through how to set up a configuration whereby I could use KEDA to create jobs in Kubernetes based on the tail length of an Azure Storage Queue. As is expected with a newer project, KEDA’s documentation still needs work and certain things are not entirely clear. So I view this as an opportunity to supplement the teams work. That being said, this is still very much an alpha product and, as such, I expect future iterations to not work with the steps I lay out here. But as of right now, Feb 2020, they work.

Full source code: https://github.com/jfarrell-examples/keda-queue-test

First step, Create a cluster an Azure Queue Storage

Head out to the portal and be sure to create an AKS cluster (or a Kubernetes cluster in general, doesnt matter who the provider is) and an Azure Storage account (this one you will need in Azure). Once the storage account is created, create a Queue (shown below) and saved the connection string off somewhere you can copy from later.

As indicated, you could use GKE (Google Kubernetes Engine) or something else if you wanted. KEDA also supports other storage and events outside of Azure but, I am using Azure Queue Storage for this demo hence why I will assume the Queue Storage is in Azure.

Now, let’s install KEDA

As with anything involving custom resources in Kubernetes, KEDA must be installed for those resources to exist. KEDA has a variety of ways it can be installed, laid out here: https://keda.sh/docs/2.1/deploy/

A quick note on this, BE CAREFUL of the version!! I am using v2.1 for this and that is important since the specification for ScaledJob changes between 2.0 and 2.1. If you read through the third approach to deployment, where you run kubectl apply against a remote file, be sure to replace the version of the file to v2.1.0. I noted with Helm at least I did NOT get v2.1 from the given charts repo.

If you run the third approach, creation of the keda namespace will happen for you, this is where the internal of KEDA will be installed and run from, your code does NOT need to go in here and I wont be doing that just to put you at ease.

Once the installation completes I recommend running the following command to make sure everything is up and running:

kubectl get all -n keda

Note that I used the shorthand -n because I have had it happen where the –namespace doesnt copy correctly and you end up with command syntax errors. If you see something like this, KEDA is up and running:

Let’s setup the KEDA Scaler

For starters, we need a secret to hold that connection string for our Queue Storage from earlier. Here is a simple secret definition to create a secret that KEDA can use to monitor the queue tail length. REMEMBER when you provide the value to the secret it MUST be base64 encoded. I wont show my value as I do not wish to dox myself.

Linux users you can use the built-in base64 command to generate the value for the secret file. Everyone else, you can quickly Google a Base64 encoder and convert your string.

echo “your connection string” | base64

Use kubectl apply -f to create the secret. Since the namespace is provided in file, it will be placed in that namespace for you.

Next, we are going to get into KEDA specific components TriggerAuthentication and ScaledJob. These two resources will be critical to supporting our intended functionality.

First, there is the specification for TriggerAuthentication: https://keda.sh/docs/2.1/concepts/authentication/#re-use-credentials-and-delegate-auth-with-triggerauthentication

As you can see, there are a number of ways to provide authentication, we will be using secretTargetRef. The purpose is to give our trigger a way to authentication to our Queue Storage such that it can determine the various property values it needs to find out if a scaling action needs to be taken (up or down).

Building on what we did with the creation of our Secret we add the following definition and apply it via kubectl apply -f

Comparing the Secret with this file you can see where things start to match up. We are simply telling the trigger it can find the connection string at the appropriate key in a certain secret. Many of the examples on the KEDA website will use podIdentity which as I have come to understand refers back to MSI. This is a better approach, albeit more complicated, than what I am showing here. We should always avoid storing sensitive information in our cluster (like connection strings) due to the less than stellar security around Secrets in general – base64 is not in anyway secure.

The final piece is the creation of the ScaledJob. KEDA mostly focuses around scaling deployments, which makes a lot of sense but, it can also serve to scale up Kubernetes Jobs as needed to fulfill deferred processing. Effectively, KEDA creates a psuedo deployment around the job and scales the number up as needed based on the scaling strategy specified.

This looks like quite a bit but, when you break it down it has a very straightforward purpose and a structure that is consistent with other Kubernetes objects. Let’s break it down in four parts:

The first part is identification, what we are naming the ScaledJob and where it is going to be stored within the cluster. Notice the apiVersion value keda.sh/v1alpha1 this is a clear indication of the Spec being in ALPHA meaning, I fully expect this to change.

The second part is the details for the actual ScaledJob, that is things which are specific to this instance of the resource. Here we tell the resource to check the length of our queue every 5 seconds and that it should trigger based on an azure-queue with authentication stored in our trigger auth that we defined previously.

The third and fourth part are actually all relating to the same thing which is the configuration of the created Kubernetes Job instances that will perform the work – I broke this apart based on my own personal style when constructing YAML files for Kubernetes. To keep things simple we are not going to have the job leverage parallelism, so we leave this at 1, which is also the default.

The last section lays out the template for the Pods that will carry out the work. You notice the custom image xximjasonxx/printmessage which will grab the message from the queue and print out its contents. We are also reusing the Secret here to provide the container with the connection string of the Queue so it can take items off.

All of this is available for reference in the GitHub repo I linked above.

Let’s test it

In the provided source code, I included a command line program that can send messages to our queue in the form of random numbers – SendMessage. To run this, open a Command Line window up to the directory holding the .csproj file and run the following command:

dotnet run “<connection string>” 150

The above command will send 100 messages to the queue – I should note that the queue name in the container is HARD CODED as test-queue. Feel free to download the code and make the appropriate change for your own queue name if need be – you will need to do it for both Print and Send message programs.

After running the above command you can run the following kubectl command to see the results of your experiment. Should look something like this:

This shows that it is working and, in fact, we can do a kubectl logs on one of the pods and we can see the output message sent to the queue. Or so it appears, let’s take a closer look.

Execute the following command to COUNT how many pods were actually created:

kubectl get po | wc -l

Remember to subtract one as the wc program will also count the header line. If you get similar to what I got it will be around 300. But that does not make any sense, we only sent 150 items to our queue. The answer is, the way printmessage:v3 is written, it contains logic to print that no data was found as the queue becomes empty. While valid, with the 10 completion rule being enforced this will spin up unnecessary pods. Let’s change the image used for the job to a special image: printmessage:v3-error. This image will throw an uncaught exception when the queue is empty. The updated definition for ScaledJob is below:

Before running things again I recommend executing these two commands, they assume the ONLY thing in the current space are jobs and pods related to KEDA. If you are sharing the namespace with other resources you will have to modify these commands.

kubectl delete po –all

kubectl delete job –all

Make sure to run kubectl apply to get the updated ScaledJob definition into your cluster. Run the SendMessage program again. This is what I got:

Notice how, even though we specified the job needs to complete 10 times, none of these did. Your results are likely going to vary depending on when items were pulled from the queue. But as the queue gets shorter more jobs will start to fail as the Pods attempt to grab data that does not exist.

The other thing to notice is that the Pods, if they fail, will self terminate. So, if I run my wc -l check again on the Pods I get a number that makes more sense:

kubectl get po | wc -l

Result should be 151 which, subtracting the header row gives us the 150 items we sent to the queue

Why is this happening?

The key value for controlling this behavior is the backoffLimit specified as part of the job spec. It tells a job how many times it should try to restart failing pods under its control. I have set it to 1 which effectively means it will not retry and only accept one failure.

The reason this is so important is control over resources that are scaling to match processing workloads is crucial from the standpoint of maintaining healthy resource consumption. We do not want our pods to go crazy and overwhelm the system and starve other processes.

Storage Class with Microsoft Azure

One of the things I have been focusing on lately is Kubernetes, its always been an interest to me but, I recently decided to pursue the Certified Kubernetes Application Developer (CKAD) and so diving into topics that I was not totally familiar with has been a great deal of fun.

One topic that is of particular interest is storage. In Kubernetes, though really in Containerized applications, state storage is an important topic since the entire design of these systems is aimed at being transient in nature. With this in mind, it is paramount that storage happen in a centralized and highly available way.

A common approach to this is to simply leverage the raw cloud APIs for things like Azure Storage, S3, etc as the providers will do a better job ensuring the data is stored securely and in a way that makes it hard for data loss to occur. However, Kubernetes enables the mounting of the cloud systems directly into Pods through Persistent Volumes and Storage Classes. In this post, I want to show how to use Storage Class with Azure so I wont be going to detail about the ins and out of Storage Classes or their use cases over Persistent Volumes, frankly I dont understand that super well myself, yet.

Creating the Storage Class

The advantage to Storage Class (SC) over something like Persistent Volume (PV) is the former can automatically create the later. That is, a Storage Class can received Claims for volume and will, under the hood, create PVs. This is why SC’s have become very popular with developers, less maintenance.

Here is a sample Storage Class I created for this demo:

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: file-storage
provisioner: kubernetes.io/azure-file
parameters:
skuName: Standard_LRS
view raw sc.yaml hosted with ❤ by GitHub

This step is actually optional – I only did it for practice. AKS will automatically create 4 default storage classes (they are useless without a Persistent Volume Claim (PVC)). You can see them by running the follow command:

kubectl get storageclass

Use kubectl create -f to create the storage class based on the above, or use one of the built in ones. Remember, by itself, the storage class wont do anything. We need to create a Volume Claim for the magic to actually start.

Create the Persistent Volume Claim

A persistent volume claim (PVC) is used to “claim” a storage mechanism. The PVC can be, depending on its access mode, attached to multiple nodes where its pods reside. Here is a sample PVC claim that I made to go with the SC above:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: fileupload-pvc
spec:
storageClassName: file-storage
accessModes:
ReadWriteMany
resources:
requests:
storage: 5Gi
view raw pvc.yaml hosted with ❤ by GitHub

The way PVCs work (simplistically) is they seek out a Persistent Volume (PV) that can support the claim request (see access mode and resource requests). If nothing is found, the claim is not fulfulled. However, when used with a Storage Class its fulfillment is based on the specifications of the Storage Class provisioner field.

One of the barriers I ran into, for example, was that my original provisioner (azure-disk) does NOT support multi-node (that is it does not support ReadWriteMany used above). This means, the storage medium is ONLY ever attached to a single node which limits where pods using the PVC can be scheduled.

To alleviate this, I opted to use, as you can see, the azure-file provisioner, which allows multi node mounting. A good resource for reading more about this is here: Concepts – Storage in Azure Kubernetes Services (AKS) – Azure Kubernetes Service | Microsoft Docs

Run a kubectl create -f to create this PVC in your cluster. Then run kubectl get pvc – if all things are working your new PVC should have a state of Bound.

Let’s dig a bit deeper into this – run a kubectl describe pvc <pvc name>. If you look at the details there is a value with the name Volume. This is actually the name of the PV that the Storage Class carved out based on the PVC request.

Run kubectl describe pv <pv name>. This gives you some juicy details and you can find the share in Azure now under a common Storage Account that Kubernetes has created for you (look under Source).

This is important to understand, the claim creates the actual storage and Pods just use the claim. Speaking of Pods, let’s now deploy an application to use this volume to store data.

Using a Volume with a Deployment

Right now, AKS has created a storage account for us based on the request from the given PVC that we created. To use this, we have to tell each Pod about this volume.

I have created the following application as Docker image xximjasonxx/fileupload:2.1. Its a basic C# Web API with a single endpoint to support a file upload. Here is the deployment that is associated with this:

apiVersion: apps/v1
kind: Deployment
metadata:
name: fileupload-deployment
spec:
replicas: 1
selector:
matchLabels:
app: fileupload
template:
metadata:
name: fileupload-app
labels:
app: fileupload
spec:
containers:
name: fileupload
image: xximjasonxx/fileupload:2.1
ports:
containerPort: 80
env:
name: SAVE_PATH
value: "/app/output"
volumeMounts:
mountPath: /app/output
name: save-path
volumes:
name: save-path
persistentVolumeClaim:
claimName: fileupload-pvc
view raw deploy.yaml hosted with ❤ by GitHub

The key piece of this the ENV and Volume Mounting specification. The web app looks to a hard coded path for storage if not overridden by the Environment Variable SAVE_PATH. In this spec, we specify a custom path within the container via this environment variable and then mount that directory externally using the Volume created by our PVC.

Run a kubectl create -f on this deployment spec and you will have the web app running in your cluster. To enable external access, create a Load Balancer Service (or Ingress), here is an example:

apiVersion: v1
kind: Service
metadata:
name: fileupload-service-lb
spec:
selector:
app: fileupload
ports:
– protocol: TCP
port: 80
targetPort: 80
type: LoadBalancer
view raw gistfile1.txt hosted with ❤ by GitHub

Run kubectl create -f on this spec file and then run kubectl get svc until you see an External IP for this service indicating it can be addressed from outside the cluster.

I ran the following via Postman to test the endpoint:

If all goes well, the response should be a Guid which indicates the name of the image as stored in our volume.

To see it, simply navigate to the Storage Account from before and select the newly created share under the Files service. If you see the file, congrats, you just used a PVC through a Storage Class to create a place to store data.

What about Blob Storage?

Unfortunately, near as I can tell so far, there is no support for saving these items to object storage, only file storage. To use the former, at least with Azure, you would still need to use the REST APIs.

This also means you wont get notifications when new files are created in the file share as you would with blob storage. Still, its useful and a good way to ensure that data provided and stored is securely and properly preserved as needed.

Using Scoped Dependencies

I was recently asked by a client how I would go about injecting user information into a service that could be accessed anywhere in the call chain. They did not want have to capture the value at the web layer and pass it to what could be a rather lengthy call stack.

The solution to this is to leverage scoped dependencies in ASP .NET Core which will hold an object for the duration of the request (default). In doing this, we can gather information related to the request and expose it. I also wanted to add an additional twist. I wanted to have two interfaces for the same object, one that enable writing and the other that would enable reading, like so:

The reason for doing this is aimed at being deterministic. What I dont want to support is the ability for common code to accidentally “change” values, for whatever reason. When the injection is made, I want the value to be read only. But, to get the value in there I need to be able to write it, so I segregate the operations into different interfaces.

This may be overkill for your solution but, I want the code to be as obvious in its intent and capabilities – this helps instruct users of this code how it should be used.

Configuring Injection

Our ContextService, as described above, contains only a single property: Username. For this exercise, we will pull the value for this out of the incoming query string (over simplistic I grant you, but it works well enough to show how I am using this).

I am going to define two interfaces which this class implements: IContextReaderService and IConextWriterService, code below:

public class ContextService : IContextReaderService, IContextWriterService
{
public string Username { get; set; }
}
public interface IContextReaderService
{
string Username { get; }
}
public interface IContextWriterService
{
string Username { set; }
}
view raw service.cs hosted with ❤ by GitHub

The tricky part now is, we want the instance of ContextService created with and scoped to the incoming request to be shared between IContextReaderService and IContextWriterService, that is I want the same instance to comeback when I inject a dependency marked with either of these interfaces.

In Startup.cs I need to do the following to achieve this:

services.AddScoped<ContextServiceFactory>();
services.AddScoped<IContextReaderService>(p => p.GetService<ContextServiceFactory>().GetCurrentContext());
services.AddScoped<IContextWriterService>(p => p.GetService<ContextServiceFactory>().GetCurrentContext());
view raw startup.cs hosted with ❤ by GitHub

The secret here is the request scoped ContextServiceFactory which is given as the parameter to AddScoped that allows us to tell .NET Core how to resolve the dependency. This factory is defined very simply as such:

public class ContextServiceFactory
{
private ContextService _currentContext;
public ContextService GetCurrentContext()
{
if (_currentContext == null)
_currentContext = new ContextService();
return _currentContext;
}
}
view raw factory.cs hosted with ❤ by GitHub

Remember, by default, something added as a scoped dependency is shared throughout the lifetime of the request. So here, we maintain state within the factory to know if it has created an instance of ContextService or not, and if it has, we will return that one. This factory object will get destroyed when the request completed and recreated when a new request is processed.

Hydrating the Context

Now that we have our context split off, we need to hydrate the values, thus we need to inject our IContextWriterService dependency into a section of code that will get hit on each request. You might be tempted to use a global filter, which will work but, the better approach here is custom middleware. Here is what I used:

// middleware.cs file
public class HydrateContextMiddleware
{
private RequestDelegate _next;
public HydrateContextMiddleware(RequestDelegate next)
{
_next = next;
}
public async Task Invoke(HttpContext context, IContextWriterService contextService)
{
contextService.Username = context.User.Identity.Name;
await _next(context);
}
}
// startup.cs Configure method
app.UseMiddleware<HydrateContextMiddleware>();
view raw middleware.cs hosted with ❤ by GitHub

Because of the way they are used, you can only use constructor injection for singleton scoped dependencies, if you attempt to use a Scoped or Transient scoped dependency in the middleware constructor, it will fail to run.

Fear not, we can use method injection here to inject our dependency as a parameter to the Invoke method which is what ASP .NET Core will look for and execute with each request. Here you can see we have defined a parameter of type IContextWriterService.

Within Invoke perform the steps you wish to take (here we are extracting the username from the name parameter in the Query String, for this example). Once you complete your steps be sure to call the next bit of middleware in sequence (or return a Completed Task to stop the chain).

Using the Reader dependency

Now that we have configured the dependency and hydrated it using middleware we can no reference the IContextReaderService to read the value out. This works in the standard way as you would expect:

[Route("api/user")]
[ApiController]
public class UserController : ControllerBase
{
private readonly IContextReaderService _contextService;
public UserController(IContextReaderService contextService)
{
_contextService = contextService;
}
[HttpGet]
public IActionResult GetUser()
{
return Ok(_contextService.Username);
}
}
view raw controller.cs hosted with ❤ by GitHub

We can inject this dependency wherever we need (though more specifically, wherever we can access the IContextReaderService).

Mutability vs Immutability

The main goal I was trying to illustrate here is to leverage immutability to prevent side effects in code. Because of the interface segregation, a user would be unable to change the given value of the context. This is desirable since it lends to better code.

In general, we want to achieve immutability with objects in our code, this is a core learning from functional programming. By doing this, operations become deterministic and less prone to sporadic and unexplainable failures. While the example presented above is simplistic in nature, in a more complex systems, having assurances that users can only read or write depending on which interface is used allows for better segregation and can yield cleaner and more discernable code.

Hope you enjoyed. More Testing posts to come, I promise.

Test Series: Part 2 Unit Testing

Part 1 is here – where I intro Testing Strategies.

Unit testing is the single most important test suite within ANY application. It is the first line of defense guarding against defects and is paramount to instilling confidence in developers that the application of changes does not break any existing logic. This being the case, they are (should be) the most numerous type of test authored for a system. High performing teams will run them often as a verification step and ensure their runs are as fast as possible to save time. By doing so and building confidence they are able to achiever ever higher levels of efficiency and quality.

What do we Unit Test?

This is perhaps the single most important and common question you will get from teams or you will discuss within your own teams. Making the right decision here is critical for the long term success of the project and preventing quality and performance issues from negatively impacting your teams.

As a fundamental rule, we do not unit test external dependencies, that is database calls, network calls, or any logic that might involve any sort of external dependencies. Our unit test runs need to be idempotent such that we can run them as much as we like without having to worry about disk space, data pollution, or other external factors.

Second, the focus must be on a unit of code. In this regard, our tests do not test multi-step processes. They test a single path through a unit of code; the need for a unit test to be complex is often an indicator of a code smell: either the logic is overly complicated and needs refactoring or, the test itself is wrong and should either be broken down or tested with a different form of testing such as integration tests.

Finally, we should test known conditions for external dependencies through the use of mocking. By using a mocking library we can ensure that code remains resilient and that our known error cases are handled. Further, using a mocking library often forces us to use design by contract which can improve the readability of our code.

Making the wrong choice – a story from the past

I worked with a team in a past life that made the wrong choice when it came to their testing. As part of an effort to improve quality the client (astutely) asked the team to ensure testing was being done against database and networking calls. Leaders on the team, due to poor knowledge around testing or poor decision making, opted to work these tests into the unit test library. Over the course of the project, this caused the test run time to increase to greater than 40m.

One of the critical elements to high functioning teams is the notion of fast feedback. We want to ensure developers are given immediate feedback when something breaks. Unit tests are a core part of achieving this and their speed is paramount to the teams effectiveness. What happens when you allow tests times to balloon as mentioned? Disaster.

When the turnaround time is that long, developers will seek ways to avoid incurring that time cost (there is still the pressure to get work done). Generally this involves not writing tests (we dont want to increase the time cost), running them minimally (get the work done and test at the end), or turning them off. None of these options improve efficiency and, in fact, make an already bad problem that much worse.

In this case, the team adopted a branching model that called for entire features to be developed in a “feature” branch before merging. With any development environment we always want to minimize “drift”, that is differences between master and any branches. The less drift the fewer merge conflicts and the quicker problems are discovered.

By not understanding this principle, the team unknowingly, compounded their problem. In some cases these “features” would be in flight for 10+ days, creating enormous amounts of drift. And, as the team was looking to avoid running the tests too often, the changes were not being checked regularly by the tests. As you can imagine, issues were found persistently near the end of sprints, as code was merged. And due to the size of the incoming changes debugging became a massive task.

This created more problems for the beleaguered teams as they were forced to spend time after hours routinely debugging and trying to finish features before the end of the sprint. Burnout was rampant and the team members became jaded with one another and the company itself – they endured this for 10+ months. While the project ultimately did complete, the client relationship was ruined and several good developers left the company.

To be clear, the bad choices around testing alone were not the single cause of this failure, there were numerous other problems. However, I have found that that even a difficult client can be assuaged if code quality is maintained and the team delivers. I can recall a team that I led where we had unit testing and continuous delivery processes in place such that, even though we had delays and bugs, these processes enabled us to respond quickly – the client remained delighted and worked with us.

The lesson here is, no matter what, we MUST ensure the development team has the tools needed to support automation processes. These processes form the core of the ability to deliver and lend themselves to building healthy and sustainable client relationships.

How do I write a Unit Test?

So, now you have an understanding of what can be unit tested, let’s talk about how you write them. First, I wish to introduce you to the AAA pattern: Arrange, Act, Assert. This pattern is crucial as you write your tests to check yourself against the warning signs for bad unit tests.

  • Arrange: In this step we “arrange” the unit, that is we do all of the things to prepare for executing our unit. Be wary at this level if the steps to arrange feel too cumbersome, it likely indicates that your design need refactoring
  • Act: In this step we “invoke” the unit. This executes our the code we are specifically testing. Be wary at this level if more than two executions are necessary. This means you are NOT testing a unit and your design needs to be re-evaluated. Remember, we do not test multi-part flows with unit tests.
  • Assert: In this step we check the outcome of our unit. Important here is to only assert on the minimum amount of information needed to verify the unit. I have seen teams assert on 20+ properties for an object, this is excessive. Think carefully about what indicates a failure. My rule of thumb is never more than three asserts. If you need more, create another test.

Here is an example of a simple math problem under unit test:

[Fact]
public void assert_adding_two_numbers_gives_their_sum()
{
// arrange
var numberOne = 10;
var numberTwo = 20;
// act
var result = numberOne + numberTwo;
// assert
Assert.Equal(30, result);
}
view raw simple_unit_test.cs hosted with ❤ by GitHub

As you can see, in this example we define our two variables (numberOne and numberTwo) in the arrange section, we then invoke our add operation in the act and finally we assert that the value meets with our expectations.

The [Fact] is a part of the xUnit testing library. xUnit is a popular open source testing framework commonly used with .NET Core. There are other libraries available. The use of a library for unit testing makes great sense and will greatly aid in your productivity. Below are a few of the common ones in the .NET ecosystem:

  • nUnit (https://nunit.org/) – the grand-daddy of them all. Base dont JUnit from Java and one of the first unit testing frameworks devised for .NET
  • MSTest – Microsoft’s testing framework. It offers the same functionality as nUnit and is built into .NET Framework
  • xUnit – as mentioned above, similar to nUnit in functionality and aimed at supporting testing in an OS agnostic programming world. This is my default

The next common problem is organization. When you start talking about an application that has thousands, if not tens of thousands (or more) tests, it becomes very apparent that a clear and consistent strategy must be adopted. Over the course of my career I have seen many different approaches but, the one that I favor is the given and assert naming convention. Mainly because it plays very well with most test reporters. Here is an example.

Imagine we have defined the following Web API Controller:

[ApiController]
[Route("calculate")]
public class CalculationController : Controller
{
[HttpPost("add")]
public IActionResult Add([FromBody]TwoNumberViewModel viewModel)
{
return Ok(viewModel.FirstNumber + viewModel.SecondNumber);
}
}
view raw controller.cs hosted with ❤ by GitHub

In this case we might define our test fixture (that is the class that contains our test) as such:

public class given_an_instance_of_calculation_controller
{
}
view raw testfixture.cs hosted with ❤ by GitHub

Notice the name of the class here, while it does violate traditional C# naming convention, when you run the test runner, it will precede your method name. Therefore, if we expand this to include a test like so:

public class given_an_instance_of_calculation_controller
{
[Fact]
public void assert_that_given_two_numbers_the_result_returned_is_the_correct_sum()
{
// arrange
var controller = new CalculationController();
var viewModel = new TwoNumberViewModel
{
FirstNumber = 10,
SecondNumber = 20
};
// act
var result = controller.Add(viewModel) as OkObjectResult;
// assert
Assert.NotNull(result);
Assert.Equal("30", result.Value.ToString());
}
}

The above example is a product of over simplification and ONLY for demonstration purposes. When unit testing controllers, the emphasis needs to be on result types returned NOT values. Testing the outcome of operations should be done with unit tests against services. The above represents code that violates the separation of concerns principle.

With this in place, if we run a test runner and view the results in the reporter we will see the following:

given_an_instance_of_calculation_controller.assert_that_given_two_values_the_correct_sum_is_returned

As you can see, the advantage to this strategy is it lines up nicely and produces a readable English sentence detailing what the test is doing. There are other strategies but, as I said, this is my go to in most cases due to the readability and scalable nature of this naming method.

Further, it bakes into it a necessary check to ensure unit tests are not checking too much. As a rule, the assert portion should never contain the word and as that it implies more than one thing is being checked which violates the unit principle.

How do I test external dependencies?

The short answer is, you dont, you generally write integration tests (next part in this series) to cover those interactions. However, given the speed and criticality of the logic checked by unit tests we want to maximize their ability as best we can.

A classic example of this case is Entity Framework. If you have worked with Entity Framework you will be familiar with the DbContext base class that denotes the context which handles querying our underlying database. As you might expect, our unit tests should NEVER invoke this context directly, not even the InMemory version but, we do need to ensure our logic built on the context works properly. How can we achieve this?

The short answer is: we can define an interface which exposes the necessary methods and properties on our context and have our classes take a dependency on this interface rather than the concreate context class itself. In doing so, we can use mocking libraries to mock the context allowing testing against these lower level classes.

The long answer is, honestly, an entire blog post (Learning Tree has a good write up that uses NSubstitute here) that I will try to add on later.

But this strategy of using interfaces also allows us to take dependencies on static components as well. In older versions of ASP .NET it was common for applications to utilize the HttpContext.Current property to reference the incoming ISAPI results. But, because this property was static it could not be unit tested directly (it would always be null unless running in the web context).

Using the interface approach, we commonly saw things like this:

public class ContextAccessor : IContextAccessor
{
public IDictionary<string, string> QueryString
{
// assume .AsDictionary() is an extension method that takes the QueryString struct and converts it to a Dictionary
get { return HttpContext.Current.Request.QueryString.AsDictionary(); }
}
}
public interface IContextAccessor
{
IDictionary<string, string> QueryString { get; }
}
[ApiController]
[Route("api/test")
public class TestController : ControllerBase
{
private readonly IContextAccessor _contextAccessor { get; set; }
public TestController(IContextAccessor contextAccessor)
{
_contextAccessor = contextAccessor;
}
public IActionResult Get()
{
return Ok(_contextAccessor.QueryString["name"]);
}
}

Using this approach the controller, which will have unit tests, is dependent on the injected IContextAccessor interface instead of HttpContext. This fact is crucial as it allows us to write code like such:

public class given_an_instance_of_test_controller
{
[Fact]
public void assert_that_the_name_query_string_parameter_is_returned_in_the_result()
{
// arrange
var contextMock = new Mock<IContextAccessor>();
contextMock.Setup(x => x.QueryString).Returns(new Dictionary<string, string> { { "name", "TestUser" } });
var controller = new TestController(contextMock.Object);
// act
var result = controller.Get() as OkObjectResult;
// assert
Assert.Equal("TestUser", result.Value.ToString());
}
}
view raw mocking.cs hosted with ❤ by GitHub

This is the result. This code validates that our logic is correct but, it does NOT validate that HttpContext gets built properly at runtime, this is not our responsibility, it is the author of the framework (Microsoft in this case) whose responsibility that is.

This brings a very clear and important point when writing tests: some tests are NOT yours to right. It is not on your team to validate that, for example, Entity Framework works properly, or that a request through HttpClient works – these components are already (hopefully) being tested by their authors. Attempting to go down this road will not lead you anywhere where the test drive value.

A final point

The final use case with testing I would like to make, and this is especially true with .NET is, tests should ALWAYS be synchronous and deterministic. Parallel code needs to be broken down into its discrete pieces and those pieces need to be tested. Trying to unit test parallel code is fraught with the risk of introducing “flakiness” into tests – these are tests that pass sometimes and other times not.

.NET developers commonly use the async/await syntax in their code. Its very useful and helpful however, when running unit tests it needs to be forced down a synchronous path.

We do not test external dependencies so, the use of async/await should not be needed for ANY test. Our dependencies should be mocked and thus will return instantaneously.

To do this, it is quite easy, we can call GetAwaiter and GetResult methods which will force the resolution of the Task return variable. Here is an example:

public interface IDataService
{
Task<List<DataModel>> GetData();
}
[ApiController]
[Route("api/Test")]
public class TestController : ControllerBase
{
private readonly IDataSerivce _dataService;
public TestController(IDataService dataService)
{
_dataService = dataService;
}
public async Task<IActionResult> Get()
{
return Ok(await _dataService.GetData());
}
}
public class given_an_instance_of_test_controller
{
[Fact]
public void assert_that_data_is_returned_from_get_call()
{
// arrange
var dataServiceMock = new Mock<IDataService>();
dataServiceMock.Setup(x => x.GetData()).ReturnsAsync(new List<DataModel> { new DataModel() } });
var controller = new TestController();
// act
var result = controller.Get().GetAwaiter().GetResult() as OkObjectResult;
// assert
var resultValue = result.Value as List<DataModel>();
Assert.NotNull(resultValue);
Assert.IsTrue(resultValue.Any());
}
}
view raw async_await.cs hosted with ❤ by GitHub

By calling GetAwaiter() and GetResult() we force the call to be synchronous. This is important since, in some cases, the Asserts may run BEFORE the actual call completes, resulting in increased test flakiness.

The most important thing is not just to test but also to be fast

Hopefully this post has shown you some of the ways you can test things like databases, async calls, and other complex scenarios with unit tests. This is important. Due to their speed, it makes sense to use them to validate wherever possible.

One of the uses that I did not show here is “call spying“, this is where the mocking framework can “track” how many times a method is called which can serve as another way to assert.

But the most important thing I hope I can impress is the need to not only ensure unit tests are built with the application but, also that you continually are watching to ensure they remain fast enough to be effective for your developers to perform validation on a consistent ongoing basis.

The next topic which I intend to cover will focus on Integration Tests, primarily via API testing through Postman.

Test Series: Part 1 – Understanding Testing Strategies

One of the challenges with incorporating DevOps culture for teams is understanding that greater speed yields better quality. This is often foreign to teams, because conventional logic dictates that “the slow and steady win the race”. Yet, in every State of DevOps report (https://puppet.com/resources/report/state-of-devops-report/) since it began Puppet (https://puppet.com/) has consistently found that teams which move faster see higher quality that those that move slower – and this margin is not close and the gap continues to accelerate. Why is this? I shall explain.

The First Way: Enable Fast and Increasing Flow

DevOps principles (and Agile) were born out of Lean Management which is based on the Toyota Production System (https://en.wikipedia.org/wiki/Toyota_Production_System). Through these experience we identify The Three Ways, and the first of these specifically aims for teams to operating on increasingly smaller workloads. With this focus, we can enable more rapid QA and faster rollback as it far easier to diagnose a problem in one thing than in 10 things. Randy Shoup of Google observed:

“There is a non-linear relationship between the size of the change and the potential risk of integrating that change—when you go from a ten-line code change to a one-hundred-line code change, the risk of something going wrong is more than 10x higher, and so forth”

What this means is, the more changes we make the more difficult it is to diagnose and identify problems. And this relationship is non-linear meaning, this difficulty goes up exponentially as the size of our changes increase.

In more practical terms, it argues against concepts such as “release windows” and aims for a more continuous deployment model whereby smaller changes are constantly deployed and evaluated. The value here is, by operating on these smaller pieces we can more easily diagnose a problem and rollbacks become less of an “event”. Put more overtly, the aim is to make deployments “normal” instead of large events.

This notion is very hard for many organizations to accept and it often runs counter to how many IT departments operate. Many of these departments have long had problems with software quality and have devised release and operations plans to, they believe, minimize the risk of these quality issues. However, from the State of DevOps reports, this thinking is not backed up by evidence and tends to create larger problems. Successful high functioning teams are deploying constantly and moving fast. Speed is the key.

The secret to this speed with quality is the confidence created through a safety net. Creating a thorough safety net can even create enough confidence to let newest person on the team deploy to Production on Day 1 (this is the case at Etsy).

Creating the Safety Net

In the simplest terms, the safety net is the amalgamation of ALL your tests/scans running automatically with each commit. The trust and faith in these tests to catch problems before they reach production allows developers to move faster with confidence. It also being automated means it does not rely on a sole person (or group) and can scale with the team.

Ensuring the testing suite is effective is a product of having a solid understanding of the breakdown of testing types and adopting of the “Shift Left” mindset. For an illustration of testing breakdown, we can reference the tried and true “Testing Pyramind”:

As illustrated, unit tests comprise the vast majority of tests in the system. The speed of these tests is something that should be closely monitored as they are run the most often. Tips for ensuring speed:

  • Do NOT invoke external dependencies (database, network calls, disk, etc)
  • Focus on a UNIT, use Mocking libraries to fulfill dependencies
  • Adhere to the AAA model (Arrange, Act, Assert) and carefully examine tests for high complexity

Unit tests must be run frequently to be effective. In general, a minimum of three runs should occur with any change: Local run, run as part of PR validation, and a run when the code is merged to master. The speed is crucial to reduce, as much as possible, the amount of time developers have to wait for these tests.

At the next level we start considering “Integration tests”. These are tests which require a running instance of the application and thus need to follow a deploy action. Their inclusion of external dependencies makes them take longer to run, hence we decrease the frequency. There are two principle strategies I commonly see when executing these tests:

  1. Use of an Ephemeral “Integration” environment – in this strategy, we use Infrastructure as code to create a wholly new environment to run our Integration tests in – this has several advantages and disadvantages
    • Benefit – avoids “data pollution”. Data pollution occurs when data created as part of these tests can interfere with future test runs. A new environments guarantees a fresh starting point each time
    • Benefit – tests your IaC scripts more frequently. Part of the benefit in modern development is the ability to fully represent environments using technologies like Terraform, ARM, and others. These scripts, like the code itself, need exercising to ensure they continue to meet our needs.
    • Negative – creating ephemeral environments can elongate the cycle time for our release process. This may give us clues when one “thing” is more complex than it should be
  2. Execute against an existing environment. Most commonly, I recommend this to be the Development environment as it allows the testing to serve as a “gate” to enable further testing (QA and beyond)
    • Benefit – ensures that integration testing completes before QA examines the application
    • Negative – requires logic to avoid problems with data pollution.

What about Load Testing?

Load Testing is a form of integration testing with some nuance. We want to run them frequently but, their running must be in a context where our results are valid. Running them in, let us say, a QA environment is often not helpful since a QA server likely does not have the same specs as Production. Thus problems in QA with load may not be an issue in higher environments.

If you opt for the “ephemeral approach” you can conduct load testing as part of these integration tests – provided your ephemeral environment is configured to have horsepower similar to production.

If the second strategy is used, I often see Load Testing done for staging, which I disagree with – it is too far to the right. Instead, this should be done in QA ahead of (or as part of) the manual testing effort.

As you can see above in the pyramid, ideally these integration tests comprise about 20% of your tests. Typically though, this section is where the percentage will vary the most depending on the type of application you are building.

Finally we do our Manual Testing with UI testing

UI tests and/or acceptance testing comprises the smallest percentage (10%), mainly because the tests are so high level that they become brittle and excessive amounts will generate an increased need for maintenance. Further, testing here tends to be more subjective and strategic in nature, thus exploratory testing tends to yield more results and inform the introduction of more tactical tests at other levels.

QA is a strategic resource, not a tactical one

A core problem that is often seen within teams and organizations prior is how QA is seen and used. Very often, QA is a member of the team or some other department that code is “thrown over the wall to” as a last step in the process. This often leads to bottlenecks and can even create an adversarial relationship between Engineering and QA.

The truth is, the way QA is treated is not fair and nor is it sensible. I always ask teams “how often has QA been given 4 features to test at 445pm the day before the Sprint Demo?”. And each time, this is not an exception, it is consistent. And of course, QA finds issues and results in the whole team staying late or “living with bugs”. The major mistake that is made

The truth is, this creates a bottleneck with QA, a rather unfair one at that. How often has QA been asked to work long hours the day before the sprint ends after being given 5 features that “just finished and need testing”? This is not acceptable and underlines the misunderstanding organizations have for QA.

QA is not responsible for testing, per se, they are responsible for guiding testing and to ensure it is happening. Testing, ultimately, falls to developers as they are the closest to the code and have the best understanding of it. This is why automated testing (unit in particular) is so vital to the “safety net” concept. Getting developers to understand that testing and writing tests is their responsible is vital to adopting DevOps culture.

This is not to say QA does NO testing, they do. But it is more strategic in nature; aimed at exploratory testing and/or ensuring the completness of the testing approach. They also lead in the identification of issues and their subsequent triaging. Key to high function teams is, whenever an issue is found, the team should remediate it but also create a test which can prevent it from appearing in the future. As the old saying goes “any problem is allowed to happen once”.

Moving away from this relience on the QA department/individual can feel rash to teams that have become overly dependant on this idiom. But rest assured, the best way forward is to focus on automation to create and maintain a suitable safety net for teams.

Safety Nets Take Time

Even introducing 1000 unit tests tomorrow is not going to immediately give your developers the confidence to move faster. Showing that you can deploy 6x a day is not going to immediately see teams deploying 6x a day. Confidence is earned and, there is a saying, “you only notice something when it breaks”. DevOps is not a magic bullet or even a tool – it is a cultural shift, one that, when properly done, touchest every corner of the organization and every level, from the most junior developer to the CEO.

The culture implores participants to constantly challenge themselves and application to ensure that safety measures in place work correctly and complete. High functioning teams want to break their systems, notable Netflix will often break things in Production intentionally to ensure failsafes are working properly.

More canonically, if a test never breaks, how do we know it works at all? This is the reason behind the Red-Green-Refactor development methodology (https://www.codecademy.com/articles/tdd-red-green-refactor). I see a lot of teams simply write tests with the assumption that they work, without actually creating a false condition to test if they break.

But the effort is worth it to move faster and see higher quality. In addition, adopting this aspect of DevOps culture means teams can have higher confidence in their releases (even if they are not deploying all the time). This makes for decreased burn out and better morale/productivity. Plus, you get a full regression suite for free.

I plan to continue this series by diving more deeply into many of the concepts I covered here with unit testing likely being the next candidate.