Reviewing “Team Topologies”

Recently I finished “Team Topologies: Organizing Business and Technology Teams for Fast Flow” written by Matthrew Skelton and Manuel Pais – Amazon: https://a.co/d/1U8Gz56

I really enjoyed this book because it took a different tactic to talking about DevOps, one that is very often overlooked by organizations: Team Structure and Team Communication. A lot of organizations that I have worked with misunderstand DevOps as simply being automation or the use of a product like GitHub Actions, CircleCI, Azure DevOps, etc. But the truth is, DevOps is about so much more than this and the book really goes deep into this explore team topologies and emphasizing the need to organize communication.

In particular the book calls out four core team types:

  • Stream aligned – in the simplest sense these are feature teams but, really they are so much more. If you read The Phoenix Project by Gene Kim you start to understand that IT and engineering are not really its own “thing” but rather, a feature of a specific department, entity, or collab. Thus, what stream-aligned actually means is a set of individuals, working together to handle changes for that part of the organization
  • Enabling – this one I was aware of though, I had never given it a formal name. This team is designed to assist steam aligned teams enable something. A lot of orgs make the mistake of creating a DevOps team, which is a known anti-pattern. DevOps (or automation as it usually is) is something you enable teams with, with things like self-service and self-management. The goal of the enabling team is to improve the autonomy of the team.
  • Platform – platform teams can be stream-aligned teams but, their purpose is less about directly handling the changes for a part of the org than it is support other stream aligned teams. Where enabling teams may introduce new functionality, platform teams support various constructs to enable more streamlined operation. Examples might include a wiki with documentation for certain techniques to even a custom solution enabling the deployment of infrastructure to the cloud.
  • Complicated Sub-system – the author views this a specialized team that is aligned to a single, highly complex, part of a system or the organization (can even be a team managing regulatory compliance). The author uses the example of a trading platform, where individuals on the team manage a highly complex system performing market trades, where speed and accuracy must be perfect.

The essence of this grouping is to align teams to purpose and enable fast flow, what Gene Kim (in The DevOps Handbook) calls The First Way. Speed is crucial for an organization using DevOps, as speed to deploy also means speed to recover. And to enable that speed teams need focus (and to reduce change sizes). Too often organizations get into sticky situations and respond with still more process. While the general thought is it makes things better, really it is security theater (link) – in fact I observed this often leads to what I term TPS (Traumatic Process Syndrome) where processes become so bad, that teams do every thing they can to avoid the trauma of going through with them.

Team Topologies goes even deeper than just talking about these four team types, going even into office layouts and referencing the Spotify idea of squads. But, in the end, as the author indicates, this is all really a snapshot in time and it is important to constantly be evaluating your topology and make the appropriate shifts as priorities or realities shift – nothing should remain static.

To further make this point, the book introduces the three core communication types:

  • Collaboration – this is a short lived effort so two teams can perform discovery of new features, capabilities, and techniques in an effort to be better. The author stresses this MUST be short lived, since collaborating inherently brings about inefficiencies and blurs boundaries of responsibilities, and increased cognitive load for both teams.
  • X-as-a-Sevice – this is the natural evolution from collaboration, where one team provides functionality “as a service” to one or more teams. This is not necessarily a platform model but, instead, enforces the idea of separation of responsibilities. Contrasting with collaboration, cognitive load is minimal here as each knows their responsibilities
  • Facilitating – this is where one team is guiding another. Similar, in my view, to collaboration, it is likewise short-lived and designed to enable new capabilities. Therefore, this is the typical type of communication a stream-aligned and enabling team will experience.

One core essence of this is avoid anti-patterns like Architectural Review Boards, or another other ivory-tower planning committee. Trying to do this sort of planning up front is at best, asking for continuous stream of proposals as architectures morph, and at worst a blocking process that delays projects and diminishes trust and autonomy.

It made me recall an interaction I had with a client many years ago. I had asked “how do you ensure quality in your software?” to which they replied “we require a senior developer approve all PRs”. I looked at the person and then asked “about how many meetings per day is that person involved in?” I asked. They conferred for a moment and came back and said “8”. I then looked at them and said, “how much attention would you say he is actually exercising against the code?” It began to dawn on them. It came to light much later that that Senior Developer had not been actively in the code in months and was just approving what he was asked to approve. It was the Junior developers approving and validating their work with each other – further showing that “developers will do what it takes to get things done, even in the face of a bad process”.

And this brings me to the final point I was to discuss from this book, cognitive load. Being in the industry for 20yrs now I have come to understand that we must constantly monitor how much cognitive load an action takes, people have limits. For example, even if its straightforwad, opening a class file with 1000 lines will immediately overload cognitive load for most people. Taking a more complex approach, or trying to be fancy when it is not needed also affects cognitive load. And this makes it progressively harder for the team to operate efficiently.

In fact, Team Topologies talks about monitoring cognitive load as a way to determine when a system might need to be broken apart. And yes, that means giving time for the reduction of technical debt, even in the face of delaying features. If LinkedIn can do it (https://www.bloomberg.com/news/articles/2013-04-10/inside-operation-inversion-the-code-freeze-that-saved-linkedin#xj4y7vzkg) your organization can do it, and in doing so shift the culture to “team-first” and improve its overall health.

I highly recommend this book for all levels and roles, technologists will benefit as much as managers. Organizing teams is the key to actually getting value from DevOps. Anyone can write pipelines and automate things but, if such a shift is done without actually addressing organizational inefficiencies in operations and culture, you may do more harm than good.

Team Topologies on Amazon – https://a.co/d/1U8Gz56

Advertisement

Create a Private Function App

Deploying to the Cloud makes a lot of sense as the large number of services in Azure (and other providers) can help accelerate teams and decrease time to market. However, while many services are, with their defaults, a great option for hosting applications on the public Internet, it can be a bit of a mystery for scenarios where applications should be private. Here I wanted to walk through the steps of privatizing a Function App and opening it to the Internet via an Application Gateway.

Before we start, a word on Private Endpoint

This post will heavily feature Private Endpoint as a means to make private connections. Private Endpoints and the associated service, Private Link, enable to very highest levels of control over the flow of network traffic by restricting it ONLY within the attached Virtual Network.

This, however, comes at a cost as it will typically require the usage of Premium plans for services to support the feature. What is important to understand is that service-service (even cross region) communication in Azure is ALL handled on the Microsoft backbone, it never touches the public internet. Therefore, your traffic is, by default, traveling in a controlled and secure environment.

I say this because I have a lot of clients whose security teams set Private Endpoint as the default. For the vast majority of use cases, this is overkill as the default Microsoft networking is going to be adequate for majority of data cases. The exceptions are the obvious ones: HIPPAA, CIJIS, IRS, and Financial (most specifically PCI), and perhaps others. But, in my view, using it for general data transfer, is overkill and leads to bloated cost.

Now, on with the show.

Create a Virtual Network

Assuming you already have a Resource Group (or set of Resource Groups) you will first want to deploy a Virtual Network with an address space, for this example I am taking the default of 10.0.0.0/16. Include the following subnets:

  • functionApp – CIDR: 10.0.0.0/24 – will host the private endpoint that is the Function App on the Virtual Network
  • privateEndpoints – CIDR: 10.01.0/24 – will host our private endpoints for related services, Storage Account in this case
  • appGw – CIDR: 10.0.2.0/24 – will host the Application Gateway which enables access to the Function App for external users
  • functionAppOutbound – CIDR: 10.0.3.0/24 – This will be the integration point where the function app will send outbound requests

The Region selected here is critical, as many network resources can either not cross a regional boundary OR can only cross into their paired region. I am sing East US 2 for my example.

Create the Storage Account

Function Apps rely on a storage account to support the runtime. So we will want to create one to support our function app. One thing to keep in mind, Private Endpoints are NOT supported on v1 of Storage Account, only v2. If you attempt to create the Storage Account through the Portal via the Function App process, it will create a v1 account and NOT support Private Endpoint.

When you create this Storage Account, be sure the network settings are wide-open; we will adjust it after the Function App is successfully setup.

Create the Function App

Now with the Function App we want to keep a few things in mind.

  • Use the same region that the Virtual Network is deployed into
  • You MUST use either a Premium or App Service plan type, Consumption does not support privatization
  • For hosting, select the storage account you created in the previous section.
  • For the time being do NOT disable public access – we will disable it later

For added benefit I recommend picking Windows for the Operating system as it will enable in-portal editing. This will let you quickly setup the Ping endpoint I am going to describe later. Note this post does NOT go into deploying – without public access additional configuration may be required to support automated deployments.

Allow the process to complete to create the Function App.

Enable VNet Integration

VNet integration is only available on Premium SKUs and above for both Function Apps and App Services. It enables a service to sit effectively on the boundary of the VNet and communicate with private IPs in the attached VNet as well as peered VNets.

For this step, access the Networking blade in your Function App and look for the VNet integration link on the right side of the screen.

Next, click “Add VNet” and select the Virtual Network and Subnet (functionAppOutbound) which receive the outbound traffic from the Function App.

Once complete, leave ROUTE ALL enabled. Note that for many production scenarios leaving this on can create issues, as I explain next. But for this simple example, having it enabled will be fine.

What is ROUTE ALL?

I like to view an App Service, or Function App, as having two sides, inbound and outbound. VNet integration allows the traffic coming out of the service to enter a Virtual Network. Two different modes are supported: ROUTE ALL and default. With ROUTE ALL enabled ALL traffic enters the VNet, including traffic perhaps bound for an external host (https://www.google.com for example). Thus, to support this YOU must add the various control to support egress. With ROUTE ALL disabled, routing will simply follow rules within RFC1918 (link) and send 10.x and a few others into the Virtual Network and the rest will follow Azure Routing rules.

Microsoft documentation explains it more clearly: Integrate your app with a Virtual Network

Setup Private Connection for Storage Account

Function Apps utilize two sub-services within Storage Account for operation: blob and file. We need to create private endpoints for these two sub-services so that, using the VNet Integration we just enabled, the connection to the runtime is handled via private connection.

Access the Storage Account and select the Networking blade. Immediately select Disabled for Public network access. This will force the use of Private Endpoint as the sole means to access the storage account. Hit Save before continuing to the next step.

Select the Private endpoint connections tab from the top. Here is a screen shot of the two Private Endpoints I created to support my case:

We will create a Private Endpoint for file share and blob services, as these are being used by the Function App to support the runtime. By doing this through the portal, other networking elements, such as setup of the Private DNS Zone can be handled for us. Note, in an effort to stay on point, I wont be discussing how Private Endpoint/Link routing actually works.

Click the + Private endpoint button and follow the steps for both file and blob subresource types. Pay special attention to the values the defaults select, if you have other networking in the subscription, it can select these components and cause communication issues.

Each private endpoint should link into the privateEndpoints subnet that was created with the Virtual Network.

Remember, it is imperative that the Private Endpoint MUST be deployed in the same region and same subscription as the Virtual Network to which it is being attached to.

More information on Private Endpoint and the reason for the Private DNS Zone here

Update the Function App

Your Function App needs to be updated to ensure it understands that it must get its content over a VNet. Specifically this involves updating Configuration values.

Details on what values should be updated: Configure your function app settings

The one to key on is the WEBSITE_CONTENTOVERVNET setting and ensuring it is set to 1. Note the documentation deploys a Service Bus, we are not doing so here so you can skip related fields.

Be sure to check that each values matches expectation. I skipped over this the first time and ran into problems because of it.

Click Save to apply and before moving on.

Go into General Settings and disable HTTPS Only. We are doing this to avoid dealing with certificate in the soon to be created Application Gateway. In a Production setting you would not want this turned off.

Click Save again to apply the changes.

Next, create a new HttpTrigger Function called HttpPing. Use the source code below:

#r "Newtonsoft.Json"
using System.Net;
using Microsoft.AspNetCore.Mvc;
using Microsoft.Extensions.Primitives;
using Newtonsoft.Json;
public static IActionResult Run(HttpRequest req, ILogger log)
{
return new OkObjectResult("ping");
}
view raw ping.cs hosted with ❤ by GitHub

Again, I am assuming you used Windows for your Plan OS otherwise, you will need to figure out how to get custom code to this function app so you can validate functionality, beyond seeing the loading page.

Once you complete this, break out Postman or whatever and hit the endpoint to make sure it’s working.

Coincidentally, this will also validate that the Storage Account connection is working. Check for common errors like DLL not found or Runtime unreachable or the darn thing just not loading.

Create Private Endpoint for Function App

With the networking features in place to secure the outbound communications from the Function App we need to lock down the incoming traffic. To do this we need disable private access and use Private Endpoint to get a routable private IP for the Function App.

Return to the Networking blade and this time, select Private Endpoints from the screen (shown below):

Using the Express option, create a private endpoint attached to the functionApp subnet in our Virtual Network – choose Yes for Integrate with private DNS zone (this will create the Private DNS zone and allow routing to work). Once complete, attempt to hit your Function App again, it should still work.

Now, we need to disable Public Access to the function app. Do this by returning to the Networking blade of the Function App, this time we will select Access restriction.

Declick the Allow public access checkbox at the top of the page. And click Save.

If you attempt to query the Function App now, you will be met with an error page indicating a 403 Forbidden.

Remember, for most PaaS services, unless an App Service Environment is used, it can never be fully private. Users who attempt to access this function app now will receive a 403 – as the only route left to the service is through our Virtual Network. Let’s add an Application Gateway and finish the job.

Create an Application Gateway

Application Gateway are popular networking routing controls that operate at Layer-7, the HTTP layer. This means they can route based on pathing, protocol, hostname, verb, really any feature of the HTTP payload. In this case, we are going to assign the Application Gateway a Public IP and then call that Public IP and see our Function App respond.

Start by selecting Application Gateway for the list of available services:

On the first page set the following values:

  • Region should be the SAME as the Virtual Network
  • Disable auto-scaling (not recommended for Production scenarios)
  • Virtual Network should be the Virtual Network created previously
  • Select the appGw subnet (App Gateway MUST have a dedicated subnet)

On the second page:

  • Create a Public IP Address so as to make the Application Gateway addressable on the public internet, that is it will allow external clients to call the Function App.

On the third page:

  • Add a backend pool
  • Select App Service and pick you Function App from the list
  • Click Add

On the fourth page:

  • Add a Routing Rule
  • For Priority make it 100 (can really be whatever number you like)
  • Take the default for all fields, but make sure the Listener Type is Basic Site and the Fronend IP Protocol is HTTP (remember we disabled HTTPS Only on the Function App)
  • Select Backend Targets tab
  • For Backend Target select the pool you defined previously
  • Click Add new for Backend Settings field
  • Backend protocol should be HTTP, with port 80
  • Indicate you wish to Override with new host name. Then choose to Pick host name from backend target – since we will let the Function App decide the hostname
  • Click Add a couple times

Finish up and create the Application Gateway.

Let’s test it out

When we deployed the Application Gateway we attached it to a public IP. Get the address of that Public IP and replace the hostname in your query – REMEMBER we must use HTTP!!

If everything is setup properly you should get back a response. Congratulations, you have created a private Azure Function App routable only through your Virtual Network.

Options for SSL

To be clear, I would not advocate the use of HTTP for any scenario, even in Development. I abstained from that path to make this walkthrough easier. Apart from create an HTTPS listener in the Application Gateway, Azure API Management operating in External mode with Developer or Premium SKU (only they support VNet Integration) would be the easiest way of support TLS throughout this flow.

Perhaps another blog post in the future – just APIM takes an hour to deploy so, it is a wait 🙂

Closing Remarks

Private endpoint is designed as a way to secure the flow of network data between services in Azure, specifically it is for high security scenarios where data needs to meet certain regulatory requirement for isolation. Using Private Endpoint for this case, as I have shown, is a good way to approach security without taking on the expense and overhead of an App Service Environment which creates an isolated block within the data center for your networking.

That said, using them for all data in your environment is not recommended. Data, by default, goes over the Azure backbone and stays securely on Microsoft networks so long as the communication is between Azure resources. This is advised for most data scenarios and can free your organization from the cost and overhead of maintaining Private Endpoints and Premium SKUs for apps that make no sense to have such capability.

Understanding Terraform Data Sources in Modules

So over the last week I have been battling an issue in Terraform that truly drove me nuts and I think understanding can help someone else who is struggling with the same issue.

What is a data source?

In Terraform there are two principle elements when building scripts: resources and data sources. Resource is something that will be created by and controlled by the script. A data source is something which Terraform expects to exist. Keep this in mind, as it will be important in a moment.

What is a module?

Modern Infrastructure as Code approaches focus on modules as a way to encapsulate logic and standards which can be reused. It is this approach which underpins the problem I found. That is, when a module is built that needs to look up existing resources to hydrate fields on encapsulated resources.

What is the problem?

The problems a sequence of events like this:

terraform {
required_providers {
azurerm = {
source = "hashicorp/azurerm"
version = "=3.30.0"
}
}
}
provider "azurerm" {
features {}
}
resource azurerm_resource_group this {
name = "rg-terraform"
location = "East US 2"
}
module "storage-account" {
source = "./storage-account"
resource_group_name = azurerm_resource_group.this.name
}
module "image-container" {
source = "./storage-account-container"
storage_account_resource_group_name = azurerm_resource_group.this.name
storage_account_name = module.storage-account.storage_account_name
container_name = "images"
}
view raw main.tf hosted with ❤ by GitHub

Where the problem rears its head is in the source for the image-container

terraform {
}
variable storage_account_resource_group_name {
type = string
}
variable storage_account_name {
type = string
}
variable container_name {
type = string
}
data azurerm_storage_account sa {
name = var.storage_account_name
resource_group_name = var.storage_account_resource_group_name
}
resource azurerm_storage_container container {
name = var.container_name
storage_account_name = data.azurerm_storage_account.sa.name
}
view raw main.tf hosted with ❤ by GitHub

The key here is the data source reference to the azurerm_storage_account. Here, DESPITE the reference to the storage-account module used in the root, Terraform will attempt to resolve the data source within the image-container before ANYTHING, which results in this error:

As you can see, Terraform will NOT wait for the storage-account module to complete before trying to resolve the data source within the container module.

What is the solution?

Frankly, I am not sure I would classify this as a bug so much as “by design” but it still annoying. The way to get around it is to not have ANY data source in your modules that reference components created as part of your root scripts. So for example, our test script would look like this:

terraform {
}
variable storage_account_resource_group_name {
type = string
}
variable storage_account_name {
type = string
}
variable container_name {
type = string
}
resource azurerm_storage_container container {
name = var.container_name
storage_account_name = var.storage_account_name
}
view raw main.tf hosted with ❤ by GitHub

Notice how I have removed the data source and simply used the variable for the storage account name directly? This is all you have to do. This is, I admit, a simple script but, it is something you will want to be thinking about. Where I ran into this problem was defining custom domains and mTLS certificates for API Management,

So that is it, that is what I found. Maybe it is not new and it was something obvious, though I venture otherwise due to the lack of this mention in the Terraform documentation. It might be something HashiCorp considers “works as designed” but I still found it annoying. So if this helps you, let me know in the comments.

Cheers.

Introducing a Redis binding for Azure Functions

One of the things I have been looking at it recent weeks is Event Streaming, or basically democratizing data within a system so that it can be freely accessed by any number of microservices. It is a wonderful pattern and Azure Functions are an ideal platform for implementation. However, over the course of this process I came to realize that while Azure Functions has a bevy of bindings available to it, one that is very clearly missing is for Redis. So I set about building one.

I am pleased to say I have released version 1.1.0 on nuget.org – https://www.nuget.org/packages/Farrellsoft.Azure.Functions.Extensions.Redis/

Under the hood there are two supporting libraries:

  • NewtonSoft Json
  • Stack Exchange Redis

Reading Values

public static async Task<IActionResult> Run(
[Redis(key: "value", Connection = "RedisConnectionString")] string value)
{
}
view raw simple-read.cs hosted with ❤ by GitHub

Here, the binding is reading a string value from Redis (connected using the RedisConnectionString app setting value) with the key of value. The binding is, right now, limited to only reading data from either a Redis string or a Redis list. However, it can support reading C# classes – which are stored as JSON strings and deserializable using Newtonsoft’s Json.NET. For example:

public static async Task<IActionResult> Run(
[Redis(key: "object", Connection = "RedisConnectionString")] Person person)
{
}
view raw object-read.cs hosted with ❤ by GitHub

When we get into reading lists, the internal type can either be string or a normal C# class. For example:

public static async Task<IActionResult> Run(
[Redis(key: "objects", Connection = "RedisConnectionString")] List<Person> people)
{
}
view raw objects-read.cs hosted with ❤ by GitHub

As when reading a single object our of the cache, the underlying value is stored as a JSON string and will be deserialized using Newtonsoft Json.NET.

Writing Values

This is currently the newest use case I have added to the binding, and it, like reading, only supports the string and basic C# types saved using either ICollector<T> or IAsyncCollector<T>, currently out parameter is NOT supported, I plan to add it in the future.

public async Task<IActionResult> Run(
[Redis(key: "value", valueType: RedisValueType.Single, Connection = "RedisConnectionString")] ICollector<string> values)
{
values.Add("testvalue")
}
view raw string-write.cs hosted with ❤ by GitHub

When doing an output (write) case, you must specify the underlying type the value is stored as, right now either Single or Collection. In the above example, the use of Single will invoke JSON serialization logic and store the value given using StringSet. If the given key already has a value, the new value sent through the collector will overwrite the existing value.

When using Collection the underlying code will use List functions against the Redis class, with two potential execution paths. For example, the following code will append JSON strings for objects given:

public async Task<IActionResult> Run(
[Redis(key: "values", valueType: RedisValueType.Collection, Connection = "RedisConnectionString")] ICollector<Person> values)
{
values.Add(new Person { Name = "Name 1" });
values.Add(new Person { Name = "Name 2" });
}

Its important to understand that, the above code will keep adding values to the end of the Redis list. If you want to update values in a list, you need to have your C# object implement the IRedisListItem interface, which will force an Id property. Note this approach is NOT available for strings.

public class Person : IRedisListItem
{
public string Name { get; set; }
public string Id { get; set; }
}
view raw person.cs hosted with ❤ by GitHub

The binding will key off this Id value when it receives a value for the specific Redis key. The one drawback in the current implementation is the entire list has to be pulled, so if you are adding a lot of values for a C# class you will notice performance degradation. Here is an example of this approach being used:

public async Task<IActionResult> Run(
[Redis(key: "values", valueType: RedisValueType.Collection, Connection = "RedisConnectionString")] ICollector<Person> values)
{
values.Add(new Person { Id = "1", Name = "Name 1" });
values.Add(new Person { Id = "1", Name = "Name 2" });
}

In the example above, because the Id value is found in the list, the final value for Name will be Name 2.

The source code for the binding is located at https://github.com/farrellsoft/azure-functions-redis-binding and I am open to taking feedback and I look forward to people having a positive usage experience.

Creating Custom OPA Policies with Azure Policy

Governance is a very important component of a strong security and operational posture. Using Azure Policy, organizations can define rules around how the various resources in Azure may be used. These cover a wide range of granularity from the general “resources may only be deployed to these regions” to “VMs must be of certain SKUs and certain resource tiers are prohibited” to finally, “network security groups must contain a rule allow access over port 22”. Policies are essential to giving operators and security professionals peace of mind that rules and standards are being enforced.

As organizations have moved to adopt Cloud Native technologies like Kubernetes, the governance question continues to come up as a point of concern. Kubernetes resources are, after all, outside the true scope of the Cloud Provider, Azure included. Thus, in many cases, teams rely on each other to avoid any pitfalls, in lieu of proper governance.

Open Policy Agent (https://openpolicyagent.org) is a Graduated CNCF (https://cncf.io) project aimed at providing an agnostic platform for enforcing policy rules across a variety platform, one of which is Kubernetes (a fully intro on this topic is not covered here). Open Policy Agent hooks into the Kubernetes Admission Controller and works to prevent the creation of resources that violate defined rules. To aid in applying this at scale, Microsoft created the Azure Policy for Kubernetes (https://docs.microsoft.com/en-us/azure/governance/policy/concepts/policy-for-kubernetes) feature to allow the provisioning of OPA policies (written in Rego) to be created in an assigned AKS or Arc connected cluster.

The feature is enabled via an add-on to the target AKS cluster. Azure Policy authors have already written a large amount of these policies that you can use for free with an Azure subscription. Support for custom OPA policies is, at this time, still in Preview mode but, it is stable enough that walking through it should prove beneficial.

Enable the Add-on

Support for OPA type policies in AKS is done through an add-on, which must be enabled to support proliferation of the policies. Instructions for enabling this add-on can be done using the following command:

az aks enable-addons --addons azure-policy --resource-group $RG_NAME --name $CLUSTER_NAME

As with ANY addon, you should run a list first to see if its already installed. Be prepared for the enablement process to take a non-trivial amount of time.

Create your Rego Policy

Whether you use OPA in the traditional sense or with Azure Policy you start with creating a ConstraintTemplate. This template is what enables the creation of the custom kind (the name of a specific resource type in Kubernetes) that will enable low level assignment of your policy. Below is a simple ConstraintTemplate which restricts a certain for a target namespace:

apiVersion: templates.gatekeeper.sh/v1beta1
kind: ConstraintTemplate
metadata:
name: k8srestricttype
spec:
crd:
spec:
names:
kind: K8sRestrictType
validation:
openAPIV3Schema:
properties:
namespace:
type: string
targets:
target: admission.k8s.gatekeeper.sh
rego: |
package k8srestrictype
violation[{ "msg": msg }] {
object := input.review.object
object.metadata.namespace == input.parameters.namespace
msg := sprintf("%v is not allowed for creation in namespace %v", [input.review.kind.kind, input.parameters.namespace])
}
view raw template.yaml hosted with ❤ by GitHub

This code is not something I would use in a production setting, but it gets the point across without being overly complicated. The policy registers a violation if within a namespace as indicated by an input parameter namespace, if creation of a certain type of resource is attempted.

VERY IMPORTANT!! Note the use of the openAPIV3Schema block to indicate the supported parameters and their type. Including this is vital, otherwise the addon will not generate the relevant constraint with the parameter value provided.

Once you have created the ConstraintTemplate you should store it a place which is accessible, I recommend, for ease of use, Azure Blob Storage with a public container.

Create the Azure Policy

The Azure Policy team has a wonderful extension for VSCode that can aid in generating a strong starting point for a new Azure Policy: https://docs.microsoft.com/en-us/azure/governance/policy/how-to/extension-for-vscode. The below is the Policy JSON I used for the above policy:

{
"properties": {
"policyType": "Custom",
"mode": "Microsoft.Kubernetes.Data",
"policyRule": {
"if": {
"field": "type",
"in": [
"Microsoft.ContainerService/managedClusters"
]
},
"then": {
"effect": "[parameters('effect')]",
"details": {
"templateInfo": {
"sourceType": "PublicURL",
"url": "https://stopapolicyjx01.blob.core.windows.net/templates/template.yaml"
},
"apiGroups": [
""
],
"kinds": [
"[parameters('kind')]"
],
"namespaces": "[parameters('namespaces')]",
"excludedNamespaces": "[parameters('excludedNamespaces')]",
"labelSelector": "[parameters('labelSelector')]",
"values": {
"namespace": "[parameters('namespace')]"
}
}
}
},
"parameters": {
"kind": {
"type": "String",
"metadata": {
"displayName": "The Kind of Resource the policy is restricting",
"description": "This is a Restriction"
},
"allowedValues": [
"Pod",
"Deployment",
"Service"
]
},
"namespace": {
"type": "String",
"metadata": {
"displayName": "The namespace the restriction will be applied to",
"description": ""
}
},
"effect": {
"type": "String",
"metadata": {
"displayName": "Effect",
"description": "'audit' allows a non-compliant resource to be created or updated, but flags it as non-compliant. 'deny' blocks the non-compliant resource creation or update. 'disabled' turns off the policy."
},
"allowedValues": [
"audit",
"deny",
"disabled"
],
"defaultValue": "audit"
},
"excludedNamespaces": {
"type": "Array",
"metadata": {
"displayName": "Namespace exclusions",
"description": "List of Kubernetes namespaces to exclude from policy evaluation."
},
"defaultValue": [
"kube-system",
"gatekeeper-system",
"azure-arc"
]
},
"namespaces": {
"type": "Array",
"metadata": {
"displayName": "Namespace inclusions",
"description": "List of Kubernetes namespaces to only include in policy evaluation. An empty list means the policy is applied to all resources in all namespaces."
},
"defaultValue": []
},
"labelSelector": {
"type": "Object",
"metadata": {
"displayName": "Kubernetes label selector",
"description": "Label query to select Kubernetes resources for policy evaluation. An empty label selector matches all Kubernetes resources."
},
"defaultValue": {},
"schema": {
"description": "A label selector is a label query over a set of resources. The result of matchLabels and matchExpressions are ANDed. An empty label selector matches all resources.",
"type": "object",
"properties": {
"matchLabels": {
"description": "matchLabels is a map of {key,value} pairs.",
"type": "object",
"additionalProperties": {
"type": "string"
},
"minProperties": 1
},
"matchExpressions": {
"description": "matchExpressions is a list of values, a key, and an operator.",
"type": "array",
"items": {
"type": "object",
"properties": {
"key": {
"description": "key is the label key that the selector applies to.",
"type": "string"
},
"operator": {
"description": "operator represents a key's relationship to a set of values.",
"type": "string",
"enum": [
"In",
"NotIn",
"Exists",
"DoesNotExist"
]
},
"values": {
"description": "values is an array of string values. If the operator is In or NotIn, the values array must be non-empty. If the operator is Exists or DoesNotExist, the values array must be empty.",
"type": "array",
"items": {
"type": "string"
}
}
},
"required": [
"key",
"operator"
],
"additionalProperties": false
},
"minItems": 1
}
},
"additionalProperties": false
}
}
}
}
}
view raw policy.json hosted with ❤ by GitHub

Now, that is a rather long definition so, lets hit on the key points, focusing on the then block:

  • OPA supports many standard parameters, and these are listed in the policy JSON under parameters. Important to understand is that OPA ALWAYS expects these, so we do not need to do anything extra. You may also omit them, and the system does not seem to care.
  • Custom parameters should be placed under the values key. Parameters specified here MUST have a corresponding openAPIV3Spec definition.
  • kind and apiGroups relate directly to concepts in Kubernetes and help declare for what resources and actions against those resources the rule applies.
  • Note the templateInfo key value under then. This indicates where Azure Policy can find the OPA constraint template – that which we created earlier. There are a few different sourceType values available. For this example, I am referencing a URL in the storage account where the template was uploaded
  • Note the mode value (Microsoft.Kubernetes.Data). This value must be used so that the Azure Policy runners know this policy contains an OPA

In the case of our example, we are piggybacking off whatever kind the policy applies to and then indicating a specific namespace to make this restriction. We could achieve the same with the namespaces standard parameter but, in this case, I am using a singular namespace property to demonstrate passing properties.

Once this is created, create the Policy definition in Azure Policy using either the Portal or the Azure CLI, instructions here: https://docs.microsoft.com/en-us/azure/governance/policy/tutorials/create-custom-policy-definition#create-a-resource-in-the-portal

I also recommend giving the new definition a very clear name indicating it is custom, I tend to recommend this regardless of whether or not it’s for Kubernetes or uses OPA, just helps to more easily find them in the portal or when searching with the CLI.

Make the Assignment

As with any policy in Azure Policy, you must assign it to a specific scope for it to take effect. Policy assignments in Azure are hierarchical in nature thus, making the assignment at a higher level affects the lower levels. In MCS, we typically recommend assigning policies to Management Groups in order to create easier maintenance – but you may assign all the way down to Resource Groups.

More information on making assignments: https://learn.microsoft.com/en-us/azure/governance/policy/assign-policy-portal

Once the assignment is made you will have to wait for the AKS addon to pick it up and create the new type and constraint definition on your behalf – my experience has had it take around 15 minutes. At the time of this writing, there is no way to speed it up – even using a trigger-scan command from the AZ CLI does not work.

One critical detail while doing the assignment, ensure that the value provided to the effect parameter is in lower case and either audit (which will map to dryrun in OPA) or deny. In some of the built-in templates Audit and Deny are offered as options. In my experience, the addon gets confused by this.

Validate the assignment

After a period of time run a kubectl get <your CRD type name> and it will eventually return a generated resource of that type representing the policy assignment. Once you see the result there, you can attempt to create an invalid resource in your cluster to validate enforcement.

Something to keep in mind, is the CRD Constraints will NOT report on deny actions, only dryrun (at this time warn is not supported, nor do I see much sense in the team supporting it) enforcements get reported on – this makes sense since with deny in place, invalid resources cannot enter the system.

I also recommend starting with dryrun if your governance process is new, this will give teams time to make change per the policies. Starting with deny can cause work disruptions and less then chances of success.

Debugging the Assignment

One thing I found helpful is to use -o yaml to see what the generated CRD and Constraint look like in the cluster. I used this when I was working with the Engineering team to determine why parameters were not being mapped.

I still recommend building your Rego using the Rego Playground (https://play.openpolicyagent.org/)

Closing

I am deeply intrigued by the use OPA in Kubernetes to enforce governance, as I believe strongly in the promise of governance and its criticality in the DevSecOps movement. I also like what support in Azure Policy means for larger organizations looking to adopt OPA at a wider scale. Combine it with what Azure Arc brings to the table and suddenly any organization in any cloud has the ability to create a single pane of glass to monitor the governance status of their cluster regardless of platform or provider.

Please let me know if you have any questions in the comments.

Granular Control with ACR Scope Maps

One of the common questions I field with our customers involves the granularity of permissions within Azure. Though we provide RBAC as a means to assign roles that enables certain functionalities, often this is not granular enough – two prominent examples are Azure Key Vault and Azure Container Registry. In both of these, we typically recommend a “splitting” approach whereby multiple instances of these resources are deployed to “segment” the dispersion of values/assets.

The problem here is twofold: 1) it does not solve the problem and 2) it creates pollution, especially within large organizations and diminishes ROI (Return on Investment) as organizations must now devote additional resources to managing that which should be managed by the cloud provider. In the case of Container Registry, it also adds unnecessary cost as Container Registry is not billed on consumption as Key Vault is: https://azure.microsoft.com/en-gb/pricing/details/container-registry/#pricing

To mitigate this, the Container Registry team is offering, in preview, Scope Maps and Tokens. Let’s see how they work.

Explanation

First, the Container Registry MUST be created using the Premium SKU, if not the Scope Map and Tokens feature will not be available.

Once created, you will want to create a Scope Maps

When you open this you will see 3 default scope maps created – unfortunately at the time of this writing there is no way to duplicate these using the portal, since user created scope maps must target specific registries with no support for wild cards.

Our goal is to create an identity that supports pushing and reading images from a specific repository in the ACR, in my case I created an image called servers/nginx with a tag value of 1.3.5 – this is purely fictitious.

The specific repository (or repositories) can be selected from the drop downs after clicking the + Add button. Next to this selection select the permissions this scope map has – for a writer type mapping you will want content/read and content/write.

Once the scope map is defined, we need to create a token which will serve as the username/password for the user. We do this in the Tokens section:

This screen is self explanatory, provide the Name you want to use to identify the token (the username) and the associated Scope Map (the assigned role). Ensure the token is Enabled (default) and press the Create button.

The final step is to generate a set of passwords for the token; these passwords can or cannot have an expiry value. Click the name of the Token in the list, this will open the token details screen. You can see mine below for the pusher token:

I have yet to generate a password for either value. Click the refresh arrow to generate the password, you can choose whether or not you want an expiry. Once the generation is complete the web portal will display both the password and the appropriate docker login command – this is what az acr login calls under the hood.

Copy the docker login command for later when we run our tests.

Remember, once you leave the password generation screen you will NOT be able to view the password again, only regenerate it.

Testing our Token

In my example, I had create a repository in the registry called servers/nginx, I then walked through the steps of creating a dedicated Token with Scope Map to push to this registry. To see this action, we need another image to push to our nginx repository – we can accomplish this by performing a docker tag – I used this command:

docker tag crsandboxym01.azurecr.io/servers/nginx:1.3.5 crsandboxym01.azurecr.io/servers/nginx:2.4.6

This command effectively gives me a new image called crsandboxym01.azurecr.io/servers/nginx:2.4.6 that we can push. Use the docker login command provided by the portal and then execute docker push on the retagged image – the image should push up successfully.

To test the deny case, run the following command:

docker tag crsandboxym01.azurecr.io/servers/nginx:2.4.6 crsandboxym01.azurecr.io/servers/myserver:1.0.0

This will add a new image called crsandboxym01.azurecr.io/servers/myserver:1.0.0 that is equivalent to our nginx image in every way except name. If you execute a docker push here you should get a deny message.

So that’s it – that is how to use token.

Additional Considerations

While this is certainly needed functionality, especially for some of my larger clients, I understand the disappointment that repository selection for a scope map cannot use wildcards; such a feature is nearly a requirement for enterprises consisting of many teams and numerous images and repositories. Hopefully this will be a feature that gets added in the not too distant future.

One other thing to keep in mind is Tokens do NOT trump RBAC rules. If I use an azure acr login and I have AcrPush I can still push to whichever repository I want. Thus this needs to be safeguarded and kept from development teams if the token approach is desired.

Graph your data with Gremlin

CosmosDB is one of the most interesting, and useful, services in Microsoft Azure in my view though, it can be quite daunting to use. It features many different types of databases from Cassandra, to Mongo/Core, to Table, and to Gremlin. While naturally falling under the umbrella of “NoSQL”, when understood, these databases can bring immense value to your use case. For today, I wanted to talk about Gremlin, as I have been experimenting with it heavily of late.

Have you ever wondered just how Facebook, Twitter and others maintain their data despite the volume and varied nature of the data they collect. Social networks in general represent a very unique use cases as the consistency model of RDBMS cannot easily be instituted due to volume and the nature of the data does not easily lend itself to the eventual nature of Document stores like MongoDB. Instead, these services tend to rely on Graph databases to hold their data, and in Cosmos Gremlin (Apache Tinkerpop) is the Graph API.

When might you use a Graph database?

First, no single solution is good for everything. If you are dealing with a highly transactional system (like a bank) an RDBMS like Azure SQL or PostgresSQL is likely going to fit your needs the best. On the other end, if you ingesting a high volume of data or data which has varying forms, the eventual models of NoSQL (Mongo, Raven, Cockroach) is likely going to be ideal. Where I feel Graph databases come in is when you have data that is HIGHLY relatable or where relationships are always growing or need to change. A great example of this is an organization hierarchy.

In classic computer science, we would often experiment with building these highly relative systems in RDBMS to help students and professionals better understand normalization or, because it was all we had lying around. Let me be clear, nothing stops any of these systems from handling any use case, however, always try to pick the best tool for the job.

Let’s consider a group of people based on my family. Here is a visualization (in graph form) of the data:

Nothing would stop us from representing this in an RDBMS, in fact its a well worn problem with nullable “related to” columns but, is it the best tool. If we think about Facebook and when someone “likes” a story we have to also consider how we ensure integrity of a count, aggregating would be impossible at that scale, we need our entities to “magically” keep track of these values.

Enter a Graph database. Each bubble above is referred to as a “vertex” and each arrow an “edge”. The relationships are mono-directional, though they can certainly circle back to make it feel bidirectional. But each vertex knows how many edges (or relationships) it has and can easily spit that value back. If someone chooses to “unlike” a Facebook story, for example, that edge simply disappears from the Vertex.

I think an even better example is a company hierarchy. Consider how often a company like Microsoft, for example, might shift around positions, create new titles and positions and move who reports to whom. While it could be represented in a RDBMS database, it would be far easier in a graph database.

How do I get started?

First, I would recommend creating a CosmosDB instance using Gremlin (if you like AWS, they have Neptune). Instructions are here: https://docs.microsoft.com/en-us/azure/cosmos-db/graph/create-graph-dotnet

Its good to understand but, honestly, the sample app is not very good and the default driver library leaves much to be desired. After some searching I came across Gremlinq by ExRam and I LOVE this library. It makes things so much easier and there is even a complete sample project for Gremlin (and Neptune) provided. Working with CosmosDB I created the following objects:

var clairePerson = await querySource.AddV<Person>(new Person() { Id = Guid.NewGuid(), FirstName = "Claire" }).FirstAsync();
var ethanPerson = await querySource.AddV<Person>(new Person() { Id = Guid.NewGuid(), FirstName = "Ethan" }).FirstAsync();
var jasonPerson = await querySource.AddV<Person>(new Person() { Id = Guid.NewGuid(), FirstName = "Jason" }).FirstAsync();
var brendaPerson = await querySource.AddV<Person>(new Person() { Id = Guid.NewGuid(), FirstName = "Brenda" }).FirstAsync();
var stevenPerson = await querySource.AddV<Person>(new Person() { Id = Guid.NewGuid(), FirstName = "Steven" }).FirstAsync();
var seanPerson = await querySource.AddV<Person>(new Person() { Id = Guid.NewGuid(), FirstName = "Sean" }).FirstAsync();
var katiePerson = await querySource.AddV<Person>(new Person() { Id = Guid.NewGuid(), FirstName = "Katie" }).FirstAsync();
var madelinePerson = await querySource.AddV<Person>(new Person() { Id = Guid.NewGuid(), FirstName = "Madeline" }).FirstAsync();
var myungPerson = await querySource.AddV<Person>(new Person() { Id = Guid.NewGuid(), FirstName = "Myung" }).FirstAsync();
var chanPerson = await querySource.AddV<Person>(new Person() { Id = Guid.NewGuid(), FirstName = "Chan" }).FirstAsync();
view raw family.cs hosted with ❤ by GitHub

Once you have these (and it would be easy to create these dynamically) you can set up about relating them. As I said above, I tried to keep my relationships flowing in a single direction. I could allow a loopback if needed but I wanted to avoid creating bi-directional relationships. Not sure if this is a good practice or not yet.

await querySource
.V(ethanPerson.Id)
.AddE<ChildOf>(new ChildOf() { Id = Guid.NewGuid() })
.To(_ => _.V(clairePerson.Id))
.FirstAsync();
await querySource
.V(ethanPerson.Id)
.AddE<ChildOf>(new ChildOf() { Id = Guid.NewGuid() })
.To(_ => _.V(jasonPerson.Id))
.FirstAsync();
await querySource
.V(clairePerson.Id)
.AddE<MarriedTo>(new MarriedTo() { Id = Guid.NewGuid() })
.To(_ => _.V(jasonPerson.Id))
.FirstAsync();
await querySource
.V(clairePerson.Id)
.AddE<ChildOf>(new ChildOf() { Id = Guid.NewGuid() })
.To(_ => _.V(myungPerson.Id))
.FirstAsync();
await querySource
.V(clairePerson.Id)
.AddE<ChildOf>(new ChildOf() { Id = Guid.NewGuid() })
.To(_ => _.V(chanPerson.Id))
.FirstAsync();
await querySource
.V(jasonPerson.Id)
.AddE<ChildOf>(new ChildOf() { Id = Guid.NewGuid() })
.To(_ => _.V(brendaPerson.Id))
.FirstAsync();
await querySource
.V(jasonPerson.Id)
.AddE<ChildOf>(new ChildOf() { Id = Guid.NewGuid() })
.To(_ => _.V(stevenPerson.Id))
.FirstAsync();
await querySource
.V(seanPerson.Id)
.AddE<ChildOf>(new ChildOf() { Id = Guid.NewGuid() })
.To(_ => _.V(brendaPerson.Id))
.FirstAsync();
await querySource
.V(seanPerson.Id)
.AddE<ChildOf>(new ChildOf() { Id = Guid.NewGuid() })
.To(_ => _.V(stevenPerson.Id))
.FirstAsync();
await querySource
.V(jasonPerson.Id)
.AddE<SiblingOf>(new SiblingOf() { Id = Guid.NewGuid() })
.To(_ => _.V(seanPerson.Id))
.FirstAsync();
await querySource
.V(seanPerson.Id)
.AddE<MarriedTo>(new MarriedTo() { Id = Guid.NewGuid() })
.To(_ => _.V(katiePerson.Id))
.FirstAsync();
await querySource
.V(madelinePerson.Id)
.AddE<ChildOf>(new ChildOf() { Id = Guid.NewGuid() })
.To(_ => _.V(seanPerson.Id))
.FirstAsync();
await querySource
.V(madelinePerson.Id)
.AddE<ChildOf>(new ChildOf() { Id = Guid.NewGuid() })
.To(_ => _.V(katiePerson.Id))
.FirstAsync();

What I came to find while doing this is, it feels like a good idea to create a base type (I used Vertex and Edge) and then create derivations for specific object and relationship types. For example:

public abstract class Vertex
{
public abstract string Label { get; }
}
public class Person : Vertex
{
public override string Label => "Person";
public Guid Id { get; set; }
public string FirstName { get; set; }
public string partitionKey => FirstName.Substring(0, 1);
}
public abstract class Edge
{
public Guid Id { get; set; }
public abstract string Label { get; }
}
public class SiblingOf : Edge
{
public override string Label => "Sibling Of";
}
public class ChildOf : Edge
{
public override string Label => "Child Of";
}
public class MarriedTo : Edge
{
public override string Label => "Married To";
}
view raw class.cs hosted with ❤ by GitHub

I am still trying to understand the best way to lay this out on account of when I query the whole graph in Cosmos, all I get are Guids everywhere where I would like some better identifying data being shown.

Now the strength of the Graph is that you can more easily traverse the data versus something like RDBMS where you would be writing cartesian queries or doing a lot of subquerying. To aid in this, and I was surprised it was not baked into the query language itself (it might be and I just have not found it yet), I wrote a recursive traversal algorithm:

public static async Task<List<Person>> GetAncestors(Person vertex, IGremlinQuerySour
{
var ancestors = await querySource.V<Person>(vertex.Id)
.Out<ChildOf>()
.Cast<Person>()
.ToArrayAsync();
if (ancestors.Count() == 0)
{
return ancestors.ToList();
}
var ancestorsReturn = new List<Person>(ancestors);
foreach (var ancestor in ancestors)
{
ancestorsReturn.AddRange(await GetAncestors(ancestor, querySource));
}
return ancestorsReturn;
}
view raw traversal.cs hosted with ❤ by GitHub

What this does is, given a specific object that we are looking for (all we need is the Id) it looks for Edges attached to the Person of type ChildOf, indicating the person is a child of another node. The graph nature allows me to count the edges attached to my node and even traverse them like a typical B-tree or other Advanced Data Type in computer science.

Closing thoughts… for now

This is a pretty shallow dip into Graph databases but I am very intrigued. Last year, while working with CSharpfritz on KlipTok we struggled with this due to the desire to have relationships between the data. Ultimately, we got around it but, looking back I realize our data was definitely graphable as the site continues to add relationships and want to do traversals and aggregate relationships. In this way, one definite conclusion is Graph allows relationships to scale much easier than in RDBMS and certainly better than NoSQL systems like Cassandra and Mongo.

I look forward to diving into this deeper and hope the above gives others reasonable insight into the approach and a good starting point.

Secure Configuration in Azure with Managed Identity

I decided to build a proof of concept for some colleagues here at Microsoft to demonstrate how, with minimal code, we could establish secure connections with services in Azure using User Managed identities. Building the POC I realized the information and lessons could be useful as well to others so, here it is.

Why User Managed Identities?

One of the initial questions people might ask is, why would I go with a User Managed Identity instead of a System Assigned Identity. The reason is, I have started to feel like creating a single user managed identity for the application made more sense, in my view. System assigned could certainly work but, it does add a bit of complexity since, in that situation, the App Service has to be created first.

Further, User Managed Identities go with my general motivation to keep application components together.

The Process

Before we get into showing code, let’s walk through the steps. I wrote this POC with an eye on Infrastructure as Code (IaC) and so used Azure Bicep to deploy the resources. Here is the overall flow of this process:

  1. Deploy Application Identity
  2. Deploy SQL Server database (Server, Database, Administrator, Firewall)
  3. Deploy Key Vault w/ Secrets
  4. Deploy Azure App Configuration w/ Config Values
  5. Deploy Azure App Service w/ Plan

I heavily leveraged Bicep modules for each of these parts where I could, in some cases to overcome current limitations related to Bicep being so new. For example, I deployed the Identity using a module because, identity creation takes time and Bicep has a tendency to not respect this wait period, the module forces it to. Here is the code:

// deploy.bicep
module identity 'modules/identity.bicep' = {
name: 'identityDeploy'
params: {
name: 'id-pocapplication-${suffix}'
location: location
}
}
// identity.bicep
resource idApplication 'Microsoft.ManagedIdentity/userAssignedIdentities@2018-11-30' = {
name: name
location: location
}
view raw identity.bicep hosted with ❤ by GitHub

As you can see, the module does not really do anything, resource-wise. But by placing it in a module we force dependsOn to actually work. I am not sure if this is a bug but, it underpins a lot of decision we make with what is a module in Bicep.

Let’s Deploy a Database

As we start to deploy Azure resources the first one we want to focus on is Azure SQL. Deploying it using a Serverless SKU is pretty straightforward (reference database.bicep). Where it gets tricky is when we want to enable MSI access into Azure SQL.

The first thing to realize here is, typical SQL users (even the admin) cannot create users linked to AD, only an AD user can. So, the first thing we need to do is link an AD account here as the admin. This should NOT be the User Managed Identity we created for the application. Rather, it should be an actual AD user, perhaps the so-called “break glass” user – you need only provide the ObjectId of the user from Active Directory (you can find it on the User Detail screen in AAD).

We set the admin using the administrators resource:

resource administrator 'Microsoft.Sql/servers/administrators@2021-05-01-preview' = {
name: 'ActiveDirectory'
parent: dbServer
properties: {
administratorType: 'ActiveDirectory'
login: 'appuser'
sid: administratorPrincipalId
tenantId: tenant().tenantId
}
}
view raw dbadmin.bicep hosted with ❤ by GitHub

Note the appuser name here is not relevant, it should be something that describes the user, it can be whatever you want.

Once the database is created, log into the Query Builder (or connect using SSMS) using that Active Directory user. Next you need to run the follow query to create the user represented by the MSI we created earlier:

DROP USER IF EXISTS [id-pocapplication-pojm2j]
GO
CREATE USER [id-pocapplication-pojm2j] FROM EXTERNAL PROVIDER;
GO
ALTER ROLE db_datareader ADD MEMBER [id-pocapplication-pojm2j];
ALTER ROLE db_datawriter ADD MEMBER [id-pocapplication-pojm2j];
GRANT EXECUTE TO [id-pocapplication-pojm2j]
GO
view raw usercreate.sql hosted with ❤ by GitHub
Make sure to replace the names here and ALSO note how permissions are defined for the MSI user. And remember this script can ONLY be run by an AD User in SQL, you cannot log in, even as the admin, and run this script.

Once this is run, we will be able to use MSI to access the database.

Let’s add a Key Vault

Key Vault is ubiquitous in modern applications as it is an ideal place to store sensitive values. I will often store my Storage Account Access Key values and Database Administrator password here. In addition, for this POC I want to store a “secure” value to show how it can be retrieved using the App Configuration service (which we will be deploying next).

Now, a funny thing about Bicep, at its current stage of development, you cannot return, as an output, a key dictionary based off an array – we would often refer to this as a projection. To mitigate this, I do not recommend using a module for Key Vault. So, you will not see a keyvault.bicep file, instead you will see the key vault operations in the main deploy.bicep file.

Ideally, in a situation such as this we would pass an array of secrets into a module and then, as output, we could return a lookup object where the name of the secret provides us the URI of the secret. This is not currently possible, from an array, in Bicep.

The important secret I want to point out is sensitiveValue we will retrieve this value using App Configuration service next.

But for the access to occur we need to allow a look up to occur using our Managed Identity, we do this with Access Policies – this can be either built-in or using AAD; I am choosing to use the built-in. The key here to remember is the policies are all or nothing. If you grant get secret, the identity can get ANY secret in the vault. This is rarely desirable and thus we use the practice of key vault splitting to mitigate this in complex and shared application scenarios.

App Configuration Service

Despite having been around for almost 2 years at this point, the App Configuration Service is not a commonly used service, which is a shame. It has the ability to manage configuration, link to Key Vaults, and provide a web centric way to manage Feature Flags. For me, it is a common and necessary service because not only does it take the config out of the App Service and enable higher security.

If configuration values are left within the App Service, as Configuration, then we have to be extra diligent that we lock down roles for our users so the values cannot be seen. If we move the values to a different service, we can more easily enforce this sort of deny by default requirement.

App Configuration Service is separated into a template for the main service and then subsequent values. The most interesting of these is the Key Vault requirement, which is not anything super special but we do we use a different content type, here is our value referring back to the sensitiveValue we created with the Key Vault:

// in deploy.bicep
module appConfig 'modules/app-config.bicep' = {
name: 'appConfigDeploy'
params: {
name: 'appconfig-pocapplication-${suffix}'
location: location
configValues: [
{
name: 'searchAddress'
value: 'hhttps://www.bing.com'
contentType: 'text/plain'
}
{
name: 'sensitive-value'
contentType: 'application/vnd.microsoft.appconfig.keyvaultref+json;charset=utf-8'
value: '{ "uri": "${secret.properties.secretUri}" }'
}
{
name: 'connection-string'
contentType: 'text/plain'
value: 'Server=tcp:${database.outputs.serverFqdn};Authentication=Active Directory Managed Identity; User Id=${identity.outputs.principalId}; Database=${database.outputs.databaseName};"'
}
]
applicationIdentityPrincipalId: identity.outputs.principalId
}
dependsOn: [
identity
]
}
// in the module
resource keyValues 'Microsoft.AppConfiguration/configurationStores/keyValues@2021-03-01-preview' = [for config in configValues: {
parent: appConfig
name: config.name
properties: {
contentType: config.contentType
value: config.value
}
}]
view raw appconfig.bicep hosted with ❤ by GitHub
Here we use a module and pass an array of values for creation, using the configurationStores/keyVaults resource. Among these is a value which specifies a JSON based content type. This JSON is an understood format which allows use of AppConfig’s built-in SecretClient to lookup secrets.

One important point that may not be immediately obvious is the lookup of the secret is still done by the invoking client. This means the “host” of the application (App Service in our case) will use its identity to contact Key Vault once it receives the value from App Config service.

Deploy the App Service

The final bit of infrastructure we need to deploy is out App Service + App Service Plan combination and part of this will involve assigning our application identity to the App Service via this definition:

resource app 'Microsoft.Web/sites@2021-02-01' = {
name: 'app-${baseName}'
location: location
identity: {
type: 'UserAssigned'
userAssignedIdentities: {
'${applicationIdentityResourceId}': {}
}
}
properties: {
serverFarmId: plan.id
httpsOnly: false
siteConfig: {
linuxFxVersion: 'DOTNETCORE|3.1'
alwaysOn: true
}
}
}
Note the identity block above and how we make the assignment. We are able to assign as many user assigned identities to our app service as we like. This gives a TON of flexibility in terms of how we limit blast radius in more secure applications. Combining this feature with the use of the Azure.Identity library in our applications and it enables high security scenarios with minimal code, as we will see with our application.

The final bit here is I am using .NET Core 3.1 which is the last LTS release before .NET 6. The reason for its usage here is due to the circumstances with the POC, its designed to showcase this approach with a .NET Core 3.1 application – the same techniques here can apply to .NET 5 and beyond.

Connect application to App Configuration Service

What I will be summarizing here is the tutorial here: https://docs.microsoft.com/en-us/azure/azure-app-configuration/quickstart-aspnet-core-app?tabs=core5x

The App Configuration Service is connected at application start and injects its values into the IConfiguration dependency you should be using today – for added worth, it overwrites local values and assumes the values coming from the Configuration Service are valid. Here is what my connection looks like:

public static IHostBuilder CreateHostBuilder(string[] args) =>
Host.CreateDefaultBuilder(args)
.ConfigureWebHostDefaults(webBuilder =>
{
webBuilder.ConfigureAppConfiguration(config =>
{
config.AddAzureAppConfiguration(appConfig =>
{
var credential = new ManagedIdentityCredential("93e93c24-c457-4dea-bee3-218b29ae32f1");
appConfig.Connect(new Uri(AppConfigUrl), credential)
.ConfigureKeyVault(kvConfig =>
{
kvConfig.SetCredential(credential);
});
});
}).UseStartup<Startup>();
});
view raw appstart.cs hosted with ❤ by GitHub
The key here is the use of ManagedIdentityCredential type which comes from the Azure.Identity NuGet package. Most services in Azure have integrations with this library allowing the process of getting tokens from Azure IDMS to be transparent and handled under the covers.

The other aspect to note here is what we pass to this class, which is the ClientId of the Service Principal (in this case created through the User Managed Identity) within Azure Active Directory. If you recall when we assigned the identity to the App Service we used the Resource Id for the User Managed Identity. This is what allows an App Service to have many identities associated with it, you can use any one you wish so long as it is assigned to the App Service (or whatever service you are using).

Notice also, we call the follow on method ConfigureKeyVault and pass it the same credential. Truthfully, we could specify a different identity here if we so choose, in this case I am not. But it is the credential the underlying SecretClient (or KeyClient) will use when communicating with Key Vault when a Key Vault reference is discovered (this is the purpose of the content type setting).

Azure Configuration Service also supports the use of labels which can be used to differentiate environment or run condition. I dont often recommend using it for environmentalization, as it can lead to complexity and mess plus, you rarely want all of your values in the same service as it means if access is breached all values can potentially be read. Your best bet is to use separate App Config Service instances for each environment. Within those you can support special cases with labels.

Let’s Access the Database

Our final task here was to access the database with MSI. This is not a supported case with Azure.Identity but, thankfully, it is so simple you dont need it. As shown here: https://blog.johnnyreilly.com/2021/03/10/managed-identity-azure-sql-entity-framework/#connection-string-alone

Yes, you can literally do this with JUST the connection string, I used the fourth example to make it work, I only needed to provide the Object/Principal Id of my identity. Obviously, you want to be careful about sharing this since, without networking rules anyone could technically access your database server if they have the ObjectId.

Lessons Learned

Bicep is a wonderful language and, as someone who distanced himself from ARM templates due to their difficulty and favored Terraform, I am very excited at what the future holds. That said, and I have mentioned this before, even to colleagues at Microsoft, Bicep is great until you get into more complex and nuanced cases – I am working on some shareable modules and have had difficulty, especially with array and outputs.

I wanted to prove that you could do the entire process using Bicep and Infrastructure as Code and I was able to accomplish that. Part of what made this easier was the use of a User Managed Identity which allowed tighter control over the identity.

Source Code: https://github.com/xximjasonxx/SecureConfigPoc

Understanding Infrastructure as Code (and DevOps)

The rise of IaC as a model for teams aligns very naturally with the high adoption of teams using the cloud to deploy applications. The ability to provision infrastructure on-demand is essential but, moreover, it allows teams to define the service and infrastructure applications use within the same VCS (Version Control System) where their application code resides. Effectively, this allows the team to see the application not just as the source code but also be inclusive on the underlying support service which allow the application code to work.

Difference from Configuration as Code

Infrastructure as Code is a rather broad term. The modern definition is, as I stated above, more aligned with provisioning infrastructure from scripts on-demand. Configuration as Code is more aligned with on-premises deployments where infrastructure cannot be created on-demand. In these environments the fleet is static and so it is the configuration of the relevant members of the fleet which are important.

It is with Configuration as Code where you commonly see tools like Octopus Deploy (link), Ansible (link), Chef (link), Puppet (link), and others. This is not to say these tools CANNOT spin up infrastructure on-demand, they most certainly can, but it is not their main use case.

In both approaches, and with IaC in general, the central idea is that your application is not merely its code, but also the infrastructure (or configuration of certain infrastructure) which need to be persisted.

Why is it important?

Regardless of the flavor, IaC is vitally important to the modern application development team. Why? Well let me ask you a question: Where is the best place to test an application?

The only viable answer here is: Production. But wait, testing in production is risky and can cause all sorts of problems!! You are clearly crazy Jason!!.

Yes, but let me amend the answer: the best place to test is an environment that is Production-like. This answer is the reason we have things like Docker, Kubernetes, and IaC. We need the environments we develop in and run tests in to be as close to production as possible.

Now, I don’t mean that your developers should have an environment that has the performance specs or disaster recovery features of Production but, from a configuration standpoint, it should be identical. And above all, the team MUST HAVE the same configurations in development as they do in Production. That means, if developers are planning to deploy their application to .NET 5.0.301 on Windows Server 2019, ideally their development environments should be using .NET on Windows Server 2019 – or at least when the application is deployed for testing that environment should be running Windows Server 2019.

Mitigating Drift

The principal goal of placing infrastructure or configuration into VCS as code is to ensure consistency. This aids in ensuring a GUARANTEE that environments are configured in a way that is expected. There is nothing worse than having to find a flag or setting that someone (who is no longer around) applied three years ago when setting up a second server and trying to figure out why “it doesnt work on that machine”.

With proper IaC control, we ensure that EVERY configuration and service is under source controlled so we can quickly get an accurate understanding of the services involved in supporting an application and the configuration of those services. And the more consistent we are, the lower the chance that a difference in environments allows a bug to manifest in production which can not be duplicated in any other environment.

Production is still Production

All this being said, it is important to understand that Production is still Production. That means, there will be additional safeguards in place to ensure proper function in case of disaster and the specs are generally higher. The aim of IaC is NOT to say you should run a Premium App Service Plan at the cost of thousands of dollars per month in Development. The aim is to ensure you are aiming for the same target.

That said, one of other benefits of IaC is the ability to spin up ephemeral environments to perform testing with production style specs – this can include variants of chaos testing (link). This is something done, usually ahead of a production release. IaC is vital here as it allows the creation of said environment easily and guarantees an EXACT replica of production. Another alternative is blue/green deployments which conforms to the sort of shift right testing (link) that IaC enables.

Understanding Operating Models

As you begin the IaC journey it is important to have an understanding of the sort of operating models which go along with it. This helps you understand how changes to various parts of your infrastructure should be handled; this is often the toughest concept for those just getting started to grasp.

Shared vs Bespoke Infrastructure

In many cases, there might be infrastructure which is shared for all applications and then infrastructure which is bespoke for each application. This understanding and differentiation is core to selecting the right operating model. This understanding also underpins how IaC scripts are broken apart and how frequently each is fun. As an example, when adopting a Hub and Spoke deployment model, the code which builds the hub and the spoke connection points is run FAR less frequently than the code which builds the services applications in the spoke rely upon.

Once you understand this differentiation and separation you can choose the operating model. Typically there are considered to be three operating models:

  • ManualOps – this is where the IaC scripts in question are run manually, often by an operations team member. The scripts are provided either by the application team or by a central operations teams. This approach is commonly used when organizations and teams are just getting started with IaC and may not have the time or knowledge of how to work Infrastructure updates into automated pipelines
  • GitOps – coined by WeaveWorks (link) this model centralizes on kicking off infrastructure updates via operatins in Git, usually a merge action. While not essentially driven by a Continuous Integration (CI) process, it is the most common. The key to operating with this model is ensure ALL change to infrastructures are performed via an update to source control, thereby guaranteeing what is in source represents what is deployed.
  • NoOps – NoOps is a derivation of GitOps which emphasizes a lack of operations involvement per se. Instead of running scripts based on a Git operations or manual, it is ALWAYS run with each check in. Through this, application teams take over the ownership of their operations responsibilities. This is the quintessential model for teams operating in a DevOps centric culture.

Which operating model you select is impacted, again by the nature of the infrastructure being supported, but also your teams maturity. DevOps and IaC is a journey, it is not a destination. Not all teams progress (or need to progress) to the same destination.

Centralized Control for Decentralized Teams

In DevOps, and DevSecOps, the question is first, how to involve the necessary disciplines in the application development process such that no specific concern is omitted or delayed – security often gets the short end of the stick. I cannot tell you how many projects I have seen save their security audit for near the end of the project. Rarely does their audit not yield issues and, depending on the timeline and criticality, some organizations ignore the results and recommendations of these audits at their own peril.

I can recall a project for a healthcare client that I was party to years ago. The project did not go well and encountered many problems throughout its development. As a result, the security audit was pushed to the end of the project. When it happened, the auditer noted that the application did not encrypt sensitive data and was not compliant with many HIPAA regulations. The team took the feedback and concluded it would take 2-3 months to address the problems.

Given where the project was and the relationship with the client, we were told to deliver the application as is. The result was disastrous. The client ended up suing our company. The result was not disclosed. But it it just goes to show that security must be checked early and often.

How DevOps, and DevSecOps, approach this is a couple key ways:

  1. The use of the Liaison model (popularized by Google) in which the key areas of Infrastructure, Quality, Security, and Operations delegate a representative who is part time on projects to ensure teams have access to the resources and knowledge needed to carry out tasks.
  2. Creation of infrastructure resources is done through shared libraries which are “blessed” by teams to ensure that certain common features are created.

IaC can help teams shore up #2. Imagine if each time a team wanted to create a VM they had to use a specific module that would limit what OS images they could use, ensure certain ports were closed, and installed standard monitoring and security settings to the machine. This brings about consistency while still allowing teams to self service as needed. For operations, the image could require a tag for the created instances so operations can track them centrally. The possibilities are endless.

This is what is meant by “centralized control for decentralized teams”. Teams could even work with Infrastructure, Operations, and Security to make changes to these libraries in controlled ways. This lets the organizations maintain control over the decentralization necessary to allow teams to operate efficiently.

Using Modules with Terraform

Most IaC tools (if not all) support this modularization concept to some degree, Terraform (link) is no exception. The use of modules can ensure that the service that teams do deploy conform to certain specifications. Further, since Terraform modules are simply directories containing code files, they can easily be zipped and deployed to a “artifact” server (Azure Artifact or GitHub Packages to name a couple) where other teams can download the latest version or a specific version.

Let’s take a look at what a script that uses modules can look like. This is an example application that leverages Private Endpoint to ensure traffic from the Azure App Service to the Azure Blob Storage Container never leaves the VNet. Further, it uses an MSI (Managed Service Identity) with RBAC (Role Based Access Control) to grant specific rights on the target container to the Identity representing the App Service. This is a typical approach to building secure applications in Azure.

# create the resource group
resource azurerm_resource_group this {
name = "rg-secureapp2"
location = "eastus2"
}
# create random string generator
resource random_string this {
length = 4
special = false
upper = false
number = true
}
locals {
resource_base_name = "secureapp${random_string.this.result}"
allowed_ips = var.my_ip == null ? [] : [ var.my_ip ]
}
# create the private vnet
module vnet {
source = "./modules/virtualnetwork"
depends_on = [
azurerm_resource_group.this
]
network_name = "secureapp2"
resource_group_name = azurerm_resource_group.this.name
resource_group_location = azurerm_resource_group.this.location
address_space = [ "10.1.0.0/16" ]
subnets = {
storage_subnet = {
name = "storage"
address_prefix = "10.1.1.0/24",
allow_private_endpoint_policy = true
service_endpoints = [ "Microsoft.Storage" ]
}
apps_subnet = {
name = "apps"
address_prefix = "10.1.2.0/24"
delegations = {
appservice = {
name = "appservice-delegation"
service_delegations = {
webfarm = {
name = "Microsoft.Web/serverFarms"
actions = [
"Microsoft.Network/virtualNetworks/subnets/action"
]
}
}
}
}
}
}
}
# create storage account
module storage {
source = "./modules/storage"
depends_on = [
module.vnet
]
resource_group_name = azurerm_resource_group.this.name
resource_group_location = azurerm_resource_group.this.location
storage_account_name = local.resource_base_name
container_name = "pictures"
vnet_id = module.vnet.vnet_id
allowed_ips = local.allowed_ips
private_endpoints = {
pe = {
name = "pe-${local.resource_base_name}"
subnet_id = module.vnet.subnets["storage"]
subresource_names = [ "blob" ]
}
}
}
# create app service
module appservice {
source = "./modules/appservice"
depends_on = [
module.storage
]
resource_group_name = azurerm_resource_group.this.name
resource_group_location = azurerm_resource_group.this.location
appservice_name = local.resource_base_name
storage_account_endpoint = module.storage.container_endpoint
private_connections = {
pc = {
subnet_id = module.vnet.subnets["apps"]
}
}
}
# assign the identity to the storage account roles
resource azurerm_role_assignment this {
scope = module.storage.storage_account_container_id
role_definition_name = "Storage Blob Data Contributor"
principal_id = module.appservice.appservice_identity_id
depends_on = [
module.appservice,
module.storage
]
}
view raw main.tf hosted with ❤ by GitHub

For this particular script, the modules are all defined locally so, I am not downloading them from a central store but, doing so would be trivial. The modules of Terraform give the ability to also hide certain bits of logic from the callers. For example, there are a variety of rules which must be followed when setting up a Private Endpoint for an Azure Storage account (creation of DNS zone, usage of the correct Private IP, specific names which must be used) all of which can be encapsulated within the module.

There is even validation rules which can be written for Module Input parameters, again, allows Infrastructure or Security to enforce their core concerns on the teams using the modules. This is the power of IaC in large organizations. Its not an easy level to achieve but, achieving it helps team gain efficiencies which were, previously, difficult, if not impossible, to achieve.

Full source code is available here: https://github.com/jfarrell-examples/SecureAppTF

Lessons to Learn

DevOps is always an interesting conversation with clients. Many managers and organizations are looking for a path to get from Point A to Point B. Sadly, DevOps does not work that way. In many ways, as I often have to explain, its a journey, not a destination. The way DevOps is embraced will change from team to team, organization to organization, person to person.

One of the key mistakes I see clients happen upon is “too much, too soon”. Many elements of DevOps take a good amount of time and pivoting to get used to. Infrastructure as Code is one such element that can take an especially long time (this is not to imply that DevOps and IaC must go together. IaC is a component of DevOps, yes, but it can equally stand on its own).

It is important, with any transformation, to start small. I have seen organizations hoist upon their teams a sort of mandate to “codify everything” or “automate everything”. While good intentioned, this advice comes from a misunderstanding of DevOps as a “thing” rather than a culture.

Obviously, advice on DevOps transformations is out of scope for this post and is unique to each client situation. But, it is important to be committed to a long term plan. Embracing IaC (and DevOps) is not something that happens overnight and there are conversations that need to take place and, as with any new thing, you will need to address the political ramifications of changing the way work is done – be prepared for resistance.

Securing Configuration with Key Vault

In my previous post (here), I talked about the need to consider security when you build your application and focused mainly on securing network traffic. In keeping with a focus on DevOps, we took an Infrastructure as Code (IaC) approach which used Terraform to represent infrastructure in a script. But, as someone point out to me privately, I only covered a part of security, and not even the bit which generally leads to more security flaws.

The same codebase as in the aforementioned post is used: jfarrell-examples/SecureApp (github.com)

Configuration Leaking

While securing network access and communication direction is vital the more likely avenue for an attack tends to be an attacker finding your value in source code or in an Azure configuration section. In the example for Private Endpoints I stored the connection string for the Storage Account and the Access Key for the Event Grid in the Azure App Service Configuration section.

This is not entirely bad, as Configuration can be secured with RBAC to keep them visible to certain persons. However, this is still not advised as you would not be be following a “defense in depth” mentality. Defense in Depth calls for us to never rely on ONE mechanism for security but rather, force a would be attacked to conquer multiple layers. For web applications (and most Azure apps) the defacto standard for securing values in Azure Key Vault (Key Vault | Microsoft Azure).

By using Azure Key Vault with Managed Identity, you can omit your sensitive values from the configuration section and use a defined identity for the calling service to ensure only necessary access for the service to request its values. Doing so lessens the chance that you will leak configuration values to users with nefarious intentions.

Understanding Managed Identity

Most services in Azure can make use of managed identities in on of two flavors:

  • System Assigned Identity – this is an identity that will be managed by Azure. It can only be assigned to a SINGLE resource
  • User Managed Identity – this is an identity created and managed by a user. It can be assigned to as many resources as needed. It is ideal for situations involving things like scale-sets where new resources are created and need to use the same identity.

For this example we will use a System Assigned Identity as Azure App Service does not span multiple resources within a single region, Azure performs some magic behind the scenes to maintain the same identity for the server farm machines which support the App Service as it scales.

The identity of the service effectively represents a user, or more accurately a service principal. This service principal has an object_id that we can use in Key Vault Access policy. These policies, separate from RBAC settings, dictate what a specific identity can do against that key vault.

Policies are NOT specific to certain secrets, keys, and certificates. If you GET secret permission to an identity it allows that identity to read ANY secret in the vault. This is not always advisable. To improve your security posture, create multiple key vaults to segment access to secret, key, and certificate values.

We can use Terraform to create the Key Vault and an App Service with an identity, and make the association. This is key because, doing so allows us to create these secret values through IaC scripts versus relying on engineers to do it manually.

Create the App Service with a System Identity

Here is code for the App Service, note the identity block:

# create the app service
resource "azurerm_app_service" "this" {
name = "app-${var.name}ym05"
resource_group_name = var.rg_name
location = var.rg_location
app_service_plan_id = azurerm_app_service_plan.this.id
site_config {
dotnet_framework_version = "v5.0"
}
app_settings = {
"WEBSITE_DNS_SERVER" = "168.63.129.16"
"WEBSITE_VNET_ROUTE_ALL" = "1"
"WEBSITE_RUN_FROM_PACKAGE" = "1"
"EventGridEndpoint" = var.eventgrid_endpoint
"KeyVaultEndpoint" = var.keyvault_endpoint
}
identity {
type = "SystemAssigned"
}
}
# outputs
output "system_id" {
value = azurerm_app_service.this.identity.0.principal_id
}
view raw appservice2.tf hosted with ❤ by GitHub

The change from the previous version of this module is, the StorageAccountConnectionString and EventGridAccessKey is no longer present. We only provide the endpoints for our KeyVault and EventGrid, the sensitive values are held in Key Vault and accessed using the App Service’s Managed Identity.

Setup the Key Vault

First, I want to show you the creation block for Key Vault, here:

terraform {
required_providers {
azurerm = {
source = "hashicorp/azurerm"
version = "=2.62.1"
}
}
}
variable "name" {
type = string
}
variable "rg_name" {
type = string
}
variable "rg_location" {
type = string
}
variable "tenant_id" {
type = string
}
variable "secrets" {
type = map
}
# get current user
data "azurerm_client_config" "current" {}
# create the resource
resource "azurerm_key_vault" "this" {
name = "kv-${var.name}"
resource_group_name = var.rg_name
location = var.rg_location
tenant_id = var.tenant_id
sku_name = "standard"
# define an access policy for terraform connection
access_policy {
tenant_id = var.tenant_id
object_id = data.azurerm_client_config.current.object_id
secret_permissions = [ "Get", "Set", "List" ]
}
}
# add the secrets
resource "azurerm_key_vault_secret" "this" {
for_each = var.secrets
name = each.key
value = each.value
key_vault_id = azurerm_key_vault.this.id
}
#outputs
output "key_vault_endpoint" {
value = azurerm_key_vault.this.vault_uri
}
output "key_vault_id" {
value = azurerm_key_vault.this.id
}
view raw keyvault.tf hosted with ❤ by GitHub

The important thing to point out here is the definition of access_policy in this module. This is not the access being given to our App Service, it is instead a policy to allow Terraform to update Key Vault (the actual permissions are provided as parameters).

The output here is the Key Vault URI (for use as a configuration setting to the App Service) and the Event Grid endpoint (also for use as a configuration setting in App Service).

Creation of this Key Vault MUST precede the creation of the App Service BUT, we cannot create the App Service Access Policy until the App Service is created, we need the Identity’s obejct_id (see above).

Here is the access policy that gets created to allow the Managed Identity representing the App Service to Get secrets from the Key Vault:

terraform {
required_providers {
azurerm = {
source = "hashicorp/azurerm"
version = "=2.62.1"
}
}
}
variable "key_vault_id" {
type = string
}
variable "tenant_id" {
type = string
}
variable "object_id" {
type = string
}
variable "secret_permissions" {
type = list(string)
default = []
}
variable "key_permissions" {
type = list(string)
default = []
}
variable "certificate_permissions" {
type = list(string)
default = []
}
# create resource
resource "azurerm_key_vault_access_policy" "this" {
key_vault_id = var.key_vault_id
tenant_id = var.tenant_id
object_id = var.object_id
key_permissions = var.key_permissions
secret_permissions = var.secret_permissions
certificate_permissions = var.certificate_permissions
}

This policy does need to support List permission on the Key Vault Secret so that the configuration provider can bring all secrets into the configuration context for our App Service. This is the .NET Core code to bring the Key Vault Secrets into the IConfiguration instance for the web app:

public class Program
{
public static void Main(string[] args)
{
CreateHostBuilder(args).Build().Run();
}
public static IHostBuilder CreateHostBuilder(string[] args) =>
Host.CreateDefaultBuilder(args)
.ConfigureAppConfiguration((ctx, config) =>
{
var builtConfig = config.Build();
var keyVaultEndpoint = builtConfig["KeyVaultEndpoint"];
var secretClient = new SecretClient(
new Uri(keyVaultEndpoint),
new Azure.Identity.ManagedIdentityCredential());
config.AddAzureKeyVault(secretClient, new KeyVaultSecretManager())
})
.ConfigureWebHostDefaults(builder => builder.UseStartup<Startup>());
}
view raw program.cs hosted with ❤ by GitHub

The result here is, the Web App will use the Managed Identity of the Azure App Service to communication with our Key Vault at the given endpoint to bring our sensitive values into the web app. This gives us a solid amount of security and diminishes the chances that configuration values leak into places where they can be exposed.

Make it more secure

One issue with the above approach is, it requires some fudging because local development will NOT have a managed identity. Instead, they will need to use something else, such as a InteractiveCredentials or ClientSecretCredentials (available in Azure.Identity NuGet package). These are fine but, aside from requiring a person to authenticate with Azure when they run the app or finding a way to ensure sensitive client authorization values do NOT leak into source, it is a bit onerous.

The way to make our approach more secure is to introduce Azure App Configuration which can integrate with Key Vault in much the same way App Service does. The added benefit is Azure App Configuration can replace your local configuration in Azure and offers numerous features to aid in the management of these values across environments.

Unfortunately, at the time of this writing, Terraform does NOT support managing the keys within Azure App Configuration. Still, while its not perfectly secure, just using Key Vault is usually an improvement over existing sensitive data management techniques I typically see organization and teams using.