Infrastructure as Excavations

Bulb

As much as I consider “infrastructure-as-code” a good idea, in the end it does not fully solve the problem for me. So I want to discuss how to cover the rest.

Infrastructure as code means there are some definitions, treated as code in the sense they are checked in in version control system and reviewed with pull requests or some such, that describe what infrastructure should be set up. And then some tool creates the infrastructure from some CD pipeline.

Which is great for describing the components for the infrastructure. If a solution requires starting those three services, and creating a database and a message queue and storage and putting a reverse proxy in front of it all that will map URLs this way, I can define it using a helm chart or kustomization and azure resource manager templates or terraform module and it does make the deployment reproducible.

But it still does not really cover the operations side. Because whatever is in git does not fully describe the environment. Only the platform's own database does. So

I can have all the definitions, but it does not really imply that's what is actually running. What if it wasn't applied, or failed to apply, or something was changed via other means? It is always the platform's database that is the source of truth, not the version control; and don't like having a poorly synchronized copy.
There is a bootstrap issue. Some resources always have to be created by hand first, some permissions assigned and such. The best I can think of is having the initial setup documented, and unfortunately it rarely is.
If I want to ensure the version controlled manifest actually match the environment, I have to only deploy with the tool, but then I can't use any other tools that create the resources. But the platform native tools are useful, especially when still testing it.
Also everybody else must be going through the same process, which the old farts in an already running project won't want to do.

So what I would actually want is something that would keep track of changes in the actual environment, preferably requiring some description or task or something.

Except most platforms (I work mostly with kubernetes and azure) don't even provide descriptions on the resources. So soon, or at latest after the operator changes, there will be resources in the environment that nobody knows what they are for or whether they are even needed, but nobody dares to touch them, because some resources are production (and some are testing and some are experiments).

Made worse by the fact that there are no explicit links between most resources. A service here or there just gets a bunch of connection strings and those point to the various databases and event queues and hunting all of them down will be quite an undertaking. And they'll have to be rotated at some point, which is the moment I am dreading the most.

Any ideas that would help me sort out what there is and make sure it is well documented for any future updates?

boomzilla

I can tell you some of what we do. But note that I'm not really involved in this so it's just from hearing stuff.

Extensive documentation: Confluence, operations manual, security, etc. This is supposed to cover everything we have and the way we do everything.
Deployments get a hash that we store so we can later verify what's actually deployed.
Every deployment, change, etc, has a jira ticket where the customer authorizes the action. Then we update the documentation.
Automated system scans that note when a file changed (forget the name of the software our customer uses).

And the customer has auditors continually crawling up our asses making sure we actually do all that stuff. Unfortunately, I suspect this is actually needed to really attain what you're looking for because otherwise people will get sloppy and skip steps, especially the documentation parts.

We've done a lot of work to automate everything we can so that we don't skip steps but ultimately most of the documentation relies on humans to do and other humans to look at it and enforce it.

Unperverted Vixen

@Bulb said in Infrastructure as Excavations:

I can have all the definitions, but it does not really imply that's what is actually running. What if it wasn't applied, or failed to apply, or something was changed via other means? It is always the platform's database that is the source of truth, not the version control; and don't like having a poorly synchronized copy.

Maybe you need an IaaC tool like Terraform, then. Rerun terraform plan, and if there's been no changes in the infrastructure, it should return an empty plan. If someone's been tinkering, it'll return a plan to put things to rightness.

There is a bootstrap issue. Some resources always have to be created by hand first, some permissions assigned and such.

Not in my experience... Though I don't work with Kubernetes, just other Azure stuff.

If I want to ensure the version controlled manifest actually match the environment, I have to only deploy with the tool, but then I can't use any other tools that create the resources. But the platform native tools are useful, especially when still testing it.

I want reproducibility, so I work with ARM templates from the start as much as possible. If for some reason I absolutely can't, I'll create by hand, make my ARM template, and then throw away the resource to validate that it does indeed work to set it from scratch.

Except most platforms (I work mostly with kubernetes and azure) don't even provide descriptions on the resources.

Azure has tags for this.

Made worse by the fact that there are no explicit links between most resources. A service here or there just gets a bunch of connection strings and those point to the various databases and event queues and hunting all of them down will be quite an undertaking. And they'll have to be rotated at some point, which is the moment I am dreading the most.

I can't speak for event queues, but for databases in Azure you should be using managed identity, so there's no need to rotate anything.

Bulb

@Unperverted-Vixen said in Infrastructure as Excavations:

@Bulb said in Infrastructure as Excavations:

I can have all the definitions, but it does not really imply that's what is actually running. What if it wasn't applied, or failed to apply, or something was changed via other means? It is always the platform's database that is the source of truth, not the version control; and don't like having a poorly synchronized copy.

Maybe you need an IaaC tool like Terraform, then. Rerun terraform plan, and if there's been no changes in the infrastructure, it should return an empty plan. If someone's been tinkering, it'll return a plan to put things to rightness.

Terraform is what I am currently testing. The problem with it is that it only cares about the resources it created (or imported). So if someone changes a property, it will appear. But if someone adds permissions (each role assignment is a separate object), adds new secrets (connection strings) into a keyvault (also separate objects) etc., terraform will not notice.

But those two cases I pretty much do care about, because tracking where a connection string is important as it will need to be rotated. And permissions (for managed identities mainly) indicate dependencies.

There is a bootstrap issue. Some resources always have to be created by hand first, some permissions assigned and such.

Not in my experience... Though I don't work with Kubernetes, just other Azure stuff.

You have to set up terraform, git and some pipelines or maybe Atlantis before you can build the rest with terraform. So some plain old documentation still required.

If I want to ensure the version controlled manifest actually match the environment, I have to only deploy with the tool, but then I can't use any other tools that create the resources. But the platform native tools are useful, especially when still testing it.

I want reproducibility, so I work with ARM templates from the start as much as possible. If for some reason I absolutely can't, I'll create by hand, make my ARM template, and then throw away the resource to validate that it does indeed work to set it from scratch.

The ARM templates may actually be slightly better because az group deployment create (orwhatitscalled) can be told that it is being handed the complete content of that resource group and it should nuke everything else. The problem is that back when I tried to set up application gateway as an ingress controller for kubernetes, it failed to set up the object in one pass and I had to add some options only after the object was created, so I gave up on that option. Might be they fixed it since.

Except most platforms (I work mostly with kubernetes and azure) don't even provide descriptions on the resources.

Azure has tags for this.

I doubt a tag can contain a page worth of documentation.

Made worse by the fact that there are no explicit links between most resources. A service here or there just gets a bunch of connection strings and those point to the various databases and event queues and hunting all of them down will be quite an undertaking. And they'll have to be rotated at some point, which is the moment I am dreading the most.

I can't speak for event queues, but for databases in Azure you should be using managed identity, so there's no need to rotate anything.

I am trying to use managed identities, but for histerical raisins there are two subscriptions with separate domains and managed identities can't be used across tenants, so I can't avoid some of the connection strings, at least until everything is migrated to one domain, which unfortunately probably will be never.

Basically the point is there is a running solution that has approximately no documentation, is quite big, and has some test resources sprinkled all over it that nobody knows whether they are needed or not. And that thing needs to be cleaned up and have a deployment process introduced over it, but has to keep running through the process.

Unperverted Vixen

@Bulb said in Infrastructure as Excavations:

Terraform is what I am currently testing. The problem with it is that it only cares about the resources it created (or imported). So if someone changes a property, it will appear. But if someone adds permissions (each role assignment is a separate object), adds new secrets (connection strings) into a keyvault (also separate objects) etc., terraform will not notice.

Yeah, that's all reasonable. Our secrets aren't "infrastructure", so that didn't really occur to me. And new permissions aren't a "problem" I'd considered.

You have to set up terraform, git and some pipelines or maybe Atlantis before you can build the rest with terraform. So some plain old documentation still required.

I don't really consider that to be a bootstrap paradox... I'm not going to take an IaaC lump for an application and suddenly build a new pipeline from nothing to deploy it. The pipeline will have been built hand-in-hand when developing the IaaC code. (Ideally it'd be a YAML ADO pipeline or a GitHub Actions workflow and jus tlive in the same repo as the rest of the IaaC code, but even if it doesn't the principle's still the same.)

The problem is that back when I tried to set up application gateway as an ingress controller for kubernetes, it failed to set up the object in one pass and I had to add some options only after the object was created, so I gave up on that option. Might be they fixed it since.

If the option was "enabling HTTPS", not that I'm aware of. :(

I doubt a tag can contain a page worth of documentation.

That's a bit more than a description, then!

Gribnit

@Bulb on AWS with Terraform you can see what fucko did off the res by getting a drift report, diffing config

Bulb

@Unperverted-Vixen said in Infrastructure as Excavations:

@Bulb said in Infrastructure as Excavations:

Terraform is what I am currently testing. The problem with it is that it only cares about the resources it created (or imported). So if someone changes a property, it will appear. But if someone adds permissions (each role assignment is a separate object), adds new secrets (connection strings) into a keyvault (also separate objects) etc., terraform will not notice.

Yeah, that's all reasonable. Our secrets aren't "infrastructure", so that didn't really occur to me. And new permissions aren't a "problem" I'd considered.

It applies to new anything, really.

Basically there are two layers:

Describing the infrastructure a specific application or component needs to work. Terraform does a good job of that.
Keeping the entirety of the Azure (or AWS or whatever) subscription organized, and track of who did there what, when and most importantly why. Terraform basically does not handle that. Nothing really seems to.

You have to set up terraform, git and some pipelines or maybe Atlantis before you can build the rest with terraform. So some plain old documentation still required.

I don't really consider that to be a bootstrap paradox... I'm not going to take an IaaC lump for an application and suddenly build a new pipeline from nothing to deploy it. The pipeline will have been built hand-in-hand when developing the IaaC code. (Ideally it'd be a YAML ADO pipeline or a GitHub Actions workflow and jus tlive in the same repo as the rest of the IaaC code, but even if it doesn't the principle's still the same.)

The bit I am missing is really on the operations side. The application can be developed with the appropriate IasC templates and pipeline to deploy them, and that takes care of the deployment process well.

But on the operations side there is old cruft lying around in the subscription and people can still create things via portal (and I want to keep doing it for testing things too) and what I am looking for is something to keep some track of those things.

Terraform doesn't do it, but the thing is that terraform even can't really do it, because the ultimate source of truth is Azure, not terraform's metadata, and because terraform needs some, whatever small, bit of bootstrap that requires the manual way to still exist.

The problem is that back when I tried to set up application gateway as an ingress controller for kubernetes, it failed to set up the object in one pass and I had to add some options only after the object was created, so I gave up on that option. Might be they fixed it since.

If the option was "enabling HTTPS", not that I'm aware of. :(

The option was enabling the kubernetes integration in my case. All the various ‘add-on’ options probably have or had the same issue.

I doubt a tag can contain a page worth of documentation.

That's a bit more than a description, then!

I suppose a tag could contain a reference to ticket in ADO or Jira.

Bulb

@boomzilla said in Infrastructure as Excavations:

I can tell you some of what we do. But note that I'm not really involved in this so it's just from hearing stuff.

Extensive documentation: Confluence, operations manual, security, etc. This is supposed to cover everything we have and the way we do everything.

That's what this place is sorely missing, unfortunately.

The only way to fix things is probably writing at least some.

Deployments get a hash that we store so we can later verify what's actually deployed.

The problem of this application is that it's built from Azure pieces. There is database that is designed manually in some Microsoft-provided enterprise thingamajig directly in Azure, there are the IOT hubs and Event hubs that had the queues set up by hand, there are the functionapps that are being moved into source control, but have been written directly in the portal as .csx etc.

Every deployment, change, etc, has a jira ticket where the customer authorizes the action. Then we update the documentation.

That … should be done, but isn't.

Automated system scans that note when a file changed (forget the name of the software our customer uses).

Here it's all Azure resources. But, well, there is the ‘azure resource manager’ json describing them, so there could be something to keep track of it.

I guess my problem is that I am supposed to instill some order into it, but the people who have built it won't want to start documenting the thing anyway, so I can't actually do that much.

We've done a lot of work to automate everything we can so that we don't skip steps but ultimately most of the documentation relies on humans to do and other humans to look at it and enforce it.

… yeah. And these are cowboys who just cobbled it together and didn't really document it and don't seem to have much time so I can't pick their brains efficiently either.

Kamil Podlesak

@Bulb said in Infrastructure as Excavations:

I can have all the definitions, but it does not really imply that's what is actually running. What if it wasn't applied, or failed to apply, or something was changed via other means? It is always the platform's database that is the source of truth, not the version control; and don't like having a poorly synchronized copy.

I am going to disagree here. Both are "source of truth", but those are two different truths: what IS there versus what SHOULD be there. Both are equally important and any difference is not just "poor synchronization" - it's a problem that needs to be solved.

Except most platforms (I work mostly with kubernetes and azure) don't even provide descriptions on the resources. So soon, or at latest after the operator changes, there will be resources in the environment that nobody knows what they are for or whether they are even needed, but nobody dares to touch them, because some resources are production (and some are testing and some are experiments).

Considering that this is just a modern variant of the good old "mystery server in the lower rack" and even better older "mystery computer in janitor's closet with DO NOT TURN OFF sticker", I would say that this is problem is persistent and pretty much unsolvable.

Which is a reason why all the popular tools explicitly allow this. Any tool preventing this is going to me immediately categorized as "unusable" and "preventing us to just get things done".

Bulb

@Kamil-Podlesak said in Infrastructure as Excavations:

Considering that this is just a modern variant of the good old "mystery server in the lower rack" and even better older "mystery computer in janitor's closet with DO NOT TURN OFF sticker", I would say that this is problem is persistent and pretty much unsolvable.

Looks like it indeed.

Well, it's really a people problem, so it does not have a (completely) technical solution. Though I'd hope the tools would at least try to be helpful with providing a place to put some description in so there would be one obvious place to put some documentation in instead of everybody thinking whether it is described in the confluence or the azure devops wiki or the onenote notebook we keep forgetting to share with the new hires. But no, that would be too sensible.

Which is a reason why all the popular tools explicitly allow this. Any tool preventing this is going to me immediately categorized as "unusable" and "preventing us to just get things done".

Well, I see it as ‘allowing us to set up a footgun that will go off at some unpredictable time in next two years’. But yeah, the big ball of mud is very popular and likely going to remain so just as much for infrastructure as it is for the code.

Gribnit

@Bulb said in Infrastructure as Excavations:

But yeah, the big ball of mud is very popular and likely going to remain so just as much for infrastructure as it is for the code.

Again, platforms including AWS will let you pull drift reports for managed resources, which are the diff of what was planned for and what is actually configured. Also iirc some tools will let you let them keep you from making footguns, if you set the appropriately non-cowardly configuration.

Bulb

@Gribnit That is basically just running plan. It will find the differences in the resources it knows about, but not the additional resources that were created by hand and should be documented and/or coopted into the managed configuration.

dkf

@Bulb said in Infrastructure as Excavations:

@Kamil-Podlesak said in Infrastructure as Excavations:

Considering that this is just a modern variant of the good old "mystery server in the lower rack" and even better older "mystery computer in janitor's closet with DO NOT TURN OFF sticker", I would say that this is problem is persistent and pretty much unsolvable.

Looks like it indeed.

Well, it's really a people problem, so it does not have a (completely) technical solution. Though I'd hope the tools would at least try to be helpful with providing a place to put some description in so there would be one obvious place to put some documentation in instead of everybody thinking whether it is described in the confluence or the azure devops wiki or the onenote notebook we keep forgetting to share with the new hires. But no, that would be too sensible.

Software used to be just as full of that messy style. Still is if you rely on the IDE to do everything for you. We use source control and CI so that we have a source of truth for what the software is; anything built on a developer's machine is never definitive.

Bulb

@dkf Yeah, except the infrastructure actually deployed is kinda analogous to that software built on developer machine.

Kamil Podlesak

@Bulb said in Infrastructure as Excavations:

@dkf Yeah, except the infrastructure actually deployed is kinda analogous to that software built on developer machine.

But it works ~~on my machine~~in the production!

dkf

@Bulb said in Infrastructure as Excavations:

@dkf Yeah, except the infrastructure actually deployed is kinda analogous to that software built on developer machine.

That was my point. While that remains true, infrastructure remains stuck at the manual build/deploy stage. Being able to build the system from scratch using only scripted instructions (with all bits coming from repositories, though probably private ones) is far superior. When you have that, you can do Continuous Deployment, and can say what should be up because you brought it all up in a controlled fashion; if it isn't on the list, it isn't needed and should be slaughtered.

Wish I had it myself, but I have a bunch of reasons why not (notably there is a complex hardware situation; I'm talking about the part that virtualizes that stuff for the rest of our stack). I ought to work on improving that...

Bulb

@Kamil-Podlesak said in Infrastructure as Excavations:

@Bulb said in Infrastructure as Excavations:

@dkf Yeah, except the infrastructure actually deployed is kinda analogous to that software built on developer machine.

But it works ~~on my machine~~in the production!

… which is good until you need to modify something at which point the production environment promptly collapses into a black hole and you will spend days trying to rebuild it.

@dkf said in Infrastructure as Excavations:

@Bulb said in Infrastructure as Excavations:

@dkf Yeah, except the infrastructure actually deployed is kinda analogous to that software built on developer machine.

That was my point. While that remains true, infrastructure remains stuck at the manual build/deploy stage. Being able to build the system from scratch using only scripted instructions (with all bits coming from repositories, though probably private ones) is far superior. When you have that, you can do Continuous Deployment, and can say what should be up because you brought it all up in a controlled fashion; if it isn't on the list, it isn't needed and should be slaughtered.

… which assumes a lot of things this cowboy operation unfortunately does not have, like a separate staging environment. The pinnacle of staging is currently staging slots in functionapps.