From Zero to CI/CD Hero: Our 10-Year Insights

9 September 2024

CI/CD is a familiar topic for us; we’ve explored it in videos and presentations. So, when I set out to write this post, I asked myself: What new perspectives can we offer that we haven’t covered before?

It turns out, we have quite a bit more to share. In this post, I’m excited to dive into some insights we've gained from a decade of working with CI/CD systems. Let’s begin with a brief recap:

CI/CD

CI/CD can have two meanings: Continuous Integration/Continuous Delivery and Continuous Integration/Continuous Deployment.

Continuous Integration is the practice of automatically combining all the work that all your developers do when writing code into one place. The idea is that developers merge their changes into a central repository where builds and tests are run to ensure that those changes all work together. By continuously checking if the new code plays nicely with the existing code, you can catch and fix issues early, thus making the final product better and the process more efficient.

Continuous Delivery is a practice where teams produce software in short cycles, ensuring that the software can be reliably released at any time so you don’t have to wait for a long time to get new features out to your users. What this means is, that after developers have finished building and testing their features, the CD system ensures that these changes are automatically prepared and ready to be deployed to users at any time. It does everything except push the big red button.

Continuous Deployment is where software features are delivered frequently and through automated deployments. So once a developer finishes building and testing a new feature or fix, it is automatically deployed to users without any manual intervention. In other words, it pushes the big red button without waiting for a human to approve.

Regardless of the definition, it’s a process that aims to improve the way we get software into the hands of the user in an automated way. That done, let’s get into the insights;

Configuration

Keep your workflows relatively small

When you start a project, the pipeline will be small but it will grow as you add features. It’s fine to keep everything in the pipeline’s config file at first but as it grows you will want to move things out of the pipeline file.

What we advise is to have separate logical scripts that handle some steps and then call those scripts in the config file. A good approach is to consider your pipeline configuration as part of the code base, therefore all the rules you apply to your code apply to the config. If your config file starts to grow too big then refactor that sucker and move some stuff out to separate scripts.

This becomes crucial when you start including conditional logic in your pipeline. If you leave everything in the config file it will soon become unreadable and hard to maintain. Also, super larger config files are harder to migrate if you decide to change CI/CD providers, so that’s another reason to avoid them.

This isn’t the wild west, use containers

Make sure that the environments are the same across development, testing, and production. Software like docker is a godsend here because it allows you to have the same environment across all stages.

You do not want a situation where a check passes on development but fails on the pipeline or worse, fails on the pipeline but passes locally.

Continuously deploy carefully

The Continuous Deployment part of CI/CD can be a double-edged sword if you are not careful. Continuous Deployment means that the deployment is automated. For example, if something is committed to a release branch, it automatically goes live. This is fine in theory, but in practice, it can lead to problems. Especially if you combine it with ignored CI/CD pipeline warnings and the human tendency to make mistakes, especially when under pressure.

An exception to this rule is continuous deployment to non-production systems, like a test server that users can play on. In this case, it’s great and we recommend it since even if something goes wrong the damage is minimal.

Performance

Waiting sucks

When you are developing software, long delays can take you out of the zone and sap your creative juices. For example, if you have to wait 30 minutes for a build to happen on the test server you will lose focus especially if you switch over to another task while waiting.

So the tip here is to spend time thinking about how to optimize your pipeline. Optimizing isn’t just about your checks running as efficiently as possible, but also about whether they run at all.

Not everything that can be run should be run. For example, some checks should only be run when merging to the main branch, or if some section of the application is properly isolated, then its checks should be run if there is some change in that section.

There is a caveat though, speed isn’t everything. Optimizing to save a few milliseconds is a waste of time, rather look for bottlenecks that will increase your performance by magnitudes. At some point, diminishing returns will kick in and at that point, it will be better to use your time to do something else rather than optimize.

Costs

We mentioned this in the video but it bears repeating. Mind your costs. In the video, we said the cost of your CI/CD solution can be measured in hours, dollars, or both. This is still very true so you need to pay attention to how much you are actually spending in time or money.

CI/CD pipelines can get very expensive, very fast, so the idea is to make sure they do not eat you alive.

The tip here is to pay attention to how much you are paying, in dollars or in time. It's easier to manage earlier rather than later, especially if you have several developers working on the project.

Another thing to keep in mind is the payment model, i.e. pay per use vs limited parallelity. There are advantages and disadvantages to both models but if i were to sum them up, it would be the difference between unlimited costs vs unlimited waiting time. Analyze your usage then choose what works best for you.

Tests, warnings, and errors

Make sure your pipeline fails

I know this might sound weird but pipelines are much like unit tests; they have to fail successfully and loudly. A pipeline that succeeds no matter the situation is pointless. The checks have to check something that can go wrong. If the code is not linted, the pipeline must fail. If a test case fails, the pipeline must fail. If an artifact is missing after a build, the pipeline must fail!

Make sure your CI/CD fails when you need it to. It’s a good idea to test that it actually fails in the beginning by forcing it to face various fail scenarios.

Another thing to be careful about is if your pipeline fails randomly. This is a sign of flakiness and should be treated as a critical failure and handled immediately. Flakiness is a huge problem because it leads to complacency which will lead to bigger problems. If you get only one thing from this post, then I hope it’s not to ignore flakiness.

Deal with warnings

Warnings are a sign that something is not quite right. Sure you can get away with ignoring CI/CD warnings but do not get into the habit. Deal with warnings as soon as you get them. If you can address them then address them.

If, however, you can not and it’s not a project-threatening warning, then mute them. I’m not suggesting a blanket “mute all” here but a more surgical culling. This might seem controversial but there is a good reason.

CI/CD logs are used to debug problems. Useless warnings will pollute your log and distract you from that goal and at best will slow you down and at worst act as a red herring that can waste hours looking down the wrong rabbit holes.

So if you can, fix the warnings, if not and they are optional, then mute them.

Avoid useless checks

A useless check is one that doesn’t add any value. This isn’t to say the check itself is not worth looking at but more about how it is handled by you. For example, you can have a check that tests the bundle size, this is a good check to make sure your bundle doesn’t grow too much, but if every time a developer runs into a bundling issue they just increase the limit instead of working to decrease the bundle size then it’s pointless.

If you are not going to respect the check then just don’t have it in the first place. Another example is a test that checks coverage but instead of adding tests to keep the coverage up, developers reduce the requirement.

In other words…

Take the CI/CD seriously

Don’t ignore it, don’t overrule it. If something breaks don’t overrule and merge anyway. This is a quick way to have a broken production.

Debugging and Troubleshooting

Catch the cache

When building on a pipeline, your provider usually caches some requests for efficiency. They don’t actually run all those requests every time you start a build. If they did their internet bill would be unreasonable. So what they do is cache some of these calls and only get new stuff when necessary.

If their interface is good, you can actually specify which requests for yourself and what the update conditions are. For example, only running yarn install or npm install when your package.json changes.

This sounds awesome until you run into an issue where the problem is an old cache that didn’t get updated.

This is not a hit piece on caching, but rather the opposite. We love caching and recommend you do it liberally, but always be cognizant of the fact that you are caching. When something breaks without reason, make sure one of the first places you look is the cache.

Embrace the Automatic

Sometimes you will have a properly set up pipeline pass but the production build fails or it passes but by some black magic the live system is broken. This could be for example, you have forgotten to do some manual steps needed in the process, like creating migrations in Django or updating the API that the frontend uses to generate its interface.

The cause is irrelevant, but what's important is that in these cases, if it happens once, odds are it will happen again. Your first impulse might be to fix the issue and forget about it, after all, it will only take 5 minutes to fix, right?

That’s a bad impulse, crush it ruthlessly.

The tip here is when that occurs, you need to either automate that step or create a check for it so that it never happens again. It might seem like overkill but consider that if you ran into the issue then others will also run into it. Adding it to the pipeline acts as documentation for the next developer or yourself months down the road.

Conclusion

The insights we’ve shared in this post are only a fraction of the knowledge we've gained. But if we tried to write everything at once, we’d be writing a book and not a blog post. As technology evolves, so too will our approaches and strategies, so we’ll definitely be writing a part two some time in the future.

Ready to take your CI/CD to the next level? Give us a call to discuss how we can help you on your own CI/CD journey. After all the path is easier to walk when you are guided by one who has already been there.

djangsters GmbH

Vogelsanger Straße 187
50825 Köln

Sortlist