A friend of mine told me about a problem he’s facing. He’s working with a client who has a globally distributed team, much of which is offshore, working in a monolith Java codebase. I hear the groans already! It gets worse… As you might imagine, there are quality issues with the software. They release only once or twice per week and when issues are found in production, they scramble to fix it and quickly ship again (fix forward). The leaders of the company are unhappy with the development team’s inability to produce quality software,
The scenario he described is all too common!
What strategies can be used to deal with this?
First let’s break the problem down the issues:
- Multiple teams sharing one codebase
- Testing is a challenge and quality issues slip through to production
- There is no automatic way to find production issues so they rely on QA staff and customers reporting issues
- Deployments are a manual process and therefore they are not easy to do frequently
So how can we address these issues?
Migrate parts of the monolith to a microservices architecture
If we can split the codebase into smaller, independently deployable services, each service can be assigned a team to own it. With ownership comes pride in one’s work and code quality naturally improves. If there’s an issue with a service, guess who is on the hook? The team who wrote it! Each service endpoint is a contract- you give me these inputs and I will give you these outputs. Now teams within the organization can independently build their services.
Invest in the right kinds of testing
Where this client was concerned, there was only surface level testing occurring- they had little automation tests to simulate clicks and confirm behavior of the web application itself. They had dismal code coverage with their minimal unit tests. Bugs slipped through the cracks. And what’s worse? When they found a bug there were no tests written to confirm the fix.
My advice: every time there is a bug, write unit tests to reproduce the bug and fix it. These tests will confirm that the fix is in place. This is Test Driven Development (TDD) in practice. Start by writing the test the way it should work (it will fail because there is a bug). Then fix the code. Run the test again to demonstrate that this code path produces the desired result. Check it in!
Additionally, for a web application of this size, more automation tests should be considered. There are easy ways to get started with these tests. Often times people work directly with Selenium but there are more reliable ways to get started- I would recommend using Cucumber or if you’re in a .NET shop, try Canopy.
Ship code faster
When there are more people in a codebase, there is more reason to ship code faster! Why is that? Well suppose we have 20 changes from various contributors all ready to be delivered together. And suppose we find a bug in the software while it is being QA tested (or worse: once it is in production). Now we have to track down which change was the offending change. I don’t know about you but my life is much easier when it’s not spent searching for a needle in the haystack.
What’s the solution? Ship code faster! Better to push code every day (except not on Friday!) or even multiple times every day! This couples nicely with microservices architectures because each team is responsible for getting their codebase into production. That’s the DevOps way!
Have an issue? It becomes much easier to diagnose and determine a course of action when you can tell exactly when it started and which codebase it was in! Only one problem… shipping code is painful! What can we do…?
Build a CI/CD Pipeline
To reduce the pain of deployments, we need to make packaging up code easy! To steal a Wikipedia definition,
Continuous integration is the practice of merging all developer working copies to a shared mainline several times a day.
So now for each codebase, we are constantly merging back into the develop branch as changes are happening. This means the develop branch is the source of truth for the current state of the team’s work.
And now for Continuous Delivery (from Wikipedia),
Continuous delivery is a software engineering approach in which teams produce software in short cycles, ensuring that the software can be reliably released at any time. It aims at building, testing, and releasing software faster and more frequently. The approach helps reduce the cost, time, and risk of delivering changes by allowing for more incremental updates to applications in production.
So we need to make it easy to automatically build our software when a commit happens to develop. The build process should include running all unit tests to ensure things are still healthy.
When a release branch is create, we can have a system run through the same build/test process, package it up for various environments, store packages in an artifact repository. The process then deploys it to a testing environment and runs our automation suite of tests against it to validate that level.
Manual QA is still likely a requirement for most organizations, but now driven off of nothing but a developer commit, we have releasable packages that just need to be signed off on by our QA staff and they can make it to production. And you guessed it– production deployments should be automated too!
So what can we do to find our production errors before our customers? Well we could have a wired QA staff frantically clicking through to manually verify each behavior after each deployment to production. But there are better options!
One options I’ll put forward is to ensure we have centralized logging. Each codebase should report its errors and audits (and to different log stores). There are a variety of ways to do this but one that is gaining a lot of traction (and rightfully so) is the ELK stack. ELK stands for ElasticSearch, LogStash, and Kibana. The idea is that each codebase will store its logs locally on the server on a temporary basis. LogStash is a service installed on each server that monitors for these log files and ships the log entries to a central log store. This log store is running ElasticSearch and the log entries are combined within that so they are searchable. Finally, developers and QA staff use Kibana as a frontend to search through the errors that are occuring.
Best practices here include:
- using a Correlation ID to identify a call working its way through a variety of microservices
- logging which Codebase an error or audit happened in
So what do we do with all of these code deployments happening so frequently our heads are spinning? We can wrap our changes in something called a feature toggle.
A feature toggle is branching logic. If the toggle is on, go to the new path. If the toggle is off, go to the old path.
One use of feature toggles is for safely releasing. We can deploy code with the toggle off and there is no change to system behavior. Once we are ready- and maybe the product team has some involvement here- we toggle the feature on. Now code goes down the new path. Don’t forget to monitor the error logs! If all is well, we still need to clean up after the feature toggle on a subsequent deployment to remove the old path. This is very important as feature toggles should be short-lived so they don’t clutter your codebase.
So that was a lot! With some work, we can safely move fast and deploy even several times per day. We can build confidence in the quality of the software we produce and can separate out the code into microservices to enable larger teams to safely work together.
I hope that helps your organization!