CI/CD Pipelines That Don't Suck
GitHub Actions patterns, caching strategies, and how we got our pipeline under 5 minutes.
The CI Pipeline Nobody Wants to Own
At every company I've worked at, the CI/CD pipeline is simultaneously the most critical piece of infrastructure and the least loved. It was set up by someone who left two years ago, it's held together by YAML that nobody fully understands, and when it breaks, everyone looks around the room hoping someone else will fix it.
Over three companies, I've rebuilt CI pipelines from scratch twice and incrementally improved one. Here's what I've learned about building pipelines that are fast, reliable, and maintainable.
Speed Is a Feature
A CI pipeline that takes 20 minutes is a CI pipeline that engineers avoid. They'll batch changes into larger PRs to reduce the number of CI runs, which defeats the purpose of continuous integration. Your pipeline should complete in under 10 minutes, ideally under 5.
At CARS24, our pipeline was 18 minutes. We got it to 4.5 minutes with three changes: parallelizing lint, type-check, and test jobs instead of running them sequentially; caching node_modules between runs using actions/cache with a hash of the lockfile; and running only the tests affected by changed files using Jest's --changedSince flag.
The caching alone saved 3 minutes per run by avoiding a clean npm install every time. The parallel jobs saved another 5 minutes by utilizing GitHub Actions' concurrent job execution. And the targeted testing saved 4 minutes by skipping tests for unmodified modules.
The Pipeline Structure That Works
I use a three-stage pipeline: validate, build, and deploy. The validate stage runs in parallel jobs — linting, type checking, unit tests, and accessibility checks. If any job fails, the pipeline stops. No point building if the code doesn't pass basic checks.
The build stage creates the production bundle and runs integration tests against it. This catches issues that only manifest in the production build — missing environment variables, broken dynamic imports, CSS ordering differences. At Mamaearth, we caught two production bugs per month in this stage that would have otherwise reached users.
The deploy stage is conditional. For PRs, it deploys to a preview environment. For merges to main, it deploys to staging. For tagged releases, it deploys to production. Each environment has its own configuration, and the pipeline handles the routing automatically.
GitHub Actions Patterns I Use Everywhere
Reusable workflows are the single best feature of GitHub Actions for maintainability. We define our lint, test, and build steps as reusable workflows that every repository calls. When we update the test configuration, we update it once and it propagates to all repos.
Path-based triggers save CI minutes by only running jobs when relevant files change. Our documentation workflow only runs when .md files change. Our frontend workflow only runs when files in app/ or components/ change. This reduces unnecessary CI runs by about 30%.
Concurrency groups prevent wasted resources on superseded pushes. If I push three commits in quick succession, only the latest one needs a full CI run. We set concurrency groups per PR so that a new push cancels the in-progress run for the same PR.
Monitoring Your Pipeline
Your CI pipeline needs monitoring just like your production services. We track four metrics: average pipeline duration, failure rate, flaky test rate, and queue wait time. If any of these degrades, we treat it as a bug.
Flaky tests are the most insidious problem. A test that fails 5% of the time doesn't seem like a big deal until you have 200 tests and your pipeline fails randomly 10% of the time. We quarantine flaky tests into a separate non-blocking job and fix them within the sprint. If they're not fixed within two sprints, we delete them.
At Flipkart, we built a simple dashboard that shows CI metrics over time. When pipeline duration started creeping up from 4.5 minutes to 7 minutes over a quarter, the dashboard made it visible and we addressed it before it became a problem. Without the dashboard, nobody would have noticed until it hit 15 minutes.
Found this useful? I write about engineering, performance, and career growth.