Skip to main content

Fast Feedback: Observability-Driven Engineering

A painful early-career lesson about feedback speed came from a batch processing system at a bank that ran overnight. If something failed, the team would not find out until the next morning when operations rang the on-call engineer. By then, downstream systems had already consumed partial data, reconciliation was broken, and the remediation effort would consume the entire day. The feedback loop was measured in hours. The cost was measured in hundreds of thousands of dollars per incident.

Across 25 years of engineering leadership in banking and financial services, the speed of feedback loops has proven to be the single strongest predictor of team performance. This is not just intuition. The DORA research programme, published in Accelerate, demonstrated empirically that elite teams have feedback cycles orders of magnitude faster than their low-performing counterparts. During the DevSecOps transformation at a Tier-1 bank, the first capability invested in was not security tooling or deployment automation — it was observability. You cannot improve what you cannot see, and you cannot fix what you do not know is broken.

Why Fast Feedback Matters

Fast feedback loops enable teams to quickly identify and address issues, leading to faster learning and improvement. In financial services, where regulatory obligations demand auditability and system reliability, the ability to detect and remediate issues in minutes rather than hours is not merely an efficiency gain — it is a risk control.

The research is unambiguous. Forsgren, Humble, and Kim found that high-performing teams have a Mean Time to Recovery (MTTR) measured in minutes, not days. Google's Site Reliability Engineering discipline formalised this with error budgets, Service Level Objectives (SLOs), and structured on-call practices. The thread connecting all of this work is the same: shorten the time between a change being introduced and the signal that tells you whether it worked.

Key Components

  • Continuous Integration: Regularly integrating code changes and running automated tests to provide early feedback on every commit.
  • Shift-Left Testing: Moving testing activities earlier in the software delivery lifecycle so that defects are caught when they are cheapest to fix.
  • Monitoring and Observability: Collecting and correlating logs, metrics, and traces to gain deep insight into system behaviour in production.

Continuous Integration

Continuous Integration (CI) is the practice of regularly integrating code changes into a shared repository and running automated tests to provide early feedback. This helps in identifying and addressing issues quickly, reducing the risk of integration problems. Key practices include:

  • Automated Builds: Automating the build process to ensure that code changes are integrated smoothly. In regulated environments, build provenance and reproducibility are not optional — they are auditable controls.
  • Automated Testing: Running tests automatically to verify the correctness of code changes. This includes unit tests, contract tests, and static analysis.
  • Frequent Commits: Committing code changes frequently (at least daily to trunk) to detect integration issues early. Long-lived feature branches are where feedback goes to die.

Example: CI in a Banking Environment

At the bank, we implemented a CI pipeline that ran on every pull request. Within eight minutes of a developer pushing code, they received feedback on compilation, unit tests, static analysis (SonarQube), dependency vulnerability scanning (Snyk), and secrets detection. Before this pipeline existed, developers would wait until a nightly build to discover failures. The reduction in cycle time was dramatic: defect escape rate to integration environments dropped by over 60% in the first quarter.

The key insight was that speed matters as much as coverage. A test suite that takes 45 minutes to run will be bypassed by developers under deadline pressure. We invested heavily in parallelisation, test pyramid optimisation, and caching to keep the feedback loop under ten minutes.

Shift-Left Testing

Shift-left testing is the practice of moving testing earlier in the development lifecycle. Rather than treating testing as a gate at the end of a sprint, testing becomes a continuous activity that begins at the moment a requirement is written.

  • Static Analysis: Running linters and static analysis tools (SonarQube, ESLint, Checkmarx) as part of the IDE experience and the CI pipeline, catching code quality and security issues before code review.
  • Contract Testing: Using tools like Pact to verify API contracts between services independently, without requiring a full integration environment.
  • BDD and Specification by Example: Writing executable specifications (Cucumber, SpecFlow) that serve as both requirements documentation and automated acceptance tests.
  • Threat Modelling: Conducting lightweight threat models during design, not after deployment. In the DevSecOps transformation, we embedded threat modelling into the definition of ready for any feature involving data flows or authentication changes.

Example: Shift-Left Security in Financial Services

One of the most impactful shift-left initiatives involved embedding SAST (Static Application Security Testing) directly into the developer's IDE. Previously, security scans ran in a separate pipeline stage, and findings arrived days after the code was written. Developers had already moved on to other work and had lost the mental context. By moving the scan to the IDE (using Checkmarx or Semgrep plugins), developers saw security findings in real-time, alongside their compiler warnings. The fix rate for critical findings improved from under 40% to over 85% within six months.

Monitoring and Observability

Monitoring tells you when something is broken. Observability tells you why. As Charity Majors, Liz Fong-Jones, and George Miranda argue in Observability Engineering, traditional monitoring based on predefined dashboards and alerts is insufficient for modern distributed systems. You need the ability to ask arbitrary questions of your production telemetry without having anticipated the question in advance.

The three pillars of observability are:

  • Logs: Structured, contextual event records. In banking, every log entry must be correlated to a transaction ID for audit purposes.
  • Metrics: Numeric time-series data (latency percentiles, error rates, throughput). These power your SLOs and error budgets.
  • Traces: Distributed traces that follow a request across service boundaries. When a payment transaction crosses six microservices, traces are the only way to understand where latency is being introduced.

Key Practices

  • Log Aggregation: Collecting logs from various sources and aggregating them for analysis using tools like the ELK Stack (Elasticsearch, Logstash, Kibana) or Splunk.
  • Metrics Collection: Collecting metrics on system performance, such as response times, error rates, and saturation, using Prometheus, Datadog, or similar platforms.
  • Distributed Tracing: Instrumenting services with OpenTelemetry to produce traces that can be analysed in Jaeger, Zipkin, or Honeycomb.
  • Alerting on SLOs: Setting up alerts based on Service Level Objectives rather than arbitrary static thresholds. An alert that fires when you are burning through your error budget at an unsustainable rate is far more actionable than an alert that fires when CPU exceeds 80%.

Example: Observability in a Payment System

We operated a real-time payment processing system that handled millions of transactions daily. Traditional monitoring would alert us when error rates exceeded a static threshold. The problem was that error rates fluctuated naturally with traffic patterns — a 1% error rate at 3am with low volume was catastrophic, while a 1% rate during peak hours might be within normal variance.

We moved to SLO-based alerting using burn-rate windows. The system calculated how quickly we were consuming our monthly error budget and alerted only when the burn rate indicated we would breach the SLO before the end of the window. This reduced alert noise by over 70% while simultaneously catching genuine incidents faster.

Example: Log Aggregation in Regulated Environments

In banking, log aggregation is not just an engineering practice — it is a regulatory requirement. The challenge is that compliance teams need immutable, tamper-evident log storage with retention periods measured in years, while engineering teams need fast, queryable access to recent logs. We implemented a tiered architecture: hot storage in Elasticsearch for the last 30 days (fast queries for incident response), warm storage in object storage for 6 months (compliance queries), and cold archival for 7 years (regulatory retention).

Benefits of Fast Feedback

Fast feedback loops provide several benefits, including:

  • Early Detection of Defects: Identifying defects early in the development process, reducing the cost and effort required to fix them. The IBM Systems Sciences Institute found that defects found in production cost 6x more to fix than those found during implementation.
  • Reduced Mean Time to Recovery (MTTR): Quickly identifying and addressing production issues leads to faster recovery. Elite performers in the DORA research achieve MTTR under one hour.
  • Improved System Stability: Continuous monitoring and observability ensure that the system remains stable by detecting anomalies before they cascade into customer-facing incidents.
  • Audit Confidence: In regulated environments, the ability to demonstrate rapid detection and remediation is itself a control. Regulators want to see that you can detect and contain issues, not just that you prevent them.

Tools and Technologies for Fast Feedback

Several tools and technologies can help in implementing fast feedback loops, including:

  • Jenkins: An open-source automation server for continuous integration and continuous delivery (CI/CD).
  • GitHub Actions: A CI/CD tool that allows you to automate workflows directly from your GitHub repository.
  • ELK Stack: A set of tools for log aggregation and analysis, including Elasticsearch, Logstash, and Kibana.
  • Prometheus: An open-source monitoring and alerting toolkit, purpose-built for reliability and dimensional data.
  • Grafana: An open-source platform for monitoring and observability, used to visualise metrics collected by Prometheus and other sources.
  • OpenTelemetry: A vendor-neutral observability framework for generating, collecting, and exporting telemetry data (traces, metrics, logs).
  • Honeycomb: A modern observability platform built on high-cardinality, high-dimensionality data, designed for debugging distributed systems.
  • Jaeger / Zipkin: Open-source distributed tracing systems for monitoring and troubleshooting microservice architectures.

References

  1. Forsgren, N., Humble, J., & Kim, G. (2018). Accelerate: The Science of Lean Software and DevOps. IT Revolution Press. The definitive research on what drives software delivery performance, including the statistical link between feedback speed and organisational outcomes.

  2. Majors, C., Fong-Jones, L., & Miranda, G. (2022). Observability Engineering: Achieving Production Excellence. O'Reilly Media. Establishes the distinction between monitoring and observability, and provides practical guidance on instrumenting modern distributed systems.

  3. Beyer, B., Jones, C., Petoff, J., & Murphy, N.R. (2016). Site Reliability Engineering: How Google Runs Production Systems. O'Reilly Media. Introduces SLOs, error budgets, and structured approaches to production reliability at scale. Available at sre.google/sre-book.

  4. Humble, J. & Farley, D. (2010). Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation. Addison-Wesley. The foundational text on deployment pipelines and the mechanics of fast, reliable software delivery.

  5. Kim, G., Humble, J., Debois, P., & Willis, J. (2016). The DevOps Handbook. IT Revolution Press. Practical implementation guidance for the Three Ways of DevOps, including amplifying feedback loops.