DORA Metrics
You Cannot Transform What You Cannot Measure
This lesson emerges repeatedly in large-scale transformation work. Early initiatives often feel successful -- teams are happier, releases seem smoother, and nobody is complaining. But when leadership asks for demonstrable impact at a board-level technology committee, anecdotes are insufficient. That realization is what makes measurement the foundation of every effective transformation. By the time the DevSecOps transformation began at a Tier-1 bank, measurement was not an afterthought -- it was the foundation.
DORA metrics provided a common language. Before their adoption, every team had its own definition of "fast" and "stable." One team measured deployment frequency by counting production releases; another counted deployments to staging. One team measured lead time from the moment a developer started coding; another measured it from the moment a ticket was created. These inconsistencies made it impossible to benchmark, compare, or identify systemic bottlenecks. The organization standardized on the DORA definitions from Forsgren, Humble, and Kim's "Accelerate" research, instrumented every pipeline to emit metrics automatically, and built dashboards that gave every team -- and every executive -- a consistent, real-time view of delivery performance. Within six months, the data was driving decisions. Teams that were struggling could see where they were struggling. Teams that were excelling could articulate exactly what practices were driving their performance. The transformation stopped being a mandate from leadership and started being a data-informed competition among teams to improve.
Overview of DORA Metrics
DORA (DevOps Research and Assessment) metrics are a set of key performance indicators used to measure the performance and effectiveness of software development and delivery practices. Developed through six years of rigorous academic research led by Dr. Nicole Forsgren, Jez Humble, and Gene Kim, and validated across tens of thousands of organizations worldwide, these metrics represent the most empirically grounded framework for measuring software delivery performance. The four key DORA metrics are:
- Deployment Frequency: How often new code is deployed to production.
- Lead Time for Changes: The time it takes for a code change to go from commit to production.
- Change Failure Rate: The percentage of changes that result in a failure in production.
- Mean Time to Restore (MTTR): The average time it takes to restore service after a failure.
The "Accelerate" research demonstrated a critical insight that defied conventional wisdom: these four metrics are not in tension with each other. High-performing organizations deploy more frequently AND have lower failure rates AND restore faster. Speed and stability are not trade-offs -- they are complementary outcomes of the same underlying practices. This finding is the empirical bedrock upon which modern DevSecOps transformation is built.
The DORA Performance Profiles
The annual State of DevOps Reports, published by Google's DORA team, categorize organizations into performance clusters. As of the 2023 and 2024 reports, these clusters are:
| Metric | Elite | High | Medium | Low |
|---|---|---|---|---|
| Deployment Frequency | On-demand (multiple times per day) | Between once per day and once per week | Between once per week and once per month | Between once per month and once every six months |
| Lead Time for Changes | Less than one hour | Between one day and one week | Between one week and one month | Between one month and six months |
| Change Failure Rate | 0-5% | 5-10% | 10-15% | 16-30%+ |
| MTTR | Less than one hour | Less than one day | Between one day and one week | More than one week |
When measurement began at the bank, most teams fell in the Low to Medium range. Within eighteen months of the golden path rollout, over sixty percent of teams had moved to High, and three teams -- all on the golden path with fully automated security scanning -- reached Elite. The correlation between golden path adoption and DORA metric improvement was not accidental; it was the designed outcome.
The DORA framework continues to evolve. The 2025 edition — titled the 'State of AI-Assisted Software Development Report' — marks a significant expansion of the research programme. Drawing on thousands of survey responses and over 100 hours of practitioner interviews, the 2025 report introduces a sixth metric, Rework Rate (the percentage of changes that require rework after initial deployment), alongside a quasi-metric for Reliability. Perhaps most significantly, the report replaces the previous four-tier performance model (Elite, High, Medium, Low) with seven team archetypes identified through cluster analysis, recognising that performance is multidimensional and that teams exhibit different strengths and weaknesses across the metric set. This evolution from simple tiers to nuanced archetypes reflects a maturity in the research — and demands a corresponding maturity in how organisations interpret and act on their DORA data.
Importance of DORA Metrics in AI and DevSecOps Practices
DORA metrics are crucial for measuring the performance and effectiveness of AI and DevSecOps practices. They provide valuable insights into the efficiency, reliability, and stability of software development and delivery processes. By tracking and optimizing these metrics, organizations can improve their ability to deliver value to customers quickly and reliably.
In financial services specifically, DORA metrics serve a dual purpose: they measure delivery performance and they provide evidence of operational risk management. Regulators increasingly expect banks to demonstrate that they can deploy changes reliably and recover from failures quickly. DORA metrics, when instrumented properly, provide exactly this evidence.
Deployment Frequency
Deployment Frequency measures how often new code is deployed to production. High deployment frequency indicates a mature and efficient software delivery process. In AI and DevSecOps practices, frequent deployments enable teams to quickly iterate on improvements and deliver new features to users more rapidly.
The research consistently shows that higher deployment frequency correlates with lower risk, not higher risk. This is counterintuitive to many executives in regulated industries, where the instinct is to deploy less frequently to reduce risk. The data tells the opposite story: organizations that deploy infrequently accumulate large batches of changes that are harder to test, harder to debug when they fail, and harder to roll back. Frequent, small deployments are safer deployments.
At the bank, this evidence was presented to the board risk committee, earning support for increasing deployment frequency. The research was backed up with internal data: teams that deployed weekly had a change failure rate of three percent; teams that deployed monthly had a change failure rate of fourteen percent. The data was unambiguous.
Lead Time for Changes
Lead Time for Changes measures the time it takes for a code change to go from commit to production. Short lead times indicate an efficient development process. In AI and DevSecOps practices, reducing lead time for changes allows teams to respond faster to new data, changing requirements, and emerging threats.
Lead time decomposition is where the real diagnostic value lies. At the bank, the pipeline was instrumented to measure five distinct segments of lead time:
- Coding time: From first commit to pull request creation.
- Review time: From pull request creation to approval.
- Security scan time: Time spent in automated security scanning stages.
- Build and test time: Time in CI build and test execution.
- Deployment time: From merge to production deployment completion.
This decomposition revealed that security scanning, which many teams blamed for slow pipelines, accounted for less than eight percent of total lead time. The actual bottleneck was review time -- pull requests were sitting in review queues for an average of two days. This data-driven insight redirected improvement efforts from optimizing scan performance (which would have yielded marginal gains) to improving review practices (which yielded substantial gains).
Change Failure Rate
Change Failure Rate measures the percentage of changes that result in a failure in production. Low change failure rates indicate a stable and reliable software delivery process. In AI and DevSecOps practices, monitoring and reducing change failure rates helps ensure that updates do not negatively impact production systems.
At the bank, the standard DORA definition of change failure was extended to include security failures: deployments that introduced a vulnerability detected post-deployment, configurations that violated compliance policies, or changes that triggered a security incident. This expanded definition aligned DORA metrics with the security governance framework and gave us a single metric that captured both operational and security reliability.
The 2024 State of DevOps Report introduced additional nuance around change failure rate, noting that the relationship between deployment frequency and change failure rate is moderated by the quality of automated testing and the maturity of deployment practices. This matched the bank's internal findings: teams with comprehensive automated test suites (both functional and security) had consistently lower change failure rates regardless of deployment frequency.
Mean Time to Restore (MTTR)
Mean Time to Restore (MTTR) measures the average time it takes to restore service after a failure. Low MTTR indicates a resilient and responsive software delivery process. In AI and DevSecOps practices, minimizing MTTR ensures that any issues in production are resolved quickly, reducing downtime and maintaining service reliability.
MTTR is the metric that most directly reflects an organization's operational resilience. In banking, where service availability is both a customer expectation and a regulatory requirement, MTTR is scrutinized at the highest levels. At the bank, MTTR was tracked separately for four categories:
- Functional failures: Application bugs that impacted user experience.
- Infrastructure failures: Platform or infrastructure issues.
- Security incidents: Vulnerabilities or breaches detected in production.
- Compliance violations: Changes that violated regulatory controls.
Each category had different MTTR targets and different escalation paths. Security incidents had the most aggressive MTTR targets -- under thirty minutes for critical severity -- and triggered automatic incident response playbooks that included isolation, forensic evidence preservation, and regulatory notification assessment.
The "Accelerate" research demonstrates that MTTR is more important than preventing all failures. Failures are inevitable in complex systems. The question is not whether you will fail, but how quickly you can detect, respond, and recover. This philosophy informed the entire approach to resilience engineering at the bank.
Implementing DORA Metrics at Scale
Implementing DORA metrics across a large organization requires more than dashboards. It requires standardized instrumentation, consistent definitions, and a governance model that uses the data without weaponizing it.
Instrumentation
At the bank, DORA metrics were instrumented at the pipeline level, not the team level. Every golden path pipeline emitted standardized events to a central metrics platform:
- Deployment events: Timestamp, service identifier, environment, deployer, commit hash.
- Change events: Commit timestamp, merge timestamp, deployment timestamp (enabling lead time calculation).
- Failure events: Incident timestamp, severity, category, resolution timestamp (enabling MTTR and CFR calculation).
This instrumentation was automatic for teams on the golden path and required minimal configuration for teams with custom pipelines. The data flowed into dashboards accessible to everyone -- from individual contributors to the CTO.
Using Metrics Without Weaponizing Them
The "Accelerate" research is explicit about this: DORA metrics must be used to improve, not to punish. The moment teams believe that metrics will be used against them, they will game the metrics, suppress incident reporting, and optimize for appearances rather than outcomes. At the bank, this principle was enforced through several mechanisms:
- Metrics were reported at the team level, never at the individual level.
- Teams set their own improvement targets in consultation with their engineering managers.
- Cross-team comparisons were presented as anonymized quartile distributions, not league tables.
- Blameless post-incident reviews were mandatory, and findings were shared across the organization as learning opportunities.
This approach built trust in the metrics program and ensured that the data drove genuine improvement rather than compliance theater.
Examples of Applying DORA Metrics in AI and DevSecOps Projects
-
Deployment Frequency: By increasing the frequency of model deployments, teams can quickly iterate on model improvements and deliver new features to users more rapidly. For example, a team might aim to deploy new model versions weekly instead of monthly. At the bank, the ML model serving teams increased deployment frequency from monthly to weekly after adopting the golden path, with automated model validation and shadow deployment stages that caught regressions before they reached production traffic.
-
Lead Time for Changes: Reducing the lead time for changes allows teams to respond faster to new data and changing requirements. For instance, automating the data pipeline and model training process can significantly reduce the time it takes to deploy updated models. Mean lead time across the organization dropped from twenty-three days to four days within the first year, with the primary improvement coming from automated security scanning (eliminating the six-week manual review) and streamlined review processes.
-
Change Failure Rate: Monitoring and reducing the change failure rate helps ensure that model updates do not negatively impact production systems. Implementing robust testing and validation processes can help catch issues before they reach production. The organization-wide change failure rate dropped from eleven percent to four percent, with security-related failures dropping from three percent to under one percent.
-
Mean Time to Restore (MTTR): Minimizing MTTR ensures that any issues in production are resolved quickly, reducing downtime and maintaining service reliability. For example, setting up automated rollback mechanisms can help restore service quickly in case of a failure. Automated canary analysis and instant rollback reduced the P1 MTTR from four hours to eighteen minutes.
References
-
Forsgren, N., Humble, J., and Kim, G. Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations. IT Revolution Press, 2018.
-
DORA State of DevOps Report 2023. Google Cloud, 2023. Available at: https://dora.dev/research/2023/dora-report/
-
DORA State of DevOps Report 2024. Google Cloud, 2024. Available at: https://dora.dev/research/2024/dora-report/
-
Kim, G., Humble, J., Debois, P., and Willis, J. The DevOps Handbook: How to Create World-Class Agility, Reliability, and Security in Technology Organizations. IT Revolution Press, 2016.
-
Kim, G., Behr, K., and Spafford, G. The Phoenix Project: A Novel about IT, DevOps, and Helping Your Business Win. IT Revolution Press, 2013.
-
Forsgren, N., Smith, D., Humble, J., and Frazelle, J. "DORA 2024 Accelerate State of DevOps Report." Google Cloud, 2024.
-
Humble, J. and Farley, D. Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation. Addison-Wesley, 2010.
-
DORA State of AI-Assisted Software Development Report 2025. Google Cloud, 2025. Available at: https://dora.dev/research/2025/dora-report/