Home/Architecture Hub/Technical Debt Scoring

Architecture Review & Technical Debt Scoring Guide

A practical scoring template with weighted criteria, benchmarks from DORA, SonarQube, and CAST — and the specific thresholds CTOs need to make debt decisions defensible.

TL;DR — Executive Summary

Most organizations have no structured way to score their architecture — which means technical debt decisions default to gut feel and politics. This guide synthesizes the leading scoring frameworks (ATAM, ISO 25010, DORA, SonarQube, CAST, CodeScene) into a single, weighted scorecard CTOs can use immediately. Technical debt consumes 23–33% of developer time, costs the US economy $2.41 trillion annually, and 91% of CTOs name it their top challenge. The 20/20/20/15/15/10 weighting across six dimensions is the most defensible starting template available.

What You'll Learn

  • A weighted architecture scoring template you can adapt and use this week
  • Specific green/yellow/red thresholds for code quality, test coverage, and debt density
  • How DORA, SonarQube, and CAST benchmarks map to a unified scorecard
  • The red flags that reliably predict architectural distress — and the green flags that signal health
  • The business case in hard numbers: revenue impact, retention risk, and remediation ROI

Why Most Architecture Reviews Produce Nothing Actionable

An architecture review scoring guide is a structured methodology for evaluating the health, risk, and remediation priority of a software system across weighted dimensions — producing a numeric score that supports investment decisions, vendor assessments, and technical due diligence.

Most architecture reviews end with a slide deck that says “we have some debt” and a recommendations list that goes nowhere. The problem isn't effort — it's the absence of a scoring framework that converts qualitative findings into defensible, prioritized numbers. Without scores, every architectural concern feels equally urgent (or equally ignorable), and remediation never gets funded.

23–33%
Of developer time consumed by technical debt
$2.41T
Annual cost of poor software quality in the US
91%
Of CTOs name tech debt their top challenge
$3.61
Average technical debt per line of code

The Three Frameworks That Anchor Every Credible Scoring Methodology

ATAM

Surfacing Tradeoffs, Not Assigning Scores

Carnegie Mellon's SEI method uses a Utility Tree to prioritize quality attributes on two axes — importance and difficulty. ATAM produces a catalog of risks, sensitivity points, and tradeoff points, not numerical scores. Its value: surfacing hidden tensions before they become production incidents.

ISO 25010

The Most Comprehensive Quality Taxonomy

Eight top-level characteristics, 31+ sub-characteristics including Functional Suitability, Performance, Reliability, Security, and Maintainability. Practitioners operationalize through Likert-based weights with hierarchical aggregation. A 2025 finding: industry prioritizes maintainability and readability, not what academics publish papers on.

DORA

The Only Widely Benchmarked Quantitative Framework

Google's Accelerate program surveyed 39,000+ professionals in 2024. Four core metrics plus Rework Rate. A critical finding worth flagging: AI tooling correlated with worsened delivery performance — a 25% increase in AI adoption corresponded to a 1.5% throughput decrease and 7.2% stability decrease.

DORA Performance Tiers

MetricEliteHighMediumLow
Deployment frequencyMultiple times/dayDaily to weeklyWeekly to monthlyMonthly+
Lead time for changes< 1 day1 day – 1 week1 week – 1 month> 1 month
Change failure rate~5%~10–20%~10–15%~64%
Recovery time< 1 hour< 1 day< 1 day> 1 week

The Scoring Scales: What Green, Yellow, and Red Actually Mean

SonarQube anchors maintainability to the Technical Debt Ratio (TDR): Total Remediation Cost ÷ (Cost per LOC × Total LOC). CodeScene combines static analysis with Git history, identifying hotspots — the 2–3% of code responsible for 25–70% of all defects. CAST provides portfolio-level analysis across 1,400+ applications and 550 million lines of code.

Tool / MethodScaleGreenYellowRed
SonarQube TDRA–E (%)A: ≤5%B–C: 6–20%D–E: >20%
CodeScene Health1–108–105–71–4
CAST Health0–100>7553–75<53
Industry TDR%≤5%5–15%>15%
Debt per LOC$/LOC<$2$2–5>$5

The Architecture Scorecard: Six Dimensions, Explicit Weights

The ARDURA Consulting Architecture Review Scorecard provides the most detailed publicly available weighting: Scalability 20%, Security 20%, Maintainability 20%, Performance 15%, Deployment 15%, Documentation 10%. Each dimension scores 1–5. The weighted total maps to four action tiers:

4.0–5.0
Strong
Incremental improvements only
3.0–3.9
Adequate
Targeted investment needed before next growth phase
2.0–2.9
Significant gaps
Remediate before building new capability
1.0–1.9
Critical blocker
Major intervention required
20%
Scalability
Horizontal scaling with externalized state, auto-scaling capable of 3–5× peak traffic.
The answer to "what happens if traffic doubles?" is "we add a bigger server."
20%
Security
Defense-in-depth strategy. Threat modeling completed. Secrets management in place.
Security described as "handled by the perimeter."
20%
Maintainability
Test coverage ≥80% on critical paths, cyclomatic complexity ≤10, code duplication <5%.
Only one person who can explain a critical component.
15%
Performance
API P95 latency <300ms, availability ≥99.95%.
Inability to answer "what is our P99 latency?" — observability doesn't exist.
15%
Deployment
CI/CD pipeline <15 min, zero-downtime deployments, infrastructure as code.
Deployments requiring manual runbook execution.
10%
Documentation
ADRs maintained and current. C4 diagrams updated within six months. New devs productive in one day.
Architecture documentation that exists only in someone's head.

Code Quality Benchmarks: The Specific Numbers

Test Coverage
≥80% Green — SonarQube default gate
60–79% Yellow — acceptable for non-critical paths
<60% Red — insufficient coverage

Safety-critical systems (DO-178B, IEC 61508) mandate 100% MC/DC. Going from 80% to 100% is disproportionately expensive — pursue only for critical paths.

Cyclomatic Complexity
1–10 Green — low risk per method
11–20 Yellow — moderate risk
>20 Red — high risk; >50 = untestable

SonarQube's cognitive complexity metric uses a default threshold of 15 per function, accounting for nesting depth and human readability.

Code Duplication
<3% Green — SIG/TÜV 5-star rating
3–10% Yellow — 4-star to acceptable
>20% Red — SIG/TÜV 1-star

Industry average is approximately 18.5% — most codebases have substantial room for improvement here.

API Performance
<300ms P95 Green — target for most APIs
300–1000ms Yellow — user frustration begins
>1s Red — measurable user abandonment

Standard SLO: P95 under 500ms over a 30-day rolling window. Enterprise availability: 99.95% uptime (≤4.38 hours/year downtime).

Red Flags & Green Flags

Red Flags That Predict Architectural Distress

Scalability ceiling confirmed
The answer to "what happens if traffic doubles?" is "we add a bigger server."
Perimeter-only security
Security described as "handled by the firewall" with no defense-in-depth strategy.
Bus factor of 1
Only one person who can explain a critical component.
No observability
Inability to answer "what is our P99 latency?" — monitoring doesn't exist.
Manual deployments
Deployments requiring manual runbook execution.
Tribal knowledge architecture
Architecture documentation exists only in someone's head.
Debt bankruptcy threshold
Tech debt exceeding 50% of technology estate value (McKinsey's "bankruptcy" threshold).
Point-to-point integration sprawl
Proliferation of point-to-point integrations with no consistent data model.

Green Flags That Indicate Architecture Health

Horizontal scaling with externalized state and auto-scaling capable of 3–5× peak traffic
Comprehensive test coverage: >70% line, >80% on critical paths
CI/CD pipeline completing in <15 minutes from commit to deployable artifact
Zero-downtime deployments with canary or blue-green capability
Maintained Architecture Decision Records — DORA 2024: teams with ADRs are 2× as likely to meet delivery targets
Infrastructure as code across all environments
Distributed tracing with correlation IDs
API P95 latency <200ms
Feature flags for safe rollout
New developers productive within one day
C4 model diagrams updated within the last six months
Gartner Context

Architecture debt is the most common form of technical debt, cited by 67% of organizations, followed by infrastructure debt (58%) and code debt (53%). Architectural-level debt has the highest impact on product quality and delivery lead time — it's the hardest to score but the most consequential to get right.

The Business Case in Hard Numbers

33%
Developer time lost to maintenance (Stripe, 1,000+ devs)
40%
Of IT balance sheets are tech debt (McKinsey, 50 CIOs)
62%
Of developers cite tech debt as #1 frustration (SO 2024)
20%
Higher revenue growth for 80th-percentile Tech Debt Score firms

Retention risk is real. A Stepsize survey found 51% of engineers have left or considered leaving specifically due to technical debt, with 21% citing it as the primary reason for changing jobs. Stack Overflow's 2024 survey (65,000+ developers) found technical debt is the #1 developer frustration — roughly double the second-place complaint.

Revenue upside is quantifiable. Gartner data shows organizations actively managing technical debt achieve 50% faster service delivery times. One major cloud provider case study demonstrated a reduction from 75% of engineer time paying the debt “tax” down to 25% — freeing engineers to spend 50% more time on value-generating work.

Sphere Analysis

According to Sphere's analysis of enterprise architecture engagements, the most common failure pattern is not the absence of a scoring methodology — it's the absence of a remediation governance process. Teams score their debt, identify the hotspots, and then fail to protect engineering capacity to address them. The score becomes a document, not a driver.

Key Takeaways

1. The 20/20/20/15/15/10 weighting is the most defensible starting template. Adjust weights for your context — a fintech platform should weight security higher; an internal tool may weight documentation lower.

2. $3.61 per LOC is the average technical debt density. If you don't know your number, you're flying blind on a cost that compounds at 14% annually.

3. Fix the 2–3% first. CodeScene's data shows that hotspots account for 25–70% of all defects. Targeted remediation beats broad cleanup every time.

4. AI tools have not reduced toil. Gartner predicts a 2,500% increase in software defects by 2028 from prompt-to-app approaches. Score your AI-assisted code the same way you score everything else.

5. ADRs are a leading indicator of delivery performance. DORA's 2024 data found teams with maintained Architecture Decision Records are more than 2× as likely to meet delivery targets. If your team isn't writing them, start there.

Powered by Sphere

Get Your Architecture Scored

Sphere's architecture team will evaluate your codebase against the six-dimension framework — and deliver a scored report with prioritized remediation steps you can take to your board.

Frequently Asked Questions

How do you score architecture quality in a structured way?
Score architecture quality by evaluating six weighted dimensions — Scalability (20%), Security (20%), Maintainability (20%), Performance (15%), Deployment (15%), and Documentation (10%) — on a 1–5 scale per dimension. A weighted total of 4.0–5.0 indicates strong architecture; below 2.0 signals a critical blocker requiring major intervention before new development. Anchor each dimension to specific thresholds: test coverage, cyclomatic complexity, TDR, and DORA metrics are the most commonly cited.
What does a technical debt scoring template look like?
A practical technical debt scoring template combines a Technical Debt Ratio (TDR) calculation with a weighted architecture scorecard. Start with SonarQube's A–E TDR scale (A = ≤5%, E = >50%) for code-level debt, CAST's 0–100 Software Health score for portfolio-level view, and CodeScene's 1–10 Code Health score for behavioral analysis. Overlay these on the six-dimension architecture scorecard to get both a code-level debt score and a systems-level architecture health score.
What's the right architecture review scoring methodology for an enterprise?
For enterprise use, combine DORA metrics (delivery performance benchmarks), ISO 25010 (quality characteristic taxonomy), and practitioner-weighted scorecards (the 20/20/20/15/15/10 split). ATAM adds value when evaluating a new system design or a major architectural change and you need to surface tradeoffs before they're built in. For ongoing governance, embed Architectural Fitness Functions in your CI/CD pipeline — automated checks that catch architectural drift before it becomes debt.
What criteria should an architecture assessment include?
At minimum: scalability (can the system handle 3–5× peak traffic?), security (defense-in-depth, not just perimeter), maintainability (test coverage, coupling, duplication), performance (P95/P99 latency SLOs), deployment (pipeline speed, zero-downtime capability), and documentation (ADRs, C4 diagrams, onboarding time). Gartner's data shows architecture debt is cited by 67% of organizations as their most common form of technical debt — most assessments underweight structural concerns and overweight code-level metrics.
How do I use an architecture health check scoring guide to prioritize remediation?
Use the weighted scorecard to identify the lowest-scoring dimension, then cross-reference with CodeScene hotspot analysis to find the specific code areas responsible for disproportionate defect rates. The 2–3% of code that receives the most commits typically drives 25–70% of all defects — fix those first. McKinsey's "bankruptcy" threshold (tech debt exceeding 50% of technology estate value) is the hard stop: if you're there, new feature development should pause until debt is reduced below 20%.
Is a formal architecture review scoring methodology worth it for mid-market companies?
Yes — especially if you're preparing for a funding round, M&A transaction, or major platform migration. McKinsey's 220-company study found firms in the 80th percentile for Tech Debt Score achieve 20% higher revenue growth than bottom-quintile firms. Technical Due Diligence buyers now routinely quantify debt density and DORA tier as part of deal valuation. A credible scoring methodology turns a liability (known debt) into a negotiating asset (quantified, managed debt with a remediation roadmap).
SR
Sphere Research Team
Architecture Practice — Sphere

The Sphere Research Team is the editorial and research arm of Sphere's CTO Accelerator. Our analysis draws on 20+ years of enterprise delivery across AI, cloud, data, and modernization — spanning 230+ projects in financial services, healthcare, insurance, manufacturing, and private equity. Every framework, benchmark, and cost range published here is grounded in real project data and reviewed by Sphere's senior engineering leadership.