AI Engineering

Overview

AI Engineering focuses on the practical implementation of AI models and systems. It involves the development, deployment, and maintenance of AI solutions that can solve real-world problems — not in a research lab, but in production environments where reliability, governance, and auditability matter.

Sponsoring the Feature Spec Generator from the Seattle Tech Hub confronted a question that every engineering leader in a regulated environment eventually faces: how do you adopt AI-powered tooling in an organization where every change to the SDLC has compliance implications? The answer is not to avoid AI adoption. It is to treat AI systems with the same engineering rigor applied to any production system — version control, automated testing, monitoring, rollback capability, and clear ownership. The practices in this document reflect that philosophy.

AI engineering is a multidisciplinary field that combines principles from computer science, data science, and domain-specific knowledge to create AI systems that are robust, scalable, and efficient. It requires a deep understanding of machine learning algorithms, data processing techniques, and software engineering practices. As Sculley et al. demonstrated in their seminal 2015 paper, the actual ML code in a production system is often a small fraction of the total codebase — the surrounding infrastructure for data collection, feature extraction, serving, and monitoring represents the bulk of the engineering work and the bulk of the technical debt.

Data Engineering

Data Engineering involves collecting, processing, and managing data for AI applications. In banking and financial services, data engineering carries additional weight because data quality directly impacts regulatory reporting, risk calculations, and customer outcomes. Key practices include:

Data Collection: Gathering data from various sources, including transaction systems, market data feeds, customer interaction logs, and third-party data providers.
Data Cleaning: Removing noise and inconsistencies from data. In financial services, this includes reconciliation against authoritative sources and flagging data lineage issues that could affect downstream model outputs.
Data Transformation: Converting data into a format suitable for analysis while maintaining audit trails that satisfy regulatory requirements.

Data Collection

Data Collection is the process of gathering data from various sources to be used in AI applications. It involves:

Web Scraping: Extracting data from websites — used cautiously in regulated environments due to licensing and data provenance requirements.
APIs: Using application programming interfaces to collect data from external services. In banking, this includes market data APIs (Bloomberg, Refinitiv), credit bureau APIs, and interbank messaging systems (SWIFT).
Databases: Retrieving data from relational and non-relational databases. Enterprise AI systems in banking typically pull from data warehouses, data lakes, and operational data stores, often requiring cross-domain access approvals.
Event Streams: Consuming real-time event data from message brokers (Kafka, Kinesis) — increasingly important for fraud detection models that need to score transactions in milliseconds.

Data Cleaning

Data Cleaning involves removing noise and inconsistencies from data to ensure its quality and reliability. Techniques include:

Handling Missing Values: Imputing or removing missing data points. In financial datasets, the approach to missing values must be documented because it can affect regulatory reporting. A model that imputes missing income data differently from the bank's standard methodology creates compliance risk.
Removing Duplicates: Identifying and removing duplicate records. Transaction deduplication in banking is non-trivial because the same economic event can appear in multiple source systems with different identifiers.
Outlier Detection: Identifying and handling outliers in the data. In fraud detection, outliers are often the signal rather than the noise — the data cleaning pipeline must be careful not to discard exactly the data points the model needs to learn from.

Data Transformation

Data Transformation is the process of converting data into a format suitable for analysis. It includes:

Normalization: Scaling features to a standard range. Critical for models that combine features with different scales, such as transaction amounts (dollars) and transaction frequency (counts).
Aggregation: Summarizing data at different levels of granularity. Customer-level aggregations for risk models, portfolio-level aggregations for stress testing, and market-level aggregations for trading strategies.
Feature Extraction: Creating new features from raw data. In banking, derived features like "velocity of transactions in the last 24 hours" or "ratio of international to domestic transactions" are often more predictive than raw transaction data.

Example: Data Transformation in Banking

Consider a fraud detection system at a retail bank. Raw transaction data includes timestamp, amount, merchant category, and geographic coordinates. The transformation pipeline creates derived features: transaction velocity (transactions per hour for this customer), geographic velocity (distance between consecutive transactions divided by time elapsed), category deviation (how unusual is this merchant category for this customer), and amount deviation (z-score of this transaction amount relative to the customer's historical distribution). These engineered features dramatically improve model performance compared to using raw fields alone.

Model Development

Model Development involves building and training machine learning models. Key practices include:

Feature Engineering: Creating features from raw data to improve model performance. This is often where domain expertise matters most — an experienced banking technologist knows which features are predictive, which are redundant, and which create regulatory risk.
Hyperparameter Tuning: Optimizing model parameters to achieve the best performance.
Model Evaluation: Assessing model performance using metrics like accuracy, precision, and recall. In regulated environments, evaluation must also include fairness metrics, explainability assessments, and performance across protected demographic groups.

Feature Engineering

Feature Engineering is the process of creating new features from raw data to improve the performance of machine learning models. It involves selecting, transforming, and creating variables that can enhance the predictive power of the model. Examples include:

Normalization: Scaling features to a standard range.
Encoding Categorical Variables: Converting categorical variables into numerical representations. One-hot encoding for low-cardinality features (merchant category), target encoding for high-cardinality features (merchant ID).
Creating Interaction Features: Combining features to capture interactions between variables. In credit risk modeling, the interaction between income and debt load is more predictive than either feature alone.
Temporal Features: Engineering time-based features such as day-of-week effects, seasonal patterns, and rolling aggregations. Financial data is inherently temporal, and ignoring time dynamics leaves predictive power on the table.

Hyperparameter Tuning

Hyperparameter Tuning involves optimizing the parameters of a machine learning model to achieve the best performance. It includes techniques such as:

Grid Search: Exhaustively searching through a specified subset of hyperparameters. Feasible for small parameter spaces but computationally expensive for complex models.
Random Search: Randomly sampling hyperparameters from a specified distribution. Bengio and Bergstra (2012) showed that random search is more efficient than grid search for most practical problems because not all hyperparameters are equally important.
Bayesian Optimization: Using probabilistic models to find the optimal hyperparameters. Increasingly used for expensive-to-train models where each training run costs significant compute resources.

Model Evaluation

Model Evaluation is the process of assessing the performance of a machine learning model using various metrics. Common evaluation metrics include:

Accuracy: The proportion of correctly predicted instances out of the total instances.
Precision: The proportion of true positive predictions out of the total positive predictions. In fraud detection, high precision means fewer false alarms that waste investigator time.
Recall: The proportion of true positive predictions out of the total actual positives. In fraud detection, high recall means fewer fraudulent transactions slip through undetected.
F1 Score: The harmonic mean of precision and recall. Useful when you need to balance both concerns.
AUC-ROC: Area under the receiver operating characteristic curve. Provides a threshold-independent measure of model discrimination.

Example: Model Evaluation in Fraud Detection

In a fraud detection model at a retail bank, the base rate of fraud is typically 0.1-0.3% of transactions. A model that predicts "not fraud" for every transaction achieves 99.7% accuracy but is useless. The evaluation must focus on precision-recall tradeoffs: at what threshold does the model catch 95% of fraud (recall) while keeping false positives manageable for the investigations team? This tradeoff is a business decision, not a purely technical one, and it requires close collaboration between data scientists, fraud operations, and risk management.

Model Deployment

Model Deployment involves deploying AI models into production environments. In regulated banking environments, deployment is not just a technical process — it requires model risk management (MRM) review, change advisory board approval, and documented rollback procedures. Key practices include:

Containerization: Using containers to package and deploy models.
Model Monitoring: Monitoring model performance in production to detect issues.
Continuous Integration/Continuous Deployment (CI/CD): Automating the deployment of models using CI/CD pipelines.

Containerization

Containerization is the process of packaging an AI model and its dependencies into a container, such as Docker, to ensure consistency across different environments. Benefits include:

Portability: Containers can run on any platform that supports containerization. This is critical in banking where models may need to run in on-premises data centers, private cloud, and edge environments.
Scalability: Containers can be easily scaled to handle increased workloads. Real-time scoring systems for fraud detection need to scale with transaction volume, which peaks during holidays and promotional events.
Isolation: Containers provide isolation between different applications and their dependencies. In a multi-model serving environment, isolation prevents a misbehaving model from affecting others.
Reproducibility: Containers ensure that the exact same environment used during model validation is used in production, eliminating "works on my machine" issues that are particularly dangerous in regulated model deployments.

Model Monitoring

Model Monitoring involves tracking the performance of AI models in production to detect issues and ensure their reliability. Key aspects include:

Performance Metrics: Monitoring metrics such as accuracy, precision, and recall on live traffic. Model performance in production often differs from validation performance due to data distribution shifts.
Data Drift Detection: Identifying changes in the input data distribution that may affect model performance. In banking, data drift can be caused by macroeconomic changes (recession, interest rate shifts), seasonal effects, or changes in customer behavior.
Concept Drift Detection: Identifying changes in the relationship between input features and the target variable. A fraud model trained on pre-pandemic data may not perform well when transaction patterns shift dramatically.
Alerting: Setting up alerts to notify stakeholders of any issues with the model. Alerts should be actionable — connected to runbooks that describe the investigation and remediation process.

Example: Model Monitoring in Production

A deployed model predicting customer churn at a retail bank is monitored for performance metrics like precision and recall. After a competitor launches an aggressive rate promotion, the model's recall drops from 85% to 60% — it is missing customers who are leaving for rate-driven reasons because the training data did not include this competitive dynamic. Data drift detection catches the shift in the "competitor rate differential" feature distribution, triggering a retraining workflow. This is the MLOps feedback loop in action.

Continuous Integration/Continuous Deployment (CI/CD)

CI/CD for ML systems extends traditional CI/CD with ML-specific concerns. Google's MLOps whitepaper describes three levels of ML automation maturity:

Level 0 (Manual): Manual model training, manual deployment. Common in early ML adoption.
Level 1 (ML Pipeline Automation): Automated training pipelines, automated validation, but manual deployment approval.
Level 2 (CI/CD for ML): Full automation of training, validation, and deployment with automated rollback. This is the target state for production ML systems.

Key practices include:

Automated Testing: Running tests automatically to ensure the quality of the model. This includes unit tests for feature engineering code, integration tests for the training pipeline, and model validation tests that check performance against baseline thresholds.
Version Control: Using version control systems to manage changes to the model, its code, its configuration, and its training data. DVC (Data Version Control) and MLflow are common tools for tracking model lineage.
Deployment Automation: Automating the deployment process to reduce manual intervention and minimize errors. In regulated environments, automation includes evidence collection for audit trails — automatically recording who approved the deployment, what validation results were observed, and what the rollback procedure is.

AI-Driven Software Development Practices

AI-driven software development practices leverage AI technologies to enhance the software development lifecycle. The Feature Spec Generator sponsored from the Seattle Tech Hub is one example of this approach — using LLMs to transform minimal requirements into executable BDD specifications. Key practices include:

Automated Code Generation: Using AI to generate code snippets or entire functions. Tools like GitHub Copilot and Cursor have moved from novelty to daily-use tools for many engineering teams.
Intelligent Code Review: Leveraging AI to review code for potential issues and suggest improvements. AI-powered code review can catch patterns that human reviewers miss, particularly in large codebases with inconsistent coding standards.
Predictive Analytics: Using AI to predict project timelines, resource allocation, and potential risks. In large banking technology organizations, predictive models for delivery performance can identify at-risk programs weeks before they miss milestones.
Automated Testing: Implementing AI-driven testing tools to identify bugs and optimize test coverage. This includes AI-generated test cases, mutation testing guided by ML, and intelligent test selection that runs only the tests most likely to catch regressions for a given code change.

The empirical evidence on AI coding agents is maturing rapidly. Agarwal, He, and Vasilescu (2026) conducted a longitudinal causal analysis of LLM-based coding agents in open-source projects, finding that agents produce "large, front-loaded velocity gains" — but with a critical caveat. Projects already using IDE-based AI assistants show minimal additional throughput increases, suggesting diminishing returns to layered AI assistance. More concerning, static-analysis warnings and cognitive complexity rose by approximately 18% and 39% respectively across all scenarios, indicating sustained technical debt accumulation from AI-generated code. This research validates the approach taken with the Feature Spec Generator: AI-generated artefacts must pass through the same quality gates — automated testing, security scanning, code review — as human-written code. The velocity gains are real, but they must be accompanied by quality safeguards.

Automated Code Generation

Automated Code Generation involves using AI to generate code snippets or entire functions based on natural language descriptions or code patterns. Examples include:

Code Completion: Providing suggestions for completing code statements. Modern tools provide multi-line completions that are contextually aware of the surrounding codebase.
Code Synthesis: Generating code based on high-level specifications. The Feature Spec Generator is a domain-specific example — generating Gherkin BDD specifications from natural-language feature descriptions.
Template Generation: Creating code templates for common tasks. In enterprise environments, this includes generating boilerplate for API endpoints, database access layers, and integration test scaffolding that conforms to organizational standards.

Intelligent Code Review

Intelligent Code Review leverages AI to review code for potential issues and suggest improvements. Benefits include:

Bug Detection: Identifying bugs and vulnerabilities in the code. AI reviewers can identify null pointer dereferences, resource leaks, and concurrency issues that are easy to miss in manual review.
Code Quality Improvement: Suggesting best practices and improvements to enhance code quality. This includes identifying code smells, suggesting refactoring opportunities, and enforcing architectural patterns.
Automated Feedback: Providing real-time feedback to developers during the coding process. The key to adoption is low false-positive rates — if the tool cries wolf too often, developers will ignore it.

Predictive Analytics

Predictive Analytics involves using AI to predict project timelines, resource allocation, and potential risks. Key techniques include:

Time Series Analysis: Analyzing historical data to forecast future trends. Delivery velocity, defect escape rates, and infrastructure costs all exhibit temporal patterns that can be modeled.
Resource Optimization: Predicting resource needs and optimizing allocation. In banking technology organizations with thousands of engineers, even small improvements in resource allocation translate to significant cost savings.
Risk Assessment: Identifying potential risks and their impact on the project. ML models trained on historical delivery data can identify the characteristics of projects that are likely to miss deadlines or exceed budgets.

Automated Testing

Automated Testing involves implementing AI-driven testing tools to identify bugs and optimize test coverage. Key practices include:

Test Case Generation: Using AI to generate test cases based on code analysis. This is the core capability of the Feature Spec Generator — transforming requirements into comprehensive test scenarios that include edge cases human testers might miss.
Test Optimization: Prioritizing and optimizing test cases to maximize coverage. Intelligent test selection uses code change analysis and historical failure data to run the most valuable tests first, reducing feedback time.
Bug Detection: Identifying and reporting bugs automatically. AI-powered fuzzing and property-based testing can discover bugs that conventional test suites miss.

Examples of AI-Driven Development Tools and Technologies

GitHub Copilot: An AI-powered code completion tool that helps developers write code faster. Widely adopted across enterprise engineering teams, with measurable productivity improvements reported by Ziegler et al. (2024).
Cursor: An AI-native IDE that combines code generation, editing, and codebase understanding. Represents the shift from AI as an assistant to AI as a collaborator.
DeepCode / Snyk Code: An AI-based code review tool that identifies bugs, security vulnerabilities, and suggests improvements. Integrates into CI/CD pipelines for automated quality gates.
TabNine: An AI-driven code completion tool that supports multiple programming languages and can be trained on proprietary codebases.
Snyk: A security tool that uses AI to identify and fix vulnerabilities in code and dependencies. Critical for DevSecOps pipelines in regulated environments.
SonarQube: A code quality tool that uses static analysis and AI-powered rules to analyze code and provide actionable insights.

The frontier of AI-driven development is moving toward multi-agent systems. Fu, Pasuksmit, and Tantithamthavorn (2024) surveyed 99 papers identifying 12 distinct security tasks in DevSecOps that can be addressed by AI, from vulnerability detection to patch generation to compliance verification. Their analysis of 65 benchmarks reveals that while individual AI tools excel at specific tasks, the combination of specialised agents — each optimised for a different phase of the SDLC — produces superior outcomes to monolithic AI assistants. This multi-agent architecture mirrors the design of OpenClaw, where specialised sub-agents handle coding, security analysis, testing, and deployment tasks under the coordination of a central orchestrator.

Importance of DORA Metrics in AI Engineering

DORA (DevOps Research and Assessment) metrics are crucial for measuring the performance and effectiveness of AI engineering practices. The research by Forsgren, Humble, and Kim demonstrates that these metrics are predictive of organizational performance — teams that excel on DORA metrics deliver more value with fewer failures. The four key DORA metrics are:

Deployment Frequency: How often new code is deployed to production. For ML systems, this includes both application code deployments and model deployments — which may follow different cadences.
Lead Time for Changes: The time it takes for a code change to go from commit to production. For ML systems, this includes the time from identifying a model performance issue to deploying a retrained model.
Change Failure Rate: The percentage of changes that result in a failure in production. For ML systems, this includes model deployments that cause performance degradation, serving errors, or fairness violations.
Mean Time to Restore (MTTR): The average time it takes to restore service after a failure. For ML systems, this depends on having automated rollback mechanisms that can quickly revert to a previous model version.

Examples of Applying DORA Metrics in AI Engineering Projects

Deployment Frequency: By increasing the frequency of model deployments, teams can quickly iterate on model improvements and deliver new features to users more rapidly. At the Tier-1 bank, the target was to move from quarterly model releases to monthly releases for low-risk models, with a path to continuous deployment for models with automated validation gates.
Lead Time for Changes: Reducing the lead time for changes allows teams to respond faster to new data and changing requirements. Automating the data pipeline, model training, and validation process reduced model update lead time from weeks to days. The bottleneck shifted from engineering to model risk management review — which prompted a parallel effort to streamline the MRM process for low-materiality model changes.
Change Failure Rate: Monitoring and reducing the change failure rate helps ensure that model updates do not negatively impact production systems. The team implemented shadow deployment (running the new model alongside the existing model and comparing outputs) as a standard practice, catching performance regressions before they affected customers.
Mean Time to Restore (MTTR): Minimizing MTTR ensures that any issues in production are resolved quickly, reducing downtime and maintaining service reliability. Automated rollback mechanisms that can revert to the previous model version within minutes are essential. The team also implemented circuit breakers that fall back to rule-based systems when the ML model is unavailable, ensuring continuity of service.

References

Sculley, D., Holt, G., Golovin, D., et al. (2015). "Hidden Technical Debt in Machine Learning Systems." Advances in Neural Information Processing Systems (NeurIPS). https://papers.nips.cc/paper/2015/hash/86df7dcfd896fcaf2674f757a2463eba-Abstract.html
Google Cloud. (2023). "MLOps: Continuous delivery and automation pipelines in machine learning." https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning
Amershi, S., Begel, A., Bird, C., et al. (2019). "Software Engineering for Machine Learning: A Case Study." IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). https://doi.org/10.1109/ICSE-SEIP.2019.00042
Forsgren, N., Humble, J., & Kim, G. (2018). Accelerate: The Science of Lean Software and DevOps. IT Revolution Press.
Ziegler, A., Kalliamvakou, E., Li, X.A., et al. (2024). "Measuring GitHub Copilot's Impact on Productivity." Communications of the ACM, 67(3), 54-63.
Bengio, Y. & Bergstra, J. (2012). "Random Search for Hyper-Parameter Optimization." Journal of Machine Learning Research, 13, 281-305.
Agarwal, S., He, H., & Vasilescu, B. (2026). "AI IDEs or Autonomous Agents? Measuring the Impact of Coding Agents on Software Development." arXiv:2601.13597. https://arxiv.org/abs/2601.13597
Fu, M., Pasuksmit, J., & Tantithamthavorn, C. (2024). "AI for DevSecOps: A Landscape and Future Opportunities." arXiv:2404.04839. https://arxiv.org/abs/2404.04839

Overview​

Data Engineering​

Data Collection​

Data Cleaning​

Data Transformation​

Example: Data Transformation in Banking​

Model Development​

Feature Engineering​

Hyperparameter Tuning​

Model Evaluation​

Example: Model Evaluation in Fraud Detection​

Model Deployment​

Containerization​

Model Monitoring​

Example: Model Monitoring in Production​

Continuous Integration/Continuous Deployment (CI/CD)​

AI-Driven Software Development Practices​

Automated Code Generation​

Intelligent Code Review​

Predictive Analytics​

Automated Testing​

Examples of AI-Driven Development Tools and Technologies​

Importance of DORA Metrics in AI Engineering​

Examples of Applying DORA Metrics in AI Engineering Projects​

References​