MLflow and Datadog Best Practices for EU AI Act Logging and Monitoring

Article 12 of EU AI Act Regulation 2024/1689 requires high-risk AI systems to have automatic logging capabilities sufficient to ensure traceability throughout the system's operational lifetime. Article 72 requires post-market monitoring. For ML engineering teams, these obligations map directly to two tools already in most production stacks: MLflow for experiment and model tracking, and Datadog for operational monitoring and logging.

The question is not whether to use these tools for compliance -- most teams already use them. The question is how to configure them so that the data they generate constitutes usable EU AI Act evidence.

MLflow: Structuring Experiments and Models for Annex IV

MLflow tracks experiments, runs, models, and datasets. Configured correctly, an MLflow registry becomes a living record of the information required in Annex IV Sections 2 (system elements and development), 3 (monitoring and control), and 5 (changes).

Experiment Naming Convention

Name MLflow experiments with enough structure that a compliance reviewer can understand what changed, why, and what the regulatory context is:

# Avoid:
experiment_47
test_run_new_data

# Prefer:
credit-scoring-v2/baseline-2026-04
credit-scoring-v2/dataset-refresh-2026-04/art9-bias-recheck
hiring-model-v3/robustness-test-adversarial/art15
hiring-model-v3/post-market-retrain-q1-2026/annex-iv-update

Required MLflow Run Tags

Add standard tags to every production-relevant MLflow run. These tags make it possible to filter runs by compliance context and generate Annex IV Section 5 change records automatically:

mlflow.set_tags({ "compliance.trigger": "scheduled_retrain", # or: post_market_finding / data_refresh / art9_review "compliance.article": "art9,art15", # which obligations this run addresses "compliance.annex_iv_updated": "true", # whether Annex IV was updated for this run "data.version": "dataset-v4.1", # traceable dataset reference "data.governance_approved": "true", # Article 10 data governance evidence "model.risk_assessment": "risk-2026-04-15", # links to Article 9 risk assessment "reviewer": "[email protected]" # human sign-off (Article 14 evidence) })

Model Registry Stage Lifecycle as Conformity Evidence

Use MLflow's model registry staging lifecycle to enforce a compliance gate before production deployment:

Staging (None) -- > Staging -- > Production -- > Archived

# Required before transitioning Staging -> Production:
# 1. Accuracy metrics meet Article 15 thresholds (logged as run metrics)
# 2. Bias/fairness tests pass (logged as run metrics)
# 3. Risk assessment updated in Confluence (tag: compliance.risk_assessment)
# 4. Annex IV Section 5 updated (tag: compliance.annex_iv_updated = true)
# 5. Human reviewer sign-off recorded (tag: reviewer)

The model registry transition timestamp, the approving user, and the run metrics form a verifiable record that Article 9 (risk), Article 14 (human oversight), and Article 15 (accuracy) obligations were satisfied before the model went to production.

Logging Dataset Lineage for Article 10

Article 10 requires data governance for training, validation, and test datasets. Log dataset metadata with every training run:

mlflow.log_params({
    "dataset.train.source": "s3://data-lake/training/v4.1/",
    "dataset.train.version": "4.1",
    "dataset.train.size": 1250000,
    "dataset.train.date_collected_from": "2024-01-01",
    "dataset.train.date_collected_to": "2025-12-31",
    "dataset.train.geographic_scope": "EU",
    "dataset.train.demographic_groups_represented": "documented_in_confluence_data_card",
    "dataset.validation.source": "s3://data-lake/validation/v4.1/",
    "dataset.test.source": "s3://data-lake/test/v4.1/"
})

Datadog: Configuring Monitoring for Article 12 and Article 72

Datadog is where operational Article 12 logging and Article 72 post-market monitoring live. The goal is not just to have logs -- it is to have logs structured in a way that can be exported as compliance evidence on demand.

Article 12 Log Schema

Article 12 requires automatic logging of events sufficient to ensure traceability. At minimum, each log event for a high-risk AI inference should contain:

{
  "timestamp": "2026-04-25T09:14:22.341Z",   # UTC, millisecond precision
  "system_id": "hiring-screener-v3",          # unique identifier for the AI system
  "model_version": "3.2.1",                   # must match MLflow registered model version
  "request_id": "req_7f8a91bc",              # unique per inference request
  "input_reference": "sha256:a3f8...",        # hash of input, not raw input (privacy)
  "output_class": "SHORTLIST",               # the decision made
  "confidence_score": 0.87,
  "human_override": false,                   # was this overridden by a human reviewer?
  "oversight_flag": false,                   # was this flagged for human review?
  "session_id": "sess_4a2b19",              # groups requests per use session
  "compliance.art12": true                   # explicit tag for compliance filtering
}

Key design decisions:

Hash inputs, do not log raw personal data -- Article 10 and GDPR require data minimisation. A hash proves the input was a specific value without storing the value itself
Log human override and oversight flag -- these are Article 14 evidence fields showing human oversight is operational
Tag with compliance.art12 -- this makes it trivial to generate a compliance-filtered export for auditors

Datadog Monitor Naming for Post-Market Monitoring

Article 72 post-market monitoring requires tracking of performance metrics after deployment. Name Datadog monitors to make their compliance purpose explicit:

# Avoid:
"Model accuracy alert"
"High error rate"

# Prefer:
"[Art72] Hiring model accuracy degradation >5% (7-day rolling)"
"[Art72] Credit model demographic parity drift alert"
"[Art12] Inference logging coverage <100% -- missing log events detected"
"[Art14] Human review queue backlog >24h -- oversight SLA breach"
"[Art9] Data distribution shift detected -- risk reassessment required"

Log Retention Configuration

Article 18 requires technical documentation to be retained for 10 years. Your compliance-tagged logs should match this retention policy. In Datadog:

Create a dedicated Log Archive for compliance-tagged events (compliance.art12:true)
Route this archive to long-term storage (S3, Azure Blob) with a 10-year retention policy
Separate compliance logs from operational logs to avoid incurring full Datadog ingestion costs on archived compliance records
Document the archive configuration in your Annex IV Section 7 (post-market monitoring plan)

Generating Compliance Exports

When a market surveillance authority requests evidence, you need to be able to produce a compliance log export quickly. Set up a saved Datadog query for Article 12 log exports:

# Saved Datadog Log Query: "Article 12 Compliance Export"
compliance.art12:true system_id:hiring-screener-v3
  @timestamp:[2026-01-01 TO 2026-04-25]
  | fields timestamp, system_id, model_version, request_id,
             output_class, human_override, oversight_flag
  | sort by timestamp asc

The Full Evidence Chain

When MLflow and Datadog are configured this way, every inference event produces a traceable chain:

The model version in the Datadog log matches the registered model version in MLflow
The MLflow run record shows the training data, risk assessment, and human reviewer who approved the model for production
The Datadog monitor alerts feed back into the Confluence Article 9 risk management system
The Confluence Annex IV Section 7 links to the Datadog monitoring plan

This chain is what an Article 43 conformity assessment reviewer or a market surveillance authority inspector expects to see. It is not a compliance artefact you produce for audits -- it is the normal operational record of a well-run AI system, configured to be auditable.

Frequently Asked Questions

How long must Article 12 inference logs be retained?

Article 18 requires technical documentation to be retained for 10 years after market placement. Article 12 logs, as part of the evidence trail that demonstrates compliance with the logging obligation, should be retained for the same period. For operational cost reasons, compliance-tagged logs can be archived to lower-cost long-term storage (S3 Glacier, Azure Archive) rather than retained in your primary Datadog account.

Is hashing inference inputs sufficient for Article 12, or must raw inputs be logged?

Hashing is both sufficient and preferable. Article 12 requires logging sufficient to ensure traceability -- it does not require storing raw personal data inputs, which would conflict with GDPR data minimisation requirements. A cryptographic hash of the input proves a specific input was processed at a specific time without retaining the personal data. The input reference field in the log should document the hashing algorithm used.