Running the Pipeline

The pipeline accelerates what traditionally takes teams of lawyers and analysts weeks of manual contract review. DD timelines keep compressing — what used to be a six-week process becomes three weeks, with no reduction in scope. The pipeline analyzes every document across 9 specialist domains (Legal, Finance, Commercial, ProductTech, Cybersecurity, HR, Tax, Regulatory, ESG), cross-validates findings, and produces quality-gated structured analysis with sourced citations.

This tool does not replace professional advisors. Use the output alongside your advisory workstreams to accelerate search, correlation, and tracking across the data room.

Basic Execution

dd-agents run deal-config.json

This runs the full 38-step pipeline: config validation, document extraction, entity resolution, specialist agent analysis, quality audits, and report generation.

Where does output go? Results are written to _dd/forensic-dd/ inside your data room directory. The final report is at {data_room_path}/_dd/forensic-dd/runs/latest/report/dd_report.html. See Output Directory Structure for the full layout.

Execution Modes

Full Mode (default)

Processes all documents from scratch. Use for first runs or when the data room has changed significantly.

dd-agents run deal-config.json --mode full

Incremental Mode

Reuses cached extraction and prior findings. Only re-analyzes new or modified files. Faster for iterative runs on the same data room.

dd-agents run deal-config.json --mode incremental

Command Options

Option	Description
`--mode full\\|incremental`	Override execution mode from config
`--resume-from N`	Resume from step N (0-35; 0 means start fresh)
`--dry-run`	Validate config and print step plan without executing
`--quick-scan`	Run steps 1-13 plus Red Flag Scanner only (fast triage)
`--model-profile PROFILE`	Override model tier: `economy`, `standard`, `premium`
`--model-override AGENT=MODEL`	Per-agent model, e.g. `--model-override legal=claude-opus-4-8`
`--no-knowledge`	Skip knowledge compilation after pipeline run
`--no-narrative`	Skip LLM narrative generation (deterministic report only)
`--verbose / -v`	Enable debug logging

Examples

Preview the step plan without running:

dd-agents run deal-config.json --dry-run

Resume after a failure at step 17:

dd-agents run deal-config.json --resume-from 17

Quick red-flag triage with economy models:

dd-agents run deal-config.json --quick-scan --model-profile economy

Use Opus for the legal agent, standard for everything else:

dd-agents run deal-config.json --model-override legal=claude-opus-4-8

The 38-Step Pipeline

The pipeline is organized into 8 phases:

Phase 1: Setup (Steps 1-3) Validate config, initialize persistence layer, check cross-skill dependencies. Step 1 is effectively blocking — if config validation fails, the pipeline halts.

Phase 2: Discovery and Extraction (Steps 4-5) Discover files in the data room. Extract text from PDFs and Office documents using pymupdf with fallback to markitdown, OCR, or Claude vision. Step 5 is a blocking gate — extraction must succeed for at least a minimum threshold of files.

Phase 3: Inventory and Resolution (Steps 6-12) Build subject inventory with document precedence ranking (which version of a file to trust when duplicates exist), match company names across documents (handling aliases, abbreviations, and legal suffixes automatically), build reference registry, count subject mentions, and verify inventory integrity. Steps 11-12 are conditional (database reconciliation and incremental classification).

Phase 4: Agent Execution (Steps 13-17) Create the specialist team (9 agents by default — Legal, Finance, Commercial, ProductTech, Cybersecurity, HR, Tax, Regulatory, ESG), prepare analysis instructions with document ranking context, route references, and run agents in parallel. Agents can be disabled per-deal via deal-config.json. Step 17 is a blocking gate — coverage must meet minimum thresholds across all active domains.

Phase 5: Cross-Domain Analysis (Steps 18-20) Neurosymbolic cross-domain analysis. Step 18 uses deterministic trigger rules to detect when findings in one domain imply risks in another (e.g., a Legal CoC clause that threatens Finance revenue). Step 19 spawns targeted pass-2 agents to verify cross-domain risks. Step 20 merges pass-2 findings. These steps are budget-bounded and priority-ordered.

Phase 6: Merge & Judge Review (Steps 21-25) Step 21 merges incremental results (if applicable). Steps 22-25 optionally spawn the Judge agent for adversarial review of specialist findings (runs only when judge.enabled is true in the deal config).

Phase 7: Validation & Reporting (Steps 26-34) Step 26 runs pre-merge validation (cross-agent anomaly detection, citation verification). Steps 27-28 merge and deduplicate findings across agents and identify coverage gaps. Step 29 builds the numerical manifest. Step 30 runs the numerical audit (blocking gate). Step 31 runs the full QA audit (blocking gate). Step 32 builds the incremental diff. Step 33 generates both Excel and HTML reports, and also runs the Executive Synthesis agent (Go/No-Go calibration), the Acquirer Intelligence agent (when buyer_strategy is configured), and the Red Flag Scanner (when --quick-scan is used). Step 34 is the post-generation validation blocking gate.

Phase 8: Finalization (Steps 35-38) Write run metadata, update run history, save entity resolution cache, shut down.

Blocking Gates

Five steps are blocking gates that halt the pipeline on failure:

Step	Gate	What It Checks
5	Bulk Extraction	Minimum extraction success rate
17	Coverage Gate	Agent coverage across all domains
30	Numerical Audit	Financial figure consistency across 6 validation layers
31	Full QA Audit	31 definition-of-done checks
34	Post-Generation Validation	Report completeness and integrity

When a gate fails, the pipeline stops with exit code 2 and prints the reason for failure.

How to recover from each gate failure:

Gate	Common Cause	How to Fix
Bulk Extraction (step 5)	Corrupted or password-protected files	Remove or replace problem files, then `--resume-from 4`
Coverage Gate (step 17)	Too few documents per subject	Add missing documents to the data room, then `--resume-from 6`
Numerical Audit (step 30)	Contradictory financial figures	Review `audit.json` in the run directory, then `--resume-from 30`
Full QA Audit (step 31)	Quality checks failed (missing citations)	Review `dod_results.json` for specifics, then `--resume-from 31`
Post-Generation (step 34)	Incomplete report output	Check disk space, then `--resume-from 33`

Agents

The pipeline uses 13 specialized analyzers — 9 domain specialists that process contracts in parallel, plus 4 synthesis/validation components:

Agent	Type	Phase	Description
Legal	Specialist	4	Contract clause analysis (CoC, TfC, IP, privacy, liability) with 18 canonical clause types
Finance	Specialist	4	Revenue recognition, SaaS metrics, financial risk
Commercial	Specialist	4	Customer concentration, pricing, renewal risk
ProductTech	Specialist	4	Technology dependencies, integration complexity
Cybersecurity	Specialist	4	Security governance, incident history, vulnerability management, compliance certifications
HR	Specialist	4	Workforce composition, compensation, key talent retention, labor compliance
Tax	Specialist	4	Income tax compliance, transfer pricing, NOL/tax attributes, deal structure tax
Regulatory	Specialist	4	License transferability, antitrust, data privacy regulation, AML/sanctions
ESG	Specialist	4	Environmental contamination, climate/carbon risk, ESG governance, supply chain sustainability
Judge	Validation	5	Adversarial review of specialist findings (optional)
Executive Synthesis	Synthesis	6	Go/No-Go calibration, severity recalibration
Acquirer Intelligence	Synthesis	6	Buyer thesis alignment, synergy validation (when `buyer_strategy` configured)
Red Flag Scanner	Triage	6	Quick stoplight triage (when `--quick-scan` used)

The system is neurosymbolic: deterministic risk scoring, cross-domain trigger rules, and domain ontology graphs provide structured reasoning scaffolding that guides and constrains LLM analysis. All 9 specialists share a base execution engine (BaseAgentRunner) but are differentiated by substantive domain-specific prompts. The specialist set is extensible — external agents can be added via pip entry-points, and agents can be disabled per-deal via deal-config.json.

Output Directory Structure

All output goes under _dd/forensic-dd/ relative to the data room:

_dd/forensic-dd/
├── index/text/                     # PERMANENT: extracted document text
├── inventory/                      # FRESH: rebuilt each run (subject registry, file counts)
├── entity_resolution_cache.json    # PERMANENT: entity matching cache
└── runs/
    └── 20260307_143000/            # VERSIONED: timestamped per run
        ├── findings/
        │   ├── legal/              # Per-agent raw findings
        │   ├── finance/
        │   ├── commercial/
        │   ├── product_tech/
        │   └── merged/            # Deduplicated merged findings
        ├── report/
        │   ├── dd_report.html     # Interactive HTML report
        │   └── dd_report.xlsx     # 14-sheet Excel report
        ├── pre_merge_validation.json  # Cross-agent validation report
        ├── audit.json             # QA validation results
        ├── metadata.json          # Run metadata and costs
        └── dod_results.json       # Definition-of-done check results

Persistence Tiers

PERMANENT: Never wiped between runs. Extraction cache, entity resolution cache, subject registry. Reused across full and incremental runs.
VERSIONED: Archived per run in timestamped directories. Findings, reports, audit results. Each run gets its own copy.
FRESH: Rebuilt each run. Working state, intermediate computations.

Handling Failures

If the pipeline fails mid-run, note the step number from the error output and resume:

dd-agents run deal-config.json --resume-from 17

If a blocking gate fails (exit code 2), fix the underlying issue (e.g., add missing documents to the data room) and resume from that step.

When resuming from steps 3-5, the FRESH persistence tier is automatically wiped to prevent stale inventory data from a prior interrupted run.

Advanced: Environment Variable Overrides

For advanced tuning (e.g., non-English documents, OCR-heavy data rooms), several algorithm thresholds can be overridden via environment variables. All use the DD_ prefix.

Variable	Tunes
`DD_QUOTE_MATCH_THRESHOLD`	Fuzzy match score for citation verification — lower tolerates more OCR noise
`DD_MIN_QUOTE_CHARS` / `DD_MAX_QUOTE_CHARS`	Bounds on extracted quote length
`DD_SYNTHESIS_BUDGET_CHARS`	Character budget for synthesis-phase quote aggregation
`DD_FUZZY_THRESHOLD_LONG` / `DD_FUZZY_THRESHOLD_MEDIUM`	Entity-resolution fuzzy thresholds by name length
`DD_SHORT_NAME_MAX_LEN`	Names at or below this length are matched exactly (never fuzzy)
`DD_TFIDF_THRESHOLD`	Cosine-similarity threshold for TF-IDF entity matching

Current defaults — and the full list of DD_ overrides — live in the code that reads them: src/dd_agents/utils/constants.py and src/dd_agents/search/analyzer.py. Run grep DD_ src/dd_agents/utils/constants.py to see every variable and its default.

Example:

# Loosen citation matching for OCR-heavy data rooms
DD_QUOTE_MATCH_THRESHOLD=65 dd-agents run deal-config.json

# Tighten entity resolution for data rooms with similar company names
DD_FUZZY_THRESHOLD_LONG=92 DD_FUZZY_THRESHOLD_MEDIUM=98 dd-agents run deal-config.json

Next Steps

Reading the Report -- Navigate the generated reports
Deal Configuration -- Adjust config settings
CLI Reference -- Full command reference