Skip to content

Running the Pipeline

The pipeline accelerates what traditionally takes teams of lawyers and analysts weeks of manual contract review. DD timelines keep compressing — what used to be a six-week process becomes three weeks, with no reduction in scope. The pipeline analyzes every document across 9 specialist domains (Legal, Finance, Commercial, ProductTech, Cybersecurity, HR, Tax, Regulatory, ESG), cross-validates findings, and produces quality-gated structured analysis with sourced citations.

This tool does not replace professional advisors. Use the output alongside your advisory workstreams to accelerate search, correlation, and tracking across the data room.

Basic Execution

dd-agents run deal-config.json

This runs the full 38-step pipeline: config validation, document extraction, entity resolution, specialist agent analysis, quality audits, and report generation.

Where does output go? Results are written to _dd/forensic-dd/ inside your data room directory. The final report is at {data_room_path}/_dd/forensic-dd/runs/latest/report/dd_report.html. See Output Directory Structure for the full layout.

Execution Modes

Full Mode (default)

Processes all documents from scratch. Use for first runs or when the data room has changed significantly.

dd-agents run deal-config.json --mode full

Incremental Mode

Reuses cached extraction and prior findings. Only re-analyzes new or modified files. Faster for iterative runs on the same data room.

dd-agents run deal-config.json --mode incremental

Command Options

Option Description
--mode full\|incremental Override execution mode from config
--resume-from N Resume from step N (0-35; 0 means start fresh)
--dry-run Validate config and print step plan without executing
--quick-scan Run steps 1-13 plus Red Flag Scanner only (fast triage)
--model-profile PROFILE Override model tier: economy, standard, premium
--model-override AGENT=MODEL Per-agent model, e.g. --model-override legal=claude-opus-4-8
--no-knowledge Skip knowledge compilation after pipeline run
--no-narrative Skip LLM narrative generation (deterministic report only)
--verbose / -v Enable debug logging

Examples

Preview the step plan without running:

dd-agents run deal-config.json --dry-run

Resume after a failure at step 17:

dd-agents run deal-config.json --resume-from 17

Quick red-flag triage with economy models:

dd-agents run deal-config.json --quick-scan --model-profile economy

Use Opus for the legal agent, standard for everything else:

dd-agents run deal-config.json --model-override legal=claude-opus-4-8

The 38-Step Pipeline

The pipeline is organized into 8 phases:

Phase 1: Setup (Steps 1-3) Validate config, initialize persistence layer, check cross-skill dependencies. Step 1 is effectively blocking — if config validation fails, the pipeline halts.

Phase 2: Discovery and Extraction (Steps 4-5) Discover files in the data room. Extract text from PDFs and Office documents using pymupdf with fallback to markitdown, OCR, or Claude vision. Step 5 is a blocking gate — extraction must succeed for at least a minimum threshold of files.

Phase 3: Inventory and Resolution (Steps 6-12) Build subject inventory with document precedence ranking (which version of a file to trust when duplicates exist), match company names across documents (handling aliases, abbreviations, and legal suffixes automatically), build reference registry, count subject mentions, and verify inventory integrity. Steps 11-12 are conditional (database reconciliation and incremental classification).

Phase 4: Agent Execution (Steps 13-17) Create the specialist team (9 agents by default — Legal, Finance, Commercial, ProductTech, Cybersecurity, HR, Tax, Regulatory, ESG), prepare analysis instructions with document ranking context, route references, and run agents in parallel. Agents can be disabled per-deal via deal-config.json. Step 17 is a blocking gate — coverage must meet minimum thresholds across all active domains.

Phase 5: Cross-Domain Analysis (Steps 18-20) Neurosymbolic cross-domain analysis. Step 18 uses deterministic trigger rules to detect when findings in one domain imply risks in another (e.g., a Legal CoC clause that threatens Finance revenue). Step 19 spawns targeted pass-2 agents to verify cross-domain risks. Step 20 merges pass-2 findings. These steps are budget-bounded and priority-ordered.

Phase 6: Merge & Judge Review (Steps 21-25) Step 21 merges incremental results (if applicable). Steps 22-25 optionally spawn the Judge agent for adversarial review of specialist findings (runs only when judge.enabled is true in the deal config).

Phase 7: Validation & Reporting (Steps 26-34) Step 26 runs pre-merge validation (cross-agent anomaly detection, citation verification). Steps 27-28 merge and deduplicate findings across agents and identify coverage gaps. Step 29 builds the numerical manifest. Step 30 runs the numerical audit (blocking gate). Step 31 runs the full QA audit (blocking gate). Step 32 builds the incremental diff. Step 33 generates both Excel and HTML reports, and also runs the Executive Synthesis agent (Go/No-Go calibration), the Acquirer Intelligence agent (when buyer_strategy is configured), and the Red Flag Scanner (when --quick-scan is used). Step 34 is the post-generation validation blocking gate.

Phase 8: Finalization (Steps 35-38) Write run metadata, update run history, save entity resolution cache, shut down.

Blocking Gates

Five steps are blocking gates that halt the pipeline on failure:

Step Gate What It Checks
5 Bulk Extraction Minimum extraction success rate
17 Coverage Gate Agent coverage across all domains
30 Numerical Audit Financial figure consistency across 6 validation layers
31 Full QA Audit 31 definition-of-done checks
34 Post-Generation Validation Report completeness and integrity

When a gate fails, the pipeline stops with exit code 2 and prints the reason for failure.

How to recover from each gate failure:

Gate Common Cause How to Fix
Bulk Extraction (step 5) Corrupted or password-protected files Remove or replace problem files, then --resume-from 4
Coverage Gate (step 17) Too few documents per subject Add missing documents to the data room, then --resume-from 6
Numerical Audit (step 30) Contradictory financial figures Review audit.json in the run directory, then --resume-from 30
Full QA Audit (step 31) Quality checks failed (missing citations) Review dod_results.json for specifics, then --resume-from 31
Post-Generation (step 34) Incomplete report output Check disk space, then --resume-from 33

Agents

The pipeline uses 13 specialized analyzers — 9 domain specialists that process contracts in parallel, plus 4 synthesis/validation components:

Agent Type Phase Description
Legal Specialist 4 Contract clause analysis (CoC, TfC, IP, privacy, liability) with 18 canonical clause types
Finance Specialist 4 Revenue recognition, SaaS metrics, financial risk
Commercial Specialist 4 Customer concentration, pricing, renewal risk
ProductTech Specialist 4 Technology dependencies, integration complexity
Cybersecurity Specialist 4 Security governance, incident history, vulnerability management, compliance certifications
HR Specialist 4 Workforce composition, compensation, key talent retention, labor compliance
Tax Specialist 4 Income tax compliance, transfer pricing, NOL/tax attributes, deal structure tax
Regulatory Specialist 4 License transferability, antitrust, data privacy regulation, AML/sanctions
ESG Specialist 4 Environmental contamination, climate/carbon risk, ESG governance, supply chain sustainability
Judge Validation 5 Adversarial review of specialist findings (optional)
Executive Synthesis Synthesis 6 Go/No-Go calibration, severity recalibration
Acquirer Intelligence Synthesis 6 Buyer thesis alignment, synergy validation (when buyer_strategy configured)
Red Flag Scanner Triage 6 Quick stoplight triage (when --quick-scan used)

The system is neurosymbolic: deterministic risk scoring, cross-domain trigger rules, and domain ontology graphs provide structured reasoning scaffolding that guides and constrains LLM analysis. All 9 specialists share a base execution engine (BaseAgentRunner) but are differentiated by substantive domain-specific prompts. The specialist set is extensible — external agents can be added via pip entry-points, and agents can be disabled per-deal via deal-config.json.

Output Directory Structure

All output goes under _dd/forensic-dd/ relative to the data room:

_dd/forensic-dd/
├── index/text/                     # PERMANENT: extracted document text
├── inventory/                      # FRESH: rebuilt each run (subject registry, file counts)
├── entity_resolution_cache.json    # PERMANENT: entity matching cache
└── runs/
    └── 20260307_143000/            # VERSIONED: timestamped per run
        ├── findings/
        │   ├── legal/              # Per-agent raw findings
        │   ├── finance/
        │   ├── commercial/
        │   ├── product_tech/
        │   └── merged/            # Deduplicated merged findings
        ├── report/
        │   ├── dd_report.html     # Interactive HTML report
        │   └── dd_report.xlsx     # 14-sheet Excel report
        ├── pre_merge_validation.json  # Cross-agent validation report
        ├── audit.json             # QA validation results
        ├── metadata.json          # Run metadata and costs
        └── dod_results.json       # Definition-of-done check results

Persistence Tiers

  • PERMANENT: Never wiped between runs. Extraction cache, entity resolution cache, subject registry. Reused across full and incremental runs.
  • VERSIONED: Archived per run in timestamped directories. Findings, reports, audit results. Each run gets its own copy.
  • FRESH: Rebuilt each run. Working state, intermediate computations.

Handling Failures

If the pipeline fails mid-run, note the step number from the error output and resume:

dd-agents run deal-config.json --resume-from 17

If a blocking gate fails (exit code 2), fix the underlying issue (e.g., add missing documents to the data room) and resume from that step.

When resuming from steps 3-5, the FRESH persistence tier is automatically wiped to prevent stale inventory data from a prior interrupted run.

Advanced: Environment Variable Overrides

For advanced tuning (e.g., non-English documents, OCR-heavy data rooms), several algorithm thresholds can be overridden via environment variables. All use the DD_ prefix.

Variable Tunes
DD_QUOTE_MATCH_THRESHOLD Fuzzy match score for citation verification — lower tolerates more OCR noise
DD_MIN_QUOTE_CHARS / DD_MAX_QUOTE_CHARS Bounds on extracted quote length
DD_SYNTHESIS_BUDGET_CHARS Character budget for synthesis-phase quote aggregation
DD_FUZZY_THRESHOLD_LONG / DD_FUZZY_THRESHOLD_MEDIUM Entity-resolution fuzzy thresholds by name length
DD_SHORT_NAME_MAX_LEN Names at or below this length are matched exactly (never fuzzy)
DD_TFIDF_THRESHOLD Cosine-similarity threshold for TF-IDF entity matching

Current defaults — and the full list of DD_ overrides — live in the code that reads them: src/dd_agents/utils/constants.py and src/dd_agents/search/analyzer.py. Run grep DD_ src/dd_agents/utils/constants.py to see every variable and its default.

Example:

# Loosen citation matching for OCR-heavy data rooms
DD_QUOTE_MATCH_THRESHOLD=65 dd-agents run deal-config.json

# Tighten entity resolution for data rooms with similar company names
DD_FUZZY_THRESHOLD_LONG=92 DD_FUZZY_THRESHOLD_MEDIUM=98 dd-agents run deal-config.json

Next Steps