Running the Pipeline
The pipeline accelerates what traditionally takes teams of lawyers and analysts weeks of manual contract review. DD timelines keep compressing — what used to be a six-week process becomes three weeks, with no reduction in scope. The pipeline analyzes every document across 9 specialist domains (Legal, Finance, Commercial, ProductTech, Cybersecurity, HR, Tax, Regulatory, ESG), cross-validates findings, and produces quality-gated structured analysis with sourced citations.
This tool does not replace professional advisors. Use the output alongside your advisory workstreams to accelerate search, correlation, and tracking across the data room.
Basic Execution
This runs the full 38-step pipeline: config validation, document extraction, entity resolution, specialist agent analysis, quality audits, and report generation.
Where does output go? Results are written to _dd/forensic-dd/ inside your data room directory. The final report is at {data_room_path}/_dd/forensic-dd/runs/latest/report/dd_report.html. See Output Directory Structure for the full layout.
Execution Modes
Full Mode (default)
Processes all documents from scratch. Use for first runs or when the data room has changed significantly.
Incremental Mode
Reuses cached extraction and prior findings. Only re-analyzes new or modified files. Faster for iterative runs on the same data room.
Command Options
| Option | Description |
|---|---|
--mode full\|incremental |
Override execution mode from config |
--resume-from N |
Resume from step N (0-35; 0 means start fresh) |
--dry-run |
Validate config and print step plan without executing |
--quick-scan |
Run steps 1-13 plus Red Flag Scanner only (fast triage) |
--model-profile PROFILE |
Override model tier: economy, standard, premium |
--model-override AGENT=MODEL |
Per-agent model, e.g. --model-override legal=claude-opus-4-8 |
--no-knowledge |
Skip knowledge compilation after pipeline run |
--no-narrative |
Skip LLM narrative generation (deterministic report only) |
--verbose / -v |
Enable debug logging |
Examples
Preview the step plan without running:
Resume after a failure at step 17:
Quick red-flag triage with economy models:
Use Opus for the legal agent, standard for everything else:
The 38-Step Pipeline
The pipeline is organized into 8 phases:
Phase 1: Setup (Steps 1-3) Validate config, initialize persistence layer, check cross-skill dependencies. Step 1 is effectively blocking — if config validation fails, the pipeline halts.
Phase 2: Discovery and Extraction (Steps 4-5) Discover files in the data room. Extract text from PDFs and Office documents using pymupdf with fallback to markitdown, OCR, or Claude vision. Step 5 is a blocking gate — extraction must succeed for at least a minimum threshold of files.
Phase 3: Inventory and Resolution (Steps 6-12) Build subject inventory with document precedence ranking (which version of a file to trust when duplicates exist), match company names across documents (handling aliases, abbreviations, and legal suffixes automatically), build reference registry, count subject mentions, and verify inventory integrity. Steps 11-12 are conditional (database reconciliation and incremental classification).
Phase 4: Agent Execution (Steps 13-17)
Create the specialist team (9 agents by default — Legal, Finance, Commercial, ProductTech,
Cybersecurity, HR, Tax, Regulatory, ESG), prepare analysis instructions with document
ranking context, route references, and run agents in parallel. Agents can be disabled
per-deal via deal-config.json. Step 17 is a blocking gate — coverage must meet
minimum thresholds across all active domains.
Phase 5: Cross-Domain Analysis (Steps 18-20) Neurosymbolic cross-domain analysis. Step 18 uses deterministic trigger rules to detect when findings in one domain imply risks in another (e.g., a Legal CoC clause that threatens Finance revenue). Step 19 spawns targeted pass-2 agents to verify cross-domain risks. Step 20 merges pass-2 findings. These steps are budget-bounded and priority-ordered.
Phase 6: Merge & Judge Review (Steps 21-25)
Step 21 merges incremental results (if applicable). Steps 22-25 optionally spawn the
Judge agent for adversarial review of specialist findings (runs only when
judge.enabled is true in the deal config).
Phase 7: Validation & Reporting (Steps 26-34)
Step 26 runs pre-merge validation (cross-agent anomaly detection, citation verification).
Steps 27-28 merge and deduplicate findings across agents and identify coverage gaps.
Step 29 builds the numerical manifest. Step 30 runs the numerical audit
(blocking gate). Step 31 runs the full QA audit (blocking gate). Step 32 builds
the incremental diff. Step 33 generates both Excel and HTML reports, and also runs the
Executive Synthesis agent (Go/No-Go calibration), the Acquirer Intelligence agent
(when buyer_strategy is configured), and the Red Flag Scanner (when --quick-scan
is used). Step 34 is the post-generation validation blocking gate.
Phase 8: Finalization (Steps 35-38) Write run metadata, update run history, save entity resolution cache, shut down.
Blocking Gates
Five steps are blocking gates that halt the pipeline on failure:
| Step | Gate | What It Checks |
|---|---|---|
| 5 | Bulk Extraction | Minimum extraction success rate |
| 17 | Coverage Gate | Agent coverage across all domains |
| 30 | Numerical Audit | Financial figure consistency across 6 validation layers |
| 31 | Full QA Audit | 31 definition-of-done checks |
| 34 | Post-Generation Validation | Report completeness and integrity |
When a gate fails, the pipeline stops with exit code 2 and prints the reason for failure.
How to recover from each gate failure:
| Gate | Common Cause | How to Fix |
|---|---|---|
| Bulk Extraction (step 5) | Corrupted or password-protected files | Remove or replace problem files, then --resume-from 4 |
| Coverage Gate (step 17) | Too few documents per subject | Add missing documents to the data room, then --resume-from 6 |
| Numerical Audit (step 30) | Contradictory financial figures | Review audit.json in the run directory, then --resume-from 30 |
| Full QA Audit (step 31) | Quality checks failed (missing citations) | Review dod_results.json for specifics, then --resume-from 31 |
| Post-Generation (step 34) | Incomplete report output | Check disk space, then --resume-from 33 |
Agents
The pipeline uses 13 specialized analyzers — 9 domain specialists that process contracts in parallel, plus 4 synthesis/validation components:
| Agent | Type | Phase | Description |
|---|---|---|---|
| Legal | Specialist | 4 | Contract clause analysis (CoC, TfC, IP, privacy, liability) with 18 canonical clause types |
| Finance | Specialist | 4 | Revenue recognition, SaaS metrics, financial risk |
| Commercial | Specialist | 4 | Customer concentration, pricing, renewal risk |
| ProductTech | Specialist | 4 | Technology dependencies, integration complexity |
| Cybersecurity | Specialist | 4 | Security governance, incident history, vulnerability management, compliance certifications |
| HR | Specialist | 4 | Workforce composition, compensation, key talent retention, labor compliance |
| Tax | Specialist | 4 | Income tax compliance, transfer pricing, NOL/tax attributes, deal structure tax |
| Regulatory | Specialist | 4 | License transferability, antitrust, data privacy regulation, AML/sanctions |
| ESG | Specialist | 4 | Environmental contamination, climate/carbon risk, ESG governance, supply chain sustainability |
| Judge | Validation | 5 | Adversarial review of specialist findings (optional) |
| Executive Synthesis | Synthesis | 6 | Go/No-Go calibration, severity recalibration |
| Acquirer Intelligence | Synthesis | 6 | Buyer thesis alignment, synergy validation (when buyer_strategy configured) |
| Red Flag Scanner | Triage | 6 | Quick stoplight triage (when --quick-scan used) |
The system is neurosymbolic: deterministic risk scoring, cross-domain trigger rules, and domain ontology graphs provide structured reasoning scaffolding that guides and constrains LLM analysis. All 9 specialists share a base execution engine (BaseAgentRunner) but are differentiated by substantive domain-specific prompts. The specialist set is extensible — external agents can be added via pip entry-points, and agents can be disabled per-deal via deal-config.json.
Output Directory Structure
All output goes under _dd/forensic-dd/ relative to the data room:
_dd/forensic-dd/
├── index/text/ # PERMANENT: extracted document text
├── inventory/ # FRESH: rebuilt each run (subject registry, file counts)
├── entity_resolution_cache.json # PERMANENT: entity matching cache
└── runs/
└── 20260307_143000/ # VERSIONED: timestamped per run
├── findings/
│ ├── legal/ # Per-agent raw findings
│ ├── finance/
│ ├── commercial/
│ ├── product_tech/
│ └── merged/ # Deduplicated merged findings
├── report/
│ ├── dd_report.html # Interactive HTML report
│ └── dd_report.xlsx # 14-sheet Excel report
├── pre_merge_validation.json # Cross-agent validation report
├── audit.json # QA validation results
├── metadata.json # Run metadata and costs
└── dod_results.json # Definition-of-done check results
Persistence Tiers
- PERMANENT: Never wiped between runs. Extraction cache, entity resolution cache, subject registry. Reused across full and incremental runs.
- VERSIONED: Archived per run in timestamped directories. Findings, reports, audit results. Each run gets its own copy.
- FRESH: Rebuilt each run. Working state, intermediate computations.
Handling Failures
If the pipeline fails mid-run, note the step number from the error output and resume:
If a blocking gate fails (exit code 2), fix the underlying issue (e.g., add missing documents to the data room) and resume from that step.
When resuming from steps 3-5, the FRESH persistence tier is automatically wiped to prevent stale inventory data from a prior interrupted run.
Advanced: Environment Variable Overrides
For advanced tuning (e.g., non-English documents, OCR-heavy data rooms), several algorithm thresholds can be overridden via environment variables. All use the DD_ prefix.
| Variable | Tunes |
|---|---|
DD_QUOTE_MATCH_THRESHOLD |
Fuzzy match score for citation verification — lower tolerates more OCR noise |
DD_MIN_QUOTE_CHARS / DD_MAX_QUOTE_CHARS |
Bounds on extracted quote length |
DD_SYNTHESIS_BUDGET_CHARS |
Character budget for synthesis-phase quote aggregation |
DD_FUZZY_THRESHOLD_LONG / DD_FUZZY_THRESHOLD_MEDIUM |
Entity-resolution fuzzy thresholds by name length |
DD_SHORT_NAME_MAX_LEN |
Names at or below this length are matched exactly (never fuzzy) |
DD_TFIDF_THRESHOLD |
Cosine-similarity threshold for TF-IDF entity matching |
Current defaults — and the full list of DD_ overrides — live in the code that reads them: src/dd_agents/utils/constants.py and src/dd_agents/search/analyzer.py. Run grep DD_ src/dd_agents/utils/constants.py to see every variable and its default.
Example:
# Loosen citation matching for OCR-heavy data rooms
DD_QUOTE_MATCH_THRESHOLD=65 dd-agents run deal-config.json
# Tighten entity resolution for data rooms with similar company names
DD_FUZZY_THRESHOLD_LONG=92 DD_FUZZY_THRESHOLD_MEDIUM=98 dd-agents run deal-config.json
Next Steps
- Reading the Report -- Navigate the generated reports
- Deal Configuration -- Adjust config settings
- CLI Reference -- Full command reference