Tool Specifications¶

Each tool is registered via mcp.AddTool with a typed input struct. This document specifies the contract for each tool — the schemas, behavior, caching, and error conditions that the implementation must satisfy.

Note: Output schemas shown below describe the JSON shape returned by each tool. They are documentation of the response contract — see the corresponding internal/tools/*.go file for the actual implementation. Input schemas are auto-generated from the struct jsonschema tags.

Tool Registration Pattern¶

Each tool follows the pattern in internal/tools/registry.go: a typed input struct with jsonschema tags (the SDK auto-generates JSON Schema from these) and a register* function that calls mcp.AddTool. See internal/tools/search.go for a representative example.

Tool 1: `web_search`¶

Purpose¶

Perform a web search and return structured result URLs with metadata.

Input Schema¶

Field	Type	Required	Default	Constraints
`query`	string	yes	—	1-500 chars
`num_results`	int	no	5	1-10
`time_range`	string	no	—	`day`, `week`, `month`, `year`
`safe`	string	no	`medium`	`off`, `medium`, `high`
`language`	string	no	—	ISO 639-1 code
`site`	string	no	—	Domain restriction
`exact_terms`	string	no	—	Exact phrase match
`exclude_terms`	string	no	—	Terms to exclude
`country`	string	no	—	ISO 3166-1 alpha-2
`lens`	string	no	—	Search lens name

Output Schema¶

type SearchOutput struct {
    URLs        []string       `json:"urls"`
    Query       string         `json:"query"`
    ResultCount int            `json:"resultCount"`
    Results     []SearchResult `json:"results"`
}

type SearchResult struct {
    Title       string `json:"title"`
    URL         string `json:"url"`
    Snippet     string `json:"snippet"`
    DisplayLink string `json:"displayLink"`
}

Behavior¶

If SEARCH_ROUTING is set, route through the multi-provider Router (priority-ordered fallback with per-provider circuit breakers).
If lens is specified and has a dedicated cx, route directly to that Google PSE engine.
If lens is specified without cx, inject site: operators and route to the configured provider.
Apply time_range as date restriction parameter.
Return deduplicated URLs and full result objects.

Cache¶

Key: SHA-256 of (provider + query + all params)
TTL: 30 minutes

Error Conditions¶

Invalid API key → return error with setup instructions
Rate limited → circuit breaker opens, return 429
No results → return empty urls array (not an error)

Tool 2: `scrape_page`¶

Purpose¶

Extract content from a URL, supporting web pages, documents, and YouTube videos.

Input Schema¶

Field	Type	Required	Default	Constraints
`url`	string	yes	—	Valid HTTP(S) URL
`mode`	string	no	`full`	`full`, `preview`
`max_length`	int	no	50000	Bytes

Output Schema¶

type ScrapeOutput struct {
    URL            string            `json:"url"`
    Content        string            `json:"content"`
    ContentType    string            `json:"contentType"`    // html, markdown, youtube, pdf, docx, pptx
    ContentLength  int               `json:"contentLength"`
    Truncated      bool              `json:"truncated"`
    EstimatedTokens int              `json:"estimatedTokens"`
    SizeCategory   string            `json:"sizeCategory"`  // small, medium, large, very_large
    OriginalLength *int              `json:"originalLength,omitempty"`
    Metadata       *DocumentMetadata `json:"metadata,omitempty"`
    Citation       *Citation         `json:"citation,omitempty"`
}

type DocumentMetadata struct {
    Title     string `json:"title,omitempty"`
    Author    string `json:"author,omitempty"`
    PageCount int    `json:"pageCount,omitempty"`
    CreatedAt string `json:"createdAt,omitempty"`
    FileSize  int64  `json:"fileSize,omitempty"`
}

type Citation struct {
    URL          string           `json:"url"`
    AccessedDate string           `json:"accessedDate"`
    Metadata     CitationMetadata `json:"metadata"`
    Formatted    CitationFormats  `json:"formatted"`
}

type CitationFormats struct {
    APA string `json:"apa"`
    MLA string `json:"mla"`
}

Scraping Strategy (Tiered Fallback)¶

1. SSRF VALIDATION
   └─ Resolve DNS, check all IPs against private ranges
   └─ Block: loopback, link-local, RFC1918, metadata endpoints

2. CONTENT TYPE DETECTION
   ├─ YouTube URL → YouTube extractor (3-strategy fallback):
   │     Strategy 1: Player response captions (primary + alt regex)
   │     Strategy 2: Direct timedtext API (en, en-US, en-GB)
   │     Strategy 3: Video description (shortDescription JSON field)
   ├─ .pdf / application/pdf → PDF parser
   ├─ .docx / application/vnd.openxmlformats* → DOCX parser
   └─ .pptx / application/vnd.ms-powerpoint → PPTX parser

3. WEB PAGE EXTRACTION (4-tier, ordered by speed)
   a) Tier 1: MARKDOWN NEGOTIATION (fastest, ~200ms)
      ├─ Send GET with Accept: text/markdown
      ├─ 5-second timeout
      ├─ Verify response is actually markdown (heuristic check)
      └─ If content-type mismatch or too short → next tier

   b) Tier 2: STEALTH HTTP CLIENT (fast, ~300ms)
      ├─ Browser-like TLS fingerprint (TLS 1.2+, HTTP/2)
      ├─ Full Chrome 131 headers (User-Agent, Sec-Ch-Ua, Sec-Fetch-*)
      ├─ Parse with goquery (article > [role=main] > main > body)
      ├─ Remove: script, style, nav, footer, aside, ads, popups
      ├─ SSRF protection via safe dialer when AllowPrivateIPs=false
      └─ If below 100-char threshold → next tier

   c) Tier 3: HTML EXTRACTION via goquery (standard, ~500ms)
      ├─ Fetch page with standard Accept header
      ├─ Parse with goquery
      ├─ Extract: article > main > body (priority order)
      ├─ Remove: script, style, nav, footer, aside, ads
      ├─ Minimum content: 100 bytes, 10% meaningful text ratio
      └─ If below threshold → next tier

   d) Tier 4: HEADLESS BROWSER via go-rod + stealth (slow, ~5s)
      ├─ Browser pool with lazy init + singleton pattern
      ├─ go-rod/stealth plugin (navigator spoofing, WebGL masking)
      ├─ Used for: Known SPA domains, JS-rendered content, bot challenges
      ├─ Wait for: page stability (500ms) OR 30s timeout
      ├─ Extract: rendered DOM via JavaScript evaluation
      └─ Graceful cleanup via Pipeline.Close()

4. CONTENT PROCESSING
   ├─ Sanitize: strip hidden text, zero-width chars, dangerous patterns
   ├─ Truncate: at paragraph/sentence boundary if > max_length
   ├─ Estimate tokens: length / 4
   └─ Extract citation: from <meta> tags, URL, response headers

Known SPA Domains (require headless browser)¶

patents.google.com, scholar.google.com, news.google.com
trends.google.com, twitter.com, x.com
linkedin.com, facebook.com, instagram.com

Cache¶

Key: SHA-256 of (url + mode)
TTL: 1 hour

Error Conditions¶

SSRF violation → return error, do not fetch
Timeout → return partial content if available, else error
404/5xx → return error with HTTP status
Empty content after extraction → return error

Tool 3: `search_and_scrape`¶

Purpose¶

Combined search + scrape pipeline with quality scoring, deduplication, and source ranking.

Input Schema¶

Field	Type	Required	Default	Constraints
`query`	string	yes	—	1-500 chars
`num_results`	int	no	3	1-10
`include_sources`	bool	no	true	—
`deduplicate`	bool	no	true	—
`max_length_per_source`	int	no	50000	Bytes
`total_max_length`	int	no	300000	Bytes
`filter_by_query`	bool	no	false	—

Output Schema¶

type SearchAndScrapeOutput struct {
    Query           string          `json:"query"`
    Sources         []SourceResult  `json:"sources"`
    CombinedContent string          `json:"combinedContent"`
    Summary         PipelineSummary `json:"summary"`
    SizeMetadata    SizeMetadata    `json:"sizeMetadata"`
}

type SourceResult struct {
    URL         string        `json:"url"`
    Title       string        `json:"title,omitempty"`
    Content     string        `json:"content"`
    ContentType string        `json:"contentType"`
    Scores      *QualityScore `json:"scores,omitempty"`
}

type QualityScore struct {
    Overall        float64 `json:"overall"`
    Relevance      float64 `json:"relevance"`
    Freshness      float64 `json:"freshness"`
    Authority      float64 `json:"authority"`
    ContentQuality float64 `json:"contentQuality"`
}

type PipelineSummary struct {
    URLsSearched     int `json:"urlsSearched"`
    URLsScraped      int `json:"urlsScraped"`
    ProcessingTimeMs int `json:"processingTimeMs"`
}

Behavior¶

Execute search (via configured provider)
Scrape all result URLs in parallel (bounded concurrency: 5)
If deduplicate: paragraph-level hashing (djb2), remove >85% similar blocks
Score and rank sources by quality (weighted: relevance 35%, freshness 20%, authority 25%, content 20%)
If filter_by_query: extract keywords, remove sources below relevance threshold
Combine content, truncate to total_max_length
Return structured result with scores and metadata

Cache¶

NOT cached as a whole (composed of cached sub-operations)
Individual search and scrape results are cached per their own TTLs

Tool 4: `image_search`¶

Input Schema¶

Field	Type	Required	Default	Constraints
`query`	string	yes	—	1-500 chars
`num_results`	int	no	5	1-10
`size`	string	no	—	huge, icon, large, medium, small, xlarge, xxlarge
`type`	string	no	—	clipart, face, lineart, stock, photo, animated
`color_type`	string	no	—	color, gray, mono, trans
`dominant_color`	string	no	—	black, blue, brown, gray, green, orange, pink, purple, red, teal, white, yellow
`file_type`	string	no	—	jpg, gif, png, bmp, svg, webp
`safe`	string	no	`medium`	off, medium, high

Output Schema¶

type ImageSearchOutput struct {
    Images      []ImageResult `json:"images"`
    Query       string        `json:"query"`
    ResultCount int           `json:"resultCount"`
}

type ImageResult struct {
    Title         string `json:"title"`
    Link          string `json:"link"`
    ThumbnailLink string `json:"thumbnailLink,omitempty"`
    DisplayLink   string `json:"displayLink"`
    ContextLink   string `json:"contextLink,omitempty"`
    Width         int    `json:"width,omitempty"`
    Height        int    `json:"height,omitempty"`
    FileSize      string `json:"fileSize,omitempty"`
}

Cache¶

Key: SHA-256 of (query + all filter params)
TTL: 30 minutes

Tool 5: `news_search`¶

Input Schema¶

Field	Type	Required	Default	Constraints
`query`	string	yes	—	1-500 chars
`num_results`	int	no	5	1-10
`freshness`	string	no	`week`	hour, day, week, month, year
`sort_by`	string	no	`relevance`	relevance, date
`news_source`	string	no	—	Domain filter

Output Schema¶

type NewsSearchOutput struct {
    Articles    []NewsArticle `json:"articles"`
    Query       string        `json:"query"`
    ResultCount int           `json:"resultCount"`
}

type NewsArticle struct {
    Title       string `json:"title"`
    URL         string `json:"url"`
    Source      string `json:"source"`
    PublishedAt string `json:"publishedAt,omitempty"`
    Snippet     string `json:"snippet"`
}

Behavior¶

Route to configured search provider's news endpoint.
Apply freshness as date restriction.
If news_source specified, add as domain filter.
Sort by sort_by parameter.
Return deduplicated articles.

Cache¶

TTL: 15 minutes (news is time-sensitive)

Tool 6: `academic_search`¶

Input Schema¶

Field	Type	Required	Default	Constraints
`query`	string	yes	—	1-500 chars
`num_results`	int	no	5	1-10
`year_from`	int	no	—	1900-2030
`year_to`	int	no	—	1900-2030
`source`	string	no	`all`	all, arxiv, pubmed, ieee, nature, springer
`pdf_only`	bool	no	false	—
`sort_by`	string	no	`relevance`	relevance, date

Academic Site Pool (site-restricted via configured provider)¶

arxiv.org, pubmed.ncbi.nlm.nih.gov, scholar.google.com, ieeexplore.ieee.org, dl.acm.org, nature.com, sciencedirect.com, link.springer.com, researchgate.net, plos.org, frontiersin.org, mdpi.com, wiley.com, jstor.org, semanticscholar.org, biorxiv.org, medrxiv.org

Output Schema¶

type AcademicSearchOutput struct {
    Papers       []AcademicPaper `json:"papers"`
    Query        string          `json:"query"`
    TotalResults int             `json:"totalResults"`
    ResultCount  int             `json:"resultCount"`
    Source       string          `json:"source"`
}

type AcademicPaper struct {
    Title    string `json:"title"`
    URL      string `json:"url"`
    Source   string `json:"source"`
    Abstract string `json:"abstract"`
}

Cache¶

TTL: 24 hours (papers don't change)

Tool 7: `patent_search`¶

Input Schema¶

Field	Type	Required	Default	Constraints
`query`	string	yes	—	1-500 chars
`num_results`	int	no	5	1-10
`search_type`	string	no	`prior_art`	prior_art, specific, landscape
`patent_office`	string	no	`all`	all, US, EP, WO, JP, CN, KR
`assignee`	string	no	—	Company name
`inventor`	string	no	—	Inventor name
`cpc_code`	string	no	—	CPC classification (e.g., G06F)
`year_from`	int	no	—	1900-2030
`year_to`	int	no	—	1900-2030

Behavior¶

Generate company name variations (5-8 permutations: no spaces, with suffixes, base names)
Map patent office to prefix codes
Always uses site:patents.google.com restriction via configured provider
Post-filter results by patent number prefix (US, EP, WO, JP, CN, KR) — non-matching patents are dropped when patent_office is specified

Cache¶

TTL: 24 hours

Tool 8: `sequential_search`¶

Purpose¶

Multi-step research tracking with session persistence, branching, and knowledge gap identification.

Input Schema¶

Field	Type	Required	Default	Constraints
`searchStep`	string	yes	—	Description of this step
`stepNumber`	int	yes	—	Starts at 1
`totalStepsEstimate`	int	no	—	Estimated total
`nextStepNeeded`	bool	yes	—	Whether more steps follow
`isRevision`	bool	no	false	Revising a previous step
`revisesStep`	int	no	—	Step being revised
`branchFromStep`	int	no	—	Branching point
`branchId`	string	no	—	Branch identifier
`knowledgeGap`	string	no	—	Gap identified

Session Management¶

Sessions created on first call (stepNumber=1)
Session ID: UUID v4, returned in output
TTL: 30 minutes of inactivity (configurable)
Max concurrent sessions: 50 per server instance
Cleanup: goroutine every 5 minutes
Per-tenant isolation: sessions keyed by {tenantID}:{sessionID}

Output Schema¶

type SequentialSearchOutput struct {
    SessionID          string          `json:"sessionId"`
    Question           string          `json:"question"`
    CurrentStep        int             `json:"currentStep"`
    TotalStepsEstimate int             `json:"totalStepsEstimate"`
    IsComplete         bool            `json:"isComplete"`
    Steps              []ResearchStep  `json:"steps"`
    Sources            []ResearchSource `json:"sources"`
    Gaps               []KnowledgeGap  `json:"gaps"`
    StartedAt          string          `json:"startedAt"`
    CompletedAt        string          `json:"completedAt,omitempty"`
}

State Management¶

In single-instance: sync.Map per tenant
In multi-instance: Redis hash per session (key: session:{tenantID}:{sessionID})
No cache (state tracking, not content)

Cross-Cutting Concerns¶

Timeouts (all configurable via env)¶

Operation	Default	Max
Search API call	10s	30s
Markdown negotiation	5s	10s
HTML scrape (goquery)	15s	30s
Browser scrape (go-rod)	30s	60s
YouTube transcript	30s	60s
Document download	30s	60s
Total tool execution	60s	120s

Content Size Limits¶

Content	Max Size
Single page content	50 KB
Combined research content	300 KB
Document download	10 MB
YouTube transcript	100 KB

Token Estimation¶

Formula: len(content) / 4 (conservative, ~4 chars per token)
Size categories: small (<5K chars), medium (<20K), large (<50K), very_large (>=50K)

Tool Specifications¶

Tool Registration Pattern¶

Tool 1: web_search¶

Purpose¶

Input Schema¶

Output Schema¶

Behavior¶

Cache¶

Error Conditions¶

Tool 2: scrape_page¶

Purpose¶

Input Schema¶

Output Schema¶

Scraping Strategy (Tiered Fallback)¶

Known SPA Domains (require headless browser)¶

Cache¶

Error Conditions¶

Tool 3: search_and_scrape¶

Purpose¶

Input Schema¶

Output Schema¶

Behavior¶

Cache¶

Tool 4: image_search¶

Input Schema¶

Output Schema¶

Cache¶

Tool 5: news_search¶

Input Schema¶

Output Schema¶

Behavior¶

Cache¶

Tool 6: academic_search¶

Input Schema¶

Academic Site Pool (site-restricted via configured provider)¶

Output Schema¶

Cache¶

Tool 7: patent_search¶

Input Schema¶

Behavior¶

Cache¶

Tool 8: sequential_search¶

Purpose¶

Input Schema¶

Session Management¶

Output Schema¶

State Management¶

Cross-Cutting Concerns¶

Timeouts (all configurable via env)¶

Content Size Limits¶

Token Estimation¶

Tool 1: `web_search`¶

Tool 2: `scrape_page`¶

Tool 3: `search_and_scrape`¶

Tool 4: `image_search`¶

Tool 5: `news_search`¶

Tool 6: `academic_search`¶

Tool 7: `patent_search`¶

Tool 8: `sequential_search`¶