Each tool is registered via mcp.AddTool with a typed input struct. This document specifies the contract for each tool — the schemas, behavior, caching, and error conditions that the implementation must satisfy.
Note: Output schemas shown below describe the JSON shape returned by each tool. They are documentation of the response contract — see the corresponding internal/tools/*.go file for the actual implementation. Input schemas are auto-generated from the struct jsonschema tags.
Each tool follows the pattern in internal/tools/registry.go: a typed input struct with jsonschema tags (the SDK auto-generates JSON Schema from these) and a register* function that calls mcp.AddTool. See internal/tools/search.go for a representative example.
Purpose
Perform a web search and return structured result URLs with metadata.
| Field |
Type |
Required |
Default |
Constraints |
query |
string |
yes |
— |
1-500 chars |
num_results |
int |
no |
5 |
1-10 |
time_range |
string |
no |
— |
day, week, month, year |
safe |
string |
no |
medium |
off, medium, high |
language |
string |
no |
— |
ISO 639-1 code |
site |
string |
no |
— |
Domain restriction |
exact_terms |
string |
no |
— |
Exact phrase match |
exclude_terms |
string |
no |
— |
Terms to exclude |
country |
string |
no |
— |
ISO 3166-1 alpha-2 |
lens |
string |
no |
— |
Search lens name |
Output Schema
type SearchOutput struct {
URLs []string `json:"urls"`
Query string `json:"query"`
ResultCount int `json:"resultCount"`
Results []SearchResult `json:"results"`
}
type SearchResult struct {
Title string `json:"title"`
URL string `json:"url"`
Snippet string `json:"snippet"`
DisplayLink string `json:"displayLink"`
}
Behavior
- If
SEARCH_ROUTING is set, route through the multi-provider Router (priority-ordered fallback with per-provider circuit breakers).
- If
lens is specified and has a dedicated cx, route directly to that Google PSE engine.
- If
lens is specified without cx, inject site: operators and route to the configured provider.
- Apply
time_range as date restriction parameter.
- Return deduplicated URLs and full result objects.
Cache
- Key: SHA-256 of (provider + query + all params)
- TTL: 30 minutes
Error Conditions
- Invalid API key → return error with setup instructions
- Rate limited → circuit breaker opens, return 429
- No results → return empty
urls array (not an error)
Tool 2: scrape_page
Purpose
Extract content from a URL, supporting web pages, documents, and YouTube videos.
| Field |
Type |
Required |
Default |
Constraints |
url |
string |
yes |
— |
Valid HTTP(S) URL |
mode |
string |
no |
full |
full, preview |
max_length |
int |
no |
50000 |
Bytes |
Output Schema
type ScrapeOutput struct {
URL string `json:"url"`
Content string `json:"content"`
ContentType string `json:"contentType"` // html, markdown, youtube, pdf, docx, pptx
ContentLength int `json:"contentLength"`
Truncated bool `json:"truncated"`
EstimatedTokens int `json:"estimatedTokens"`
SizeCategory string `json:"sizeCategory"` // small, medium, large, very_large
OriginalLength *int `json:"originalLength,omitempty"`
Metadata *DocumentMetadata `json:"metadata,omitempty"`
Citation *Citation `json:"citation,omitempty"`
}
type DocumentMetadata struct {
Title string `json:"title,omitempty"`
Author string `json:"author,omitempty"`
PageCount int `json:"pageCount,omitempty"`
CreatedAt string `json:"createdAt,omitempty"`
FileSize int64 `json:"fileSize,omitempty"`
}
type Citation struct {
URL string `json:"url"`
AccessedDate string `json:"accessedDate"`
Metadata CitationMetadata `json:"metadata"`
Formatted CitationFormats `json:"formatted"`
}
type CitationFormats struct {
APA string `json:"apa"`
MLA string `json:"mla"`
}
Scraping Strategy (Tiered Fallback)
1. SSRF VALIDATION
└─ Resolve DNS, check all IPs against private ranges
└─ Block: loopback, link-local, RFC1918, metadata endpoints
2. CONTENT TYPE DETECTION
├─ YouTube URL → YouTube extractor (3-strategy fallback):
│ Strategy 1: Player response captions (primary + alt regex)
│ Strategy 2: Direct timedtext API (en, en-US, en-GB)
│ Strategy 3: Video description (shortDescription JSON field)
├─ .pdf / application/pdf → PDF parser
├─ .docx / application/vnd.openxmlformats* → DOCX parser
└─ .pptx / application/vnd.ms-powerpoint → PPTX parser
3. WEB PAGE EXTRACTION (4-tier, ordered by speed)
a) Tier 1: MARKDOWN NEGOTIATION (fastest, ~200ms)
├─ Send GET with Accept: text/markdown
├─ 5-second timeout
├─ Verify response is actually markdown (heuristic check)
└─ If content-type mismatch or too short → next tier
b) Tier 2: STEALTH HTTP CLIENT (fast, ~300ms)
├─ Browser-like TLS fingerprint (TLS 1.2+, HTTP/2)
├─ Full Chrome 131 headers (User-Agent, Sec-Ch-Ua, Sec-Fetch-*)
├─ Parse with goquery (article > [role=main] > main > body)
├─ Remove: script, style, nav, footer, aside, ads, popups
├─ SSRF protection via safe dialer when AllowPrivateIPs=false
└─ If below 100-char threshold → next tier
c) Tier 3: HTML EXTRACTION via goquery (standard, ~500ms)
├─ Fetch page with standard Accept header
├─ Parse with goquery
├─ Extract: article > main > body (priority order)
├─ Remove: script, style, nav, footer, aside, ads
├─ Minimum content: 100 bytes, 10% meaningful text ratio
└─ If below threshold → next tier
d) Tier 4: HEADLESS BROWSER via go-rod + stealth (slow, ~5s)
├─ Browser pool with lazy init + singleton pattern
├─ go-rod/stealth plugin (navigator spoofing, WebGL masking)
├─ Used for: Known SPA domains, JS-rendered content, bot challenges
├─ Wait for: page stability (500ms) OR 30s timeout
├─ Extract: rendered DOM via JavaScript evaluation
└─ Graceful cleanup via Pipeline.Close()
4. CONTENT PROCESSING
├─ Sanitize: strip hidden text, zero-width chars, dangerous patterns
├─ Truncate: at paragraph/sentence boundary if > max_length
├─ Estimate tokens: length / 4
└─ Extract citation: from <meta> tags, URL, response headers
Known SPA Domains (require headless browser)
- patents.google.com, scholar.google.com, news.google.com
- trends.google.com, twitter.com, x.com
- linkedin.com, facebook.com, instagram.com
Cache
- Key: SHA-256 of (url + mode)
- TTL: 1 hour
Error Conditions
- SSRF violation → return error, do not fetch
- Timeout → return partial content if available, else error
- 404/5xx → return error with HTTP status
- Empty content after extraction → return error
Purpose
Combined search + scrape pipeline with quality scoring, deduplication, and source ranking.
| Field |
Type |
Required |
Default |
Constraints |
query |
string |
yes |
— |
1-500 chars |
num_results |
int |
no |
3 |
1-10 |
include_sources |
bool |
no |
true |
— |
deduplicate |
bool |
no |
true |
— |
max_length_per_source |
int |
no |
50000 |
Bytes |
total_max_length |
int |
no |
300000 |
Bytes |
filter_by_query |
bool |
no |
false |
— |
Output Schema
type SearchAndScrapeOutput struct {
Query string `json:"query"`
Sources []SourceResult `json:"sources"`
CombinedContent string `json:"combinedContent"`
Summary PipelineSummary `json:"summary"`
SizeMetadata SizeMetadata `json:"sizeMetadata"`
}
type SourceResult struct {
URL string `json:"url"`
Title string `json:"title,omitempty"`
Content string `json:"content"`
ContentType string `json:"contentType"`
Scores *QualityScore `json:"scores,omitempty"`
}
type QualityScore struct {
Overall float64 `json:"overall"`
Relevance float64 `json:"relevance"`
Freshness float64 `json:"freshness"`
Authority float64 `json:"authority"`
ContentQuality float64 `json:"contentQuality"`
}
type PipelineSummary struct {
URLsSearched int `json:"urlsSearched"`
URLsScraped int `json:"urlsScraped"`
ProcessingTimeMs int `json:"processingTimeMs"`
}
Behavior
- Execute search (via configured provider)
- Scrape all result URLs in parallel (bounded concurrency: 5)
- If
deduplicate: paragraph-level hashing (djb2), remove >85% similar blocks
- Score and rank sources by quality (weighted: relevance 35%, freshness 20%, authority 25%, content 20%)
- If
filter_by_query: extract keywords, remove sources below relevance threshold
- Combine content, truncate to
total_max_length
- Return structured result with scores and metadata
Cache
- NOT cached as a whole (composed of cached sub-operations)
- Individual search and scrape results are cached per their own TTLs
| Field |
Type |
Required |
Default |
Constraints |
query |
string |
yes |
— |
1-500 chars |
num_results |
int |
no |
5 |
1-10 |
size |
string |
no |
— |
huge, icon, large, medium, small, xlarge, xxlarge |
type |
string |
no |
— |
clipart, face, lineart, stock, photo, animated |
color_type |
string |
no |
— |
color, gray, mono, trans |
dominant_color |
string |
no |
— |
black, blue, brown, gray, green, orange, pink, purple, red, teal, white, yellow |
file_type |
string |
no |
— |
jpg, gif, png, bmp, svg, webp |
safe |
string |
no |
medium |
off, medium, high |
Output Schema
type ImageSearchOutput struct {
Images []ImageResult `json:"images"`
Query string `json:"query"`
ResultCount int `json:"resultCount"`
}
type ImageResult struct {
Title string `json:"title"`
Link string `json:"link"`
ThumbnailLink string `json:"thumbnailLink,omitempty"`
DisplayLink string `json:"displayLink"`
ContextLink string `json:"contextLink,omitempty"`
Width int `json:"width,omitempty"`
Height int `json:"height,omitempty"`
FileSize string `json:"fileSize,omitempty"`
}
Cache
- Key: SHA-256 of (query + all filter params)
- TTL: 30 minutes
| Field |
Type |
Required |
Default |
Constraints |
query |
string |
yes |
— |
1-500 chars |
num_results |
int |
no |
5 |
1-10 |
freshness |
string |
no |
week |
hour, day, week, month, year |
sort_by |
string |
no |
relevance |
relevance, date |
news_source |
string |
no |
— |
Domain filter |
Output Schema
type NewsSearchOutput struct {
Articles []NewsArticle `json:"articles"`
Query string `json:"query"`
ResultCount int `json:"resultCount"`
}
type NewsArticle struct {
Title string `json:"title"`
URL string `json:"url"`
Source string `json:"source"`
PublishedAt string `json:"publishedAt,omitempty"`
Snippet string `json:"snippet"`
}
Behavior
- Route to configured search provider's news endpoint.
- Apply
freshness as date restriction.
- If
news_source specified, add as domain filter.
- Sort by
sort_by parameter.
- Return deduplicated articles.
Cache
- TTL: 15 minutes (news is time-sensitive)
| Field |
Type |
Required |
Default |
Constraints |
query |
string |
yes |
— |
1-500 chars |
num_results |
int |
no |
5 |
1-10 |
year_from |
int |
no |
— |
1900-2030 |
year_to |
int |
no |
— |
1900-2030 |
source |
string |
no |
all |
all, arxiv, pubmed, ieee, nature, springer |
pdf_only |
bool |
no |
false |
— |
sort_by |
string |
no |
relevance |
relevance, date |
arxiv.org, pubmed.ncbi.nlm.nih.gov, scholar.google.com, ieeexplore.ieee.org, dl.acm.org, nature.com, sciencedirect.com, link.springer.com, researchgate.net, plos.org, frontiersin.org, mdpi.com, wiley.com, jstor.org, semanticscholar.org, biorxiv.org, medrxiv.org
Output Schema
type AcademicSearchOutput struct {
Papers []AcademicPaper `json:"papers"`
Query string `json:"query"`
TotalResults int `json:"totalResults"`
ResultCount int `json:"resultCount"`
Source string `json:"source"`
}
type AcademicPaper struct {
Title string `json:"title"`
URL string `json:"url"`
Source string `json:"source"`
Abstract string `json:"abstract"`
}
Cache
- TTL: 24 hours (papers don't change)
| Field |
Type |
Required |
Default |
Constraints |
query |
string |
yes |
— |
1-500 chars |
num_results |
int |
no |
5 |
1-10 |
search_type |
string |
no |
prior_art |
prior_art, specific, landscape |
patent_office |
string |
no |
all |
all, US, EP, WO, JP, CN, KR |
assignee |
string |
no |
— |
Company name |
inventor |
string |
no |
— |
Inventor name |
cpc_code |
string |
no |
— |
CPC classification (e.g., G06F) |
year_from |
int |
no |
— |
1900-2030 |
year_to |
int |
no |
— |
1900-2030 |
Behavior
- Generate company name variations (5-8 permutations: no spaces, with suffixes, base names)
- Map patent office to prefix codes
- Always uses
site:patents.google.com restriction via configured provider
- Post-filter results by patent number prefix (US, EP, WO, JP, CN, KR) — non-matching patents are dropped when
patent_office is specified
Cache
Purpose
Multi-step research tracking with session persistence, branching, and knowledge gap identification.
| Field |
Type |
Required |
Default |
Constraints |
searchStep |
string |
yes |
— |
Description of this step |
stepNumber |
int |
yes |
— |
Starts at 1 |
totalStepsEstimate |
int |
no |
— |
Estimated total |
nextStepNeeded |
bool |
yes |
— |
Whether more steps follow |
isRevision |
bool |
no |
false |
Revising a previous step |
revisesStep |
int |
no |
— |
Step being revised |
branchFromStep |
int |
no |
— |
Branching point |
branchId |
string |
no |
— |
Branch identifier |
knowledgeGap |
string |
no |
— |
Gap identified |
Session Management
- Sessions created on first call (stepNumber=1)
- Session ID: UUID v4, returned in output
- TTL: 30 minutes of inactivity (configurable)
- Max concurrent sessions: 50 per server instance
- Cleanup: goroutine every 5 minutes
- Per-tenant isolation: sessions keyed by
{tenantID}:{sessionID}
Output Schema
type SequentialSearchOutput struct {
SessionID string `json:"sessionId"`
Question string `json:"question"`
CurrentStep int `json:"currentStep"`
TotalStepsEstimate int `json:"totalStepsEstimate"`
IsComplete bool `json:"isComplete"`
Steps []ResearchStep `json:"steps"`
Sources []ResearchSource `json:"sources"`
Gaps []KnowledgeGap `json:"gaps"`
StartedAt string `json:"startedAt"`
CompletedAt string `json:"completedAt,omitempty"`
}
State Management
- In single-instance:
sync.Map per tenant
- In multi-instance: Redis hash per session (key:
session:{tenantID}:{sessionID})
- No cache (state tracking, not content)
Cross-Cutting Concerns
Timeouts (all configurable via env)
| Operation |
Default |
Max |
| Search API call |
10s |
30s |
| Markdown negotiation |
5s |
10s |
| HTML scrape (goquery) |
15s |
30s |
| Browser scrape (go-rod) |
30s |
60s |
| YouTube transcript |
30s |
60s |
| Document download |
30s |
60s |
| Total tool execution |
60s |
120s |
Content Size Limits
| Content |
Max Size |
| Single page content |
50 KB |
| Combined research content |
300 KB |
| Document download |
10 MB |
| YouTube transcript |
100 KB |
Token Estimation
- Formula:
len(content) / 4 (conservative, ~4 chars per token)
- Size categories: small (<5K chars), medium (<20K), large (<50K), very_large (>=50K)