Semantic Context Filtering Pattern

Problem

Raw data sources are too verbose and noisy for effective LLM consumption. Full representations include invisible elements, implementation details, and irrelevant information that bloat context and confuse reasoning.

Research on boilerplate detection shows that 40-80% of web page content is typically navigation, footers, ads, and other boilerplate that should be filtered before semantic processing (Kohlschütter et al., SIGIR 2010).

This creates several problems:

Token explosion: Raw data exceeds context limits or becomes prohibitively expensive
Poor signal-to-noise: LLM wastes reasoning capacity on irrelevant details
Slower inference: More tokens = slower generation and higher costs
Confused reasoning: Noise leads to hallucinations or wrong conclusions

The issue appears across domains:

Web scraping: Full HTML DOM includes scripts, styles, tracking iframes
API responses: JSON with nested metadata, internal fields, debug info
Document processing: Headers, footers, navigation, boilerplate text
Code analysis: Comments, whitespace, boilerplate code

Solution

Extract only the semantic, interactive, or relevant elements from raw data. Filter out noise and provide the LLM with a clean representation that captures what matters for reasoning.

How to use it

Trade-offs

Pros:

Dramatic token reduction: 10-100x smaller context
Better LLM reasoning: Focus on signal, not noise
Lower costs: Fewer tokens = cheaper
Faster inference: Smaller context = faster generation
Higher reliability: Less confusion and hallucination

Cons:

Filter complexity: Need to build and maintain filter logic
Information loss: May remove context that matters
Domain-specific: Filters need to be tailored per use case
Mapping overhead: Need to track filtered-to-original references
Potential bugs: Filter might remove important elements

Edge cases to handle:

Hidden but content-rich: Accordions, tab panels, and collapsed content may be excluded by accessibility tree
Dynamic content: AJAX-loaded content, infinite scroll, and lazy-loaded elements require wait/scroll strategies
Canvas/SVG: Charts and custom-rendered content may need OCR or fallback HTML

Mitigation strategies:

Start conservative: Filter obvious noise, include borderline cases
Add filter bypass for debugging
Monitor LLM performance: Expand filter if accuracy drops
Version filters alongside data schemas
Provide hints to LLM: "Context has been filtered for relevance"

Security note: Semantic extraction can also provide security benefits. By removing untrusted content after extracting safe intermediate representations, agents gain resistance to prompt injection (see: Context-Minimization Pattern).

References

HyperAgent GitHub Repository - Original accessibility tree implementation
Kohlschütter et al., "Boilerplate Detection using Shallow Text Features", SIGIR 2010 - Foundational research showing 40-80% of web content is boilerplate
Beurer-Kellner et al., "Design Patterns for Securing LLM Agents", arXiv 2025 - Context-Minimization Pattern (security framework)
WAI-ARIA Accessibility Tree - Browser accessibility API
Related patterns: Context Window Anxiety Management, Curated Context Windows