Problem
Raw data sources are too verbose and noisy for effective LLM consumption. Full representations include invisible elements, implementation details, and irrelevant information that bloat context and confuse reasoning.
Research on boilerplate detection shows that 40-80% of web page content is typically navigation, footers, ads, and other boilerplate that should be filtered before semantic processing (Kohlschütter et al., SIGIR 2010).
This creates several problems:
- Token explosion: Raw data exceeds context limits or becomes prohibitively expensive
- Poor signal-to-noise: LLM wastes reasoning capacity on irrelevant details
- Slower inference: More tokens = slower generation and higher costs
- Confused reasoning: Noise leads to hallucinations or wrong conclusions
The issue appears across domains:
- Web scraping: Full HTML DOM includes scripts, styles, tracking iframes
- API responses: JSON with nested metadata, internal fields, debug info
- Document processing: Headers, footers, navigation, boilerplate text
- Code analysis: Comments, whitespace, boilerplate code
Solution
Extract only the semantic, interactive, or relevant elements from raw data. Filter out noise and provide the LLM with a clean representation that captures what matters for reasoning.
How to use it
Trade-offs
Pros:
- Dramatic token reduction: 10-100x smaller context
- Better LLM reasoning: Focus on signal, not noise
- Lower costs: Fewer tokens = cheaper
- Faster inference: Smaller context = faster generation
- Higher reliability: Less confusion and hallucination
Cons:
- Filter complexity: Need to build and maintain filter logic
- Information loss: May remove context that matters
- Domain-specific: Filters need to be tailored per use case
- Mapping overhead: Need to track filtered-to-original references
- Potential bugs: Filter might remove important elements
Edge cases to handle:
- Hidden but content-rich: Accordions, tab panels, and collapsed content may be excluded by accessibility tree
- Dynamic content: AJAX-loaded content, infinite scroll, and lazy-loaded elements require wait/scroll strategies
- Canvas/SVG: Charts and custom-rendered content may need OCR or fallback HTML
Mitigation strategies:
- Start conservative: Filter obvious noise, include borderline cases
- Add filter bypass for debugging
- Monitor LLM performance: Expand filter if accuracy drops
- Version filters alongside data schemas
- Provide hints to LLM: "Context has been filtered for relevance"
Security note: Semantic extraction can also provide security benefits. By removing untrusted content after extracting safe intermediate representations, agents gain resistance to prompt injection (see: Context-Minimization Pattern).
References
- HyperAgent GitHub Repository - Original accessibility tree implementation
- Kohlschütter et al., "Boilerplate Detection using Shallow Text Features", SIGIR 2010 - Foundational research showing 40-80% of web content is boilerplate
- Beurer-Kellner et al., "Design Patterns for Securing LLM Agents", arXiv 2025 - Context-Minimization Pattern (security framework)
- WAI-ARIA Accessibility Tree - Browser accessibility API
- Related patterns: Context Window Anxiety Management, Curated Context Windows