GitHub
Context & Memory emerging

Semantic Context Filtering Pattern

By Nikola Balic (@nibzard)
Add to Pack
or

Saved locally in this browser for now.

Cite This Pattern
APA
Nikola Balic (@nibzard) (2026). Semantic Context Filtering Pattern. In *Awesome Agentic Patterns*. Retrieved March 11, 2026, from https://agentic-patterns.com/patterns/semantic-context-filtering
BibTeX
@misc{agentic_patterns_semantic-context-filtering,
  title = {Semantic Context Filtering Pattern},
  author = {Nikola Balic (@nibzard)},
  year = {2026},
  howpublished = {\url{https://agentic-patterns.com/patterns/semantic-context-filtering}},
  note = {Awesome Agentic Patterns}
}
01

Problem

Raw data sources are too verbose and noisy for effective LLM consumption. Full representations include invisible elements, implementation details, and irrelevant information that bloat context and confuse reasoning.

Research on boilerplate detection shows that 40-80% of web page content is typically navigation, footers, ads, and other boilerplate that should be filtered before semantic processing (Kohlschütter et al., SIGIR 2010).

This creates several problems:

  • Token explosion: Raw data exceeds context limits or becomes prohibitively expensive
  • Poor signal-to-noise: LLM wastes reasoning capacity on irrelevant details
  • Slower inference: More tokens = slower generation and higher costs
  • Confused reasoning: Noise leads to hallucinations or wrong conclusions

The issue appears across domains:

  • Web scraping: Full HTML DOM includes scripts, styles, tracking iframes
  • API responses: JSON with nested metadata, internal fields, debug info
  • Document processing: Headers, footers, navigation, boilerplate text
  • Code analysis: Comments, whitespace, boilerplate code
02

Solution

Extract only the semantic, interactive, or relevant elements from raw data. Filter out noise and provide the LLM with a clean representation that captures what matters for reasoning.

03

How to use it

04

Trade-offs

Pros:

  • Dramatic token reduction: 10-100x smaller context
  • Better LLM reasoning: Focus on signal, not noise
  • Lower costs: Fewer tokens = cheaper
  • Faster inference: Smaller context = faster generation
  • Higher reliability: Less confusion and hallucination

Cons:

  • Filter complexity: Need to build and maintain filter logic
  • Information loss: May remove context that matters
  • Domain-specific: Filters need to be tailored per use case
  • Mapping overhead: Need to track filtered-to-original references
  • Potential bugs: Filter might remove important elements

Edge cases to handle:

  • Hidden but content-rich: Accordions, tab panels, and collapsed content may be excluded by accessibility tree
  • Dynamic content: AJAX-loaded content, infinite scroll, and lazy-loaded elements require wait/scroll strategies
  • Canvas/SVG: Charts and custom-rendered content may need OCR or fallback HTML

Mitigation strategies:

  • Start conservative: Filter obvious noise, include borderline cases
  • Add filter bypass for debugging
  • Monitor LLM performance: Expand filter if accuracy drops
  • Version filters alongside data schemas
  • Provide hints to LLM: "Context has been filtered for relevance"

Security note: Semantic extraction can also provide security benefits. By removing untrusted content after extracting safe intermediate representations, agents gain resistance to prompt injection (see: Context-Minimization Pattern).

06

References