--- title: 'Semantic Context Filtering Pattern' status: emerging authors: ['Nikola Balic (@nibzard)'] based_on: ['Hyperbrowser Team (@hyperbrowserai)'] category: 'Context & Memory' source: 'https://github.com/hyperbrowserai/HyperAgent' tags: [context-filtering, token-optimization, semantic-extraction, noise-reduction] --- ## Problem Raw data sources are too verbose and noisy for effective LLM consumption. Full representations include invisible elements, implementation details, and irrelevant information that bloat context and confuse reasoning. This creates several problems: - **Token explosion**: Raw data exceeds context limits or becomes prohibitively expensive - **Poor signal-to-noise**: LLM wastes reasoning capacity on irrelevant details - **Slower inference**: More tokens = slower generation and higher costs - **Confused reasoning**: Noise leads to hallucinations or wrong conclusions The issue appears across domains: - **Web scraping**: Full HTML DOM includes scripts, styles, tracking iframes - **API responses**: JSON with nested metadata, internal fields, debug info - **Document processing**: Headers, footers, navigation, boilerplate text - **Code analysis**: Comments, whitespace, boilerplate code ## Solution Extract only the semantic, interactive, or relevant elements from raw data. Filter out noise and provide the LLM with a clean representation that captures what matters for reasoning. ### Core Principle **Don't send raw data to the LLM. Send semantic abstractions.** ### Example 1: Browser Accessibility Tree Instead of full HTML DOM: ```html

``` Extract the accessibility tree (100-200 tokens): ```typescript { "interactiveElements": [ { "role": "link", "name": "Home", "xpath": "/html/body/nav/a[1]" }, { "role": "link", "name": "About", "xpath": "/html/body/nav/a[2]" }, { "role": "button", "name": "Login", "id": "login-button", "xpath": "/html/body/main/button" }, { "role": "textbox", "name": "Email", "id": "email", "xpath": "/html/body/main/input" } ] } ``` **Implementation:** ```typescript // Use browser's built-in accessibility tree const tree = await page.accessibility.snapshot({ interestingOnly: true, // Only interactive elements }); // Automatically filters: // - Elements with aria-hidden="true" // - Elements with display:none // - Ad/tracking iframes by domain // - Non-semantic divs and spans ``` ### Example 2: API Response Filtering Raw API responses often include internal metadata: ```json // Raw API response (2,000 tokens) { "data": { "users": [ { "id": "123", "name": "Alice", "email": "alice@example.com", "internalFlags": ["vip", "beta_tester"], "metadata": { "created_at": "2024-01-01", "updated_at": "2024-01-15", "version": 42 } } ] }, "_internal": { "requestId": "req-abc123", "latency": 45, "cache": "HIT", "debug": [] }, "_links": { "self": "/users", "next": "/users?page=2" } } ``` Filter to semantic fields only: ```typescript function filterAPIResponse(response: any, schema: FieldSchema): any { const filtered = {}; for (const field of schema.relevantFields) { if (response.data?.[field]) { filtered[field] = response.data[field]; } } return filtered; } // Result (200 tokens): { "users": [ { "name": "Alice", "email": "alice@example.com" } ] } ``` ### Example 3: Document Section Extraction Full documents include boilerplate: ``` FULL DOCUMENT: ==================== COMPANY CONFIDENTIAL [Header repeated on every page] Copyright 2024 Acme Corp. All rights reserved. [Legal disclaimer spanning 3 pages] EXECUTIVE SUMMARY ==================== The Q4 revenue increased by 15%... [Navigation menu] - Table of Contents - Index - Glossary ACTUAL CONTENT STARTS HERE ==================== Analysis of market trends shows... [50 more pages] APPENDIX ======== [Technical specifications] [Legal disclaimers] [Contact information - repeated] ``` Extract semantic sections: ```python def extract_semantic_content(document: str) -> dict: # Skip headers, footers, navigation sections = { "executive_summary": extract_section(document, "EXECUTIVE SUMMARY"), "analysis": extract_section(document, "ANALYSIS"), "conclusions": extract_section(document, "CONCLUSIONS"), } # Remove boilerplate for section in sections.values(): section = remove_legal_disclaimers(section) section = remove_navigation(section) return sections # Result: Only the actual content, ~20% of original size ``` ### Architecture ```mermaid graph LR A[Raw Data Source] --> B[Semantic Filter] B --> C[Clean Context] subgraph "Filter Layer" B --> D[Interactive Elements] B --> E[Relevant Fields] B --> F[Semantic Sections] end C --> G[LLM Processing] A -. "Noise removed" .-> B B -. "10-100x reduction" .-> C style C fill:#9f9,stroke:#333 style A fill:#f99,stroke:#333 ``` ### Key Benefits | Aspect | Raw Data | Semantic Filter | Improvement | | ------------- | ----------------- | ----------------- | ----------------------- | | Token count | 10,000 | 100-1,000 | **10-100x reduction** | | LLM reasoning | Confused by noise | Focused on signal | **Better decisions** | | Cost | High | Low | **10-100x cheaper** | | Latency | Slow | Fast | **2-5x faster** | | Accuracy | Prone to errors | More reliable | **Higher success rate** | ## How to use it ### 1. Identify Semantic Elements For your domain, determine what actually matters: ```typescript // Web scraping: interactive elements only const semanticElements = ['button', 'link', 'textbox', 'checkbox', 'radio', 'combobox', 'slider']; // API responses: business data only const relevantFields = ['name', 'email', 'status', 'amount']; // Documents: content sections only const contentSections = ['executive_summary', 'analysis', 'conclusions']; ``` ### 2. Build Filter Layer ```typescript class SemanticFilter { filter(data: any, domain: string): any { switch (domain) { case 'web': return this.filterAccessibilityTree(data); case 'api': return this.filterAPIResponse(data); case 'document': return this.filterDocumentSections(data); } } private filterAccessibilityTree(dom: any): any { // Only interactive elements with ARIA roles return dom .filter((el) => el.interactive) .filter((el) => !el.isHidden) .filter((el) => !this.isAdIframe(el)) .map((el) => ({ role: el.role, name: el.name, xpath: el.xpath, })); } } ``` ### 3. Apply Before LLM Call ```typescript // Wrong: Send raw data const response = await llm.generate({ prompt: `Analyze this page: ${rawHTML}`, }); // Right: Filter first const filtered = semanticFilter.filter(rawHTML, 'web'); const response = await llm.generate({ prompt: `Analyze this page: ${JSON.stringify(filtered)}`, }); ``` ### 4. Maintain Reference Mapping Keep track of filtered-to-original mappings for execution: ```typescript interface FilteredElement { semanticId: string; // "login-button" originalRef: string; // "frameIndex-backendNodeId" xpath: string; // "/html/body/main/button" } // Filtered context uses semantic IDs const filteredContext = [{ id: 'btn-1', name: 'Login', role: 'button' }]; // Execution layer maps back to original references const element = mapToOriginal(filteredContext[0].id); await page.click(element.xpath); ``` ## Trade-offs **Pros:** - **Dramatic token reduction**: 10-100x smaller context - **Better LLM reasoning**: Focus on signal, not noise - **Lower costs**: Fewer tokens = cheaper - **Faster inference**: Smaller context = faster generation - **Higher reliability**: Less confusion and hallucination **Cons:** - **Filter complexity**: Need to build and maintain filter logic - **Information loss**: May remove context that matters - **Domain-specific**: Filters need to be tailored per use case - **Mapping overhead**: Need to track filtered-to-original references - **Potential bugs**: Filter might remove important elements **Mitigation strategies:** - Start conservative: Filter obvious noise, include borderline cases - Add filter bypass for debugging - Monitor LLM performance: Expand filter if accuracy drops - Version filters alongside data schemas - Provide hints to LLM: "Context has been filtered for relevance" ## References - [HyperAgent GitHub Repository](https://github.com/hyperbrowserai/HyperAgent) - Original accessibility tree implementation - [WAI-ARIA Accessibility Tree](https://www.w3.org/TR/core-aam-1.1/) - Browser accessibility API - Related patterns: [Context Window Anxiety Management](context-window-anxiety-management.md), [Curated Context Windows](curated-context-windows.md)