Traditional web scraping relies on brittle CSS selectors and XPath expressions that break the moment a site updates its markup. LLM Scraper takes a fundamentally different approach: it uses large language models to understand page content semantically and extract exactly the data you need as structured JSON.
Built by mishushakov, this open-source tool bridges the gap between unstructured HTML and structured data pipelines. Instead of writing and maintaining selectors, you define a typed schema of what you want to extract, and the LLM handles the rest.
How LLM Scraper Works
LLM Scraper supports multiple LLM providers including OpenAI, Anthropic, and local models via Ollama. You provide a URL or HTML content along with a JSON schema describing the data fields you need, and the tool returns a structured JSON object matching your schema.
| Feature | Description |
|---|---|
| Schema-based extraction | Define typed JSON schemas for any data structure |
| Multiple LLM providers | OpenAI, Anthropic, Ollama, and custom endpoints |
| Batch processing | Scrape multiple pages with a single command |
| Playwright integration | Handles JavaScript-rendered pages automatically |
| Retry and error handling | Built-in resilience for failed extractions |
Comparison with Traditional Scraping
| Approach | Maintenance | Accuracy | JavaScript Support | Setup Time |
|---|---|---|---|---|
| CSS Selectors | High (breaks often) | Variable | Requires Playwright | Medium |
| XPath | High (breaks often) | Variable | Requires Playwright | Medium |
| Regex parsing | Very High | Low | No | Low |
| LLM Scraper | Low (semantic) | High | Built-in | Low |
Data Extraction Pipeline
flowchart LR
A[Web Page URL] --> B[Playwright Browser]
B --> C[HTML Content]
C --> D[LLM Processor]
E[Schema Definition] --> D
D --> F[Structured JSON]
F --> G[Database]
F --> H[Data Pipeline]
F --> I[Analysis Tools]The pipeline starts with a target URL. Playwright loads the page (handling JavaScript rendering), the raw HTML is passed to the LLM along with your schema, and the LLM returns clean, structured data ready for downstream use.
Key Advantages
LLM Scraper excels where traditional tools struggle. Pages that load content dynamically, use complex JavaScript frameworks, or have frequently changing layouts are handled with zero maintenance. The LLM understands the meaning of content, not just its position in the DOM.
For more details, visit the official GitHub repository and check out the LLM Scraper documentation.
Frequently Asked Questions
Q: Which LLM models work best with LLM Scraper? A: GPT-4 and Claude 3.5 Sonnet offer the best extraction accuracy, while local models like Llama 3 provide a good free-tier option.
Q: Can LLM Scraper handle pages behind login walls? A: Yes, by passing cookies or session tokens through Playwright’s authentication context.
Q: How much does it cost per page scraped? A: Costs depend on the LLM provider and page size, typically $0.001-$0.01 per page with cloud providers.
Q: Does it support pagination and crawling? A: Yes, you can chain multiple pages and define crawling patterns to scrape entire sites.
Q: What output formats are supported? A: JSON is the primary output, but you can pipe results to CSV, Parquet, or any format via post-processing.
無程式碼也能輕鬆打造專業LINE官方帳號!一鍵導入模板,讓AI助你行銷加分!