Social media data is a goldmine for market research, trend analysis, and competitive intelligence – but accessing it programmatically is notoriously difficult. Platforms actively block scrapers, change their APIs, and require complex authentication flows. MediaCrawler has emerged as one of the most popular open-source solutions to this challenge, with over 30,000 GitHub stars and support for all major Chinese social media platforms.
The project at github.com/NanmiCoder/MediaCrawler provides a unified framework for crawling data from Xiaohongshu (Little Red Book), Douyin (TikTok China), Kuaishou, Bilibili, Weibo, and more. It uses Playwright for browser automation, IP rotation, and cookie management to bypass anti-scraping measures. The result is a reliable data pipeline for extracting posts, comments, user profiles, and engagement metrics.
MediaCrawler’s popularity stems from its pragmatic design. Rather than being a generic scraping library, it is specifically tuned to the quirks of each supported platform – login flows, rate limits, response formats, and anti-bot detection mechanisms are all handled internally. Users configure targets (keywords, user IDs, hashtags) and the crawler handles the rest.
What is MediaCrawler?
MediaCrawler is an open-source, multi-platform social media data scraper that uses Playwright-based browser automation to collect content from major social platforms. It supports search-based crawling (by keyword), user-based crawling (by user ID), and comment collection. Data is output in structured JSON format for downstream analysis.
Which platforms are supported?
MediaCrawler supports all major Chinese social media platforms and a growing selection of international platforms.
| Platform | Type | Crawling Modes |
|---|---|---|
| Xiaohongshu (RED) | Lifestyle/content sharing | Search notes, user notes, comments |
| Douyin (TikTok CN) | Short video | Search videos, user videos, comments |
| Kuaishou | Short video | Search videos, user videos |
| Bilibili | Video streaming | Search videos, user videos, comments |
| Microblogging | Search posts, user posts, comments | |
| Zhihu | Q&A platform | Search questions, answers |
| Tieba (Planned) | Forums | Search threads |
| TikTok Global (Planned) | Short video | Search videos |
Each platform has its own crawling strategy tailored to its API behavior and anti-scraping measures.
What technology powers MediaCrawler?
MediaCrawler is built on a stack of well-established Python libraries for web automation and data processing.
| Component | Technology | Purpose |
|---|---|---|
| Browser automation | Playwright | Headless browser control |
| Proxy management | Custom IP rotation | Bypass rate limits and blocks |
| Cookie management | Persistent cookie store | Maintain login sessions |
| Data extraction | CSS/XPath selectors | Parse page content |
| Data storage | JSON, CSV, MySQL | Output collected data |
| Concurrency | asyncio | Parallel scraping |
| Anti-detection | Custom stealth patches | Avoid bot detection |
The Playwright-based approach means MediaCrawler interacts with pages like a real user, making it significantly harder for platforms to detect compared to simple HTTP request-based scrapers.
What are MediaCrawler’s key features?
MediaCrawler provides a comprehensive set of scraping capabilities beyond basic content extraction.
| Feature | Description |
|---|---|
| Keyword search scraping | Collect all posts/videos matching search terms |
| User profile scraping | Extract all content from a specific user |
| Comment harvesting | Collect comments and replies on posts |
| Auto login | Credential-based or QR-code login per platform |
| Proxy rotation | SOCKS5/HTTP proxy pools for IP diversity |
| Rate limiting | Configurable delays to avoid detection |
| Incremental crawling | Resume from last checkpoint |
| Structured output | JSON with normalized field names across platforms |
Is there a Pro version of MediaCrawler?
The core MediaCrawler project is fully open-source and free. The developers offer a “Pro” version with additional features for commercial users.
| Feature | Open Source | Pro Version |
|---|---|---|
| Platform support | 6 platforms | 10+ platforms |
| Proxy support | Basic SOCKS5 | Advanced rotating proxies |
| Data export | JSON + CSV | JSON, CSV, MySQL, Elasticsearch |
| Rate limiting | Manual config | Adaptive AI rate limiting |
| Support | GitHub Issues | Dedicated support channel |
| License | MIT | Commercial license |
The Pro version is primarily aimed at enterprises running large-scale data collection pipelines.
Frequently Asked Questions
What is MediaCrawler?
MediaCrawler is an open-source Python tool for scraping social media data from platforms like Xiaohongshu, Douyin, Bilibili, Weibo, and more. It uses Playwright browser automation to extract posts, comments, and user data.
Which social media platforms are supported?
Xiaohongshu (RED), Douyin (TikTok China), Kuaishou, Bilibili, Weibo, and Zhihu. TikTok Global support is planned for future releases.
What technology does MediaCrawler use?
Playwright for browser automation, asyncio for concurrent scraping, customizable IP rotation, and persistent cookie management for session maintenance.
What are MediaCrawler’s key features?
Keyword search scraping, user profile extraction, comment harvesting, auto login, proxy rotation, rate limiting, incremental crawling, and structured JSON output.
Is there a Pro/enterprise version of MediaCrawler?
Yes, a Pro version offers additional platforms, advanced proxy management, adaptive rate limiting, and commercial support for enterprise users.
Further Reading
- MediaCrawler GitHub Repository
- Playwright Python Documentation
- Web Scraping Best Practices Guide
- Social Media Data Analysis with Python
- Xiaohongshu Platform Overview
flowchart TB
A[User Configuration] --> B[MediaCrawler Engine]
B --> C{Select Platform}
C --> D[Xiaohongshu]
C --> E[Douyin]
C --> F[Bilibili]
C --> G[Weibo]
D --> H[Launch Playwright]
E --> H
F --> H
G --> H
H --> I[Login + Cookie Management]
I --> J[Navigate to Target]
J --> K[Extract Data]
K --> L[Parse & Normalize]
L --> M[JSON Output]
M --> N[Analysis Pipeline]graph LR
subgraph Data Pipeline
A[Search Keywords] --> B[Auto Login]
B --> C[Proxy Select]
C --> D[Page Scrape]
D --> E[Data Parse]
E --> F[Format JSON]
end
subgraph Storage
F --> G[Local File]
F --> H[Database]
F --> I[Data Warehouse]
end
subgraph Anti-Detection
J[User Agent Rotation]
K[Human-like Delays]
L[IP Rotation]
M[Stealth Patches]
end
無程式碼也能輕鬆打造專業LINE官方帳號!一鍵導入模板,讓AI助你行銷加分!