Open Source

MediaCrawler: Open-Source Social Media Data Scraper with 30K Stars

MediaCrawler is an open-source multi-platform social media scraper supporting Xiaohongshu, Douyin, Kuaishou, Bilibili, Weibo, and more with Playwright automation.

Keeping this site alive takes effort — your support means everything.
無程式碼也能輕鬆打造專業LINE官方帳號!一鍵導入模板,讓AI助你行銷加分! 無程式碼也能輕鬆打造專業LINE官方帳號!一鍵導入模板,讓AI助你行銷加分!
MediaCrawler: Open-Source Social Media Data Scraper with 30K Stars

Social media data is a goldmine for market research, trend analysis, and competitive intelligence – but accessing it programmatically is notoriously difficult. Platforms actively block scrapers, change their APIs, and require complex authentication flows. MediaCrawler has emerged as one of the most popular open-source solutions to this challenge, with over 30,000 GitHub stars and support for all major Chinese social media platforms.

The project at github.com/NanmiCoder/MediaCrawler provides a unified framework for crawling data from Xiaohongshu (Little Red Book), Douyin (TikTok China), Kuaishou, Bilibili, Weibo, and more. It uses Playwright for browser automation, IP rotation, and cookie management to bypass anti-scraping measures. The result is a reliable data pipeline for extracting posts, comments, user profiles, and engagement metrics.

MediaCrawler’s popularity stems from its pragmatic design. Rather than being a generic scraping library, it is specifically tuned to the quirks of each supported platform – login flows, rate limits, response formats, and anti-bot detection mechanisms are all handled internally. Users configure targets (keywords, user IDs, hashtags) and the crawler handles the rest.

What is MediaCrawler?

MediaCrawler is an open-source, multi-platform social media data scraper that uses Playwright-based browser automation to collect content from major social platforms. It supports search-based crawling (by keyword), user-based crawling (by user ID), and comment collection. Data is output in structured JSON format for downstream analysis.

Which platforms are supported?

MediaCrawler supports all major Chinese social media platforms and a growing selection of international platforms.

PlatformTypeCrawling Modes
Xiaohongshu (RED)Lifestyle/content sharingSearch notes, user notes, comments
Douyin (TikTok CN)Short videoSearch videos, user videos, comments
KuaishouShort videoSearch videos, user videos
BilibiliVideo streamingSearch videos, user videos, comments
WeiboMicrobloggingSearch posts, user posts, comments
ZhihuQ&A platformSearch questions, answers
Tieba (Planned)ForumsSearch threads
TikTok Global (Planned)Short videoSearch videos

Each platform has its own crawling strategy tailored to its API behavior and anti-scraping measures.

What technology powers MediaCrawler?

MediaCrawler is built on a stack of well-established Python libraries for web automation and data processing.

ComponentTechnologyPurpose
Browser automationPlaywrightHeadless browser control
Proxy managementCustom IP rotationBypass rate limits and blocks
Cookie managementPersistent cookie storeMaintain login sessions
Data extractionCSS/XPath selectorsParse page content
Data storageJSON, CSV, MySQLOutput collected data
ConcurrencyasyncioParallel scraping
Anti-detectionCustom stealth patchesAvoid bot detection

The Playwright-based approach means MediaCrawler interacts with pages like a real user, making it significantly harder for platforms to detect compared to simple HTTP request-based scrapers.

What are MediaCrawler’s key features?

MediaCrawler provides a comprehensive set of scraping capabilities beyond basic content extraction.

FeatureDescription
Keyword search scrapingCollect all posts/videos matching search terms
User profile scrapingExtract all content from a specific user
Comment harvestingCollect comments and replies on posts
Auto loginCredential-based or QR-code login per platform
Proxy rotationSOCKS5/HTTP proxy pools for IP diversity
Rate limitingConfigurable delays to avoid detection
Incremental crawlingResume from last checkpoint
Structured outputJSON with normalized field names across platforms

Is there a Pro version of MediaCrawler?

The core MediaCrawler project is fully open-source and free. The developers offer a “Pro” version with additional features for commercial users.

FeatureOpen SourcePro Version
Platform support6 platforms10+ platforms
Proxy supportBasic SOCKS5Advanced rotating proxies
Data exportJSON + CSVJSON, CSV, MySQL, Elasticsearch
Rate limitingManual configAdaptive AI rate limiting
SupportGitHub IssuesDedicated support channel
LicenseMITCommercial license

The Pro version is primarily aimed at enterprises running large-scale data collection pipelines.

Frequently Asked Questions

What is MediaCrawler?

MediaCrawler is an open-source Python tool for scraping social media data from platforms like Xiaohongshu, Douyin, Bilibili, Weibo, and more. It uses Playwright browser automation to extract posts, comments, and user data.

Which social media platforms are supported?

Xiaohongshu (RED), Douyin (TikTok China), Kuaishou, Bilibili, Weibo, and Zhihu. TikTok Global support is planned for future releases.

What technology does MediaCrawler use?

Playwright for browser automation, asyncio for concurrent scraping, customizable IP rotation, and persistent cookie management for session maintenance.

What are MediaCrawler’s key features?

Keyword search scraping, user profile extraction, comment harvesting, auto login, proxy rotation, rate limiting, incremental crawling, and structured JSON output.

Is there a Pro/enterprise version of MediaCrawler?

Yes, a Pro version offers additional platforms, advanced proxy management, adaptive rate limiting, and commercial support for enterprise users.

Further Reading

TAG
CATEGORIES