Every data scientist has faced the same frustration: spending hours searching for a reliable dataset, only to find broken links, outdated information, or unclear licensing. According to recent surveys, data professionals spend an average of 12 hours per week just locating and preparing data for their projects. That is roughly one-third of a standard work week consumed by discovery alone.
Awesome Public Datasets solves this problem at scale. With over 59,800 GitHub stars and 9,700 forks, it is one of the most trusted community-driven catalogs of open data on the internet. Originally incubated at the OMNILab of Shanghai Jiao Tong University and now stewarded by the BaiYuLan Open AI community (Shanghai’s premier open AI ecosystem), this project has evolved from a simple curated list into a comprehensive data discovery platform.
What makes Awesome Public Datasets truly exceptional is its breadth. The list spans over 35 distinct categories — from agriculture and astronomy to social networks and sports — with each dataset entry carrying status indicators that tell you at a glance whether the source is actively maintained (a green checkmark) or needs attention (a warning icon). The entire listing is auto-generated using the apd-core tool, which means entries are structurally consistent and automatically verified.
Created in November 2014 and actively maintained through April 2026 — more than eleven years of continuous curation — this resource has powered research papers, startup MVPs, Kaggle competition entries, university coursework, and enterprise proof-of-concepts around the world. Whether you are training a large language model, analyzing climate trends, or building a recommendation engine, this is the first bookmarks you should save.
What problem does Awesome Public Datasets solve?
The data discovery landscape is fractured. Government portals, university repositories, cloud provider marketplaces, and domain-specific archives all operate independently. Researchers often resort to forum posts and social media to learn about usable datasets. Awesome Public Datasets consolidates this chaos into a single, navigable index.
flowchart LR
A[Researcher needs data] --> B{Browse Awesome<br/>Public Datasets}
B --> C[Agriculture]
B --> D[Biological]
B --> E[Climate]
B --> F[Finance]
B --> G[Healthcare]
B --> H[35+ categories]
C --> I[Verified link + status]
D --> I
E --> I
F --> I
G --> I
H --> I
I --> J[Start analysis]Before this project, finding a high-quality dataset might involve visiting a dozen government portals, university repositories, and forum threads. Now every major open dataset is a single click away.
The project’s longevity is a testament to its utility. Since November 2014, the collection has grown from a handful of links to hundreds of verified entries, with the community contributing new datasets and flagging dead links through pull requests. The apd-core automation ensures that contributions meet quality standards before they are merged.
How is the list organized?
The repository uses a straightforward category system with 35+ top-level domains. Each dataset entry in the README includes a direct link, a brief description, and a status icon. The categories themselves are alphabetized, making navigation predictable even as the list grows.
flowchart TD
subgraph Browsing["Browsing Flow"]
direction LR
A1[Open README] --> A2[Pick category] --> A3[Scan entries] --> A4[Check status ✅⚠️] --> A5[Follow link]
end
subgraph Contributing["Contribution Flow"]
direction LR
B1[Fork apd-core] --> B2[Edit YAML metadata] --> B3[Submit PR] --> B4[Automated review] --> B5[Merge]
endEntries that display the green checkmark icon (✅) have been recently verified and the links are confirmed active. Entries marked with a warning icon (⚠️) may have broken links or need community attention — a transparent system that keeps expectations honest and invites contributions.
What categories does the list cover?
The breadth of the collection is one of its strongest features. Researchers in virtually any domain will find something relevant.
| Category | Description | Example Datasets | Approx. Entries |
|---|---|---|---|
| Agriculture | Crop yields, soil data, food nutrition | USDA Nutrient Database, Global Crop Yields, PLANTS Database | 15+ |
| Biology | Genomics, proteomics, cancer data | 1000 Genomes, TCGA, ENCODE, GEO, PDB, COSMIC | 45+ |
| Climate + Weather | Atmospheric, oceanic, climate projections | WorldClim, NOAA Models, NASA GIBS, Open-Meteo | 20+ |
| Finance | Market data, economic indicators | FRED, Quandl, Yahoo Finance, NASDAQ, CBOE | 25+ |
| Healthcare | Medical imaging, physiology, epidemiology | PhysioNet, TCIA, WHO Observatory, Medicare Data | 30+ |
| Machine Learning | Benchmark datasets, ML repositories | ImageNet, MNIST, Kaggle, UCI ML Repository | 40+ |
| Natural Language | Text corpora, embeddings, speech | Common Crawl, Wikipedia Dumps, LibriSpeech | 35+ |
| Social Networks | Graph data, user behavior, platform data | Stanford SNAP, Twitter Data, Reddit Datasets | 20+ |
| Government | Open government portals worldwide | Data.gov, EU Open Data Portal, city-level portals | 100+ |
| Transportation | Transit, traffic, mobility | NYC Taxi Trips, GTFS Feeds, OpenFlights | 15+ |
The Government category alone contains over 100 sub-entries, linking to open data portals from cities, states, provinces, and national governments worldwide. If you need demographic, economic, or administrative data, this is the place to start.
Which biology datasets are included?
The Biology section is the deepest category in the collection, subdivided into genomics, functional genomics, and cancer genomics. These are foundational resources that have powered thousands of research papers.
| Dataset | Description | Type | Access |
|---|---|---|---|
| 1000 Genomes Project | 2,500+ human genome sequences from diverse populations | Genomics | Open |
| The Cancer Genome Atlas (TCGA) | Multi-platform genomic data across 33 cancer types | Cancer Genomics | Controlled |
| ENCODE Project | Functional elements in the human genome | Functional Genomics | Open |
| Gene Expression Omnibus (GEO) | High-throughput gene expression and functional genomics | Functional Genomics | Open |
| COSMIC | Somatic mutation information in human cancer | Cancer Genomics | Open |
| Protein Data Bank (PDB) | 3D structures of biological macromolecules | Structural Biology | Open |
| PubChem | Information on chemical molecules and bioactivities | Chemoinformatics | Open |
| Human Microbiome Project (HMP) | Microbial communities across body sites | Metagenomics | Open |
Many of these datasets are too large to download in their entirety — the 1000 Genomes dataset alone exceeds 200 terabytes. Researchers typically use programmatic access (via APIs or cloud mirroring) to work with subsets relevant to their research.
Which machine learning datasets are featured?
The machine learning category links to the most widely used benchmarks in the field. Whether you are working on computer vision, natural language processing, or tabular data, these datasets are industry standards.
| Dataset | Domain | Typical Use | Scale |
|---|---|---|---|
| ImageNet | Computer Vision | Image classification, object detection | 14M+ images, 22K categories |
| MNIST | Computer Vision | Handwritten digit recognition | 70K grayscale images |
| Common Crawl | Web Text | LLM pre-training, NLP corpora | Billions of web pages |
| LibriSpeech | Speech | ASR model training | 1,000 hours of speech |
| UCI ML Repository | Mixed | Benchmarking algorithms | 600+ datasets |
| Kaggle Datasets | Mixed | Competitions and exploration | 100,000+ datasets |
The presence of both foundational datasets (like MNIST) and massive-scale corpora (like Common Crawl) means the list serves everyone from students learning the basics to researchers training billion-parameter models.
How does the apd-core tool maintain data quality?
The apd-core repository is the engine behind Awesome Public Datasets. It stores all dataset metadata as structured YAML files, each containing the dataset name, URL, description, category tags, and verification history.
This structured format enables several automated quality checks:
- Link verification: Scripts test whether dataset URLs resolve correctly
- Metadata completeness: Each entry must include required fields before it is accepted
- Category consistency: Entries are classified under the correct domain heading
- License awareness: Dataset license terms can be tracked alongside the listing
When you find a new dataset to contribute, you do not edit the README directly. Instead, you submit a pull request to apd-core with the new YAML entry. The automated pipeline validates your submission and, once merged, regenerates the README. This separation of data from presentation ensures the list remains consistent and machine-readable.
Why has this project lasted over a decade?
Eleven years is an eternity for an open source project. Most curated lists atrophy when their maintainers move on to other interests. Awesome Public Datasets has thrived for several reasons:
The first is its clear scope. By limiting itself to publicly available datasets and organizing them by topic rather than format or size, the project avoids scope creep. It knows exactly what it is: a curated index, not a data marketplace, not a storage platform, not a community forum.
The second is automation. The apd-core toolchain means that adding a new dataset is as simple as writing a few lines of YAML. The maintainers do not need to manually format the README or check links. The machine handles the grunt work, and humans handle the curation judgment.
The third is community stewardship. The transition from OMNILab at Shanghai Jiao Tong University to the BaiYuLan Open AI community ensured continuity. The project has institutional backing rather than relying on a single individual’s volunteer time.
What is the future of Awesome Public Datasets?
As we move deeper into 2026, several trends are shaping the project’s evolution. The rise of large language models has created unprecedented demand for high-quality text corpora — datasets like Common Crawl, C4, and The Pile are vital for pre-training. The project will likely expand its NLP and multimodal dataset sections accordingly.
Another trend is dataset versioning and provenance tracking. As datasets are filtered, deduplicated, and transformed for specific use cases, knowing the provenance chain has become essential for reproducibility. The YAML metadata in apd-core could naturally extend to track these relationships.
Finally, the spatial and climate data categories will continue to grow as planetary-scale environmental monitoring generates ever-larger streams of open Earth observation data. Awesome Public Datasets is well-positioned to remain the front door to these resources.
Frequently Asked Questions
What is Awesome Public Datasets?
Awesome Public Datasets is a topic-centric curated list of high-quality open datasets in public domains, maintained by the community and originally incubated at Shanghai Jiao Tong University.
How many datasets does Awesome Public Datasets cover?
The list covers datasets across dozens of categories including agriculture, biology, climate, economics, education, finance, government, healthcare, machine learning, and social networks.
Is Awesome Public Datasets free to use?
Yes, the list is completely free and licensed under MIT. The listed datasets are publicly available, though individual datasets may have their own license terms.
How is Awesome Public Datasets maintained?
The list is auto-generated using the apd-core tool, with community contributions reviewed regularly. Status indicators show which datasets are active or may need attention.
Who should use Awesome Public Datasets?
Researchers, data scientists, machine learning engineers, students, and anyone looking for high-quality open data for analysis, model training, or academic research.
How do I contribute a new dataset?
Fork the apd-core repository, add the dataset metadata as a YAML file under the appropriate category, and submit a pull request. The automated review process will verify the link and metadata before merging.
Can I use these datasets in commercial projects?
Most datasets listed are publicly available, but you must check each dataset’s individual license terms before commercial use. Some datasets have restrictions on redistribution or require attribution.
Further Reading
- Awesome Public Datasets on GitHub — The main repository with the full listing
- apd-core Repository — The metadata engine that generates the dataset listing
- BaiYuLan Open AI Community — The current stewarding organization
- Awesome Lists — The original awesome list format that inspired this project
- Papers With Code Datasets — A complementary resource linking datasets to research papers
- Kaggle Datasets — A platform for discovering and competing with open datasets
