Data Science

Awesome Public Datasets: The Definitive Collection of Open Data for AI and Research

Awesome Public Datasets is a curated collection of 60K-starred open datasets across agriculture, biology, finance, healthcare, and more for AI research and data science.

Awesome Public Datasets: The Definitive Collection of Open Data for AI and Research

Every data scientist has faced the same frustration: spending hours searching for a reliable dataset, only to find broken links, outdated information, or unclear licensing. According to recent surveys, data professionals spend an average of 12 hours per week just locating and preparing data for their projects. That is roughly one-third of a standard work week consumed by discovery alone.

Awesome Public Datasets solves this problem at scale. With over 59,800 GitHub stars and 9,700 forks, it is one of the most trusted community-driven catalogs of open data on the internet. Originally incubated at the OMNILab of Shanghai Jiao Tong University and now stewarded by the BaiYuLan Open AI community (Shanghai’s premier open AI ecosystem), this project has evolved from a simple curated list into a comprehensive data discovery platform.

What makes Awesome Public Datasets truly exceptional is its breadth. The list spans over 35 distinct categories — from agriculture and astronomy to social networks and sports — with each dataset entry carrying status indicators that tell you at a glance whether the source is actively maintained (a green checkmark) or needs attention (a warning icon). The entire listing is auto-generated using the apd-core tool, which means entries are structurally consistent and automatically verified.

Created in November 2014 and actively maintained through April 2026 — more than eleven years of continuous curation — this resource has powered research papers, startup MVPs, Kaggle competition entries, university coursework, and enterprise proof-of-concepts around the world. Whether you are training a large language model, analyzing climate trends, or building a recommendation engine, this is the first bookmarks you should save.

What problem does Awesome Public Datasets solve?

The data discovery landscape is fractured. Government portals, university repositories, cloud provider marketplaces, and domain-specific archives all operate independently. Researchers often resort to forum posts and social media to learn about usable datasets. Awesome Public Datasets consolidates this chaos into a single, navigable index.

Before this project, finding a high-quality dataset might involve visiting a dozen government portals, university repositories, and forum threads. Now every major open dataset is a single click away.

The project’s longevity is a testament to its utility. Since November 2014, the collection has grown from a handful of links to hundreds of verified entries, with the community contributing new datasets and flagging dead links through pull requests. The apd-core automation ensures that contributions meet quality standards before they are merged.

How is the list organized?

The repository uses a straightforward category system with 35+ top-level domains. Each dataset entry in the README includes a direct link, a brief description, and a status icon. The categories themselves are alphabetized, making navigation predictable even as the list grows.

Entries that display the green checkmark icon (✅) have been recently verified and the links are confirmed active. Entries marked with a warning icon (⚠️) may have broken links or need community attention — a transparent system that keeps expectations honest and invites contributions.

What categories does the list cover?

The breadth of the collection is one of its strongest features. Researchers in virtually any domain will find something relevant.

CategoryDescriptionExample DatasetsApprox. Entries
AgricultureCrop yields, soil data, food nutritionUSDA Nutrient Database, Global Crop Yields, PLANTS Database15+
BiologyGenomics, proteomics, cancer data1000 Genomes, TCGA, ENCODE, GEO, PDB, COSMIC45+
Climate + WeatherAtmospheric, oceanic, climate projectionsWorldClim, NOAA Models, NASA GIBS, Open-Meteo20+
FinanceMarket data, economic indicatorsFRED, Quandl, Yahoo Finance, NASDAQ, CBOE25+
HealthcareMedical imaging, physiology, epidemiologyPhysioNet, TCIA, WHO Observatory, Medicare Data30+
Machine LearningBenchmark datasets, ML repositoriesImageNet, MNIST, Kaggle, UCI ML Repository40+
Natural LanguageText corpora, embeddings, speechCommon Crawl, Wikipedia Dumps, LibriSpeech35+
Social NetworksGraph data, user behavior, platform dataStanford SNAP, Twitter Data, Reddit Datasets20+
GovernmentOpen government portals worldwideData.gov, EU Open Data Portal, city-level portals100+
TransportationTransit, traffic, mobilityNYC Taxi Trips, GTFS Feeds, OpenFlights15+

The Government category alone contains over 100 sub-entries, linking to open data portals from cities, states, provinces, and national governments worldwide. If you need demographic, economic, or administrative data, this is the place to start.

Which biology datasets are included?

The Biology section is the deepest category in the collection, subdivided into genomics, functional genomics, and cancer genomics. These are foundational resources that have powered thousands of research papers.

DatasetDescriptionTypeAccess
1000 Genomes Project2,500+ human genome sequences from diverse populationsGenomicsOpen
The Cancer Genome Atlas (TCGA)Multi-platform genomic data across 33 cancer typesCancer GenomicsControlled
ENCODE ProjectFunctional elements in the human genomeFunctional GenomicsOpen
Gene Expression Omnibus (GEO)High-throughput gene expression and functional genomicsFunctional GenomicsOpen
COSMICSomatic mutation information in human cancerCancer GenomicsOpen
Protein Data Bank (PDB)3D structures of biological macromoleculesStructural BiologyOpen
PubChemInformation on chemical molecules and bioactivitiesChemoinformaticsOpen
Human Microbiome Project (HMP)Microbial communities across body sitesMetagenomicsOpen

Many of these datasets are too large to download in their entirety — the 1000 Genomes dataset alone exceeds 200 terabytes. Researchers typically use programmatic access (via APIs or cloud mirroring) to work with subsets relevant to their research.

The machine learning category links to the most widely used benchmarks in the field. Whether you are working on computer vision, natural language processing, or tabular data, these datasets are industry standards.

DatasetDomainTypical UseScale
ImageNetComputer VisionImage classification, object detection14M+ images, 22K categories
MNISTComputer VisionHandwritten digit recognition70K grayscale images
Common CrawlWeb TextLLM pre-training, NLP corporaBillions of web pages
LibriSpeechSpeechASR model training1,000 hours of speech
UCI ML RepositoryMixedBenchmarking algorithms600+ datasets
Kaggle DatasetsMixedCompetitions and exploration100,000+ datasets

The presence of both foundational datasets (like MNIST) and massive-scale corpora (like Common Crawl) means the list serves everyone from students learning the basics to researchers training billion-parameter models.

How does the apd-core tool maintain data quality?

The apd-core repository is the engine behind Awesome Public Datasets. It stores all dataset metadata as structured YAML files, each containing the dataset name, URL, description, category tags, and verification history.

This structured format enables several automated quality checks:

  • Link verification: Scripts test whether dataset URLs resolve correctly
  • Metadata completeness: Each entry must include required fields before it is accepted
  • Category consistency: Entries are classified under the correct domain heading
  • License awareness: Dataset license terms can be tracked alongside the listing

When you find a new dataset to contribute, you do not edit the README directly. Instead, you submit a pull request to apd-core with the new YAML entry. The automated pipeline validates your submission and, once merged, regenerates the README. This separation of data from presentation ensures the list remains consistent and machine-readable.

Why has this project lasted over a decade?

Eleven years is an eternity for an open source project. Most curated lists atrophy when their maintainers move on to other interests. Awesome Public Datasets has thrived for several reasons:

The first is its clear scope. By limiting itself to publicly available datasets and organizing them by topic rather than format or size, the project avoids scope creep. It knows exactly what it is: a curated index, not a data marketplace, not a storage platform, not a community forum.

The second is automation. The apd-core toolchain means that adding a new dataset is as simple as writing a few lines of YAML. The maintainers do not need to manually format the README or check links. The machine handles the grunt work, and humans handle the curation judgment.

The third is community stewardship. The transition from OMNILab at Shanghai Jiao Tong University to the BaiYuLan Open AI community ensured continuity. The project has institutional backing rather than relying on a single individual’s volunteer time.

What is the future of Awesome Public Datasets?

As we move deeper into 2026, several trends are shaping the project’s evolution. The rise of large language models has created unprecedented demand for high-quality text corpora — datasets like Common Crawl, C4, and The Pile are vital for pre-training. The project will likely expand its NLP and multimodal dataset sections accordingly.

Another trend is dataset versioning and provenance tracking. As datasets are filtered, deduplicated, and transformed for specific use cases, knowing the provenance chain has become essential for reproducibility. The YAML metadata in apd-core could naturally extend to track these relationships.

Finally, the spatial and climate data categories will continue to grow as planetary-scale environmental monitoring generates ever-larger streams of open Earth observation data. Awesome Public Datasets is well-positioned to remain the front door to these resources.

Frequently Asked Questions

What is Awesome Public Datasets?

Awesome Public Datasets is a topic-centric curated list of high-quality open datasets in public domains, maintained by the community and originally incubated at Shanghai Jiao Tong University.

How many datasets does Awesome Public Datasets cover?

The list covers datasets across dozens of categories including agriculture, biology, climate, economics, education, finance, government, healthcare, machine learning, and social networks.

Is Awesome Public Datasets free to use?

Yes, the list is completely free and licensed under MIT. The listed datasets are publicly available, though individual datasets may have their own license terms.

How is Awesome Public Datasets maintained?

The list is auto-generated using the apd-core tool, with community contributions reviewed regularly. Status indicators show which datasets are active or may need attention.

Who should use Awesome Public Datasets?

Researchers, data scientists, machine learning engineers, students, and anyone looking for high-quality open data for analysis, model training, or academic research.

How do I contribute a new dataset?

Fork the apd-core repository, add the dataset metadata as a YAML file under the appropriate category, and submit a pull request. The automated review process will verify the link and metadata before merging.

Can I use these datasets in commercial projects?

Most datasets listed are publicly available, but you must check each dataset’s individual license terms before commercial use. Some datasets have restrictions on redistribution or require attribution.

Further Reading

TAG
CATEGORIES