Data Science

Awesome Public Datasets: The Definitive Collection of Open Data for AI and Research

Q: "What is Awesome Public Datasets?"

"Awesome Public Datasets is a topic-centric curated list of high-quality open datasets in public domains, maintained by the community and originally incubated at Shanghai Jiao Tong University."

Q: "How many datasets does Awesome Public Datasets cover?"

"The list covers datasets across dozens of categories including agriculture, biology, climate, economics, education, finance, government, healthcare, machine learning, and social networks."

Q: "Is Awesome Public Datasets free to use?"

"Yes, the list is completely free and licensed under MIT. The listed datasets are publicly available, though individual datasets may have their own license terms."

Q: "How is Awesome Public Datasets maintained?"

"The list is auto-generated using the apd-core tool, with community contributions reviewed regularly. Status indicators show which datasets are active or may need attention."

Q: "Who should use Awesome Public Datasets?"

"Researchers, data scientists, machine learning engineers, students, and anyone looking for high-quality open data for analysis, model training, or academic research."

Awesome Public Datasets is a curated collection of 60K-starred open datasets across agriculture, biology, finance, healthcare, and more for AI research and data science.

Editorial Team May 01, 2026 10 min read

Awesome Public Datasets: The Definitive Collection of Open Data for AI and Research

Every data scientist has faced the same frustration: spending hours searching for a reliable dataset, only to find broken links, outdated information, or unclear licensing. According to recent surveys, data professionals spend an average of 12 hours per week just locating and preparing data for their projects. That is roughly one-third of a standard work week consumed by discovery alone.

Awesome Public Datasets solves this problem at scale. With over 59,800 GitHub stars and 9,700 forks, it is one of the most trusted community-driven catalogs of open data on the internet. Originally incubated at the OMNILab of Shanghai Jiao Tong University and now stewarded by the BaiYuLan Open AI community (Shanghai’s premier open AI ecosystem), this project has evolved from a simple curated list into a comprehensive data discovery platform.

What makes Awesome Public Datasets truly exceptional is its breadth. The list spans over 35 distinct categories — from agriculture and astronomy to social networks and sports — with each dataset entry carrying status indicators that tell you at a glance whether the source is actively maintained (a green checkmark) or needs attention (a warning icon). The entire listing is auto-generated using the apd-core tool, which means entries are structurally consistent and automatically verified.

Created in November 2014 and actively maintained through April 2026 — more than eleven years of continuous curation — this resource has powered research papers, startup MVPs, Kaggle competition entries, university coursework, and enterprise proof-of-concepts around the world. Whether you are training a large language model, analyzing climate trends, or building a recommendation engine, this is the first bookmarks you should save.

What problem does Awesome Public Datasets solve?

The data discovery landscape is fractured. Government portals, university repositories, cloud provider marketplaces, and domain-specific archives all operate independently. Researchers often resort to forum posts and social media to learn about usable datasets. Awesome Public Datasets consolidates this chaos into a single, navigable index.

flowchart LR
    A[Researcher needs data] --> B{Browse Awesome<br/>Public Datasets}
    B --> C[Agriculture]
    B --> D[Biological]
    B --> E[Climate]
    B --> F[Finance]
    B --> G[Healthcare]
    B --> H[35+ categories]
    C --> I[Verified link + status]
    D --> I
    E --> I
    F --> I
    G --> I
    H --> I
    I --> J[Start analysis]

Before this project, finding a high-quality dataset might involve visiting a dozen government portals, university repositories, and forum threads. Now every major open dataset is a single click away.

The project’s longevity is a testament to its utility. Since November 2014, the collection has grown from a handful of links to hundreds of verified entries, with the community contributing new datasets and flagging dead links through pull requests. The apd-core automation ensures that contributions meet quality standards before they are merged.

How is the list organized?

The repository uses a straightforward category system with 35+ top-level domains. Each dataset entry in the README includes a direct link, a brief description, and a status icon. The categories themselves are alphabetized, making navigation predictable even as the list grows.

flowchart TD
    subgraph Browsing["Browsing Flow"]
        direction LR
        A1[Open README] --> A2[Pick category] --> A3[Scan entries] --> A4[Check status ✅⚠️] --> A5[Follow link]
    end
    
    subgraph Contributing["Contribution Flow"]
        direction LR
        B1[Fork apd-core] --> B2[Edit YAML metadata] --> B3[Submit PR] --> B4[Automated review] --> B5[Merge]
    end

Entries that display the green checkmark icon (✅) have been recently verified and the links are confirmed active. Entries marked with a warning icon (⚠️) may have broken links or need community attention — a transparent system that keeps expectations honest and invites contributions.

What categories does the list cover?

The breadth of the collection is one of its strongest features. Researchers in virtually any domain will find something relevant.

Category	Description	Example Datasets	Approx. Entries
Agriculture	Crop yields, soil data, food nutrition	USDA Nutrient Database, Global Crop Yields, PLANTS Database	15+
Biology	Genomics, proteomics, cancer data	1000 Genomes, TCGA, ENCODE, GEO, PDB, COSMIC	45+
Climate + Weather	Atmospheric, oceanic, climate projections	WorldClim, NOAA Models, NASA GIBS, Open-Meteo	20+
Finance	Market data, economic indicators	FRED, Quandl, Yahoo Finance, NASDAQ, CBOE	25+
Healthcare	Medical imaging, physiology, epidemiology	PhysioNet, TCIA, WHO Observatory, Medicare Data	30+
Machine Learning	Benchmark datasets, ML repositories	ImageNet, MNIST, Kaggle, UCI ML Repository	40+
Natural Language	Text corpora, embeddings, speech	Common Crawl, Wikipedia Dumps, LibriSpeech	35+
Social Networks	Graph data, user behavior, platform data	Stanford SNAP, Twitter Data, Reddit Datasets	20+
Government	Open government portals worldwide	Data.gov, EU Open Data Portal, city-level portals	100+
Transportation	Transit, traffic, mobility	NYC Taxi Trips, GTFS Feeds, OpenFlights	15+

The Government category alone contains over 100 sub-entries, linking to open data portals from cities, states, provinces, and national governments worldwide. If you need demographic, economic, or administrative data, this is the place to start.

Which biology datasets are included?

The Biology section is the deepest category in the collection, subdivided into genomics, functional genomics, and cancer genomics. These are foundational resources that have powered thousands of research papers.

Dataset	Description	Type	Access
1000 Genomes Project	2,500+ human genome sequences from diverse populations	Genomics	Open
The Cancer Genome Atlas (TCGA)	Multi-platform genomic data across 33 cancer types	Cancer Genomics	Controlled
ENCODE Project	Functional elements in the human genome	Functional Genomics	Open
Gene Expression Omnibus (GEO)	High-throughput gene expression and functional genomics	Functional Genomics	Open
COSMIC	Somatic mutation information in human cancer	Cancer Genomics	Open
Protein Data Bank (PDB)	3D structures of biological macromolecules	Structural Biology	Open
PubChem	Information on chemical molecules and bioactivities	Chemoinformatics	Open
Human Microbiome Project (HMP)	Microbial communities across body sites	Metagenomics	Open

Many of these datasets are too large to download in their entirety — the 1000 Genomes dataset alone exceeds 200 terabytes. Researchers typically use programmatic access (via APIs or cloud mirroring) to work with subsets relevant to their research.

Which machine learning datasets are featured?

The machine learning category links to the most widely used benchmarks in the field. Whether you are working on computer vision, natural language processing, or tabular data, these datasets are industry standards.

Dataset	Domain	Typical Use	Scale
ImageNet	Computer Vision	Image classification, object detection	14M+ images, 22K categories
MNIST	Computer Vision	Handwritten digit recognition	70K grayscale images
Common Crawl	Web Text	LLM pre-training, NLP corpora	Billions of web pages
LibriSpeech	Speech	ASR model training	1,000 hours of speech
UCI ML Repository	Mixed	Benchmarking algorithms	600+ datasets
Kaggle Datasets	Mixed	Competitions and exploration	100,000+ datasets

The presence of both foundational datasets (like MNIST) and massive-scale corpora (like Common Crawl) means the list serves everyone from students learning the basics to researchers training billion-parameter models.

How does the apd-core tool maintain data quality?

The apd-core repository is the engine behind Awesome Public Datasets. It stores all dataset metadata as structured YAML files, each containing the dataset name, URL, description, category tags, and verification history.

This structured format enables several automated quality checks:

Link verification: Scripts test whether dataset URLs resolve correctly
Metadata completeness: Each entry must include required fields before it is accepted
Category consistency: Entries are classified under the correct domain heading
License awareness: Dataset license terms can be tracked alongside the listing

When you find a new dataset to contribute, you do not edit the README directly. Instead, you submit a pull request to apd-core with the new YAML entry. The automated pipeline validates your submission and, once merged, regenerates the README. This separation of data from presentation ensures the list remains consistent and machine-readable.

Why has this project lasted over a decade?

Eleven years is an eternity for an open source project. Most curated lists atrophy when their maintainers move on to other interests. Awesome Public Datasets has thrived for several reasons:

The first is its clear scope. By limiting itself to publicly available datasets and organizing them by topic rather than format or size, the project avoids scope creep. It knows exactly what it is: a curated index, not a data marketplace, not a storage platform, not a community forum.

The second is automation. The apd-core toolchain means that adding a new dataset is as simple as writing a few lines of YAML. The maintainers do not need to manually format the README or check links. The machine handles the grunt work, and humans handle the curation judgment.

The third is community stewardship. The transition from OMNILab at Shanghai Jiao Tong University to the BaiYuLan Open AI community ensured continuity. The project has institutional backing rather than relying on a single individual’s volunteer time.

What is the future of Awesome Public Datasets?

As we move deeper into 2026, several trends are shaping the project’s evolution. The rise of large language models has created unprecedented demand for high-quality text corpora — datasets like Common Crawl, C4, and The Pile are vital for pre-training. The project will likely expand its NLP and multimodal dataset sections accordingly.

Another trend is dataset versioning and provenance tracking. As datasets are filtered, deduplicated, and transformed for specific use cases, knowing the provenance chain has become essential for reproducibility. The YAML metadata in apd-core could naturally extend to track these relationships.

Finally, the spatial and climate data categories will continue to grow as planetary-scale environmental monitoring generates ever-larger streams of open Earth observation data. Awesome Public Datasets is well-positioned to remain the front door to these resources.

Frequently Asked Questions

What is Awesome Public Datasets?

Awesome Public Datasets is a topic-centric curated list of high-quality open datasets in public domains, maintained by the community and originally incubated at Shanghai Jiao Tong University.

How many datasets does Awesome Public Datasets cover?

The list covers datasets across dozens of categories including agriculture, biology, climate, economics, education, finance, government, healthcare, machine learning, and social networks.

Is Awesome Public Datasets free to use?

Yes, the list is completely free and licensed under MIT. The listed datasets are publicly available, though individual datasets may have their own license terms.

How is Awesome Public Datasets maintained?

The list is auto-generated using the apd-core tool, with community contributions reviewed regularly. Status indicators show which datasets are active or may need attention.

Who should use Awesome Public Datasets?

Researchers, data scientists, machine learning engineers, students, and anyone looking for high-quality open data for analysis, model training, or academic research.

How do I contribute a new dataset?

Fork the apd-core repository, add the dataset metadata as a YAML file under the appropriate category, and submit a pull request. The automated review process will verify the link and metadata before merging.

Can I use these datasets in commercial projects?

Most datasets listed are publicly available, but you must check each dataset’s individual license terms before commercial use. Some datasets have restrictions on redistribution or require attribution.