Interesting datasets can make personal machine learning projects more fun and exciting. Here are some of my favorite places to go looking for datasets to hone my data science and ML skills.
Data Is Plural
Data Is Plural is my favorite place to find novel datasets on interesting topics. The site is managed by Jeremy Singer-Vine.
Each edition (250+ and counting!) is published weekly and contains descriptions of various datasets and what makes them interesting. There are usually 5 or so different entries in each week. One nice aspect is some datasets are more raw (depending on the source) and you can practice working with data before modeling.
Ease of Use: Medium
Interesting/Novel Datasets: High
Kaggle
Kaggle is known for machine learning competitions. However, one interesting aspect is the datasets feature. Users can post datasets and collaborate with tasks and discussion around them.
There are various levels of how clean datasets are, but users can receive medals for how good they are. This often results in good quality datasets for those that rate highly, with data types and complete descriptions. Given the crowd sourced ranking nature, there are only so many that are popular or trending. New ones are being added all the time, but you may have to dig a bit more to find novel datasets like found in Data Is Plural. For this reason, I rank the novelty factor medium, although they can certainly be found.
Ease of Use: High
Interesting/Novel Datasets: Medium
Seaborn Datasets
Seaborn is a popular data visualization library in Python. One function, load_dataset allows you to use some out of the box datasets with a single command. You can find some additional information on the datasets looking through the github page.
This functionality allows for an easy-to-use experience for the data scientist. A pandas dataframe with dataset in hand to work with is just one function call away. The drawback is many of these datasets are standard/used by many - you may not find many brand new novel datasets available for use.
I like to use Seaborn datasets as a way to test out some automation or new machine learning models. This helps provide a consistent baseline that can be used to compare things over time or incorporate in automated testing when building out libraries.
Ease of Use: Highest
Interesting/Novel Datasets: Low
data.world Open Datasets
data.world has an open data feature where users post datasets. This is similar to Kaggle and I’ve found some good data in the past when I’ve had a topic in mind.
The medal and reward structure isn’t as robust as Kaggle, so datasets can be a bit lower on the documentation/discussion factor. However, there are certainly some interesting and novel datasets that can be found.
Ease of Use: Medium
Interesting/Novel Datasets: Medium
Summary
These are my top favorites, but great datasets can be found everywhere. Do you have any favorites I’ve missed? Put them in the comments below.