As part of my ongoing quest to learn AI, I recently joined Kaggle, the Google-owned machine learning competition site. While I’m nowhere near ready to enter the contests hosted on there, I thought I might contribute an original dataset for other coders to use.
The question was, what kind of data to gather and compile? I knew I didn’t want to make another hotel bookings dataset as those have already been done to death. I thought of doing something related to trilobites, or exoplanets, but ultimately decided to start simpler.
The Internet Speculative Fiction Database is a project that aims to catalog every published literary work in the science fiction and fantasy genres. From what I can tell, they’ve basically succeeded in that task. As someone drawn more to the possible than the actual, I’ve been a science fiction fan all my life, and making a Kaggle dataset from ISFDB data seemed like a great idea.
Unfortunately, ISFDB doesnt provide a convenient API to get the data, so I knew I’d have to write a script to scrape what I needed from the website. Seeing as the site has an unfamiliar cataloging scheme, in addition to the fact that it’s been gradually added to since 1995, I knew this wasn’t going to be a pleasant process. Indeed, the final dataset I ended up with is fairly basic. Just book title, author, publication date, and type (e.g. novel, short story, anthology etc.)
Several entries had interesting possible data categories such as thematic tags (e.g. lost colony), but as only some of the titles had these, including them in the final dataset would have led to a ton of missing values.
Below is the script I wrote to scrape the site and build the dataset. After completion, you’ll have a CSV file with the metadata of about 125,000 books. It takes several hours to run on a basic Linux server:
Below is how the dataset looks in Calc from LibreOffice:
Update: I’ve coded a Reddit bot that uses the dataset. It watches for comments containing mentions of books in the dataset, than searches YouTube for audiobook versions of the mentioned book.
2020-04-02 00:09 +0000