Leo's dev blog

Crawling Goodreads books data in Node.js

Published on
Published on
/3 mins read/---

Since Goodreads no longer supports fetching user's books data via their API, I've decided to crawl / scrape user's book data using the RSS feed and parse it in Node.js.

Goodreads API

The idea here is to use the rss-parser package to parse the RSS feed and extract the book data.

goodreads.ts
import Parser from 'rss-parser'
import type { GoodreadsBook } from '~/types'
 
let parser = new Parser<{ [key: string]: any }, GoodreadsBook>({
  customFields: {
    // Define all the custom fields you want to extract from the RSS feed
    // Here I'm listing all the available fields from the Goodreads RSS feed
    item: [
      'guid',
      'pubDate',
      'title',
      'link',
      'book_id',
      'book_image_url',
      'book_small_image_url',
      'book_medium_image_url',
      'book_large_image_url',
      'book_description',
      'author_name',
      'isbn',
      'user_name',
      'user_rating',
      'user_read_at',
      'user_date_added',
      'user_date_created',
      'user_shelves',
      'user_review',
      'average_rating',
      'book_published',
    ],
  },
})

Then you can fetch the data from the RSS feed using the parser object, and process it as needed.

goodreads.ts
const GOODREADS_RSS_FEED_URL = '<YOUR_GOODREADS_RSS_FEED_URL>'
 
export async function fetchGoodreadsBooks() {
  if (GOODREADS_RSS_FEED_URL) {
    try {
      let data = await parser.parseURL(GOODREADS_RSS_FEED_URL)
      // All the books data will be stored in the `data.items` array
      // Use the parsed data as needed, for example, you can write it to a JSON file:
      writeFileSync(`./json/books.json`, JSON.stringify(data.items))
    } catch (error) {
      console.error(`Error fetching the Goodreads RSS feed: ${error.message}`)
    }
  } else {
    console.log('📚 No Goodreads RSS feed found.')
  }
}

NOTE

You can get a Goodreads user's RSS feed URL by going to their profile and navigating to the bookshelf page and copy the RSS feed URL. This is my bookshelf page for example: https://www.goodreads.com/review/list/179720035

Now that you have the data you might need to prettify them before storing or using in your application since the data is stored in a raw format.

goodreads.ts
let data = await parser.parseURL(/* GOODREADS_RSS_FEED_URL */)
// Loop through the `data.items` array to prettify the data
for (let book of data.items) {
  book.content = book.content.replace(/\n/g, '').replace(/\s\s+/g, ' ') // Remove line breaks
  book.book_description = book.book_description
    .replace(/<[^>]*(>|$)/g, '') // Remove HTML tags
    .replace(/\s\s+/g, ' ') // Replace multiple spaces with a single space
    .replace(/^["|“]|["|“]$/g, '') // Remove leading and trailing quotation marks
    .replace(/\.([a-zA-Z0-9])/g, '. $1') // Add a space after a period
}
// Use the parsed and prettified data as needed...

The GoodreadsBook type is defined here:

types.ts
export type GoodreadsBook = {
  guid: string
  pubDate: string
  title: string
  link: string
  book_id: string
  book_image_url: string
  book_small_image_url: string
  book_medium_image_url: string
  book_large_image_url: string
  book_description: string
  author_name: string
  isbn: string
  user_name: string
  user_rating: string
  user_read_at: string
  user_date_added: string
  user_date_created: string
  user_shelves: string
  user_review: string
  average_rating: number
  book_published: string
  content: string
}

Caveat

The Goodreads RSS feed is not updated instantly if you update your books on Goodreads. You might need to wait for a few hours before you can fetch the latest data.

Happy crawling!