Thu 11 November 2021

Last year I wanted an excuse to play with GPT-2 and decided to generate fake book titles and descriptions for a parody version of Goodreads: Couldreads

Sourcing the Data

Rather than scrape a large amount of book metadata from the web I decided to obtain actual ebooks with a Usenet scraper because that was something I hadn't done before. Ebooks are small, I could download thousands of them quickly and delete them after extracting the metadata to prepare a model.

This code finds relevant content from Usenet based on message subjects and downloads the message parts:

import psycopg2
import nntplib

nntp = nntplib.NNTP('nttp-hostname')
resp, count, first, last, name = nntp.group('alt.binaries.e-books')

# inserting headers into a database isn't strictly necessary, but while developing
# this having the table was helpful to explore/query the header data

con = psycopg2.connect("host=localhost dbname=ebooks")
cursor = con.cursor()

# retrieve 40 chunks of 1000 message headers at a time
for n in range(40):

    message_range = (last - ((n+1)*1000), last - (n*1000))

    resp, overviews = nntp.over(message_range)

    for _, over in overviews:

        keys = ["subject", "from", "message-id", "references", ":bytes", ":lines", "xref", "date"]

        args = [over[k].encode('utf-8', errors='replace').decode('utf-8') for k in keys]

        args[4] = int(args[4] or 0)
        args[5] = int(args[5] or 0)

        sql = "insert into nntp_headers values(%s,%s,%s,%s,%s,%s,%s,%s)"

        cursor.execute(sql, args)


con.commit()

sql = """SELECT message_id FROM nntp_headers 
            WHERE subject LIKE '%.epub%' or subject LIKE '%.mobi%' 
            ORDER BY 1 DESC"""

cursor.execute(sql)

for n, row in enumerate(cursor):
    nntp.body(message_spec=row[0], file='nntp-message-bodies/nntpbody-%02d' % n)

That script downloaded the raw messages formatted to ASCII for Usenet. Decoding them back to binary and assembling them into files wasn't a necessary problem to solve myself given the convenient UUDeview utility.

mkdir ./nntp-ebooks

uudeview -i -c -o -e .rar -p ./nntp-ebooks nntp-message-bodies/*

Now ./nntp-ebooks contains many .mobi and .epub ebooks, depending on how many headers and messages were downloaded from Usenet. All I want out of these files is a few pieces of metadata. Title, author, genre, and most importantly the paragraph or two with the book synopsis. This is the data that will train/finetune GPT-2.

Preparing the Training Input

The simplest and most reliable solution I found to extract metadata from these files was to extract a certain XML file from them. Both file types are archive formats, and fortunately both include a file with the same .opf metadata file format. The standard unzip utility extracts files from .epub files and mobi does the same for .mobi files.

#!/bin/bash

# create a directory of .opf files extracted from .epub and .mobi
# files. give them random filenames to avoid overwrites.

find ./nntp-ebooks -name "*.epub" |sort |while read line
do
    hash=`echo $line |md5sum |cut -d' ' -f1`
    unzip -q -d ./outdir "$line" "*.opf" || mkdir -p outdir
    find outdir -name "*.opf" |while read opf
    do
        mv "$opf" "./opf/${hash}.opf"
    done
    rm -rf outdir
done

find ./nntp-ebooks -name "*.mobi" |sort |while read line
do
    hash=`echo $line |md5sum |cut -d' ' -f1`
    mobiunpack "$line" outdir || mkdir -p outdir
    find outdir -name content.opf |while read opf
    do
        mv $opf ./opf/${hash}.opf
    done
    rm -rf outdir
done

The .opf files have the metadata fields I need, which I pulled out of the XML and concatenated into a single text file like this.

========================================
TITLE: Blood risk
AUTHOR: Dean Koontz
SUBJECT: Fiction / Mystery / English Fiction
DESCRIPTION: Four men waited on the narrow mountain road for the Cadillac carrying 341,890, the biweekly taking of a Mafia cell. Four men who had never failed in a heist before, on their fourteenth operation in three years: Shirillo, watching in the long grass; Pete Harris with a submachine gun; Bachman in the getaway car; and Mike Tucker, art dealer and professional thief; the perfectionist. As the big Cadillac slewed round the bend, none of them realized that this time Tucker had made a fatal miscalcuation that would plunge them all into a blood war against the Mafia.
========================================
TITLE: Blood of Amber
AUTHOR: Roger Zelazny
SUBJECT: Fiction / Fantasy / General / Science Fiction
DESCRIPTION: Merle Corey, hero of Trumps of Doom (1985), escapes from prison with the help of a woman who has many shapes. Merle Corey escapes from prison into Amber, a world of wonders and confusions where friends and foes are sometimes indistinguishable, where a man is out to kill him and a woman to help him. This is the seventh Amber novel.
========================================
TITLE: Barrayar
AUTHOR: Lois McMaster Bujold
SUBJECT: Fiction / Science Fiction & Fantasy / General / Science Fiction
DESCRIPTION: Following her marriage to the notorious "Butcher of Komarr" - Lord Aral Vorkosigan - Captain Cordelia Naismith has become an outcast in her own world. Sick of combat and betrayal, she prepares to settle down to a quiet life on Barrayar. At least, that was the plan...
========================================

Perhaps the least interesting yet most impactful effort towards getting good results was applying mundane filters and transformations on this model input. Removing edge cases like descriptions that are too short or too long. Removing meta blurbs like "From the Trade Paperback edition" or "All rights reserved". Filtering out non-English content. Discovering that there are hundreds of Doctor Who ebooks in the input and realizing that you don't want your output to be dominated by Doctor Who.

# there's like 10x the doctor who files that anyone would need
if 'doctor who' in subject.lower() and random.random() > 0.1:
    continue

(Nothing against Doctor Who.)

The hope was that GPT-2 would notice this format and generate not only similar content, but similar structure with a formatted title/author etc. that could be just as easily parsed.

Starting with 35,000 ebook files, 16,000 ended up containing a usable title/description. I created a 20 MB text file with these and used that as input to train a model with gpt-2-simple. This took about 50 minutes and cost $6 (spot priced) with one p3.8xlarge EC2 instance.

Generating Output

Once the model was ready I started generating output with it and quickly found it to be successful. The output was structured as neatly as the input. At a glance, the output contained some amusing fake book descriptions and some that were unusable. I figured the worst output could be filtered out and I started with a big enough pool of candidates to account for that. I generated enough output to yield 20,000 samples. Creating a batch of 80-100 samples for on p3.2xlarge instances cost about 40 cents with spot pricing.

The big weakness with GPT-2 was that it would generate loops of repeated words and phrases. Algorithmically throwing away samples with repeated n-grams saved a lot of time, I didn't plan to manually curate the output.

Different batches of output samples were created from different iterations of the model and different runtime parameters passed to the GPT-2. Each output sample was also tagged with the score from the below function. This score increases with more occurrences of repeated 3-grams, 4-grams, and 5-grams.

from collections import Counter

def repeated_ngram_score(text):

    words = text.split()

    # lists of integers representing the number of times each ngram occurs in the text
    ngram3_counts = Counter(zip(words, words[1:], words[2:])).values()
    ngram4_counts = Counter(zip(words, words[1:], words[2:], words[3:])).values()
    ngram5_counts = Counter(zip(words, words[1:], words[2:], words[3:], words[4:])).values()

    # multiplying by n-1 zeroes out the ngrams that occur only once, which are desired
    return (sum(5 * (n-1) for n in ngram5_counts) + \
            sum(4 * (n-1) for n in ngram4_counts) + \
            sum(3 * (n-1) for n in ngram3_counts)) / len(words)

All output samples were saved to a database with this extra metadata of how they were created and scored. Querying this data helped to figure out the better models and parameters, and a good ngram score threshold to use.

At the end of the funnel, there were about 4,000 unique passable output samples. You can download a dump of all of those samples here.

And the final version of the training data used to finetune the GPT-2 model is here.

Using GPT-2

I didn't go into detail about how I ran GPT-2 because it wasn't that interesting or novel of a task compared to managing the data through the pipeline. Training models and generating output samples was done with gpt-2-simple installed on a p3.8xlarge instance for model training and a p3.2xlarge instance for text sample generation. I used Amazon's Deep Learning AMI (ami-079028d69001e1bc3) which had the ML stack already installed, and adding gpt-2-simple was the only additional configuration necessary.

Lost in Code

Couldreads: Goodreads For Books That Could Have Been

Sourcing the Data

Preparing the Training Input

Generating Output

Using GPT-2