- Sat 03 September 2022
Often when browsing a media catalogue site like IMDb or Goodreads I wish I could query the underlying data because it would be faster than clicking around the site and would open up more kinds of analysis.
And eventually I wondered: what if all of Wikipedia could be turned into a structured database with tables of not only movies and books, but the entire long tail of every object type in the data set?
My vision was a database schema with thousands of tables, one for each object type, with columns representing the different fields relevant to that specific object type. For example, the movies table would have columns representing the title, director, runtime, etc.[1] Without having to code or curate anything for each specific type.
I tackled this a while ago, and when I saw the topic come up on Hacker News recently[2][3] it reminded me that I meant to but never got around to writing up my approach and results.
Wikipedia provides its full dataset as a download to help with analysis like this and I was able to successfully create the database I envisioned.
Obtaining the Data: A 78 GB XML File
Rather than scrape web pages I downloaded the full data set by following links from here. The pages-articles-multistream.xml.bz2
file turned out to be the best format available.
Processing the data efficiently was a goal, to avoid requiring excessive cost or time. Some framework or another for parallelization was necessary, and I looked into Spark.
A single very large XML file is not a convenient or natural format to start with and I hoped to find a way to avoid preprocessing the file into another format. Fortunately, I found that DataBricks open-sourced a project that allows XML to be parallelized with Spark: spark-xml
Before starting the other piece I wanted to make sure was in place was a way to parse the wiki text format itself. A 20-year old file format in continuous use has likely grown to be very complex. Writing code to parse it myself felt like it might look simple on the surface, but could turn out to be a rabbit hole. So I didn't want to reinvent that wheel. Again, fortune smiled on me in the form of wikitextparser. (And yes, a project with 1300 commits over 7 years is something I'm glad I didn't try to recreate myself.)
Parallelizing the large XML file and the parsing the wiki text were the challenges unique to the project, and with those solved the rest of it should be straightforward. So the first step was to prove these two solutions with a minimal Spark program that could process the entire data set with reasonable time and cost.
Spark environment
My chosen environment and stack for using Spark was:
- Amazon EC2 - I don't love being locked into EMR, I prefer to use plain Linux machines without the proprietary Amazon stack
- Flintrock - This is a tool that manages clusters of EC2 machines, including Spark installation
- Python - My preferred language for Spark development and in general
Below is my first program, it outputs a CSV with an entry for the title and a few other metadata fields for every Wikipedia article, with multiple entries if the page contains more than one infobox. In my next post I'll talk more about infoboxes and where I'll take this code next.
This program is a lot of lame boilerplate that doesn't amount to more than running a single python function on every <page></page>
element, but in this project that's where all of the risk and complexity is for this project, so that's the place to start.
This program took 15 minutes on a Spark cluster with 8 m5a.2xlarge slaves, so it was successful on both cost and time at about 34 cents per machine/hour.
import re
import sys
import wikitextparser as wtp
from slugify import slugify
from pyspark.sql.types import *
from pyspark.sql import SQLContext
from pyspark.context import SparkContext
s3_input_path = sys.argv[1]
s3_output_bucket = sys.argv[2]
s3_output_path = sys.argv[3]
sc = SparkContext.getOrCreate()
# This configuration setting is important because it enables parallel writing
# of the output. Writing to S3 is very slow and single-threaded by default.
sc._jsc.hadoopConfiguration() \
.set("mapreduce.fileoutputcommitter.algorithm.version","2")
sqlContext = SQLContext(sc)
wikiSchema = StructType([ \
StructField("title", StringType(), True), \
StructField("redirect", StructType([
StructField("_title", StringType(), True)]), True), \
StructField("ns", IntegerType(), True), \
StructField("revision", StructType([
StructField("id", StringType(), True), \
StructField("timestamp", StringType(), True), \
StructField("text", StringType(), True)]), True), \
StructField("id", StringType(), True)])
def get_infobox(template):
if not "infobox" in template.name.lower():
return None
name_slug = template.name.lower().replace("\n", " ")
name_slug = name_slug.replace("infobox", "").strip()
name_slug = re.sub("^([^|<]*).*", "\\1", name_slug, flags=re.I) \
.strip()
name_slug = slugify(name_slug)
return name_slug
def parse_wt(row):
_id, timestamp, text = row.revision
doc = wtp.parse(text or "")
infobox_slugs = [get_infobox(t) for t in doc.templates]
infobox_slugs = set([s for s in infobox_slugs if s]) or [""]
redirect = 1 if row.redirect and row.redirect._title else 0
return [(row.id, row.ns, row.title, redirect, timestamp, slug) \
for slug in infobox_slugs]
df = sqlContext.read \
.format("com.databricks.spark.xml") \
.options(rowTag="page") \
.load(s3_input_path, schema=wikiSchema)
df = df.rdd.flatMap(parse_wt).toDF()
df.repartition(32).write \
.format("com.databricks.spark.csv") \
.options(header="false") \
.save(f"s3a://{s3_output_bucket}/{s3_output_path}/")
[1] I'm familiar with the Wikidata project, but its coverage of facts isn't as thorough as Wikipedia infoboxes and I was looking for an excuse for a Spark project anyway.
[2] Wikidata or Scraping Wikipedia (hn)
[3] Data-Mining Wikipedia for Fun and Profit (hn)
[4] I'm not a Spark expert, so feel free to suggest improvements.