At Twitter, I've needed to do some ad-hoc data analysis over many terabytes of logs, and I've been surprised to discover that SQLite is an incredibly useful tool. The sqlite3
command-line tool requires nearly no configuration, is usually already installed, and can execute complex ad-hoc queries pretty quickly. Its .import
command makes it trivial to read a text file into a table. Now my workflow is to write a Scalding job to extract the subset of data that I want from terabytes of logs using Hadoop, then import a few GB of data into SQLite to be able to quickly and interactively compute aggregate statistics over subsets of the data. This is many, many times faster than my previous approach, which was to write a one off script to filter and aggregate the data in the way I wanted. Here are some random tips if you are working with the sqlite3
command line:
.import
, since by default SQLite uses the first row as column names..separator "\t"
(or whatever) to set the correct separator..mode column
makes the output a bit more readable.PRAGMA page_size=4096; PRAGMA mmap_size=4294967296; PRAGMA temp_store=MEMORY;
INTEGER PRIMARY KEY
, sqlite avoids creating an implicit rowid
column, which is slightly more efficient. However, you should pre-sort your data (e.g. using sort -n
) before importing, so the import is slightly faster.readline
support so the terminal supports editing history: gcc -O2 -DHAVE_READLINE=1 -o sqlite3 shell.c sqlite3.c -lreadline
. You may need to add -ltermcap
depending on your version of Linux.