SQLite: A surprisingly good data analysis tool (evanjones.ca)

[ 2014-December-06 12:44 ]

At Twitter, I've needed to do some ad-hoc data analysis over many terabytes of logs, and I've been surprised to discover that SQLite is an incredibly useful tool. The sqlite3 command-line tool requires nearly no configuration, is usually already installed, and can execute complex ad-hoc queries pretty quickly. Its .import command makes it trivial to read a text file into a table. Now my workflow is to write a Scalding job to extract the subset of data that I want from terabytes of logs using Hadoop, then import a few GB of data into SQLite to be able to quickly and interactively compute aggregate statistics over subsets of the data. This is many, many times faster than my previous approach, which was to write a one off script to filter and aggregate the data in the way I wanted. Here are some random tips if you are working with the sqlite3 command line:

Create your table before calling .import, since by default SQLite uses the first row as column names.
Use .separator "\t" (or whatever) to set the correct separator.
.mode column makes the output a bit more readable.
Use a recent version of sqlite3: 3.7 or later seems good. I encountered an ancient version 3.3, and it imports data painfully slowly.
Tuning: Execute the following before creating your table or importing data. This sets the on-disk page size to match the operating system, uses mmap to cache the whole DB in RAM and save temporary tables in memory: PRAGMA page_size=4096; PRAGMA mmap_size=4294967296; PRAGMA temp_store=MEMORY;
If you create a table with an INTEGER PRIMARY KEY, sqlite avoids creating an implicit rowid column, which is slightly more efficient. However, you should pre-sort your data (e.g. using sort -n) before importing, so the import is slightly faster.
If you compile your own (which is really easy), enable readline support so the terminal supports editing history: gcc -O2 -DHAVE_READLINE=1 -o sqlite3 shell.c sqlite3.c -lreadline . You may need to add -ltermcap depending on your version of Linux.