Following up on my last post about large JSON queries, I also benchmarked sub-string queries on large variable-length strings. I wanted to check if sub-string queries might be faster than HSTORE
or JSONB
key lookups. I tested both binary (BYTEA
) and Unicode text (TEXT
). Unfortunately, Postgres sub-string queries are about 6× slower than HSTORE
or JSONB
key queries. I also learned that Postgres regular expressions are extremely slow: about 3× slower than using LIKE
. Queries on BYTEA
are faster than queries on TEXT
. Perhaps the most interestingly, TOAST inline compressed storage was faster than the uncompressed storage. However, similar to my results with JSON values, out-of-line storage using a separate TOAST table is quite a bit slower than inline storage.
I spent a bit of time looking at the code that implements Postgres's LIKE operator and the implementation of the POSITION
function. I'm pretty sure they could be made quite a bit faster, at least for UTF-8 or binary strings. For example, the BYTEA
implementation of POSITION is a function called byteapos. It implements the simple O(nm) implementation, where for each byte of the string, you compare it to the substring. There are much faster implementations that can use SIMD instructions. I think there is an opportunity to make POSITION() and LIKE substantially faster.
For details on the benchmark setup, see my previous article. In this case, I tested TEXT
and BLOB columns that are near the 2004 byte limit for inline uncompressed tuples. I measured querying substrings that either did not match, or matched at the very end, which should represent the "worst case" search times. I tested LIKE
, regular expressions (~
operator), and the POSITION()
function. The entire workload was in memory, and parallel queries were disabled. For more details, see the benchmark source code in Github, which links to a Google sheet with the raw results.
This table shows the fastest query times for each of the column types and sizes. In general, BYTEA
is faster than TEXT
, and using LIKE is better than using POSITION, because it avoids strangly slow queries for TEXT
columns that match.
inline uncompressed | inline compressed | toast uncompressed | toast compressed | |
---|---|---|---|---|
TEXT | 3262 | 2900 | 6325 | 10623 |
BYTEA | 2067 | 1803 | 5010 | 7054 |