I have written about my experiments with Java string encoding performance before. However, someone asked me some questions about why my code can do better than the JDK. I did not have a good answer, so I dug into the JDK source code to find out. If you want to know what happens when you call String.getBytes("UTF-8")
, read on. The short version for UTF-8 is:
byte[]
array is allocated with s.length() * 4
bytes. This is the maximum length for a UTF-8 string (although this maximum length is actually larger than needed, see the bug I filed on this issue).CharsetEncoder
. The JDK does this by accessing the raw char[]
array in the String object.byte[]
array into the final byte[]
array with the exact right length.Conclusion: This allocates s.length() * 4
bytes of garbage, and has an "extra" copy. This is what permits custom code to be slightly faster than the JDK: custom code produces less garbage, particularly for ASCII or mostly ASCII text. Significant wins are possible when the destination does not need to be a byte[]
array with the exact length. For example, writing directly to the output buffer or a ByteBuffer with "unused" bytes at the end can be faster. See my StringEncoder
class in the source code used for these benchmarks if you want to try and take advantage of this in your own code.
The details, with links to the source code:
isTrusted
boolean to true if the Charset is provided by the JDK (charset.getClass().getClassLoader0() == null
).byte[]
array of size length * encoder.maxBytesPerChar()
. Note that for UTF-8, the JDK reports that maxBytesPerChar() == 4.isTrusted
: If it is not, it makes a defensive copy of the input string. This is to prevent a user supplied CharsetEncoder from being able to mutate the internals of a String. This does not happen for UTF-8.REPLACE
as default mode; call reset()
isTrusted
, return the array as is. Otherwise, call Arrays.copyOf
to copy the bytes into a newly allocated array. This copy will happen every time for UTF-8, since the output will never fill the array.