Tuesday, December 28, 2010

-XX:+UseCompressedStrings explained

It looks like Oracle has finally released some documentation for those options they've been using in SPECjbb2005 submissions. The doc is here and it looks like it appeared on Christmas eve.

Like I guessed, they're using a byte[] array instead of a char[] array for Strings wherever they can.

Presumably this makes the code path more complicated, because every time the JVM deals with a String it now needs  to check what kind it is. The space savings are probably worth it, at least in some applications.

Why isn't it on by default? Two possibilities:
  1. The penalty is too high in many applications. In my opinion, this would make it a bit of a benchmark special.
  2. The option isn't quite ready for prime time yet, but they plan to turn it on by default later.
Is this option "fair" to non-Western-European applications? I'd argue that it probably isn't unfair. A lot of String objects aren't involved in the user interface at all. In many applications, such as Eclipse, Strings are used extensively as internal identifiers for things like plug ins, extension points, user interface elements, etc. Even if your application presents a non-ASCII user interface there's a good chance that it still has a lot of ASCII strings under the surface. It might not benefit as much from this option, but it would probably still benefit.

(Of course that assumes that there's no penalty for using non-ASCII Strings beyond the extra space. If the option is implemented in an all-or-nothing fashion, e.g. if it stops using byte[] arrays the first time it encounters a non-ASCII String, then non-ASCII applications wouldn't benefit at all.)


  1. I did some very preliminary testing for my purposes. I was parsing 10 kB XMLs with Xalan (i.e. default parser) on JDK build 1.6.0_23-b05 (win 64 bit) and was generating text-only PDFs out of each of them with iText. Tests were done with all data in memory, all strings had nothing but ASCII, minimum 1000 transformations in one loop.

    The tests consistently show a speed penalty of about 10% when -XX:+UseCompressedStrings is used. No other JVM modifiers, BTW.

    I would imagine that when caching wins over any I/O many times over in speed, it might well be a good deal. It does consume measurably less heap. In my case, very approximately around 30%.

  2. Interesting. I would have assumed that in most cases this compression occurs naturally when serializing Objects (with String members) as UTF-8, so that there'd be little benefit to tweaking in-memory representation. Except for Strings that are seldom used (could be the case for class definitions, method names etc). Then again access via String.charAt() should be relatively fast so maybe overhead is not all that drastic.

  3. presumably the repeated -XX: is a typo (as in -XX:-XX:+UseCompressedStrings)?