The Virtual Machinist: 2010

Friday, December 31, 2010

Die Klasse Namen

I just noticed that the IBM JDK class libraries include these klasses:

com.ibm.security.util.DerInputStream and
com.ibm.security.util.DerValue.

Huck huck.

Ach! Der InputStream ist ein ... NuisanceStream! Der Value, Der. No one who speaks German could be an evil man!

Happy New Year!

Tuesday, December 28, 2010

-XX:+UseCompressedStrings explained

It looks like Oracle has finally released some documentation for those options they've been using in SPECjbb2005 submissions. The doc is here and it looks like it appeared on Christmas eve.

Like I guessed, they're using a byte[] array instead of a char[] array for Strings wherever they can.

Presumably this makes the code path more complicated, because every time the JVM deals with a String it now needs to check what kind it is. The space savings are probably worth it, at least in some applications.

Why isn't it on by default? Two possibilities:

The penalty is too high in many applications. In my opinion, this would make it a bit of a benchmark special.
The option isn't quite ready for prime time yet, but they plan to turn it on by default later.

Is this option "fair" to non-Western-European applications? I'd argue that it probably isn't unfair. A lot of String objects aren't involved in the user interface at all. In many applications, such as Eclipse, Strings are used extensively as internal identifiers for things like plug ins, extension points, user interface elements, etc. Even if your application presents a non-ASCII user interface there's a good chance that it still has a lot of ASCII strings under the surface. It might not benefit as much from this option, but it would probably still benefit.

(Of course that assumes that there's no penalty for using non-ASCII Strings beyond the extra space. If the option is implemented in an all-or-nothing fashion, e.g. if it stops using byte[] arrays the first time it encounters a non-ASCII String, then non-ASCII applications wouldn't benefit at all.)

Friday, December 24, 2010

Working backwards

Sometimes I get a bug that I just can't figure out. If the problem is reproducible with a good test case it's usually fairly easy to narrow the problem down pretty quickly. But what do you do if your product crashed once on a customer's server, and hasn't failed again?

Well, you start with the logs and diagnostic files. Usually we can figure it out from tracepoints and a core file. But sometimes this doesn't work.

In cases like this I don't like to throw in the towel without doing something. It feels like defeat (probably because it is). Instead, I always try to figure out what additional information could have helped me solve the problem.

How did we get to the point of failure? If I can identify two or three paths to the failure point and can't infer which one was taken I'll add some tracepoints to those paths. Or maybe I can add assertions on those paths to detect the error a bit earlier.

Since the problem isn't reproducible they won't help me now, but they might help me in the future. If the problem does occur again (and 'not reproducible' really just means 'very rare') hopefully these diagnostics will get me one step closer to the actual problem. And if it didn't hit any of my new tracepoints or assertions when it reoccurs, that's useful (and potentially maddening) too.

Of course I still might not be able to figure out what's happening. Then I add another round of tracepoints and assertions. Each failure gets me one step closer to the solution.

Saturday, November 6, 2010

Amiguous Java packages and innner classes

Yesterday, when I should have been paying attention to my breathing during yoga, I was instead thinking about inner classes. Specifically I was wondering how Java resolves ambiguities between package names and inner class names.

In Java, an inner class is specified using its parent's name, a dot, and its name. for example, Foo.Bar is an inner class named "Bar" in the parent class "Foo". But this is also how Java names packages. Bar could just as easily be a first class class in package Foo.

Note that there's no ambiguity once the code has been compiled. The class file format uses / as a package separator and $ as an inner class separator, so Foo$Bar is an inner class, and Foo/Bar is a top level class in package Foo. The ambiguity is only present in the compiler, where "." serves both purposes.

I wrote a simple test case. I created a simple Foo.java file with a public static inner class Bar, and a Foo/Bar.java file with a public class Bar. Then I wrote a test class like this:

public class Test {
        public static void main(String[] args) {
                System.out.println(new Foo.Bar());
        }
}

What happens when you compile and run this?

peter$ java Test
Foo$Bar@c53dce

Notice the '$' in the name? The inner class won, so it looks like javac resolves the ambiguity in favour of the inner class.

But what if we really want to mess with the compiler? What happens if you try to compile this file?

public class java {
        public static class lang {
                public static class Object {
                }
        }
}

Yes, the file is called java.java. None of these names are reserved in Java, so this ought to be a perfectly legitimate class named java. It has an inner class, java.lang, with its own inner class, java.lang.Object (not to be confused with java.lang.Object, the superclass of each of these classes).

javac doesn't seem to like this class very much, at least not the version installed on my Mac:

tmp peter$ javac java.java
An exception has occurred in the compiler (1.6.0_17). Please file a bug at the Java Developer Connection (http://java.sun.com/webapps/bugreport) after checking the Bug Parade for duplicates. Include your program and the following diagnostic in your report. Thank you.
java.lang.NullPointerException
    at com.sun.tools.javac.comp.Flow.visitIdent(Flow.java:1214)
    at com.sun.tools.javac.tree.JCTree$JCIdent.accept(JCTree.java:1547)
    at com.sun.tools.javac.tree.TreeScanner.scan(TreeScanner.java:35)
    ...

I don't think that this is a security hole, since the ambiguity is only present in the compiler. It's possible that there's a similar ambiguity in reflection, but I'm not sure. That might be somewhat higher risk.

Update: Eclipse compiles this java.java file with no problem. Looks like it's just an obscure bug in javac.

Further update: I'm not the first person to discover this: http://www.bodden.de/tag/name-resolution/

Saturday, October 16, 2010

RIP Benoit Mandelbrot

Benoit Mandelbrot, the father of fractals, is dead at 85.

When I was in high school I was fascinated by the Mandelbrot set. I wrote a program which rendered it (quite slowly) on my 286. It let you draw a line anywhere across the set and play it as musical tones. (Well, as musical as a PC's speaker could be.)

Saturday, October 2, 2010

Data Mining 101: Amazon Wishlists

I was searching for someone's Amazon wishlist and found this link, instead: Data Mining 101: Finding Subversives with Amazon Wishlists. It's nearly 5 years old, so you probably discovered this long before I did, but it's pretty interesting how much data he could find just linking together a few public, free databases.

It reminded me of an ad I saw on a website yesterday. (I should have captured it.) It said something like "Searching for Gary Gnu?" and then provided links to a couple of sites ostensibly selling Gary Gnu (Buy Gary Gnu on eBay.ca!). I had done a Google search for Gary Gnu a few days before; the ad must have been sniffing my browser history to figure that out, possibly using the old link coloring trick.

Is privacy dead? Probably, but I think it has been for a long time now. The only real difference is that we know it, and I guess that's half the battle.

Thursday, September 23, 2010

-XX:+UseCompressedStrings?

Our friends at Oracle have recently published some impressive SPECjbb2005 scores. Hotspot got a little over 3.3 million bops using 8 JVMs on a Sun Fire X4800.

I'm curious about some of the command line options they used. Most of them look like they're JIT and GC tuning options (would anyone really use -XX:InlineSmallCode except for a benchmark?), but one looks a bit different: -XX:+UseCompressedStrings. I also noticed that they've prepended some libraries called alt-string.jar and alt-rt.jar onto the bootstrap classpath.

I've googled this option but not much turns up. My best guess is that they're representing String data using byte arrays instead of char arrays. This would save half the space (8-bits vs. 16-bits), but would only work well for ASCII. i.e. good if you're Western European, not so good if you're Eastern European or Asian. Plus it wouldn't be very significant for small strings, since the object headers of the String and char[] objects take up a big portion of space occupied by each String.

According to the SPEC submission the JVM which supports these options should be available this month, so maybe we'll see some doc for this option to satisfy my curiosity.

Wednesday, September 1, 2010

What I'm reading

I'm heading off to Paris with Kristin for a week. We'll be looking for merchandise for her store, sight seeing and eating well.

I found a few papers which looked interesting to read on the flight:

The Economics of Garbage Collection -- the authors suggest applying microeconomics theory to memory management
Improved Replication-Based Incremental Garbage Collection for Embedded Systems -- I'm interested in incremental GC and region-based GC right now
Tracing Garbage Collection on Highly Parallel Systems -- multi-core GC is clearly something we need to focus on in the coming years

I might post a follow up on these once I've read them.

Sunday, August 29, 2010

We're hiring

Interested in working on the J9 VM in Ottawa? We've got a job opening.

Note that a demonstrated aptitude for low-level programming is so important that we've written it twice. We've written it twice.

If you're interested (and you have a demonstrated aptitude for low-level programming) apply through the web site or send me a note.

Saturday, August 28, 2010

Debugging tip: timing holes

Parallel programming is hard. Really hard. And, as we're now firmly in the era of multi-core computing, it's an increasingly important skillset.

A couple of days ago we hit an assertion in some code. The failing test case reproduced the problem 2 out of 300 times. So it was reproducible, but highly intermittent.

The developer who was debugging the problem speculated that there might be a timing hole. There was a section of code which set two variables kind of like this:

section->hasRedObjects = true;

section->baseOfRedObjects = basePointer;

He hypothesized that if one thread ran through this code while another thread read the same fields, the second thread might see that the section has red objects, but, briefly, might think they start at NULL instead of the correct address somewhere in the middle of the section.

We used a really simple technique to prove this: make the timing hole bigger.

section->hasRedObjects = true;

usleep(10000); // sleep for 10 ms 

section->baseOfRedObjects = basePointer;

Adding a 10ms sleep between the two assignments caused the problem to occur nearly every time, confirming his hypothesis.

(We fixed the problem by getting rid of the bool hasRedObjects field. We defined a new function, bool hasRedObjects() { return NULL != baseOfRedObjects; }. With only one field, the updates are atomic and the timing hole is eliminated.)

Monday, August 23, 2010

What's in a name?

When I first started at OTI, and as we were slowly digested into IBM, our team didn't have any formal coding guidelines. We were a small group, and everyone basically followed the same unwritten conventions and tried to keep things sensible.

As we grew it became apparent that this wasn't going to scale. So a few years ago we wrote down our coding guidelines for C and C++. (We write Smalltalk, too, but we've never had to codify the Smalltalk conventions. We do a lot less Smalltalk now, so we probably never will.) Most of them are fairly straightforward, but one overarching theme is simplicity, simplicity, simplicity.

I believe that reading code is significantly harder than writing code. (I think the opposite is true of prose.) When you're writing a new function you've already worked out the structure in your head (hopefully) and you just need to translate that into something a compiler can understand. But when you're reading code you've got to reverse-engineer the structure from the simple instructions to the compiler. Plus there's a multiplier: on any large project you'll probably come back to debug and read certain pieces of code dozens or hundreds of times, but (hopefully) you only need to write it once. So not only is reading harder, but you'll do it much more frequently.

One of our most important guidelines, and one which seems to be fairly unusual, is that variable names need to be spelled out. Don't use idx when index will do. Don't use firstOpt where firstOption (or firstOptimization?) will do.

It seems fairly minor when you're writing the code, but it's only a few extra keystrokes and it makes your code that much easier to decipher later on. Why waste a few brain cycles (regardless how minor) to expand objcnt, when you could have just written objectCount in the first place? (Or object_count if you don't like CamelCase.)

(Of course there are few exceptions. You can use i in a for loop, since it's such a common idiom. You can use abbreviations if they're at least as commonly used in speech as their expansions.)

So why is this so hard for new team members to follow sometimes? I think it largely comes down to bad examples, and there are a few reasons for this.

Computer science developed largely out of mathematics. Mathematicians love using short names. First they go through the Latin alphabet, then they capitalize all the letters, then they start stealing letters from other alphabets, and finally, like Prince, they just start making up new symbols. If they simply can't find a symbol for some concept, they might condescend to string together two or maybe three letters, but anything more than that seems to make them uncomfortable. But even very complex theorems are unlikely to use more than a few dozen variables and constants, and they've got a few hundred years of precedent, so they can get away with it.

Other sources of bad abbreviation precedents, I think, are academic papers. Because of the standard two column layout used in most journals, code examples must fit in narrow columns. I counted 36 characters in the examples here. Don't they know that impressionable young students read these things?

Finally, remember that you're not the only one who needs to read your code. Once you've moved on to newer and cooler projects someone else might have to read and maintain that code. So be polite and make their job a little bit easier -- please don't abbreviate your variable names!

Saturday, August 14, 2010

Picket fence comments

Here's a stupid trick for C++ or C99 programmers. Make your comments look like a picket fence:

// Here's the first line of my comment \\
\\ Here's the second line              //
// You can go on and on like this      \\
\\ but you have to be careful about    //
// how you end the comment, or the     \\
\\ next line might be commented out    //

Why does this work? You can't use \\ as a single line comment, can you?

Hint; this doesn't work:

// This is a comment
\\ But this isn't

Why can't you do this in C89? Because // comments weren't supported until the 1999 revision of the ISO C spec although a number of compilers "embraced and extended" the standard (I'm looking at you Microsoft and GNU).

Why would you do this? I guess you could use it to help document your Obfuscated C Code Contest entry. But in the end this is just another example of something you can do in C but shouldn't.

Sunday, August 8, 2010

What's the return value of memset?

I figured I'd start off with something I noticed last week for the first time: memset has a return value!

Every C programmer should be familiar with memset. You've probably used it like this:

void *memory = malloc(100);
if (NULL != memory) {
  memset(memory, 0, 100);
  doSomething(memory);
}

(You did test that malloc succeeded, right?)

I looked up memset last week, and was surprised to see that the man page said it returns void*. I'd always assumed that it was a void function. It turns out that memset returns the first argument.

So you could actually write the above code like this:

void *memory = malloc(100);
if (NULL != memory) {
  doSomething(memset(memory, 0, 100));
}

Why would you do this? Either you're running low on your budget for lines of code, or you like to confuse your collaborators.

So why does memset have a return value when it doesn't need one? I don't know, but my guess is that it's a historic artifact. In some cases it could let you avoid creating a temporary variable to store the pointer. Temporary variables take up space in the stack frame, but return values are usually passed around in registers. In the era before optimizing compilers, that mattered. But a modern compiler will remove temporary variables (or add them) all by itself -- it doesn't need a micro-optimizing programmer to tell it how to do that.