The Virtual Machinist

Flickr, beards and feature recognition

2014-12-16T10:46:00.001-05:00

So it looks like Flickr is applying image recognition beyond face recognition. A couple of months ago they demoed their "Park or Bird?" feature recognition system, but I hadn't realized they'd put anything into production until today.

Yesterday, I uploaded some old photos from mid 2000 that I'd taken with my first digital camera, and had recently recovered (thanks Dana). This morning I noticed that someone had found the photo below by searching for "Beard". That seemed odd, since I hadn't tagged any of the old photos. The only way that Flickr could know that Gac has a beard is by looking at the photo itself.

Something feels a little bit creepy about this, but on the other hand it's also pretty cool.

I would like to see Flickr expose this more explicitly, and give me an opportunity to edit these automatically added tags.

Picasa

2014-11-16T20:30:00.003-05:00

I learned something about Picasa today: when you edit a photo, it stores all the modifications as extra fields in the JPEG file and doesn't modify the displayed image. Until you export the image from Picasa, other tools only see the original image.

I have a bunch of photos from before 2006 that I salvaged from an old Thinkpad, and which I'd touched up with Picasa, and I want to upload them to Flickr. But because the touch-ups are all in metadata, none of the images reflect this.

I downloaded the latest version of Picasa for OS X, imported all of these photos, and amazingly it seems to have correctly applied the changes (which were made with a much older version of the tool). They're not pixel-for-pixel identical -- I presume that some of the enhancement algorithms have changed -- but they're damn close.

The case of the missing Citibikes

2014-08-22T07:51:00.000-04:00

The example of my friend and colleague Ben, with his amazing I Quant NY blog, has motivated me to try my hand at some open data hacking. Ben's written several posts where he analyzes Citibike bike share data. Citibike has made all of their trip data through the end of May 2014 available for free download. I'm a huge fan of New York's bike share program and of their open data policy.

Ben has analyzed trips and stations, but I have a different question: how many of New York's shared bikes have been stolen or lost?

The New York Post reports that bikes are routinely stolen from Manhattan stations and ridden to underserved parts of Brooklyn and Queens. I'm not too concerned about these bikes: they're recovered quickly, and Citibike may wish to treat the bikes' eventual destinations as a kind of desire line. Clearly Crown Heights residents can't wait for the program to expand to their neighborhood.

In July, the 109th precinct proudly reported that their detectives had detected a 68 year old man riding a Citibike which had been liberated and repainted. The suspected thief was detained and his ride confiscated. That's what I'm looking for!

What happens when you steal a @CitibikeNYC, do a bad job repainting it & ride it around the #109pct ? You get Caught! pic.twitter.com/qLxcOChtKF
— NYPD 109th Precinct (@NYPD109Pct) July 25, 2014

So I got myself the trip data, put together a quick-and-dirty python script, and identified the first and last trip for each bike in the system. I presume that if a bike is stolen or destroyed it will disappear from the trip data, so we can guess that if a bike hasn't been ridden in some time, it's likely gone AWOL.

Note that the bikeid field in the data doesn't appear to match the number stenciled on the bike's frame. It could correspond to the electronic identifier (probably an RFID tag) which the stations use to identify bikes. If that's the case, missing trip data could simply indicate that the electronics were damaged and replaced.

There are 6943 unique numbers in the trip data. This is roughly consistent with a New York Times story, published when the program launched, reporting 6,000 bikes in the system.

If we sort the bikes by their final trip, we can quickly get an estimate of losses.

Month	Final rides	Month	Final rides
2013-07	18	2014-01	26
2013-08	36	2014-02	60
2013-09	17	2014-03	68
2013-10	31	2014-04	428
2013-11	41	2014-05	6186
2013-12	32

The vast majority of the bikes showed activity in May 2014, meaning that they weren't stolen or lost. Before April, each month saw between 17 and 68 final rides, averaging 36.5 each month.

At first glance, April appears to have been a disastrous month for Citibike thefts. But a more likely explanation could be that those bikes have been removed for maintenance. If Citibike keeps 300-400 bikes in their warehouse for routine tuneups, and if it takes two months for the bikes to rotate back out into service, it could easily explain most of the 428 bikes which were ridden in April, but idle in May. We would expect most of them to return to service in June.

February and March also saw higher than average losses. Perhaps bike thieves are more active in those months, but this may be better explained by the unusually snowy winter. Plows, for example, may have taken a toll on the fleet.

Citibike can, at least in theory, bill a rider $1200 for failing to return a bike. If they collected this fee for each bike which went missing before April, they'd have raised nearly $400,000. However I've yet to hear of any rider receiving such a bill.

Assuming that these final trips do represent theft and loss, approximately half of 1% of Citibikes are lost each month, or about 6% every year. That's far better than the reputed 80% of Paris's Vélib' bikes which were stolen in that system's first year!

Update: the original version of the table showed 116 trips which ended in June. This is because there were a handful of trips which started on May 31 but finished after midnight, and were thus credited to the next month. To make it less confusing, I've merged these final rides into the data for May.

Variadic macros and trailing commas

2013-06-03T18:43:00.000-04:00

One of the great new (ok, 14-year old) features of C99 is the variadic macro. This feature finally provides a portable way to write a logging macro, for example. Here's a simple example:

#define LOGIFY(format, ...) \
fprintf(stderr, "LOG: " format "\n", \
__VA_ARGS__)
...
LOGIFY("Found %d widgets.", widgetCount);
...

However, there's a subtle problem here. What if we call LOGIFY with no variadic arguments?

LOGIFY("Finished.");

This expands to:

fprintf(stderr, "LOG: Finished.\n",);

And that doesn't compile, because there's a trailing comma.

GCC provides an extension to the language which works around this problem. For that compiler, you can use token pasting to make the comma magically disappear:

#define LOGIFY(format, ...) \
fprintf(stderr, "LOG: " format "\n", \
##__VA_ARGS__)

However this is non-portable.

StackOverflow suggests that you can make the format part of the variadic part of the macro:

#define LOGIFY(...) \

fprintf(stderr, "LOG: " __VA_ARGS__)

However this doesn't work for my example because I want to be able to past a new line onto the end of the format string. Since the format string isn't isolated from the arguments I can't append anything to it.

An ugly work-around is to simply pass in an extra, unused argument if you have no real arguments:

LOGIFY("Finished.", 0);

It's always harmless to pass extra arguments to a variadic function, and this does compile, but it doesn't seem very elegant.

If the ANSI C committee ever revisits this, or if I design my own language, they (or I) could resolve this problem by permitting trailing commas in function calls (and function declarations). C already permits trailing commas in array initializers (as do Java and other C-like languages):

int fibs[] = {1,1,2,3,5,8,13,};

This is convenient for a number of reasons, including conditional inclusion (#ifdef) and source code generation. It can also be a good habit to always include a trailing comma, especially when declaring arrays of strings. Without a trailing comma, there's a risk that a developer might add a new string to the end of the list without a comma. Since CPP does string concatenation, the results may be unexpected!

const char * messages[] = {

"No error", // 0; added in 2002

"First legacy error", // added in 2002

"Other legacy error" // added in 2002

"Brand new error" // new in 2013

};

The same problem can occur with numbers if sign tokens (+/-) are used.

Why not support trailing commas in functions, too? It would be more consistent, and there are a number of not-immediately-obvious advantages. In addition to this variadic macro case, consider the case where one or more of the arguments is conditionally included:

void

doWork(

int howMuchWork,

WorkType whatKindOfWork,

#ifdef THREAD_SAFE

bool useLocks

#endif

)

...

This doesn't compile if THREAD_SAFE isn't defined because there's an unused comma after the second argument. If C permitted trailing commas here we'd avoid awkward constructs like moving the comma into the conditional code (and even that wouldn't work if each of the arguments were conditional!).

Although it's not consistent with the normal usage of commas in prose the trailing comma is surprisingly useful in programming languages, and ought to be supported more widely.

Square Root: My solution

2013-03-25T07:58:00.000-04:00

Wouter Coekaerts's square root puzzle presents a difficult challenge. We need to figure out the number he's thinking of without seeing the number. The only question we can ask is: "is it this number?" (In fact, it's a bit worse than that: when we ask that question, he just writes the answer on the console if we're right.) There are a few possible toeholds here where we could get started. I'll discuss some I rejected before I get to my actual solution.

Wouter gives us the source code for the answer() method, so we know how he's testing the answer. We also know how he generated the answer: a BigInteger of approximately 10000 random bits generated from a SecureRandom.

The easiest approach would be to just look at the answer. However the answer (actually the square of the answer) is stored in a private field. We can't get at it without using reflection, JNI, sun.misc.Unsafe, or something else of that nature. The challenge doesn't permit those types of attacks. What about accessing it as an inner class? Inner classes can access private fields of their enclosing classes, so could we trick the JVM into thinking our code is an inner class of square.Square? You can't do this, because inner classes are just a fiction created by the compiler. It might look like you can access private fields of your enclosing class, but the compiler actually adds hidden accessors for those fields which it calls under the covers. This approach won't work.

What if we could influence the answer? Can we control what SecureRandom does? We can do this by providing a different source of entropy from the command line or by modifying /dev/random. However I'm not sure that these can be done with a security manager in place or without root access, and I think that they violate the "no security vulnerabilities" spirit of the challenge.

One promising approach relies on the lack of a final modifier on BigInteger. What if we don't pass in a BigInteger for the answer, but pass in our own subclass of BigInteger? For example, could we override BigInteger.equals() to always return true? This looked like a promising line of attack, and I think it might actually work in some implementations. Unfortunately, the way Wouter wrote answer() he doesn't directly invoke any methods on the candidate object we control and OpenJDK's implementation of BigInteger.divide() doesn't call any methods on the divisor object either, so there are no opportunities there (however GNU Classpath's Biginteger.divide() does call methods on the divisor). BigInteger probably should have been final, but unfortunately this oversight doesn't seem to be sufficient to solve this puzzle. (If answer() had been implemented as root.equals(n.divide(root)) instead of n.divide(root).equals(root)this solution would have worked well, because our own equals() would be invoked.)

If we can't subclass BigInteger, can we just replace it? We can write our own implementation of BigInteger which always returns true for equals, but we'd need to replace the built-in one. This would require modifying the bootstrap classpath with -Xbootclasspath/p:, which I think violates the rules, or at least the spirit, of the challenge.

What about a simple brute force approach? I was actually tempted to submit this. We know the upper bound of the answer (2¹⁰⁰⁰⁰), so a simple for loop ought to suffice. Assuming we can test one answer per nanosecond (highly optimistic!), wolframalpha estimates that it would take 6.322×10²⁹⁹⁶ years, or 4.6×10²⁹⁸⁶ times the age of the universe. The solution is provably correct, if somewhat impractical.

When I was looking at the implementation of BigInteger.divide() to evaluate the feasibility of the subclassing attack, I noticed that there are a few special cases at the top of the method. Consider a/b. If a = b, the result of integral division will always be 1. If a < b, the result of integral division is always 0. If a > b, the code is much more complicated. This suggested that a timing attack might be feasible: we should be able to determine if our candidate is less than Wouter's recorded number by how long it takes for the answer() function to complete.

Note that this attack is dependent on a number of things. We know how Wouter is testing the answer: n.divide(root).equals(root). If he'd done the more efficient n.equals(root.times(root)), instead, this attack wouldn't work (or would be much more difficult). It also relies on the JDK not doing unnecessary work for division when the dividend is less than the divisor. In Java 6 the JDK still calculates the remainder for this case (and discards it) making the attack more difficult but still feasible. In Java 7 the timing difference is more pronounced.

Timing the function is also quite challenging. For one thing, JVMs use JIT compilers which may compile code multiple times, with increasing levels of optimization, as they discover what's important. Therefore it's important to give the JIT plenty of opportunity to compile the code before we start trying to time it. I did most of my initial testing using -Xint, which disables the JIT, to eliminate this variable until I was confident that the solution worked. Garbage collection can also interfere with the timing, but it can only increase the time. If we only look at minimum times, and do enough measurements, it shouldn't matter. Other programs running on the machine can interfere with our timing, as can power management features. I tried running the solution on my MacBook Pro at first, but the timing was all over the place. In the end I used an isolated Linux x86-64 Xeon X5690 (Westmere) machine and Hotspot Java 1.7.0_09_b05 to test the solution. It was much more reliable.

Once the JIT is warmed up so we can get reliable timing, the next step is to calculate a base line. I do this by generating my own, random, 20000 bit number (the square of a 10000 bit number is a 20000 bit number) and testing how long it takes to divide that by number slightly lower than it. I repeat this a million times and record the minimum time it took to run my own answer() function, which is identical to Wouter's. Now that we have the baseline for a 'slow' divide.

Someone with a better statistical background that me could probably come up with a robust way of distinguishing between 'fast' and 'slow' division. I just used a heuristic and ran the test many times. If the divide time is >= 75% of the fastest baseline time, I consider it slow. If it's <= 50% of the fastest baseline time, I consider it fast. If it's somewhere in between I run the measurement again until I get a conclusive result. I short circuit for fast times, since I assume that these must be fast due to the number being tested, but not for slow times, since they could be slow for any number of the reasons discussed above.

Then I start to identify bits. I start with the highest possible bit and work my way down. Whenever I discover that the divide is fast with a particular bit set, and slow with the same bit clear, I know that we've identified the next bit in n. When the first stage of my program finishes (it takes about 80 minutes), it should have identified the highest number for which answer() is 'slow'. In other words, we know n-1.

Obviously determining n from n-1 is trivial, but Wouter doesn't actually want us to find n: he wants the square root of n. Unfortunately, there's no built-in library function to do this. I could have implemented Newton's method to solve the square root, but fortunately someone else already did this for me: BigIntegerMath in Google's Guava has a very efficient implementation that handles a lot of corner cases I probably wouldn't have bothered with. Wouter wanted the solution in a single class, and I didn't want to rely on libraries not included with the base JDK, so I copied the relevant code from Guava. Since this is an Apache licensed project, that's conveniently legal.

My program runs in about 80 minutes on Java 7 on an Intel Xeon Linux server. The actual time depends on the ratio of 0 bits to 1 bits in n. 0 bits can be identified much faster than 1 bits. It may or may not work on other machines without tweaking some of the timing parameters. There's no way to tell for sure from within the program if it got the right answer, since Wouter just prints something to the console, but I do check to see if it's feasible. Most integers have irrational square roots, but we know that n is a perfect square. Therefore I test that the number I identified is also a perfect square. If it's not, I repeat the whole process again. A more sophisticated solution might track a confidence level for each bit and try remeasuring those, first, but it's always found the solution on the first attempt.

You can find the source code at github.

Thanks to my colleagues at Two Sigma who provided useful insights into this problem. Andrew Berman and Yaron Gvili suggested attacking SecureRandom, either directly or via /dev/random and /dev/urandom. Yaron also discussed the brute force approach. Isaac Dooley steered me away from mean and towards min and also explored ClassLoader based solutions. Trammell Hudson encouraged me to continue with the timing attack even when initial results were frustratingly ambiguous. Another colleague who shall remain anonymous suggested I prove that P=NP, and then implement a polynomial-time solution.

Java Puzzle: Square Root

2013-03-20T21:37:00.003-04:00

I'm working on a solution to Wouter Coekaerts's square root puzzle. It's a challenging problem and will test your knowledge of Java and JVMs. I'll post my solution next week, but here's a hint about the direction I'm taking: Wouter's answer() method isn't the most efficient way to test for the correct solution.

Kobo surgery

2012-10-30T15:23:00.000-04:00

A few weeks ago my Kobo Wifi e-reader suffered an unfortunate accident. I got caught in a thunderstorm and the pocket of my messenger bag filled with water. (Good: my Brooks Barbican bag is waterproof. Bad: water can't get out once it's in.) It was a few hours before I noticed that my Kobo had been partly submerged for an extended period. The poor thing just displayed a plaintive "Please Charge Your eReader" message. Pressing the power button caused some lights to flash, but no activity on the screen. It wasn't completely dead, but it wasn't working either.

Coincidentally, the next day my friend Roo tweeted that he'd dropped his Kobo and broken the screen. I suspected that between the two of us we probably had enough parts to build a working Kobo, so I arranged to collect his e-reader so I could try to rebuild one.

Dead board (left) dead screen (right)

The first step was opening the cases. The Kobo cases just snap together, so you can pry them apart fairly easily. Be careful as you risk cracking the case if you bend it too much. I damaged the white case a bit but was able to get the black one off without any problem. The board is attached to the rear half of the case with four Phillips screws which are easily removed.

Naked Kobos

The Kobo is fairly simple inside. The e-ink screen is mounted on the circuit board and connected to it with a flexible flat cable which wraps over the right-hand side of the board to a plug on the reverse. There's a lithium-polymer battery in the lower left hand corner, and you can see the 5 buttons of the rubber navigation pad in the lower right corner.

Removing the screen.

It took me a while to figure out how the screen was attached. I was worried that it might be glued to to board. Fortunately, it's quite easy to remove once you know that it's only attached with four strips of double sided tape.

First, unplug the flexible connector, but be careful! The plug has a plastic clamp to hold the connector in place. These break easily and without the clamp you'll get a poor connection. You need to disconnect this so that you can have unobstructed access to the edge of the screen.

I found that a thin utility blade slipped easily between the screen and the board. Just run this around all four edges and you'll loosen the tape. The tape will re-adhere pretty quickly, but it's quite easy to pry the screen off with your fingers once you've cut through the tape.

Screen detached. You can see where the four pieces of tape were.

Once both screens were off it was a simple matter to swap them. The boards also have Micro-SD cards which are used to store your books. (This is in addition to the SD expansion slot at the top of the card. This means that if you want to expand your Kobo's capacity you could probably easily replace the 2GB Micro-SD card with a larger one.) I swapped the Micro-SD cards, too.

It's alive!

While it was open, my friend Trammell pointed out that there were some connectors near the navigation pad which looked suspiciously like a serial port. (They were marked Tx and Rx, which was a bit of a give-away!) We connected it to a terminal emulator and were able to watch the Linux kernel boot as the Kobo powered on. It seemed a shame to hide that in the case, so before I put the case back on Trammell helped me make a few small modifications. We added a small hole to the case and soldered a connector onto the serial port.

Serial connector. From top to bottom: V, Tx, Rx

I haven't played much with the serial port yet, but I expect it might expose some interesting opportunities.

After swapping the Micro-SD cards the new Kobo showed all of my books in my library, but would only let me read some of them. This suggests that the DRM scheme is tied to some serial number on the device, and also that not all books are protected by DRM. I did a factory reset on the Kobo, connected it to my laptop and resynced all of my books. This made all the books readable again.

A Modest Proposal

2012-09-23T14:55:00.001-04:00

For Minimizing the Obstruction of Bike Lanes in New York, and for Making the Department of Police Beneficial to the Publick.

NYPD tow truck parked at 6th Ave and 32nd St

This morning at about 10:30 I encountered this oversized New York Police Department tow truck straddling the bike lane on 6th Avenue at 32nd Street. While I was watching, one of the officers returned to the truck with coffees, but, even caffeinated, they remained firmly in place, ignoring the cyclists (photographed) who had to detour into motor traffic to pass them. Particularly irksome is that they've pulled half way into the bike lane, but they're still obstructing a full lane of motor traffic. Instead of just blocking one lane, they're blocking two. (Also note that one of the cyclists photographed is going the wrong way, a.k.a. salmoning)

Taxis, town cars, delivery trucks, private cars and cops regularly block New York's otherwise excellent network of bike lanes (as, apparently, do fruit and vegetable carts). Many of the lanes are segregated, making it more challenging for drivers to block them, but many others, like the one pictured, are just painted and we rely on drivers' respect for the law and fellow road users to keep them clear. Like in other cities, this is a bit of wishful thinking, and given the poor example set by law enforcement it's not surprising that other drivers treat these lanes as short-term parking.

On Friday, over beers with some cow-orkers, I formulated a proposal which I think might be a pragmatic compromise to improve access to unobstructed bike lanes. First, we must recognize that we're not going to change the behavior of the police. So, instead of just complaining and building up resentment (as I've done above), let's work with them.

I propose that all unsegregated bike lanes in New York City be redesignated as police parking lanes. Let's change the bylaws so that these lanes are reserved for use by law enforcement, with an exception allowing cyclists to use them when they're not required for urgent police business (like the morning coffee run). We'll paint NYPD logos in the lanes alongside the cycling icons.

My theory is that the NYPD will be much more aggressive about ticketing motor vehicles obstructing police parking lanes. Those are their lanes! Other drivers aren't going to mess with the NYPD's parking lanes.

For cyclists, this Faustian bargain could significantly improve access to the approximately 1% of New York City's paved road surface currently designated as bike lanes. Of course we'd still have to swerve around parked cop cars, but at least the number of SUVs, FedEx trucks and taxis in the lanes would be lower.

In praise of idleness

2012-05-07T21:58:00.002-04:00

C has some odd baggage in its APIs, but one API which I think ought to be emulated more often is free(). Specifically, free(NULL).

Many programmers don't realize that you can pass a NULL pointer to free() and that this has no effect. This isn't undefined behaviour. It's specified to work like that: If ptr is a null pointer, no action shall occur.

This makes cleanup code simpler. Instead of
if (ptr != NULL) free(ptr);
you can just use
free(ptr);

This has a number of advantages.

First of all, it reduces the amount of code you need to write for mundane housekeeping tasks.

Secondly, it encourages good habits (freeing memory when you're done with it) by not punishing programmers for using it. If free() crashed when called with a NULL pointer (or worse, corrupted memory) that would discourage programmers from using it. While it doesn't really reward you for writing good code, at least it doesn't kick you in the shin.

Finally, it's consistent with an unfortunate misfeature of its companion, malloc(). malloc(0) is permitted to return NULL. At least you can always pass the result of malloc() to free().

Unfortunately, many other common APIs don't follow free()'s good example. close(-1) and pthread_mutex_destroy(NULL) both invoke undefined behaviour, for example. (And don't get me started about zero as a legal file descriptor!)

Whenever I'm designing my own APIs which include destructor-style functions, I always try to make sure that they quietly ignore NULL. It just makes life simpler for users.

Frickin' lasers!

2012-04-10T23:08:00.000-04:00

My apartment in New York is great. It's close to Central Park, close to the subway and close to Momofuku Milk Bar. But it's pretty small. Especially the kitchen. My kitchen drawers are only about six inches wide. Finding a cutlery organizer which fits in the drawer is pretty well impossible. So what to do? Build one!

Fortunately, one of my colleagues has access to a laser cutter at NYC Resistor. He helped me build a completely custom cutlery organizer which fits both my drawer and my cutlery perfectly.

Step 1: build a prototype from cardboard to check the dimensions and functionality.

Step 2: Using InkScape, design all of the pieces (with raster images which are etched with the laser at low power).

Step 3: Bring on the laser!

Step 4: Prepare all the pieces

Step 5: Put it all together!

Step 6: A perfect fit!

Next I'm going to build a small tray to sit on top of this one, giving me a bit of additional storage space for little utensils.

C99 designated initializers

2012-03-25T16:07:00.000-04:00

One of the very nice features added in C99 is the designated initializer. This allows you to write code like the following to initialize a structure:

    div_t d = { .quot=3, .rem=2 };

In C89 there was no way to reliably initialize a structure like this. The specification says that the quot and rem members may be in any order, so if you write:

    div_t d = { 3, 2 };

you can't be sure which member will be 3 and which will be 2.

In general I'm enamored with designated initializers. But I've run into an unfortunate case where the specification is somewhat ambiguous.

Paragraph §6.7.8.19 of the C99 standard (draft version available for free here) has this to say about the ordering of designated initializers:

The initialization shall occur in initializer list order, each initializer provided for a particular subobject overriding any previously listed initializer for the same subobject;¹³⁰ all subobjects that are not initialized explicitly shall be initialized implicitly the same as objects that have static storage duration.

(Footnote 130 reads "Any initializer for the subobject which is overridden and so not used to initialize that subobject might not be evaluated at all.")

The spec says that "initialization shall occur in initializer list order". To me, this suggests that one initializer can safely rely on the result of a previous initializer, e.g.:

    div_t d = { .quot=42, .rem=d.quot };

So the following program ought to only invoke well defined behaviour, right?

#include <stdio.h>
#include <stdlib.h>

int main(void) {
div_t d1 = { .quot=1, .rem=2, };
printf("d1: quot=%i, rem=%i\n", d1.quot, d1.rem);

    div_t d2 = { .quot=1, .rem=d2.quot };
printf("d2: quot=%i, rem=%i\n", d2.quot, d2.rem);

    div_t d3 = { .rem=2, .quot=d3.rem };
printf("d3: quot=%i, rem=%i\n", d3.quot, d3.rem);

    return 0;
}

Let's compile it and see what happens:

$ gcc -c99 -O3 -Wall -Wextra foo.c
$ ./a.out
d1: quot=1, rem=2
d2: quot=1, rem=1
d3: quot=2, rem=2

Excellent! That exactly what I would expect to happen. The initializers are run in order, allowing the second designated initializer to depend on the result of the first.

Now here's where things start to get weird:

$ gcc -c99 -O0 -Wall -Wextra foo.c
$ ./a.out
d1: quot=1, rem=2
d2: quot=1, rem=0
d3: quot=0, rem=2

If we turn off optimization (-O0) the results suddenly change! Even worse, I've compiled with maximum warnings (-Wall -Wextra) and GCC doesn't even issue a warning about using an uninitialized variable!

How can we reconcile this behaviour with the specification? I think that we need to take paragraph §6.7.8.23 into account, as well:

The order in which any side effects occur among the initialization list expressions is unspecified.¹³¹

(Footnote 131 reads "In particular, the evaluation order need not be the same as the order of subobject initialization.")

This suggests that evaluation and initialization are two separate steps: an implementation may evaluate all of the initialization expressions (in any order), record the results in temporary storage, and then apply them all in order. If you inspect the generated assembly code this does seem to match what happens in GCC at -O0.

So why have paragraph 19 at all? I don't see how you could write a C program which can observe the initialization order, except for the special case of one designated initializer overriding another. If that is its only purpose the specification could certainly be more explicit about it. (It's also possible that the specification has actually been clarified in this respect; I don't have access to the final version.)

I've tried this test case with a handful of different compilers and get similar results. However I would be curious to hear about results with other C99 compilers.

Update

2012-03-25T12:13:00.000-04:00

It's been a while since I updated this blog. I've been a bit busy, but new articles will start appearing shortly. Since my last post, I have left IBM Canada and joined Two Sigma Investments in New York. I'm no longer developing virtual machines, but the name and scope of the blog will remain the same. I'm still doing low-level software development and I'm learning a lot about areas I haven't investigated in depth before, and from my new colleagues.

IBM Java 7 is now available

2011-09-20T11:43:00.000-04:00

IBM Java 7 was officially released yesterday, September 19. You can download it from IBM DeveloperWorks.

There are a lot of exciting new features including a number of GC improvements.

The balanced GC policy, which I've mentioned before, is included in all 64-bit Java 7 JDKs. You can enable it with -Xgcpolicy:balanced.
The soft-realtime garbage collector is included for evaluation on Linux and AIX. It can be enabled with -Xgcpolicy:metronome.
The verbose GC format has been completely overhauled. It now provides more information and the XML format has been redesigned to make machine interpretation of the data simpler, allowing both IBM and customers to write tools to process and analyse the data.

"Don't do what Donny Don't does"

2011-09-02T17:12:00.000-04:00

Thank to Evan Hughes for pointing out this paper: Conditional statements, looping constructs, and program comprehension: an experimental study.

Not surprisingly, negative conditions are more difficult to understand than positive conditions.

I always try to write conditions to be positive. Sometimes, I'll even include an empty 'if' block so that I can put code in the 'else' block instead of using a negative condition:

   if (isInRange(value)) {

      // expected case; do nothing

   } else {

      throw new OutOfRangeException(value);

}

I haven't read the full paper, so maybe the researchers answered my next question: is the problem exacerbated by the syntax for negative conditions used in C-like languages? I find that the '!' operator uses very little horizontal space, making it less noticeable than other unary operators such as '~' or '*'.

e.g. would this statement:

   if (!isInRange(value)) { ...

be more obvious if it were written like this?

   if (not isInRange(value)) { ...

Balanced garbage collection

2011-08-04T10:21:00.000-04:00

We've just published an article on the IBM DeveloperWorks website describing the new garbage collection technology available in IBM Java 6 2.6 and IBM Java 7. This is a project I've been working on for several years and we're pretty excited that customers can now try it out for themselves.

You can read the article for yourself.

Reachability follow up

2011-07-15T21:59:00.000-04:00

We've been having quite an interesting series of conversations at work about this reachability problem. Today, one of my co-workers, Dan Heidinga, pointed out that Microsoft's CLR has the same issue. For CLR Microsoft has provided a special static method called GC.KeepAlive(Object). This acts as a hint to the virtual machine and JIT to extend an object's lifetime, but is otherwise a no-op. There's a good article about the problem on an old MSDN blog here. Note that the author considers and rejects the option of automatically extending the lifetime of all function arguments to the end of their functions on the basis that it impacts codegen and therefore performance.

A subtle issue of reachability

2011-07-10T14:45:00.000-04:00

In the last few weeks I've run into two similar and very subtle problems in Java code. In some ways, these seem to illustrate an oversight in the design the Java language and/or virtual machine. The problem has to do with objects being collected earlier than expected.

In one case a finalizer was run and in the other a PhantomReference was cleared. I'll describe an example based on finalization. It's easy enough to see how this could also apply to reference objects.

Consider a class like this:

public class Foo {

  private byte[] array = new byte[] { 1, 2, 3 };

  public void finalize() { 

    array[0] = array[1] = 

      array[2] = 0; 

}

  public byte[] getData() { return array.clone(); }

}

This class is able to return a copy of its data array. (This is a common pattern since it prevents the caller from modifying the master copy of the array). When instances of this class are garbage collected they wipe out the data in the array, overwriting it with zeros. (Let's ignore why it does this; it's just an example!) So far so good.

What result will we get if we invoke getData() on an instance of Foo?

  Foo f = new Foo();

  byte[] array = f.getData();

  System.out.println("array={" + 

    array[0] + ", " + 

    array[1] + ", " + 

    array[2] + "}");

Intuitively, we expect this to print "array={1, 2, 3}". And it usually does. But is it legitimate for it to print "array={0, 0, 0}" (or even "array={1, 2, 0}")? If it did, that would mean that the object was finalized while we were still using it, wouldn't it?

Actually, that can happen. It happens quite often in the IBM Java VM, and seems to happen occasionally in Oracle HotSpot, too, but less frequently.

The Java VM is permitted to collect (or finalize) an object when it is no longer reachable. But how could the object become unreachable when we're running the getData() function? It's easier to understand if you imagine the function in-lined in the caller, and broken up into individual statements:

  Foo f = new Foo();

  byte[] masterArray = f.array; // ignore that array

                                // is private

  // what if a garbage collection happens here?

  // e.g. System.gc(); System.runFinalization(); 

  byte[] copyArray = masterArray.clone();

  System.out.println("array={" + 

    copyArray[0] + ", " + 

    copyArray[1] + ", " + 

    copyArray[2] + "}");

Here we can see that if the garbage collector interrupts the program at just the right (wrong?) time, the finalize() function might run before we clone the array. Even though we don't explicitly assign null to f, a clever VM can analyze the program and determine that f is never used again. It can reclaim the memory for that object and, in this case, finalize it, before the clone() function runs. In most cases this is exactly what you want the VM to do: garbage collect objects as early as possible to recover as much memory as possible.

Ok, but is that really the same thing? Surely, the receiver of a function is kept alive until the function returns, right? In-lining the function isn't quite the same!

Actually, neither the Java language specification nor the Java Virtual Machine specification say anything about that. In the VM, the receiver of a function (i.e. this) isn't very special at all. It's just the first argument of a virtual function. Although the language doesn't allow it (keep in mind that the Java language and the Java VM have separate specifications) you can overwrite the receiver just as you can a local variable if you're writing bytecodes directly without the aid of javac:

byte[] getData() {

  byte[] masterArray = this.array;

  this = null; // not legal in Java language,

               // but is legal in class files!

  return masterArray.clone();

}

javac won't compile this, but the JVM's class file verifier won't report any problems in this function.

So, what's the right way to write your Java code so that your objects won't be finalized or collected earlier than expected? Unfortunately, I don't know the answer. You could add an extra reference to the receiver, like this:

byte[] getData() {

  byte[] result = array.clone();

  this.array = this.array; 

  return result;

}

But that's a hack, not a real solution, and is unlikely to work reliably. The VM can easily determine that the dummy "this.array = this.array" statement has no effect and can be removed, leaving us exactly where we started.

Perhaps Java needs a new keyword like this:

byte[] getData() {

  keep_alive(this) {

    return array.clone();

}

}

However I doubt that something like that would be used correctly very often.

Unfortunately, the best advice is probably to avoid finalization whenever possible.

Proportionality of incremental changes

2011-03-12T19:21:00.002-05:00

A few weeks ago I was watching a presentation on Project Lambda. The proposed syntax is generally quite nice. It's very simple, and avoids a lot of the noise which pollutes typical Java programs using inner classes. But one of the syntax examples didn't sit right with me, and it took me a few days to figure out exactly why.

The particular example isn't important, and I don't think it's currently in the public proposal. But what is important, I realized, is that code should be written to facilitate incremental changes, and this interfered with that.

What do I mean?

Here's an example: one of the rules in our coding guidelines is that control blocks must always use curly brackets. That is, don't write code like this:

if (condition)

  doSomething();

else

  doSomethingElse();

We require that this code be written like this, instead:

if (condition) {

  doSomething();

} else {

  doSomethingElse();

}

Here's why I don't like the syntax without the curly brackets (at least one of the reasons): it makes it harder to modify the code.

If you want to add an extra line inside the else block, you need to add the line and convert the single statement block into a compound block:

if (condition)

  doSomething();

else {

  fprintf(stderr, "Debugging doSomethingElse\n");

  doSomethingElse();

}

To add one line of code I need to modify three lines! This is a disproportionate amount of work considering the actual change.

In summary, code should be written in a way which encourages incremental changes. If small logical changes require large textual changes, then there's something wrong, either with the tools or the technique.

Overloading and varargs

2011-03-10T17:35:00.001-05:00

I learned something important today: don't overload a variadic function.

It's very tempting to have functions like this in your C++ class:

class Foo {

public:

  void write(const char* format, ...);

  void write(const chat* format, va_list args);

}

The variadic function would be implemented like this:

void Foo::write(const char* format, ...) {

  va_list args;

  va_start(args, format);

  write(format, args);

  va_end(args);

}

It looks nice and clean -- a perfect application of overloading.

However there's a subtle problem! The C++ compiler will always try to resolve the method with the most specific signature when it encounters an overloaded call.

Calls like this are fine:

foo->write("%s\n", "Hello");

foo->write("...world");

But what if you have an argument which looks like a va_list? For example, if va_list is a pointer on your platform, what does this line do?

foo->write("How many chickens? Answer: %d\n", 0);

In C++ 0 isn't just a number. It's also the NULL pointer. The compiler will decide that you're actually calling the non-variadic function and will use NULL for the va_list argument. Then your program will crash when you try to read an int argument from a NULL va_list.

Lesson learned. Don't overload functions if one of the functions is variadic.

Now I have to go back and fix some code I wrote yesterday...

Groan

2011-02-16T19:15:00.000-05:00

Die Klasse Namen

2010-12-31T18:55:00.002-05:00

I just noticed that the IBM JDK class libraries include these klasses:

com.ibm.security.util.DerInputStream and
com.ibm.security.util.DerValue.

Huck huck.

Ach! Der InputStream ist ein ... NuisanceStream! Der Value, Der. No one who speaks German could be an evil man!

Happy New Year!

-XX:+UseCompressedStrings explained

2010-12-28T11:38:00.000-05:00

It looks like Oracle has finally released some documentation for those options they've been using in SPECjbb2005 submissions. The doc is here and it looks like it appeared on Christmas eve.

Like I guessed, they're using a byte[] array instead of a char[] array for Strings wherever they can.

Presumably this makes the code path more complicated, because every time the JVM deals with a String it now needs to check what kind it is. The space savings are probably worth it, at least in some applications.

Why isn't it on by default? Two possibilities:

The penalty is too high in many applications. In my opinion, this would make it a bit of a benchmark special.
The option isn't quite ready for prime time yet, but they plan to turn it on by default later.

Is this option "fair" to non-Western-European applications? I'd argue that it probably isn't unfair. A lot of String objects aren't involved in the user interface at all. In many applications, such as Eclipse, Strings are used extensively as internal identifiers for things like plug ins, extension points, user interface elements, etc. Even if your application presents a non-ASCII user interface there's a good chance that it still has a lot of ASCII strings under the surface. It might not benefit as much from this option, but it would probably still benefit.

(Of course that assumes that there's no penalty for using non-ASCII Strings beyond the extra space. If the option is implemented in an all-or-nothing fashion, e.g. if it stops using byte[] arrays the first time it encounters a non-ASCII String, then non-ASCII applications wouldn't benefit at all.)

Working backwards

2010-12-24T13:59:00.001-05:00

Sometimes I get a bug that I just can't figure out. If the problem is reproducible with a good test case it's usually fairly easy to narrow the problem down pretty quickly. But what do you do if your product crashed once on a customer's server, and hasn't failed again?

Well, you start with the logs and diagnostic files. Usually we can figure it out from tracepoints and a core file. But sometimes this doesn't work.

In cases like this I don't like to throw in the towel without doing something. It feels like defeat (probably because it is). Instead, I always try to figure out what additional information could have helped me solve the problem.

How did we get to the point of failure? If I can identify two or three paths to the failure point and can't infer which one was taken I'll add some tracepoints to those paths. Or maybe I can add assertions on those paths to detect the error a bit earlier.

Since the problem isn't reproducible they won't help me now, but they might help me in the future. If the problem does occur again (and 'not reproducible' really just means 'very rare') hopefully these diagnostics will get me one step closer to the actual problem. And if it didn't hit any of my new tracepoints or assertions when it reoccurs, that's useful (and potentially maddening) too.

Of course I still might not be able to figure out what's happening. Then I add another round of tracepoints and assertions. Each failure gets me one step closer to the solution.

Amiguous Java packages and innner classes

2010-11-06T16:44:00.002-04:00

Yesterday, when I should have been paying attention to my breathing during yoga, I was instead thinking about inner classes. Specifically I was wondering how Java resolves ambiguities between package names and inner class names.

In Java, an inner class is specified using its parent's name, a dot, and its name. for example, Foo.Bar is an inner class named "Bar" in the parent class "Foo". But this is also how Java names packages. Bar could just as easily be a first class class in package Foo.

Note that there's no ambiguity once the code has been compiled. The class file format uses / as a package separator and $ as an inner class separator, so Foo$Bar is an inner class, and Foo/Bar is a top level class in package Foo. The ambiguity is only present in the compiler, where "." serves both purposes.

I wrote a simple test case. I created a simple Foo.java file with a public static inner class Bar, and a Foo/Bar.java file with a public class Bar. Then I wrote a test class like this:

public class Test {
        public static void main(String[] args) {
                System.out.println(new Foo.Bar());
        }
}

What happens when you compile and run this?

peter$ java Test
Foo$Bar@c53dce

Notice the '$' in the name? The inner class won, so it looks like javac resolves the ambiguity in favour of the inner class.

But what if we really want to mess with the compiler? What happens if you try to compile this file?

public class java {
        public static class lang {
                public static class Object {
                }
        }
}

Yes, the file is called java.java. None of these names are reserved in Java, so this ought to be a perfectly legitimate class named java. It has an inner class, java.lang, with its own inner class, java.lang.Object (not to be confused with java.lang.Object, the superclass of each of these classes).

javac doesn't seem to like this class very much, at least not the version installed on my Mac:

tmp peter$ javac java.java
An exception has occurred in the compiler (1.6.0_17). Please file a bug at the Java Developer Connection (http://java.sun.com/webapps/bugreport) after checking the Bug Parade for duplicates. Include your program and the following diagnostic in your report. Thank you.
java.lang.NullPointerException
    at com.sun.tools.javac.comp.Flow.visitIdent(Flow.java:1214)
    at com.sun.tools.javac.tree.JCTree$JCIdent.accept(JCTree.java:1547)
    at com.sun.tools.javac.tree.TreeScanner.scan(TreeScanner.java:35)
    ...

I don't think that this is a security hole, since the ambiguity is only present in the compiler. It's possible that there's a similar ambiguity in reflection, but I'm not sure. That might be somewhat higher risk.

Update: Eclipse compiles this java.java file with no problem. Looks like it's just an obscure bug in javac.

Further update: I'm not the first person to discover this: http://www.bodden.de/tag/name-resolution/

RIP Benoit Mandelbrot

2010-10-16T10:53:00.001-04:00

Benoit Mandelbrot, the father of fractals, is dead at 85.

When I was in high school I was fascinated by the Mandelbrot set. I wrote a program which rendered it (quite slowly) on my 286. It let you draw a line anywhere across the set and play it as musical tones. (Well, as musical as a PC's speaker could be.)