Julian Hyde on Streaming Data, Open Source OLAP. And stuff.: Roll your own high-performance Java collections classes

Friday, June 03, 2011

Roll your own high-performance Java collections classes

The Java collections framework is great. You can create maps, sets, lists with various element types, various performance characteristics (e.g. if you want O(1) insert, use a linked list), iterate over them, and you can decorate them to give them other behaviors.

But suppose that you want to create a high-performance, memory efficient immutable list of integers? You'd write

List<Integer> list =

  Collections.unmodifiableList(

    new ArrayList(

      Arrays.asList(1000, 1001, 1002)));

There will be 6 objects allocated in the JVM: three Integer objects, an array Object[3] to hold the Integers, an ArrayList, and an UnmodifiableRandomAccessList. Not to mention the Arrays.ArrayList and Integer[3] used to construct the list and quickly thrown away.

The resulting list is no longer high-performance. A call to say 'int n = list.get(2)' requires 3 method calls (UnmodifiableRandomAccessList.get, ArrayList.get, Integer.intValue) and 3 indirections. And the sheer number of objects created reduces the chance that a given stretch of code will be able to operate solely from the contents of L1 cache.

So, what next? Should I write my own class, like this?

public class UnmodifiableNativeIntArrayList

  implements List<Integer>

{

  ...

}

Well, maybe. But there are rather a lot of variations to cover, and each one needs to be hand-coded and tested.

Do I use library code? I searched and turned up Apache Commons Primitives, Primitive Collections for Java (PCJ), and GNU Trove (trove4j). Of these, only GNU Trove is still active.

None of the libraries supports features such as maps with two or more keys, unmodifiable collections, synchronized collections, flat collections (similar to Apache Flat3Map). It's not surprising that they don't: each combination of features would require its own class, so the size of the jar file would grow exponentially.

So, I'd like to propose an alternate approach. You configure a factory, specifying the precise kind of collection you would like, and the factory generates the collection class in bytecode. You can use the factory to quickly create as many instances of the collection as you wish. The collection implements the Java collections interfaces, plus additional interfaces that allow you to efficiently access the collection without boxing/unboxing.

The above example would be written as follows:

// Initialize the factory when the program is loaded.

// Then the bytecode gets generated just once.

static final Factory factory =

  new FactoryBuilder()

    .list()

    .elementType(Integer.TYPE)

    .modifiable(false)

    .factory();



int[] ints = {1000, 1001, 1002};

IntList list = factory.createIntList(ints);

Variants are expressed as FactoryBuilder methods:

FactoryBuilder FactoryBuilder.list()
FactoryBuilder FactoryBuilder.map()
FactoryBuilder FactoryBuilder.set()
FactoryBuilder FactoryBuilder.keyType(Class...) (for maps only)
FactoryBuilder FactoryBuilder.valueType(Class...) (for maps only)
FactoryBuilder FactoryBuilder.elementType(Class...) (for list and set only)
FactoryBuilder FactoryBuilder.sorted(boolean) (cf. the difference between Set and SortedSet)
FactoryBuilder FactoryBuilder.deterministic(boolean) (cf. the difference between HashMap and LinkedHashMap)
FactoryBuilder FactoryBuilder.modifiable(boolean)
FactoryBuilder FactoryBuilder.fixedSize(boolean) (cf. the difference between Flat3Map and Map)
FactoryBuilder FactoryBuilder.synchronized(boolean)

And so forth. Additional variants could be added as the project evolved. Templates could be fine-tuned for particular combinations of variants.

The projects I mentioned above clearly use a template system, and we could use and extend those templates. The janino facility can easily convert the generated java code into bytecode. And the JVM would be able to apply JIT (just-in-time compilation) to these classes; in fact, these classes would be more amenable to compilation, because they would be compact and final.

The existing projects have invested a lot of effort designing high-performance collections. I'd like to build on that work; this project could even be an extension to those projects.

I'd like to hear if you're interested in working with me on this.

12 comments:

Daniel Lemire said...: This is interesting, and I might be willing to work on this... But you'd have to convince me that this is not overengineered.

Ok. So you specify all the characteristics you want... but wait a minute, if you know you want sorted keys, then what stops you from going for the B-tree?

To put it differently, why isn't GNU Trove "good enough"? I know they don't support all possible variations... but they come pretty close?; 6/03/2011 1:06 PM
Daniel Lemire said...: Extra problem: if the classes really are generated dynamically, then this won't work well with some IDE such as Eclipse.; 6/03/2011 1:08 PM
Unknown said...: This is a great idea and I would love to help out. Janino lacks some features like Java generics but could very much be used for this regardless. The trick will be to come up with a predictable Java code-generator that is both generic and still high performance. I'm guessing that simplicity and predictability is key here.

Kettle has a bunch of specific classes like for mapping Long/Long using a hash index which is a lot more efficient (speed and memory) compared to the default Map implementations. It would be great if we could replace custom code like that with a single factory.; 6/03/2011 2:05 PM
Julian Hyde said...: @Daniel:

Note that I'm not committing to implement any particular recipes. This is a framework that allows coders to declare the behavior they need from collections, and the framework delivers the best implementation that it can.

At first, the best implementation of most combinations of properties would fall back and use some combination of trove4j and the Java collections. If someone believes that they have a better algorithm, they would contribute a template to generate that algorithm and also some benchmarks which prove that it is significantly better than the default implementation.

As the project evolved, and there were a few algorithms to choose from, the challenge would be curating that collection. There would need to be a set of rules for which ones to use.

But the coder who wrote their code against version 1 of the library would benefit when version 2 of the library comes up with a better implementation of their collection.

Extra problem: if the classes really are generated dynamically, then this won't work well with some IDE such as Eclipse.

We would need a few extra interfaces (e.g. IntList, to save boxing/unboxing cost) but I think most of the additional functionality (e.g. synchronized, fixed-size, copying vs. using a backing collection, sorted) would not extend the interface.

The one problem I can't think how to solve is a M-ary to N-ary map. Each such map would need its own interface, available at compile time, in order to prevent boxing overhead. For example, consider 3-ary to 2-ary maps. One such class is IntLongObjectKeyDoubleIntMap, and it has a method put(int k0, long k1, Object k2, double v0, int v1). Since there are 9 types (8 primitives: byte, char, short, int, long, float, double, boolean, plus Object), there are 9^5 = 59,049 combinations. Far too many to be in the jar file. But I'm sure that we can find a solution.; 6/03/2011 3:01 PM
Julian Hyde said...: @Matt: Thanks for the support. I kind of suspected that you had already done a lot of work in this area in Kettle.

I ping you when I have come with a project name and a minimal framework. Then you can see whether you can easily slot your classes in.; 6/03/2011 3:25 PM
Daniel Lemire said...: Ok. This sounds like fun.; 6/03/2011 6:32 PM
Anonymous said...: What about Guava?; 6/05/2011 8:53 PM
Julian Hyde said...: @Anonymous:

I wasn't aware of Guava. Thanks for bringing it to my attention.

The Apache license is a plus. Trove and fastutil are LGPL, which I find restricts a project's adoption these days.

It is interesting that MapMaker uses a similar approach to what I am proposing: use a builder to specify the physical properties of the collection, then call makeMap. MapMaker is concerned with the dynamic behavior of the map (e.g. weak keys, concurrency, generating values on demand), which is different things than I am interested in. But there still might be a fit.

However, I couldn't find many primitive collections in Guava. I found Ints.asList(int...) but I couldn't find say an int-set or int-to-double map. Can you point them out if I've missed them?; 6/06/2011 10:47 AM
rossjudson said...: Julian - check out fastutil. Pretty sure that it has all the native collection stuff you want. It's a template-driven system.; 6/08/2011 8:44 PM
Julian Hyde said...: Thanks for the tip. I've been checking out fastutil for a few days. It looks more promising than trove4j and guava.

First impressions of fastutil. It is template-driven, but the templates are invoked at library build time. (So you end up with a jar file with over 2,000 classes.)

Even with 2,000 classes, lots of things you might want from a collection are not covered (flat lists, synchronization, read-only). It would certainly benefit from my generate-on-the-fly approach.

Template expansion uses Bourne shell scripts and the C preprocessor. That doesn't fit into a Java runtime very easily.

And the license is BSD, from fastutil release 6 onwards. Yay!

The problem Daniel mentioned still exists, whichever framework we use. You need quite a few interfaces to access primitive collections without boxing/unboxing. For example, IntList, IntSet, IntToLongMap, IntIterator, IntListIterator, IntToLongMapIterator. Since people need to write code against these, they can't be generated on the fly. However, I think the numbers are manageable unless we allow M-to-N maps.; 6/09/2011 10:00 AM
Julian said...: Why do you write:
List list =
Collections.unmodifiableList(
new ArrayList(
Arrays.asList(1000, 1001, 1002)));

When you can write:
List list =
Collections.unmodifiableList(
Arrays.asList(1000, 1001, 1002));; 10/02/2011 12:35 AM
Julian Hyde said...: Julian,

(What is your full name by the way?)

Suppose you need to create a list that could not be modified by anyone, not even the original creator of the list. And suppose you write the following:

List immutableList(Integer[] integers) {
return Collections.unmodifibleList(
Arrays.asList(integers));
}

If someone modifies one of the elements of "integers", then the contents of the list will change. The list would be unmodifiable (in the sense that it cannot be changed via the list API) but not immutable (in the sense that it will never change for any reason whatsoever). So, I added an extra copy into an ArrayList to solve that problem.

I should have made that clear in the original post. But frankly, your simpler case using just an Arrays.ArrayList makes my point just as well, so if I was writing the post again, I would use your example.

Julian; 10/02/2011 8:23 AM