10 Billion Integers Walk Into an Array

Donald Raab
10 min readOct 11, 2024

--

How an experiment with 64-bit Pharo Smalltalk surprised me.

Storing 10 billion SmallInteger instances in an Array in Pharo Smalltalk 13.0

The 32-bit party that doesn’t want to end

I haven’t programmed professionally in Smalltalk since Y2K. I only worked professionally in 32-bit versions of Smalltalk. Every once in a while I try some things out in the Pharo dialect of Smalltalk, which has had a 64-bit version for a few years now. I waited a long time for an open source version of Smalltalk to support 64-bit addressing. I have been using 64-bit versions of Java for the past 18 years.

I’ve been wondering when Java will finally provide built-in support for large arrays, meaning arrays with a length of more than 2³¹-1. The Java Collections Framework and all frameworks that implement the built in collection interfaces like Collection, List, Set, Map all max out at an int size. There can be at most 2³¹-1 things stored in an array in Java. (Update: I learned from a post on Hacker News from Stuart Marks the max array size in the Java Hotspot JVM is actually 2³¹-2. I confirmed this in a test.) For all intents and purposes, this imposes a limit of storing slightly more than 2 billion objects or primitives in an array in Java. There are third party libraries that provide BigArrays like fastutil, and we can simulate large arrays ourselves by using multi-dimensional arrays in Java, but this becomes trickier to manage. Better to use a well-tested third party library than recreate hard to test and find bugs.

I was curious if 64-bit Pharo Smalltalk would be able to store more than 2³¹-1 elements in an array. I knew one way I could find out. I would need a lot of RAM for this experiment. Thankfully, I bought a new Mac Book Pro M2 Max last year with 96GB of RAM so I could experiment and test with larger memory footprints.

The newest MacBook Pro M3 Max supports up to 128GB of RAM today. This is an impressive leap from just over a year ago when I bought my MacBook Pro M2 Max series with 96GB RAM. 128GB is double what I have in my 10 year old Mac Pro trash can on my desktop, and 8x more than I had in my 10 year old MacBook Pro laptop. The 2019 version of the Mac Pro supports up to a whopping 1.5TB of RAM. The current Apple Silicon Mac Pro (2023) offers a max configuration of 192GB of RAM which is 3x what my 10 year old Mac Pro has. I suspect in another 10 years we will see 256GB+ RAM on high end commercial laptops, and easily TB+ RAM on high end commercial desktops.

Note: Server class machines can be configured with terabytes of RAM today.

Who needs more than 2 billion elements in an array?

I haven’t needed more than hundred million elements stored in a List in the past 20 years. That is still quite a large number of things, and I needed that around 2006 for the domain I was working in. I’m sure there are some folks who deal with domains that require enormous amounts of data in memory.

Estimates say there are ~8.2 billion people on the planet today. It would currently take four Java arrays to store references to all people in memory. It would be very expensive to hold 8.2 billion simple Person objects in memory. By simple Person object I am thinking of a Person class with first, middle, and last names as String instances. The cost of the array storage alone would be over 65GB (~8.2 billion x 8 bytes per reference). The cost of the Person instances would far exceed my current available memory on my MacBook Pro laptop which is pretty large at 96GB. Let’s estimate maybe 8.2 billion * 32 bytes for the Person instances which would measure at ~262GB. In total, we ‘d need 327GB of RAM to fit everyone with their first, middle, and last names in memory. We could pool the String names, which probably would probably unique down to a few hundred million instances so maybe we’d need 32 GB or more for the String instances. We’re not quite there yet in terms of commercially available desktop hardware. This would be possible with high end servers with terabytes of RAM.

This got me wondering. What if we started smaller than a Person, and went with an object like SmallInteger in Pharo Smalltalk. I started experimenting with creating arrays larger than 2³¹-1 in Pharo. I would learn during the experiment that Pharo Smalltalk provides a great optimization for SmallInteger. Instead of storing references to SmallInteger objects, the actual values of the SmallInteger are stored inline. This felt like the promise of value types we have been hearing about from Project Valhalla in the Java space. I figured this out by doing a bit of digging and experimenting with using the simple sizeInMemory method. I was confused at first when the size reported for a SmallInteger instances was zero. This suggested to me that SmallInteger was getting some special handling in the language. I was also surprised that SmallInteger went beyond the max value for an int in Java.

Transcript show: SmallInteger maxVal.

// Prints - 1152921504606846975
// SmallInteger -> 1,152,921,504,606,846,975
// Max Java long -> 9,223,372,036,854,775,807

This value seemed more like a long in Java. The value in Java of Long.MAX_VALUE = 9,223,372,036,854,775,807.

The article linked here explains the maximum value of SmallInteger in Smalltalk as well as how it is stored as values instead of as references. SmallInteger uses 61 bits instead of 64.

The biggest difference between Smalltalk’s SmallInteger and Java’s long is what you get when you add one to the maximum value.

In Smalltalk, we get a LargePositiveInteger. Dynamic typing to the rescue.

Transcript cr; show: SmallInteger maxVal.
Transcript cr; show: SmallInteger maxVal class.
Transcript cr; show: SmallInteger maxVal + 1.
Transcript cr; show: (SmallInteger maxVal + 1) class.

// Prints
// 1152921504606846975
// SmallInteger
// 1152921504606846976
// LargePositiveInteger

When we add 1 to the maximum value of a SmallInteger, Smalltalk dynamically creates an instances of a LargePositiveInteger. This is a benefit of a dynamic language where everything is an object.

In Java, we get a silent but potentially deadly overflow.

System.out.println(BigInteger.valueOf(Long.MAX_VALUE));
System.out.println(BigInteger.valueOf(Long.MAX_VALUE + 1));

// Prints
// 9223372036854775807
// -9223372036854775808

Adding 1 to the maximum value of a long results in an overflow, and the result becomes negative. We cannot dynamically change the type from a long to anything else in Java. What’s long is long, and what’s short is short. This is one place where static typing and primitive types fall short. We have learned to deal with it in Java.

So let’s move on and see the experiments I tried.

The Experiments

I couldn’t settle with having a solution in Smalltalk without trying a solution in Java. Pharo Smalltalk gave me all the tools I needed. In Java, I used a combination of fastutil and Eclipse Collections libraries to repeat the experiment in Java. The great thing about Java is that the ecosystem is rich and vibrant, and folks have built solutions to most problems you may face.

Pharo Smalltalk Version

I started off with storing 5 billion SmallInteger instances in an Array in Pharo. Once I saw this worked, I bumped the total number up to 8.2 billion. Things still worked. Then I tried 10 billion. It still worked. I was very surprised. I didn’t think I would have enough RAM for 10 billion. This is because I didn’t understand at the time how Smalltalk treats SmallInteger “instances”.

This is the source for the 10 billion element experiment. You’ll need 96GB with around 85GB free to run this code. You can downgrade to 5 billion if you have 64GB of RAM.

|array sum|

array := (1 to: 10000000000) collect: [ :each | each * 2 ].
sum := array sum.

Transcript cr; show: array size class.
Transcript cr; show: array size.
Transcript cr; show: sum.
Transcript cr; show: sum class.

(1 to: 10000000000 by: 1000000000) do: [ :index | Transcript cr;
show: 'index: ';
show: index;
show: ' value: ';
show: (array at: index);
show: ' ';
show: (array at: index) class ].

Transcript cr; show: 'Array memory (bytes) ';
show: array sizeInMemory.
Transcript cr; show: 'Elements memory (bytes) ';
show: (array sumNumbers: #sizeInMemory).

The results for this code is as follows:

SmallInteger
10000000000
100000000010000000000
LargePositiveInteger
index: 1 value: 2 SmallInteger
index: 1000000001 value: 2000000002 SmallInteger
index: 2000000001 value: 4000000002 SmallInteger
index: 3000000001 value: 6000000002 SmallInteger
index: 4000000001 value: 8000000002 SmallInteger
index: 5000000001 value: 10000000002 SmallInteger
index: 6000000001 value: 12000000002 SmallInteger
index: 7000000001 value: 14000000002 SmallInteger
index: 8000000001 value: 16000000002 SmallInteger
index: 9000000001 value: 18000000002 SmallInteger
Array memory (bytes) 80000000016
Elements memory (bytes) 0

What these results show is that there is no extra cost for the SmallInteger instances. The SmallInteger values are inlined into the Array. So the Array storage is all we need, which is ~80GB.

Java w/ fastutil version

This is the source for the 10 billion element experiment in Java. You’ll need 96GB and 85GB free to run this code. I added a commmand line setting of -Xmx85g. You can downgrade to 5 billion if you have 64GB of RAM. I used fastutil for a big long list. I used Eclipse Collections to sum BigInteger. You’ll see why I needed to use BigInteger in the results below.

First, I added a Maven dependency on the fastutil library.

<dependency>
<groupId>it.unimi.dsi</groupId>
<artifactId>fastutil</artifactId>
<version>8.5.14</version>
</dependency>

Then I wrote a test that uses a LongBigArrayBigList to store 10 billion longs. This is roughly the equivalent of the 10 billion element array in Smalltalk storing SmallInteger.

@Test
public void fastutilTenBillion()
{
LongBigArrayBigList longs = new LongBigArrayBigList(10_000_000_000L);
LongStream.rangeClosed(1, 10_000_000_000L)
.forEach(l -> longs.add(l * 2));

long sum = longs.longStream().sum();
BigInteger bigSum = longs.longStream()
.boxed()
.collect(Collectors2.summingBigInteger(BigInteger::valueOf));

System.out.println("size: " + longs.size64());
System.out.println("long sum: " + sum);
System.out.println("BigInteger sum: " + bigSum);

for (long l = 0; l < 10_000_000_000L; l += 1_000_000_000L)
{
System.out.println("index: " + l + " value: " + longs.getLong(l));
}
}

The results are as follows:

size: 10000000000
long sum: 7766279641452241920
BigInteger sum: 100000000010000000000
index: 0 value: 2
index: 1000000000 value: 2000000002
index: 2000000000 value: 4000000002
index: 3000000000 value: 6000000002
index: 4000000000 value: 8000000002
index: 5000000000 value: 10000000002
index: 6000000000 value: 12000000002
index: 7000000000 value: 14000000002
index: 8000000000 value: 16000000002
index: 9000000000 value: 18000000002

Now, the first thing that might jump out at you if you weren’t expecting it is that Java uses zero-based indexing, and Smalltalk uses one-based indexing. The values are at the correct indexes. We just have to adjust the indexes by one when comparing the results

The next thing you will notice is that the long sum and the BigInteger sum are different. Why is that?

The following test will show us why.

@Test
public void longMaxValue()
{
BigInteger maxLong = BigInteger.valueOf(Long.MAX_VALUE);
BigInteger sum = new BigInteger("100000000010000000000");
System.out.println(maxLong);
System.out.println(sum);
}

The results are as follows:

9223372036854775807    // Max Long Value in Java
100000000010000000000 // Sum of 10 billion long values

The maximum long is two digits shorter than the sum. Summing ten billion numbers that are doubled, results in a number larger than the maximum long value. This is why I had to construct the BigInteger for the sum using a String. The value is too big for a long literal. I honestly wasn’t expecting to see this test overflowing a long value. This is more typical when using int values. The max long value is HUGE. Not huge enough as it turns out for this example/

When I saw the sum was incorrect, I decided to try the Collectors2.summingBigInteger() Collector from Eclipse Collections. This worked fine with the downside that I had to box the long values in the LongStream to Long objects and then again to BigInteger objects. It’s possible I could have stared at the code for a few more minutes and figured out a way to just use the “Three-musketeer” collect method on LongStream, but it requires a mutating result, so I didn’t bother.

Reflections on software imposed limits

Most developers will not need arrays that support more than 2³¹-1 elements this year. The same will probably be true next year. In 5, 10, or 20 years this number will begin creeping up on more folks as big data problems become more commonplace. 2³¹-1 sized collection limits may become more problematic. We have solutions built in Java if we need a sum bigger than an int or long. It’s challenging sometimes writing fast and correct software when you’re face with limits of primitives and arrays.

In the late 1980s and early 1990s no one needed more than 640K of RAM. Now we can get 128GB of RAM on a commercial grade laptop. Our great-grandchildren may laugh when they hear about how we suffered using only 64-bit computing. The trend in computing hardware is always that we get more.

When we encounter hardware or software imposed limits, we have to find solutions. I’ve dealt with memory constrained environments since I began programming professionally in the late 1980s. The hardware and software almost always has found a way to catch up just in time. When it doesn’t, we need to be creative.

Java has many important problems to solve. I don’t know when the max array size issue will become a priority for Java. I suspect it will be in the next decade. In the mean time, libraries like fastutil can help fill the void.

One thing I continue to admire about the dynamic nature of Smalltalk is its ability to adapt. Just add one more thing that will threaten to push you over the edge, and Smalltalk will pleasantly surprise you and return a different type that can handle it. I miss this aspect of Smalltalk.

Thanks for reading!

I am the creator of and committer for the Eclipse Collections OSS project, which is managed at the Eclipse Foundation. Eclipse Collections is open for contributions. I am writing a book this year about Eclipse Collections. Stay tuned!

--

--

Donald Raab
Donald Raab

Written by Donald Raab

Java Champion. Creator of the Eclipse Collections OSS Java library (https://github.com/eclipse/eclipse-collections). Inspired by Smalltalk. Opinions are my own.

No responses yet