Next: , Up: Behind the scenes

6.12.1 How Arrays Work

Smalltalk provides a very adequate selection of predefined classes from which to choose. Eventually, however, you will find the need to code a new basic data structure. Because Smalltalk's most fundamental storage allocation facilities are arrays, it is important that you understand how to use them to gain efficient access to this kind of storage.

The Array Class. Our examples have already shown the Array class, and its use is fairly obvious. For many applications, it will fill all your needs—when you need an array in a new class, you keep an instance variable, allocate a new Array and assign it to the variable, and then send array accesses via the instance variable.

This technique even works for string-like objects, although it is wasteful of storage. An Array object uses a Smalltalk pointer for each slot in the array; its exact size is transparent to the programmer, but you can generally guess that it'll be roughly the word size of your machine. 1 For storing an array of characters, therefore, an Array works but is inefficient.

Arrays at a Lower Level. So let's step down to a lower level of data structure. A ByteArray is much like an Array, but each slot holds only an integer from 0 to 255-and each slot uses only a byte of storage. If you only needed to store small quantities in each array slot, this would therefore be a much more efficient choice than an Array. As you might guess, this is the type of array which a String uses.

Aha! But when you go back to chapter 9 and look at the Smalltalk hierarchy, you notice that String does not inherit from ByteArray. To see why, we must delve down yet another level, and arrive at the basic methods for setting up the structure of the instances of a class.

When we implemented our NiledArray example, we used <shape: #inherit>. The shape is exactly that: the fundamental structure of Smalltalk objects created within a given class. Let's consider the differences in the next sub-sections.

The default shape specifies the simplest Smalltalk object. The object consists only of the storage needed to hold the instance variables. In C, this would be a simple structure with zero or more scalar fields.2.
Storage is still allocated for any instance variables, but the objects of the class must be created with a new: message. The number passed as an argument to new: causes the new object, in addition to the space for instance variables, to also have that many slots of unnamed (indexed) storage allocated. The analog in C would be to have a dynamically allocated structure with some scalar fields, followed at its end by a array of pointers.
The storage allocated as specified by new: is an array of bytes. The analog in C would be a dynamically allocated structure with scalar fields3, followed by a array of char.
The storage allocated as specified by new: is an array of C unsigned longs, which are represented in Smalltalk by Integer objects. The analog in C would be a dynamically allocated structure with scalar fields, followed by an array of long. This kind of subclass is only used in a few places in Smalltalk.
The storage allocated as specified by new: is an array of characters. The analog in C would be a dynamically allocated structure with scalar fields, followed by a array of char.

There are many more shapes for more specialized usage. All of them specify the same kind of “array-like” behavior, with different data types.

How to access this new arrays? You already know how to access instance variables—by name. But there doesn't seem to be a name for this new storage. The way an object accesses it is to send itself array-type messages like at:, at:put:, and so forth.

The problem is when an object wants to add a new level of interpretation to these messages. Consider a Dictionary—it is a pointer-holding object but its at: message is in terms of a key, not an integer index of its storage. Since it has redefined the at: message, how does it access its fundamental storage?

The answer is that Smalltalk has defined basicAt: and basicAt:put:, which will access the basic storage even when the at: and at:put: messages have been defined to provide a different abstraction.

This can get pretty confusing in the abstract, so let's do an example to show how it's pretty simple in practice. Smalltalk arrays tend to start at 1; let's define an array type whose permissible range is arbitrary.

        ArrayedCollection subclass: RangedArray [
            | offset |
            <comment: 'I am an Array whose base is arbitrary'>
            RangedArray class >> new: size [
                <category: 'instance creation'>
                ^self new: size base: 1
            RangedArray class >> new: size base: b [
                <category: 'instance creation'>
                ^(super new: size) init: b
            init: b [
                <category: 'init'>
                offset := (b - 1).   "- 1 because basicAt: works with a 1 base"
           rangeCheck: i [
                <category: 'basic'>
                (i <= offset) | (i > (offset + self basicSize)) ifTrue: [
                    'Bad index value: ' printOn: stderr.
                    i printOn: stderr.
                    Character nl printOn: stderr.
                    ^self error: 'illegal index'
            at: [
                self rangeCheck: i.
                ^self basicAt: i - offset
            at: i put: v [
                self rangeCheck: i.
                ^self basicAt: i- offset put: v

The code has two parts; an initialization, which simply records what index you wish the array to start with, and the at: messages, which adjust the requested index so that the underlying storage receives its 1-based index instead. We've included a range check; its utility will demonstrate itself in a moment:

        a := RangedArray new: 10 base: 5.
        a at: 5 put: 0
        a at: 4 put: 1

Since 4 is below our base of 5, a range check error occurs. But this check can catch more than just our own misbehavior!

        a do: [:x| x printNl]

Our do: message handling is broken! The stack backtrace pretty much tells the story:


Our code received a do: message. We didn't define one, so we inherited the existing do: handling. We see that an Integer loop was constructed, that a code block was invoked, and that our own at: code was invoked. When we range checked, we trapped an illegal index. Just by coincidence, this version of our range checking code also dumps the index. We see that do: has assumed that all arrays start at 1.

The immediate fix is obvious; we implement our own do:

        RangedArray extend [
            do: aBlock [
                <category: 'basic'>
                1 to: (self basicSize) do: [:x|
                    aBlock value: (self basicAt: x)

But the issues start to run deep. If our parent class believed that it knew enough to assume a starting index of 14, why didn't it also assume that it could call basicAt:? The answer is that of the two choices, the designer of the parent class chose the one which was less likely to cause trouble; in fact all standard Smalltalk collections do have indices starting at 1, yet not all of them are implemented so that calling basicAt: would work.5

Object-oriented methodology says that one object should be entirely opaque to another. But what sort of privacy should there be between a higher class and its subclasses? How many assumption can a subclass make about its superclass, and how many can the superclass make before it begins infringing on the sovereignty of its subclasses?

Alas, there are rarely easy answers, and this is just an example. For this particular problem, there is an easy solution. When the storage need not be accessed with peak efficiency, you can use the existing array classes. When every access counts, having the storage be an integral part of your own object allows for the quickest access—but remember that when you move into this area, inheritance and polymorphism become trickier, as each level must coordinate its use of the underlying array with other levels.


[1] For gnu Smalltalk, the size of a C long, which is usually 32 bits.

[2] C requires one or more; zero is allowed in Smalltalk

[3] This is not always true for other Smalltalk implementations, who don't allow instance variables in variableByteSubclasses and variableWordSubclasses.

[4] Actually, in gnu Smalltalk do: is not the only message assuming that.

[5] Some of these classes actually redefine do: for performance reasons, but they would work even if the parent class' implementation of do: was kept.