Collections

Collections are data structures that are fundamental to all types of programming. Whenever we need to refer to a group of objects, we have some kind of collection. At the core language level, Java supports collections in the form of arrays. But arrays are static and because they have a fixed length, they are awkward for groups of things that grow and shrink over the lifetime of an application. Arrays also do not represent abstract relationships between objects well. In the early days, the Java platform had only two basic classes to address these needs: the java.util.Vector class, which represents a dynamic list of objects, and the java.util.Hashtable class, which holds a map of key/value pairs. Today, Java has a more comprehensive approach to collections called the Collections Framework. The older classes still exist, but they have been retrofitted into the framework (with some eccentricities) and are generally no longer used.

Though conceptually simple, collections are one of the most powerful parts of any programming language. Collections implement data structures that lie at the heart of managing complex problems. A great deal of basic computer science is devoted to describing the most efficient ways to implement certain types of algorithms over collections. Having these tools at your disposal and understanding how to use them can make your code both much smaller and faster. It can also save you from reinventing the wheel.

Prior to Java 5, the Collections Framework had two major drawbacks. The first was that—not having generic types to work with—collections were by necessity untyped and worked only with anonymous Objects instead of real types like Dates and Strings. This meant that you had to perform a type cast every time you took an object out of a collection. This flew in the face of Java’s compile-time type safety. But in practice, this was less a problem than it was just plain cumbersome and tedious. The second issue was that, for practical reasons, collections could work only with objects and not with primitive types. This meant that any time you wanted to put a number or other primitive type into a collection, you had to store it in a wrapper class first and unpack it later upon retrieving it. The combination of these factors made code working with collections less readable and more dangerous to boot.

This all changed with the introduction of generic types and autoboxing of primitive values. First, the introduction of generic types, as we described in Chapter 8, has made it possible for truly typesafe collections to be under the control of the programmer. Second, the introduction of autoboxing and unboxing of primitive types means that you can generally treat objects and primitives as equals where collections are concerned. The combination of these new features can significantly reduce the amount of code you write and add safety as well. As we’ll see, all of the collections classes now take advantage of these features.

The Collections Framework is based around a handful of interfaces in the java.util package. These interfaces are divided into two hierarchies. The first hierarchy descends from the Collection interface. This interface (and its descendants) represents a container that holds other objects. The second, separate hierarchy is based on the Map interface, which represents a group of key/value pairs where the key can be used to retrieve the value in an efficient way.

The mother of all collections is an interface appropriately named Collection. It serves as a container that holds other objects, its elements. It doesn’t specify exactly how the objects are organized; it doesn’t say, for example, whether duplicate objects are allowed or whether the objects are ordered in any way. These kinds of details are left to child interfaces. Nevertheless, the Collection interface defines some basic operations common to all collections:

Additionally, the methods addAll(), removeAll(), and containsAll() accept another Collection and add, remove, or test for all of the elements of the supplied collection.

When using generics, the Collection type is parameterized with a specific type of element that the collection will hold. This makes a generic collection of “anything” into a specific collection of some type of element. The parameter type becomes the compile-time type of the element arguments in all of the methods of Collection (in this case, the add(), remove(), and contains() methods listed earlier). For example, in the following code, we create a Collection that works with Dates:

    Collection<Date> dates = new ArrayList<Date>(); // = new ArrayList<>() would 
                                                        // also work.
    dates.add( new Date() );
    dates.add( "foo" ) // Error; string type where Date expected!!

ArrayList is just one implementation of Collection; we’ll talk about it a bit later. The important thing is that we’ve declared the variable dates to be of the type Collection<Date>; that is, a collection of Dates, and we’ve allocated our ArrayList to match. Because our collection has been parameterized with the type Date, the add() method of the collection becomes add( Date date ) and attempting to add any type of object other than a Date to the list would have caused a compile-time error.

If you are working with very old Java code that predates generics, you can simply drop the types and perform the appropriate casts. For example:

    Collection dates = new ArrayList();
    dates.add( new Date() );  // unchecked, compile-time warning
    Date date = (Date)dates.get( 0 );

In this case, we’ll get a compile time warning that we’re using ArrayList in a potentially unsafe (nongeneric typesafe) way.

As we’ve described earlier in the book, this is essentially what the Java compiler is doing for us with generics. When using collections (or any generic classes) in this way under Java 5 or later, you will get compile-time warnings indicating that the usage is unchecked, meaning that it is possible to get an error at runtime if you have made a mistake. In this example, a mistake would not be caught until someone tried to retrieve the object from the collection and cast it to the expected type.

An iterator is an object that lets you step through a sequence of values. This kind of operation comes up so often that it is given a standard interface: java.util.Iterator. The Iterator interface has only two primary methods:

The following example shows how you could use an Iterator to print out every element of a collection:

    public void printElements(Collection c, PrintStream out) {
        Iterator iterator = c.iterator();
        while ( iterator.hasNext() )
            out.println( iterator.next() );
    }

In addition to the traversal methods, Iterator provides the ability to remove an element from a collection:

Not all iterators implement remove(). It doesn’t make sense to be able to remove an element from a read-only collection, for example. If element removal is not allowed, an UnsupportedOperationException is thrown from this method. If you call remove() before first calling next(), or if you call remove() twice in a row, you’ll get an IllegalStateException.

The Collection interface has three child interfaces. Set represents a collection in which duplicate elements are not allowed. List is a collection whose elements have a specific order. The Queue interface is a buffer for objects with a notion of a “head” element that’s next in line for processing.

A Queue is a collection that acts like a buffer for elements. The queue maintains the insertion order of items placed into it and has the notion of a “head” item. Queues may be first in, first out (FIFO) or last in, first out (LIFO) depending on the implementation:

Java 7 added Deque, which is a “double-ended” queue that supports adding, querying, and removing elements from either end of the queue (the head or the tail). Dequeue has versions of the queue methods—offer, poll, and peek—that operate on the first or last element: offerFirst(), pollFirst(), peekFirst(), offerLast(), pollLast(), peekLast(). Note that Deque extends Queue and so is still a type of Queue. If you use the plain Queue methods offer(), poll(), and peek() on a Deque, they operate as a FIFO queue. Specifically, calling offer() is equivalent to offerLast() and calling poll() or peek() is the same as calling pollFirst() or peekFirst(), respectively.

Finally, Java has a legacy Stack class that acts as a LIFO queue with “push” and “pop” operations, but Deque is generally better and should serve as a general replacement for Stack. Simply use addFirst() for “push” and pollFirst() for “pop.”

The Collections Framework also includes the java.util.Map, which is a collection of key/value pairs. Other names for map are “dictionary” or “associative array.” Maps store and retrieve elements with key values; they are very useful for things like caches or minimalist databases. When you store a value in a map, you associate a key object with a value. When you need to look up the value, the map retrieves it using the key.

With generics, a Map type is parameterized with two types: one for the keys and one for the values. The following snippet uses a HashMap, which is an efficient type of map implementation that we’ll discuss later:

    Map<String, Date> dateMap = new HashMap<String, Date>();
    dateMap.put( "today", new Date() );
    Date today = dateMap.get( "today" );

In legacy code, maps simply map Object types to Object types and require the appropriate cast to retrieve values.

The basic operations on Map are straightforward. In the following methods, the type K refers to the key parameter type and the type V refers to the value parameter type:

You can retrieve all the keys or values in the map:

Map has one child interface, SortedMap. A SortedMap maintains its key/value pairs sorted in a particular order according to the key values. It provides the subMap(), headMap(), and tailMap() methods for retrieving sorted map subsets. Like SortedSet, it also provides a comparator() method, which returns an object that determines how the map keys are sorted. We’ll talk more about that later. Java 7 adds a NavigableMap with functionality parallel to that of NavigableSet; namely, it adds methods to search the sorted elements for an element greater or lesser than a target value.

Finally, we should make it clear that although related, Map is not literally a type of Collection (Map does not extend the Collection interface). You might wonder why. All of the methods of the Collection interface would appear to make sense for Map, except for iterator(). A Map, again, has two sets of objects: keys and values, and separate iterators for each. This is why a Map does not implement Collection.

One more note about maps: some map implementations (including Java’s standard HashMap) allow null to be used as a key or value, but others may not.

Up until this point, we’ve talked only about interfaces. But you can’t instantiate interfaces; you need concrete implementations. Of course, the Collections Framework includes useful implementations of all of the collections interfaces. In some cases, there are several alternatives from which to choose. To understand the tradeoffs between these implementations, it helps to have a basic understanding of a few of the most common data structures used in all programming: arrays, linked lists, trees, and hash maps. Many books have been written about these data structures, and we will not drag you into the mind-numbing details here. We’ll hit the highlights briefly as a prelude to our discussion of the Java implementations. We should stress before we go on that the differences in the implementations of Java collections are only significant when working with very large numbers of elements or with extreme time sensitivity. For the most part, they all behave well enough to be interchangeable.

A linked list, shown in Figure 11-3, holds its elements in a chain of nodes, each referencing the node before and after it (if any). In this way, the linked list forms an ordered collection that looks something like an array. Unlike the magic of an array, however, to retrieve an element from a linked list, you must traverse the list from either the head or tail to reach the correct spot. As you might have guessed, this is a linear-time operation that gets more expensive as the number of elements grows. The flip side is that once you’re at the spot, inserting or deleting an element is a piece of cake: simply change the references and you’re done. This means that insertions and deletions—at least near the head and tail of a linked list—are said to be in constant time.

Linked lists are useful when you are doing a lot of insertions or deletions on a collection. An interesting variation on the basic linked list is the “skip list,” which is a kind of linked list that maintains a hierarchy of references spanning increasing ranges of elements instead of only pointing to the next element in the chain. The idea is that when you need to jump to the middle, you can use one of these “express lane” pointers to jump to the approximate location and then move forward or backward with finer granularity as needed by descending the pointer hierarchy. Java 7 adds skip list implementations of the NavigableMap and NavigableSet interfaces in the java.util.concurrent package for concurrent programming.

Hash maps are strange and magical beasts. A hash map (or hash table, as it is also called) uses a mathematical hash algorithm applied to its key value to distribute its element values into a series of “buckets.” The algorithm relies on the hash algorithm to distribute the elements as uniformly (randomly) as possible. To retrieve an element by its key simply involves searching the correct bucket. Because the hash calculation is fast and can have a large number of buckets, there are few elements to search and retrieval is very fast. As we described in Chapter 7, all Java Objects have a hash value as determined by the hashCode() method. We’ll say more about hash codes and key values for maps later in this chapter.

Hash map performance is governed by many factors, including the sophistication of the hash algorithm implemented by its elements (see Figure 11-5). In general, with a good hash function implementation, the Java HashMap operates in constant time for putting and retrieving elements. Hash maps are fast at mapping unsorted collections.

Table 11-7 lists the implementations of the Java Collections Framework by interface type.

ArrayList and LinkedList provide the array and linked list implementations of the List interface that was described earlier. ArrayList is satisfactory for most purposes, but you should use LinkedList when you plan to do a lot of insertions or deletions at various points in the list.

HashSet and HashMap provide a good hash map implementation of the Set and Map interfaces. The LinkedHashSet and LinkedHashMap implementations combine the hash algorithm with a linked list that maintains the insertion order of the elements. Note that these linked collections are ordered, but not sorted collections.

TreeSet and TreeMap maintain sorted collections using a tree data structure. In the case of TreeMap, it is the key values that are sorted. The sorting is accomplished by a comparator object. We’ll discuss sorting later in this chapter.

Queue is implemented both by LinkedList (which implements List, Queue, and—as of Java 7, Deque) and PriorityQueue. A PriorityQueue’s prioritization comes from a sorting order determined by a comparator supplied with its constructor. Elements that sort “least” or “lowest” have the highest priority. The various implementations of BlockingQueue mirror these for concurrency-aware queues.

Finally, IdentityHashMap is an alternate type of HashMap that uses object identity instead of object equality to determine which keys match which objects. Normally, any two objects that test equal with equals() operate as the same key in a Map. With IdentityHashMap, only the original object instance retrieves the element. We’ll talk about hash codes and keys more in the next section.

We should also mention three specialized collections that we’ll talk about later: EnumSet and EnumMap are specifically designed to work with Java enumerations. WeakHashMap uses weak references to cooperate with Java garbage collection.

The term hash in Hashtable and HashMap refers to the key hash value that these collections use to make their associations. Specifically, an element in a Hashtable or HashMap is not associated with a key strictly by the key object’s identity but rather by a function of the key’s contents. This allows keys that are equivalent to access the same object. By “equivalent,” we mean those objects that compare true with equals(). If you store an object in a Hashtable using one object as a key, you can use any other object that equals() tells you is equivalent to retrieve the stored object.

It’s easy to see why equivalence is important if you remember our discussion of strings. You may create two String objects that have the same characters in them but that are different objects in Java. In this case, the == operator tells you that the String objects are different, but the equals() method of the String class tells you that they are equivalent. Because they are equivalent, if we store an object in a HashMap using one of the String objects as a key, we can retrieve it using the other.

The hash code of an object makes this association based on content. As we mentioned in Chapter 7, the hash code is like a fingerprint of the object’s data content. HashMap uses it to store the objects so that they can be retrieved efficiently. The hash code is nothing more than a number (an integer) that is a function of the data. The number is always the same for identical data, but the hashing function is intentionally designed to generate as different (random looking) a number as possible for different combinations of data. In other words, a very small change in the data should produce a big difference in the number. It should be unlikely that two nonidentical datasets, even very similar ones, would produce the same hash code.

As we described earlier, internally, HashMap really just keeps a number of lists of objects, but it puts objects into the lists based on their hash code. When it wants to find the object again, it can look at the hash code and know immediately how to get to the appropriate list. The HashMap still might end up with a number of objects to examine, but the list should be short. For each object in the short list it finds, it does the following comparison to see if the key matches:

    if ((keyHashcode == storedKeyHashcode) && key.equals(storedKey))
        return object;

There is no prescribed way to generate hash codes. The only requirement is that they be somewhat randomly distributed and reproducible (based on the data). This means that two objects that are not the same could end up with the same hash code by accident. This is unlikely (there are 232 possible integer values); moreover, it shouldn’t cause a problem because as you can see in the preceding snippet, the HashMap ultimately checks the actual keys using equals(), as well as the hash codes, to find the match. Therefore, even if two key objects have the same hash code, they can still coexist in the HashMap as long as they don’t test equal to one another as well. (To put it another way, if two keys’ hashcodes are the same and the equals method says they are the same, then they will be considered the same key and retrieve the same value object.)

Hash codes are computed by an object’s hashCode() method, which is inherited from the Object class if it isn’t overridden. The default hashCode() method simply assigns each object instance a unique number to be used as a hash code. If a class does not override this method, each instance of the class will have a unique hash code. This goes along well with the default implementation of equals() in Object, which only compares objects for identity using ==; the effect being that these arbitrary objects serve as unique keys in maps.

You must override equals() in any classes for which equivalence of different objects is meaningful. Likewise, if you want equivalent objects to serve as equivalent keys, you must override the hashCode() method as well to return identical hash code values. To do this, you need to create some suitably randomizing, arbitrary function of the contents of your object. The only criterion for the function is that it should be almost certain to return different values for objects with different data, but the same value for objects with identical data.

The java.util.Collections class contains important static utility methods for working with Sets and Maps. All the methods in Collections operate on interfaces, so they work regardless of the actual implementation classes you’re using. The first methods we’ll look at involve creating synchronized versions of our collections.

Most of the default collection implementations are not synchronized; that is, they are not safe for concurrent access by multiple threads. The reason for this is performance. In many applications, there is no need for synchronization, so the Collections API does not provide it by default. Instead, you can create a synchronized version of any collection using the following methods of the Collections class:

    public static Collection synchronizedCollection(Collection c)
    public static Set synchronizedSet(Set s)
    public static List synchronizedList(List list)
    public static Map synchronizedMap(Map m)
    public static SortedSet synchronizedSortedSet(SortedSet s)
    public static SortedMap synchronizedSortedMap(SortedMap m)

These methods return synchronized, threadsafe versions of the supplied collection, by wrapping them (in a new object that implements the same interface and delegates the calls to the underlying collection). For example, the following shows how to create a threadsafe List:

    List list = new ArrayList();
    List syncList = Collections.synchronizedList(list);

Multiple threads can call methods on this list safely and they will block as necessary to wait for the other threads to complete.

In contrast to the norm, the older Hashtable and Vector collections are synchronized by default (and, therefore, may be a bit slower when that’s not needed). The “copy on write” collection implementations that we’ll talk about later also do not require synchronization for their special applications. Finally, the ConcurrentHashMap and ConcurrentLinkedQueue implementations that we’ll cover later are threadsafe and designed specifically to support a high degree of concurrent access without incurring a significant penalty for their internal synchronization.

You can use the Collections class to create read-only versions of any collection:

    public static Collection unmodifiableCollection(Collection c)
    public static Set unmodifiableSet(Set s)
    public static List unmodifiableList(List list)
    public static Map unmodifiableMap(Map m)
    public static SortedSet unmodifiableSortedSet(SortedSet s)
    public static SortedMap unmodifiableSortedMap(SortedMap m)

Making unmodifiable versions of collections is a useful way to ensure that a collection handed off to another part of your code is not modified intentionally or inadvertently. Attempting to modify a read-only collection results in an UnsupportedOperationException.

In Chapter 5, we introduced the idea of weak references—object references that don’t prevent their objects from being removed by the garbage collector. WeakHashMap is an implementation of Map that makes use of weak references in its keys and values. This means that you don’t have to remove key/value pairs from a Map when you’re finished with them. Normally, if you removed all references to a key object from the rest of your application, the Map would still contain a reference and keep the object “alive,” preventing garbage collection. WeakHashMap changes this; once you remove all references to a key object from the rest of the application, the WeakHashMap lets go of it, too and both the key and its corresponding value (if it is similarly unreferenced) are eligible for garbage collection.

EnumSet and EnumMap are collections designed to work specifically with the limited domain of objects defined by a Java enumerated type. (Enums are discussed in Chapter 5.) Java enums are Java objects and there is no reason you can’t use them as keys or values in collections otherwise. However, the EnumSet and EnumMap classes are highly optimized, taking advantage of the knowledge that the set or keys in the map, respectively, may be one of only a few individual objects. With this knowledge, storage can be compact and fast using bit fields internally. The idea is to make using collection operations on enumerations efficient enough to replace the general usage pattern of bit-flags and make binary logic operations unnecessary. Instead of using:

    int flags = getFlags();
    if ( flags & ( Constants.ERROR | Constants.WARNING ) != 0 )

we could use set operations:

    EnumSet flags = getFlags();
    if ( flags.contains( Constants.Error) || 
        flags.contains( Constants.Warning ) )

This code may not be as terse, but it is easier to understand and should be just as fast.

The Collections utilities include methods for performing common operations like sorting. Sorting comes in two varieties:

The sorted collections we discussed earlier, SortedSet and SortedMap, maintain their collections in a specified order using the Comparable interface of their elements. If the elements do not implement Comparable, you must supply a Comparator object yourself in the constructor of the implementation. For example:

    Comparator myComparator = ...
    SortedSet mySet = new TreeSet( myComparator );

Collections give you some other interesting capabilities, too. If you’re interested in learning more, check out the min(), max(), binarySearch(), and reverse() methods.

Collections is a bread-and-butter topic, which means it’s hard to create exciting examples. The example in this section reads a text file, parses all its words, counts the number of occurrences, sorts them, and writes the results to another file. It gives you a good feel for how to use collections in your own programs. This example also shows features including generics, autoboxing, and the Scanner API.

    import java.io.*;
    import java.util.*;
     
    public class WordSort
    {
      public static void main(String[] args) throws IOException
      {
        if ( args.length < 2 ) {
          System.out.println("Usage: WordSort inputfile outputfile");
          return;
        }
        String inputfile = args[0];
        String outputfile = args[1];
     
        /*  Create the word map. Each key is a word and each value is an
            Integer that represents the number of times the word occurs
            in the input file.
        */
        Map<String,Integer> map = new TreeMap<>();
     
        Scanner scanner = new Scanner( new File(inputfile) );
        while ( scanner.hasNext() ) {
            String word = scanner.next();
            Integer count = map.get( word );
            count = ( count == null ? 1 : count +1 );
            map.put( word, count );
        }
        scanner.close();
     
        // get the map's keys
        List<String> keys = new ArrayList<>( map.keySet() );
     
        // write the results to the output file
        PrintWriter out = new PrintWriter( new FileWriter(outputfile) );
        for ( String key : keys )
          out.println( key + " : " + map.get(key) );
        out.close();
      }
    }

Suppose, for example, that you have an input file named Ian Moore.txt:

    Well it was my love that kept you going
    Kept you strong enough to fall
    And it was my heart you were breaking
    When he hurt your pride

    So how does it feel
    How does it feel
    How does it feel
    How does it feel

You could run the example on this file using the following command:

    % java WordSort "Ian Moore.txt" count.txt

The output file, count.txt, looks like this:

    And : 1
    How : 3
    Kept : 1
    So : 1
    Well : 1
    When : 1
    breaking : 1
    does : 4
    enough : 1
    ...

The results are case-sensitive: “How” and “how” are recorded separately. You could modify this behavior by converting words to all lowercase after retrieving them from the Scanner.