All Articles

Data in Java

Every application deals with data, and within certain domains, the handling of data dominates everything. It doesn’t matter how fancy your “web-scale dependency-injected synergetic micro-service”1 is if you don’t manage data correctly. Even still, dealing with data in Java turns out to be harder than you’d expect.

Let’s take a deep-dive through a few of the available options.

Minimalism

The most basic way to store data is to use a plain old Java object (POJO) with the bare necessities:

class Flight {
  String airline;
  String id;
  String destination;
  Set<String> delays;
}

What is the problem with this approach? Simple classes like this are not equivalent to structs in languages like C/Go/Rust; you can’t make very many assumptions about its behavior.

The most apparent issue is that it is quite error-prone to initialize this data structure. Let’s have a look at an example:

Flight flight = new Flight();
flight.airline = "BA";
flight.id = "1759";
flight.destination = "Birmingham";

Have we initialized all of the fields? If delays is empty (unlikely), should we set it to null or Collections.emptySet() or a mutable set? If we add a field in the future, will we remember to set it? Does this result in a reasonable .toString(), .hashCode() or .equals()?

Encapsulation

Some common patterns have emerged to address these and related problems:

  • Create a constructor to offer a place to put validation of — and constraints on — the data.
  • Expose properties as “getters” à la airline() (or the older style getAirline()) so that properties can be mapped to/from different underlying fields.
  • Hide the object constructor and expose factory methods that allow you to create instances in different ways.
  • Make the fields immutable to reduce the API surface and aid with concurrent usage patterns.
  • Create a reasonable .toString()/.hashCode()/.equals() implementation using all of the fields.

The resulting structure looks something like this:

class Flight {
  private final String airline;
  // ... other fields
  
  private Flight(String airline/*, other fields */) {
    this.airline = airline;
    // ... other fields
  }
  
  public String airline() { return airline; }
  
  // ... other fields
  
  @Override
  public String toString() {
    return "Flight{airline=" + airline /* + other fields */ + "}";
  }
  
  @Override
  public boolean equals(Object that) { /* ... */ }
  
  @Override
  public int hashCode() { /* ... */ }
}

This is a lot of boilerplate! The class is a lot safer to use, but there is a new burden on the developer to type out all of that code, not make any mistakes, and keep everything up to date when new fields are added or changed.

Code generation

There are several libraries to help solve this problem. Some of them use reflection or other “interesting” hacks to help out at runtime. However, some of them are tools that can be used at compile-time to generate all of the necessary boilerplate. This is both a safer and more performant method, since there is no code running at runtime taking time or having bugs.

The ones I have been using in the past include:

Lombok is a compiler plugin for the various versions of javac. I’ve found it to be very unreliable/unstable and require plugins for most of the major IDEs (Like IntelliJ IDEA or Eclipse). As a result, I’ve always preferred not to use this plugin for big projects for that reason.

@AutoMatter uses annotation processing on an interface, and generates a few implementations of that interface (such as a builder and value class) that you can use in your code. The boilerplate above is thus reduced to simply:

@AutoMatter
public interface Flight {
  String airline();
  String id();
  String destination();
  Set<String> delays();
}

You can use the generated classes like so:

Flight flight = new FlightBuilder()
    .airline("SAS")
    .id("SK903")
    .destination("EWR")
    .build();

out.println("Airline: " + flight.airline());
out.println("ID: " + flight.id());
out.println("Destination: " + flight.destination());

This is in general quite nice, and removes a lot of the burden from the developer. However, there are still some problems:

  • Closed world assumption — Since you are now using an interface, users are free to make many different implementation of Flight, so you can’t assume that iff f instanceof Flight, it will also behave like a value type.
  • Data invariants — You have no control over various constraints that should apply. It might be the case that if airline is SAS, the id must start with SK. It is nice to be able to encode that in the code somehow.
  • API surface control — Since @AutoMatter generates all of the classes, you as a developer have no control of the object’s API surface. You are at the mercy of what the @AutoMatter library has chosen to implement.

@AutoValue solves these problems with the trade-off that a bit more boilerplate is needed:

@AutoValue
public abstract class Flight {
  Flight() {}
  
  public abstract String airline();
  public abstract String id();
  public abstract String destination();
  public abstract ImmutableSet<String> delays();
  
  public static Flight create(
      String airline,
      String id,
      String destination,
      ImmutableSet<String> delays) {
    if ("SAS".equals(airline) && !id.startsWith("SK")) {
      throw new IllegalArgumentException("...");
    }
    return new AutoValue_Flight(airline, id, destination, delays);
  }
}

Here are some interesting differences:

  • By using an abstract class with a package-private constructor instead of an interface, it becomes possible to limit subclassing the type.
  • The generated class is not accessible to the user; you should create your own factory methods where it is possible to enforce invariants.
  • You have complete control of the API surface. If you want a builder with a specific signature, you have to declare its interface

@AutoValue is additionally in general quite production-hardened compared to @AutoMatter in that it has a plugin system, integration with common libraries such as guava, and handles corner cases such as generics/existential quantification/obscure primitive types quite well.

In summary, @AutoValue is my go-to tool for creating value types in Java, with @AutoMatter being useful in some cases when the boiler plate of @AutoValue becomes too much to bear.


  1. If you use any of those words unironically, this blog is not for you.