Overriding Equality Operators: A Cautionary Tale
tl;dr: This is a long-winded way of advising you to always be sure what you’re comparing when you build low-level algorithms that will be used with arbitrary generic arguments. The culprit in this case was the default comparator in a
HashSet, but it could be anything. It ends with cogitation about software processes in the real world.
Imagine that you have a framework with support for walking arbitrary object graphs in the form of a
GraphWalker. Implementations of this interface complement a generalized algorithm.
This algorithm generates nodes corresponding to various events generated by the graph traversal, like beginning or ending a node or edge or encountering a previously processed node (in the case of graphs with cycles). Such an algorithm is eminently useful for formatting graphs into a human-readable format, cloning said graphs or other forms of processing.
A crucial feature of such a
GraphWalker is to keep track of the nodes it has seen before in order to avoid traversing the same node multiple times and going into an infinite loop in graphs with cycles. For subsequent encounters with a node, the walker handles it differently—generating a reference event rather than a begin node event.
A common object graph is the AST for a programming language. The graph walker can be used to quickly analyze such ASTs for nodes that match particular conditions.
Processing a Little Language
Let’s take a look at a concrete example, with a little language that defines simple boolean expressions:
OR( (A < 2) (B > A) )
It’s just an example and we don’t really have to care about what it does, where
B came from or the syntax. What matters is the AST that we generate from it:
1 Operator (OR) 2 Operator (<) 3 Variable (A) 4 Constant (2) 5 Operator (>) 6 Constant (B) 7 Variable (A)
When the walker iterates over this tree, it generates the following events (note the numbers at the front of the line correspond to the object in the diagram above:
1 begin node 1 begin edge 2 begin node 2 begin edge 3 begin node 3 end node 4 begin node 4 end node 2 end edge 2 end node 5 begin node 5 begin edge 6 begin node 6 end node 7 begin node 7 end node 5 end edge 5 end node 1 end edge
Now that’s the event tree we expect. This is also the event tree that we get for the objects that we’ve chosen to represent our nodes (
Constant in this case). If, for example, we process the AST and pass it through a formatter for this little language, we expect to get back exactly what we put in (namely the code in Listing 1). Given the event tree, it’s quite easy to write such a formatter—namely, by handling the begin node (output the node text), begin edge (output a “(”) and end edge (output a “)”) events.
So far, so good?
Running Into Trouble
However, now imagine that we discover a bug in other code that uses these objects and we discover that when two different objects refer to the same variable, we need them to be considered equal. That is, we update the equality methods—in the case of .NET,
As soon as we do, however, the sample from Listing 1 now formats as:
OR( (A < 2) (B > ) )
Now we have to figure out what happened. A good first step is to see what the corresponding event tree looks like now. We discover the following:
1 begin node 1 begin edge 2 begin node 2 begin edge 3 begin node 3 end node 4 begin node 4 end node 2 end edge 2 end node 5 begin node 5 begin edge 6 reference 7 begin node 7 end node 5 end edge 5 end node 1 end edge
The change is highlighted and affects the sixth node, which has now become a reference because we changed how equality is handled for
Variables. The algorithm now considers any two
Variables with the same name to be equivalent even if they are two different object references.
Fix #1—Hack the Application Code)
If we look back at how we wrote the simple formatter above, we only handled the begin node, begin edge and end edge events. If we throw in a handler for the reference event and output the text of the node, we’re back in business and have “fixed” the formatter.
Fix #2—Fix the Algorithm
But we ignore the more subtle problem at our own peril: namely, that the graph walking-code is fragile in that its behavior changes due to seemingly unrelated changes in the arguments that are passed. Though we have a quick fix above, we need to think about providing more stability in the algorithm—especially if we’re providers of low-level framework functionality.
The walker algorithm uses a
HashSet to track the nodes that it has previously encountered. However, the default comparator—again, in .NET—leans on the equality functions of the objects stored in the map to determine membership.
The first solution—or rather, the second one, as we already “fixed” the problem with what amounts to a hack above by outputting references as well—is to change the equality comparator for the
HashSet to explicitly compare references. We make that change and we can once again remove the hack because the algorithm no longer generates references for subsequent variable encounters.
Fix #3—Giving the Caller More Control
However, we’re still not done. We’ve now not only gotten our code running but we’ve fixed the code for the algorithm itself so the same problem won’t crop up again in other instances. That’s not bad for a day’s work, but there’s still a nagging problem.
What happens if the behavior that was considered unexpected in this case is exactly the behavior that another use of the algorithm expects? That is, it may well be that other types of graph walker will actually want to be able to control what is and is not a reference by changing the equivalence functions for the nodes.
Luckily, callers of the algorithm already pass in the graph walker itself, the methods of which the algorithm already calls to process nodes and edges. A simple solution is to add a method to the graph walker interface to ask it to create the kind of
HashSet that it would like to use to track references.
Tough Decisions: Which Fix to Use?
So how much time does this all take to do? Well, the first solution—the hack in application code—is the quickest, with time spent only on writing the unit test for the AST and verifying that it once again outputs as expected.
If we make a change to the framework, as in the second solution where we change the equality operator, we have to create unit tests to test the behavior of the AST in application code, but using test objects in the framework unit tests. That’s a bit more work and we may not have time for it.
The last suggestion—to extend the graph walker interface—involves even more work because we then have to create two sets of test objects: one set that tests a graph walker that uses reference equality (as the AST in the application code) and one that uses object equality (to make sure that works as well).
It is at this point that we might get swamped and end up working on framework code and unit tests that verify functionality that isn’t even being used—and certainly isn’t being used by the application with the looming deadline. However, we’re right there, in the code, and will never be better equipped to get this all right than we are right now. But what if we just don’t have time? What if there’s a release looming and we should just thank our lucky stars that we found the bug? What if there’s no time to follow the process?
Well, sometimes the process has to take a back seat, but that doesn’t mean we do nothing. Here are a few possibilities:
- Do nothing in the framework; add an issue to the issue tracker explaining the problem and the work that needs to be done so that it can be fixed at a more opportune time (or by a developer with time). This costs a few minutes of time and is the least you should do.
- Make the fix in the framework to prevent others from getting bitten by this relatively subtle bug and add an issue to the issue tracker describing the enhanced fix (adding a method to the graph walker) and the tests that need to be written.
- Add the method to the graph walker interface so that not only do others not get bitten by the bug but, should they need to control equivalence, they can do so. Add an issue describing the tests that need to be written to verify the new functionality.
What about those who quite rightly frown at the third possibility because it would provide a solution for what amounts to a potential—as opposed to actual—problem? It’s really up to the developer here and experience really helps. How much time does it take to write the code? How much does it change the interface? How many other applications are affected? How likely is it that other implementations will need this fix? Are there potential users who won’t be able to make the fix themselves? Who won’t be able to recompile and just have to live with the reference-only equivalence? How likely is it that other code will break subtly if the fix is not made? It’s not an easy decision either way, actually.
Though purists might be appalled at the fast and loose approach to correctness outlined above, pragmatism and deadlines play a huge role in software development. The only way to avoid missing deadlines is to have fallback plans to ensure that the code is clean as soon as possible rather than immediately as a more stringent process would demand.
And thus ends the cautionary tale of making assumptions about how objects are compared and how frameworks are made.