Using serialization to find dirty fields in an object

Raji SankarNovember 18th, 2013Last Updated: November 28th, 2013

3 305 7 minutes read

Say you are developing a framework to auto-save objects into a database. You need to detect changes made between two saves, so that only modified fields are saved. How to detect dirty fields. The easiest way to do this is to traverse through the original data and the current data and compare each field separately. Code as below:

public static void getDirtyFields(Object obj, Object obj2, Class cls, Map<String, DiffFields> diff)
        throws Exception {
        Field[] flds = cls.getDeclaredFields();
        for (int i = 0; i < flds.length; i++) {
            flds[i].setAccessible(true);
            Object fobj = flds[i].get(obj);
            Object fobj2 = flds[i].get(obj2);
            if (fobj.equals(fobj2)) continue;
 
            if (checkPrimitive(flds[i].getType())) {
               <!-- add to dirty fields -->
                continue;
            }
 
            Map<String, DiffFields> fdiffs = new HashMap<String, DiffFields>();
            getDirtyFields(fobj, fobj2, fobj.getClass(), fdiffs);
            <!-- add to dirty fields -->
        }
 
        if (cls.getSuperclass() != null)
            getDirtyFields(obj, obj2, cls.getSuperclass(), diff);
    }

The above code does not handle a lot of conditions such as value of nulls, field being a collection or map or array etc. Yet, this gives an idea of what can be done. Works well if the object is small and does not contain a lot of hierarchy in them. When the change is very small in a huge hierarchical object, we have to traverse all the way to the last object to know the difference. Moreover, using equals may not be the right approach to detect dirty fields. Equals may not have been implemented, or simply it may just compare a few fields so a true dirty field detection is not done. You would have to traverse through each field irrespective of equals or not till you hit a primitive to detect dirty fields.

Here I want to talk of a different approach to detecting dirty fields. Instead of using reflection, we can use serialization to detect dirty fields. We can easily replace the “equals” in the above code to serialize the object and only if the bytes are different, continue further. But this is not optimal since we will be serializing the same object multiple times. We need a logic as below:

Serialize the two objects being compared
While comparing the two bytestreams, detect the fields being compared
If the byte values are different, store the field as different
Collect all the fields that are different and return them

Thus a single traversal of two byte streams can generate a list of fields that are different. How do we implement this logic? Can we traverse a serialized stream and be able to recognize fields in it? We want to write a code as below:

public static void main(String[] args) throws Exception {
        ComplexTestObject obj = new ComplexTestObject();
        ComplexTestObject obj2 = new ComplexTestObject();
        obj2._simple._string = "changed";
 
        //serialize the first object and get the bytes
        ByteArrayOutputStream ostr = new ByteArrayOutputStream();
        CustomOutputStream str = new CustomOutputStream(ostr);
        str.writeObject(obj);
        str.close();
        byte[] bytes = ostr.toByteArray();
 
        //serialize the second object and get the bytes
        ostr = new ByteArrayOutputStream();
        str = new CustomOutputStream(ostr);
        str.writeObject(obj2);
        str.close();
        byte[] bytes1 = ostr.toByteArray();       
 
       //read and compare the bytes and get back a list of differing fields
        ReadSerializedStream check = new ReadSerializedStream(bytes, bytes1);
        Map diff = check.compare();
        System.out.println("Got difference: " + diff);
    }

The Map should contain _simple._string, so that we can directly go to _string and process it.

Explaining the Serialization Format

There are articles that explain how standard serialization byte stream looks like. But, we will use a custom format. While we can read the standard serialization format, it becomes unnecessary when the class structure is already defined by our classes. We will simplify it and change the format of serialization to write only the type of the fields. The type of the fields is necessary since class declarations can have references to interfaces, super classes etc, while the contained value can be a derived type.

To customize serialization, we create our own ObjectOutputStream and override the writeClassDescriptor function. Our ObjectOutputStream now looks as below:

public class CustomOutputStream extends ObjectOutputStream {
    public CustomOutputStream(OutputStream str)
        throws IOException  {
        super(str);
    }
    @Override
    protected void writeClassDescriptor(ObjectStreamClass desc)
        throws IOException  {
        <b>String name = desc.forClass().getName();
        writeObject(name);</b>
        String ldr = "system";
        ClassLoader l = desc.forClass().getClassLoader();
        if (l != null)  ldr = l.toString();
        if (ldr == null)  ldr = "system";
        writeObject(ldr);
    }
}

Let’s write a simple object to serialize and see how the byte stream looks:

public class SimpleTestObject implements java.io.Serializable {
    int _integer;
    String _string;
    public SimpleTestObject(int b)  {
        _integer = 10;
        _string = "TestData" + b;
    }
    public static void main(String[] args) throws Exception  {
        SimpleTestObject obj = new SimpleTestObject(0);
        FileOutputStream ostr = new FileOutputStream("simple.txt");
        CustomOutputStream str = new CustomOutputStream(ostr);
        str.writeObject(obj);
        str.close(); ostr.close();
    }
}

After running this class, calling “hexdump -C simple.txt” , shows the following output:

00000000  ac ed 00 05 73 72 74 00  10 53 69 6d 70 6c 65 54  |....srt..SimpleT|
00000010  65 73 74 4f 62 6a 65 63   74 74 00 27 73 75 6e 2e  |estObjectt.'sun.|
00000020  6d 69 73 63 2e 4c 61 75  6e 63 68 65 72 24 41 70  |misc.Launcher$Ap|
00000030  70 43 6c 61 73 73 4c 6f   61 64 65 72 40 33 35 63  |pClassLoader@35c|
00000040  65 33 36 78 70 00 00 00  0a 74 00 09 54 65 73 74  |e36xp....t..Test|
00000050  44 61 74 61 30                                                          |Data0|
00000055

Following the format in this article we can trace the bytes as:

AC ED: STREAM_MAGIC. Specifies that this is a serialization protocol.
00 05: STREAM_VERSION. The serialization version.
0×73: TC_OBJECT. Specifies that this is a new Object.

Now we need to read the class descriptor.

0×72: TC_CLASSDESC. Specifies that this is a new class.

The class descriptor is written by us, so, we know the format. It has read two strings.

0×74: TC_STRING. Specifies the type of the object.
0×00 0×10: The length of the String followed by 16 characters of the type of the object i.e., SimpleTestObject
0×74: TC_STRING. Specifies the classloader
0×00 0×27: The length of the String followed by the classloader name
0×78: TC_ENDBLOCKDATA, the end of the optional block data for an object.
0×70: TC_NULL,follows the end block and represents the fact that there are no superclasses

After this the values of different fields in the class are written. There are two fields in our class _integer and _string. so we have 4 bytes of value of _integer i.e, 0×00, 0×00, 0×00, 0x0A followed by a string which is of the format

0×74: TC_STRING
0×00 0×09: Length of the string
9 bytes of the string data

Comparing streams and detecting the dirty fields

Now that we understand and have simplified the serialization format, we can start writing the parser for the stream and comparing them. First we write the standard read functions for primitive fields. For eg., the getInt is written as below to read integers (others are present in the sample code):

static int getInt(byte[] b, int off) {
        return ((b[off + 3] & 0xFF) << 0) +  ((b[off + 2] & 0xFF) << 8) +
               ((b[off + 1] & 0xFF) << 16) + ((b[off + 0]) << 24);
    }

The class descriptor can be read with a code as below.

byte desc = _reading[_readIndex++]; //read TC_CLASSDESC
        byte cdesc = _compareTo[_compareIndex++];
        switch (desc) {
        case TC_CLASSDESC: {
                byte what = _reading[_readIndex++];  byte cwhat = _compareTo[_compareIndex++]; //read the type written TC_STRING
                if (what == TC_STRING) {
                    String[] clsname = readString(); //read the field Type 
                    if (_reading[_readIndex] == TC_STRING) {
                        what = _reading[_readIndex++];  cwhat = _compareTo[_compareIndex++];
                        String[] ldrname = readString(); //read the classloader name
                    }
                    ret.add(clsname[0]);
                    cret.add(clsname[1]);
                }
                byte end = _reading[_readIndex++]; byte cend = _compareTo[_compareIndex++]; //read 0x78 TC_ENDBLOCKDATA
                //we read again so that if there are super classes, their descriptors are also read
                //if we hit a TC_NULL, then the descriptor is read
                readOneClassDesc(); 
            }
            break;
        case TC_NULL:
            //ignore all subsequent nulls 
            while (_reading[_readIndex] == TC_NULL) desc = _reading[_readIndex++];
            while (_compareTo[_compareIndex] == TC_NULL) cdesc = _compareTo[_compareIndex++];
            break;
        }

Here, we read the first byte, if it is TC_CLASSDESC, we read two strings. We then continue to read till we hit a TC_NULL. There are other conditions to be handled, such as TC_REFERENCE which is a reference to a previously declared value. This can be found in the sample code.

Note: the functions read both the byte streams simultaneously (the _reading and the _compareTo). Hence both of them is always pointing to a point where comparison has to begin next. Bytes are read as a block, this ensures that we will always start at the correct position even if there are value differences. For eg., a string block has a length that indicates till where to read, a class descriptor has a endblock indicating till where to read and so on.

We have not written the field sequence. How do we know what fields to read? For this, we can do the following:

Class cls = Class.forName(clsname, false, this.getClass().getClassLoader());
        ObjectStreamClass ostr = ObjectStreamClass.lookup(cls);
        ObjectStreamField[] flds = ostr.getFields();

This gives us the fields in the order in which it was serialized. If we iterate through flds, it will be in the order in which data was written. So, we can iterate through it as below:

Map diffs = new HashMap();
for (int i = 0; i < flds.length; i++) {
    DiffFields dfld = new DiffFields(flds[i].getName());
    if (flds[i].isPrimitive()) { //read primitives
    Object[] read = readPrimitive(flds[i]);
    if (!read[0].equals(read[1])) diffs.put(flds[i].getName(), dfld); //Value is not the same so add as different
    }
    else if (flds[i].getType().equals(String.class)) { //read strings
    byte nxtread = _reading[_readIndex++]; byte nxtcompare = _compareTo[_compareIndex++];
    String[] rstr = readString();
    if (!rstr[0].equals(rstr[1])) diffs.put(flds[i].getName(), dfld); //String not same so add as difference
    }
}

Here, I have only explained how primitive fields in the class can be checked for differences. The logic however can be extended to sub-classes by recursively calling the same functions for object field types.

The sample code for this blog to try out can be found here that has logic to compare sub-classes and super classes. A neater implementation can be found here.

A word of caution. A few disadvantages exist with this method:

Only serializable objects and fields can be used by this method. Transients and static fields are not compared for differences.
If a writeObject overrides the default serialization, then the ObjectStreamClass does not reflect the serialized fields correctly. For this, we will have to either hardcode the reading of such classes. For eg., in the sample code, there is such a read for ArrayList or use and parse the standard serialization format.

Reference: Using serialization to find dirty fields in an object from our JCG partner Raji Sankar at the Reflections blog.