Enterprise Java

Avro: Storing Null Values in Files

Apache Avro is a widely used data serialization system designed for efficient storage and transmission of structured data. It is commonly used in big data processing frameworks like Apache Hadoop and Apache Kafka. One of the challenges developers face while working with Avro is handling null values due to its strict schema enforcement. If fields are not explicitly defined to accept null, serialization and deserialization errors can occur. Let me delve into understanding Avro storing null values in files and explore the best practices for handling them efficiently.

1. Introduction

Apache Avro is a widely used data serialization system that provides a compact and efficient way to store structured data. It is commonly used with big data processing frameworks like Apache Hadoop, Apache Kafka, and Apache Spark. Avro offers:

  • A schema-based approach to define data structures.
  • Efficient binary serialization.
  • Support for schema evolution.

However, handling null values in Avro files can be challenging due to Avro’s strict type enforcement. This article explores how to properly store and retrieve null values in Avro files using Java.

2. The Problem With Null Values in Avro

In Avro, every field in a record must conform to a predefined schema. Unlike other data formats such as JSON, Avro does not automatically allow null values unless explicitly defined. For example, consider the following schema:

01
02
03
04
05
06
07
08
09
10
11
12
13
14
{
  "type": "record",
  "name": "User",
  "fields": [
    {
      "name": "name",
      "type": "string"
    },
    {
      "name": "age",
      "type": "int"
    }
  ]
}

In this schema, the age field is defined as an int. If we try to store a record where age is null, we will encounter a schema validation error.

3. Solutions for Handling Null Values

The recommended approach to allow null values in Avro is to use a union type. A union type in Avro allows multiple data types for a single field. To enable null values, we define the field type as a union of null and the actual data type:

01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
{
  "type": "record",
  "name": "User",
  "fields": [
    {
      "name": "name",
      "type": "string"
    },
    {
      "name": "age",
      "type": [
        "null",
        "int"
      ],
      "default": null
    }
  ]
}

In this schema:

  • The age field can hold either an integer or null.
  • The order of types in the union matters; null should come first for compatibility.
  • The default value is set to null to ensure schema evolution compatibility.

4. Implementation of Writing to File

Below is a Java implementation that writes Avro data with null values.

4.1. Maven Dependencies

To work with Avro in Java, include the following Maven dependencies in your pom.xml file:

1
2
3
4
5
6
7
<dependencies>
    <dependency>
        <groupId>org.apache.avro</groupId>
        <artifactId>avro</artifactId>
        <version>your__latest__jar__version</version>
    </dependency>
</dependencies>

4.2 Java Code

01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import org.apache.avro.Schema;
import org.apache.avro.file.DataFileWriter;
import org.apache.avro.generic.GenericDatumWriter;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.generic.GenericData;
import java.io.File;
import org.apache.avro.io.DatumWriter;
 
public class AvroNullExample {
    public static void main(String[] args) throws Exception {
        String schemaStr = "{ \"type\": \"record\", \"name\": \"User\", \"fields\": ["
                + "{\"name\": \"name\", \"type\": \"string\"},"
                + "{\"name\": \"age\", \"type\": [\"null\", \"int\"], \"default\": null}"
                + "]}";
 
        Schema schema = new Schema.Parser().parse(schemaStr);
 
        GenericRecord user1 = new GenericData.Record(schema);
        user1.put("name", "Alice");
        user1.put("age", null); // Null age
 
        GenericRecord user2 = new GenericData.Record(schema);
        user2.put("name", "Bob");
        user2.put("age", 25); // Valid integer
 
        File file = new File("users.avro");
        DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<>(schema);
        DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<>(datumWriter);
        dataFileWriter.create(schema, file);
        dataFileWriter.append(user1);
        dataFileWriter.append(user2);
        dataFileWriter.close();
 
        System.out.println("Avro file created successfully!");
    }
}

4.2.1 Code Explanation

The AvroNullExample class demonstrates how to handle null values in Avro files using Java. It defines an Avro schema, creates records with nullable fields, and writes them to an Avro file.

First, the Avro schema is defined as a JSON string. This schema represents a record named User with two fields. The name field is a required string, whereas the age field is defined as a union type ["null", "int"]. This means that age can hold either an integer value or null. Additionally, the default value for age is explicitly set to null to ensure schema compatibility.

Next, the schema is parsed using Schema.Parser().parse(schemaStr), which converts the JSON schema into an Avro Schema object that will be used for record creation.

Using this schema, two user records are created with the GenericRecord class. The first record, user1, has the name “Alice” and an age value of null. The second record, user2, has the name “Bob” and an age value of 25.

After creating the records, an Avro file named users.avro is initialized. To handle writing operations, a DatumWriter instance is created using GenericDatumWriter<>(schema). This writer is then passed to a DataFileWriter instance, which manages file creation and appends data to the Avro file.

The file is created using the dataFileWriter.create(schema, file) method, ensuring that the schema is stored within the file. The user records are then written sequentially using dataFileWriter.append(user1) and dataFileWriter.append(user2).

Finally, the dataFileWriter.close() method is called to properly save the file and release system resources. Upon successful execution, the program prints the message Avro file created successfully!, confirming that the Avro file has been generated with the specified records.

5. Testing Our Solution

To verify the Avro file, we can read it back and print the contents.

5.1. Reading the Avro File

01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
import org.apache.avro.file.DataFileReader;
import org.apache.avro.generic.GenericDatumReader;
import org.apache.avro.generic.GenericRecord;
import java.io.File;
 
public class AvroReader {
    public static void main(String[] args) throws Exception {
        File file = new File("users.avro");
        GenericDatumReader<GenericRecord> datumReader = new GenericDatumReader<>();
        DataFileReader<GenericRecord> dataFileReader = new DataFileReader<>(file, datumReader);
 
        while (dataFileReader.hasNext()) {
            GenericRecord record = dataFileReader.next();
            System.out.println("User: " + record);
        }
 
        dataFileReader.close();
    }
}

5.1.1 Code Explanation and Output

The AvroReader class is responsible for reading and displaying records from an Avro file named users.avro. It utilizes the Avro API to deserialize the data and print each record to the console.

The program begins by creating a File object that points to the Avro file (users.avro) that was previously created. If the file does not exist, the program will throw an error.

Next, a GenericDatumReader<GenericRecord> instance is created. This reader is responsible for interpreting the Avro records in a generic format, meaning it does not require a precompiled Java class for the schema.

A DataFileReader<GenericRecord> object is then initialized by passing the Avro file and the datumReader. The DataFileReader reads the file sequentially, retrieving each record one by one.

The program then enters a while loop that iterates through the Avro file using dataFileReader.hasNext(). This method checks if there are more records to read. Inside the loop, dataFileReader.next() retrieves the next GenericRecord and prints it to the console using System.out.println("User: " + record).

Finally, the dataFileReader.close() method is called to properly close the file and release resources.

If the Avro file contains the two records created in the previous example, the output will be:

1
2
User: {"name": "Alice", "age": null}
User: {"name": "Bob", "age": 25}

The first record shows that the age field is null, while the second record correctly displays an integer value for age. This confirms that the Avro file correctly stored nullable values and that the reader successfully retrieved and displayed them.

6. Conclusion

Handling null values in Avro requires using a union type (["null", ""]) to allow optional values. In this article, we implemented a solution in Java that stores and retrieves Avro data containing null values. With this approach, we can efficiently serialize and deserialize data while ensuring compatibility with Avro’s schema requirements

Yatin Batra

An experience full-stack engineer well versed with Core Java, Spring/Springboot, MVC, Security, AOP, Frontend (Angular & React), and cloud technologies (such as AWS, GCP, Jenkins, Docker, K8).
Subscribe
Notify of
guest


This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Back to top button