Create Avro Schema With List of Objects
Apache Avro is a powerful data serialization framework that enables efficient data exchange and storage. It is widely used in big data applications, particularly within the Hadoop ecosystem, due to its compact binary format and schema-based structure. Avro schemas, written in JSON, define the data types and structure, ensuring compatibility between systems. Let us delve into understanding how to create an Avro schema that effectively manages list-objects.
1. Introduction
Apache Avro is a framework for data serialization that provides a compact, fast, and binary format for data exchange. It is particularly useful in big data applications, such as those using Apache Hadoop. Avro is schema-based, meaning that the data structure is defined in a schema, which is written in JSON format.
An Avro schema describes the data types and structure of the data being serialized. This allows for dynamic data processing and ensures compatibility between different systems. The schema is essential for both the serialization and deserialization processes, enabling efficient data interchange.
2. Creating the Avro Schema
To define an Avro schema that includes a list of objects, you can use the following JSON structure. Suppose we want to create a schema for a Person
that includes a list of Address
objects.
{ "type": "record", "name": "Person", "fields": [ { "name": "name", "type": "string" }, { "name": "age", "type": "int" }, { "name": "addresses", "type": { "type": "array", "items": { "type": "record", "name": "Address", "fields": [ { "name": "street", "type": "string" }, { "name": "city", "type": "string" }, { "name": "zipCode", "type": "string" } ] } } } ] }
In this schema, the Person
record has a list of Address
records, which includes fields for street, city, and zip code.
3. Using the Schema in Java
To use the Avro schema in a Java application, you need to include the Avro library in your project. You can do this by adding the following dependency to your pom.xml
if you are using Maven:
<dependency> <groupId>org.apache.avro</groupId> <artifactId>avro</artifactId> <version>1.10.2</version> </dependency>
After adding the dependency, you can generate Java classes from the Avro schema (Created above person.avsc
) using the Avro tools. This can be done with the following command:
java -jar avro-tools-1.10.2.jar compile schema person.avsc .
4. Working With Generated Classes
Once you have generated the Java classes from your Avro schema, you can use them in your application to serialize and deserialize data. Here’s an example of how to create a Person
object and serialize it to a file:
import org.apache.avro.file.DataFileWriter; import org.apache.avro.io.DatumWriter; import org.apache.avro.specific.SpecificDatumWriter; import java.io.File; import java.util.Arrays; public class AvroExample { public static void main(String[] args) throws Exception { // Create a Person object Person person = Person.newBuilder() .setName("John Doe") .setAge(30) .setAddresses(Arrays.asList( Address.newBuilder().setStreet("123 Main St").setCity("New York").setZipCode("10001").build(), Address.newBuilder().setStreet("456 Elm St").setCity("Boston").setZipCode("02110").build() )) .build(); // Create a DatumWriter for the Person class DatumWriter personDatumWriter = new SpecificDatumWriter(Person.class); // Create a DataFileWriter to write the Person object to a file DataFileWriter dataFileWriter = new DataFileWriter(personDatumWriter); dataFileWriter.create(person.getSchema(), new File("person.avro")); // Create the file with schema dataFileWriter.append(person); // Append the Person object to the file dataFileWriter.close(); // Close the file writer } }
4.1 Code Explanation
The given code defines a Java class named AvroExample
that demonstrates how to serialize a Person
object using Apache Avro. The main
method serves as the entry point for the application.
Inside the main
method, a new Person
object is created using the builder pattern. The Person.newBuilder()
method initializes the builder, allowing us to set various attributes. The setName
method assigns the name “John Doe” to the Person
object, while the setAge
method sets the age to 30.
The setAddresses
method is used to assign a list of Address
objects to the Person
. Each Address
object is created using its builder, which allows us to specify the street, city, and zip code. In this example, two addresses are provided: one for “123 Main St, New York, 10001” and another for “456 Elm St, Boston, 02110”. After setting these values, the build()
method is called to finalize the creation of the Person
object.
Next, a DatumWriter
specific to the Person
class is created using SpecificDatumWriter
. This writer is responsible for converting the Person
object into a format suitable for Avro serialization. A DataFileWriter
is then instantiated with this DatumWriter
, which will handle the process of writing the serialized data to a file.
The create
method of the DataFileWriter
initializes a new file named person.avro
using the schema retrieved from the Person
object. This file will store the serialized data according to the defined schema. Following this, the append
method is called to write the person
object to the file.
Finally, the close
method is invoked on the DataFileWriter
to ensure that all data is flushed to the file and that resources are released, completing the serialization process.
4.2 Code Output
Once the file is generated the following output will be written to the generated file.
{"name": "John Doe", "age": 30, "addresses": [ {"street": "123 Main St", "city": "New York", "zipCode": "10001"}, {"street": "456 Elm St", "city": "Boston", "zipCode": "02110"} ]}
5. Conclusion
Apache Avro provides a robust and efficient way to work with data serialization in Java applications. By defining an Avro schema that includes a list of objects, you can easily manage complex data structures. This article covered the basics of Avro, how to create an Avro schema, use it in Java, and work with the generated classes. With Avro, you can ensure that your data is structured, consistent, and easily transferable across different systems.