Core Java

Get the Schema From an Avro File

In today’s data-driven world, efficient data serialization and seamless system interoperability are critical for building scalable applications. Apache Avro, a prominent data serialization framework, stands out for its compact storage, schema evolution capabilities, and multi-language support. Designed for big data and distributed systems, Avro ensures data is self-describing by embedding the schema with the data, simplifying data exchange between systems.

A fundamental aspect of working with Avro is its schema—a JSON-based structure that defines the organization of data. Extracting and understanding this schema is vital for processing Avro-encoded data effectively, whether you’re managing batch jobs, real-time streams, or data storage.

Let us delve into understanding how to use Java and Avro to get the schema from a file.

1. Introduction

Apache Avro is a popular data serialization framework developed within the Apache Hadoop ecosystem. Its key features include compact storage, schema evolution, and support for many programming languages. These properties make Avro an ideal choice for big data workflows, especially when used with distributed systems like Apache Kafka or Apache Spark.

In Avro, data is serialized in a binary format along with its schema. This schema, defined in JSON format, describes the structure of the data, including field names, types, and optional/default values. This combination of data and schema ensures interoperability between systems and reduces storage overhead.

1.1 Key Features

  • Compact Storage: Binary serialization reduces data size compared to text-based formats like JSON or XML.
  • Schema Evolution: Avro supports schema evolution, allowing systems to add, remove, or rename fields while maintaining backward and forward compatibility.
  • Language Support: Avro is supported in multiple programming languages, including Java, Python, C#, and Ruby.
  • Self-Describing Data: The schema is stored with the data, making it self-descriptive and easy to interpret.

2. Code Example

2.1 Add Maven Dependency

Ensure your Java project is set up with the Avro dependency in your pom.xml file:

<dependency>
    <groupId>org.apache.avro</groupId>
    <artifactId>avro</artifactId>
    <version>your__jar__version</version>
</dependency>

2.2 Create an Avro File

Below is a program to write data into an Avro file. The schema defines the structure of a record containing user details such as ID, name, and email address.

import org.apache.avro.Schema;
import org.apache.avro.file.DataFileWriter;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericDatumWriter;
import org.apache.avro.generic.GenericRecord;

import java.io.File;
import java.io.IOException;

public class AvroFileWriter {
    public static void main(String[] args) throws IOException {
        String schemaJson = """
                {
                    "type": "record",
                    "name": "User",
                    "fields": [
                        {"name": "id", "type": "int"},
                        {"name": "name", "type": "string"},
                        {"name": "email", "type": "string"}
                    ]
                }
                """;

        Schema schema = new Schema.Parser().parse(schemaJson);

        File avroFile = new File("users.avro");
        GenericDatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<>(schema);
        DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<>(datumWriter);

        dataFileWriter.create(schema, avroFile);

        GenericRecord user1 = new GenericData.Record(schema);
        user1.put("id", 1);
        user1.put("name", "Alice");
        user1.put("email", "alice@example.com");

        GenericRecord user2 = new GenericData.Record(schema);
        user2.put("id", 2);
        user2.put("name", "Bob");
        user2.put("email", "bob@example.com");

        dataFileWriter.append(user1);
        dataFileWriter.append(user2);
        dataFileWriter.close();

        System.out.println("Avro file created: " + avroFile.getAbsolutePath());
    }
}

2.2.1 Code Explanation and Output

The provided Java code demonstrates how to write data into an Avro file using a defined schema. Apache Avro, as a data serialization framework, stores data efficiently in a binary format along with its schema. This example focuses on creating a file named users.avro that contains user details (ID, name, and email) with the help of Avro’s APIs.

The program begins by defining an Avro schema in JSON format as a multi-line string. This schema describes the structure of a record named User with three fields: id (of type int), name (of type string), and email (of type string). The schema is parsed using Avro’s Schema.Parser to create a Schema object.

Next, the program creates a file named users.avro to store the data. The GenericDatumWriter class is used to serialize the records according to the defined schema. A DataFileWriter object wraps the GenericDatumWriter and handles the writing process.

The dataFileWriter.create(schema, avroFile) method initializes the file and writes the schema to it. After this, the program creates two user records as instances of GenericRecord, assigning values to their fields using the put method. For example, the first record represents a user with id=1, name=Alice, and email=alice@example.com.

The two user records are appended to the Avro file using the dataFileWriter.append() method. Finally, the dataFileWriter.close() method ensures that all resources are released and the file is properly closed. A confirmation message displaying the file’s path is printed to the console.

Avro file created: /path/to/your/directory/users.avro

2.3 Extract the Schema

The following program reads the Avro file created earlier and extracts the embedded schema:

import org.apache.avro.Schema;
import org.apache.avro.file.DataFileReader;
import org.apache.avro.generic.GenericDatumReader;
import org.apache.avro.generic.GenericRecord;

import java.io.File;
import java.io.IOException;

public class AvroSchemaReader {
    public static void main(String[] args) throws IOException {
        File avroFile = new File("users.avro");
        GenericDatumReader<GenericRecord> datumReader = new GenericDatumReader<>();
        DataFileReader<GenericRecord> dataFileReader = new DataFileReader<>(avroFile, datumReader);

        Schema schema = dataFileReader.getSchema();
        System.out.println("Extracted Schema:");
        System.out.println(schema.toString(true)); // Pretty print schema
    }
}

2.3.1 Code Explanation and Output

This Java program demonstrates how to read and extract the embedded schema from an Avro file. Apache Avro stores its schema alongside the data, allowing the data to be self-descriptive and simplifying schema discovery for downstream processing.

The program starts by creating a File object that points to the Avro file named users.avro. This file is assumed to have been created earlier and contains serialized data along with its schema.

To read the Avro file, a GenericDatumReader object is instantiated. This class is a generic reader that understands how to deserialize Avro records. The GenericDatumReader is passed to a DataFileReader object, which facilitates reading records and metadata (such as the schema) from the Avro file.

The dataFileReader.getSchema() method is then used to retrieve the schema embedded within the Avro file. The schema is returned as a Schema object, which represents the structure of the data. To make the schema human-readable, the program prints the schema in a formatted JSON string using the schema.toString(true) method.

The output of this program is the Avro schema in its JSON representation, clearly showing the record name, field names, and their respective data types.

Extracted Schema:
{
  "type": "record",
  "name": "User",
  "fields": [
    {
      "name": "id",
      "type": "int"
    },
    {
      "name": "name",
      "type": "string"
    },
    {
      "name": "email",
      "type": "string"
    }
  ]
}

This schema can be used to understand the structure of the data or to generate additional Avro files with the same format.

3. Conclusion

Apache Avro is a powerful data serialization framework that excels in supporting schema evolution and optimizing storage efficiency. By embedding the schema directly within the file, Avro ensures data is self-describing, streamlining seamless data exchange between diverse systems. Its versatility has made it a cornerstone in big data ecosystems, powering stream processing, batch workflows, and data warehousing. Mastering Avro’s capabilities enables developers to design scalable, adaptable systems that can effortlessly handle evolving data needs. Embrace Avro in your workflows to unlock its full potential and drive innovation in your data processing pipelines!

Yatin Batra

An experience full-stack engineer well versed with Core Java, Spring/Springboot, MVC, Security, AOP, Frontend (Angular & React), and cloud technologies (such as AWS, GCP, Jenkins, Docker, K8).
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Back to top button