Core Java

Remove Byte Order Mark Characters from File

The Byte Order Mark (BOM) signifies a file’s encoding but can lead to problems if not handled properly, particularly when dealing with text data. Additionally, it is not unusual to encounter files that begin with a BOM character when reading text files. Let’s understand how to remove Byte Order Mark (BOM) characters from a file in Java.

1. Understanding BOM Characters

The Byte Order Mark (BOM) characters are special markers used at the beginning of a text stream to indicate its encoding. While BOM characters help in identifying the encoding, they can cause problems if not handled correctly, especially in text processing. For example, encountering BOM characters in files can lead to unexpected behavior when reading or manipulating text data. Therefore, understanding BOM characters and knowing how to manage them, particularly in programming languages like Java, is crucial for ensuring smooth text processing and file handling.

2. Using InputStream and Reader

This example demonstrates how to read a file and handle BOM characters using InputStream and Reader.

import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.InputStreamReader;
import java.io.IOException;

public class BOMHandling {
    public static void main(String[] args) {
        try (FileInputStream fis = new FileInputStream("file.txt");
             InputStreamReader isr = new InputStreamReader(fis, "UTF-8");
             BufferedReader br = new BufferedReader(isr)) {

            // Read the first character
            fis.mark(4);
            int ch = fis.read();
            if (ch != 0xFEFF) {
                // If it's not BOM, reset the stream
                fis.reset();
            }

            // Read and process the rest of the file
            String line;
            while ((line = br.readLine()) != null) {
                System.out.println(line);
            }

        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

The code defines:

  • FileInputStream fis = new FileInputStream("file.txt"); – Opens the file file.txt for reading.
  • InputStreamReader isr = new InputStreamReader(fis, "UTF-8"); – Creates an InputStreamReader with UTF-8 encoding.
  • BufferedReader br = new BufferedReader(isr); – Wraps the InputStreamReader in a BufferedReader for efficient reading.
  • fis.mark(4); – Marks the current position in the input stream, allowing you to reset to this position later.
  • int ch = fis.read(); – Reads the first character from the file.
  • if (ch != 0xFEFF) { fis.reset(); } – If the first character is not a BOM, resets the stream to the marked position.
  • The rest of the file is read and processed line by line.

The code returns the following output:

This is the first line of the file.
This is the second line of the file.
This is the third line of the file.

3. Using Apache Commons IO

This example shows how to remove BOM characters using Apache Commons IO’s BOMInputStream.

import org.apache.commons.io.input.BOMInputStream;
import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.InputStreamReader;
import java.io.IOException;

public class BOMHandlingWithCommonsIO {
    public static void main(String[] args) {
        try (FileInputStream fis = new FileInputStream("file.txt");
             BOMInputStream bomIn = new BOMInputStream(fis);
             InputStreamReader isr = new InputStreamReader(bomIn, "UTF-8");
             BufferedReader br = new BufferedReader(isr)) {

            String line;
            while ((line = br.readLine()) != null) {
                System.out.println(line);
            }

        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

The code defines:

  • FileInputStream fis = new FileInputStream("file.txt"); – Opens the file file.txt for reading.
  • BOMInputStream bomIn = new BOMInputStream(fis); – Wraps the FileInputStream in a BOMInputStream, which automatically detects and removes the BOM.
  • InputStreamReader isr = new InputStreamReader(bomIn, "UTF-8"); – Creates an InputStreamReader with UTF-8 encoding.
  • BufferedReader br = new BufferedReader(isr); – Wraps the InputStreamReader in a BufferedReader for efficient reading.
  • The rest of the file is read and processed line by line.

The code returns the following output:

This is the first line of the file.
This is the second line of the file.
This is the third line of the file.

4. Using NIO (New I/O)

This example uses Java NIO to read a file and handle BOM characters.

import java.nio.file.Files;
import java.nio.file.Paths;
import java.nio.charset.StandardCharsets;
import java.util.List;
import java.io.IOException;

public class BOMHandlingWithNIO {
    public static void main(String[] args) {
        try {
            byte[] bytes = Files.readAllBytes(Paths.get("file.txt"));
            String content = new String(bytes, StandardCharsets.UTF_8);

            // Check and remove BOM
            if (content.startsWith("\uFEFF")) {
                content = content.substring(1);
            }

            System.out.println(content);

        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

The code defines:

  • byte[] bytes = Files.readAllBytes(Paths.get("file.txt")); – Reads all bytes from the file file.txt.
  • String content = new String(bytes, StandardCharsets.UTF_8); – Converts the byte array to a string using UTF-8 encoding.
  • if (content.startsWith("\uFEFF")) { content = content.substring(1); } – Checks if the content starts with a BOM and removes it if present.
  • System.out.println(content); – Prints the content of the file.

The code returns the following output:

This is the first line of the file.
This is the second line of the file.
This is the third line of the file.

5. BOM Handling Methods Comparison

MethodAdvantagesDisadvantagesMemory UsagePerformance
Using InputStream and Reader
  • Simple and straightforward
  • Uses core Java libraries
  • Manual handling of BOM
  • More code required for BOM detection
Moderate, depends on file sizeModerate, involves marking and resetting streams
Using Apache Commons IO
  • Automatic BOM detection and removal
  • Less code and easier to implement
  • Requires additional library dependency
  • Potential overhead from library abstractions
Low to Moderate, efficient handling by libraryHigh, optimized for handling BOM
Using NIO (New I/O)
  • Advanced I/O capabilities
  • Handles large files efficiently
  • Manual handling of BOM
  • Requires understanding of NIO API
High, reads entire file into memoryHigh, efficient for large files

6. Conclusion

Handling BOM characters is essential for accurate text file processing. In Java, you can manage BOM characters using InputStream and Reader, Apache Commons IO, or Java NIO. Each method offers different advantages, with Apache Commons IO simplifying the process and Java NIO providing more advanced functionalities. Choose the approach that best suits your project’s requirements.

Yatin Batra

An experience full-stack engineer well versed with Core Java, Spring/Springboot, MVC, Security, AOP, Frontend (Angular & React), and cloud technologies (such as AWS, GCP, Jenkins, Docker, K8).
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

2 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Paul King
Paul King
4 months ago

Your first example should be using br not fis when calling mark(), read() and reset() otherwise you’ll get an IOException: mark/reset not supported.

Back to top button