Remove Byte Order Mark Characters from File
The Byte Order Mark (BOM) signifies a file’s encoding but can lead to problems if not handled properly, particularly when dealing with text data. Additionally, it is not unusual to encounter files that begin with a BOM character when reading text files. Let’s understand how to remove Byte Order Mark (BOM) characters from a file in Java.
1. Understanding BOM Characters
The Byte Order Mark (BOM) characters are special markers used at the beginning of a text stream to indicate its encoding. While BOM characters help in identifying the encoding, they can cause problems if not handled correctly, especially in text processing. For example, encountering BOM characters in files can lead to unexpected behavior when reading or manipulating text data. Therefore, understanding BOM characters and knowing how to manage them, particularly in programming languages like Java, is crucial for ensuring smooth text processing and file handling.
2. Using InputStream and Reader
This example demonstrates how to read a file and handle BOM characters using InputStream
and Reader
.
import java.io.BufferedReader; import java.io.FileInputStream; import java.io.InputStreamReader; import java.io.IOException; public class BOMHandling { public static void main(String[] args) { try (FileInputStream fis = new FileInputStream("file.txt"); InputStreamReader isr = new InputStreamReader(fis, "UTF-8"); BufferedReader br = new BufferedReader(isr)) { // Read the first character fis.mark(4); int ch = fis.read(); if (ch != 0xFEFF) { // If it's not BOM, reset the stream fis.reset(); } // Read and process the rest of the file String line; while ((line = br.readLine()) != null) { System.out.println(line); } } catch (IOException e) { e.printStackTrace(); } } }
The code defines:
FileInputStream fis = new FileInputStream("file.txt");
– Opens the filefile.txt
for reading.InputStreamReader isr = new InputStreamReader(fis, "UTF-8");
– Creates anInputStreamReader
with UTF-8 encoding.BufferedReader br = new BufferedReader(isr);
– Wraps theInputStreamReader
in aBufferedReader
for efficient reading.fis.mark(4);
– Marks the current position in the input stream, allowing you to reset to this position later.int ch = fis.read();
– Reads the first character from the file.if (ch != 0xFEFF) { fis.reset(); }
– If the first character is not a BOM, resets the stream to the marked position.- The rest of the file is read and processed line by line.
The code returns the following output:
This is the first line of the file. This is the second line of the file. This is the third line of the file.
3. Using Apache Commons IO
This example shows how to remove BOM characters using Apache Commons IO’s BOMInputStream
.
import org.apache.commons.io.input.BOMInputStream; import java.io.BufferedReader; import java.io.FileInputStream; import java.io.InputStreamReader; import java.io.IOException; public class BOMHandlingWithCommonsIO { public static void main(String[] args) { try (FileInputStream fis = new FileInputStream("file.txt"); BOMInputStream bomIn = new BOMInputStream(fis); InputStreamReader isr = new InputStreamReader(bomIn, "UTF-8"); BufferedReader br = new BufferedReader(isr)) { String line; while ((line = br.readLine()) != null) { System.out.println(line); } } catch (IOException e) { e.printStackTrace(); } } }
The code defines:
FileInputStream fis = new FileInputStream("file.txt");
– Opens the filefile.txt
for reading.BOMInputStream bomIn = new BOMInputStream(fis);
– Wraps theFileInputStream
in aBOMInputStream
, which automatically detects and removes the BOM.InputStreamReader isr = new InputStreamReader(bomIn, "UTF-8");
– Creates anInputStreamReader
with UTF-8 encoding.BufferedReader br = new BufferedReader(isr);
– Wraps theInputStreamReader
in aBufferedReader
for efficient reading.- The rest of the file is read and processed line by line.
The code returns the following output:
This is the first line of the file. This is the second line of the file. This is the third line of the file.
4. Using NIO (New I/O)
This example uses Java NIO to read a file and handle BOM characters.
import java.nio.file.Files; import java.nio.file.Paths; import java.nio.charset.StandardCharsets; import java.util.List; import java.io.IOException; public class BOMHandlingWithNIO { public static void main(String[] args) { try { byte[] bytes = Files.readAllBytes(Paths.get("file.txt")); String content = new String(bytes, StandardCharsets.UTF_8); // Check and remove BOM if (content.startsWith("\uFEFF")) { content = content.substring(1); } System.out.println(content); } catch (IOException e) { e.printStackTrace(); } } }
The code defines:
byte[] bytes = Files.readAllBytes(Paths.get("file.txt"));
– Reads all bytes from the filefile.txt
.String content = new String(bytes, StandardCharsets.UTF_8);
– Converts the byte array to a string using UTF-8 encoding.if (content.startsWith("\uFEFF")) { content = content.substring(1); }
– Checks if the content starts with a BOM and removes it if present.System.out.println(content);
– Prints the content of the file.
The code returns the following output:
This is the first line of the file. This is the second line of the file. This is the third line of the file.
5. BOM Handling Methods Comparison
Method | Advantages | Disadvantages | Memory Usage | Performance |
---|---|---|---|---|
Using InputStream and Reader |
|
| Moderate, depends on file size | Moderate, involves marking and resetting streams |
Using Apache Commons IO |
|
| Low to Moderate, efficient handling by library | High, optimized for handling BOM |
Using NIO (New I/O) |
|
| High, reads entire file into memory | High, efficient for large files |
6. Conclusion
Handling BOM characters is essential for accurate text file processing. In Java, you can manage BOM characters using InputStream
and Reader
, Apache Commons IO, or Java NIO. Each method offers different advantages, with Apache Commons IO simplifying the process and Java NIO providing more advanced functionalities. Choose the approach that best suits your project’s requirements.
Your first example should be using
br
notfis
when callingmark()
,read()
andreset()
otherwise you’ll get anIOException: mark/reset not supported
.Hi Paul – I will take a look and get back to you. Thanks.