How to read CSV files in Java – A case study of Iterator and Decorator
In this post, I will talk about how to read CSV (Comma-separated values) files using Apache Common CSV. From this case study, we will learn how to use Iterator and Decorator in context of design pattern to improve the reusability in different situations. But before we get started, I guess I have to answer two questions first.
- Why do I need a third party library if there are more than enough DIY posts talking about how to read CSV files?
It is true that when you google “java csv parser”, you will get several related posts. But even if you are a beginner, you won’t be satisfied with these shallow methods. Of course using BufferedReader and String.split() will successfully parse a typical CSV file, but you won’t learn ANYTHING from it except making redundant. On the other hand, like what I will show below, using and studying Apache Common CSV will teach you several topics in Design Pattern, for instance iterator and decorator. - Why Apache Common CSV, not others?
As far as I know, there are several other libraries on Sourceforge or Google code. However, if you look into details of their code, forgive my criticism, none of them are flexible and manageable: some are too simple to meet users various requirements; others are too complicated and painful to use. Furthermore, most of them I’ve come across don’t have commercial-friendly licenses. You know, sometimes, it really scares users off.
Apache Common CSV is still in sandbox, which means there are currently no official download and stable release. But nightly builds may be available.
Using Iterator to hide underlying representation
Let me begin with a sample CSV file, where each record is located on a separate line, delimited by a line break. The first line is the header containing two names COL1
and COL2
corresponding to the fields in the file. The rest of the file contains three records with fields separated by commas.
COL1,COL2 a,b c,d e,f
The code using Apache Common CSV to read this file is:
public void test() throws FileNotFoundException, IOException { CSVParser parser = new CSVParser( new FileReader("test.csv"), CSVFormat.DEFAULT.withHeader()); for (CSVRecord record : parser) { System.out.printf("%s\t%s\n", record.get("COL1"), record.get("COL2")); } parser.close(); }
CSVParser is used to parse CSV files according to the specified format. Here I use the default CSVFormat together with setting withHeader() with no argument. This enables the parser to treat the first line of the CSV file as the header and to make the record.get("COL1")
valid. CSVParser provides an iterative way of reading records. Here we meet the first design pattern Iterator. It provides a way to access the records of a CSV file sequentially without exposing its underlying representation, like how to skip over comment line and how to map the column name to the field value. For each record, we use CSVRecord.get(String name) to retrieve the field value by its name.
CSVRecord provides different ways to access the field value: by name or by index. If you are not sure the field has a value or is empty, CSVRecord.isSet(String name)
can be called before. If you just want to check whether a name has been defined to the parser, call CSVRecord.isMapped(String name)
instead.
Using Decorator to allow different behaviors
CSVFormat.DEFAULT or CSVFormat.RFC4180 follows the RFC4180 format. So fields enclosed in double quotes can be handled too, such as
"COL1","COL2" "a","b" "c","d" "e","f"
In RFC4180, fields in a CSV file should be separated by commas. But in general, the library can handle arbitrary delimiter like TAB or space. To make the code reusable, the library provides a way to create your own CSVFormat,
CSVFormat format = CSVFormat.newFormat(',') .withQuoteChar('"') .withHeader();
The above format is same as the CSVFormat.DEFAULT. Here we encounter another design pattern Decorator, which allows behavior to be added to an individual object, either statically or dynamically, without affecting the behavior of other objects from the same class. In the case of CSVFormat, every withXXX() method returns a new CSVFormat that is equal to the calling one but with one attribute modified. The question here might be why not just return the self-reference this? I think it is because the later way will fail the following code
CSVFormat format = CSVFormat.newFormat(','); CSVFormat format1 = format.withQuoteChar('"'); CSVFormat format2 = format.withHeader();
If we simply return this, format1 will be equal to format2, which is absolutely now what we are expecting.
CSVFormat provides quite flexible ways of specifying CSV format. Details can be found in its javadoc, which is well documented. We can set the delimiter character, the comment start marker, the quote character, etc. Therefore, for the following CSV file where fields are seperated by TAB and comments are started with #
,
COL1 COL2 # comments a b c d e f
We can create a format
CSVFormat format = CSVFormat.newFormat('\t') .withCommentStart('#') .withIgnoreEmptyLines(true) .withNullString("") .withHeader();
In summary, Apache Common CSV was started to unify a common and simple interface for reading and writing CSV files under an ASL license. It is still in sandbox, but it is quite flexible to meet different requirements. At last, I would like to emphasize that reading sophisticated codes is really helpful to improve programming skills. Therefore, I would highly recommend you to read this project source code, which is very simple but powerful.
Thanks for your article, and I’m sharing one more open-source library for reading/writing/mapping CSV data. Since I used this library in my project, I found it powerful and flexiable especially parsing big CSV data (such as 1GB+ file or complex processing logic). Here are the code samples to parse CSV file just in several lines: public static void main(String[] args) throws FileNotFoundException { /** * ————————————— * Read CSV rows into 2-dimensional array * ————————————— */ // 1st, config the CSV reader, such as line separator, column separator and so on CsvParserSettings settings = new CsvParserSettings(); settings.getFormat().setLineSeparator(“\n”); // 2nd, creates… Read more »