Enterprise Java

CSV Import into Elasticsearch with Spring Boot

Elasticsearch is a powerful search and analytics engine used in various applications requiring fast retrieval of structured and unstructured data. Importing CSV data into Elasticsearch is a common use case, and Spring Boot makes this process seamless. This article will guide you through how to import CSV data into Elasticsearch using Spring Boot.

1. Setting Up ElasticSearch

To install Elasticsearch, follow the official installation guide provided by Elastic: ElasticSearch Installation Guide. The commands below can download/pull the Elasticsearch Docker image.

1
2
docker network create elastic
docker pull docker.elastic.co/elasticsearch/elasticsearch:8.17.1

Next, run the following command to start an Elasticsearch container:

1
docker run --name elasticsearch --net elastic -p 9200:9200 -it -m 1GB docker.elastic.co/elasticsearch/elasticsearch:8.17.1

This starts an Elasticsearch instance on port 9200.

Alternatively, you can follow the guide below If you have Homebrew installed on your system. You can run a few simple brew commands to quickly install Elasticsearch along with all the necessary dependencies required for proper functionality. This ensures a smooth setup process without manually handling configurations.

1
2
brew tap elastic/tap
brew install elastic/tap/elasticsearch-full

After completing the installation, you can start Elasticsearch directly from the terminal by running the elasticsearch command.

1
$ elasticsearch

2. Create a Spring Boot Project and Add Dependencies

2.1 Creating the Spring Boot Project

Go to Spring Initializr and generate a Spring Boot project with the following dependencies:

  • Spring Web – To create REST APIs
  • Spring Data Elasticsearch for integrating with Elasticsearch.

Download and unzip the project, then open it in your IDE.

2.2 Adding Required Dependencies

Open the pom.xml file and add the following dependencies:

01
02
03
04
05
06
07
08
09
10
11
12
13
<!-- Elasticsearch Rest High-Level Client -->
<dependency>
    <groupId>org.elasticsearch.client</groupId>
    <artifactId>elasticsearch-rest-high-level-client</artifactId>
    <version>7.17.27</version>
</dependency>
 
<!-- Apache Commons CSV for parsing CSV files -->
<dependency>
    <groupId>org.apache.commons</groupId>
    <artifactId>commons-csv</artifactId>
    <version>1.13.0</version>
</dependency>

3. Configuring Elasticsearch Rest High-Level Client

Create a configuration class to set up the Elasticsearch client:

01
02
03
04
05
06
07
08
09
10
@Configuration
public class ElasticsearchConfig {
     
    @Bean
    public RestHighLevelClient client() {
        return new RestHighLevelClient(
                RestClient.builder(new HttpHost("localhost", 9200, "http"))
        );
    
}

This configuration creates a RestHighLevelClient bean that connects to Elasticsearch on localhost:9200.

4. Defining a Model for CSV Data

Assuming our CSV file (books.csv) contains the following data:

id,title,author,genre
1,Spring Boot in Action,Craig Walls,Technology
2,Effective Java,Joshua Bloch,Programming
3,Clean Code,Robert C. Martin,Software Engineering

Ensure this file is placed in the src/main/resources directory of your Spring Boot application.

Create a Book.java model class:

01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
@Document(indexName = "books")
public class Book {
     
    @Id
    private String id;
    private String title;
    private String author;
    private String genre;
 
    public Book() {
    }
 
    public Book(String id, String title, String author, String genre) {
        this.id = id;
        this.title = title;
        this.author = author;
        this.genre = genre;
    }
 
  // Standard Getters and Setters
     
}
  • @Document(indexName = "books") tells Spring Data Elasticsearch to treat this class as an Elasticsearch document.
  • @Id marks the id field as the primary identifier in Elasticsearch.

5. Reading and Parsing CSV Data

We will use Apache Commons CSV to read the CSV file. Create CSVService.java:

01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
@Service
public class CSVService {
     
    public List<Book> parseCSV(String filePath) {
        List<Book> books = new ArrayList<>();
        try (Reader reader = new InputStreamReader(new ClassPathResource(filePath).getInputStream(), StandardCharsets.UTF_8);
             CSVParser csvParser = new CSVParser(reader, CSVFormat.DEFAULT.withFirstRecordAsHeader())) {
 
            for (CSVRecord csvRecord : csvParser) {
                Book book = new Book(
                        csvRecord.get("id"),
                        csvRecord.get("title"),
                        csvRecord.get("author"),
                        csvRecord.get("genre")
                );
                books.add(book);
            }
        } catch (IOException e) {
            System.out.println(" " + e.getMessage());
        }
        return books;
    }
}

This method, parseCSV(String filePath), is responsible for reading and parsing a CSV file containing book data. It takes the file path as an argument and returns a list of Book objects. The method initializes an empty ArrayList to store books and uses a try-with-resources block to handle file reading safely.

Inside the try block, it creates a Reader to read the CSV file from the classpath using ClassPathResource. A CSVParser is then used to parse the file, with CSVFormat.DEFAULT.withFirstRecordAsHeader() ensuring that the first row is treated as column headers rather than data.

The method iterates over each CSVRecord in the parsed CSV file. For each record, it extracts values from the columns “id,” “title,” “author,” and “genre” and constructs a new Book object. This object is then added to the list. Finally, the method returns the list of parsed Book objects, making them available for further processing, such as indexing into Elasticsearch.

5.1 Indexing Data into Elasticsearch

Create ElasticsearchService.java to index data into Elasticsearch.

01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
@Service
public class ElasticsearchService {
     
    @Autowired
    private RestHighLevelClient client;
    @Autowired
    private ObjectMapper objectMapper;
 
    public void indexBooks(List books) throws IOException {
        for (Book book : books) {
            IndexRequest indexRequest = new IndexRequest("books")
                    .id(book.getId())
                    .source(objectMapper.convertValue(book, Map.class));
            client.index(indexRequest, RequestOptions.DEFAULT);
        }
    }
}

The ElasticsearchService class is responsible for interacting with Elasticsearch using the Elasticsearch Rest High-Level Client. It provides a method to index a list of Book objects into an Elasticsearch index named "books". The class has two dependencies: RestHighLevelClient, which is used to communicate with Elasticsearch, and ObjectMapper, which converts Java objects into a format suitable for indexing.

The indexBooks(List<Book> books) method takes a list of books as input and iterates over them. For each Book object, it creates an IndexRequest for the "books" index, using the book’s id as the document ID. The source method is used to convert the Book object into a Map, ensuring it is stored in a JSON-compatible format. The request is then sent to Elasticsearch using the client.index() method with default request options.

5.2 Creating a Controller to Trigger Import

Now that we have set up our CSV parsing and Elasticsearch indexing logic, we need a way to trigger the import process. A REST controller will expose an API endpoint that reads the CSV file, converts the data into Book objects, and indexes them into Elasticsearch. This approach lets us initiate the import process with a simple HTTP request.

Here’s the implementation of the controller:

01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
@RestController
@RequestMapping("/books")
public class BookController {
 
    @Autowired
    private CSVService csvService;
    @Autowired
    private ElasticsearchService elasticsearchService;
 
    @PostMapping("/import")
    public String importCSV() throws IOException {
        List books = csvService.parseCSV("books.csv");
        elasticsearchService.indexBooks(books);
        return "CSV Imported Successfully!";
    }
}

The importCSV() method is mapped to a POST request at /books/import. When this endpoint is called, it first invokes csvService.parseCSV("books.csv"), which reads and parses the CSV file. The returned list of Book objects is then passed to elasticsearchService.indexBooks(books), which indexes the books into Elasticsearch. If the process completes without errors, the method returns "CSV Imported Successfully!" as a response.

Now, start the application and trigger the CSV import using curl, with the following command in your terminal:

1
curl -X POST http://localhost:8080/books/import

Example Response (JSON):

Spring Boot CSV Import to Elasticsearch Example Output

5. Importing CSV into Elasticsearch Using Spring Batch

In addition to using a simple CSV parsing and indexing approach, we can leverage Spring Batch to efficiently handle large CSV files and import them into Elasticsearch. Spring Batch provides a scalable and robust way to process large datasets in chunks, ensuring reliability, fault tolerance, and better performance.

Spring Batch operates in three main steps:

  1. Reading the CSV file
  2. Processing and transforming data
  3. Writing the data to Elasticsearch

This method is useful when dealing with large CSV files that may not fit into memory at once. First, let us add the necessary dependencies to our pom.xml file:

1
2
3
4
5
<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-batch</artifactId>
    <version>3.4.2</version>
</dependency>

5.1 Implement CSV Reader

This block of code reads book records from the CSV file and maps them to Book objects.

01
02
03
04
05
06
07
08
09
10
@Bean
public FlatFileItemReader<Book> reader() {
    return new FlatFileItemReaderBuilder<Book>()
            .name("bookItemReader")
            .resource(new ClassPathResource("data.csv"))
            .delimited()
            .names("id", "title", "author", "genre")
            .targetType(Book.class)
            .build();
}

This code defines a FlatFileItemReader<Book> bean using FlatFileItemReaderBuilder for reading a CSV file (data.csv) and mapping its contents to the Book class. The reader is named "bookItemReader" and loads the file from the classpath. It expects a comma-separated format, mapping columns (id, title, author, genre) to corresponding fields in Book. The .targetType(Book.class) method automatically converts each row into a Book object. Finally, .build() constructs the reader, which will be used in the batch job to process CSV data.

5.2 Implement Processor (ItemProcessor, Optional)

We can add a processor to modify data before saving it. ItemProcessor class allows for transformations or validation of data before writing it to Elasticsearch.

1
2
3
4
5
6
7
8
9
@Component
public class BookItemProcessor implements ItemProcessor<Book, Book> {
 
    @Override
    public Book process(Book book) {
        book.setTitle(book.getTitle().toUpperCase()); // Convert title to uppercase
        return book;
    }
}

The BookItemProcessor is an optional step that allows us to transform the data before writing it to Elasticsearch. In this example, the book title is converted to uppercase. If additional processing is needed, such as data validation or enrichment, it can be done here.

5.3 Implement Elasticsearch Writer

This writer takes the processed Book objects and indexes them into Elasticsearch.

01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
@Bean
 public ItemWriter<Book> writer(RestHighLevelClient restHighLevelClient) {
     return books -> {
         for (Book book : books) {
             IndexRequest request = new IndexRequest("book")
                     .id(book.getId())
                     .source(Map.of(
                             "id", book.getId(),
                             "title", book.getTitle(),
                             "author", book.getAuthor(),
                             "genre", book.getGenre()
                     ));
             restHighLevelClient.index(request, RequestOptions.DEFAULT);
         }
     };
 }

This ItemWriter<Book> bean writes Book objects to Elasticsearch using RestHighLevelClient. It iterates through the list of books, creating an IndexRequest for each one with the index name "book", setting the document ID from book.getId(), and mapping the book fields (id, title, author, genre) to a Map. The request is then sent using restHighLevelClient.index(request, RequestOptions.DEFAULT), ensuring each book is stored in Elasticsearch.

5.4 Configure Spring Batch Job

This class defines the workflow and the chunk size for efficient batch processing.

01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
@Bean
public Job importBooksJob(Step step, JobRepository jobRepository) {
 
    var builder = new JobBuilder("importBooksJob", jobRepository);
    return builder
            .start(step)
            .build();
}
 
@Bean
public Step jsonstep(
        JobRepository jobRepository,
        PlatformTransactionManager transactionManager, RestHighLevelClient restHighLevelClient) {
 
    var builder = new StepBuilder("batch-step", jobRepository);
    return builder
            .<Book, Book>chunk(4, transactionManager)
            .reader(reader())
            .writer(writer(restHighLevelClient))
            .faultTolerant()
            .skip(FlatFileParseException.class)
            .skipLimit(10)
            .build();
}

The first bean, importBooksJob, defines a Spring Batch job named "importBooksJob". It takes a Step and a JobRepository as parameters. The JobBuilder is used to create the job, specifying the jobRepository for managing job execution metadata. The job starts with the provided Step and is then built, making it ready for execution.

The second bean, jsonstep, defines a batch step named "batch-step" using StepBuilder. It takes a JobRepository, PlatformTransactionManager, and RestHighLevelClient as dependencies. The step processes data in chunks of 4 records at a time, using reader() to read CSV data and writer(restHighLevelClient) to store the books in Elasticsearch. The .faultTolerant() configuration allows the step to skip up to 10 FlatFileParseException errors, ensuring minor issues don’t halt the entire batch job.

5.5 Run the Job from a Controller

This controller exposes an API endpoint to manually start the batch import.

01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
@RestController
@RequestMapping("/batch")
public class BatchController {
 
    @Autowired
    private JobLauncher jobLauncher;
     
    @Autowired
    private Job job;
 
    @GetMapping("/import")
    public String importCSV() throws Exception {
        jobLauncher.run(job, new JobParameters());
        return "Batch Import Started!";
    }
}

This BatchController is a REST controller that provides an endpoint to trigger the batch job. It is mapped to "/batch" using @RequestMapping. The JobLauncher and Job are autowired to facilitate launching the batch job. JobLauncher is responsible for executing jobs, while Job represents the defined import process.

The importCSV() method is mapped to GET /batch/import. When called, it triggers the batch job using jobLauncher.run(job, new JobParameters()), starting the CSV import process.

6. Conclusion

In this article, we explored two methods for importing CSV data into Elasticsearch using Spring Boot: Apache Commons CSV and Spring Batch. The Apache Commons CSV approach provided a simple way to read CSV files, manually parse records, and index them into Elasticsearch using RestHighLevelClient. This method is lightweight and useful for small-scale imports or applications that don’t require advanced job management.

The Spring Batch approach offered a more structured and scalable solution for handling large CSV imports. It leveraged FlatFileItemReader for reading data, ItemProcessor for optional transformations, and ItemWriter for writing to Elasticsearch. Spring Batch’s fault tolerance, chunk-based processing, and job execution tracking made it ideal for handling large datasets efficiently.

Both methods are effective depending on our needs. Apache Commons CSV is great for quick imports, while Spring Batch provides better scalability and resilience. By implementing these techniques, we can efficiently manage and index CSV data in Elasticsearch, making it accessible for search and analytics.

7. Download the Source Code

This article covered how to import CSV data into Elasticsearch using Spring Boot.

Download
You can download the full source code of this example here: import CSV data into Elasticsearch using Spring Boot

Omozegie Aziegbe

Omos Aziegbe is a technical writer and web/application developer with a BSc in Computer Science and Software Engineering from the University of Bedfordshire. Specializing in Java enterprise applications with the Jakarta EE framework, Omos also works with HTML5, CSS, and JavaScript for web development. As a freelance web developer, Omos combines technical expertise with research and writing on topics such as software engineering, programming, web application development, computer science, and technology.
Subscribe
Notify of
guest


This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Back to top button