Enterprise Java

Integrate Amazon Athena with Spring Boot

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Let’s dive into understanding Spring Boot integration with Amazon Athena.

1. Introduction

1.1 What is Amazon S3?

Amazon Simple Storage Service (S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance. With Amazon S3, you can store and protect any amount of data for a range of use cases, such as websites, mobile apps, backup and restore, archive, enterprise applications, IoT devices, and big data analytics.

S3 provides developers and IT teams with secure, durable, and highly-scalable object storage at low cost. Data is stored in “buckets,” and each object is identified by a unique key, making it easy to organize and retrieve files when needed.

1.1.1 Key Features of Amazon S3

  • Durability and Availability: Amazon S3 is designed to offer 99.999999999% (11 nines) durability and 99.99% availability.
  • Scalability: S3 can scale storage up or down to meet varying needs.
  • Security: S3 integrates with AWS Identity and Access Management (IAM) to securely control access to your data.
  • Data Organization: You can store objects in a structured way with folders and prefixes to improve organization.

1.2 What is Amazon Athena?

Amazon Athena is an interactive query service that makes it easy to analyze data directly in Amazon S3 using standard SQL. With Athena, there’s no need for complex ETL processes; you can start querying the data where it lives. Athena uses Presto, an open-source distributed SQL engine, to run queries, making it a powerful tool for analyzing large datasets stored in S3.

1.2.1 Key Features of Amazon Athena

  • Serverless: Athena is serverless, which means you don’t have to manage any infrastructure or servers. You simply pay per query or per TB of data scanned.
  • SQL-Based: Athena allows you to run queries using SQL syntax, making it accessible to users with SQL knowledge.
  • Scalable and Fast: Athena is built to handle large amounts of data and return results quickly by scaling the resources needed to execute queries efficiently.
  • Integration with AWS Glue: You can use AWS Glue to automatically catalog data in S3, which makes it easier to query with Athena.

1.3 Why Use Amazon S3 and Amazon Athena Together?

Using Amazon S3 and Athena together is a powerful combination for big data analytics. By storing data in S3, you can have a centralized, highly available, and cost-effective data lake. Then, with Athena, you can query this data without moving or transforming it, saving both time and resources. This setup is particularly useful for large-scale analytics, reporting, and business intelligence applications.

2. Why IAM is Important for S3 and Athena Integration?

AWS Identity and Access Management (IAM) is a web service that helps you securely control access to AWS resources. IAM enables you to manage who can access resources in your AWS account and what actions they can perform. By defining fine-grained permissions, you can ensure that the right users have access to the right resources while protecting sensitive data and preventing unauthorized actions.

When using Amazon S3 and Athena in a Spring Boot application, IAM plays a critical role in securing access to these services. For instance:

  • S3 Permissions: IAM can control which users or applications can upload, download, or manage data in S3 buckets, helping prevent unauthorized access to stored data.
  • Athena Permissions: IAM policies can also grant or restrict access to Athena, ensuring that only authorized applications or users can execute queries, view results, or manage resources in Athena.
  • Temporary Roles for Applications: By using IAM roles, you can give your Spring Boot application temporary access to S3 and Athena, without embedding sensitive credentials in your codebase.

2.1 IAM Policy for S3 and Athena Access

Here is an IAM policy that grants read access to a specific S3 bucket and permissions to run queries in Athena:

{"Version":"2012-10-17","Statement":[{"Effect":"Allow","Action":["s3:GetObject","s3:ListBucket"],"Resource":["arn:aws:s3:::your-bucket-name","arn:aws:s3:::your-bucket-name/*"]},{"Effect":"Allow","Action":["athena:StartQueryExecution","athena:GetQueryExecution","athena:GetQueryResults"],"Resource":"*"}]}

In this policy, replace your-bucket-name with the name of your S3 bucket. This policy allows the application to access data in S3 and execute queries in Athena. To use this IAM policy, create an IAM user and attach the policy to it. This same IAM user will be used throughout the tutorial.

3. Setting up Amazon S3 and Athena

Go to the Amazon S3 Console and create a new bucket. Upload a CSV file or JSON file to this bucket that you want to query using Athena. Below is a sample CSV file that can be used:

id,name,age,city
1,John Doe,29,New York
2,Jane Smith,34,Los Angeles
3,Michael Johnson,45,Chicago
4,Emily Davis,23,Houston
5,William Brown,37,Phoenix
6,Jessica Wilson,30,Philadelphia
7,David Lee,40,San Antonio
8,Amy Clark,28,San Diego
9,Chris Walker,32,Dallas
10,Laura Scott,25,San Jose

3.1 Set Up an Athena Database and Table

Go to the Amazon Athena Console. Create a new database (say your_database) in Athena or use an existing one. Create a table in Athena that points to the data in your S3 bucket. Use SQL queries like:

CREATE EXTERNAL TABLE IF NOT EXISTS my_table (
    id INT,
    name STRING,
    age INT
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES ('serialization.format' = ',')
LOCATION 's3://your-bucket-name/path/';

Make sure to replace your_database and your-bucket-name with the actual names.

4. Code Example

Create a new Spring Boot application or add dependencies to an existing project. Add the following dependencies to your pom.xml file:

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter</artifactId>
</dependency>

<dependency>
    <groupId>software.amazon.awssdk</groupId>
    <artifactId>athena</artifactId>
    <version>your_jar_version</version>
</dependency>

4.1 Configure AWS Credentials

Provide your AWS IAM credentials in the application.properties file.

server.port = 9090

# AWS config
aws.accessKeyId=YOUR_ACCESS_KEY
aws.secretAccessKey=YOUR_SECRET_KEY
aws.region=YOUR_REGION

Replace the IAM details with those created earlier.

  • aws.accessKeyId=YOUR_ACCESS_KEY: Replace YOUR_ACCESS_KEY with the IAM user AWS access key ID.
  • aws.secretAccessKey=YOUR_SECRET_KEY: Replace YOUR_SECRET_KEY with the IAM user AWS secret access key.
  • aws.region=YOUR_REGION: Replace YOUR_REGION with the AWS region you are working in (e.g., us-west-2).

4.2 Athena Query Service

Now, we’ll create a service in Spring Boot that uses the Athena SDK to run queries.

import org.springframework.stereotype.Service;
import software.amazon.awssdk.services.athena.AthenaClient;
import software.amazon.awssdk.services.athena.model.*;

@Service
public class AthenaService {

    private final AthenaClient athenaClient;

    public AthenaService(AthenaClient athenaClient) {
        this.athenaClient = athenaClient;
    }

    public String startQueryExecution(String query) {
        QueryExecutionContext context = QueryExecutionContext.builder().database("your_database").build();
        ResultConfiguration configuration = ResultConfiguration.builder()
                .outputLocation("s3://your-output-bucket/athena/results/")
                .build();

        StartQueryExecutionRequest request = StartQueryExecutionRequest.builder()
                .queryString(query)
                .queryExecutionContext(context)
                .resultConfiguration(configuration)
                .build();

        StartQueryExecutionResponse response = athenaClient.startQueryExecution(request);
        return response.queryExecutionId();
    }

    public GetQueryExecutionResponse getQueryExecutionStatus(String queryExecutionId) {
        GetQueryExecutionRequest request = GetQueryExecutionRequest.builder().queryExecutionId(queryExecutionId).build();
        return athenaClient.getQueryExecution(request);
    }

    public GetQueryResultsResponse getQueryResults(String queryExecutionId) {
        GetQueryResultsRequest request = GetQueryResultsRequest.builder().queryExecutionId(queryExecutionId).build();
        return athenaClient.getQueryResults(request);
    }
}

4.1.1 Code Explanation

The @Service annotation designates the AthenaService class as a Spring service component. This class handles interactions with Amazon Athena.

The AthenaService constructor takes an AthenaClient instance, which is used to make calls to Athena. The startQueryExecution method is responsible for initiating a query execution. It takes a SQL query string as input, defines the database context, and sets up a result configuration that specifies where the results will be stored in Amazon S3 (e.g., s3://your-output-bucket/athena/results/).

The method creates a StartQueryExecutionRequest object with the provided query, execution context, and result configuration, then calls startQueryExecution on the athenaClient to execute the query. It returns the queryExecutionId, which uniquely identifies the query execution.

The getQueryExecutionStatus method takes a queryExecutionId as input and returns the execution status of the query. It uses a GetQueryExecutionRequest object to retrieve details of the query execution by calling getQueryExecution on the athenaClient.

The getQueryResults method also takes a queryExecutionId and retrieves the results of a completed query. It constructs a GetQueryResultsRequest and calls getQueryResults on the athenaClient to get the data associated with the query execution ID.

4.2 Code the Athena Controller

Now create a controller to expose an endpoint for querying the data.

import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RequestParam;
import org.springframework.web.bind.annotation.RestController;
import software.amazon.awssdk.services.athena.model.*;

@RestController
public class AthenaController {

    private final AthenaService athenaService;

    public AthenaController(AthenaService athenaService) {
        this.athenaService = athenaService;
    }

    @GetMapping("/query")
    public String executeQuery() {
	    String query = "SELECT * FROM my_table;";
        String queryExecutionId = athenaService.startQueryExecution(query);

        GetQueryExecutionResponse queryExecution = athenaService.getQueryExecutionStatus(queryExecutionId);
        while (queryExecution.queryExecution().status().state() == QueryExecutionState.RUNNING) {
            queryExecution = athenaService.getQueryExecutionStatus(queryExecutionId);
        }

        GetQueryResultsResponse queryResults = athenaService.getQueryResults(queryExecutionId);
        return queryResults.resultSet().toString();
    }
}

4.2.1 Code Explanation

The @RestController annotation defines the AthenaController class as a REST controller in Spring Boot, allowing it to handle HTTP requests. This controller is responsible for interacting with Amazon Athena through the AthenaService class.

The AthenaController constructor takes an AthenaService instance, which provides methods for starting a query execution, checking its status, and retrieving results from Athena.

The @GetMapping("/query") annotation maps the executeQuery method to a GET request on the /query endpoint. When this endpoint is accessed, the executeQuery method runs a SQL query (in this example, SELECT * FROM my_table;) by calling the startQueryExecution method from AthenaService. This method returns a queryExecutionId that uniquely identifies the query execution in Athena.

The method then uses a loop to repeatedly check the status of the query execution. The getQueryExecutionStatus method retrieves the current status, and the loop continues until the query is no longer in the RUNNING state.

Once the query execution completes, the getQueryResults method is called to retrieve the results. The results are then returned as a string representation of the ResultSet for display or further processing.

5. Run the code

To run the application, first start your Spring Boot application. Then, send a request to http://localhost:9090/query. Finally, check the response for the query results.

{"ResultSet":{"Rows":[{"Data":[{"VarCharValue":"id"},{"VarCharValue":"name"},{"VarCharValue":"age"},{"VarCharValue":"city"}]},{"Data":[{"VarCharValue":"1"},{"VarCharValue":"John Doe"},{"VarCharValue":"29"},{"VarCharValue":"New York"}]},{"Data":[{"VarCharValue":"2"},{"VarCharValue":"Jane Smith"},{"VarCharValue":"34"},{"VarCharValue":"Los Angeles"}]},{"Data":[{"VarCharValue":"3"},{"VarCharValue":"Michael Johnson"},{"VarCharValue":"45"},{"VarCharValue":"Chicago"}]},{"Data":[{"VarCharValue":"4"},{"VarCharValue":"Emily Davis"},{"VarCharValue":"23"},{"VarCharValue":"Houston"}]},{"Data":[{"VarCharValue":"5"},{"VarCharValue":"William Brown"},{"VarCharValue":"37"},{"VarCharValue":"Phoenix"}]},{"Data":[{"VarCharValue":"6"},{"VarCharValue":"Jessica Wilson"},{"VarCharValue":"30"},{"VarCharValue":"Philadelphia"}]},{"Data":[{"VarCharValue":"7"},{"VarCharValue":"David Lee"},{"VarCharValue":"40"},{"VarCharValue":"San Antonio"}]},{"Data":[{"VarCharValue":"8"},{"VarCharValue":"Amy Clark"},{"VarCharValue":"28"},{"VarCharValue":"San Diego"}]},{"Data":[{"VarCharValue":"9"},{"VarCharValue":"Chris Walker"},{"VarCharValue":"32"},{"VarCharValue":"Dallas"}]},{"Data":[{"VarCharValue":"10"},{"VarCharValue":"Laura Scott"},{"VarCharValue":"25"},{"VarCharValue":"San Jose"}]}]}}

Feel free to modify the query in the controller and experiment with this tutorial.

6. Conclusion

Integrating Amazon Athena with Spring Boot provides a scalable solution for querying large datasets stored in S3. By leveraging Athena, you can quickly access and analyze your data with SQL queries, avoiding the complexity of ETL processes.

Yatin Batra

An experience full-stack engineer well versed with Core Java, Spring/Springboot, MVC, Security, AOP, Frontend (Angular & React), and cloud technologies (such as AWS, GCP, Jenkins, Docker, K8).
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Back to top button