Spring AI Testing: AI Evaluators Example
Large Language Models (LLMs) have become an integral part of modern applications, but ensuring their accuracy and reliability is crucial. Spring AI provides tools to evaluate LLM responses effectively. Let us delve into understanding Spring AI testing and the role of AI evaluators in ensuring reliable AI models.
1. Introduction to Ollama and TestContainers
1.1 What is Ollama?
Ollama is a platform designed to run large language models (LLMs) efficiently on local machines. It provides an easy-to-use interface for loading, running, and managing AI models without requiring cloud dependencies. Ollama is optimized for speed and performance, making it ideal for developers and researchers who want to experiment with AI models directly on their hardware.
One of the key features of Ollama is its ability to work offline, ensuring data privacy and reducing reliance on external servers. It supports various open-source models, allowing users to choose the best fit for their applications. Additionally, Ollama simplifies model deployment by providing a streamlined installation process and a user-friendly API for integration with other applications.
Overall, Ollama is a powerful tool for AI enthusiasts, developers, and businesses looking to harness the potential of large language models without the complexities of cloud-based solutions.
1.2 What are TestContainers?
Testcontainers is an open-source library that enables developers to create lightweight, throwaway instances of databases, message brokers, and other services in Docker containers for integration testing. It simplifies the process of setting up and managing dependencies, ensuring consistent test environments. The key benefits of Testcontainers include improved test reliability, faster feedback loops, and easier setup of complex dependencies. It is widely used for testing microservices, database interactions, and cloud-native applications without requiring dedicated infrastructure. By leveraging Docker, Testcontainers provides a clean, isolated environment, reducing flakiness in tests and making continuous integration pipelines more robust.
By combining Ollama with TestContainers, developers can create a controlled environment for evaluating AI models. This approach not only improves the reliability of AI responses but also enhances the overall testing and deployment workflow of LLM-based applications.
2. Code Example
Create a Spring Boot application using Spring Initializr. This Spring boot application integrates with Ollama running in a TestContainer and evaluates LLM responses using RelevanceEvaluator
and FactCheckingEvaluator
.
2.1 Adding Spring Dependencies
Add the following dependencies in pom.xml
file:
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | < dependencies > <!-- Spring Boot Web --> < dependency > < groupId >org.springframework.boot</ groupId > < artifactId >spring-boot-starter-web</ artifactId > </ dependency > <!-- Spring AI Core --> < dependency > < groupId >org.springframework.ai</ groupId > < artifactId >spring-ai-core</ artifactId > < version >0.8.0</ version > </ dependency > <!-- TestContainers for Ollama --> < dependency > < groupId >org.testcontainers</ groupId > < artifactId >testcontainers</ artifactId > < version >1.19.3</ version > </ dependency > <!-- JUnit for Testing --> < dependency > < groupId >org.springframework.boot</ groupId > < artifactId >spring-boot-starter-test</ artifactId > < scope >test</ scope > </ dependency > </ dependencies > |
2.2 Running Ollama in a TestContainer
We use TestContainers to run Ollama in an isolated environment. Ensure that Docker is installed on your machine before proceeding.
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | package com.example.ollama; import org.testcontainers.containers.GenericContainer; import org.testcontainers.utility.DockerImageName; /** * This class sets up and runs an Ollama container using TestContainers. */ public class OllamaContainer { // Create a GenericContainer instance for Ollama using the latest image. private static final GenericContainer ollama = new GenericContainer(DockerImageName.parse( "ollama/ollama:latest" )) .withExposedPorts( 11434 ) // Expose port 11434 for API access. .withCommand( "ollama serve" ); // Run the Ollama server inside the container. // Static block to start the container when the class is loaded. static { ollama.start(); } /** * Returns the base URL for accessing the Ollama instance. * * @return Base URL as a string in the format http://host:port */ public static String getBaseUrl() { } } |
2.2.1 Code Explanation
The provided Java code defines an OllamaContainer
class to run an Ollama instance in an isolated environment using TestContainers. It leverages the GenericContainer
class from TestContainers to pull and run the ollama/ollama:latest
Docker image. The .withExposedPorts(11434)
method ensures the Ollama server is accessible on port 11434
, while .withCommand("ollama serve")
starts the server inside the container. The static block initializes and starts the container automatically when the class is loaded. The getBaseUrl()
method constructs and returns the base URL for accessing Ollama by dynamically mapping the container’s port to the host machine. This setup allows for seamless testing and integration of Ollama within a Spring Boot or Java-based application.
2.3 LLM Service (Queries Ollama)
This service sends prompts to Ollama and retrieves responses.
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | package com.example.ollama.service; import org.springframework.ai.client.OpenAiClient; import org.springframework.ai.client.generative.GenerativeAiResponse; import org.springframework.stereotype.Service; /** * Service class for interacting with the AI model using OpenAI's client. */ @Service public class LlmService { // OpenAiClient instance to communicate with the AI model. private final OpenAiClient client; /** * Constructor initializes the OpenAiClient with the base URL * retrieved from the OllamaContainer. */ public LlmService() { this .client = new OpenAiClient(OllamaContainer.getBaseUrl()); } /** * Sends a prompt to the AI model and returns the generated response. * * @param prompt The user-provided input for AI processing. * @return The AI-generated response as a string. */ public String getResponse(String prompt) { GenerativeAiResponse response = client.complete(prompt); return response.getResult(); } } |
2.3.1 Code Explanation
The LlmService
class is a Spring service that interacts with an AI model using OpenAI’s client. It initializes an OpenAiClient
instance, connecting to the locally running Ollama instance via the base URL retrieved from OllamaContainer.getBaseUrl()
. The getResponse(String prompt)
method takes a user-provided prompt and sends it to the AI model using the client.complete(prompt)
method, and returns the generated response. This setup allows seamless communication between the Spring Boot application and the AI model hosted in a TestContainers-managed environment.
2.4 Evaluators for Relevance and Fact-Checking
We integrate Spring AI’s RelevanceEvaluator
and FactCheckingEvaluator
for AI model assessment.
2.4.1 RelevanceEvaluator
The RelevanceEvaluator
is a component in Spring AI designed to assess how well an AI-generated response aligns with the given prompt. It analyzes the semantic similarity between the input and output, providing a relevance score that indicates the accuracy and contextual appropriateness of the response. This evaluator is useful for applications that require quality control over AI-generated content, ensuring meaningful and relevant outputs in conversational AI, chatbots, and automated text generation systems. For more details, visit the official Spring AI documentation.
01 02 03 04 05 06 07 08 09 10 11 12 13 | package com.example.ollama.evaluation; import org.springframework.ai.evaluation.RelevanceEvaluator; import org.springframework.stereotype.Service; @Service public class ResponseEvaluator { private final RelevanceEvaluator evaluator = new RelevanceEvaluator(); public double getRelevanceScore(String prompt, String response) { return evaluator.evaluate(prompt, response).getScore(); } } |
2.4.1.1 Code Explanation
The ResponseEvaluator
class is a Spring service that utilizes the RelevanceEvaluator
to assess the relevance of an AI-generated response based on a given prompt. It initializes an instance of RelevanceEvaluator
and provides the getRelevanceScore(String prompt, String response)
method, which evaluates the AI’s response against the prompt and returns a relevance score. This helps in measuring the accuracy and appropriateness of AI-generated content within applications integrating AI models.
2.4.2 FactCheckingEvaluator
The FactCheckingEvaluator
is a component in Spring AI that assesses the factual accuracy of AI-generated responses by comparing them against the given prompt. It assigns a fact-checking score, helping to determine how truthful and reliable the response is. This evaluator is particularly useful for applications that require high accuracy, such as news verification, research support, and knowledge-based AI systems. For more details, visit the official Spring AI documentation.
01 02 03 04 05 06 07 08 09 10 11 12 13 | package com.example.ollama.evaluation; import org.springframework.ai.evaluation.FactCheckingEvaluator; import org.springframework.stereotype.Service; @Service public class FactCheckService { private final FactCheckingEvaluator evaluator = new FactCheckingEvaluator(); public double getFactCheckScore(String prompt, String response) { return evaluator.evaluate(prompt, response).getScore(); } } |
2.4.2.1 Code Explanation
The FactCheckService
class is a Spring service that leverages the FactCheckingEvaluator
to verify the factual accuracy of an AI-generated response. It initializes an instance of FactCheckingEvaluator
and provides the getFactCheckScore(String prompt, String response)
method, which evaluates the response against the given prompt and returns a fact-checking score. This helps ensure that AI-generated content is reliable and factually correct, making it useful for applications where content accuracy is critical, such as news verification, research assistance, and knowledge-based AI systems.
2.5 Controller (Integrating LLM & Evaluators)
Develop a controller that generates an LLM response, evaluates its relevance and factual accuracy, and returns the results in a well-structured JSON format.
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | package com.example.ollama.controller; import com.example.ollama.evaluation.FactCheckService; import com.example.ollama.e;valuation.ResponseEvaluator; import com.example.ollama.service.LlmService; import org.springframework.web.bind.annotation.*; import java.util.HashMap; import java.util.Map; @RestController @RequestMapping ( "/api/llm" ) public class LlmController { private final LlmService llmService; private final ResponseEvaluator responseEvaluator; private final FactCheckService factCheckService; public LlmController(LlmService llmService, ResponseEvaluator responseEvaluator, FactCheckService factCheckService) { this .llmService = llmService; this .responseEvaluator = responseEvaluator; this .factCheckService = factCheckService; } @GetMapping ( "/generate" ) public Map<String, Object> generate( @RequestParam String prompt) { String response = llmService.getResponse(prompt); double relevanceScore = responseEvaluator.getRelevanceScore(prompt, response); double factCheckScore = factCheckService.getFactCheckScore(prompt, response); // Creating a structured response Map<String, Object> result = new HashMap<>(); result.put( "prompt" , prompt); result.put( "response" , response); result.put( "relevanceScore" , relevanceScore); result.put( "factCheckScore" , factCheckScore); return result; } } |
2.5.1 Code Explanation
The LlmController
class is a Spring Boot REST controller that handles AI-generated responses while evaluating their relevance and factual accuracy. It is mapped to the /api/llm
endpoint and depends on three services: LlmService
for generating responses, ResponseEvaluator
for assessing relevance, and FactCheckService
for verifying factual accuracy. The /generate
endpoint accepts a prompt as a query parameter, fetches the AI-generated response, calculates the relevance and fact-checking scores, and returns a structured JSON response containing the prompt, generated response, and evaluation scores. This approach ensures that AI responses are assessed for quality and reliability before being presented to the user.
2.6 Setting up the properties
Include the following properties in the application.properties
file located at src/main/resources/application.properties
.
1 2 3 4 5 6 7 | # Server Configuration server.port=8080 # Ollama API (Dynamic URL from TestContainers) # ollama.api.base-url=${OLLAMA_BASE_URL:http://localhost:11434} # Spring AI Configurations spring.ai.model=gpt-3.5-turbo spring.ai.temperature=0.7 |
server.port=8080
– Specifies that the application will run on port8080
.- Spring AI Configurations:
spring.ai.model=gpt-3.5-turbo
– Defines the AI model to be used, in this case,gpt-3.5-turbo
.spring.ai.temperature=0.7
– Adjusts the AI response randomness, with higher values making outputs more creative and lower values making them more deterministic.
2.7 Running the Application
To start the application, run the main class from your IDE or use the following command:
1 | mvn spring-boot:run |
2.7.1 Testing the API
You can test the API using cURL or Postman.
Execute the following cURL command:
1 | curl "http://localhost:8080/api/llm/generate?prompt=What is Spring AI?" |
The API will respond with a structured JSON output similar to the following:
1 2 3 4 5 6 | { "prompt": "What is Spring AI?", "response": "Spring AI is a framework that integrates AI into Java applications.", "relevanceScore": 0.95, "factCheckScore": 0.87 } |
3. Unit Testing (Validating the LLM response)
The LlmEvaluatorTest
class is a unit test for validating the functionality of the LlmController
in a Spring Boot application.
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | package com.example.ollama; import com.example.ollama.controller.LlmController; import org.junit.jupiter.api.Test; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.boot.test.context.SpringBootTest; import java.util.Map; import static org.junit.jupiter.api.Assertions.*; @SpringBootTest public class LlmEvaluatorTest { @Autowired private LlmController llmController; @Test void testLlmResponse() { String prompt = "What is Spring AI?" ; Map<String, Object> result = llmController.generate(prompt); assertNotNull(result); assertTrue(( double ) result.get( "relevanceScore" ) > 0.7 , "Response is not relevant!" ); assertTrue(( double ) result.get( "factCheckScore" ) > 0.8 , "Response contains incorrect information!" ); } } |
The test class is annotated with @SpringBootTest
to load the application context, and it uses @Autowired
to inject an instance of LlmController
. The testLlmResponse()
method sends a test prompt, “What is Spring AI?”, to the controller and retrieves the AI-generated response. It then asserts that the response is not null and verifies that the relevanceScore
is above 0.7 and the factCheckScore
is above 0.8. If these conditions are not met, appropriate failure messages are displayed.
The test case will successfully pass if all conditions are met and no assertions fail during execution.
1 | All Tests passed. |
If these conditions are not met, an assertion failure message is displayed.
4. Code Example
By using Spring AI Evaluators along with Ollama in TestContainers, we can effectively test LLM responses in a reproducible environment. This approach ensures that the generated responses meet expectations, improving the reliability of AI-driven applications.