Lab 24 Solution: Creating an Evaluation Case for the Calculator Agent
Goal
This file contains the expected output for the eval_results/calculator_tests.evalset.json file that is generated during the lab. This demonstrates the structure of a "golden path" test case that can be used for automated regression testing.
eval_results/calculator_tests.evalset.json
{
"eval_set_id": "calculator_tests",
"eval_cases": [
{
"eval_id": "addition_test",
"conversation": [
{
"user_content": {
"role": "user",
"parts": [
{
"text": "What is 10 + 5?"
}
]
},
"final_response": {
"role": "model",
"parts": [
{
"text": "The result of 10 + 5 is 15."
}
]
},
"intermediate_data": {
"tool_uses": [
{
"name": "tools.calculator.add",
"args": {
"a": 10,
"b": 5
}
}
],
"tool_responses": [
{
"name": "tools.calculator.add",
"response": {
"status": "success",
"result": 15
}
}
]
}
}
]
}
]
}
Self-Reflection Answers
-
Why is testing the
tool_trajectoryoften more important for ensuring an agent's correctness than just testing its final text response?- Answer: Testing the
tool_trajectoryis crucial because it validates the agent's underlying reasoning process. An LLM might coincidentally produce a correct final answer even if it used the wrong tools or called them in an incorrect order. Trajectory testing ensures that the agent follows the intended workflow, using the right tools with the correct arguments. This is vital for predictability, reliability, and preventing unexpected failures or inefficient resource usage in complex scenarios.
- Answer: Testing the
-
The
response_match_scoreis not a simple "equals" check. Why is this fuzzy matching necessary for evaluating LLM-generated text?- Answer: LLMs are non-deterministic. For the same input, they can produce semantically identical but syntactically different responses (e.g., "The weather is sunny" vs. "It's a sunny day"). A simple exact-match
equalscheck would fail these valid variations. Fuzzy matching (like ROUGE scores) measures the overlap of n-grams or semantic similarity, allowing tests to pass if the key information and meaning are conveyed, even if the exact wording differs from the reference. This makes evaluation more robust and less prone to false negatives.
- Answer: LLMs are non-deterministic. For the same input, they can produce semantically identical but syntactically different responses (e.g., "The weather is sunny" vs. "It's a sunny day"). A simple exact-match
-
How could you integrate the
adk evalcommand into a CI/CD pipeline (like GitHub Actions) to automatically test your agent every time you push new code?-
Answer: You would add a step to your CI/CD workflow (e.g., a
.github/workflows/main.ymlfile for GitHub Actions). After checking out your code and installing dependencies, you would run theadk evalcommand. For example:jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.10'
- name: Install dependencies
run: |
pip install --upgrade pip
pip install adk-python
# Install any agent-specific dependencies
- name: Run agent evaluations
run: adk eval . eval_results/calculator_tests.evalset.jsonIf
adk evalreturns a non-zero exit code (indicating test failures), the CI/CD pipeline step will fail, preventing regressions from being merged or deployed. This provides an automated safety net for agent development.
-
-
What is the difference between "Golden Path" testing and "User Simulation"?
- Answer: "Golden Path" testing (Regression Testing) uses static, recorded conversations to ensure the agent performs exactly as it did in the past for known inputs. It verifies correctness and prevents breaking changes. "User Simulation" (Stress/Dynamic Testing) uses a generative model to act as a dynamic user. It creates varied, unpredictable conversations based on a scenario, helping to find edge cases, safety issues, or robustness failures that static tests might miss.