OpenHands CodeAct v2.1 v/s Tools + Claude 3.5 Sonnet
The recent release of the Claude-3.5 Sonnet (20241022)
model has been a game-changer, revitalizing the SWE-Bench leaderboard. This model has been the driving force behind several systems currently holding the top positions.
The SWE-Bench leaderboard Rank 1: Amazon Q Developer Agent (v20241202-dev)
submission README says ‘Our team is currently working on publishing details of the Q Dev Agent algorithm.’ and Rank 2: devlo
submission README does not share any details.
Rank 3: OpenHands + CodeAct v2.1 (claude-3.5-sonnet-20241022)
explicitly highlights its use of the Claude-3.5 Sonnet model in its submission entry and their agent is open-source and available on Github.
It could be an interesting exercise to compare this Rank 3 submission against Anthropic’s official entry, Tools + Claude 3.5 Sonnet (2024-10-22)
, which currently holds Rank 9 on the SWE-Bench_Verified leaderboard as of December 28, 2024.
NOTE: For our analysis, we will take a cursory approach, focusing on identifying overarching patterns rather than examining individual instances in detail.
Rank | Model | % Resolved | Date |
---|---|---|---|
3 | OpenHands + CodeAct v2.1 (claude-3-5-sonnet-20241022) | 53.00 | 2024-10-29 |
9 | Tools + Claude 3.5 Sonnet (2024-10-22) | 49.00 | 2024-10-22 |
There is a 6%
increase in peformance by the OpenHands system, in comparison to the Anthropic’s official submission. The Anthropic’s submission official post says:
The above statement provides an excellent foundation for this exercise.
Let’s begin.
Total instances: 500
OpenHands resolved instances: 265
Tools Sonnet resolved instances: 245
# is there overlap?
Number of Instances resolved by both systems: 205
Number of Instances resolved by OpenHands only: 60
Number of Instances resolved by Tools Sonnet only: 40
Both systems get 205
instances resolved, but there are 60
instances only resolved by OpenHands and 40
instances only resolved by Tools + Claude 3.5 Sonnet. This is interesting.
Now, take a deeper dive into the details, leveraging the official posts from both submissions as well as the logs and trajectories accessible for further analysis.
The OpenHands + CodeAct v2.1 (claude-3-5-sonnet-20241022)
submission README says:
This submission is made with OpenHands CodeAct v2.1 at commit a86275e288087bc7833bc79b0b691ed2ec223205
of PR #4537 using `anthropic/claude-3-5-sonnet-20241022.
and the official post outlines the changes:
- Switched to use function calling, a method used by language models to more precisely specify the functions available to them
- We switched to Anthropic's new claude-3.5 model, released in October
- Made a number of fixes to make it easier for agents to traverse directories
Let’s review the prompt for both systems first and then review the tools used by each system.
Prompt
Tools + Claude 3.5 Sonnet (2024-10-22)
The Anthropic Tool Using Agent
prompt is copied below from the official post. It outlines a suggested approach for the model, but it’s not overly long or too detailed for this task.
<uploaded_files>
{location}
</uploaded_files>
I've uploaded a python code repository in the directory {location} (not in /tmp/inputs). Consider the following PR description:
<pr_description>
{pr_description}
</pr_description>
Can you help me implement the necessary changes to the repository so that the requirements specified in the <pr_description> are met?
I've already taken care of all changes to any of the test files described in the <pr_description>. This means you DON'T have to modify the testing logic or any of the tests in any way!
Your task is to make the minimal changes to non-tests files in the {location} directory to ensure the <pr_description> is satisfied.
Follow these steps to resolve the issue:
1. As a first step, it might be a good idea to explore the repo to familiarize yourself with its structure.
2. Create a script to reproduce the error and execute it with `python <filename.py>` using the BashTool, to confirm the error
3. Edit the sourcecode of the repo to resolve the issue
4. Rerun your reproduce script and confirm that the error is fixed!
5. Think about edgecases and make sure your fix handles them as well
Your thinking should be thorough and so it's fine if it's very long.
OpenHands + CodeAct v2.1 (claude-3-5-sonnet-20241022)
The OpenHands agent has an additional system
message:
You are a helpful assistant that can interact with a computer to solve tasks.
<IMPORTANT>
* If user provides a path, you should NOT assume it's relative to the current working directory. Instead, you should explore the file system to find the file before working on it.
</IMPORTANT>
However, OpenHands retains the Anthropic’s prompt unchanged.
https://github.com/eschluntz/swe-bench-experiments/tree/main/evaluation/verified/20241022_tools_claude-3-5-sonnet-updated/trajs
Below is a diff illustrating the differences between the prompts:
--- tools_sonnet_prompt.txt 2024-12-28 13:33:44
+++ openhands_prompt.txt 2024-12-28 13:39:38
@@ -1,7 +1,7 @@
<uploaded_files>
-/repo
+/workspace/astropy__astropy__1.3
</uploaded_files>
-I've uploaded a python code repository in the directory /repo (not in /tmp/inputs). Consider the following PR description:
+I've uploaded a python code repository in the directory astropy__astropy__1.3. Consider the following PR description:
<pr_description>
{pr_description}
We can make the following observations:
- Upon exploring the log files from the Anthropic submission, we can see that the
{location}
is always set to/repo
for Anthropic’sTool Using Agent
. For OpenHands CodeAct 2.1, the{location}
is set to the directory where the git repo was cloned inside the sandbox, e.g./workspace/astropy__astropy__1.3
,/workspace/django__django__3.2
,/workspace/pallets__flask__2.3
etc. The workspace directory name is a combination of the instance repo and version in the OpenHands github repo here
def _get_swebench_workspace_dir_name(instance: pd.Series) -> str:
return f'{instance.repo}__{instance.version}'.replace('/', '__')
- The OpenHands prompt does not modify the
{location}
variable in the following sentence in the prompt and is kept as/repo
.
Your task is to make the minimal changes to non-tests files in the {location} directory to ensure the <pr_description> is satisfied.
A quick grep of the OpenHands trajectories reveals several instances with the following stdout observation entries in the log files:
OBSERVATION:\nls -la /repo\r\nls: cannot access '/repo': No such file or directory
OBSERVATION:\nERROR:\nThe path /repo does not exist. Please provide a valid path.
'/repo/' is not a directory
I notice the repository path is not in /repo. Let's move it there. Let's create the /repo directory first
Let me check if we need to add /repo to the path
These show that the Claude 3.5 Sonnet model gets confused by two different path locations specified in the OpenHands prompt /repo
and /workspace/astropy__astropy__1.3
.
Perhaps it’s a typo/bug which was resolved before the submission but the trajectories were not updated. The current version of the prompt in the OpenHands code has updated it to /workspace
.
So, we can conclude that OpenHands retains the Anthropic’s prompt unchanged, but has an additional system
message, that is absent in the Anthropic Tool Use Agent
. Anthropic has not disclosed details about any system message they might have used during their evaluation.
Tools
The Anthropic Tool Use Agent
has two tools:
- a Bash Tool for executing bash commands, and
- an Edit Tool, for viewing and editing files and directories.
The description for both tools contains detailed information for the Claude 3.5 Sonnet model about how to use the tool. I really like the tool descriptions - they are well-crafted and strike a balance between detail and clarity.
Bash Tool/ CmdRunTool
Both OpenHands CmdRunTool
and Anthropic’s Bash Tool
have a very simple schema, taking only the command to be run in the environment. But, they have different tool descriptions.
OpenHands CmdRunTool
description:
Execute a bash command in the terminal.
* Long running commands: For commands that may run indefinitely, it should be run in the background and the output should be redirected to a file, e.g. command = `python3 app.py > server.log 2>&1 &`.
* Interactive: If a bash command returns exit code `-1`, this means the process is not yet finished. The assistant must then send a second call to terminal with an empty `command` (which will retrieve any additional logs), or it can send additional text (set `command` to the text) to STDIN of the running process, or it can send command=`ctrl+c` to interrupt the process.
* Timeout: If a command execution result says "Command timed out. Sending SIGINT to the process", the assistant should retry running the command in the background.
Anthropic’s Bash Tool
description -
Run commands in a bash shell\n
* When invoking this tool, the contents of the \"command\" parameter does NOT need to be XML-escaped.\n
* You don't have access to the internet via this tool.\n
* You do have access to a mirror of common linux and python packages via apt and pip.\n
* State is persistent across command calls and discussions with the user.\n
* To inspect a particular line range of a file, e.g. lines 10-25, try 'sed -n 10,25p /path/to/the/file'.\n
* Please avoid commands that may produce a very large amount of output.\n
* Please run long lived commands in the background, e.g. 'sleep 10 &' or start a server in the background.
Edit Tool
OpenHands StrReplaceEditorTool
tool uses the a) same tool description, b) schema and c) code implementation as Anthropic’s str_replace_editor
Edit tool. In CodeAct 2.1, StrReplaceEditorTool
was used here and here.
Wait, what?
If both systems: OpenHands + CodeAct v2.1 (claude-3.5-sonnet-20241022)
at Rank 3 and Tools + Claude 3.5 Sonnet (2024-10-22)
at Rank 9 share:
- Prompt
- Edit tool for viewing and editing files and directories
- Bash tool schema
and only differ in:
- additional/different system instruction/prompt, and
- the description of Bash tool
These differences raise an interesting question: are they significant enough
to explain the following scenario?60 instances only resolved by OpenHands
40 instances only resolved by Tools + Claude 3.5 Sonnet
205 instances resolved by both systems
Perhaps, what Anthropic’s official post suggests is true, even when there are minor differences in the scaffolding!
Takeaways
-
I find it quite surprising that such seemingly minor differences could lead to markedly distinct outcomes on the benchmark.
-
It’s entirely plausible that repeated runs of the same system would yield entirely different outcomes. This raises an intriguing point about the consistency and reliability of these systems — “how much of the observed performance is deterministic versus subject to randomness?”
Update (Jan 3, 2025)
Response from Xingyao Wang @All Hands AI/ OpenHands
Yeah, I've been observing similar things. The most challenging thing about agent evaluation is that even if you run the SAME code on the SAME set of problems with the SAME temperature=0, you typically end up with different results every time 😅.
— Xingyao Wang (@xingyaow_) January 3, 2025
We recently re-ran the commit… pic.twitter.com/pz0HdbdIPC