The year 2024 was remarkable for SWE-Agents, as we celebrated significant advancements in system performance on our cherished SWE-Bench benchmark. This progress was especially notable on the SWE-Bench Verified benchmark since its release.

In a previous post, I discussed why SWE-Bench Verified may not accurately represent real-world SWE tasks, primarily due to the low proportion of issues requiring code changes across multiple files. If you haven’t read that post yet, I recommend doing so before continuing with this one.

As highlighted in the previous post, the SWE-Bench_Verified dataset, consisting of 500 instances, can be categorized into two distinct buckets:

  • Single-file changes: 429/500 instances (85.8%)
  • Multiple-file changes: 71/500 instances (14.2%)

Furthermore, it is evident that the performance of the top-10 systems drops significantly for tasks requiring changes across multiple files, underscoring a critical area for improvement.

Model Overall %Resolved (count/500) Multi-file %Resolved (count/71)
Amazon Q Developer Agent (v20241202-dev) 55.0 (275) 21.13 (15)
devlo 54.2 (271) 19.72 (14)
OpenHands + CodeAct v2.1 (claude-3-5-sonnet-20241022) 53.0 (265) 25.35 (18)
Engine Labs (2024-11-25) 51.8 (259) 18.31 (13)
Agentless-1.5 + Claude-3.5 Sonnet (20241022) 50.8 (254) 18.31 (13)
Solver (2024-10-28) 50.0 (250) 18.31 (13)
Bytedance MarsCode Agent 50.0 (250) 18.31 (13)
nFactorial (2024-11-05) 49.2 (246) 14.08 (10)
Tools + Claude 3.5 Sonnet (2024-10-22) 49.0 (245) 11.27 (8)
Composio SWE-Kit (2024-10-25) 48.6 (243) 8.45 (6)

In this post, we’ll take a closer look at the performance of the top-10 systems on the multiple files subset of SWE-Bench Verified from a fresh perspective:

  1. How does the approach of our current state-of-the-art SWE-Agents compare to that of a human software engineer (SWE) when solving these issues?
  2. Do our SWE-Agents appropriately modify multiple files to resolve the issue, given that these instances inherently require changes across multiple files (using the human-generated patch as the golden standard)?

This is the question we aim to explore today! πŸš€

Let’s start by analyzing how many of the 71 multiple-file instances were resolved by at least one of the top-10 systems. Based on the compiled statistics, we get:

  • Multiple-file Instances (requiring changes across multiple files): 71/500
  • Multiple-file Instances resolved by at least one system: 29/71
  • Multiple-file Instances resolved by at least one system with single-file changes: 20/29

This observation is quite surprising. There are 20 instances that:

  1. Required changes across multiple files, as indicated by the human-generated patch.
  2. Were marked as resolved (i.e., passed both PASS_TO_PASS and FAIL_TO_PASS tests in the evaluation) by at least one of the top-10 systems.
  3. Were resolved by modifying only one file, contradicting the multi-file nature of the issue.

This discrepancy raises important questions about how the systems achieve these resolutions and the robustness of the evaluation process. 😳

Disclaimer

If you are working on the SWE-Bench benchmark, please refrain from attempting to reverse engineer the anonymized instance IDs using the information provided about the gold patch files. Additionally, do not access the original pull requests on GitHub or the corresponding gold patches in the dataset. This ensures that hidden tests remain protected, maintaining ethical standards.

To achieve this:

  • Instance IDs are purposefully anonymized.
  • The tabular analysis focuses only on listing the file names in the model-generated patches to identify any patterns, while keeping both the instance IDs and the patches hidden.

This approach safeguards the integrity of the benchmark while facilitating responsible and ethical analysis.

Analysis

Let’s take a closer look at these 20 instances. For each instance, the following information is provided:

  1. Instance ID (anonymized).
  2. Number of systems (out of the top-10) that resolved the instance: n/10.
  3. Number of files K in the gold patch, further divided into:
    • k1 βŠ† K: Files modified by all n/10 systems.
    • k2 βŠ† K: Files modified by some systems (a subset of n/10).
    • k3 βŠ† K: Files not modified by any systems.

By clicking on the anonymized Instance ID, you can view additional details, such as:

  • The file names in the gold patch (color-encoded):
    • Green: Files modified by all systems.
    • Orange: Files modified by some systems.
    • Red: Files not modified by any systems.
  • The file names in the model-generated patches for each of the n/10 systems that resolved the instance.

Based on the additional details, the 20 instances can be categorized as follows:


Category 1: No file from the gold patch was modified

In these instances, the model-generated patches resolved the issue by modifying files that were not part of the gold patch, while none of the files in the gold patch were altered.

Note: Click on the anonymized Instance ID to view additional details

scikit-learn__scikit-learn-x5x0x
- resolved by 4/10 systems
- 2 files in gold patch
    - 2 files not modified by any systems


files in gold patch:

  • sklearn/base.py
  • sklearn/feature_selection/_base.py
Model ​# files model_patch files
devlo 1 sklearn/utils/_set_output.py
OpenHands + CodeAct v2.1 (claude-3-5-sonnet-20241022) 3 reproduce_error.py
sklearn/utils/_set_output.py
test_edge_cases.py
Agentless-1.5 + Claude-3.5 Sonnet (20241022) 1 sklearn/utils/_set_output.py
Bytedance MarsCode Agent 1 sklearn/utils/_set_output.py
django__django-1x9x8
- resolved by 1/10 systems
- 2 files in gold patch
    - 2 files not modified by any systems


files in gold patch:

  • django/core/serializers/python.py
  • django/core/serializers/xml_serializer.py
Model ​# files model_patch files
Agentless-1.5 + Claude-3.5 Sonnet (20241022) 1 django/db/models/query_utils.py

Category 2: At least 1 file from the gold patch was not modified

In these instances, the model-generated patches resolved the issue, but at least one file from the gold patch was left unmodified by the systems.

Note: Click on the anonymized Instance ID to view additional details

django__django-xxxx3
- resolved by 3/10 systems
- 2 files in gold patch
    - 1 file modified by all 3 systems
    - 1 file modified by some systems


files in gold patch:

  • django/urls/resolvers.py
  • django/urls/base.py
Model ​# files model_patch files
Amazon Q Developer Agent (v20241202-dev) 1 django/urls/resolvers.py
OpenHands + CodeAct v2.1 (claude-3-5-sonnet-20241022) 4 django/urls/base.py
django/urls/resolvers.py
reproduce.py
test_cache_clear.py
nFactorial (2024-11-05) 2 django/urls/base.py
django/urls/resolvers.py
django__django-x1xxx
- resolved by 7/10 systems
- 5 files in gold patch
    - 1 file modified by all 7 systems
    - 1 file modified by some systems
    - 3 files not modified by any systems


files in gold patch:

  • django/core/mail/utils.py
  • django/core/mail/message.py
  • django/core/validators.py
  • django/utils/encoding.py
  • django/utils/html.py
Model ​# files model_patch files
Amazon Q Developer Agent (v20241202-dev) 2 django/core/mail/message.py
django/core/mail/utils.py
devlo 1 django/core/mail/utils.py
OpenHands + CodeAct v2.1 (claude-3-5-sonnet-20241022) 2 django/core/mail/utils.py
reproduce.py
Engine Labs (2024-11-25) 2 django/core/mail/utils.py
reproduce_issue.py
Agentless-1.5 + Claude-3.5 Sonnet (20241022) 1 django/core/mail/utils.py
Bytedance MarsCode Agent 1 django/core/mail/utils.py
Tools + Claude 3.5 Sonnet (2024-10-22) 2 django/core/mail/utils.py
reproduce_error.py
django__django-xxxx5
- resolved by 10/10 systems
- 2 files in gold patch
    - 1 file modified by all 10 systems
    - 1 file not modified by any systems


files in gold patch:

  • django/contrib/admindocs/utils.py
  • django/contrib/admindocs/views.py
Model ​# files model_patch files
Amazon Q Developer Agent (v20241202-dev) 1 django/contrib/admindocs/utils.py
devlo 1 django/contrib/admindocs/utils.py
OpenHands + CodeAct v2.1 (claude-3-5-sonnet-20241022) 2 django/contrib/admindocs/utils.py
reproduce_error.py
Engine Labs (2024-11-25) 2 django/contrib/admindocs/utils.py
test_docstring.py
Agentless-1.5 + Claude-3.5 Sonnet (20241022) 1 django/contrib/admindocs/utils.py
Solver (2024-10-28) 1 django/contrib/admindocs/utils.py
Bytedance MarsCode Agent 1 django/contrib/admindocs/utils.py
nFactorial (2024-11-05) 1 django/contrib/admindocs/utils.py
Tools + Claude 3.5 Sonnet (2024-10-22) 3 django/contrib/admindocs/utils.py
reproduce_docstring_issue.py
test_docstring_edge_cases.py
Composio SWE-Kit (2024-10-25) 1 django/contrib/admindocs/utils.py
django__django-xxx25
- resolved by 1/10 systems
- 2 files in gold patch
    - 1 file modified by all 1 systems
    - 1 file not modified by any systems


files in gold patch:

  • django/db/models/base.py
  • django/db/models/options.py
Model ​# files model_patch files
Agentless-1.5 + Claude-3.5 Sonnet (20241022) 1 django/db/models/base.py
django__django-xxxx1
- resolved by 4/10 systems
- 4 files in gold patch
    - 3 files modified by some systems
    - 1 file not modified by any systems


files in gold patch:

  • django/db/backends/base/operations.py
  • django/db/backends/mysql/operations.py
  • django/db/backends/sqlite3/operations.py
  • django/db/models/expressions.py
Model ​# files model_patch files
Amazon Q Developer Agent (v20241202-dev) 1 django/db/backends/sqlite3/operations.py
devlo 1 django/db/backends/base/operations.py
Engine Labs (2024-11-25) 2 django/db/backends/base/operations.py
repro.py
Tools + Claude 3.5 Sonnet (2024-10-22) 6 django/db/backends/mysql/operations.py
django/db/backends/sqlite3/operations.py
reproduce.py
test_app/__ init __.py
test_app/models.py
test_settings.py
django__django-1xxxx
- resolved by 10/10 systems
- 2 files in gold patch
    - 1 file modified by some systems
    - 1 file not modified by any systems


files in gold patch:

  • django/db/backends/base/schema.py
  • django/db/models/fields/__init__.py
Model ​# files model_patch files
Amazon Q Developer Agent (v20241202-dev) 1 django/db/backends/sqlite3/schema.py
devlo 1 django/db/backends/sqlite3/schema.py
OpenHands + CodeAct v2.1 (claude-3-5-sonnet-20241022) 2 db.sqlite3
django/db/backends/base/schema.py
Engine Labs (2024-11-25) 2 django/db/backends/base/schema.py
test_choices.py
Agentless-1.5 + Claude-3.5 Sonnet (20241022) 1 django/db/backends/base/schema.py
Solver (2024-10-28) 2 django/db/backends/base/schema.py
tests/migrations/test_operations.py
Bytedance MarsCode Agent 1 django/db/backends/base/schema.py
nFactorial (2024-11-05) 1 django/db/backends/base/schema.py
Tools + Claude 3.5 Sonnet (2024-10-22) 2 django/db/backends/base/schema.py
reproduce.py
Composio SWE-Kit (2024-10-25) 1 django/db/backends/base/schema.py
django__django-1x0xx
- resolved by 4/10 systems
- 2 files in gold patch
    - 1 file modified by some systems
    - 1 file not modified by any systems


files in gold patch:

  • django/db/models/sql/query.py
  • django/db/models/fields/related_lookups.py
Model ​# files model_patch files
Amazon Q Developer Agent (v20241202-dev) 2 django/db/models/query.py
django/db/models/sql/query.py
OpenHands + CodeAct v2.1 (claude-3-5-sonnet-20241022) 1 django/db/models/lookups.py
Engine Labs (2024-11-25) 3 django/db/models/lookups.py
test_fix.py
test_settings.py
Solver (2024-10-28) 3 django/db/models/lookups.py
django/db/models/sql/query.py
tests/annotations/test_alias_subquery.py
matplotlib__matplotlib-2xxxx
- resolved by 5/10 systems
- 3 files in gold patch
    - 1 file modified by all 5 systems
    - 2 files modified by some systems


files in gold patch:

  • lib/matplotlib/text.py
  • lib/matplotlib/backends/backend_agg.py
  • lib/matplotlib/backends/backend_cairo.py
Model ​# files model_patch files
devlo 3 lib/matplotlib/backends/backend_agg.py
lib/matplotlib/backends/backend_cairo.py
lib/matplotlib/text.py
OpenHands + CodeAct v2.1 (claude-3-5-sonnet-20241022) 4 lib/matplotlib/backends/backend_agg.py
lib/matplotlib/text.py
test_text_antialiasing.py
text_antialiasing_test.png
Engine Labs (2024-11-25) 4 lib/matplotlib/backends/backend_agg.py
lib/matplotlib/text.py
simple_test_text.py
test_text_antialiasing.py
Agentless-1.5 + Claude-3.5 Sonnet (20241022) 1 lib/matplotlib/text.py
nFactorial (2024-11-05) 1 lib/matplotlib/text.py
pydata__xarray-xx9x
- resolved by 4/10 systems
- 2 files in gold patch
    - 1 file modified by all 4 systems
    - 1 file not modified by any systems


files in gold patch:

  • xarray/core/variable.py
  • xarray/core/indexing.py
Model ​# files model_patch files
Amazon Q Developer Agent (v20241202-dev) 1 xarray/core/variable.py
devlo 1 xarray/core/variable.py
Solver (2024-10-28) 1 xarray/core/variable.py
nFactorial (2024-11-05) 1 xarray/core/variable.py
pydata__xarray-xx0x
- resolved by 9/10 systems
- 2 files in gold patch
    - 2 files modified by some systems


files in gold patch:

  • xarray/core/dataset.py
  • xarray/core/variable.py
Model ​# files model_patch files
Amazon Q Developer Agent (v20241202-dev) 2 xarray/core/dataset.py
xarray/core/variable.py
devlo 1 xarray/core/dataarray.py
OpenHands + CodeAct v2.1 (claude-3-5-sonnet-20241022) 4 reproduce.py
test_edge_cases.py
xarray/core/dataset.py
xarray/core/variable.py
Engine Labs (2024-11-25) 3 reproduce_issue.py
xarray/core/dataset.py
xarray/core/variable.py
Agentless-1.5 + Claude-3.5 Sonnet (20241022) 1 xarray/core/dataarray.py
Solver (2024-10-28) 2 xarray/core/dataset.py
xarray/core/variable.py
Bytedance MarsCode Agent 1 xarray/core/dataarray.py
nFactorial (2024-11-05) 1 xarray/core/dataarray.py
Tools + Claude 3.5 Sonnet (2024-10-22) 3 reproduce.py
xarray/core/dataset.py
xarray/core/variable.py
pydata__xarray-xxx3
- resolved by 5/10 systems
- 2 files in gold patch
    - 1 file modified by all 5 systems
    - 1 file not modified by any systems


files in gold patch:

  • xarray/core/dataarray.py
  • xarray/core/dataset.py
Model ​# files model_patch files
devlo 1 xarray/core/dataarray.py
OpenHands + CodeAct v2.1 (claude-3-5-sonnet-20241022) 2 reproduce.py
xarray/core/dataarray.py
Engine Labs (2024-11-25) 5 test_final.py
test_integrate.py
test_integrate_changes.py
test_integrate_simple.py
xarray/core/dataarray.py
Solver (2024-10-28) 1 xarray/core/dataarray.py
Bytedance MarsCode Agent 1 xarray/core/dataarray.py
pylint-dev__pylint-x5xx
- resolved by 4/10 systems
- 2 files in gold patch
    - 1 file modified by all 4 systems
    - 1 file not modified by any systems


files in gold patch:

  • pylint/lint/pylinter.py
  • pylint/lint/expand_modules.py
Model ​# files model_patch files
Amazon Q Developer Agent (v20241202-dev) 1 pylint/lint/pylinter.py
OpenHands + CodeAct v2.1 (claude-3-5-sonnet-20241022) 6 .a/foo.py
ar.py
az.py
pylint/lint/pylinter.py
reproduce_error.py
test_fix.py
Solver (2024-10-28) 1 pylint/lint/pylinter.py
Tools + Claude 3.5 Sonnet (2024-10-22) 3 pylint/lint/pylinter.py
reproduce.py
test_ignore.py
pylint-dev__pylint-x89x
- resolved by 2/10 systems
- 3 files in gold patch
    - 1 file modified by all 2 systems
    - 2 files not modified by any systems


files in gold patch:

  • pylint/config/argument.py
  • pylint/utils/__init__.py
  • pylint/utils/utils.py
Model ​# files model_patch files
Agentless-1.5 + Claude-3.5 Sonnet (20241022) 1 pylint/config/argument.py
nFactorial (2024-11-05) 2 pylint/config/argument.py
pyproject.toml
pytest-dev__pytest-xx9x
- resolved by 10/10 systems
- 2 files in gold patch
    - 1 file modified by all 10 systems
    - 1 file not modified by any systems


files in gold patch:

  • src/_pytest/unittest.py
  • src/_pytest/python.py
Model ​# files model_patch files
Amazon Q Developer Agent (v20241202-dev) 1 src/_pytest/unittest.py
devlo 2 src/_pytest/unittest.py
test_reproduce.py
OpenHands + CodeAct v2.1 (claude-3-5-sonnet-20241022) 2 reproduce_error.py
src/_pytest/unittest.py
Engine Labs (2024-11-25) 2 src/_pytest/unittest.py
test_unittest_fixture.py
Agentless-1.5 + Claude-3.5 Sonnet (20241022) 1 src/_pytest/unittest.py
Solver (2024-10-28) 2 src/_pytest/unittest.py
test_unittest_fixture.py
Bytedance MarsCode Agent 1 src/_pytest/unittest.py
nFactorial (2024-11-05) 1 src/_pytest/unittest.py
Tools + Claude 3.5 Sonnet (2024-10-22) 2 reproduce_issue.py
src/_pytest/unittest.py
Composio SWE-Kit (2024-10-25) 1 src/_pytest/unittest.py
scikit-learn__scikit-learn-xx6xx
- resolved by 4/10 systems
- 2 files in gold patch
    - 1 file modified by all 4 systems
    - 1 file not modified by any systems


files in gold patch:

  • sklearn/decomposition/dict_learning.py
  • examples/decomposition/plot_sparse_coding.py
Model ​# files model_patch files
Amazon Q Developer Agent (v20241202-dev) 1 sklearn/decomposition/dict_learning.py
Engine Labs (2024-11-25) 2 sklearn/decomposition/dict_learning.py
test_sparsecoder.py
Agentless-1.5 + Claude-3.5 Sonnet (20241022) 1 sklearn/decomposition/dict_learning.py
Bytedance MarsCode Agent 1 sklearn/decomposition/dict_learning.py
sphinx-doc__sphinx-xx2x
- resolved by 6/10 systems
- 2 files in gold patch
    - 2 files modified by some systems


files in gold patch:

  • sphinx/application.py
  • sphinx/locale/__init__.py
Model ​# files model_patch files
devlo 1 sphinx/locale/__ init __.py
OpenHands + CodeAct v2.1 (claude-3-5-sonnet-20241022) 4 sphinx/config.py
sphinx/locale/__ init __.py
sphinx/util/i18n.py
test_locale_override.py
Agentless-1.5 + Claude-3.5 Sonnet (20241022) 1 sphinx/locale/__ init __.py
Solver (2024-10-28) 1 sphinx/application.py
Bytedance MarsCode Agent 1 sphinx/locale/__ init __.py
Composio SWE-Kit (2024-10-25) 1 sphinx/application.py
sphinx-doc__sphinx-x59x
- resolved by 3/10 systems
- 2 files in gold patch
    - 1 file modified by all 3 systems
    - 1 file not modified by any systems


files in gold patch:

  • sphinx/ext/autodoc/__init__.py
  • sphinx/ext/autodoc/importer.py
Model ​# files model_patch files
Amazon Q Developer Agent (v20241202-dev) 1 sphinx/ext/autodoc/__ init __.py
OpenHands + CodeAct v2.1 (claude-3-5-sonnet-20241022) 31 sphinx/ext/autodoc/__ init __.py
sphinx/pycode/parser.py
test_meta_public.py
test_meta_public_dir/_build/html/.buildinfo
test_meta_public_dir/_build/html/.doctrees/environment.pickle
test_meta_public_dir/_build/html/.doctrees/index.doctree
test_meta_public_dir/_build/html/_sources/index.rst.txt
test_meta_public_dir/_build/html/_static/alabaster.css
test_meta_public_dir/_build/html/_static/basic.css
test_meta_public_dir/_build/html/_static/custom.css
test_meta_public_dir/_build/html/_static/doctools.js
test_meta_public_dir/_build/html/_static/documentation_options.js
test_meta_public_dir/_build/html/_static/file.png
test_meta_public_dir/_build/html/_static/jquery-3.5.1.js
test_meta_public_dir/_build/html/_static/jquery.js
test_meta_public_dir/_build/html/_static/language_data.js
test_meta_public_dir/_build/html/_static/minus.png
test_meta_public_dir/_build/html/_static/plus.png
test_meta_public_dir/_build/html/_static/pygments.css
test_meta_public_dir/_build/html/_static/searchtools.js
test_meta_public_dir/_build/html/_static/underscore-1.3.1.js
test_meta_public_dir/_build/html/_static/underscore.js
test_meta_public_dir/_build/html/genindex.html
test_meta_public_dir/_build/html/index.html
test_meta_public_dir/_build/html/objects.inv
test_meta_public_dir/_build/html/py-modindex.html
test_meta_public_dir/_build/html/search.html
test_meta_public_dir/_build/html/searchindex.js
test_meta_public_dir/conf.py
test_meta_public_dir/example.py
test_meta_public_dir/index.rst
Bytedance MarsCode Agent 1 sphinx/ext/autodoc/__ init __.py
sympy__sympy-xxx1x
- resolved by 1/10 systems
- 2 files in gold patch
    - 1 file modified by all 1 systems
    - 1 file not modified by any systems


files in gold patch:

  • sympy/simplify/sqrtdenest.py
  • sympy/simplify/radsimp.py
Model ​# files model_patch files
Engine Labs (2024-11-25) 1 sympy/simplify/sqrtdenest.py

The above analysis reveals notable discrepancies between the approaches of human software engineers (SWEs) and model-based SWE-Agents when resolving complex multi-file issues.

Key Takeaways from the Analysis

  1. Differences in Problem-Solving Approaches:
    SWE-Agents handle multi-file issues differently compared to human SWEs, often focusing on resolving issues without strictly adhering to the gold standard of modifying the correct files.

    NOTE: It is plausible that the model-generated patches from SWE-Agents might, in some cases, surpass those created by human SWEs in terms of quality or efficiency.

    • However, this hypothesis remains to be validated through a thorough review by a Subject Matter Expert (such as the GitHub repository maintainer) for the specific library or codebase.
    • Such an expert review could provide critical insights into the suitability and maintainability of the model-generated patches, helping to assess their long-term value compared to human contributions.
  2. Benchmark Limitations:
    The benchmark does not enforce requirements for SWE-Agents to make changes in the appropriate locations, which is crucial for ensuring maintainable and robust code over the long term.

Implications

If these model-generated patches were submitted as pull requests, many would likely need significant adjustments to modify the appropriate files. This step is essential to align with real-world software engineering practices and to produce code that is easier to maintain and extend.

Conclusion

The analysis highlights a gap between achieving benchmark success and adhering to real-world software engineering best practices, underscoring the need for benchmarks that better reflect practical requirements and long-term code maintainability.