Arvin Xu
|
e7598fe90b
|
✨ feat: support agent benchmark (#12355)
* improve total
fix page size issue
fix error message handler
fix eval home page
try to fix batch run agent step issue
fix run list
fix dataset loading
fix abort issue
improve jump and table column
fix error streaming
try to fix error output in vercel
refactor qstash workflow client
improve passK
add evals to proxy
refactor metrics
try to fix build
refactor tests
improve detail page
fix passK issue
improve eval-rubric
fix types
support passK
fix type
update
fix db insert issue
improve dataset ui
improve run config
finish step limit now
add step limited
100% coverage to models
add failed tests todo
support interruptOperation
fix lint
improve report detail
improve pass rate
improve sort order issue
fix timeout issue
Update db schema
完整 case 跑通
update database
improve error handling
refactor to improve database
优化 test case 的处理流程
优化部分细节体验和实现
基本完成 Benchmark 全流程功能
优化 run case 展示
优化 run case 序号问题
优化 eval test case 页面
新增 eval test 模式
新增 dataset 页面
update schema
support
finish create test run
fix
update
improve import exp
refactor data flow
improve import workflow
rubric Benchmark detail 页面
improve import ux
update schema
finish eval home page
add eval workflow endpoint
implement benchmark run model
refactor RAG eval
implement backend
update db schema
update db migration
init benchmark
* support rerun error test case
* fix tests
* fix tests
|
2026-02-21 20:36:40 +08:00 |
|