Add tools for comparing translation quality between different models
(e.g., Sonnet vs Opus) or prompt variations. Useful for evaluating
translation improvements before deploying changes.
- run_test.py: Test runner with Dify API streaming
- compare.py: Generate similarity reports between variants
- Example spec and documentation included
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude <noreply@anthropic.com>