Fix stale PR translation revert issue (#630)

* Fix stale PR translation revert issue

When PR A is created before PR B but PR B merges first, the translation
workflow for PR A was reverting all of PR B's changes. This happened because
the translation workflow used PR A's working directory state (which is a
snapshot from before PR B existed) rather than applying only PR A's changes.

Root cause:
- setup_translation_branch() for new branches did:
  checkout -b branch → reset --soft origin/main → reset
  This kept PR's working directory which could be stale

- For incremental branches, merge_docs_json_for_incremental_update() took
  the English section from PR HEAD, which was also stale for old PRs

Fix:
- For NEW branches: Create branch directly from origin/main (not from PR's
  working directory). This ensures we start with the latest state including
  all changes from PRs merged after this PR was created.

- For EXISTING branches: Merge main's docs.json structure with our
  translations (instead of taking EN section from stale PR)

- For BOTH: Selectively checkout only the files that the PR actually changed
  from PR's head, rather than bringing in the entire working directory.
  This prevents overwriting files from other PRs.

Example issue (PR #593):
- PR #593 only added one file
- Translation PR #611 tried to delete 11 files and revert massive docs.json changes
- This was because it used PR #593's stale state from before other PRs merged

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix: Use PR's docs.json for finding file positions in navigation

The initial fix had a side effect: since we start from main's docs.json,
and PR's new files aren't in main's English section yet, sync_docs_json_incremental()
couldn't find where to place new files in the translation navigation.

Fix: Add `reference_sha` parameter to sync_docs_json_incremental() that loads
PR's docs.json for finding file positions, while still modifying main's
translation sections. This ensures:
1. Main's docs.json structure is preserved (no reverts)
2. New files are found in PR's docs.json
3. Translations are added at the correct positions

This also removes the unused _apply_pr_english_section_to_main() method.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix EN section not updated when using reference_sha

When the translation branch starts from main, the PR's docs.json
structural changes (new file entries in EN section) were not being
incorporated. This caused the translation PR to have mismatched
navigation entries.

The fix now also updates the EN section of the working directory's
docs.json when processing added files found in the reference
docs.json (from the PR).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Also remove deleted files from EN section in stale PR scenario

When processing deleted files, the sync now also removes them from the
EN section of docs.json. This is needed when the translation branch
starts from main, which may still have the deleted file entries.

Verified with comprehensive local testing covering 10 scenarios:
- Basic stale PR, multiple files, modifications, deletions
- Nested groups, new dropdowns, mixed operations
- Backward compatibility, incremental syncs, structure changes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
Chenhe Gu
2025-12-23 22:46:11 -08:00
committed by GitHub
parent 61466c3f45
commit 61b37d88fc
2 changed files with 201 additions and 55 deletions

View File

@@ -1383,11 +1383,19 @@ class DocsSynchronizer:
deleted_files: List[str] = None,
renamed_files: List[Tuple[str, str]] = None,
base_sha: str = None,
head_sha: str = None
head_sha: str = None,
reference_sha: str = None
) -> List[str]:
"""
Incrementally sync docs.json structure - only processes changed files.
Preserves existing dropdown names and only updates affected pages.
Args:
reference_sha: If provided, use this commit's docs.json for finding file positions
in the English section. This is needed when the working directory's
docs.json (from main) doesn't have the new files yet (stale PR scenario).
The translation sections are still modified in the working directory's
docs.json.
"""
sync_log = []
added_files = added_files or []
@@ -1447,7 +1455,7 @@ class DocsSynchronizer:
sync_log.append("ERROR: No languages found in navigation")
return sync_log
# Find language sections
# Find language sections in working directory docs.json (for modifying translations)
source_section = None
target_sections = {}
@@ -1457,6 +1465,40 @@ class DocsSynchronizer:
elif lang_data.get("language") in self.target_languages:
target_sections[lang_data.get("language")] = lang_data
# If reference_sha is provided, load that docs.json for finding file positions
# This is needed when working directory (main) doesn't have the new files yet
reference_source_section = source_section
if reference_sha and added_files:
try:
import subprocess
result = subprocess.run(
["git", "show", f"{reference_sha}:docs.json"],
cwd=self.base_dir,
capture_output=True,
text=True,
check=True
)
ref_docs = json.loads(result.stdout)
ref_navigation = ref_docs.get("navigation", {})
# Handle both structures
ref_languages = None
if "languages" in ref_navigation and isinstance(ref_navigation["languages"], list):
ref_languages = ref_navigation["languages"]
elif "versions" in ref_navigation and len(ref_navigation["versions"]) > 0:
if "languages" in ref_navigation["versions"][0]:
ref_languages = ref_navigation["versions"][0]["languages"]
if ref_languages:
for lang_data in ref_languages:
if lang_data.get("language") == self.source_language:
reference_source_section = lang_data
sync_log.append(f"INFO: Using reference docs.json from {reference_sha[:8]} for finding file positions")
break
except Exception as e:
sync_log.append(f"WARNING: Could not load reference docs.json from {reference_sha}: {e}")
# Fall back to working directory's source section
if not source_section:
sync_log.append("ERROR: Source language section not found")
return sync_log
@@ -1469,7 +1511,8 @@ class DocsSynchronizer:
continue
# Find which dropdown contains this file in source language section
result = self.find_dropdown_containing_file(source_file, source_section)
# Use reference_source_section (from PR's docs.json) to find new file positions
result = self.find_dropdown_containing_file(source_file, reference_source_section)
if not result:
sync_log.append(f"WARNING: Could not find {source_file} in source language navigation")
continue
@@ -1478,9 +1521,10 @@ class DocsSynchronizer:
sync_log.append(f"INFO: Found {source_file} in '{source_dropdown_name}' dropdown at location {file_location}")
# Get the source language dropdown for reference
# Use reference_source_section (from PR's docs.json) for new file structure
source_dropdown = None
source_dropdown_index = -1
for i, dropdown in enumerate(source_section.get("dropdowns", [])):
for i, dropdown in enumerate(reference_source_section.get("dropdowns", [])):
if dropdown.get("dropdown") == source_dropdown_name:
source_dropdown = dropdown
source_dropdown_index = i
@@ -1490,6 +1534,40 @@ class DocsSynchronizer:
sync_log.append(f"WARNING: Could not find source language dropdown '{source_dropdown_name}'")
continue
# STALE PR FIX: Also update the working directory's EN section if the file
# was found in reference docs.json but doesn't exist in current EN section.
# This happens when the translation branch is based on main but the PR's
# docs.json hasn't merged yet.
if reference_source_section != source_section:
# Find or create corresponding dropdown in working directory's EN section
wd_en_dropdown = None
wd_en_dropdowns = source_section.get("dropdowns", [])
if source_dropdown_index >= 0 and source_dropdown_index < len(wd_en_dropdowns):
wd_en_dropdown = wd_en_dropdowns[source_dropdown_index]
if not wd_en_dropdown:
# Dropdown doesn't exist in working directory - create it
wd_en_dropdown = {
"dropdown": source_dropdown.get("dropdown", ""),
"icon": source_dropdown.get("icon", "book-open"),
"pages": []
}
source_section.setdefault("dropdowns", [])
# Ensure we have enough slots
while len(source_section["dropdowns"]) <= source_dropdown_index:
source_section["dropdowns"].append(None)
source_section["dropdowns"][source_dropdown_index] = wd_en_dropdown
sync_log.append(f"INFO: Created EN dropdown '{wd_en_dropdown['dropdown']}' from PR")
if wd_en_dropdown:
# Add the page to EN section at the correct location
if "pages" not in wd_en_dropdown:
wd_en_dropdown["pages"] = []
added_to_en = self.add_page_at_location(wd_en_dropdown, source_file, file_location, source_dropdown)
if added_to_en:
sync_log.append(f"INFO: Added {source_file} to EN section (from PR's docs.json)")
# Add to each target language
for target_lang, target_section in target_sections.items():
target_file = self.convert_path_to_target_language(source_file, target_lang)
@@ -1548,6 +1626,14 @@ class DocsSynchronizer:
sync_log.append(f"INFO: Processing deletion of {source_file}")
# STALE PR FIX: Also remove from EN section if it exists there
# This is needed when starting from main's docs.json which may still have the file
for en_dropdown in source_section.get("dropdowns", []):
if "pages" in en_dropdown:
if self.remove_page_from_structure(en_dropdown["pages"], source_file):
sync_log.append(f"INFO: Removed {source_file} from EN section")
break
# Remove from each target language
for target_lang, target_section in target_sections.items():
target_file = self.convert_path_to_target_language(source_file, target_lang)

View File

@@ -128,103 +128,155 @@ class TranslationPRManager:
)
return result.returncode == 0
def merge_docs_json_for_incremental_update(self) -> None:
def _merge_docs_json_from_main(self) -> None:
"""
Merge docs.json for incremental updates:
- Source language section from PR HEAD (latest structure)
- Translation sections from translation branch (preserve existing translations)
- Full structure from main (latest, includes changes from all merged PRs)
- Translation sections from translation branch (preserve our existing translations)
This fixes the issue where stale PRs would revert changes from PRs merged after them.
"""
print("Merging docs.json: Source language from PR, translations from translation branch...")
print("Merging docs.json: structure from main, translations from branch...")
# Get docs.json from PR HEAD (has latest source language structure)
result = self.run_git("show", f"{self.head_sha}:docs.json")
pr_docs = json.loads(result.stdout)
# Get docs.json from main (has latest structure including all merged PRs)
result = self.run_git("show", "origin/main:docs.json")
main_docs = json.loads(result.stdout)
# Get docs.json from translation branch (has target language translations)
# Get docs.json from translation branch (has our translations)
docs_json_path = self.repo_root / "docs.json"
with open(docs_json_path, 'r', encoding='utf-8') as f:
translation_docs = json.load(f)
branch_docs = json.load(f)
# Merge strategy: Replace source language section from PR, keep translations from translation branch
# Navigate to language sections
pr_navigation = pr_docs.get("navigation", {})
translation_navigation = translation_docs.get("navigation", {})
main_navigation = main_docs.get("navigation", {})
branch_navigation = branch_docs.get("navigation", {})
# Handle both direct languages and versions structure
if "versions" in pr_navigation:
pr_languages = pr_navigation["versions"][0].get("languages", [])
translation_languages = translation_navigation.get("versions", [{}])[0].get("languages", [])
if "versions" in main_navigation:
main_languages = main_navigation["versions"][0].get("languages", [])
branch_languages = branch_navigation.get("versions", [{}])[0].get("languages", [])
else:
pr_languages = pr_navigation.get("languages", [])
translation_languages = translation_navigation.get("languages", [])
main_languages = main_navigation.get("languages", [])
branch_languages = branch_navigation.get("languages", [])
# Build language lookup from translation branch
translation_langs_by_code = {}
for lang_data in translation_languages:
# Build lookup from translation branch
branch_langs_by_code = {}
for lang_data in branch_languages:
lang_code = lang_data.get("language")
if lang_code:
translation_langs_by_code[lang_code] = lang_data
branch_langs_by_code[lang_code] = lang_data
# Merge: Use source language from PR, translations from translation branch
# Merge strategy:
# - Use main's structure for everything (includes all recent changes)
# - For target languages, preserve our translations from branch
merged_languages = []
for pr_lang in pr_languages:
lang_code = pr_lang.get("language")
for main_lang in main_languages:
lang_code = main_lang.get("language")
if lang_code == self.source_language:
# Use source language section from PR (latest structure)
merged_languages.append(pr_lang)
elif lang_code in translation_langs_by_code:
# Use translations from translation branch (preserve existing translations)
merged_languages.append(translation_langs_by_code[lang_code])
if lang_code in self.target_languages and lang_code in branch_langs_by_code:
# For target languages, use branch's version (has our translations)
merged_languages.append(branch_langs_by_code[lang_code])
else:
# Fallback: use from PR
merged_languages.append(pr_lang)
# For source language and others, use main's version
merged_languages.append(main_lang)
# Update the merged docs.json
merged_docs = pr_docs.copy()
if "versions" in pr_navigation:
merged_docs = main_docs.copy()
if "versions" in main_navigation:
merged_docs["navigation"]["versions"][0]["languages"] = merged_languages
else:
merged_docs["navigation"]["languages"] = merged_languages
# Write merged docs.json preserving original formatting
# Use translation branch's docs.json as reference for format detection
success = save_json_with_preserved_format(
str(docs_json_path),
merged_docs,
reference_file=str(docs_json_path) # Current file from translation branch
reference_file=str(docs_json_path)
)
if success:
print(f"✓ Merged docs.json: Source language from PR {self.head_sha[:8]}, translations from {self.sync_branch}")
print(f"✓ Merged docs.json: structure from main, translations from {self.sync_branch}")
else:
print(f"⚠️ Warning: Could not preserve formatting, using default")
# Fallback to standard json.dump if format preservation fails
with open(docs_json_path, 'w', encoding='utf-8') as f:
json.dump(merged_docs, f, indent=2, ensure_ascii=False)
def _checkout_pr_changed_files(self, files_to_sync: List) -> None:
"""
Checkout only the specific source language files that the PR changed.
This prevents overwriting files from other PRs that were merged after this PR was created.
By only checking out the files this PR actually changed, we preserve the current state
of all other files (from main or translation branch).
"""
source_files = []
for file_info in files_to_sync:
file_path = file_info.get("path") if isinstance(file_info, dict) else file_info
# Only checkout source language files (not docs.json - handled separately)
if file_path.startswith(f"{self.source_dir}/") and file_path != "docs.json":
source_files.append(file_path)
if not source_files:
print("No source files to checkout from PR")
return
print(f"Checking out {len(source_files)} changed source files from PR {self.head_sha[:8]}...")
for file_path in source_files:
result = self.run_git("checkout", self.head_sha, "--", file_path, check=False)
if result.returncode == 0:
print(f"{file_path}")
else:
# File might be deleted in PR or not exist - that's OK
print(f" ⚠️ {file_path} (not in PR commit - may be deleted)")
def setup_translation_branch(self, branch_exists: bool) -> None:
"""Setup the translation branch (create or checkout existing)."""
"""
Setup the translation branch (create or checkout existing).
Key fix for stale PR issue:
- For NEW branches: Start from origin/main (not PR's working directory)
This ensures we have the latest state including changes from PRs merged after this PR was created.
- For EXISTING branches: Checkout branch, then merge main's docs.json structure
This preserves our translations while getting the latest navigation structure.
The actual PR files to translate are checked out selectively later in run_translation_from_sync_plan().
"""
# Store for later use in run_translation_from_sync_plan
self._branch_existed = branch_exists
if branch_exists:
print(f"✅ Fetching existing translation branch for incremental update: {self.sync_branch}")
self.run_git("fetch", "origin", f"{self.sync_branch}:{self.sync_branch}")
self.run_git("fetch", "origin", "main")
self.run_git("checkout", self.sync_branch)
# For incremental updates, checkout source language files only (not docs.json)
print(f"Checking out source language files from {self.head_sha[:8]}...")
self.run_git("checkout", self.head_sha, "--", f"{self.source_dir}/", check=False)
# Merge main's docs.json structure with our translations
# This ensures we have the latest navigation structure from main
# while preserving our existing translations
self._merge_docs_json_from_main()
# Merge docs.json: Source language from PR HEAD, translations from translation branch
self.merge_docs_json_for_incremental_update()
# Note: Source language files will be checked out selectively later
# in run_translation_from_sync_plan() based on what the PR actually changed
else:
print(f"🆕 Creating new translation branch: {self.sync_branch}")
self.run_git("checkout", "-b", self.sync_branch)
# Reset branch to main to avoid including source language file changes from PR
# Use --soft to keep working directory with PR files (needed for translation)
self.run_git("reset", "--soft", "origin/main")
# Unstage everything
self.run_git("reset")
# CRITICAL FIX: Start from origin/main, not from the PR's working directory
# This ensures we have the latest state including all changes from PRs
# that were merged after this PR was created.
#
# Old approach (buggy):
# checkout -b branch → reset --soft origin/main → reset
# This kept PR's working directory which could be stale
#
# New approach (fixed):
# checkout -b branch origin/main
# This starts fresh from main, then we selectively checkout PR's changed files
self.run_git("fetch", "origin", "main")
self.run_git("checkout", "-b", self.sync_branch, "origin/main")
# Note: Source language files will be checked out selectively later
# in run_translation_from_sync_plan() based on what the PR actually changed
async def run_translation(self) -> Dict:
"""Run the translation process using sync_and_translate logic."""
@@ -286,6 +338,11 @@ class TranslationPRManager:
base_sha = metadata.get("base_sha", self.base_sha)
head_sha = metadata.get("head_sha", self.head_sha)
# CRITICAL: Checkout only the files that this PR actually changed
# This is the key fix for the stale PR issue - we only bring in the PR's
# changed files, not its entire working directory state which could be outdated
self._checkout_pr_changed_files(files_to_sync)
# Detect added vs modified files and renames
added_files, modified_files, renamed_files = self.detect_file_changes(base_sha, head_sha)
@@ -500,12 +557,15 @@ class TranslationPRManager:
break
# Sync docs.json incrementally
# Pass head_sha as reference_sha so sync_docs_json_incremental can find
# new file positions in PR's docs.json (needed when main doesn't have them yet)
sync_log = synchronizer.sync_docs_json_incremental(
added_files=added_files,
deleted_files=deleted_files,
renamed_files=renamed_files,
base_sha=base_sha,
head_sha=head_sha
head_sha=head_sha,
reference_sha=head_sha
)
print("\n".join(sync_log))