Two round-trip puzzles, one good week
Sorry for all the delays in writing. Things are still happening, but slowly, even with going back to work. Thankfully even with reduced time it’s been really inspiring me to get more done on the project.
First, an announcement!
I’m so pleased to announce that, while my undub is not yet done… someone’s is!

Between a fellow MGS lover in my discord, Besofh, and my good colleague Green_Goblin, the USA version has been translated into Arabic!

I’m so blown away that other people have found uses for my tools. I think this is why the undub of mine hasn’t seen the light of day… having so many people who want to use my tools, it really gave me this feeling of ownership and responsibility. I still want to finish mine, but I also see how many people are looking at translating this game, and want to make sure they too can succeed in bringing this game to other languages.
Anyway, the undub is still coming, I promise. There’s a LOT of dialogue to go over in the Japanese version. I’m years into this project, but I’m not giving up. The translation is going to be my next big push, then I can look at finalizing texture work, credits, and maybe a bonus little easter egg ;)
But all in due time.
Now, onto Claude’s synopsis of a couple big milestones. There have been issues with zmovie on disc 2, mostly because im still working on disc 1. This should now be fixed. Further, I made some adjustments to the RADIO tools that have always bugged me; basically, if I want my tool to be as accurate to the original dev team as I can, i want it to be able to not only extract all dialogue, but insert it too. And there was one big thorn in my side needing to be pulled to get it to that 99% mark.
Read on, dear viewer, and you’ll see.
Two round-trip puzzles, one good week
A short update with two long-tail bugs finally getting put to bed. Both were the kind of thing where the symptom looks like “the rebuild is slightly wrong somewhere,” you stare at hex dumps for a few days, and at the end you realize the original engineers had a very specific design in mind that none of our tooling was honoring. Neither story is glamorous. But they’re both the kind of fix that quietly unlocks a pile of downstream work.
The two topics:
- ZMOVIE.STR. The recompiler had been quietly corrupting the FMV container on every disc that has a multi-chunk movie entry. I think this was probably the cause of the disc-2 end-of-game crash people had been reporting.
- RADIO.DAT kinsoku flags. We finally figured out why the same Japanese character appeared in our tables under four different byte codes. (Spoiler: they weren’t duplicates. The original engineers were encoding line-wrap rules in two flag bits and we’d been throwing the bits away.)
Going to walk through both, and then close with what it took to get RADIO.DAT round-tripping byte-identical on USA and Integral disc 1.
Part 1: The ZMOVIE.STR recompile was eating graphics bytes
So, ZMOVIE.STR is the movie container. Each FMV has a little subtitle table tucked into the first sector of the entry, followed by graphics bytes (the actual tile data for any text that gets overlaid). On most discs each entry is one big chunk; on a couple of discs some entries span two chunks. That second case was where everything went sideways.
The old compile path hard-capped the subtitle area at 0x800 bytes per
block and then copied origSlice[0x800:] verbatim into the rest of the
entry. For a single-chunk movie that’s harmless. For a multi-chunk movie
it’s a disaster: the new subtitle terminator lands at a different offset
than the original, but we were splicing the original’s tail back on after
it. Anything past the new terminator was stale data referring to offsets
that no longer existed. That’s exactly the shape of bug that would
manifest as a graphics-area crash partway through a long FMV — i.e., the
end-of-game cutscene people kept reporting.
Six entries across the corpus had this problem: jpn-d1 zmovie-02,
jpn-d2 zmovie-00/01, integral-d1 zmovie-02, and integral-d2
zmovie-01.
The fix is structural. The recompiler now:
- Walks the original subtitle table to recover the original graphics bytes wholesale.
- Composes a payload of
new_subtitle_table + original_graphicsand threads it across block 0[0x38:0x808], spilling into block 1[0x28:0x808]only when needed. - Zero-fills whichever block’s tail isn’t used, so there is no stale data left over from the previous build.
- Raises a hard error if the payload would exceed the combined
0x7D0 + 0x7E0capacity, instead of silently truncating.
While I was in there, I caught a second bug that had been hiding in
plain sight. The header field at byte 0x0E — which all of our tooling
called chunk_count and which the old code was using to decide how many
“subtitle continuation blocks” to read — is not the number of subtitle
blocks. It’s a CD-XA / stream field. I confirmed this by surveying all
six discs:
| Disc | 00 | 01 | 02 | 03 |
|---|---|---|---|---|
| jpn-d1 | 1 | 1 | 2 | 1 |
| jpn-d2 | 2 | 2 | 1 | 9 |
| usa-d1 | 1 | 1 | 1 | 1 |
| usa-d2 | 1 | 1 | 1 | 9 |
| integral-d1 | 1 | 1 | 2 | 1 |
| integral-d2 | 1 | 2 | 1 | 9 |
That 9 on every disc-2 zmovie-03 is not “nine continuation blocks.”
It’s a stream-control number, and the compile path now leaves it alone.
The bonus mystery: a phantom subtitle table on disc 2’s ending FMV
While checking that survey table, I noticed the disc-2 zmovie-03 entry
was decoding to obvious garbage:
"zmovie-03": [{ "01": "([8000]" }, { "01": "41975848,2107394" }]
A subtitle length of 41 megabytes. Right. What’s actually at block 0 of
that entry is a normal CD-XA stream sector — no subtitle table at all.
The mini-header at byte 0x32 reads 0x0280 instead of the expected
0x0010, and the “subtitle length” at 0x34 decodes to ~2 GB. The
movie just doesn’t have any overlay text.
To convince myself it really was the ending FMV and not something weird, I extracted the video stream out to MP4. It is in fact the disc-2 ending cinematic — 5:40, 320×160 at 15 fps with audio. No subtitles, by design, on every retail and Integral disc 2.
The fix is a sanity guard on both _extractEntrySubtitles and
movieSplitter.py: an entry only counts as having a subtitle table if
block0[0x32:0x34] == 0x0010 and block0[0x34:0x38] < 0x800. When
the guard fails, the entry gets omitted from the extracted JSON, the
per-entry .bin still gets written, and the injector passthrough leaves
the video/audio bit-for-bit identical. No more garbage entries in the
GUI’s subtitle editor, no more risk of the recompiler trying to rewrite
something that isn’t text.
End-of-game FMV is now safe to round-trip without touching its bytes.
Part 2: The kinsoku mystery, finally explained
OK, switching files. This one I’ve wanted to write up for a while because it explains a thing that had been bugging me for months.
Our character table for RADIO.DAT had what looked like duplicate
entries. The kanji 。 (Japanese full stop) lived at 0x9003. Fine.
But it also showed up at 0xB003, 0xD003, and 0xF003, and our old
decoder had been routing those to a totally separate “punctuation” table
that happened to mostly agree with kanji but disagreed in a few
places. The decode looked right on screen but the re-encode never
produced the original bytes back.
I had assumed this was OCR sloppiness in the table-building. It wasn’t.
What the renderer is actually doing
The font code in the Integral decomp reads a 2-byte code, then before doing the glyph lookup it strips two flag bits:
#define TOP_KINSOKU_MASK 0x4000
#define BACK_KINSOKU_MASK 0x2000
short character_mask = 0x9FFF;
next_mdata = mdata & character_mask;
That’s it. That’s the whole mystery. The four codes for 。 are the
same glyph with different line-wrap behavior:
| Raw code | Flags | Meaning |
|---|---|---|
0x9003 |
none | bare glyph |
0xB003 |
BACK only | should not end a wrapped line |
0xD003 |
TOP only | should not start a wrapped line |
0xF003 |
both | both rules apply |
This is kinsoku shori (禁則処理) — the Japanese typesetting rules
that say things like “a full stop should never be the first character on
a new line.” The high nibble pattern is the giveaway: 0x9? =
unflagged, 0xB? = BACK only, 0xD? = TOP only, 0xF? = both. Same
trick applies across the radioChar / hiragana / katakana ranges via the
0x80↔0xC0 / 0x81↔0xC1 / 0x82↔0xC2 pairings.
The fix: separate glyph from flags at both ends
Decoder side, read the raw 16-bit code, mask off 0x6000 for flags,
mask 0x9FFF for the glyph, look up the glyph, then re-emit the flags
as visible sentinels: ‹BK› and ‹TK›. (We picked single guillemets
because they’re visible to a human translator and a regex can find
them. Zero-width invisibles would have been cleaner for display but
worse for editing.)
Encoder side, encode the character first, peek for trailing sentinels,
then OR the flag bits back into the high byte of whatever was just
emitted. Since 0xD0 = 0x90 | 0x40, the punctuation path can safely
emit the unflagged 0x90 prefix and rely on the sentinel-driven OR to
raise it to 0xD0 only when the original had a TOP flag — instead of
always emitting 0xD0 and losing the information about which characters
were truly flagged.
The flag-bit recovery was dramatic. Looking at RADIO.DAT from
jpn-d1:
| Byte class | Before fix | After fix | Original |
|---|---|---|---|
0xC0 (radioChar + TOP) |
7,690 | 13,380 | 13,609 |
0xC1 (hiragana + TOP) |
1,897 | 7,955 | 6,720 |
0xC2 (katakana + TOP) |
874 | 3,596 | 3,889 |
0xB0 (kanji + BACK) |
9,621 | 10,076 | 10,358 |
0xD0 (kanji + TOP) |
42,432 | 41,714 | 44,329 |
0xF0 (kanji + BOTH) |
5,315 | 5,315 | 5,209 |
What was previously slow erosion of every TOP_KINSOKU flag on radio/hiragana/katakana glyphs got recovered, and the counts swung back toward the original distribution.
And then four more bugs fell out
Once the flag drift was gone, four pre-existing encoder bugs became the dominant source of diff. The brief version:
-
revSpanishwas checked beforerevKanji. Characters that exist in both (like…→ Spanish prefix0x1F4Avs kanji ellipsis0x9017) got the wrong path. With the new mask logic this was loud — applying0x40(TOP) on top of0x1Fproduces0x5F, turning a Japanese ellipsis into the literal ASCII bytes_J. Pure reorder fix: a one-line swap that put kanji ahead of Spanish in the lookup chain. -
Per-call custom-character dictionaries were silently bypassed. Many radio calls embed a small custom-char dictionary at the head of the call and reference its entries with
0x96/0x97/0x98escapes. When a character lived in both the global kanji table and the per-call dict, the encoder emitted the global kanji form where the original had used a custom-char escape. All subsequent indices into that dict then drifted. Fix: split the custom-character path into two priority tiers — a high-priority check that fires before any global table (“is this character already in this call’s dict?”), and a low-priority fallback that only fires if no global table matched. -
Encoder slot thresholds off by one. The decoder reads
0x96 N→ dict indexN,0x97 N→N + 255,0x98 N→N + 510. But the encoder was switching prefixes atindex > 254and> 508. The game reserves0x97 00, and the encoder was emitting it for index 255 — which the decoder reads as 256. Every tile from the 256th slot onward drifted by one. The fix is to line the encoder’s thresholds up at> 255and> 510(offsets−255/−510). -
The recompiler was passing the encoder an empty
callDict. This one stung. After the threshold fix, the per-subtitle encoder test jumped to 99.99% — but the file recompile didn’t change at all. Same hash. Same 147k-byte diff. Turned outRadioDatRecompiler.main()was missingglobal currentCallDict, so the assignment was creating a local variable, and every subtitle’s encoder call was receiving the module default of''. The whole “use existing custom-char slot first” priority I’d built the day before was firing only in tests and never in production. A four-token Python fix unblocked the entire file rebuild. That kind of multiplicative-zero bug is the worst to debug: every individual test passes, every individual unit looks right, and the file output is just stuck.
The result
Per-subtitle round-trip on jpn-d1/MGS/RADIO.DAT (13,426 SUBTITLE
elements), stage by stage:
| Stage | Exact round-trip |
|---|---|
| Original (no kinsoku) | 0 / 13,426 |
| + kinsoku flag handling | 1,102 / 13,426 (8.2%) |
| + Spanish reorder | 1,633 / 13,426 (12.2%) |
| + custom-slot prioritization | 8,366 / 13,426 (62.3%) |
+ furigana fullwidth (# / .) |
9,394 / 13,426 (70.0%) |
| + duplicate-kanji disambiguation | 9,705 / 13,426 (72.3%) |
| + encoder thresholds + global propagation | 13,425 / 13,426 (99.99%) |
| ↳ clean subs (no custom-char escape) | 1,593 / 1,593 (100.0%) |
| ↳ dirty subs (with custom-char escapes) | 11,832 / 11,833 (99.99%) |
The one remaining byte on jpn-d1 is at offset 0x6a999: original
0x96 12, rebuild 0x96 11. Both indices land in the same call’s
custom-char dict, and the gap is a find() collision in the corpus
glyph tables — two adjacent dict slots both carry tiles that
graphicsData labels the same way, so the encoder matches the first
one. The in-game render is visually identical. It’s an OCR-data
artifact, not an encoder bug.
Cross-disc validation
The whole point of moving from per-disc overrides to structural rules
was that the fixes should hold on every disc, not just jpn-d1. So I
ran the same pipeline against USA disc 1 and MGS Integral disc 1, and
two more double-length-ASCII glyphs immediately surfaced:
- USA disc 1 had 1,006 instances of
0x80 22— opening and closing fullwidth quotation marks around quoted dialogue (said "The graveyard is full…"). - Integral disc 1 had 8 instances of
0xc0 2d— a TOP_KINSOKU hyphen inPSG-1weapon callouts, so the dash never strands to the start of a wrapped line.
Same shape of fix as the jpn-d1 furigana case: change
radioChar['22'] from " to " (U+FF02) and radioChar['2d'] from
- to - (U+FF0D). The decoder emits fullwidth, the encoder catches
it via revRadio, the double-length form survives. English
translations that happen to contain ASCII quotes or hyphens are
unaffected.
Final cross-disc results:
| Disc | Original size | Rebuilt size | Diff bytes | Status |
|---|---|---|---|---|
| jpn-d1 | 2,283,386 | 2,283,386 | 1 | OCR artifact in graphicsData |
| usa-d1 | 1,776,851 | 1,776,851 | 0 | ★ byte-identical |
| integral-d1 | 11,198,464 | 11,198,464 | 0 | ★ byte-identical |
The Integral diff was particularly satisfying. At the point where
jpn-d1 was down to 1 byte of diff, Integral was rebuilding at the
exact correct size but with 139,773 bytes differing. That sounded
structural — alignment, padding, something special to Integral’s layout
— and I was nervous about it. But the per-subtitle test was already at
63,138 / 63,146 (99.99%), and only 8 dirty subtitles were failing. The
8 failures all contained one byte sequence (0xc0 2d) that drifted on
re-encode, and every byte downstream of each one shifted by a step,
multiplying the apparent file-level diff by a factor of ~17,500. One
radioChar entry, eight subtitles, and an entire 11 MB file falls into
place.
Why both of these matter, together
The two stories rhyme. In both cases we’d been throwing away information that the original engineers had deliberately encoded into the format, because we hadn’t yet understood it was meaningful:
- For ZMOVIE.STR, the “stale graphics” came from treating block 0 and block 1 as independent containers when they’re actually one logical payload with a spill boundary.
- For RADIO.DAT, the four-byte-per-glyph “duplication” came from treating two flag bits as noise, when they were the whole point of Japanese typesetting rules.
Both of these were also load-bearing for translation work. The ZMOVIE
fix means the recompiler can now safely round-trip the disc-2 ending FMV
(by not touching it), and multi-chunk subtitle inserts on jpn-d1
zmovie-02 will actually fit in the available space without overwriting
graphics. The kinsoku/lookup-order/threshold fixes mean USA disc 1 and
Integral disc 1 both round-trip a fresh extract of RADIO.DAT to a
byte-identical rebuild — which is the strongest test I know how to
write for a recompiler. If the bytes match, the engine cannot tell our
output from the original.
What’s still open
A few honest caveats:
- JPN encoder fidelity for the ZMOVIE corpus. The kinsoku/bank-3
work fixed encoder fidelity for
RADIO.DAT. ZMOVIE still surfaces a few glyphs that round-trip to slightly different code points than the original (e.g.0x9601vs0x9c01). Symptom: round-trip text comes back kanji-jumbled even when structurally correct. USA-d1 ZMOVIE round-trips cleanly, which is the proof the structural ZMOVIE fix is sound — but the JPN ZMOVIE corpus needs another encoder pass. zMovieTextInjector.py(the CLI) is still on the old code path. Native-endianstruct.pack("I", …)instead of<I, a deadinjectSubtitles(), and the same chunk-blind block layout the GUI used to have. The GUI doesn’t touch this script. I’ll either delete it or port the fix; for now, use the GUI.- Cleanup follow-ups in the kinsoku work. The legacy
punctuationdict, therevPuncttable, the0xC0/0xC1/0xC2/0xB0/0xD0decoder branches, and the unreachableaddCharToDicthelper are all functionally dead now. Removing them is a clean follow-up; I held off so the round-trip diffs stayed focused.
A good week, all in all. Two long-standing mysteries closed, two discs flipped from “close enough” to byte-identical, and the recompiler is no longer at risk of eating the end of the game.
The Last Word
I’m finding my footing back at work. I’m sorry to say I lean more heavily on Claude to solve some of my issues, and I do miss the fun time I’ve had coding, but as we get deeper into the project, I want to make a lot more meaningful progress.
To that end, I’ll be pushing far further ahead in translation, and working on some semi-final changes to the GUI version. The reason the GUI progressed as far as it has is because aligning RADIO dialogue really needed a way to simultaneously edit both VOX and RADIO, and it made it much simpler to have a unified tool.
As far as I can tell, it’s now working, and splitting dialogue into multiple individual subtitles does work, which means I can arrange the subtitles almost any way I want to that matches the dialogue.
So, look forward to more updates coming soon!
J-Rush
