Sorry for all the delays in writing. Things are still happening, but slowly, even with going back to work. Thankfully even with reduced time it’s been really inspiring me to get more done on the project.

First, an announcement!

I’m so pleased to announce that, while my undub is not yet done… someone’s is!

Between a fellow MGS lover in my discord, Besofh, and my good colleague Green_Goblin, the USA version has been translated into Arabic!

I’m so blown away that other people have found uses for my tools. I think this is why the undub of mine hasn’t seen the light of day… having so many people who want to use my tools, it really gave me this feeling of ownership and responsibility. I still want to finish mine, but I also see how many people are looking at translating this game, and want to make sure they too can succeed in bringing this game to other languages.

Anyway, the undub is still coming, I promise. There’s a LOT of dialogue to go over in the Japanese version. I’m years into this project, but I’m not giving up. The translation is going to be my next big push, then I can look at finalizing texture work, credits, and maybe a bonus little easter egg ;)

But all in due time.

Now, onto Claude’s synopsis of a couple big milestones. There have been issues with zmovie on disc 2, mostly because im still working on disc 1. This should now be fixed. Further, I made some adjustments to the RADIO tools that have always bugged me; basically, if I want my tool to be as accurate to the original dev team as I can, i want it to be able to not only extract all dialogue, but insert it too. And there was one big thorn in my side needing to be pulled to get it to that 99% mark.

Read on, dear viewer, and you’ll see.

Two round-trip puzzles, one good week

A short update with two long-tail bugs finally getting put to bed. Both were the kind of thing where the symptom looks like “the rebuild is slightly wrong somewhere,” you stare at hex dumps for a few days, and at the end you realize the original engineers had a very specific design in mind that none of our tooling was honoring. Neither story is glamorous. But they’re both the kind of fix that quietly unlocks a pile of downstream work.

The two topics:

  1. ZMOVIE.STR. The recompiler had been quietly corrupting the FMV container on every disc that has a multi-chunk movie entry. I think this was probably the cause of the disc-2 end-of-game crash people had been reporting.
  2. RADIO.DAT kinsoku flags. We finally figured out why the same Japanese character appeared in our tables under four different byte codes. (Spoiler: they weren’t duplicates. The original engineers were encoding line-wrap rules in two flag bits and we’d been throwing the bits away.)

Going to walk through both, and then close with what it took to get RADIO.DAT round-tripping byte-identical on USA and Integral disc 1.


Part 1: The ZMOVIE.STR recompile was eating graphics bytes

So, ZMOVIE.STR is the movie container. Each FMV has a little subtitle table tucked into the first sector of the entry, followed by graphics bytes (the actual tile data for any text that gets overlaid). On most discs each entry is one big chunk; on a couple of discs some entries span two chunks. That second case was where everything went sideways.

The old compile path hard-capped the subtitle area at 0x800 bytes per block and then copied origSlice[0x800:] verbatim into the rest of the entry. For a single-chunk movie that’s harmless. For a multi-chunk movie it’s a disaster: the new subtitle terminator lands at a different offset than the original, but we were splicing the original’s tail back on after it. Anything past the new terminator was stale data referring to offsets that no longer existed. That’s exactly the shape of bug that would manifest as a graphics-area crash partway through a long FMV — i.e., the end-of-game cutscene people kept reporting.

Six entries across the corpus had this problem: jpn-d1 zmovie-02, jpn-d2 zmovie-00/01, integral-d1 zmovie-02, and integral-d2 zmovie-01.

The fix is structural. The recompiler now:

  • Walks the original subtitle table to recover the original graphics bytes wholesale.
  • Composes a payload of new_subtitle_table + original_graphics and threads it across block 0 [0x38:0x808], spilling into block 1 [0x28:0x808] only when needed.
  • Zero-fills whichever block’s tail isn’t used, so there is no stale data left over from the previous build.
  • Raises a hard error if the payload would exceed the combined 0x7D0 + 0x7E0 capacity, instead of silently truncating.

While I was in there, I caught a second bug that had been hiding in plain sight. The header field at byte 0x0E — which all of our tooling called chunk_count and which the old code was using to decide how many “subtitle continuation blocks” to read — is not the number of subtitle blocks. It’s a CD-XA / stream field. I confirmed this by surveying all six discs:

Disc 00 01 02 03
jpn-d1 1 1 2 1
jpn-d2 2 2 1 9
usa-d1 1 1 1 1
usa-d2 1 1 1 9
integral-d1 1 1 2 1
integral-d2 1 2 1 9

That 9 on every disc-2 zmovie-03 is not “nine continuation blocks.” It’s a stream-control number, and the compile path now leaves it alone.

The bonus mystery: a phantom subtitle table on disc 2’s ending FMV

While checking that survey table, I noticed the disc-2 zmovie-03 entry was decoding to obvious garbage:

"zmovie-03": [{ "01": "([8000]" }, { "01": "41975848,2107394" }]

A subtitle length of 41 megabytes. Right. What’s actually at block 0 of that entry is a normal CD-XA stream sector — no subtitle table at all. The mini-header at byte 0x32 reads 0x0280 instead of the expected 0x0010, and the “subtitle length” at 0x34 decodes to ~2 GB. The movie just doesn’t have any overlay text.

To convince myself it really was the ending FMV and not something weird, I extracted the video stream out to MP4. It is in fact the disc-2 ending cinematic — 5:40, 320×160 at 15 fps with audio. No subtitles, by design, on every retail and Integral disc 2.

The fix is a sanity guard on both _extractEntrySubtitles and movieSplitter.py: an entry only counts as having a subtitle table if block0[0x32:0x34] == 0x0010 and block0[0x34:0x38] < 0x800. When the guard fails, the entry gets omitted from the extracted JSON, the per-entry .bin still gets written, and the injector passthrough leaves the video/audio bit-for-bit identical. No more garbage entries in the GUI’s subtitle editor, no more risk of the recompiler trying to rewrite something that isn’t text.

End-of-game FMV is now safe to round-trip without touching its bytes.


Part 2: The kinsoku mystery, finally explained

OK, switching files. This one I’ve wanted to write up for a while because it explains a thing that had been bugging me for months.

Our character table for RADIO.DAT had what looked like duplicate entries. The kanji (Japanese full stop) lived at 0x9003. Fine. But it also showed up at 0xB003, 0xD003, and 0xF003, and our old decoder had been routing those to a totally separate “punctuation” table that happened to mostly agree with kanji but disagreed in a few places. The decode looked right on screen but the re-encode never produced the original bytes back.

I had assumed this was OCR sloppiness in the table-building. It wasn’t.

What the renderer is actually doing

The font code in the Integral decomp reads a 2-byte code, then before doing the glyph lookup it strips two flag bits:

#define TOP_KINSOKU_MASK    0x4000
#define BACK_KINSOKU_MASK   0x2000
short character_mask = 0x9FFF;
next_mdata = mdata & character_mask;

That’s it. That’s the whole mystery. The four codes for are the same glyph with different line-wrap behavior:

Raw code Flags Meaning
0x9003 none bare glyph
0xB003 BACK only should not end a wrapped line
0xD003 TOP only should not start a wrapped line
0xF003 both both rules apply

This is kinsoku shori (禁則処理) — the Japanese typesetting rules that say things like “a full stop should never be the first character on a new line.” The high nibble pattern is the giveaway: 0x9? = unflagged, 0xB? = BACK only, 0xD? = TOP only, 0xF? = both. Same trick applies across the radioChar / hiragana / katakana ranges via the 0x80↔0xC0 / 0x81↔0xC1 / 0x82↔0xC2 pairings.

The fix: separate glyph from flags at both ends

Decoder side, read the raw 16-bit code, mask off 0x6000 for flags, mask 0x9FFF for the glyph, look up the glyph, then re-emit the flags as visible sentinels: ‹BK› and ‹TK›. (We picked single guillemets because they’re visible to a human translator and a regex can find them. Zero-width invisibles would have been cleaner for display but worse for editing.)

Encoder side, encode the character first, peek for trailing sentinels, then OR the flag bits back into the high byte of whatever was just emitted. Since 0xD0 = 0x90 | 0x40, the punctuation path can safely emit the unflagged 0x90 prefix and rely on the sentinel-driven OR to raise it to 0xD0 only when the original had a TOP flag — instead of always emitting 0xD0 and losing the information about which characters were truly flagged.

The flag-bit recovery was dramatic. Looking at RADIO.DAT from jpn-d1:

Byte class Before fix After fix Original
0xC0 (radioChar + TOP) 7,690 13,380 13,609
0xC1 (hiragana + TOP) 1,897 7,955 6,720
0xC2 (katakana + TOP) 874 3,596 3,889
0xB0 (kanji + BACK) 9,621 10,076 10,358
0xD0 (kanji + TOP) 42,432 41,714 44,329
0xF0 (kanji + BOTH) 5,315 5,315 5,209

What was previously slow erosion of every TOP_KINSOKU flag on radio/hiragana/katakana glyphs got recovered, and the counts swung back toward the original distribution.

And then four more bugs fell out

Once the flag drift was gone, four pre-existing encoder bugs became the dominant source of diff. The brief version:

  1. revSpanish was checked before revKanji. Characters that exist in both (like → Spanish prefix 0x1F4A vs kanji ellipsis 0x9017) got the wrong path. With the new mask logic this was loud — applying 0x40 (TOP) on top of 0x1F produces 0x5F, turning a Japanese ellipsis into the literal ASCII bytes _J. Pure reorder fix: a one-line swap that put kanji ahead of Spanish in the lookup chain.

  2. Per-call custom-character dictionaries were silently bypassed. Many radio calls embed a small custom-char dictionary at the head of the call and reference its entries with 0x96 / 0x97 / 0x98 escapes. When a character lived in both the global kanji table and the per-call dict, the encoder emitted the global kanji form where the original had used a custom-char escape. All subsequent indices into that dict then drifted. Fix: split the custom-character path into two priority tiers — a high-priority check that fires before any global table (“is this character already in this call’s dict?”), and a low-priority fallback that only fires if no global table matched.

  3. Encoder slot thresholds off by one. The decoder reads 0x96 N → dict index N, 0x97 NN + 255, 0x98 NN + 510. But the encoder was switching prefixes at index > 254 and > 508. The game reserves 0x97 00, and the encoder was emitting it for index 255 — which the decoder reads as 256. Every tile from the 256th slot onward drifted by one. The fix is to line the encoder’s thresholds up at > 255 and > 510 (offsets −255 / −510).

  4. The recompiler was passing the encoder an empty callDict. This one stung. After the threshold fix, the per-subtitle encoder test jumped to 99.99% — but the file recompile didn’t change at all. Same hash. Same 147k-byte diff. Turned out RadioDatRecompiler.main() was missing global currentCallDict, so the assignment was creating a local variable, and every subtitle’s encoder call was receiving the module default of ''. The whole “use existing custom-char slot first” priority I’d built the day before was firing only in tests and never in production. A four-token Python fix unblocked the entire file rebuild. That kind of multiplicative-zero bug is the worst to debug: every individual test passes, every individual unit looks right, and the file output is just stuck.

The result

Per-subtitle round-trip on jpn-d1/MGS/RADIO.DAT (13,426 SUBTITLE elements), stage by stage:

Stage Exact round-trip
Original (no kinsoku) 0 / 13,426
+ kinsoku flag handling 1,102 / 13,426 (8.2%)
+ Spanish reorder 1,633 / 13,426 (12.2%)
+ custom-slot prioritization 8,366 / 13,426 (62.3%)
+ furigana fullwidth ( / ) 9,394 / 13,426 (70.0%)
+ duplicate-kanji disambiguation 9,705 / 13,426 (72.3%)
+ encoder thresholds + global propagation 13,425 / 13,426 (99.99%)
↳ clean subs (no custom-char escape) 1,593 / 1,593 (100.0%)
↳ dirty subs (with custom-char escapes) 11,832 / 11,833 (99.99%)

The one remaining byte on jpn-d1 is at offset 0x6a999: original 0x96 12, rebuild 0x96 11. Both indices land in the same call’s custom-char dict, and the gap is a find() collision in the corpus glyph tables — two adjacent dict slots both carry tiles that graphicsData labels the same way, so the encoder matches the first one. The in-game render is visually identical. It’s an OCR-data artifact, not an encoder bug.

Cross-disc validation

The whole point of moving from per-disc overrides to structural rules was that the fixes should hold on every disc, not just jpn-d1. So I ran the same pipeline against USA disc 1 and MGS Integral disc 1, and two more double-length-ASCII glyphs immediately surfaced:

  • USA disc 1 had 1,006 instances of 0x80 22 — opening and closing fullwidth quotation marks around quoted dialogue (said "The graveyard is full…").
  • Integral disc 1 had 8 instances of 0xc0 2d — a TOP_KINSOKU hyphen in PSG-1 weapon callouts, so the dash never strands to the start of a wrapped line.

Same shape of fix as the jpn-d1 furigana case: change radioChar['22'] from " to (U+FF02) and radioChar['2d'] from - to (U+FF0D). The decoder emits fullwidth, the encoder catches it via revRadio, the double-length form survives. English translations that happen to contain ASCII quotes or hyphens are unaffected.

Final cross-disc results:

Disc Original size Rebuilt size Diff bytes Status
jpn-d1 2,283,386 2,283,386 1 OCR artifact in graphicsData
usa-d1 1,776,851 1,776,851 0 ★ byte-identical
integral-d1 11,198,464 11,198,464 0 ★ byte-identical

The Integral diff was particularly satisfying. At the point where jpn-d1 was down to 1 byte of diff, Integral was rebuilding at the exact correct size but with 139,773 bytes differing. That sounded structural — alignment, padding, something special to Integral’s layout — and I was nervous about it. But the per-subtitle test was already at 63,138 / 63,146 (99.99%), and only 8 dirty subtitles were failing. The 8 failures all contained one byte sequence (0xc0 2d) that drifted on re-encode, and every byte downstream of each one shifted by a step, multiplying the apparent file-level diff by a factor of ~17,500. One radioChar entry, eight subtitles, and an entire 11 MB file falls into place.


Why both of these matter, together

The two stories rhyme. In both cases we’d been throwing away information that the original engineers had deliberately encoded into the format, because we hadn’t yet understood it was meaningful:

  • For ZMOVIE.STR, the “stale graphics” came from treating block 0 and block 1 as independent containers when they’re actually one logical payload with a spill boundary.
  • For RADIO.DAT, the four-byte-per-glyph “duplication” came from treating two flag bits as noise, when they were the whole point of Japanese typesetting rules.

Both of these were also load-bearing for translation work. The ZMOVIE fix means the recompiler can now safely round-trip the disc-2 ending FMV (by not touching it), and multi-chunk subtitle inserts on jpn-d1 zmovie-02 will actually fit in the available space without overwriting graphics. The kinsoku/lookup-order/threshold fixes mean USA disc 1 and Integral disc 1 both round-trip a fresh extract of RADIO.DAT to a byte-identical rebuild — which is the strongest test I know how to write for a recompiler. If the bytes match, the engine cannot tell our output from the original.

What’s still open

A few honest caveats:

  • JPN encoder fidelity for the ZMOVIE corpus. The kinsoku/bank-3 work fixed encoder fidelity for RADIO.DAT. ZMOVIE still surfaces a few glyphs that round-trip to slightly different code points than the original (e.g. 0x9601 vs 0x9c01). Symptom: round-trip text comes back kanji-jumbled even when structurally correct. USA-d1 ZMOVIE round-trips cleanly, which is the proof the structural ZMOVIE fix is sound — but the JPN ZMOVIE corpus needs another encoder pass.
  • zMovieTextInjector.py (the CLI) is still on the old code path. Native-endian struct.pack("I", …) instead of <I, a dead injectSubtitles(), and the same chunk-blind block layout the GUI used to have. The GUI doesn’t touch this script. I’ll either delete it or port the fix; for now, use the GUI.
  • Cleanup follow-ups in the kinsoku work. The legacy punctuation dict, the revPunct table, the 0xC0 / 0xC1 / 0xC2 / 0xB0 / 0xD0 decoder branches, and the unreachable addCharToDict helper are all functionally dead now. Removing them is a clean follow-up; I held off so the round-trip diffs stayed focused.

A good week, all in all. Two long-standing mysteries closed, two discs flipped from “close enough” to byte-identical, and the recompiler is no longer at risk of eating the end of the game.

The Last Word

I’m finding my footing back at work. I’m sorry to say I lean more heavily on Claude to solve some of my issues, and I do miss the fun time I’ve had coding, but as we get deeper into the project, I want to make a lot more meaningful progress.

To that end, I’ll be pushing far further ahead in translation, and working on some semi-final changes to the GUI version. The reason the GUI progressed as far as it has is because aligning RADIO dialogue really needed a way to simultaneously edit both VOX and RADIO, and it made it much simpler to have a unified tool.

As far as I can tell, it’s now working, and splitting dialogue into multiple individual subtitles does work, which means I can arrange the subtitles almost any way I want to that matches the dialogue.

So, look forward to more updates coming soon!

J-Rush