Radio Hacking comes to an impass // MGS Undubbed

Well this part kinda sucks.

I’ve mentioned before that Radio.dat has some exceedingly long calls. Which are unfortunately here to break my entire project, apparently. Because injecting english text in the space of the japanese text tends to make the binary data length longer. You can fit two english characters in the space of one Kanji, but that doesn’t mean the overall idea can be conveyed in a shorter manner.

The biggest example of this in the undub is in the mei ling blocks. Part of what makes them so large is the Save blocks, where all of the stage names (for saving data) are repeated like 50 times, making an extremely long amount of data on a per-call basis.

I summed up this problem to claude and here was his response after doing some research.

%% Now I have all the information I need to give you comprehensive advice. Let me put this together.

Research Findings and Option Analysis

GCL Research Summary

The short answer: no public GCL disassembler exists for MGS1’s .gcx script format in STAGE.DIR. What the community has:

The metalgeardev/MGS1 and Jayveer/Rex repos provide file format research but no GCX decompiler
MGS2/MGS3 have more GCL documentation, but their format diverged from the PSX variant
Your own RadioDatTools.py is the most complete disassembler for the RADIO.DAT GCL dialect
The STAGE.DIR .gcx triggers are known at the byte-offset level (your callsInStageDirFinder.py finds them), but rewriting or splitting them would require writing new GCX content with no compiler/validator — essentially hand-crafting bytecode blindly

GCL splitting is off the table without a significant new tooling investment. The risk of a malformed GCX crash or hard-to-debug game logic break is too high.

What the Data Reveals

Looking at the actual structure of the staff calls in integral-d1-split.xml, a few important things emerged:

Random selection via RND_SWCH / RND_OPTN

Staff calls use RND_OPTN blocks (randomly selected trivia items) nested inside IF_CHECK/ELSE_IFS branches (stage-gated categories). The player only ever hears one trivia item per contact. So reducing the pool from 90 to 75 options is completely invisible to the player.

Calls come in exact pairs

Every large staff call has a near-identical twin (same RND_OPTN count, byte size differing by ~85 bytes):

Pair 3: offset 4454400 (90 opts, 63444 bytes) | offset 4532224 (90 opts, 63529 bytes) Pair 3 over limit: offset 10379264 (91 opts, 65863 bytes) | offset 10459136 (91 opts, 65948 bytes)

These pairs are the same trivia pool for two different game state conditions (likely different in-game checkpoints or disc sides). Any fix applied to one must be mirrored in its twin.

Scale of the problem

4 calls already over 65,535 bytes in the current XML (and the XML lengths include the Japanese content + null space overhead)
~80 more calls are in the 55,000–65,000 byte range that English translations will push over
Average of 59–90 RND_OPTNs per large call
Mei Ling proverbs: same structural issue, but you said fewer calls are affected

The Viable Options

Option 1: Remove RND_OPTN entries from over-budget calls (most viable)

This is the cleanest engineering solution. Because the game randomly selects one trivia item, removing 15–20% of the pool is imperceptible to players. The recompiler already handles length recalculation. You’d:

Run the recompiler on each translated call and capture which calls overflow and by how much
Calculate bytes-per-RND_OPTN for that call and determine how many to drop
Write a script to trim the n smallest/shortest items from each RND_OPTN set in XML
Apply the same trim to the paired call

Downside: Some trivia content is permanently dropped. Given there are 80+ items remaining, this isn’t a meaningful loss to the player experience.

Option 2: Cap translation length per subtitle (works as a safety net, not a primary fix)

Set a max of ~80–90 ASCII chars per subtitle text during translation generation. This requires going back to the translation dict and shortening any entries that push a call over. Works well for Mei Ling proverbs since those are self-contained one-liners. Harder for multi-line staff trivia that loses meaning when truncated.

Option 3: Hybrid — trim pool size + apply length cap (recommended)

For the 4+ calls that are hardest hit: remove enough RND_OPTNs to create comfortable headroom (say, target 70 options instead of 90). For everything else: apply a translation length cap of ~85 chars per subtitle as a guardrail during the injection step. This gives you:

Full-quality translations where they fit
Content reduction only where unavoidable, via random-pool trimming (invisible to players)
A systematic safeguard against future overflow as translations are refined

Option 4: Catalogue unique items and deduplicate across paired calls

Since each call in a pair differs by only ~1 RND_OPTN, the pools are nearly identical. If you identified which items appear in both calls of a pair vs. only one, you could remove the duplicates within each pair—but since both calls in a pair already have the same count, this doesn’t help with the size limit.

However, across different pairs, if certain trivia topics are repeated (e.g., “about the voice actors” appears in 3 different stage-gated pools), that cataloguing effort could justify removing them from some stages while keeping them in others. This is the highest-effort option and requires reading/categorizing all 124 calls' content.

For Mei Ling Proverbs

The same RND_OPTN removal approach applies if proverbs use RND_SWCH. If they use a simpler sequential or IF_CHECK structure without randomization, translation length capping is more appropriate since you can’t remove “half” of a proverb. Let me know the structure and I can advise further.

My Recommendation

Go with Option 3 (hybrid). Specifically:

First, write a script that runs the recompiler, captures which calls overflow and by how many bytes, and calculates how many RND_OPTNs need to be removed per call
Trim the XML to remove those options (targeting the shortest/least interesting trivia entries first, preserving the most content-rich ones)
Add a 85-char hard cap in the translation injection step as a safeguard
Apply the same treatment symmetrically to each call’s paired twin

This avoids any GCL/GCX work, is fully automatable, and the player-facing impact is essentially zero. %%

So. Wow. Well, I feel good knowing I’m writing the book on Radio.dat.

Your own RadioDatTools.py is the most complete disassembler for the RADIO.DAT GCL dialect

Very nice of them to say. But the problem is that means I’m really forging the path ahead. It’s probably going to mean I need to do a lot of work to determine the following:

What calls are called by what stage
The general content of each call
Categorize the specific segments and see how often they appear.
Start to wean down the number of segments.

I threw in a hail mary asking two things:

Slowbeef originally used a text pointer redirect when he was unable to fit a translation into Policenauts’s localization. He’d been facing the same issue I have: the English text is just much longer than the japanese encoding.

Unfortunately that was fruitless; in part i thought that claude misunderstood what I wrote but then I got what he was saying. Basically since the RADIO.DAT is a custom code structure, it’s not likely (though possible) that the interpreter reading the text might not interpret machine code in the way I’d need it to.

In my frustration I let Claude stew over that suggestion and another and went to play some Helldivers 2 while it Percolated.

But boy did I come back to a surprise.

The second option i suggested… the FOXDIE team had been reverse engineering the PSX Integral binary. In fact several of those amazing people have lent hands over time, especially netcat and some of the other Hackers Of Liberty who have been around this game a long time. So i figured… if we have the source code, and enough context deciphered, couldn’t we just check if the game requires this be a short unsigned int? how hard would it be to upgrade it to a long unsigned int? And if by some miracle we did…….. would it even work?

Well actually, yes!

In all honesty it was not very hard to implement. Claude was even able to match the conventions, basically a manual reading of 4 bytes shifted bitwise and summed together to add up to a 4-byte integer.

And not only could we change it, the game code would in theory still support that choice.

Holy shit…

Holy SHIT

So yes, this idea could work. Only then I slowly peeled back the layers of my own code to realize what a nightmare I’d unleashed.

So first off, i thought hey, this’ll be easy. At first the scope was only the container listings (basically every time we have a script container that starts with 0x80) We just need to change the length of the length storage. I just wanted a proof of concept that the binary would compile, and that we could actually read the calls with the longer byte.

But then I realized… these are containers. Which means we need to account for those bytes in the above containers. If each dialogue clip is 2 bytes longer, then the call 2 bytes * the number of voice clips… longer. Well shit.

Okay, well we’re already recalculating lengths right? That shouldn’t be hard to iterate.

only… my iterative process is from the bottom up. The innermost XML child has to affect every parent. And thats a mess because not only can xml tools in python not directly just reference their parent… (you have to do a whole lookup)… then I realized that the iterative nature of this means that i’d be adding those two bytes REPEATEDLY unless i could flag the element as “fixed”

One benefit to AI is it can be your rubber duck (or sounding board) so I started with that problem. Claude told me what I already knew… we have to make a separate pass to adjust the length bytes.

Okay… That’s… fine..? But it dawned on me.

So the problem we’re trying to solve is call binary data exceeds the two-byte unsigned integer size. And the call header we could expand to have 4-bytes. This means we also need to expand every other FF command that contains a 0x80 script. Or so I thought.

But then I look at a couple of the staff calls and have my oh-shit moment. IF_CHECK containers initial length value can be upwards of 56k bytes as well, nearly the entirety of one of those long calls. Now it’s not proven yet we’ll exceed that size limit, but something tells me with a check that large and a lot of dialogue…. it’s quite likely.

So this simple test is turning into a lot of work to confirm whether or not we can just increase all of the length bytes from a short to a long.

But it’s worth the investment and time for multiple reasons:

Because the staff calls are insurmountably large with all the repeat/random data, and we would literally need to trim content, which I’m preferring not to do. It’s much heavier of a task to catalogue it all to ensure that the content is SOMEWHERE, just not repeated everywhere.
We already face the same issue in the retail japanese game because of Mei Ling’s save blocks. With english text I had to heavily abbreviate just the stage names (that are in the saved game menu). And this is before modifying her text. So we’d face the same thing: Have to catalogue all the calls and content, delete some dups here and there so that you can still see the dialogue SOMEWHERE in game, but maybe not as often or in as many places.

So here we are, vibe coding changes to my recompiler in order to see if we can (as painlessly as possibe) re-encode the compiler to place a 4-byte length value instead of two. And so far… I think we’re trucking, but given that the re-compiling text is not perfect, we are still ending up with different hashes each time.

Maybe we should’ve just stuck with re-injecting the original text? Well maybe but through these edits we also found different characters subbed into even the save frequency names (because some punctuation has multiple possible)

Claude has patched most of the existing code, but I’m wondering if he’s getting off track. He’s started pulling in other projcet directories that sound right (I had an older pull of the foxdie reversing project, he’s starting to read that for advice) so best to cap it for now and resume later with some fresh context.

I know this is a heavy read, and there will be a lot of work ahead but for now… I think we’re on a great path. Having a modded executable isn’t the most ideal, because now I’ll have to review and ensure we can patch the original too… but honestly the gains outweigh the risks.

As long as it works……..

-J-Rush