Writing a Baldr Sky Decompiler, Compiler, and Other Bad Ideas

ss+(2016-07-04+at+11.45.40)

So Baldr Sky got announced! I definitely didn’t see that coming. In truth, I’ve been working on the game for some time to address technical issues for Aroduc’s translation of the game, but even I hadn’t heard this was unexpected. Sadly, with an official release, I’d imagine a lot of this work will end up going to waste, since I’m certain they’ll have access to the original scripts. None the less, I wanted to go over exactly what was done on the project, because I think it’s pretty cool.

There’s a number of issues that face the Baldr Sky translation project, but the most obvious and glaring issue for sure was Makoto’s text. For the uninitiated, Makoto speaks mostly in a kind of telepathy, which is displayed in a rather unique manner using floating animated text. It’s pretty cool looking, basically like this:

As you might imagine, you can’t just change the Japanese text to English in this case… it looks completely and unreasonably awful. Like so:

8cdf593237[1]

When I was asked to look into this, I heard the last set of hackers had declared that it would be effectively impossible to fix, they just couldn’t figure out how it worked. These are people far beyond my skill level in reverse engineering, and who’s work I respect, so that had me worried. But Aroduc doesn’t take no for an answer, and in a worst case, the text could be swapped out for images or even normal text. …But even that would require an understanding of the game’s script format.

I wouldn’t be going in completely blind though, because while helping someone set up tools for Parfait (another Giga title, the translation I believe is dead), I found something amazing. Giga had accidentally included within the game archives a copy of their compiler tool and the uncompiled scripts. The formats weren’t exactly the same, Baldr Sky’s format was considerably more advanced (Parfait didn’t support loops and many other features), and the output format was a little different, but it would be enough to start figuring things out.

So first, the script format. I went at them with the usual method, simply opening them in a hex editor. After comparing a bunch of scripts by hand, it’s obvious the script files are split into four distinct sections.

1) A whole bunch of simple binary data. At a glance, it looks like a really big list of integers.
2) A list of strings. All the text is stored here, and it would be pretty trivial to swap out the strings without damaging the usability of the file. Presumably this is how other tools work.
3) A short list of strings, of what look like variable names.
4) A bit more binary data.

This format is a lot different from their older games, like Baldr Force and Duel Savior, where the strings were included within the binary data… it’s much easier to work with at least.

So here’s the most basic script I could compile, and the output generated by the compiler:

ss+(2016-07-04+at+10.05.20)

In addition to that one, I tried a lot of different simple scripts to figure out what everything meant. With a bit more figuring out, I could identify some of the input within the output:

18000000 64000000 – This represented the start(100) command.
19000000 00000000 – This represented the end() command.

To break down the binary format with what I know now, the key points is that the output file opens with a 4 byte integer which represents the length of the script section, with each operation being fixed in length and composed of two integer values. This is then followed by a string table, a variable table (probably for debugging), and then lastly if there’s any banks (basically functions), a table containing some details about those is appended at the end. Since the input code in this case contains no strings or variables, they just appear as 0’s following the last operation.

Seeing the input represented in the output in an understandable format was encouraging, so I moved onto more complex examples. I wrote down the output from various commands, and seeing the differences in the output of each one. At this point I excluded data that wasn’t part of the script code.

In addition to the command listed, the first line of the file declared the variables as “int i, j, k;”. In the output binary, those variables would be given ids of 0, 1, and 2.

As you can see, there’s some clear patterns from input to output. Figuring this out is like solving some kind of obscure puzzle. It’s not obvious what everything does, and there’s a lot of traps. For example is the 0x1A command you see at the start of each line, it isn’t obvious at first but this is actually the line number! In all these cases, the value of the 1A operation is 2, which means the code represents the second line of the file that I wrote the command on. I suppose this is handy to have for debugging purposes.

Lets break down the last example on the list to see how it works. I’m going to skip a bunch of steps here and explain the final results of what the binary code means. It took a lot of trial and error to figure it out, but eventually I solved what all the opcodes meant.

For the code: k = 32 + (1 – 3);

ss+(2016-07-04+at+10.25.24)

It’s just like assembly! It’s effectively a simple virtual machine, with two registers (which I’ll call register 0 and register 1), and a stack. It’s a bit hard to follow, so I’ll step it through.

LINE 2 – This is the second line of the compiled script.
VAL 2 – Assigns the value 2 to register 0.
VAL 32 – Assigns the value 32 to register 0.
PUSH 0 – Pushes the value in register 0 to the stack.
VAL 1 – Assigns the value 1 to register 0.
PUSH 0 – Pushes the value in register 0 to the stack.
VAL 3 – Assigns the value 3 to register 0.
POP 1 – Pops the last object off the stack and assigns it to register 1.
SUB 0 – Subtracts register 0 from register 1, and assign the value to register 0.
POP 1 – Pops the last object off the stack and assigns it to register 1.
ADD 0 – Adds the value of register 0 and register 1 together, and assigns the value to register 0.
ASSIGN 2 – The value of memory position 2 is set to the value of register 0.

With this broken down, you can follow through the code from start to finish and see that it does indeed produce the result you’d expect from the input. That second line is a bit weird, but I found that the compiler erroneously emitted extra VAL in some cases (line numbers, some operations). It doesn’t affect the output though. It’s also noting that there’s no way to directly assign a value to register 1, you have to go through the steps of assigning it to register 0, pushing it to the stack and then pop’ing the result off into register 1.

(I should note that I’m not very good with this stuff, what I call the VAL operation probably has a more correct term that I should be using.)

ss+(2016-07-04+at+10.40.05)

Here’s another example, this time with a remote function call (ie: calling a function from the engine itself). You can see that it’s pushing the variable values into the registers, and then uses what I called the PARAM operation to add them as parameters. Then the function call is made, which has a slightly different syntax. Unlike all other commands, this operation takes two 2 byte values, the first indicating the function to call, and the second listing the number of parameters being passed. Strings can also be used as parameters, which is done by loading the id of the string table entry into register 0 and issue a PARAM with a value of 1 instead of 0. There’s actually no string handling at all in the script, other than using them as parameters.

After working through example after example, I built up an ever growing reference and understanding of how the output code works. I switched gears, and started parsing out the Baldr Sky scripts. The operation codes are all different, but it’s more or less the same otherwise. Eventually I got a strong enough knowledge of what I was doing to build a tool to automatically write readable code from the scripts (for certain definitions of readable). They weren’t pretty, but it works.

For those interested, here’s an example output from this process: http://puu.sh/pQJHV/916b2efef5.txt This is a late version of the script, originally I didn’t have as much detail filled in about how stuff worked. It’s worth pointing out the extraneous VAL operations are very obvious at the start of the file, GIGA uses header files with a lot of defines that for some reason output them. You can actually exclude them entirely and the scripts will run just fine.

Anyways, armed with disassembled scripts, I quickly whipped up a reassembler, so I could test this stuff in game. Then I was able to actually start tinkering and modifying the game operation itself. Again this took the form of trial and error, adding things, deleting things, rewriting stuff. Causing things to crash. Over and over again.

Eventually I identifies the code I was actually looking for. For all of Makoto’s lines, there was a huge block of commands, hundreds of them. By modifying them, I could influence how the text was being displayed. Makoto’s text was being generated in script! This explains why the other hackers were unable to identify how it worked. But I still needed to figure out exactly what the code does in order to modify it.

So I started working backwards through the code. I reckoned that if I understood the scripts well enough, I should be able to reconstruct the code that generated it in the first place. Then, it would be easy to understand what’s going on. After a few hundred lines of transcribing code though, I wasn’t getting much closer. I did, however, start to identify patterns. With some practice, you start to have a pretty decent grasp of what the original code would look like just by looking at the operations. Could that be automated?

That’s a pretty silly idea, writing a full decompiler. And certainly far beyond my skill level. Or was it? I ran it through my head a lot, and after about a week of thinking the problem over, I sat down and started coding.

I had a pretty simple idea. You would run through the code as if you were executing it. Once you hit an operation that modifies a register, instead of assigning it the expected result, you’d actually put a string representation of the operation into the register instead. Then, when you hit any kind of function call or assignment, you simply output everything that makes up that call to the file.

Lets work this through using our previous example. Lets say register 0 contains the value 1, and register 1 has the value 3, and then you reached a SUB operation. Instead of normally calculating the output of 1 – 3 and putting the value -2 into the register, you actually assign the string “1 – 3” into the register. You treat the string like the output of the operation. You can push it to the stack, pop it off, make changes to it, and each time you just keep building up the string. So then if you try to add 32, you’d modify the string to look like “32 + (1 – 3)”. Once you hit a line where the output of that operation is actually used, like an assignment, or function call, you output the combined statement.

It seemed like a pretty dumb idea, but it worked a lot better than expected. Like way better. I spent a lot of time working on edge cases, making sure operations weren’t being skipped or were being assembled properly, but I was actually getting good looking code out of it. I started parsing more complex structures, like if/else statements, and then even while loops. I started putting in names of function calls with an obvious functionality. The code grew into a more and more of a complete picture of what was happening.

Here’s the same script file as before, properly decompiled. It’s a real, readable, useful version of the input binary.

http://puu.sh/pQLoB/a69ea6eee3.txt

Once I had code that decompiles though, it’s not simply enough to be able to read the code; I also need to be able to write it too. So I made a compiler to compile the output of the decompiler. I used Antlr 3, a code generation tool that uses grammar files to create a lexer/parser that you can use as a basis for a compiler. I don’t want to trivialize it, it wasn’t easy, but I’ve done script compilers in the past, so I was eventually able to complete a compiler that would take the output from my decompiler and create usable Baldr Sky .bin files again. At first it was pretty prone to errors and issues and couldn’t round trip (ie: compiling and then decompiling would result in different output), but at this point I think it’s pretty stable.

Then, to put it into use. Here’s some output from when I was testing the output of the compiler. In this particular case I was making sure it was outputting loops correctly. We were making legit, working changes to the game script, and it was compiling. I was pretty hyped at this point.

Now that I know I can make useful changes to the script, I need to modify the offending code segment.

Here’s a single block of Makoto fancy text, and the code we’ll ultimately need to edit:

This is it, in all it’s lengthy glory. This code is actually duplicated for each and every line of text she speaks. Presumably the original script uses some kind of macro to do it, either that or they were just crazy. Either way, with actual code in hand, we can start making sense of the problem.

If you look through it, there’s three distinct parts. The first part simply has to deal with the position of the text. The second part calculates the height of the text so it can start at the right height on screen, taking care to manually parse @n codes as line breaks. Then, it loops over each character of text, creating a layer object for the character, assigning it a move action (so it flies in), and creating a background effect to go with it. Each letter is delayed, creating the typing out effect.

Pretty sweet right? Knowing this, we can write our own version of the function to replace it. In addition to the text orientation, I needed to make it easy for Aroduc and the editors to tweak the block of code. Here’s the final version I came up with:

And here’s how it looks in game:

Now sadly, I think with the official announcement of Baldr Sky, most of this work will have gone to waste. I can’t think of a scenario where the original compiler and scripts couldn’t be used to solve this problem. Still, it was an interesting project, and if you’re at all interested, the c# code is all up on github if you want to take a look and mess around.

https://github.com/Doddler/SkyTool2/ (Sorry code is offline for a bit, it will come back in the near future. Probably!)

Thanks for reading my rambling mess!

3 Responses to Writing a Baldr Sky Decompiler, Compiler, and Other Bad Ideas

  1. Anonymous says:

    Jesus fuck you are the god of hacking, man. It’s a bit of a bummer that the original compiler will “magically” do away with all those problems though

  2. Anonymous says:

    You’re the hero we needed but didn’t deserve. Even though it seems like this is all for naught (a shame you weren’t even informed until after the announcement!) it was still a very interesting read. Thanks for the post.

  3. mahdrills says:

    Well even if it’s not used, I want to say that I’m impressed. Keep up the good work 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *