+  RHDN Forum Archive
|-+  Romhacking
| |-+  ROM Hacking Discussion
| | |-+  Text hacking Cross of Venus (NDS)
Pages: [1] 2
Author Topic: Text hacking Cross of Venus (NDS)  (Read 1 times)
Majin3
Guest
« on: May 19, 2011, 10:12:51 am »

Hello,
I hope this is the right section.

I want to try translating the Nintendo DS game "Dengeki Gakuen RPG: Cross of Venus Special".
The problem is, I have no clue where the text is... I tried memory hacking which worked perfectly. The text is simply SJS, but the game files are compressed/encrypted/whatever so I don't even know which file the text is stored in.
The game files have a .pack extension. THIS is how the header looks like and THIS is how the content looks like.
Does anyone have any clues about this pack format?

Thanks in advance.
Ryusui
Guest
« Reply #1 on: May 19, 2011, 02:25:01 pm »

Ah. Archive files. You'll need some programming skills to work with these, but it should be trivial.

The first thing you'll need to work out is where the "start" and "end" (or "stored size") data in each header is stored. Once you know that, you can dismantle the archive into its constituent files. Then we can make a reasoned guess as to its compression format.

But if you want to jump the gun in that regard, odds are it's the DS's built-in LZ compression, which is well-documented, or a close cousin. Keep an eye out for permutations in the header entries: one of the bytes is likely a compression flag, i.e. not all the files stored in an archive are necessarily compressed.
Majin3
Guest
« Reply #2 on: May 19, 2011, 05:06:13 pm »

Thanks for your response. I'm lacking programming skills though...

This is what I've found out so far about the header:
XX 00 [file name] 00 00 02 00 00 00 00 YY YY 00 ZZ ZZ 00 00
XX is the file name length, YY YY is the stored size and ZZ ZZ seems to be some sort of hash (is there any 2-bytes-long hash?)

But I still have no clue about what compression it is... I've tried to run DSDecmp on an extracted file but it returned "no matching compression method found".


The first byte of every file is mostly 8F (I found 1 file with 8D though) and the next 4 seem to be the file format (.bin has 4D 4F 4A 49 as seen in the screenshot). Any clues?
Ryusui
Guest
« Reply #3 on: May 19, 2011, 06:06:27 pm »

MOJI is 文字, or "letters". A lot of the file names are in romanized Japanese: "bunsyou" is 文章 ("bunshou"), meaning "text". "Hissatu" is 必殺 ("hissatsu"), literally "certain death" but usually used to denote a special or super attack. "Syujinkou" is 主人公 ("shujinkou"), or "protagonist".

Anyway, I'm looking at the data, and it certainly doesn't look like LZ-compressed SJIS. Which file is this?
Majin3
Guest
« Reply #4 on: May 20, 2011, 07:19:40 am »

I already suspected it being 文字 but then I though it could be a coincidence because all .bin files have MOJI but for example .iba mostly has 03 03 0F 01 which makes no sense.
But you're most likely right, those 4 bytes do sometimes vary even between the same file extension, so I guess they're simply the compressed beginning of the file.

It wouldn't surprise me if it's not SJS. There are 640 .pack files and each of them contains a lot of files as well. That file was the first from System.pack: bunsyou1.bin (as a test since without decompressing them I can't find out where the text is...)
Maybe this one is more helpful since you can recognize some words: (System.pack: script.ifb)
« Last Edit: May 20, 2011, 10:47:40 am by Majin3 »
Ryusui
Guest
« Reply #5 on: May 20, 2011, 03:50:11 pm »

It might help if you tried opening the file using one of the Japanese tables we have on hand - here's Shift-JIS, and here's EUC.

The data starting at $8D2 looks promising:

45 FF 46 46 45 43 54 5F 54 41 A2 47

This comes out to:

E($FF)FFECT_TA($A2)G

And there, ladies and gentlemen, is your smoking gun.

You'll also see it earlier at $314:

FF 5A 59 55 55 4E 49 4E 4E 7A

Which is:

($FF)ZYUUNINN($7A) ("Juunin", likely 獣人 or "beastman")

See how "FF" precedes strings of precisely eight non-gibberish characters? It's compression flags.

Let's have a look at that 7A. In binary, that's 01111010: that is, each bit tells the decompressor which of the following sequences should be treated as plaintext and which should be treated as compression codes. In FF, the bits are all set (i.e. 1), so we can assume that 0 indicates a compression code.

After it comes this sequence:

B2 1E 41 52 49 41 F7 0F 45 BC 2E 05

Or:

($B2)($1E)ARIA($F7)($0F)E($BC)($2E)($05)

If every compression code is two bytes, then this pairs up nicely with our compression flags:

0 - ($B2)($1E)
1 - A
1 - R
1 - I
1 - A
0 - ($F7)($0F)
1 - E
0 - ($BC)($2E)

And what of that mysterious 05 at the end? That's the compression flags for the next block!

Now. Without seeing what all this decompresses to, I really can't help you much further than this, but each of those compression codes comprises a "length/distance" pair - that is, they encode one value that tells the decompressor how far to look back for the next bit of data, and another that tells it how many bytes from that point to copy. Hopefully you can puzzle it out from here.
Majin3
Guest
« Reply #6 on: May 20, 2011, 05:22:19 pm »

Wow, nice reverse engineering. I think I got the main idea, more or less.
So this is a completely new compression? If there is no (de)compressor, I guess I'm out of luck.

Using SJIS I do get some results like 有力者 but looking through the memory seems to be a better solution:

The file was twice in memory: compressed and decompressed. I hope this helps.

One hypothetical thought, would it work to fetch the data from memory, change it, put FF every 8 bytes (and 05 every block?), place it back into the .pack files and adjust the stored size and the hash (however it works)? It'd be subpar, but at least I could skip compressing & decompressing.
Ryusui
Guest
« Reply #7 on: May 20, 2011, 07:08:49 pm »

If you could get the game to do all the decompression work for you, yeah, that might work, but it'll probably take less time and effort overall to simply code together a decompressor. The first step is, as you've described, to get a memory dump of the decompressed output so you can compare the compressed and uncompressed files; then, it should be trivial to figure out the exact details of the compression codes.

The good news is that the game might flag which files are compressed/decompressed. You should check to see if every file entry has a "02" in that spot - it might be a flag indicating which files are compressed and how.

Also, I think you misunderstood me - when I said "05 is the compression flags for the next block", I mean the next sequence of eight "codes". I say "codes", not "bytes", because each "code" might be a plaintext byte or a two-byte compression code. 05 is 00000101 in binary, so the next sequence (without looking at it myself) will look like this:

0 - Two-Byte Compression Code
0 - Two-Byte Compression Code
0 - Two-Byte Compression Code
0 - Two-Byte Compression Code
0 - Two-Byte Compression Code
1 - One-Byte Plaintext
0 - Two-Byte Compression Code
1 - One-Byte Plaintext
jjjewel
Guest
« Reply #8 on: May 20, 2011, 07:58:07 pm »

If it might help, the conversation when the game starts is in
Chapter_Ep01.pack.

The compression is probably as Ryusui explained.
Ryusui
Guest
« Reply #9 on: May 20, 2011, 09:04:52 pm »

There's some similar plaintext, but I don't think the compressed and decompressed data you've got side-by-side there match up - the first eight bytes after the initial "FF" are plaintext and therefore should appear in the decompressed output, but the last two bytes in the sequence (26 77) are missing.

However, this still tells us something important. They may not match up, but we can still tell enough from the decompressed sample that we can puzzle out the compression code.

At address $25, we see "get($FF)OnceInst($BA)a($F8)($1F)". In the decompressed output, we can see this should be "getOnceInstance". Where did the "nce" go, then? The question we should ask is "where did it come from": the most likely place is the last three letters of "Once".

LZ uses compression codes to determine where in the decompressed output to look for the next snippet of data and how many bytes to copy from there. Since we know where it's looking and how many bytes it had to copy, we can figure out the compression code format.

The next compression flag is $BA, or 10111010; we know that the "a" is plaintext (giving us "getOnceInsta"), and the 0 tells us that what comes next is a compression code: F8 1F, or likely 1F F8 (if these two-byte codes are being read in little-endian format). 1F F8 is 0001111111111000  - quite a whopper (and a palindrome to boot). We know that the sequence is eight (1000) bytes back and three (11) bytes long...it certainly seems to match with what we know, but it's not enough information to go on. We only know what six out of those sixteen bits are likely to mean, and we're bound to run into trouble if we go on partial information.

This appears to be where any useful similarities between the compressed and uncompressed data ends, so you'll have to dig up a different comparison if we're going to make any headway on this.
jjjewel
Guest
« Reply #10 on: May 20, 2011, 10:39:29 pm »

Quote from: Ryusui on May 20, 2011, 09:04:52 pm

The next compression flag is $BA, or 10111010; we know that the "a" is plaintext (giving us "getOnceInsta"), and the 0 tells us that what comes next is a compression code: F8 1F, or likely 1F F8 (if these two-byte codes are being read in little-endian format). 1F F8 is 0001111111111000  - quite a whopper (and a palindrome to boot).

In this case of F8 1F

F8 tells you how many bytes (in the decompressed output) you have to read back
1F tells you how many bytes you will copy to the output

For F8, subtract it from FF and add 1. So FF - F8 + 1 = 8. You read back 8 bytes from wherever your last byte of output is.
1F tells you to copy 3 bytes from where you read back. (0F = 2 bytes, 1F = 3 bytes, 2F = 4 bytes, and so on.)


This is what I tried with Chapter_Ep01.pack. (I did it manually so some bytes might be a bit off.)

Ryusui
Guest
« Reply #11 on: May 20, 2011, 10:52:38 pm »

Brilliant. Still not perfect, mind, but I think it's safe to say you've cracked the case.

The value is little-endian: 1FF8. The first nibble (four bits) indicates how many bytes to copy; the rest is a signed twelve-bit value. I.e., FF8 is actually "-8", as in "subtract 8 from the current output address to get the source address". As for why the length value doesn't match up with the actual number of bytes to copy, that's common practice: since the shortest possible match is two bytes (though really, the shortest useful match is three), the length value stored in the compression code usually gets an offset added to it. So you take the 1, add 2 to it, and get 3.

So here's how to read a two-byte compression code!

Step 1. Take the two bytes and swap them around to get the little-endian value.
Step 2. Take the first four bits and add 2. That's the length value.
Step 3. The remaining 12 bits is the signed distance value. Subtract it from $FFF to get the positive equivalent.

Or, for those of a more technical mindset:

Compression Code (Little-Endian)
XXXX YYYY YYYY YYYY
X = Length (add 2 to get actual value)
Y = Distance (signed)

Congratulations! That's all you need to know to write up a decompression program!

Here are a couple more examples:

($B2)($1E)ARIA($F7)($0F)E($BC)($2E)($05)

So the compression codes here are 1EB2, 0FF7, and 2EBC.

1EB2
XYYY

1 + 2 = 3 (length)
EB2 = -14D (distance)

0FF7
XYYY

0 + 2 = 2 (length)
FF7 = -9 (distance)

2EBC
XYYY
2 + 2 = 4 (length)
EBC = -143 (distance)

I hope this explains it well enough!
jjjewel
Guest
« Reply #12 on: May 20, 2011, 11:15:02 pm »

^
^
Wow. I've never known the significance of those 2 bytes indicator.
There are a few NDS games that use similar compression, but not exactly the same
and I've been trying to hack them. Now I'll give them another try. Cheesy

Thank you so much. I'm glad I dropped by this thread.
(The data format just looked familiar so I gave it a try.
It's different from the games I'm hacking but the concept is very similar.)
Ryusui
Guest
« Reply #13 on: May 20, 2011, 11:51:06 pm »

You get familiar with LZ compression when you've played around with GBA and DS for a while.
Majin3
Guest
« Reply #14 on: May 21, 2011, 10:02:43 am »

Thanks, guys. I didn't understand it completely, though...
The first example (1FF8): Why is it -8? Isn't FFF-FF8=7? And by looking at the compressed file, the distance to "nce" is even 9.
The third example (0FF7): Once again, isn't it -8? But -9 is right here.
And as for the other 2, I don't get where they should be pointing at all. The second one points to B6 0F 8F (should be 00 4B 02) and the fourth one to 75 B7 1F 08 (should be 00 00 00 00).
But I think I got the rest now at least.

jjjewel, are you planning to write a (de)compressor for that "NDS games that use similar compression"?
Or I guess I could ask the writer of DSDecmp or something to implement this one...
Pages: [1] 2  


Powered by SMF 1.1.4 | SMF © 2006-2007, Simple Machines LLC