RHDN Forum Archive - Bypassing compression?

RHDN Forum Archive

Romhacking

General Romhacking

Bypassing compression?

Pages: 1 [2]

Author

Topic: Bypassing compression? (Read 1 times)

DaMarsMan
Guest

Re: Bypassing compression?

« Reply #15 on: June 02, 2008, 10:03:57 am »

Certain types of compression you can just grab someone elses source and tweak it to your own use.

MathOnNapkins
Guest

Re: Bypassing compression?

« Reply #16 on: June 03, 2008, 10:12:27 am »

Quote from: MathOnNapkins on June 03, 2008, 10:12:27 am

Here is a straightforward compression routine implementation. I can't really say how good it is, as it doesn't really perform any analysis, like tree building.
It may be the case that this type of compression would not benefit significantly. As my test files I used the first 0x10000 bytes of Link to the Past (SNES) and it compressed down to about 0xA000 bytes, so that's about 62.5% compression ratio but I also wouldn't expect the output would be that great for the file I threw at it, b/c it's mostly code. Also threw a mixture of graphics and code at it and it came out roughly the same.

Code:

#include <stdio.h>

unsigned char buf1[0x10000];
unsigned char buf2[0x12000]; // the absolute worst scenario is that the compressed version actually comes out larger and the size will increase be by a factor of 9/8

unsigned short blockSize[0x12000];
unsigned short offset[0x12000];

// This looks similar if not identical to an LZ compression I've seen in Final Fantasy VI...

unsigned int decompress(unsigned int inputLength, unsigned char* readBuf, unsigned char* writeBuf)
{
unsigned int index = 1, j = 0;
unsigned char codeByte = readBuf[0];
unsigned int codeByteBitTest = 7; // a bitmask
unsigned int count = 0;
unsigned int outputLength = 0;

//read length
unsigned char nextByte;
unsigned int len;
unsigned int start_pos;
unsigned int k, oldJ;

while(index < inputLength)
{
// will test 0x80, then 0x40, ... down to 0x01
if((codeByte & (1 << codeByteBitTest)) != 0)
{
nextByte = readBuf[index];
len = nextByte >> 2; // length is stored as the upper 6 bits of the byte
count += len;
index++;

// offset into the read buffer is 10 bits, ranging 0x0000 to 0x03FF
start_pos = ((nextByte & 3) << 8) + readBuf[index];
index++;

//copy bytes
oldJ = j;

// Copy data from previous data that is already in the write buffer using an LZ-like method
for(k = 0; len != 0; k++)
{
writeBuf[j] = writeBuf[(oldJ - start_pos) + k];
j++; len--;
}
}
else
{
// copy without decompressing
writeBuf[j] = readBuf[index];
index++; j++; count++;
}

if(!codeByteBitTest)
{
codeByte = readBuf[index];
index++;

// reset the bitmask
codeByteBitTest = 7;
}
else
codeByteBitTest--;

}

return j;
}

unsigned int compress(unsigned int inputLength, unsigned char* readBuf, unsigned char* writeBuf)
{
// I'm using a different bit mask representation than Euclid did in the decompression routine
unsigned char mask = 0x80;

unsigned char* ctrlByte = NULL;

unsigned int index = 0, j = 0, k = 0;
unsigned short matchLen = 0;

// ------------------------------------------------

// see what the largest number of bytes we can get for this particular location with LZ
for(index = 0; index < inputLength; index++)
{
offset[index] = 0;
blockSize[index] = 0;

for(j = 0; j < i; j++)
{
k = 0;
matchLen = 0;

// the compression format cannot reach back any further than 0x3FF bytes in to the buffer
if((index - j) > 0x3FF)
j = index - 0x3FF;

// the decompression engine is only capable of tracking 6 bits of length for a block
while( (readBuf[j + k] == readBuf[index + k]) && (matchLen < 0x3F) )
{
if( (index + k) >= inputLength )
break;

matchLen++; k++;
}

// new winnar
if((matchLen > blockSize[index]) && (matchLen >= 3))
{
offset[index] = (index - j);
blockSize[index] = matchLen;
}
}
}

ctrlByte = &writeBuf[0];
(*ctrlByte) = 0; // no compression being used at initialization

for(index = 0, k = 1; index < inputLength; index++)
{
if(!mask)
{
mask = 0x80;
ctrlByte = &writeBuf[k++];
}

if(!blockSize[index])
writeBuf[k++] = readBuf[index];
else
{
(*ctrlByte) |= mask; // indicate that this series will use LZ

writeBuf[k++] = (blockSize[index] << 2) | (offset[index] >> 8);
writeBuf[k++] = offset[index] & 0xFF;

index += blockSize[index] - 1;
}

mask >>= 1;
}

return k;
}

//example code that uses the above routines
//note that this is not a self contained program in and of itself
// you'll need a main() function similar to Euclid's original file

FILE* fp = fopen("C:\\\\test.bin","rb");
FILE* wp = fopen("C:\\\\test2.bin","wb");
FILE* dp = fopen("C:\\\\test3.bin","wb");

unsigned int a = 0, b = 0;

// rather than using an explicit value of 0x10000 you could get the filesize
// as long as it's under 0x10000 bytes it should be fine
// test.bin just happened to be 0x10000 bytes so you'll get an error on fread if you change it to load a
// smaller file
fread(buf1,sizeof(unsigned char), 0x10000, fp);
fclose(fp);

a = compress(0x10000, buf1, buf2);

fwrite(buf2,sizeof(unsigned char), a ,wp);
fclose(wp);

b = decompress(a, buf2, buf1);

fwrite(buf1, sizeof(unsigned char), b , dp);
fclose(dp);


« Last Edit: June 16, 2008, 02:13:53 pm by MathOnNapkins »

elixirnova
Guest

Re: Bypassing compression?

« Reply #17 on: June 15, 2008, 02:00:05 am »

Alright so I got the time tonight to read through the program Euclid wrote up as well as his commented CPU code.

I figured out....

The pointer table in the ROM gives the location of the compressed block of graphics followed by 2 "length" bytes that represent the length that the uncompressed data will be, so it knows when to stop the routine.

The first byte in each block of compressed graphics is a "codeByte" as Euclid calls it. This byte represents where a "repeat length" byte followed by an "amount of bytes to repeat" byte can be found.

I also figured out that a new code byte is found after every 7 cycles of data are read. A cycle of data is referring to either a single byte being read or a series of bytes being copied, both of which count as one cycle.

For example a compressed graphics may look like:
codeByte, data, data, data, data, data, repeat length, amount of bytes to repeat, data, codeByte, data, data...

I believe this is correct. I could be a little off right now since it is 1AM...

I have a feeling I'm going to need to look more in depth at some CPU assembly to understand what is done with the codeBytes to derive from them where the repeat lengths are...

It seems like once I figure that out I can somewhat make a re compressor. I am also thinking that since every 7 data cycles a codeByte is found that the maximum length of a data that can be repeated is 6 bytes. So the re compressor would need to look for series of repeated bytes to compress them.

I hope I am moving in the right direction! Now that it is 2AM I better get some sleep... :police: . If anyone has any thoughts on what I figured out please let me know if I'm going about this correctly!

Here is Euclid's decompression routine commented a little more in depth so I could understand it better..

Code:

// The first byte of he compressed data is a "codeByte." A new codebyte is read every 7 cycles.
// codeByte values are used to point out where len values can be found that signify bytes to be copied.

// The next bytes are actual data bytes.
// Whenever one or more bytes can be repeated a "len" byte can be found followed by a byte that tells
// how many bytes back the block of bytes to be copied is. The block of data is then copied for the set "len" value

// codeByteBitTest is used to keep track of the fact that a new codebyte is to
// be read from the compressed block after 7 cycles of (reading a byte)/(copying bytes)

#include "stdafx.h"

#include <stdio.h>

unsigned char readBuf[0xFFFF]; //
unsigned char writeBuf[0xFFFF];
unsigned int length = 0x3C0; // Length is based on pointer table's data

void decompress()
{
unsigned int i = 0, j = 0; // i is readbuffer location, j is writebuffer location
unsigned char codeByte = readBuf[i]; // Read first byte of compressed block
unsigned int codeByteBitTest = 7; // intial value of codeByteBitTest is always 7
unsigned int count = 0; // Count initially starts at 0
i++;
while (count < length) // run decompression routine until counter reaches "length"
{

if ((codeByte & (1 << codeByteBitTest)) != 0) // If (codeByteBitTest bit shifted left 1) & codeByte != 0
{ // then find the read length and copy bytes!
//read length
unsigned char nextByte = readBuf[i]; // read initial byte that has to do with read length
unsigned int len = nextByte /4; // len is the amount of bytes to be written
unsigned int start_pos;
unsigned int k, oldJ;
i++;
count += len;
start_pos = (nextByte & 3) * 256 + readBuf[i];
i++;
//copy bytes
oldJ = j;
for (k = 0; len != 0; k++) // stop copying bytes when (initial byte / 4) = 0
{
writeBuf[j] = writeBuf[(oldJ - start_pos) + k]; // read a series of previous bytes and copy them until
j++; // "len" length is met
len--;
}
}
else // If (codeByteBitTest bit shifted left 1) & codeByte == 0
{
//read one byte
writeBuf[j] = readBuf[i];
i++;
j++;
count++;
}
if (codeByteBitTest == 0) // When the codeByteBitTest reaches zero it is reset to 8
{
codeByte = readBuf[i]; // The next byte in the compressed data is the new codeByte
i++;
codeByteBitTest = 8;
}
codeByteBitTest--; // After each byte or copied bytes are read codeByteBitTest is decreased...
}
}

int _tmain(int argc, _TCHAR* argv[])
{
FILE* fp = fopen("mmx3.smc","rb");
FILE* wp = fopen("out.bin","w");
fseek(fp,0x200000,0);
fread(readBuf,sizeof(unsigned char),0xFFFF,fp);
fclose(fp);
decompress();
fwrite(writeBuf,sizeof(unsigned char),0xFFFF,wp);
fclose(wp);

return 0;
}

EDIT:
Just noticed mathonNapkins wrote a compression routine. Thanks I'll check it out!
On a side note as I was looking through some of the data it was generally getting bigger due to all the codebytes most of the time! I suppose it pays off in the end with those kind of compression percentages that were mentioned! I am new to compressions though so what do I know!

Also what is this supposed to be? I notice that readBuf does not have an array position!

Code:

if(!blockSize)
writeBuf[k++] = readBuf;


« Last Edit: June 15, 2008, 11:31:44 am by elixirnova »

Tauwasser
Guest

Re: Bypassing compression?

« Reply #18 on: June 15, 2008, 12:12:14 pm »

What you described is a standard lzxx adaption. Also, you are probably off by 1 cycle (the 0th or 7th) that you just didn`t count as one.
Also, you might be interested in unlz gba as it does automatically search the rom for compressed data and shoes them to you as gfx (which is what you want). It has great recompression algorithms and is fairly easy to use.
Download...er...probably on this site Cheesy

?

Also, if this won`t do, notice how lzxx adaptions can vary only in few things:
The read direction of the bit byte (from 7 to 0 or vice versa), the meaning of bit "1" and "0" in the bit byte, the lengths of the reuse blocks (usually add + 3 for gba games), the amount of bytes to go back in already decompressed stuff (usually 3...4 bytes in gba). The read direction of those bytes and that`s it. It`s not a whole lot of differences. Also, I think unlz can handle most of this.

cYa,

Tauwasser

elixirnova
Guest

Re: Bypassing compression?

« Reply #19 on: June 15, 2008, 09:15:59 pm »

So I think I am correct right now in thinking that the "length" found in the pointer table doesn't need to be changed if I just plan on editing graphics and not adding new graphics.

Ho snap!

I almost got off Scott free! I managed to decompress some data with the routine. Looked fine.
I tried to recompress and reinsert the decompressed graphic and it was incorrect by a couple of bytes spread throughout. When running the game I can see that the graphic is offset a little bit.

I then tried decompressing the recompressed/reinserted graphic and sure enough at 0x1BA it reads:
06 0D 0D 0A F6
where as the origional reads
06 0D 0A F6

There is also another instance of this problem at 0x212
The recompressed graphic
01 0D 0D 0A 01
The origional
01 0D 0A 01

So it seems that the recompressor has duplicated 0D twice when it didn't need to be inserted twice! Guess I'll have to review the code for any errors to find the root!

Edit:
Looking through so far I have found in the ROM with the recompressed graphic
FC FC 00 06 06 0D 0A F6 FB 07
and in the decompressed origional graphics the data looks like
FC FC 06 06 0D 0A F6 FB 07

so 00 must be a "codebyte" but this does not explain why 0D is inserted one too many times.

In the recompressed rom it obviously is not using the compression and is just being read as 06 0D 0A probably would not occur in that order if there were the other compression bytes following it. Also this has to be the root of the problem since all other bytes in the file up until these occurrences match perfectly I believe...

I did also try replacing the graphic with some color bars that displayed just fine in the game so I'm guessing it is definitely a problem with the codebyte/recompression routine. I guess I'll need to wait till I get more time to read through and figure it out! This is what I get for not writing my own stuff, but oh well!


« Last Edit: June 15, 2008, 11:42:13 pm by elixirnova »

Tauwasser
Guest

Re: Bypassing compression?

« Reply #20 on: June 16, 2008, 12:04:00 am »

You approach seems meaningless, as your data might start somewhere else (because re-use codes and byte-copy can look identical in some cases). You`ll have to start with the last bit byte and go from there and have the original data matched to that section of decompression. What you did now was work with two different versions and comparing them, which would lead you to think that 00 is the bitcode used in the original too (hence producting the said bytes). However, you might find that in the original compressed data it compressed differently and therefore works differently and the compressor didn`t just write a byte too many, but actually messed up elsewhere so everything past that point is moot.

cYa,

Tauwasser

MathOnNapkins
Guest

Re: Bypassing compression?

« Reply #21 on: June 16, 2008, 02:12:41 pm »

All right...

Quote from: elixirnova

Also what is this supposed to be? I notice that readBuf does not have an array position!

I noticed that too... apparently [ i ] (BBCode) as an italics marker is not disabled in a code tag >_<. Whose brilliant idea was that?
My post is fixed now with a variable called index instead of a variable called "i".

Quote from: elixirnova

I then tried decompressing the recompressed/reinserted graphic and sure enough at 0x1BA it reads:
06 0D 0D 0A F6
where as the origional reads
06 0D 0A F6

This is b/c your file handle is being opened in text mode. Change

Code:

FILE* wp = fopen("out.bin","w");

Code:

FILE* wp = fopen("out.bin","wb""";

And you'll stop getting that. Basically it's because of a difference between how the C library handles text files in memory versus how they're stored on disk on various operating systems. If you want to read and write binary data, best to throw that "b" in there or you'll have problems.

And you said you're having trouble with my recompression routine or one from somewhere else? I had one issue with it but that was just one file that for whatever reason would compress larger than uncompressed, but then when I went to decompress it the file size was wrong. But for files that actually do compress I haven't had any problems so far.


« Last Edit: June 16, 2008, 02:22:22 pm by MathOnNapkins »

elixirnova
Guest

Re: Bypassing compression?

« Reply #22 on: June 16, 2008, 04:02:10 pm »

Thanks a bunch for the explanation! I didn't even look over the file write/read part.

Wow.... I'm so excited! I'm getting close to having enough to make a basic graphic level editor!


« Last Edit: June 18, 2008, 12:17:20 am by elixirnova »

Pages: 1 [2]