RHDN Forum Archive - How to determine the characters used in a text document

RHDN Forum Archive

Romhacking

ROM Hacking Discussion

How to determine the characters used in a text document

Pages: [1]

Author

Topic: How to determine the characters used in a text document (Read 789 times)

javiskefka
Guest

How to determine the characters used in a text document

« on: September 03, 2007, 01:21:19 pm »

So I'm working on an English-Korean translation for a Windows 9x era PC game. We've only made a small dent in the script of the game, but if we want any of out translation to show up in-game, we will need an 8-bit Korean font. Someone has already developed a tool for making a font from a bitmap, but in order to generate that bitmap, I'll need something to determine the 233 glyphs to include. Is there any ready tool that can be fed a batch of 40+ files and tell me what glyphs are used in them? Or maybe someone would be interested in coding this?

creaothceann
Guest

Re: How to determine the characters used in a text document

« Reply #1 on: September 03, 2007, 02:33:09 pm »

So this tool would only have to go through each file, noting what bytes are used? ... Sounds easy enough.

javiskefka
Guest

Re: How to determine the characters used in a text document

« Reply #2 on: September 03, 2007, 03:01:07 pm »

The input will be a batch of plain text files encoded in UTF-16 with the .STR extension. The output that I envision is another text file that lists the characters that appeared in the files, and the number of characters that are listed. What I want to produce eventually is something similar to this, from which someone else generated a bitmap to be used for a Japanese font:

Code:

　！Ã¢â‚¬Â＃＄％＆Ã¢â‚¬â„¢（）＊＋，－．／
０１２３４５６７８９：；＜＝＞？
＠な゛ーすねみけほにめこれまのむ
゜きるそとゆ、を。ぬ・「￥」＾＿
Ã¢â‚¬Ëœあへしたえふつはいせからもんお
ひくろさちうりわやよて｛｜｝～　
アァぁイィぃウゥぅエェぇオォぉ
カガがキギぎクグぐケゲげコゴごン
サザざシジじスズずセゼぜソゾぞッ
タダだチヂぢツヅづテデでトドどっ
ナニヌネノハバパばぱヒビピびぴ
フブプぶぷヘベペべぺホボポぼぽ
マミムメモラリルレロヤャゃユュゅ
ヨョょワヮゎヲヰゐヱゑヴヵヶ　

And this is a picture of the original font used in the game. I can have that many characters. The pink squares are for the font generating program to recognize each character.


« Last Edit: September 03, 2007, 03:12:52 pm by javiskefka »

creaothceann
Guest

Re: How to determine the characters used in a text document

« Reply #3 on: September 03, 2007, 06:30:25 pm »

Try this program. It needs the Delphi 10 BPLs (see my vSNES thread).

The data stream is not checked for byte-order marks yet, and the output could be written to an HTML file. The character codes are already in that format though, so you could easily write a small test file and display the characters with a browser.

javiskefka
Guest

Re: How to determine the characters used in a text document

« Reply #4 on: September 03, 2007, 10:32:37 pm »

This looks like it can perform just what I wanted.

Here's a small section of the output from a test document:

Code:

number of files : 1
number of scanned characters : 1991
number of unique characters : 478

characters:
&#0A0D;
&#0D22;
&#0D50;
&#0D65;
&#0D67;
&#0D6C;
&#0D6E;
&#0D72;
&#0D73;
&#0D74;
&#0D79;
ߤ
ߦ

I only know very basic html, so what will I need to add to get those characters to render?

creaothceann
Guest

Re: How to determine the characters used in a text document

« Reply #5 on: September 04, 2007, 07:04:49 am »

Please download the program again - turns out I forgot the "x" after the "#" (since the codepoints are in hexadecimal).
It should also handle byte-order marks now.

For displaying the characters in a browser you just have to add a basic HTML structure:

Code:

EDIT: And of course you need a font that can display these characters. Maybe a special statement like "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">" is also required, dunno.

javiskefka
Guest

Re: How to determine the characters used in a text document

« Reply #6 on: September 04, 2007, 10:44:32 am »

Of course I knew to add html and body tags Tongue

Thanks for making that adjustment so quickly!

Here's sample output from another test document after I rendered it with a web browser:

Code:

number of files : 1 number of scanned characters : 2227 number of unique characters : 215 characters:
! $ & ' ( ) * , . / 0 1 2 3 4 6 : A B C D E F G H I J L M N O P R S T U V W _ a b c d e f g h i j k l m n o p r s t u v w
x y 가 각 게 겠 결 경 고 과 관 권 그 기 나 내 뉴 는 니 닌 다 대 더 도 동 되 뒤 드 든 디 라 래 러 려 로 록 료 르 를 리 링 만 말 매
메 면 모 목 무 미 밀 반 버 변 보 브 빠 사 상 생 서 선 설 세 소 송 수 슈 스 슴 습 시 식 실 십 아 안 않 알 앞 액 야 어 언 업 없 에
엔 연 예 오 와 완 용 움 원 으 을 의 이 인 일 임 있 자 작 장 재 저 적 전 정 종 지 진 차 찾 쳐 취 치 컴 켓 콘 콤 크 타 택 터 텍 텐
템 트 파 패 퍼 페 포 폴 퓨 하 한 할 함 합 해 행 현 형 화 확 활

Works like a charm!

Now to just get through the 540 kb of script with my other translator.

edit: oops, I added some line breaks


« Last Edit: September 04, 2007, 10:53:53 am by javiskefka »

HyperHacker
Guest

Re: How to determine the characters used in a text document

« Reply #7 on: September 07, 2007, 05:49:44 pm »

Here's a hint, you don't need those pink squares. Put the font all one one line, and space the characters one pixel apart. Put a dot in the topmost pixel of that space. If you want variable height, put a second dot in the row containing the lowest pixel.

javiskefka
Guest

Re: How to determine the characters used in a text document

« Reply #8 on: September 09, 2007, 01:29:38 pm »

Hmm, I don't know what to say about that besides that the utility specifies a pink box around each character with a one pixel black space between each. I think the placement in rows is just for more easy human readability because it scans from left to right, and up to down.

Pages: [1]