+  RHDN Forum Archive
|-+  Romhacking
| |-+  ROM Hacking Discussion
| | |-+  How to determine the characters used in a text document
Pages: [1]
Author Topic: How to determine the characters used in a text document  (Read 789 times)
javiskefka
Guest
« on: September 03, 2007, 01:21:19 pm »

So I'm working on an English-Korean translation for a Windows 9x era PC game.  We've only made a small dent in the script of the game, but if we want any of out translation to show up in-game, we will need an 8-bit Korean font.  Someone has already developed a tool for making a font from a bitmap, but in order to generate that bitmap, I'll need something to determine the 233 glyphs to include.  Is there any ready tool that can be fed a batch of 40+ files and tell me what glyphs are used in them?  Or maybe someone would be interested in coding this?
creaothceann
Guest
« Reply #1 on: September 03, 2007, 02:33:09 pm »

So this tool would only have to go through each file, noting what bytes are used? ... Sounds easy enough.
javiskefka
Guest
« Reply #2 on: September 03, 2007, 03:01:07 pm »

The input will be a batch of plain text files encoded in UTF-16 with the .STR extension.  The output that I envision is another text file that lists the characters that appeared in the files, and the number of characters that are listed.  What I want to produce eventually is something similar to this, from which someone else generated a bitmap to be used for a Japanese font:

Code:
 !”#$%&’()*+,-./
0123456789:;<=>?
@な゛ーすねみけほにめこれまのむ
゜きるそとゆ、を。ぬ・「¥」^_
‘あへしたえふつはいせからもんお
ひくろさちうりわやよて{|}~ 
アァぁイィぃウゥぅエェぇオォぉ
カガがキギぎクグぐケゲげコゴごン
サザざシジじスズずセゼぜソゾぞッ
タダだチヂぢツヅづテデでトドどっ
ナニヌネノハバパばぱヒビピびぴ
フブプぶぷヘベペべぺホボポぼぽ
マミムメモラリルレロヤャゃユュゅ
ヨョょワヮゎヲヰゐヱゑヴヵヶ 


And this is a picture of the original font used in the game.  I can have that many characters.  The pink squares are for the font generating program to recognize each character.


« Last Edit: September 03, 2007, 03:12:52 pm by javiskefka »
creaothceann
Guest
« Reply #3 on: September 03, 2007, 06:30:25 pm »

Try this program. It needs the Delphi 10 BPLs (see my vSNES thread).

The data stream is not checked for byte-order marks yet, and the output could be written to an HTML file. The character codes are already in that format though, so you could easily write a small test file and display the characters with a browser.
javiskefka
Guest
« Reply #4 on: September 03, 2007, 10:32:37 pm »

This looks like it can perform just what I wanted.

Here's a small section of the output from a test document:

Code:
number of files              :  1
number of scanned characters :  1991
number of unique  characters :  478

characters:
&#0A0D;
&#0D22;
&#0D50;
&#0D65;
&#0D67;
&#0D6C;
&#0D6E;
&#0D72;
&#0D73;
&#0D74;
&#0D79;
ߤ
ߦ

I only know very basic html, so what will I need to add to get those characters to render?
creaothceann
Guest
« Reply #5 on: September 04, 2007, 07:04:49 am »

Please download the program again - turns out I forgot the "x" after the "#" (since the codepoints are in hexadecimal).
It should also handle byte-order marks now.

For displaying the characters in a browser you just have to add a basic HTML structure:

Code:
<html>
<body>
&#x0A0D;
&#x0D22;
&#x0D50;
&#x0D65;
&#x0D67;
&#x0D6C;
&#x0D6E;
&#x0D72;
&#x0D73;
&#x0D74;
&#x0D79;
&#x2020;
&#x2022;
</body>
</html>

EDIT: And of course you need a font that can display these characters. Maybe a special statement like "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">" is also required, dunno.
javiskefka
Guest
« Reply #6 on: September 04, 2007, 10:44:32 am »

Of course I knew to add html and body tags Tongue

Thanks for making that adjustment so quickly!

Here's sample output from another test document after I rendered it with a web browser:

Code:
number of files : 1 number of scanned characters : 2227 number of unique characters : 215 characters:
 ! $ & ' ( ) * , . / 0 1 2 3 4 6 : A B C D E F G H I J L M N O P R S T U V W _ a b c d e f g h i j k l m n o p r s t u v w
x y 가 각 게 겠 결 경 고 과 관 권 그 기 나 내 뉴 는 니 닌 다 대 더 도 동 되 뒤 드 든 디 라 래 러 려 로 록 료 르 를 리 링 만 말 매
메 면 모 목 무 미 밀 반 버 변 보 브 빠 사 상 생 서 선 설 세 소 송 수 슈 스 슴 습 시 식 실 십 아 안 않 알 앞 액 야 어 언 업 없 에
엔 연 예 오 와 완 용 움 원 으 을 의 이 인 일 임 있 자 작 장 재 저 적 전 정 종 지 진 차 찾 쳐 취 치 컴 켓 콘 콤 크 타 택 터 텍 텐
템 트 파 패 퍼 페 포 폴 퓨 하 한 할 함 합 해 행 현 형 화 확 활

Works like a charm!

Now to just get through the 540 kb of script with my other translator.

edit: oops, I added some line breaks
« Last Edit: September 04, 2007, 10:53:53 am by javiskefka »
HyperHacker
Guest
« Reply #7 on: September 07, 2007, 05:49:44 pm »

Here's a hint, you don't need those pink squares. Put the font all one one line, and space the characters one pixel apart. Put a dot in the topmost pixel of that space. If you want variable height, put a second dot in the row containing the lowest pixel.
javiskefka
Guest
« Reply #8 on: September 09, 2007, 01:29:38 pm »

Hmm, I don't know what to say about that besides that the utility specifies a pink box around each character with a one pixel black space between each.  I think the placement in rows is just for more easy human readability because it scans from left to right, and up to down.
Pages: [1]  


Powered by SMF 1.1.4 | SMF © 2006-2007, Simple Machines LLC