+  RHDN Forum Archive
|-+  Romhacking
| |-+  ROM Hacking Discussion
| | |-+  Google Translate updated
Pages: 1 [2]
Author Topic: Google Translate updated  (Read 2 times)
BRPXQZME
Guest
« Reply #15 on: November 24, 2009, 09:50:09 pm »

Quote from: FinS on November 24, 2009, 09:04:06 pm
"You can release Japanese?" sounds strange but I wonder if that is the literal interpretation of it.
It’s confusing 話す with 離す. One of them means “speak”. One of them means “let go”.

☆HA☆NA☆SE☆
KaioShin
Guest
« Reply #16 on: November 25, 2009, 03:23:58 am »

Systran has been around for ages already, it's not a new engine by any means. As far as I know it's also the same translation engine that ATLAS is build on (the translation software, not the script inserter...). Judging from my experiences with AGTH it's definitely not near the quality required to use it for anything but looking up a few words. It's better than Babelfish though.
BRPXQZME
Guest
« Reply #17 on: November 25, 2009, 03:27:20 am »

Babelfish is Systran-based.
KaioShin
Guest
« Reply #18 on: November 25, 2009, 03:29:15 am »

Quote from: BRPXQZME on November 25, 2009, 03:27:20 am
Babelfish is Systran-based.

 :banghead:

 Grin

Then I mixed things up. Do you know which engine Atlas is based on? I'm pretty sure it's something different then.
Orusaka
Guest
« Reply #19 on: November 25, 2009, 03:41:27 am »

Quote from: BRPXQZME on November 24, 2009, 09:50:09 pm
Quote from: FinS on November 24, 2009, 09:04:06 pm
"You can release Japanese?" sounds strange but I wonder if that is the literal interpretation of it.
It’s confusing 話す with 離す. One of them means “speak”. One of them means “let go”.

☆HA☆NA☆SE☆

I think it might be confusing it with 放す, actually. Either way, it doesn't make much sense. I understand that context is hard to program for, but if it had to pick one without considering it, why not pick the alternative with the highest frequency rate? 話す is more frequent than 放す. It wouldn't always net results, but if it's going to ignore context, it would probably be the best option.
BRPXQZME
Guest
« Reply #20 on: November 25, 2009, 03:54:45 am »

eh, that’ll show me to trust kotoeri again... *mumble mumble*

Well, it just sucks, and that’s that. But ultimately, kana vs. kanji is a much harder thing than mere statistics. Computers aren’t so great at figuring out where one word stops and another begins in Japanese; humans can only do it so well because we’re cacophonologists by nature.

Quote from: KaioShin on November 25, 2009, 03:29:15 am
Quote from: BRPXQZME on November 25, 2009, 03:27:20 am
Babelfish is Systran-based.

 :banghead:

 Grin

Then I mixed things up. Do you know which engine Atlas is based on? I'm pretty sure it's something different then.
Atlas (or perhaps more properly “Atlas II”) is its own system. What you can get today is fundamentally the same thing that went to market in 1982, except now you get the benefit of a couple decades’ worth of tinkering and UI changes, as well as some big-ass dictionaries.

Atlas is rule-based, rather than statistical.
DarthNemesis
Guest
« Reply #21 on: November 25, 2009, 06:34:56 am »

Edit: whoops, didn't see that there was a second page.
FinS
Guest
« Reply #22 on: November 25, 2009, 08:06:39 am »

I double checked it with RikaiChan which is a dictionary plugin for Firefox, but you probably know that since you all seem pretty familiar with this stuff. It also picks up the "to release" definition from  放す and  はなす but I can see that 放 is not even in the sentence.  But the next definition is "to speak" which it picks up from 語す and  はなす and I can see those are all in the sentence but in different parts.

Quote from: BRPXQZME
Computers aren’t so great at figuring out where one word stops and another begins in Japanese; humans can only do it so well because we’re cacophonologists by nature.
It's too bad words or even phrases are not seperated in writing.  That surely would make it easier although I'm guessing the sentence becomes too complex in some situations like in this example.
Orusaka
Guest
« Reply #23 on: November 25, 2009, 10:08:08 am »

Quote from: FinS on November 25, 2009, 08:06:39 am
I double checked it with RikaiChan which is a dictionary plugin for Firefox, but you probably know that since you all seem pretty familiar with this stuff. It also picks up the "to release" definition from  放す and  はなす but I can see that 放 is not even in the sentence.  But the next definition is "to speak" which it picks up from 語す and  はなす and I can see those are all in the sentence but in different parts.

Quote from: BRPXQZME
Computers aren’t so great at figuring out where one word stops and another begins in Japanese; humans can only do it so well because we’re cacophonologists by nature.
It's too bad words or even phrases are not seperated in writing.  That surely would make it easier although I'm guessing the sentence becomes too complex in some situations like in this example.

I can't speak on how RikaiChan works, as I use a different solution. I can however, explain what happened. It had nothing to do with word boundaries. It's true, that would be hard to program a computer to do, but that's not where it failed. It failed because it couldn't understand the context of the sentence, which is the real trick to getting automated translation to work.

The program had only はなす to go on. All the program knows is that it has to be either:

1. 話す - to speak
2. 放す - to seperate, to set free
3. 離す - to part, to devide, to seperate

All three of those are homophones. (words pronounced the same way, but with different meanings.) Since the program doesn't know which one is the correct one, it has to guess. Now, you and I will be able to do that based on the context. We know that when we see Japanese, that it must refer to speak, and not seperation etc.

This isn't a problem that's exlusive to Japanese, but perhaps more prominent. There are problems with automatic translation into other languages in regards to homophones, as well.

My point, however, was merely that, if you are not going to attempt the daunting task of trying to program your software to understand contexts, or fake it rather, it would probably be better to have it guess at the most statistically likely one, which in this case would be "to speak". The three possibilities are 151, 250 and 578 respectivly. Those numbers are actual frequency numbers based on newspaper occurrence, meaning that 話す is the actual 151st most common kanji in Japanese newspapers, so low numbers are "better". (given, my numbers are more than a few years old, but there is no reason to assume it has changed.)
Moulinoski
Guest
« Reply #24 on: November 25, 2009, 10:35:08 am »

Hmm, now I'd like to know what are the most used kanji, period.
DarknessSavior
Guest
« Reply #25 on: November 25, 2009, 11:09:41 am »

Quote from: Garoth Moulinoski on November 25, 2009, 10:35:08 am
Hmm, now I'd like to know what are the most used kanji, period.

http://www.kanji-a-day.com/100kanji.php

~DS
BRPXQZME
Guest
« Reply #26 on: November 25, 2009, 01:45:57 pm »

Quote from: Orusaka on November 25, 2009, 10:08:08 am
I can't speak on how RikaiChan works, as I use a different solution. I can however, explain what happened. It had nothing to do with word boundaries. It's true, that would be hard to program a computer to do, but that's not where it failed. It failed because it couldn't understand the context of the sentence, which is the real trick to getting automated translation to work.
You overestimate SYSTRAN’s ability to tell word boundaries in Japanese, though.

It didn’t just fail at translating a single word; the bigger problem is that it failed to translate the grammar naturally. If you type “日本語、話せる?” you get “Japanese, you can speak?” and other bits of Tarzan-speak. Note how none of your examples used the proper English: “Can you speak Japanese?” SYSTRAN is so afraid to switch sentence order around in anything but where it is strictly permitted, that in this case, it merely translates the sentence and then adds a question mark at the end for か. On one hand, this has a lot to do with its original focus in assisting translation with such clients like the EU and the UN that need to pass around technical documents and legal crap in multiple languages. On the other hand, it makes it utter poo for conversation and literary works (like video games, unlike what Old Man Ebert may tell you), and SYSTRAN’s academic papers would be the first to mention this.

Let’s take first a simple sentence and then a complicated one from my current project (Ace Combat 3):

もはや一刻の猶予も許されない。
Either postponement of moment is not permitted already.

The “もはや” is “already” and gets shoved to the end. “一刻の猶予も” is translated “either postponement of moment”, which makes a little bit of sense translating literally, but is ultimately gibberish. “許されない” is translated “is not permitted” and is shoehorned in right before the end, which is technically fine, but the meaning, in context of the “も〜ない” is more like “cannot permit” when you put it in English, despite this not being grammatically equivalent.

Now let’s play dirty:

死因も……どうやら人工心臓のネットワークに接続されていた自動制御プログラムが、何らかの原因で暴走した事によるものと見られ、関係者の間でも大きな波紋を呼んでいます。
Cause of death ......Somehow, the automatic control program which is connected to the network of the artificial heart, it is seen as the thing due to driving recklessly with a some cause calls the big ripple even between the authorized personnel.

Well, this is where a rule-based system like SYSTRAN shines. Except for a little word->word boo-boos, and a couple of grammar niggles, this is actually reasonably close to how I translated it. All that cautiousness pays off if you end up with something that remotely makes sense.

Cf. Google, just for kicks:

もはや一刻の猶予も許されない。
Not allowed time to lose anymore.

Equally bad as above, but in some different ways.

死因も……どうやら人工心臓のネットワークに接続されていた自動制御プログラムが、何らかの原因で暴走した事によるものと見られ、関係者の間でも大きな波紋を呼んでいます。
Cause of death ... ... and Control Program also had an artificial heart is connected to the network apparently believed by the runaway thing for whatever reason, is causing a big stir among the people concerned.

Close, but no cigar. The part “is causing a big stir among the people concerned” is great, but “apparently believed by the runaway thing for whatever reason” is aswingandamiss; the “による” can mean “by” or “on account of”, which is basically the same thing, except for a subtlety in English in which “by” used with the passive voice indicates that something is directly responsible (and would be the subject of the sentence were the active voice used). In this case, Google Translate botched it, because the “と” is more important here, and SYSTRAN’s “it is seen as” is much closer to the mark.

---
As to はなす, there is a simpler explanation: usually, when kana only are used instead of a common kanji, it’s because they wanted a complex kanji but didn’t want to bother fishing through the IME to select it. SYSTRAN’s rule-based system, with its gazillions of language pairs, reflects this. Google’s statistics-based system does not, and just goes with what it thinks would make sense statistically.
Moulinoski
Guest
« Reply #27 on: November 25, 2009, 06:10:03 pm »

Quote from: DarknessSavior on November 25, 2009, 11:09:41 am
Quote from: Garoth Moulinoski on November 25, 2009, 10:35:08 am
Hmm, now I'd like to know what are the most used kanji, period.

http://www.kanji-a-day.com/100kanji.php

~DS

Ooh, ありがとう!!
Pages: 1 [2]  


Powered by SMF 1.1.4 | SMF © 2006-2007, Simple Machines LLC