Idiom World Server TM Tutorial

Introduction

I think this is the most important guide to understand, since the most beneficial part of using Idiom is the TM compared to any other localization solutions currently available in the industry.

I will be mainly using the guide come with Idiom 8.0 installation - WorldServer v8.0.2 Translation Memory Guide.

The main reason that I am writing this document is to gain more knowledge on Idiom TM and at the same time to prepare some reference materials for my Idiom development work.

Any additional questions, please visit http://www.idiominc.com/resources/documentation/

Let’s delve into it.

Differences Between Trados and Idiom TM

Basically Idiom TM provides one additional matching scheme that Trados does not have.

Only available in Idiom - ICE (In Context Exact) / SPICE (SID Preferred In Context Exact) Match

- How Idiom determines ICE matches:

1. Find exact matches

2. Evaluate for ICE condition (basically check the usage context of the lookup
text. The usage context means here simply ‘surrounding text‘.)

3. Rank ICE candidates through the ranking strategy (AIS file path or normalized
file path comparison)

- How Idiom determines SPICE matches:

1. Find exact matches

2. Evaluate for ICE condition (basically check the usage context of
the lookup text. The usage context means here ‘SID‘.)

Available in both:

100% / Exact Match

Fuzzy Match

TMX Import Tool

1. Importing Third Party TMX file

This option will be mostly used when you migrate your current TM to Idiom TM.

In my previous experience, the TMX file from Trados TM imported fine except
some of the entries with attributes, so you need some cleanup to do there.

The most important thing to remember when importing third party TMX file to
Idiom is that Idiom is importing the file based on the name of
the TMX file including the path.

What does it mean?

Let’s say after you imported fr_FR.tmx file to Idiom,
you found out that there are some entries in the TM needed fixing. After
you fixed those entries, if you import it as is, Idiom will only import new
and changed entries to Idiom like ‘Import with overwrite’ option in Trados.
However, if you changed the name of the TMX file to, for example, fr_FR_2.tmx,
Idiom Import Process will generate a new set of TM entries for the TMX
file. So try not to change the name of TMX file to prevent duplicate
entries. One workaround would be to purge the TM and re-import it from scratch.

Note: Trados TMX export is in UTF-16 (Needs BOM).

TMX Specification: http://www.lisa.org/standards/tmx/tmx.htm#Intro_Enc

* One side note on Trados text TM: Trados text TM (Trados 7) should have BOM and it needs to be encoded in UTF-8. If one of two is not present, the text TM will be regarded as a legacy TM, which will lead corruption after imported.

2. Importing Idiom TMX file

The TMX export from Idiom contains all the usage context information so all the ICE/SPICE will be exported to the TMX file. This is a good way to back up your TMs. I prefer to export with UTF-8 encoding, since it is script-friendly.

I found that importing 3rd party TMs that contain your legacy translation is very useful in terms of enhancing leveragibility, but sometimes, due to different segmentations and problems caused when imported, it was causing some engineering problems. I would recommend though using it as a secondary TM while you are converting your assets to Idiom from your legacy TMs to alleviate any additional translation work that might already exist in your legacy TMs.

Search and Replace Tool

It is important to understand that fixing TM entries with Search and Replace Tool will not propagate the fixes to the corresponding assets. It is also not recommended to simply choose the Reapply TM option in the browser workbench, because this process will generally not update ICE or manually translated segments (*** until assets are re-segmented after segment cache is cleared).

Most of the companies using Idiom probably have some ways to propagate TM changes to assets. In my previous experience, it was not recommended to use Search and Replace Tool to apply global changes. Rather it would be ideal to implement a tool to search assets directly and create a project with the assets searched by forcing them to be re-segmented after clearing the segment cache. This way it is guaranteed that both assets and TM entries will be updated simultaneously.

Leveraging and Segmentation

- Filter configuration changes

Although you put in a lot of effort to test your filters, you will run into a few cases that will lead to filter configuration changes (if no, you should be well compensated for it :)). Unless it is a major revamp of filters, it will still be fine, since fuzzies introduced by this will only occur with re-segmented assets.

Here is one example of filter configuration change:

<Link>

<Url>aaa/bbb/ccc.php</Url>

<Text>Your available balance is </Text>

<Dynamic>Amount</Dynamic>

</Link>

In Idiom Browser Workbench with the wrong configuration:

- Placeholders

{1} = <Linik> : Embeddable & Excluded

{2} = <Url>aaa/bbb/ccc.php</Url> : Embeddable & Excluded

{3} = <Text> : Included

{4} = </Text> : Included

{5} = <Dynamic>Amount</Dynamic></Link>: : Embeddable & Excluded

The whole segment:

{1} {2} {3} Your available balance is {4} {5}

This needs to be re-arranged like this in Japanese:

{1} {2} {5} {3} Your available balance is {4}

However, the asset can not be saved due to wellformedness validation error.

The fix was making Link tag embeddable but not excluded.

The impact on leveragibility would be up to how frequently this situation occurs in your existing assets, so it can be very minimal or unexpectedly huge.

- Leveraging

One unique thing about Idiom leveraging process is that it’s storing segmentation data to DB, so Idiom will always present this pre-segmented data to users for translation and review.

- Ranking matches

The rule for ranking matches apply for both exact and ICE matches, but it does not apply to SPICE matches (there is only one SPICE match for a given segment).

Here is the ranking rule:

1. ICE Context

2. Metadata-based Matching: the same AIS-to-TM mapped attribute values

3. AIS Context

4. Position Rank

5. Most Recent

TM Migration

Please see Idiom World Server Alignment Tutorial.

TM Groups

If 100% match is found in a TM of higher precedence, then an ICE match candidate for that segment will not be sought in the lower-ranked TMs.

One other important behavior to understand is that if a matching entry is found in non-write TMs, it will not be written to the write TM. If you are using your legacy TMs as non-write TMs and planning to use them as just a reference, it is ideal to enable this property in tm.properties file, so that all ICE match segments are saved to the write TM:

save_ice_match_segments = true

Auto Split / Merge

This is a very important question to raise:

“Can Idiom find these manually merged and split segments next time they are re-segmented?”

The answer is Yes with some limitations.

e.g.

<Text> My account limit is </Text>

<Linebreak/>

<Dynamic>Account_Limit</Dynamic>

<Text> with some restrictions.</Text>

Let’s assume all three tags should be merged into one to be translated to Japanese like below:

<Text> My account limit</Text> <Linebreak/> <Dynamic>Account_Limit</Dynamic> <Text> with some restrictions.</Text>

After the re-segmentation of the asset:

This merged segment will be ICE’d.

However, if a 100% or ICE match already exists for the first segment (<Text> My account limit is </Text>), the merged segments will never be selected. To this auto-merge to work, the existing translation for the first segment should be removed from the TM.

*** This is a very crucial information that linguists should know, so that they can clean up as they merge/split. However, there is also a risk to deleting a segment, since it might be being used by another assets.

tm.properties:

• do_fuzzy_automatic_merge (True to enable, false to disable; enabled by default)
• do_fuzzy_automatic_split (True to enable, false to disable; disabled by default)
• do_hyper_merge (True to enable, false to disable; enabled by default — does not apply to ICE process)
• do_ice_automatic_merge (True to enable, false to disable; enabled by default)
• do_ice_automatic_split (True to enable, false to disable; enabled by default)

Path Normalization

- Purpose:

The only purpose of the path normalization is to reduce TM entries.

- How:

e.g. two same files in two different releases:
/release1/en_US/users/account.xml

/release2/en_US/users/account.xml

After normalization (let’s say we decided to chop off the first two directories), TM AIS context for both files are the same. Thus, Idiom will store only one translation with /users/account.xml TM AIS context. Thus, TM will see both assets as being the same asset.

/users/account.xml

- Problem (The Ping-Pong Effect):

After the path normalization, the file above — /users/account.xml — will be ICE’d only if the files in both releases are identical. Let’s consider this example.

file 1:

segment 1

segment 2

file 2:

segment 1

new segment

segment 2

Once the file2 is translated, it will be ICE’d next time. A couple days later, the file1 is updated, but since the ICE match has been re-generated due to the new segment in file2, the segment 1 and 2 in file 1 will be 100% matches.

As you see it from the example above, it is like playing ping pong. file1 -> file2 -> file1 -> file2 …

Note: One of many ways to prevent this problem happening is to set up a intermediate process to determine whether the file modified has actual content changes. If it does, no choice but putting it through leveraging process. If it has code only changes, create a new localized file through a custom autoaction rather than putting it through Idiom leveraging process.

- SDK Sample:

ModifiedTMServices.java

Caveat: This is a customized TM service, so this will be invoked every time TM is accessed including TMX import. The simple is the better.

Understanding How Idiom Handles TM entries

It takes some time to understand how Idiom creates TM entries and stores them. Here are some examples and hope this make you understand little better on this topic.

- Let’s assume the file that you are translating has ICE, Exact, fuzzy, no matches.

New TM entries will be created for Exact/100% matches, Fuzzies and No matches, but not for ICE.

Let’s break down this further so that we understand the behavior better:

- ICE:

Since a segment does not need to have the same TM AIS path to be ICE’d, translations can be from anywhere. However, Idiom will not create a new TM entry, unless you change the existing translation to something else. There is one more exception with TM Group (See TM Group).

The important thing to understand is that if ICE’d translations change to something else, is it overwriting existing ones or creating new TM entries?

New entries will be created if the ICE’d segments are from different assets. If they are from same assets, it will overwrite the existing entries.

This is when the path normalization comes in handy to minimize any additional TM entries.

- Exact:

It is important to understand that a new TM entry will be created for a Exact match, even though the 100% translation has the same AIS context (from the same asset). The additional entry will be created to make ICE’d segments. This means you will see two or more exact same TM entries for those segments, but it does not mean they have the same ICE context.

- Fuzzy and no match:

New TM entries will be created.

Auto Translation

Basically copying source to target if segments satisfiy the auto translation rules (see below).
The auto translation process happens after ICE and exact lookups have failed to retrieve a match.

1. Rules
labels: 1 NW 50
number-only segments
content variables: SN:213108124

Note: The rules are applied to the entire segment.

e.g. where n = 3 characters:

Abc Yes #1
Hat Yes #1
ABRR. No Word contains more than n characters.
(617) 123-4567 Yes #2
{1} 234#, {2} 234.453 Yes #2
AB. 435 {1}-43 Yes #1
AB1234 Yes #1, 3
AB{1}1234 Yes #1 Rule #3 fails because of the placeholder.
ABCD1234 Yes #3
ABCD{1}1234 No Rule #3 fails because of the placeholder.

Segment Repair Technology

In tm.propreties file, if you enable the following property, all repair options will be disabled. However, the only exception would be placeholder repair.

prevent_all_segment_repairs = true or false (the default value is false)

Logging TM and Alignment Processes

#TM Debug Logging:
log4j.category.com.idiominc.ws.tm =debug
#TMX Alignment/ Asset Alignment Logging:
log4j.category.com.idiominc.ws.autoalignment=debug

# enable WS alignment debug messages
log4j.category.com.idiominc.ws.autoalignment=debug
# Create an Appender for logging alignment engine debug messages
log4j.appender.alogfile.File=c:/logs/alignmentlog2.txt
log4j.appender.alogfile=org.apache.log4j.RollingFileAppender
log4j.appender.alogfile.MaxFileSize=100000KB
log4j.appender.alogfile.MaxBackupIndex=100
log4j.appender.alogfile.layout=org.apache.log4j.PatternLayout
log4j.appender.alogfile.layout.ConversionPattern=[%d]: %m%n

Note: add the options above to general.properties.

No Comments

Leave a comment

mukkamu