Scripts de glossaire
De Wiki mozfr.
I have made a bunch of perl/php scripts to help on localisation and its QA. Let me introduce them and ask for your test and feedback.
The basis of these tools is a perl script which takes the en-US and localised dtd and .properties from cvs and makes localisation memory (tmx) files out of them.
This script is put in a cron job every night.
The file format of the translation memory is the xml standard tmx : http://www.lisa.org/Translation-Memory-e.34.0.html
This tmx is used by php scripts.
Sommaire |
Glossary
From this tmx file, the PHP script :
- http://www.frenchmozilla.fr/glossaire/fr/index.php act as a glossary.
(Replace fr with your local code.)
You can search for a word (or multiple words) in the search box.
The script will return two tables : en-US and localised (here french) matches. The first column is a concatenation of the main directory, the name of the file and the entity. The second column is the en-US string. And the third column is your localised string.
Duplicates
The second script is more complicated and needs a preprocess by a perl script (run by cron every night).
(Replace fr with your local code.)
The output of this script will give you according your choice of main directories, the entities strings which are the same in en-US but differ for your localisation.
For example, if you compare Browser with Calendar it will return for the
first line of the table :
| Entity | en-US | fr 1 | Entity | fr 2 |
|---|---|---|---|---|
| browser:browser.properties:updatesItem_downloadingFallback | Downloading Update… | Téléchargement d'une mise à jour… | calendar:calendar.properties:updatesItem_downloadingFallback | Téléchargement de la mise à jour… |
For the entity updatesItem_downloadingFallback in the files browser.properties in browser and calendar.properties in calendar you have the same english string : "Downloading Update…" but two different translated strings : "Téléchargement d'une mise à jour…" and "Téléchargement de la mise à jour…"
If you want consistency in your localisation you'll want to eliminate these.
I'll plan (perhaps is someone is interested to help me) to semi-automatize the correction directly from the web page.
If the check box is ticked, the script limits the search on same entity name, if unchecked the list could be longer and sometimes with false positives.
Alignment
(Replace fr> with your local code.)
The 3rd script is a more advanced glossary. It's more an help to localiser. It tries (very simply for now) to make an alignment between a new string and the already localised ones. ie it searches for similarities between the new entry and the old ones.
You put an en-US string (for example a new entity just landed on the trunk) in the search box and it will give you the best matches it can find.
For now it is a very basic search :
- in a first table it give you the perfect matches.
- on the second table it is the almost perfect matches : when the string can be found whole in an entity.
- the last table will give you the entities where your searched words can be found (every word for now).
In the future I'll try to implement better alignment research. Some ideas are : research on some of the words only, give the translation of each words or bunch of words (based on a research on the corpus from the translation memory). For example in French "bookmarks" is always translated as "Marque-pages". Or apply alignment as treated on scientific papers.
Perhaps it's a good idea to include this sort of alignment in a future localisation tool which will propose new translation for the newly landed entities on the trunk ?
Entities
(Replace fr with your local code.)
This script looks for entities (e.g. : &brandShortName;, ", etc.). It allows you to search for mispelling entities that may cause XML errors.
When you hover an entity, a tootip shows up presenting the entity in its context.
Feedback, questions?
I encourage you to test these scripts and give us feedback.
- Do you think they can be useful for you localisation work ?
- Do you have some ideas to improve them ?
I know the pages are really ugly, but I'm not a web designer :)
If someone is interested for its locale I can give you the sources and directions to use it.
We can perhaps also include your localisation from cvs on our server (or on the l10n server ?).
I will put the source of the perl and php scripts on our server really soon (I have to put commentaries and do housework on them to eliminate all the unnecessary stuff).
Hope it will interest you :)
Feature requests
- [Done] I would like to have searched terms highlighted in the results page. It should be an option, let's say a checkbox to tick "Highlight searched terms" (Goofy Jan '10)
- [Done] I would also appreciate to have a word search as an option, that is: looking for complete searched string only (Goofy Jan '10)
- [Done] It could also improve the readability of the output to have distinct set of colors (or some kind of seperator?) according to the app where the searched string has been found. (Goofy Feb '10)
- [Done] When using Duplicates tool it would be good to have the two <my language1><my language2> columns near to each other instead of having to scroll horizontally to compare them in many cases. The entity column for <my language2> should be the last on the right then it would give
- entity1 | en-US string | mylang1 | mylang2 | entity2
- [Done] A search in entity name/property key name function could be useful sometimes. E.g I wish to list all entities (and related strings) with "accesskey" in their name.
Bug reports
- the closing quote is not displayed in the target language results when a variable like %1 is between quotes. Eg http://www.frenchmozilla.fr/glossaire/fr/index.php?recherche=Expected
- the ul, li, p, tags are preserved in the results columns but they somewhat spoil the readability of the string because they are appended to previous/next words. It would be better to keep the opening-closing tags around them < li > < p >.... (same with the newline thing \n)
- It seems the .ini files are not parsed by Transvision, though some of these files include strings (e.g toolkit/crashreporter/crashreporter.ini)