www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - [Suggestion] Standard version identifiers for language

reply Stewart Gordon <smjg_1998 yahoo.com> writes:
Suppose one wants to write an application with versions in different 
languages.  That's human languages, not programming languages.

At face value, that's simple - use version blocks to hold the various 
languages' UI text.  Or for Windows, define a separate resource file for 
each language.  (Or lists of string macros to be imported by one 
resource file.)

But what if you want to use one or more libraries from various sources, 
which also may have language versions?  Then you'd have to set all the 
version identifiers that the different library designers have chosen for 
your choice of language, which could lead to quite long command lines. 
It would be simpler if there could be a standard system of language 
identifiers for everyone to follow.

A system based on ISO 639-1 would probably be good.  One could then write

----------
version (en) {
     const char[][] DAY = [ "Sunday", "Monday", "Tuesday", "Wednesday",
       "Thursday", "Friday", "Saturday" ];
} else version (fr) {
     const char[][] DAY = [ "dimanche", "lundi", "mardi", "mercredi",
       "jeudi", "vendredi", "samedi" ];
} else {
     static assert (false);
}
----------

There are a few matters of debate to be considered:

1. Should we create a prefixed language namespace?  Or will these codes 
by themselves do?

2. How really should a lib be written to deal with unsupported 
languages, or if no language has been specified?  Two possibilities I 
can see:
(a) the lib programmer would put his/her own language (or maybe the one 
predicted to be most popular) as the default.
(b) a static assert as above, effectively telling the app programmer 
"please set a language, or create a version block in me for your 
language".  Maybe a future D compiler could be configured to use a 
certain language version as the default if none is specified on the 
command line.

3. Should we really have them as version identifiers?  Or invent a new 
CC block called 'language' that would have the specifics of language 
designation built in, a corresponding command line option and a 
corresponding compiler configuration setting?

Dialects of a language could be indicated by replacing the hyphen in the 
ISO code with an underscore.  Libs would then have something like

----------
version (en_GB) {
     ...
} else version (en_US) {
     ...
} else version (en) {
     ...
}
----------

It would be necessary either for the compiler to automatically set en if 
en_GB or en_US or en_anything is set, and similarly for other language 
codes, or to persuade all D users to do this.  Of course, this would be 
done in the aforementioned default language setting.

This would give lib programmers a choice of writing for each dialect of 
each language, just covering the basic languages, or a mixture.  A 
default fallback for an unsupported dialect would, I guess, typically be 
some 'default' dialect if that makes sense.

This provides for compile-time localisation.  Of course, some might want 
run-time l10n, in which case the app would be explicitly programmed to 
do this.

In writing a lib one might choose to support RTL.  In which case the 
version/language blocks would be used to select the default language, 
which would make it usable for monolingual apps, CTL apps and RTL apps 
alike.  Of course, one could argue that there should be some global 
variable in Phobos or somewhere for run-time language....

Stewart.

-- 
My e-mail is valid but not my primary mailbox, aside from its being the 
unfortunate victim of intensive mail-bombing at the moment.  Please keep 
replies on the 'group where everyone may benefit.
Jul 16 2004
next sibling parent Arcane Jill <Arcane_member pathlink.com> writes:
In article <cd8rji$ar0$1 digitaldaemon.com>, Stewart Gordon says...
<some interesting ideas>

Can I respond to this next week? Short answer - this is a run-time problem, not a compile-time problem. A third party should (IMO) be able to add additional human languages given access only to the executable binary and some ini files (or something similar). Long answer - can you wait? I've got lots of ideas about this, but I really not up to debate just yet. Arcane Jill
Jul 16 2004
prev sibling parent reply J C Calvarese <jcc7 cox.net> writes:
Stewart Gordon wrote:
 Suppose one wants to write an application with versions in different 
 languages.  That's human languages, not programming languages.
 
 At face value, that's simple - use version blocks to hold the various 
 languages' UI text.  Or for Windows, define a separate resource file for 
 each language.  (Or lists of string macros to be imported by one 
 resource file.)
 
 But what if you want to use one or more libraries from various sources, 
 which also may have language versions?  Then you'd have to set all the 
 version identifiers that the different library designers have chosen for 
 your choice of language, which could lead to quite long command lines. 
 It would be simpler if there could be a standard system of language 
 identifiers for everyone to follow.
 
 A system based on ISO 639-1 would probably be good.  One could then write
 
 ----------
 version (en) {
     const char[][] DAY = [ "Sunday", "Monday", "Tuesday", "Wednesday",
       "Thursday", "Friday", "Saturday" ];
 } else version (fr) {
     const char[][] DAY = [ "dimanche", "lundi", "mardi", "mercredi",
       "jeudi", "vendredi", "samedi" ];
 } else {
     static assert (false);
 }
 ----------

There seems to be 2 schools of thought in the area of localization (of which language issues are a subset): 1. Compile-time generated using version() as you described in your post. 2. Runtime-time generated using some sort of a plugin architecture or language resource files (as Arcane Jill alludes to in her reply). Since D is capable enough for either method, both parties can be happy. Personally, I think I'd prefer to use compile-time localization, but that doesn't prevent others from designing runtime-time localization functions. I have a quick comment about the specifics of your ideas. I think the version identifiers should have a prefix (such as "lang_"). This should make it clear to most viewers of the code what's happening. Not everyone would intuitively know that ky_KG is a language feature, but lang_ky_KG is guessable. version (lang_en_GB) { ... } else version (lang_en_US) { ... } else version (lang_en) { ... } Since I don't have any real experience with localization, I'd love to hear some opinions from those who have actually worked with localization. Which programming languages make localization easy. Which libraries are helpful. -- Justin (a/k/a jcc7) http://jcc_7.tripod.com/d/
Jul 16 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cd9peo$oce$1 digitaldaemon.com>, J C Calvarese says...

There seems to be 2 schools of thought in the area of localization (of 
which language issues are a subset):

1. Compile-time generated using version() as you described in your post.

2. Runtime-time generated using some sort of a plugin architecture or 
language resource files (as Arcane Jill alludes to in her reply).

Since D is capable enough for either method, both parties can be happy.

Personally, I think I'd prefer to use compile-time localization, but 
that doesn't prevent others from designing runtime-time localization 
functions.

Compile-time localization not something I'd thought about before, so it's been kind of an interesting thing to think about. The thing is, though, we already _have_ compile-time localization. We've always had it. As Stewart said, you can do: # version(fr) # { # writef("something in French"); # } # else version(de) # { # writef("something in German"); # } # else # { # writef("something in English"); # } And we've been able to do that in C ever since we learned how to use #ifdef. So it requires no new language features. It's already there. But, since this technique has been around for so long, you'd expect it be widely used ... unless, that is, it turns out to be not very useful. There are a number of disadvantages I can think of. For a start, your locale-specific code will end up distributed throughout your source code, instead of all in one place. This could be a nightmare if you decide to support a new locale. Another problem is that, if you choose the locale at compile-time, then the end-user (as opposed to the developer) has to have the source code, OR an executable which was compiled especially for their locale. And that's just for executables. For libraries, the situation is even worse. A compile-time-localized executable would have to be linked with compile-time-localized libraries, compiled for the same locale. It would be a serious headache. Another problem is that someone might compile it for version(en_US), without realizing that they should have been using version(en). And all for what? To save a small amount of run-time overhead. Well, /how much/ run-time overhead? In most cases, run-time-localization amounts to looking something up in a map. Is that bad? A matter of judgement, maybe, but I'd say it was insignificant compared to the overhead incurred by writing that localized string to printf() or a file. Localization, to me, is the flip-side of internationalization (or i18n for lazy typists). The way it's traditionally done is you "internationalize" your code - a compile-time thing, and then "localize" it at run-time. Here's an example. Start with some normal, unlocalized code: # printf("hello world\n"); Now, internationalize it. It will become something like this: # printf(local("hello world\n")); Not much different really. The function local() would probably be something like this, and would get inlined: # char[][char[]] localizedLookup; // global variable # # char[] local(char[] s) # { # return localizedLookup[s]; # } "Localizing" this program now consists only of initializing the localizedLookup map, which could happen in any number of ways. My guess is that what would be most useful in terms of internationalization/localization would be some classes and functions to make stuff like the above easier, providing localized number formats and so on. Plus of course the D definition of a "locale" - we have to standardize that somehow. Me, I'd prefer enums to strings - saves all that messing about with case, for one thing, and string-splitting to get at the two (possibly three) parts.
I have a quick comment about the specifics of your ideas. I think the 
version identifiers should have a prefix (such as "lang_"). This should 
make it clear to most viewers of the code what's happening. Not everyone 
would intuitively know that ky_KG is a language feature, but lang_ky_KG 
is guessable.

Strictly speaking it should be "locale_", not "lang_". ("ky" is a lang, or language; "KG" is a country. The combination "ky_KG" is a locale). Sorry for being a pedantic bugger here. But the original questions was, should there be "standard version identifiers"? Thing is - I don't see how there can be. A version identifier is just D's name for a #define, and there's nothing to stop anyone from using they want as such. A note in the style guide might help, but even that won't force people to use said standard. Just my thoughts. Arcane Jill
Jul 18 2004
next sibling parent reply Juanjo =?ISO-8859-15?Q?=C1lvarez?= <juanjuxNO SPAMyahoo.es> writes:
What about just porting GNU gettext to phobos? This way you have a
semi-standart way of localizing programs (which a lot of translators know
about), and a set of pre-written tools (even nice GUI ones).

Looking at the python implementation it should not be difficult; the Python
implementation is only 493 lines (gettext.py) I'll see if I can take enough
time to do it in the next weeks.
Jul 18 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cdeva5$2p2o$1 digitaldaemon.com>, Juanjo =?ISO-8859-15?Q?=C1lvarez?=
says...
What about just porting GNU gettext to phobos? This way you have a
semi-standart way of localizing programs (which a lot of translators know
about), and a set of pre-written tools (even nice GUI ones).

I'm not sure. I'll admit I don't know much about gettext, so perhaps you could clear a few things up for me (and others)? What is the encoding of a .po or .mo file? Specifically, is it UTF-8? (Or CAN it be UTF-8?). Put another way - how good is its Unicode support? I guess I have to say I'd be disappointed if we had to rely on yet another C library. Maybe I'm just completely mad, but I'd prefer a pure D solution. (That said, I have no objection to using files of the same format). gettext looks like it does some cool stuff, but it's ... well ... C. It's not OO, unless I've misunderstood. It doesn't use exceptions. Moreover, it assumes the Linux meaning of "locale", which is (again, in my opinion) not right for D. The way I see it, D should define locales exclusively in terms of ISO language and country codes, plus variant extensions. Unicode defines locales that way, and the etc.unicode library will have no choice but to use the ISO codes. Collation and stuff like that will need to rely on data from the CDLR (Common Locale Data Repository - see http://www.unicode.org/cldr/). I suppose my gut feeling is that internationalization /isn't that hard/, so it ought to be relatively simple a task to come up with a native-D solution. gettext seems to do string translation only (again, correct me if I'm wrong), which only a small part of internationalization/localazation. So, I guess, on balance, I'd vote against this one, at least pending some persuasive argument. That said, I'm way too busy to volunteer for any work (plus I'm still taking a bit of time off from coding for personal reasons) - although I /do/ intend to tackle Unicode localization quite soon. Does that help? Probably not, I guess. Ah well. Tell you what, let's start an open discussion. (I've changed the thread title). I think we should hear lots of opinions before anyone actually DOES anything. A wrong early decision here could hamper D's potential future as /the/ language for internationalization (which I'd like it to become). Arcane Jill
Jul 19 2004
next sibling parent reply Juanjo =?ISO-8859-15?Q?=C1lvarez?= <juanjuxNO SPAMyahoo.es> writes:
First things first, you can have all the documentation about GNU gettext
here:

http://www.gnu.org/software/gettext/manual/html_chapter/gettext_toc.html

Arcane Jill wrote:

 I'm not sure. I'll admit I don't know much about gettext, so perhaps you
 could clear a few things up for me (and others)?

Let's try.
 What is the encoding of a .po or .mo file? Specifically, is it UTF-8? (Or
 CAN it be UTF-8?). Put another way - how good is its Unicode support?

The enconding of the file is declared in the header of the po files; so it can be (I think) anything, for example: "Project-Id-Version: animail\n" "POT-Creation-Date: 2003-12-07 02:02+0100\n" "PO-Revision-Date: 2004-07-08 20:19+0200\n" "Last-Translator: XXX XXX <XXX XXX.de>\n" "Language-Team: Deutsch <de li.org>\n" "MIME-Version: 1.0\n" "Content-Type: text/plain; charset=UTF-8\n" ^^^^^^^ "Content-Transfer-Encoding: 8bit\n" ^^^^^^^ "X-Generator: KBabel 1.3.1\n"
 I guess I have to say I'd be disappointed if we had to rely on yet another
 C library. 

We don't; the Python implementation, (that of about 500 lines) don't use any external C lib at all; it's 100% pure Python.
 gettext looks like it does some cool stuff, but it's ... well ...
 C. 

I repeat, it doesn't have to be C. gettext is more like a set of tools and formats than a library (altought the library exists, of course, but not only for C but for a lot of languages.)
 It's not OO.

It can be done as OO.
 unless I've misunderstood. It doesn't use exceptions.  

Our implementation could.
 Moreover, it assumes the Linux meaning of "locale", which is (again, in my
 opinion) not right for D.
 The way I see it, D should define locales exclusively in terms of ISO
 language and country codes, plus variant extensions. 
 Unicode defines 
 locales that way, and the etc.unicode library will have no choice but to
 use the ISO codes. Collation and stuff like that will need to rely on data
 from the CDLR (Common Locale Data Repository - see
 http://www.unicode.org/cldr/).

I don't know the answer to this one, gettext seems to use ISO3136 country codes and ISO639 language codes.
 I suppose my gut feeling is that internationalization /isn't that hard/,
 so it ought to be relatively simple a task to come up with a native-D
 solution. 

 gettext seems to do string translation only (again, correct me 
 if I'm wrong), which only a small part of
 internationalization/localazation.

That's true. It also handles plural forms which is not so simple (http://www.gnu.org/software/gettext/manual/html_chapter/gettext_10.html#SEC150) Anyway if we all can discuss the matter and come with a better solution than gettext (which I'm sure it's possible) I doubt many will be opposed.
Jul 19 2004
next sibling parent reply "Thomas Kuehne" <eisvogel users.sourceforge.net> writes:
Juanjo Álvarez:
 Anyway if we all can discuss the matter and come with a better solution

 gettext (which I'm sure it's possible) I doubt many will be opposed.

Just to point out some other - not neccessary better - localization libs: qt/kde: http://doc.trolltech.com/3.0/linguist-manual.html java: http://java.sun.com/j2se/1.5.0/docs/api/java/util/Formattable.html
Jul 19 2004
parent reply Juanjo =?ISO-8859-15?Q?=C1lvarez?= <juanjuxNO SPAMyahoo.es> writes:
Thomas Kuehne wrote:

 Juanjo Álvarez:
 Anyway if we all can discuss the matter and come with a better solution

 gettext (which I'm sure it's possible) I doubt many will be opposed.

Just to point out some other - not neccessary better - localization libs: qt/kde: http://doc.trolltech.com/3.0/linguist-manual.html java: http://java.sun.com/j2se/1.5.0/docs/api/java/util/Formattable.html

I don't know about the Java implementation but Qt tr() is very similar to gettext (the format of the translation files is different.) I don't know if KDE uses gettext internally but they use po/mo files just like gettext (and with the same format.)
Jul 19 2004
parent Berin Loritsch <bloritsch d-haven.org> writes:
Juanjo Álvarez wrote:

 Thomas Kuehne wrote:
 
 
Juanjo Álvarez:

Anyway if we all can discuss the matter and come with a better solution

than
gettext (which I'm sure it's possible) I doubt many will be opposed.

Just to point out some other - not neccessary better - localization libs: qt/kde: http://doc.trolltech.com/3.0/linguist-manual.html java: http://java.sun.com/j2se/1.5.0/docs/api/java/util/Formattable.html

I don't know about the Java implementation but Qt tr() is very similar to gettext (the format of the translation files is different.) I don't know if KDE uses gettext internally but they use po/mo files just like gettext (and with the same format.)

With the Java MessageFormat solution, things work fairly well. Consider for instance: MessageFormat.format("There was a problem in {0}, where {2} parse errors encountered at line {1}", location, lineNum, numErrs); There is a way to map which argument goes to which location in the line. Also, the MessageFormat does have a shorthand for kind of an if:then construct so that the same message would be interpreted differently for plurals/etc. That makes it convenient to handle those issues in I18N. However, I would not say that the MessageFormat is super easy to use. It could have a better interface, but the concepts are pretty decent.
Jul 19 2004
prev sibling parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cdg2mr$7dn$1 digitaldaemon.com>, Juanjo =?ISO-8859-15?Q?=C1lvarez?=
says...

The enconding of the file is declared in the header of the po files; so it
can be (I think) anything, for example:

the Python implementation, (that of about 500 lines) don't use any
external C lib at all; it's 100% pure Python. 

It can be done as OO.

Our implementation could [use exceptions].

gettext seems to use ISO3136 country
codes and ISO639 language codes.

Wow! Well, you've quashed all of my objections then. I'll change my vote then. Looks like a D implementation of gettext is the way to go. Just one last thing though - we do need a D definition of a locale. In effect, we need a class Locale (or possibly a struct Locale) containing those ISO codes. Java uses strings internally (I _think_), but there are a whole bunch of reasons why that's not such a good idea - such as the fact that "fr", "fra" and "fre" are all, equivalently, the language code for French, and should all compare as equal; such as case and other punctuation concerns ("en-us" == "en-US" == "en_us" == "en_US", etc.). I'd vote for putting enums inside the class (enum Language and enum Country - the variant field will still need to be a string). I imagine that the gettext implementation will need to use our yet-to-be-invented Locale class, and the unicode lib certainly will (and soon). Any thoughts? Class or struct? Strings or enums? Something else? Arcane Jill
Jul 19 2004
parent reply Juanjo =?ISO-8859-15?Q?=C1lvarez?= <juanjuxNO SPAMyahoo.es> writes:
Arcane Jill wrote:


 Just one last thing though - we do need a D definition of a locale. In
 effect, we need a class Locale (or possibly a struct Locale) containing
 those ISO codes.

Yes, definitively.
 the language code for
 French, and should all compare as equal; such as case and other
 punctuation concerns ("en-us" == "en-US" == "en_us" == "en_US", etc.). I'd
 vote for putting enums inside the class (enum Language and enum Country -
 the variant field will still need to be a string).

I vote for that too.
 I imagine that the 
 gettext implementation will need to use our yet-to-be-invented Locale
 class, and the unicode lib certainly will (and soon). Any thoughts? Class
 or struct? Strings or enums? Something else?

Class + enums, IMHO :) Also the Python locale module[1] (sorry, I'm a pythonist :) could be a good source of inspiration; it supports Unix, Windows and MAC style locales with a bunch of useful functions (getlocale, getdefaultlocale, setlocale, normalize, locate-aware atoi+atof+str+format+strcol, etc...). I'll take a look at it this weekend. [1] http://doc.astro-wise.org/locale.html
Jul 19 2004
parent Arcane Jill <Arcane_member pathlink.com> writes:
In article <cdgjdq$e82$1 digitaldaemon.com>, Juanjo =?ISO-8859-15?Q?=C1lvarez?=
says...

Also the Python locale module[1] (sorry, I'm a pythonist :) could be a good
source of inspiration; it supports Unix, Windows and MAC style locales with
a bunch of useful functions (getlocale, getdefaultlocale, setlocale,
normalize, locate-aware atoi+atof+str+format+strcol, etc...). 

I'll take a look at it this weekend.

Be careful not to go too over-the-top here. I think that stuff like locale-aware-atoi(), etc., should NOT be member functions of class Locale. I'll explain my reasoning below. class Locale itself should be short, sweet and simple - little more than the embodiment of those ISO codes, in fact. Locales can identify a resource by being used as a map key, so you don't need tons of other stuff build in. The reason I say this is circularity, or bootstrapping, or simplicity, depending on your point of view. To implement (say) locale-aware-atoi() would require actual KNOWLEDGE of how to do that, for every locale. Now, we COULD pull all that data from CDLR and implement it by hand, but it would be a lot of work. Conceptually, it's simpler if we get the very basics up and running first, and then later overload functions such as strcol() later. In fact, in this /particular/ example (collation) we are most certainly better off leaving this until later. I plan later to implement the Unicode Collation Algorithm, based on the data in CDLR. That will end up as a function which takes a Locale as one of its parameters, and whose behavior is controlled by that parameter. Same with full casing. Where such an algorithm exists (along with all the data) it makes sense to take advantage of it, but we are not in a position to do that yet, because not enough of the basics are there. But do take a look anyway. Look also at Java's class Locale. It basically does nothing, except identify a locale. That's the kind of line I'm thinking along, as it allows for unlimited expansion later without tying us down to anything. Jill
Jul 19 2004
prev sibling parent reply Hauke Duden <H.NS.Duden gmx.net> writes:
As far as I remember (I looked at gettext a few of years ago) gettext 
has some serious drawbacks. The worst being that parameters that are 
inserted into the translated string have to be specified in "printf" 
formatting. That means that their order in the translated string must be 
the same as the order in the original text, which is not always possible 
and often awkward.

My memory is a little fuzzy about the specifics, so please correct me if 
I'm wrong.

Hauke

Arcane Jill wrote:

 In article <cdeva5$2p2o$1 digitaldaemon.com>, Juanjo
=?ISO-8859-15?Q?=C1lvarez?=
 says...
 
What about just porting GNU gettext to phobos? This way you have a
semi-standart way of localizing programs (which a lot of translators know
about), and a set of pre-written tools (even nice GUI ones).

I'm not sure. I'll admit I don't know much about gettext, so perhaps you could clear a few things up for me (and others)? What is the encoding of a .po or .mo file? Specifically, is it UTF-8? (Or CAN it be UTF-8?). Put another way - how good is its Unicode support? I guess I have to say I'd be disappointed if we had to rely on yet another C library. Maybe I'm just completely mad, but I'd prefer a pure D solution. (That said, I have no objection to using files of the same format). gettext looks like it does some cool stuff, but it's ... well ... C. It's not OO, unless I've misunderstood. It doesn't use exceptions. Moreover, it assumes the Linux meaning of "locale", which is (again, in my opinion) not right for D. The way I see it, D should define locales exclusively in terms of ISO language and country codes, plus variant extensions. Unicode defines locales that way, and the etc.unicode library will have no choice but to use the ISO codes. Collation and stuff like that will need to rely on data from the CDLR (Common Locale Data Repository - see http://www.unicode.org/cldr/). I suppose my gut feeling is that internationalization /isn't that hard/, so it ought to be relatively simple a task to come up with a native-D solution. gettext seems to do string translation only (again, correct me if I'm wrong), which only a small part of internationalization/localazation. So, I guess, on balance, I'd vote against this one, at least pending some persuasive argument. That said, I'm way too busy to volunteer for any work (plus I'm still taking a bit of time off from coding for personal reasons) - although I /do/ intend to tackle Unicode localization quite soon. Does that help? Probably not, I guess. Ah well. Tell you what, let's start an open discussion. (I've changed the thread title). I think we should hear lots of opinions before anyone actually DOES anything. A wrong early decision here could hamper D's potential future as /the/ language for internationalization (which I'd like it to become). Arcane Jill

Jul 19 2004
parent reply Juanjo =?ISO-8859-15?Q?=C1lvarez?= <juanjuxNO SPAMyahoo.es> writes:
Hauke Duden wrote:

 As far as I remember (I looked at gettext a few of years ago) gettext
 has some serious drawbacks. The worst being that parameters that are
 inserted into the translated string have to be specified in "printf"
 formatting. That means that their order in the translated string must be
 the same as the order in the original text, which is not always possible
 and often awkward.
 
 My memory is a little fuzzy about the specifics, so please correct me if
 I'm wrong.
 
 Hauke

Mmm, not the GNU gettext, you can put: printf(_("There are %d %s %s\n"), count, _(color), _(name)); And the output po file will be: "There are %1$d %2$s %3$s" So translator can change the numbers thus changing the word order.
Jul 19 2004
next sibling parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cdgiqm$dua$1 digitaldaemon.com>, Juanjo =?ISO-8859-15?Q?=C1lvarez?=
says...
"There are %1$d %2$s %3$s"

So translator can change the numbers thus changing the word order.

Is this a feature of printf()? If so, is a Linux thing or an all-platform thing? And (probably a silly question, but someone might know the answer) is this functionality available in the new writef()?
Jul 19 2004
next sibling parent Sean Kelly <sean f4.ca> writes:
In article <cdgq1p$gqi$1 digitaldaemon.com>, Arcane Jill says...
In article <cdgiqm$dua$1 digitaldaemon.com>, Juanjo =?ISO-8859-15?Q?=C1lvarez?=
says...
"There are %1$d %2$s %3$s"

So translator can change the numbers thus changing the word order.

Is this a feature of printf()? If so, is a Linux thing or an all-platform thing? And (probably a silly question, but someone might know the answer) is this functionality available in the new writef()?

It's not a feature of printf and AFAIK it's not in the new writef either. Semi-related: I'm recoding my scanf implementation as unFormat (to match doFormat) and changing the calling syntax to readf. So with any luck there will be both input and output routines written in D. Sean
Jul 19 2004
prev sibling parent Jonathan Leffler <jleffler earthlink.net> writes:
Arcane Jill wrote:

 In article <cdgiqm$dua$1 digitaldaemon.com>, Juanjo =?ISO-8859-15?Q?=C1lvarez?=
 says...
 
"There are %1$d %2$s %3$s"

So translator can change the numbers thus changing the word order.

Is this a feature of printf()? If so, is a Linux thing or an all-platform thing? And (probably a silly question, but someone might know the answer) is this functionality available in the new writef()?

It depends on whose printf() you're looking at. Standard C - no. POSIX - yes. See: http://www.opengroup.org/onlinepubs/009695399/functions/fprintf.html I discussed this once before in this news group, a few weeks after the thread had gone stale (mainly because I only just started to pay attention to D). ...dig...dig...dig...Friday 9th July 2004... digitalmars.D/5662 -- Jonathan Leffler #include <disclaimer.h> Email: jleffler earthlink.net, jleffler us.ibm.com Guardian of DBD::Informix v2003.04 -- http://dbi.perl.org/
Jul 19 2004
prev sibling parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cdgiqm$dua$1 digitaldaemon.com>, Juanjo =?ISO-8859-15?Q?=C1lvarez?=
says...

Mmm, not the GNU gettext, you can put:

printf(_("There are %d %s %s\n"), count, _(color), _(name));

And the output po file will be:

"There are %1$d %2$s %3$s"

So translator can change the numbers thus changing the word order.

Well, that /sounds/ like the kind of thing we need, but your above example is a little unclear to those of us who have not used gettext() before. As I read the above, and assuming that _() is the text-localizing function, that wouldn't change the word order. But you say it does, so I must have misunderstood something. Can you break that down into steps? Berin mentioned Java's MessageFormat class. This does the job of word order switching. It's cumbersome to use in practice, but we could still borrow the technique if we so needed. We will certainly find a way to do word reording in D. The question is where is the right place for that? Does gettext do it? Should we petition Walter to get writef() to do it? Would Hauke's string class be the right place. We need more information.... Jill
Jul 19 2004
parent Hauke Duden <H.NS.Duden gmx.net> writes:
Arcane Jill wrote:
 In article <cdgiqm$dua$1 digitaldaemon.com>, Juanjo =?ISO-8859-15?Q?=C1lvarez?=
 says...
 
 
Mmm, not the GNU gettext, you can put:

printf(_("There are %d %s %s\n"), count, _(color), _(name));

And the output po file will be:

"There are %1$d %2$s %3$s"

So translator can change the numbers thus changing the word order.

Well, that /sounds/ like the kind of thing we need, but your above example is a little unclear to those of us who have not used gettext() before. As I read the above, and assuming that _() is the text-localizing function, that wouldn't change the word order. But you say it does, so I must have misunderstood something. Can you break that down into steps? Berin mentioned Java's MessageFormat class. This does the job of word order switching. It's cumbersome to use in practice, but we could still borrow the technique if we so needed.

I've been using a pretty simple but effective technique for quite some time. The translatable string can contain place holders of the form %NAME% and the translation function can take a map parameter that inserts the correct values. This also has the advantage of better documentation. It is pretty hard to deduce the intended meaning of a string like "There are %d %ss in the %s". It gets easier if you have something like "There are %NUM% %OBJ%s in the %CONTAINER%". Less room for error. I have also found that it can sometimes be helpful to be able to include some kind of comment for the translator that describes the intended use or any constraints of the string. For example "keep this as short as possible" or "context is file I/O". I implemented this by adding an optional parameter that can be passed to the translation function. It is ignored at runtime, but the "harvester" tool that extracts the strings from the code files includes it in the translatable files. And last but not least, I think translatable strings should have an ID (a string ID, not a number). Not all strings that are the same in one language are the same in other languages. So if the translation is bound to the original text then you can have situations where you need to specify two different texts for two different contexts, but you are not able to do so, because the original text serves as ID/key. A good example that I encountered a few years ago: At the time I played the german version of the game Baldurs Gate. It contained some horrible text bugs that obviously originated from a translation system where the original text served as the ID. One particular case was the text "XXX attacks YYY" that was displayed whenever one character attacked another. "attacks" in english can mean the plural of the noun "attack" or it can be a form of the verb "to attack". In this case it is the verb form. Unfortunately it was translated with the German plural of the noun, which is different from the verb (probably because it was also used in a different context where it meant the noun). So that the translation made no sense at all. Hauke
Jul 19 2004
prev sibling next sibling parent reply J C Calvarese <jcc7 cox.net> writes:
Arcane Jill wrote:
 In article <cd9peo$oce$1 digitaldaemon.com>, J C Calvarese says...
 
 
There seems to be 2 schools of thought in the area of localization (of 
which language issues are a subset):

1. Compile-time generated using version() as you described in your post.

2. Runtime-time generated using some sort of a plugin architecture or 
language resource files (as Arcane Jill alludes to in her reply).

Since D is capable enough for either method, both parties can be happy.

Personally, I think I'd prefer to use compile-time localization, but 
that doesn't prevent others from designing runtime-time localization 
functions.

Compile-time localization not something I'd thought about before, so it's been kind of an interesting thing to think about. The thing is, though, we already _have_ compile-time localization. We've always had it. As Stewart said, you can do: # version(fr) # { # writef("something in French"); # } # else version(de) # { # writef("something in German"); # } # else # { # writef("something in English"); # } And we've been able to do that in C ever since we learned how to use #ifdef. So it requires no new language features. It's already there. But, since this technique has been around for so long, you'd expect it be widely used ... unless, that is, it turns out to be not very useful.

It's nothing revolutionary, but it's a start.
 
 There are a number of disadvantages I can think of. For a start, your
 locale-specific code will end up distributed throughout your source code,
 instead of all in one place. This could be a nightmare if you decide to support
 a new locale. Another problem is that, if you choose the locale at
compile-time,
 then the end-user (as opposed to the developer) has to have the source code, OR
 an executable which was compiled especially for their locale. And that's just
 for executables. For libraries, the situation is even worse. A
 compile-time-localized executable would have to be linked with
 compile-time-localized libraries, compiled for the same locale. It would be a
 serious headache.
 
 Another problem is that someone might compile it for version(en_US), without
 realizing that they should have been using version(en).
 
 And all for what? To save a small amount of run-time overhead. Well, /how much/
 run-time overhead? In most cases, run-time-localization amounts to looking
 something up in a map. Is that bad? A matter of judgement, maybe, but I'd say
it
 was insignificant compared to the overhead incurred by writing that localized
 string to printf() or a file. 

I could be a lot of run-time overhead. It could be a little. But ultimately, it should be left up to the individual programmer.
 
 Localization, to me, is the flip-side of internationalization (or i18n for lazy
 typists). The way it's traditionally done is you "internationalize" your code -
 a compile-time thing, and then "localize" it at run-time. Here's an example.

I didn't realize there was a different between localization and internalization. The OP was mostly concerned with "human languages", but I think that other issues such as date formats would naturally be discussed at the same time.
 Start with some normal, unlocalized code:
 
 #    printf("hello world\n");
 
 Now, internationalize it. It will become something like this:
 
 #    printf(local("hello world\n"));
 
 Not much different really. The function local() would probably be something
like
 this, and would get inlined:
 
 #    char[][char[]] localizedLookup; // global variable
 #
 #    char[] local(char[] s)
 #    {
 #        return localizedLookup[s];
 #    }
 
 "Localizing" this program now consists only of initializing the localizedLookup
 map, which could happen in any number of ways.
 
 My guess is that what would be most useful in terms of
 internationalization/localization would be some classes and functions to make
 stuff like the above easier, providing localized number formats and so on. Plus
 of course the D definition of a "locale" - we have to standardize that somehow.
 Me, I'd prefer enums to strings - saves all that messing about with case, for
 one thing, and string-splitting to get at the two (possibly three) parts.

Sure, Phobos should include some modules for run-time support (and maybe compile-time support, too).
 
 
I have a quick comment about the specifics of your ideas. I think the 
version identifiers should have a prefix (such as "lang_"). This should 
make it clear to most viewers of the code what's happening. Not everyone 
would intuitively know that ky_KG is a language feature, but lang_ky_KG 
is guessable.

Strictly speaking it should be "locale_", not "lang_". ("ky" is a lang, or language; "KG" is a country. The combination "ky_KG" is a locale). Sorry for being a pedantic bugger here.

I don't mind nit-picking (I do a lot of it myself). I hereby retract "lang_" in favor of either "locale_" or "loc_".
 
 But the original questions was, should there be "standard version identifiers"?
 Thing is - I don't see how there can be. A version identifier is just D's name
 for a #define, and there's nothing to stop anyone from using they want as such.
 A note in the style guide might help, but even that won't force people to use
 said standard.

Right. I was thinking "convention" when I read "standard". I don't intend to compell anyone (and as you state, they can't really be compelled), but I think if the convention makes sense, many people would use it.
 
 Just my thoughts.
 
 Arcane Jill
 
 
 

-- Justin (a/k/a jcc7) http://jcc_7.tripod.com/d/
Jul 18 2004
parent Arcane Jill <Arcane_member pathlink.com> writes:
In article <cdfgli$2vdr$1 digitaldaemon.com>, J C Calvarese says...
I didn't realize there was a different between localization and 
internalization.

I found a good explanation about this when looking up gettext on the web. Have a look at http://www.gnu.org/software/gettext/manual/html_mono/gettext.html#SEC3. To quote its summary of itself: "Also, very roughly said, when it comes to multi-lingual messages, internationalization is usually taken care of by programmers, and localization is usually taken care of by translators." I consider myself a good programmer, but I'd make a lousy translator. I'd want to leave that job to someone else. Arcane Jill
Jul 19 2004
prev sibling next sibling parent reply Stewart Gordon <smjg_1998 yahoo.com> writes:
Arcane Jill wrote:

<snip>
 There are a number of disadvantages I can think of. For a start, your
 locale-specific code will end up distributed throughout your source code,
 instead of all in one place.

Only if you choose to do it that way. You can just as well have one version block per module, with the locale-specific data and code in them, and have the rest of the module use stuff in here.
 This could be a nightmare if you decide to support a new locale.  
 Another problem is that, if you choose the locale at compile-time, 
 then the end-user (as opposed to the developer) has to have the 
 source code, OR an executable which was compiled especially for their 
 locale.

I believe it's quite common to offer separate downloadable versions in each language. That way, a unilingual end-user isn't faced with the bloat of a multilingual UI or the overhead of compiling it, and you can choose whether to release the source or not.
 And that's just for executables. For libraries, the situation is even worse. A
 compile-time-localized executable would have to be linked with
 compile-time-localized libraries, compiled for the same locale. It would be a
 serious headache.

To me it would seem straightforward to build a copy of the lib, and give it an identifying name, for each language that your app supports. <snip>
 Strictly speaking it should be "locale_", not "lang_". ("ky" is a lang, or
 language; "KG" is a country. The combination "ky_KG" is a locale). Sorry for
 being a pedantic bugger here.

ISO language codes are language-dialect pairs. So en-GB is British English, es-MX is Mexican Spanish. AIUI they don't cover other aspects of locale, such as time zones, date formats and the like. They tend to be managed by the OS - it would seem pointless to try and write apps to override this. Stewart. -- My e-mail is valid but not my primary mailbox, aside from its being the unfortunate victim of intensive mail-bombing at the moment. Please keep replies on the 'group where everyone may benefit.
Jul 19 2004
parent Arcane Jill <Arcane_member pathlink.com> writes:
In article <cdg8ab$98s$1 digitaldaemon.com>, Stewart Gordon says...

ISO language codes are language-dialect pairs.  So en-GB is British 
English, es-MX is Mexican Spanish.

I know.
AIUI they don't cover other aspects 
of locale, such as time zones, date formats and the like.

In a sense, they don't cover ANYTHING. They are just tuples of language/country/variant tags. However, if you think of these as map keys, you can turn them into anything else quite straightforwardly.
They tend to be managed by the OS

That would be nice, but collation, number formats, date formats, etc. (to give just a few examples) are not handled very well at all by any OS of which I am aware.
it would seem pointless to try and write apps to 
override this.

But not pointless for a library. The CLDR, which is maintained by the Unicode Consortium, contains just about every fragment of information you could possibly imagine wanting (short of actual language translation). Its data files are in XML - actually a custom format called LDML (Locale Data Markup Language). It absolutely DOES include such information as time zones, currencies, number formats, and so on. It's a resource we would be foolish to ignore, and since it's XML, it can be robot-parsed far, far more easily than the Unicode database. I will most certainly be using /some/ of the CLDR data for the Unicode collation algorithm. If you want to write several monolingual applications from the same source, no-one is going to stop you. Go ahead and do it, and use whatever version identifiers you want. There's room in this world (and indeed in D) for BOTH compile-time language selecation AND true internationalization/localization, so I guess we can all be happy. Arcane Jill
Jul 19 2004
prev sibling parent Stewart Gordon <smjg_1998 yahoo.com> writes:
Arcane Jill wrote:

<snip>
 Another problem is that someone might compile it for version(en_US), without
 realizing that they should have been using version(en).

Yes, as I said in my original post:
 It would be necessary either for the compiler to automatically 
 set en if en_GB or en_US or en_anything is set, and similarly for 
 other language codes, or to persuade all D users to do this. Of 
 course, this would be done in the aforementioned default language 
 setting.



Stewart. -- My e-mail is valid but not my primary mailbox, aside from its being the unfortunate victim of intensive mail-bombing at the moment. Please keep replies on the 'group where everyone may benefit.
Jul 19 2004