www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Character set conversions

reply Adam D. Ruppe <destructionator gmail.com> writes:
I've encountered some problems with other charsets recently. Phobos has
a std.encoding that can do some useful stuff, but there's some
encodings I've seen in the wild that it can't handle (indeed, it's
a fairly short list that it does support)

I used gnu iconv for one of my projects and it works for me, but
I wonder:

Is anyone planning to add more charset support to Phobos?
(alternatively, am I missing something already there?)


If no, maybe I'll do a few myself. I've never actually written code
to do this, but it can't be rocket science. I suspect it's more
tedious than anything else.
May 29 2011
next sibling parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On 2011-05-29 19:21, Adam D. Ruppe wrote:
 I've encountered some problems with other charsets recently. Phobos has
 a std.encoding that can do some useful stuff, but there's some
 encodings I've seen in the wild that it can't handle (indeed, it's
 a fairly short list that it does support)
 
 I used gnu iconv for one of my projects and it works for me, but
 I wonder:
 
 Is anyone planning to add more charset support to Phobos?
 (alternatively, am I missing something already there?)
 
 
 If no, maybe I'll do a few myself. I've never actually written code
 to do this, but it can't be rocket science. I suspect it's more
 tedious than anything else.

Well, generally the idea is that you just use UTF-8, UTF-16, or UTF-32, and for the most part, I wouldn't really expect people to be using UTF-16 when they need to interface with Windows system functions which require it. By definition, char is supposed to be UTF-8, wchar is supposed to be UTF-16, and dchar is supposed to be UTF-32. I don't really think that it's expected that you be using any other encodings within your typical D program. Sometimes it may be necessary to translate from another encoding to UTF-8, UTF-16, or UTF-32 when getting input from somewhere, and sometimes it may be necessary to translate to another encoding from UTF-8, UTF-16, or UTF-16 when outputting somewhere, but it certainly isn't the norm. It may be that we need better suppport for dealing with those cases, but they should really only be for converting on input or output. So, if you want to improve std.encoding to handle more charsets, then feel free, but don't expect the rest of Phobos to work with anything beyond UTF-8, UTF-16, and UTF-16. It's going to be throwing UtfExceptions if you do. - Jonathan M Davis
May 29 2011
parent reply Adam D. Ruppe <destructionator gmail.com> writes:
Jonathan M Davis wrote:
 Sometimes it may be necessary to translate from another encoding to
 UTF-8, UTF-16, or UTF-32 when getting input from somewhere, and
 sometimes it may be necessary to translate to another encoding from
 UTF-8, UTF-16, or UTF-16 when outputting somewhere, but it
 certainly isn't the norm.

Translation is all I want. Internally, everything is utf8 strings, but sometimes the program is fed files in another encoding and it needs to handle them too.
May 29 2011
next sibling parent Daniel Gibson <metalcaedes gmail.com> writes:
Am 30.05.2011 05:03, schrieb Adam D. Ruppe:
 Jonathan M Davis wrote:
 Sometimes it may be necessary to translate from another encoding to
 UTF-8, UTF-16, or UTF-32 when getting input from somewhere, and
 sometimes it may be necessary to translate to another encoding from
 UTF-8, UTF-16, or UTF-16 when outputting somewhere, but it
 certainly isn't the norm.

Translation is all I want. Internally, everything is utf8 strings, but sometimes the program is fed files in another encoding and it needs to handle them too.

Hmm on the one hand iconv already does this for a plethora of encodings.. on the other hand AFAIK there is no iconv implementation that could be shipped with Phobos, so if a module for translating between encodings should become part of Phobos there seems be no other way than writing one from scratch :/ (And I think having this in Phobos would make sense)
May 29 2011
prev sibling parent reply Kagamin <spam here.lot> writes:
Jonathan M Davis Wrote:

 especially with no feature requests or bug reports no the matter. Personally, 
 I wasn't even aware that it was an issue. Pure UTF-8 has always worked just 
 fine for me. Presumably, you're running into issues with it because you're 
 actually using D at work.

May be, it's his cgi lib? :) Client is free to send requests in any encoding, I suppose.
May 30 2011
parent reply Adam D. Ruppe <destructionator gmail.com> writes:
Kagamin wrote:
 May be, it's his cgi lib? :)
 Client is free to send requests in any encoding, I suppose.

In practice, that hasn't been a problem because browser tend to send requests in the same encoding as the html you served. Since the D always outputs utf8, the browsers all send back utf8 too. The first problem I had was users can upload csv files, which they generally make in Excel... which apparently outputs Windows-1252. Fine for 99% of text, but then someone puts in a curly quote or an em dash and it throws an invalid utf 8 sequence. Converting that is easy enough though. Second problem is now I want to fetch and process random websites on the internet, and they come in a variety of encodings... again, utf covers a big majority, but not all of them.
May 30 2011
parent reply =?UTF-8?B?IkrDqXLDtG1lIE0uIEJlcmdlciI=?= <jeberger free.fr> writes:
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Adam D. Ruppe wrote:
 Kagamin wrote:
 May be, it's his cgi lib? :)
 Client is free to send requests in any encoding, I suppose.

In practice, that hasn't been a problem because browser tend to send requests in the same encoding as the html you served. =20 Since the D always outputs utf8, the browsers all send back utf8 too. =20 =20 The first problem I had was users can upload csv files, which they generally make in Excel... which apparently outputs Windows-1252. Fine for 99% of text, but then someone puts in a curly quote or an em dash and it throws an invalid utf 8 sequence. =20 Converting that is easy enough though. =20

issues, the separator used between cells depends on the locale: for example, in English locales it uses a coma but in French locales it uses a semicolon... Just thought I'd point it out in case you did not know. Jerome --=20 mailto:jeberger free.fr http://jeberger.free.fr Jabber: jeberger jabber.fr
May 30 2011
next sibling parent "Nick Sabalausky" <a a.a> writes:
""Jrme M. Berger"" <jeberger free.fr> wrote in message 
news:is0m2h$1s32$1 digitalmars.com...
Fun fact about Excel generated CSV files: quite apart from encoding
issues, the separator used between cells depends on the locale: for
example, in English locales it uses a coma but in French locales it
uses a semicolon...

Just thought I'd point it out in case you did not know.

Heh, that's just wonderful: localized file format specs...
May 30 2011
prev sibling next sibling parent Adam D. Ruppe <destructionator gmail.com> writes:
  Fun fact about Excel generated CSV files: quite apart from encoding
 issues, the separator used between cells depends on the locale: for
 example, in English locales it uses a coma but in French locales it
 uses a semicolon...

Yeah, I've seen the semicolon in the wild before too, though I didn't know it was a locale thing. My program solves it by confirming with the user. When you upload a file, it tries to parse it with a few different assumptions. The one that looks best is presented back to the user. (Looks best means it has headings that roughly match what we expect and number of columns that's more or less consistent). It does charset the same way, actually. First, guess UTF-8. If that doesn't validate, assume it's Windows-1252 unless told otherwise. The user then confirms the guesses and organizes the final data import. It's worked out pretty well so far aside from unsupported charsets; the users seem to like it.
May 30 2011
prev sibling next sibling parent reply Daniel Gibson <metalcaedes gmail.com> writes:
Am 30.05.2011 22:20, schrieb Simen Kjaeraas:
 On Mon, 30 May 2011 19:57:32 +0200, Jérôme M. Berger <jeberger free.fr>
 wrote:
 
     Fun fact about Excel generated CSV files: quite apart from encoding
 issues, the separator used between cells depends on the locale: for
 example, in English locales it uses a coma but in French locales it
 uses a semicolon...

     Just thought I'd point it out in case you did not know.

Fun? Gods, it's the most horrible idea I've witnessed in computing. If only they'd call it something other than CSV, at least - Comma Separated Values separated by semicolons? WTF? And the fantastic joy of opening one of those abominations in some other program... *shiver*

CSV in Excel is totally misleading anyway. At least in the German Version, if you want to import a CSV file, the standard seperator is tab, not comma.. If you use File->Open this is all you can get, importing with custom seperators is hidden somewhere else IIRC. (This refers to Office XP, dunno if newer versions are better in this regard.) In plain C (at least on Linux) you have fun locale-dependent in/output as well: printf and scanf are locale dependent, so if you use sprintf to generate a string you'll write into a file (or fprintf directly) with one locale, reading it with scanf functions with another locale will fail. Pretty fucking stupid IMHO. This was/is(?) a bug in GtkRadiant, a level editor for Quake like games, which uses printf or something to write the map files. The map compiler will reject them if decimals use a , instead of a . and stuff like that. (The workaround is to always use the standard LOCALE, i.e. "LC_ALL=C gtkradiant" to start it). Cheers, - Daniel
May 30 2011
next sibling parent =?UTF-8?B?IkrDqXLDtG1lIE0uIEJlcmdlciI=?= <jeberger free.fr> writes:
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Daniel Gibson wrote:
 In plain C (at least on Linux) you have fun locale-dependent in/output
 as well: printf and scanf are locale dependent, so if you use sprintf
 to generate a string you'll write into a file (or fprintf directly) wit=

 one locale, reading it with scanf functions with another locale will fa=

 Pretty fucking stupid IMHO.
 This was/is(?) a bug in GtkRadiant, a level editor for Quake like games=

 which uses printf or something to write the map files. The map compiler=

 will reject them if decimals use a , instead of a . and stuff like that=

 (The workaround is to always use the standard LOCALE, i.e. "LC_ALL=3DC
 gtkradiant" to start it).
=20

locale dependent way (probably using printf), which means that in some locales the decimal point is a coma, which prevents using it as a field separator. Braindead of course, and a real pain when you want to interface with other software. Jerome --=20 mailto:jeberger free.fr http://jeberger.free.fr Jabber: jeberger jabber.fr
May 30 2011
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On 2011-05-30 14:40, J=E9r=F4me M. Berger wrote:
 Daniel Gibson wrote:
 In plain C (at least on Linux) you have fun locale-dependent in/output
 as well: printf and scanf are locale dependent, so if you use sprintf
 to generate a string you'll write into a file (or fprintf directly) with
 one locale, reading it with scanf functions with another locale will
 fail. Pretty fucking stupid IMHO.
 This was/is(?) a bug in GtkRadiant, a level editor for Quake like games,
 which uses printf or something to write the map files. The map compiler
 will reject them if decimals use a , instead of a . and stuff like that.
 (The workaround is to always use the standard LOCALE, i.e. "LC_ALL=3DC
 gtkradiant" to start it).

Actually, that is the same issue: Excel outputs numbers to CSV in a locale dependent way (probably using printf), which means that in some locales the decimal point is a coma, which prevents using it as a field separator. Braindead of course, and a real pain when you want to interface with other software.

Well, knowing Microsoft, they probably did it with printf (or fprintf or=20 whatever), not realizing that it had locale issues, but once they figured o= ut,=20 they wouldn't fix it because that would break backwards compatibility. =2D Jonathan M Davis
May 30 2011
prev sibling parent reply Kagamin <spam here.lot> writes:
Daniel Gibson Wrote:

 In plain C (at least on Linux) you have fun locale-dependent in/output
 as well: printf and scanf are locale dependent, so if you use sprintf
 to generate a string you'll write into a file (or fprintf directly) with
 one locale, reading it with scanf functions with another locale will fail.
 Pretty fucking stupid IMHO.

Doesn't C standard specify the locale to be "C" until you set it explicitly?
May 31 2011
parent Daniel Gibson <metalcaedes gmail.com> writes:
Am 31.05.2011 09:02, schrieb Kagamin:
 Daniel Gibson Wrote:
 
 In plain C (at least on Linux) you have fun locale-dependent in/output
 as well: printf and scanf are locale dependent, so if you use sprintf
 to generate a string you'll write into a file (or fprintf directly) with
 one locale, reading it with scanf functions with another locale will fail.
 Pretty fucking stupid IMHO.

Doesn't C standard specify the locale to be "C" until you set it explicitly?

At least on Linux it is usually set to whatever you specified on installation (usually you just say "I want a german/english/whatever installation" and the installer then sets the locales to de_DE.UTF8 or whatever). Applications use these settings to decide the language of their menus etc
May 31 2011
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On 2011-05-30 13:20, Simen Kjaeraas wrote:
 On Mon, 30 May 2011 19:57:32 +0200, J=E9r=F4me M. Berger <jeberger free.f=

=20
 wrote:
 	Fun fact about Excel generated CSV files: quite apart from encoding
=20
 issues, the separator used between cells depends on the locale: for
 example, in English locales it uses a coma but in French locales it
 uses a semicolon...
=20
 	Just thought I'd point it out in case you did not know.

Fun? Gods, it's the most horrible idea I've witnessed in computing. If only they'd call it something other than CSV, at least - Comma Separat=

 Values separated by semicolons? WTF?
 And the fantastic joy of opening one of those abominations in some other
 program... *shiver*

Well, then it isn't really CSV anymore. They different screwed the French o= n=20 that one. Oh, you wanted your supposedly universal format to work with othe= r=20 programs? Sorry, no can do. But you can keep using Excel! See, no reason to= be=20 unhappy about it. :P =2D Jonathan M Davis
May 30 2011
prev sibling parent Jacob Carlborg <doob me.com> writes:
On 2011-05-30 19:57, "Jérôme M. Berger" wrote:
 Adam D. Ruppe wrote:
 Kagamin wrote:
 May be, it's his cgi lib? :)
 Client is free to send requests in any encoding, I suppose.

In practice, that hasn't been a problem because browser tend to send requests in the same encoding as the html you served. Since the D always outputs utf8, the browsers all send back utf8 too. The first problem I had was users can upload csv files, which they generally make in Excel... which apparently outputs Windows-1252. Fine for 99% of text, but then someone puts in a curly quote or an em dash and it throws an invalid utf 8 sequence. Converting that is easy enough though.

issues, the separator used between cells depends on the locale: for example, in English locales it uses a coma but in French locales it uses a semicolon... Just thought I'd point it out in case you did not know. Jerome

Yeah, that is a nightmare. I tried SYLK, symbolic link as well, it's something like CSV but more advanced, didn't work out that well either. I ended up using real Excel documents with the help of the rubygem "spreadsheet". -- /Jacob Carlborg
May 30 2011
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On 2011-05-29 20:03, Adam D. Ruppe wrote:
 Jonathan M Davis wrote:
 Sometimes it may be necessary to translate from another encoding to
 UTF-8, UTF-16, or UTF-32 when getting input from somewhere, and
 sometimes it may be necessary to translate to another encoding from
 UTF-8, UTF-16, or UTF-16 when outputting somewhere, but it
 certainly isn't the norm.

Translation is all I want. Internally, everything is utf8 strings, but sometimes the program is fed files in another encoding and it needs to handle them too.

Well, likely no one has done it yet because none of the Phobos developers have needed it enough to implement it, and no one outside of them has taken the time to do so and tried to get it into Phobos. And with everything else there is to do, it's the sort of thing that's likely not to get done anytime soon - especially with no feature requests or bug reports no the matter. Personally, I wasn't even aware that it was an issue. Pure UTF-8 has always worked just fine for me. Presumably, you're running into issues with it because you're actually using D at work. So, you can either implement it yourself and create a pull request for it, or you can create an enhancement request, and it'll probably get done eventually, but with everything else that needs doing, I don't know how quickly it'll get done. - Jonathan M Davis
May 29 2011
prev sibling next sibling parent "Simen Kjaeraas" <simen.kjaras gmail.com> writes:
On Mon, 30 May 2011 19:57:32 +0200, J=C3=A9r=C3=B4me M. Berger <jeberger=
 free.fr>  =

wrote:

 	Fun fact about Excel generated CSV files: quite apart from encoding
 issues, the separator used between cells depends on the locale: for
 example, in English locales it uses a coma but in French locales it
 uses a semicolon...

 	Just thought I'd point it out in case you did not know.

Fun? Gods, it's the most horrible idea I've witnessed in computing. If only they'd call it something other than CSV, at least - Comma Separa= ted Values separated by semicolons? WTF? And the fantastic joy of opening one of those abominations in some other= program... *shiver* -- = Simen
May 30 2011
prev sibling parent Sean Kelly <sean invisibleduck.org> writes:
I suggest looking into ICU if you're doing this stuff.  I believe =
there's even a wrapper somewhere in the Mango tree on DSource.

On May 29, 2011, at 7:21 PM, Adam D. Ruppe wrote:

 I've encountered some problems with other charsets recently. Phobos =

 a std.encoding that can do some useful stuff, but there's some
 encodings I've seen in the wild that it can't handle (indeed, it's
 a fairly short list that it does support)
=20
 I used gnu iconv for one of my projects and it works for me, but
 I wonder:
=20
 Is anyone planning to add more charset support to Phobos?
 (alternatively, am I missing something already there?)
=20
=20
 If no, maybe I'll do a few myself. I've never actually written code
 to do this, but it can't be rocket science. I suspect it's more
 tedious than anything else.

May 30 2011