digitalmars.D - Working with utf

Simen Haugen (16/16) Jun 14 2007 I hate it!

Regan Heath (5/24) Jun 14 2007 I think what we want for this is a String class which internally stores ...

Simen Haugen (7/17) Jun 14 2007 That would have been a very nice addition. I cannot even count how many
Frits van Bommel (24/25) Jun 14 2007 Your time away from D (6 months was it?) is showing...

Derek Parnell (8/27) Jun 14 2007 Convert to utf32 (dchar[]) then do your stuff and convert back to latin-...

Simen Haugen (5/7) Jun 14 2007 You're kidding me, right? Then I only have to convert to utf-32 when rea...

Derek Parnell (9/17) Jun 14 2007 dchar[] Y;

Frits van Bommel (12/26) Jun 14 2007 Except his input is encoded as Latin-1, not UTF-8. Conversion is still

Simen Haugen (6/17) Jun 14 2007 I tested this now, and it works like a charm. This means I can finally g...

Frits van Bommel (9/28) Jun 14 2007 If you only ever need to represent Latin-1 (but need string functions,
Simen Haugen (6/24) Jun 14 2007 Except that most functions in the string library takes a char and not dc...

Derek Parnell (37/65) Jun 14 2007 I read the OP as saying he was already converting Latin-1 to utf8 and wa...

Frits van Bommel (27/50) Jun 14 2007 That'd work, but will allocate more memory than required (5 to 6 times

Oskar Linde (7/26) Jun 14 2007 The solution is simple. If all your data is latin-1, and your

"Simen Haugen" <simen norstat.no> writes:

I hate it!

Say we have a string "�l". When I read this from a text file, it is two 
chars, but since this is no utf8 string, I have to convert it to utf8 before 
I can do any string operations on it.
I can easily live with that. Say we have a file with several lines, and its 
important that all lines are of equal length.
The string "ol" is two chars, but the string "�l" is 3 chars in utf8. 
Because of this I have to convert it back to latin-1 before checking 
lengths. The same applies to slicing, but even worse.
For all I care, "�" is one character, not two. If I slice "�" to get the 
first character, I only get the first half of the character. Isn't it more 
obvious that all string manipulation works with all utf8 characters as one 
character instead of two for values greater than 127?

I cannot find any nice solutions for this, and have to convert to and from 
latin-1/utf8 all the time.

There must be a better way...

Jun 14 2007

Regan Heath <regan netmail.co.nz> writes:

Simen Haugen Wrote:
 I hate it!
 
 Say we have a string "�l". When I read this from a text file, it is two 
 chars, but since this is no utf8 string, I have to convert it to utf8 before 
 I can do any string operations on it.
 I can easily live with that. Say we have a file with several lines, and its 
 important that all lines are of equal length.
 The string "ol" is two chars, but the string "�l" is 3 chars in utf8. 
 Because of this I have to convert it back to latin-1 before checking 
 lengths. The same applies to slicing, but even worse.
 For all I care, "�" is one character, not two. If I slice "�" to get the 
 first character, I only get the first half of the character. Isn't it more 
 obvious that all string manipulation works with all utf8 characters as one 
 character instead of two for values greater than 127?
 
 I cannot find any nice solutions for this, and have to convert to and from 
 latin-1/utf8 all the time.
 
 There must be a better way...

I think what we want for this is a String class which internally stores the
data as utf-8, 16 or 32 (making it's own decision or being told which to use)
and provides slicing of characters as opposed to codpoints.  

Then, all you need is to convert from latin-1 to String, do all your work with
String and convert back to latin-1 only if/when you need to write it back to a
file or similar.

My gut feeling is that this functionality belongs in a class and not the
language itself.  After all, you may want/need to manipulate utf-8, 16, or 32
codepoints directly for some reason.

Regan Heath

Jun 14 2007

"Simen Haugen" <simen norstat.no> writes:

"Regan Heath" <regan netmail.co.nz> wrote in message 
news:f4rd7m$dlb$1 digitalmars.com...
 I think what we want for this is a String class which internally stores 
 the data as utf-8, 16 or 32 (making it's own decision or being told which 
 to use) and provides slicing of characters as opposed to codpoints.

 Then, all you need is to convert from latin-1 to String, do all your work 
 with String and convert back to latin-1 only if/when you need to write it 
 back to a file or similar.

 My gut feeling is that this functionality belongs in a class and not the 
 language itself.  After all, you may want/need to manipulate utf-8, 16, or 
 32 codepoints directly for some reason.

 Regan Heath

That would have been a very nice addition. I cannot even count how many 
hard-to-find bugs I've had because of this (both slicing and length).

Utf8 and slicing is supported by the language, right? To me it sounds more 
like a bug that these wont work together, as I tend to trust that language 
features work.

Jun 14 2007

Frits van Bommel <fvbommel REMwOVExCAPSs.nl> writes:

Regan Heath wrote:
 I think what we want for this is a String class which internally stores the
data as utf-8, 16 or 32 (making it's own decision or being told which to use)
and provides slicing of characters as opposed to codpoints.  

Your time away from D (6 months was it?) is showing...
There's such a string implementation at 
http://www.dprogramming.com/dstring.php (Though IIRC it's a struct, not 
a class ;) )

Features:
* Indexing and slicing always works with on code point indices, not code 
units.
* Contents is stored as chars, wchars, dchars (whichever is sufficiently 
large to store every code point in the string as a single code unit).
* The size taken to store a string instance is equal to (char[]).sizeof: 
2 * size_t.sizeof.
* The upper two bits of the size_t containing the length are used as a 
flag for what type the pointer refers to (char/wchar/dchar). This causes 
the maximum length to be cut in 4, but that still allows strings up to 1 
GiB on 32-bit machines, and much bigger on 64-bit machines.

It can still be a problem that sometimes multiple code points are needed 
to encode one "logical" character; the extra ones for diacritics 
(accents). Above-mentioned size limitation can theoretically be a 
problem, but is probably a rare one. (When was the last time you needed 
more than a billion characters in a single string?)

Basically though, this allows you to pretend it's a dchar[] without the 
memory penalty (It only uses dchars internally if you use characters 
outside the BMP (> 0xFFFF), which are rarely needed in non-asian languages).

Jun 14 2007

Derek Parnell <derek psych.ward> writes:

On Thu, 14 Jun 2007 14:40:02 +0200, Simen Haugen wrote:

 I hate it!
 
 Say we have a string "øl". When I read this from a text file, it is two 
 chars, but since this is no utf8 string, I have to convert it to utf8 before 
 I can do any string operations on it.
 I can easily live with that. Say we have a file with several lines, and its 
 important that all lines are of equal length.
 The string "ol" is two chars, but the string "øl" is 3 chars in utf8. 
 Because of this I have to convert it back to latin-1 before checking 
 lengths. The same applies to slicing, but even worse.
 For all I care, "ø" is one character, not two. If I slice "ø" to get the 
 first character, I only get the first half of the character. Isn't it more 
 obvious that all string manipulation works with all utf8 characters as one 
 character instead of two for values greater than 127?
 
 I cannot find any nice solutions for this, and have to convert to and from 
 latin-1/utf8 all the time.
 
 There must be a better way...

Convert to utf32 (dchar[]) then do your stuff and convert back to latin-1
when you're done. Each dchar[] element is a single character. 

-- 
Derek Parnell
Melbourne, Australia
"Justice for David Hicks!"
skype: derek.j.parnell

Jun 14 2007

"Simen Haugen" <simen norstat.no> writes:

"Derek Parnell" <derek psych.ward> wrote in message 
news:n1j2izm4a0x5.413sc7jjzk2x.dlg 40tude.net...
 Convert to utf32 (dchar[]) then do your stuff and convert back to latin-1
 when you're done. Each dchar[] element is a single character.

You're kidding me, right? Then I only have to convert to utf-32 when reading 
a file, and back to latin-1 when writing. Thats great! (except I have to 
modify a lot of char[] to dchar[])

Jun 14 2007

Derek Parnell <derek psych.ward> writes:

On Thu, 14 Jun 2007 15:13:35 +0200, Simen Haugen wrote:

 "Derek Parnell" <derek psych.ward> wrote in message 
 news:n1j2izm4a0x5.413sc7jjzk2x.dlg 40tude.net...
 Convert to utf32 (dchar[]) then do your stuff and convert back to latin-1
 when you're done. Each dchar[] element is a single character.

 
 You're kidding me, right? Then I only have to convert to utf-32 when reading 
 a file, and back to latin-1 when writing. Thats great! (except I have to 
 modify a lot of char[] to dchar[])

dchar[] Y;
char[]  Z;

 Y = std.utf.toUTF32(Z);

-- 
Derek Parnell
Melbourne, Australia
"Justice for David Hicks!"
skype: derek.j.parnell

Jun 14 2007

Frits van Bommel <fvbommel REMwOVExCAPSs.nl> writes:

Derek Parnell wrote:
 On Thu, 14 Jun 2007 15:13:35 +0200, Simen Haugen wrote:
 
 "Derek Parnell" <derek psych.ward> wrote in message 
 news:n1j2izm4a0x5.413sc7jjzk2x.dlg 40tude.net...
 Convert to utf32 (dchar[]) then do your stuff and convert back to latin-1
 when you're done. Each dchar[] element is a single character.

 You're kidding me, right? Then I only have to convert to utf-32 when reading 
 a file, and back to latin-1 when writing. Thats great! (except I have to 
 modify a lot of char[] to dchar[])

 
 dchar[] Y;
 char[]  Z;
 
  Y = std.utf.toUTF32(Z);

Except his input is encoded as Latin-1, not UTF-8. Conversion is still 
trivial though:
---
auto latin1 = cast(ubyte[]) std.file.read("some_latin-1_file.txt");
dchar[] utf = new dchar[](latin1.length);
for(size_t i = 0; i < latin1.length; i++) {
     utf[i] = latin1[i];
}
---
and the other way around.
(The first 256 code points of Unicode are identical to Latin-1)

Jun 14 2007

"Simen Haugen" <simen norstat.no> writes:

I tested this now, and it works like a charm. This means I can finally get 
rid of all my convertions between utf8 and latin1! (together with all these 
hidden bugs)

Thanks a lot for all your help.

"Frits van Bommel" <fvbommel REMwOVExCAPSs.nl> wrote in message 
news:f4rh01$lkt$2 digitalmars.com...
 Except his input is encoded as Latin-1, not UTF-8. Conversion is still 
 trivial though:
 ---
 auto latin1 = cast(ubyte[]) std.file.read("some_latin-1_file.txt");
 dchar[] utf = new dchar[](latin1.length);
 for(size_t i = 0; i < latin1.length; i++) {
     utf[i] = latin1[i];
 }
 ---
 and the other way around.
 (The first 256 code points of Unicode are identical to Latin-1)

Jun 14 2007

Frits van Bommel <fvbommel REMwOVExCAPSs.nl> writes:

Simen Haugen wrote:
 I tested this now, and it works like a charm. This means I can finally get 
 rid of all my convertions between utf8 and latin1! (together with all these 
 hidden bugs)
 
 Thanks a lot for all your help.
 
 "Frits van Bommel" <fvbommel REMwOVExCAPSs.nl> wrote in message 
 news:f4rh01$lkt$2 digitalmars.com...
 Except his input is encoded as Latin-1, not UTF-8. Conversion is still 
 trivial though:
 ---
 auto latin1 = cast(ubyte[]) std.file.read("some_latin-1_file.txt");
 dchar[] utf = new dchar[](latin1.length);
 for(size_t i = 0; i < latin1.length; i++) {
     utf[i] = latin1[i];
 }
 ---
 and the other way around.
 (The first 256 code points of Unicode are identical to Latin-1) 


If you only ever need to represent Latin-1 (but need string functions, 
not just array functions), wchar[] will also work, and only take half 
the memory.
If you don't need string functions, of course, you can just keep it as 
ubyte[]s the whole time.
(By "string functions" I mean stuff like case conversions, console 
output and so on. In particular, note that slicing & indexing works on 
all arrays, not just strings)

Jun 14 2007

"Simen Haugen" <simen norstat.no> writes:

Except that most functions in the string library takes a char and not dchar 
as parameter.
Then I still have to convert to utf8 whenever I want to use the functions, 
and then I'm just as far.

"Simen Haugen" <simen norstat.no> wrote in message 
news:f4riug$rrt$1 digitalmars.com...
I tested this now, and it works like a charm. This means I can finally get 
rid of all my convertions between utf8 and latin1! (together with all these 
hidden bugs)

 Thanks a lot for all your help.

 "Frits van Bommel" <fvbommel REMwOVExCAPSs.nl> wrote in message 
 news:f4rh01$lkt$2 digitalmars.com...
 Except his input is encoded as Latin-1, not UTF-8. Conversion is still 
 trivial though:
 ---
 auto latin1 = cast(ubyte[]) std.file.read("some_latin-1_file.txt");
 dchar[] utf = new dchar[](latin1.length);
 for(size_t i = 0; i < latin1.length; i++) {
     utf[i] = latin1[i];
 }
 ---
 and the other way around.
 (The first 256 code points of Unicode are identical to Latin-1)

Jun 14 2007

Derek Parnell <derek psych.ward> writes:

On Thu, 14 Jun 2007 15:48:50 +0200, Frits van Bommel wrote:

 Derek Parnell wrote:
 On Thu, 14 Jun 2007 15:13:35 +0200, Simen Haugen wrote:
 
 "Derek Parnell" <derek psych.ward> wrote in message 
 news:n1j2izm4a0x5.413sc7jjzk2x.dlg 40tude.net...
 Convert to utf32 (dchar[]) then do your stuff and convert back to latin-1
 when you're done. Each dchar[] element is a single character.

 You're kidding me, right? Then I only have to convert to utf-32 when reading 
 a file, and back to latin-1 when writing. Thats great! (except I have to 
 modify a lot of char[] to dchar[])

 
 dchar[] Y;
 char[]  Z;
 
  Y = std.utf.toUTF32(Z);

 
 Except his input is encoded as Latin-1, not UTF-8. 

I read the OP as saying he was already converting Latin-1 to utf8 and was
nowe concerned about converting utf8 to utf32, thus I gave that toUTF32()
hint. 

 Conversion is still 
 trivial though:
 ---
 auto latin1 = cast(ubyte[]) std.file.read("some_latin-1_file.txt");
 dchar[] utf = new dchar[](latin1.length);
 for(size_t i = 0; i < latin1.length; i++) {
      utf[i] = latin1[i];
 }
 ---
 and the other way around.
 (The first 256 code points of Unicode are identical to Latin-1)

I was not aware of that. So if one needs to convert from Latin-1 to utf8
...

  import std.utf;

   dchar[] Latin1toUTF32(ubyte[] pLatin1Text)
   {
       dchar[] utf;

       utf.length = pLatin1Text.length;
       foreach(i, b; pLatin1Text)
              utf[i] = b;
       return utf;
   }

   char[] Latin1toUTF8(ubyte[] pLatin1Text)
   {
       return std.utf.toUTF8(Latin1toUTF32(pLatin1Text));
   }

import std.stdio;

void main()
{
    ubyte[] td;

    td.length = 256;
    for (int i = 0; i < 256; i++)
       td[i] = i;

    // On windows, set the code page to 65001 
    // and the font to Lucinda Console.
    // eg. C:\> chcp 65001
    //     Active code page: 65001
    std.stdio.writefln("%s", Latin1toUTF8(td));
}
-- 
Derek Parnell
Melbourne, Australia
"Justice for David Hicks!"
skype: derek.j.parnell

Jun 14 2007

Frits van Bommel <fvbommel REMwOVExCAPSs.nl> writes:

Derek Parnell wrote:
 On Thu, 14 Jun 2007 15:48:50 +0200, Frits van Bommel wrote:
 
 (The first 256 code points of Unicode are identical to Latin-1)

 
 I was not aware of that. So if one needs to convert from Latin-1 to utf8
 ...
 
   import std.utf;
 
    dchar[] Latin1toUTF32(ubyte[] pLatin1Text)
    {
        dchar[] utf;
 
        utf.length = pLatin1Text.length;
        foreach(i, b; pLatin1Text)
               utf[i] = b;
        return utf;
    }
 
    char[] Latin1toUTF8(ubyte[] pLatin1Text)
    {
        return std.utf.toUTF8(Latin1toUTF32(pLatin1Text));
    }

That'd work, but will allocate more memory than required (5 to 6 times 
the length of the Latin-1 text worth of allocation - 4 times for the 
utf-32, plus 1 to 2 times for the utf-8). How about this:
---
import std.utf;

char[] Latin1toUTF8(ubyte[] lat1) {
     char[] utf8;
     // preallocate
     utf8.length = lat1.length;
     /* optionally preallocate up to 2 * lat1.length characters
        instead (you'll never need more than that).
      */
     utf8.length = 0;
     foreach (latchar; lat1) {
         utf8.encode(latchar);
     }
}
---
This should allocate 1 to 3 times the length of the Latin-1 text: 1 time 
the length as initial allocation, plus a doubling on reallocation if 
there are any non-ascii characters. (If I remember the allocation policy 
correctly)
It'll 2 times the Latin-1 length if you preallocate that beforehand.

All memory allocation sizes calculated above exclude whatever extra 
memory the allocator adds to get a nice round bin-size of course, so 
this is more of an estimate; it'll likely be a bit more.

Jun 14 2007

Oskar Linde <oskar.lindeREM OVEgmail.com> writes:

Simen Haugen skrev:
 I hate it!
 
 Say we have a string "�l". When I read this from a text file, it is two 
 chars, but since this is no utf8 string, I have to convert it to utf8 before 
 I can do any string operations on it.
 I can easily live with that. Say we have a file with several lines, and its 
 important that all lines are of equal length.
 The string "ol" is two chars, but the string "�l" is 3 chars in utf8. 
 Because of this I have to convert it back to latin-1 before checking 
 lengths. The same applies to slicing, but even worse.
 For all I care, "�" is one character, not two. If I slice "�" to get the 
 first character, I only get the first half of the character. Isn't it more 
 obvious that all string manipulation works with all utf8 characters as one 
 character instead of two for values greater than 127?
 
 I cannot find any nice solutions for this, and have to convert to and from 
 latin-1/utf8 all the time.
 
 There must be a better way...

The solution is simple. If all your data is latin-1, and your 
requirements are stated in the form of "number of latin-1 units", just 
use latin-1 as the encoding.

typedef ubyte latin1_char;
alias latin1_char[] latin1_string;

/Oskar

Jun 14 2007

D Programming

C/C++ Programming

Other

digitalmars.D - Working with utf