digitalmars.D - String Class (was other things)

Arcane Jill (62/62) Jul 28 2004 Just to tidy up a few loose ends...

parabolis (48/70) Jul 28 2004 I cut the large part of your post which explains the details of

Arcane Jill (28/51) Jul 29 2004 In article , parabolis says...

parabolis (54/105) Jul 29 2004 It is only trivial if you know how to do it. Otherwise UTF-32

Daniel Horn (4/28) Jul 28 2004 Seems like more spam filters need to start tokenizing at the grapheme
Walter (16/22) Jul 28 2004 out,

parabolis (12/27) Jul 28 2004 This was actually taken from a Thread in which I explained some

Arcane Jill (13/24) Jul 28 2004 This confuses me, but perhaps that's because I haven't understood how yo...

parabolis (42/56) Jul 28 2004 I am assuming you accept 2b) and 3b):

Arcane Jill (14/25) Jul 28 2004 Whoops! I just realized that this example contains invalid characters! H...

parabolis (13/55) Jul 28 2004 If you want to compare precise memory requirements in bytes then

Arcane Jill (5/7) Jul 28 2004 sizeof wchar[] string <= sizeof dchar[] string == sizeof FatString

parabolis (31/36) Jul 28 2004 I somehow suspected that you would suggest this. Remember

Arcane Jill (30/67) Jul 28 2004 Yes you do. Without an implementation, you can claim anything. Just an a...

parabolis (45/95) Jul 29 2004 As I said in the next paragraph... A sparse array implementation

Arcane Jill (38/61) Jul 29 2004 Of course I am, but being familiar with the concept of run length encodi...

parabolis (31/89) Jul 29 2004 No but telling you it would use an uint[] to implement an RLE

Sean Kelly (5/13) Jul 28 2004 Out of curiosity, why care about printable string length anyway? Unless...

Berin Loritsch (14/29) Jul 28 2004 Sometimes it helps getting the tail end of a string. For example,
parabolis (2/18) Jul 28 2004 I tend to deal quite a bit with text data.

Arcane Jill <Arcane_member pathlink.com> writes:

Just to tidy up a few loose ends...

The Unicode Consortium /claim/ that the definition of a character is "The
smallest component of written language that has semantic value", but in fact, a
more accurate definition would be "a character is anything that the Unicode
Consortium say is a character". This is because when Unicode was first invented,
it wanted to be a superset of all other character standards, so it imported
"characters" from all of those other standards - no matter that those legacy
standards conflicted with each other. This is just "historical baggage", and we
now have to live with it.

A Unicode string can be interpretted on many levels. As you have pointed out,
char[] arrays, etc., work at the level of code units. You say that it would be
more appropriate to work at the character level. That's a reasonable point of
view, but the truth is we can already do this. A simple dchar[] array will do
the job. Or, if you want to stick with char[] arrays, there are functions in
std.utf which can cope with this level of abstration quite happily.

But I don't think that even the /character/ level is particularly useful,
myself. I'd say you have to go up by at least one further level of abstraction -
to the /grapheme/ level at least, before you start dealing with what we /really/
think of as a string. Let me give you an example. Consider the following two
UTF-8 sequences:

(a) 0x63 0x61 0x66 0xC3 0x89
(b) 0x63 0x61 0x66 0x65 0xCC 0x81

Clearly these are different sequences of code units. They have different
lengths, for a start. If stored in char[] arrays then a.length would be 5 and
b.length would be 6. So now let's look at them, not at the code unit level, but
at the character level. Now we have:

(a) "caf\u00E9"
(b) "cafe\u0301"

Again, these are different sequences of characters. The first one is four
characters long, and contains the character U+00E9 (lowercase e with acute
accent). The second one is five characters long, and contains the character
U+0301 (combining acute accent). These strings would not compare equal either as
char[]s or as dchar[]s. But now let's look at them at the /grapheme/ level. Now
it gets interesting, because here we get:

(a) "caf�"
(b) "caf�"

This is what you /see/ when the string is rendered in a plain text editor such
as we programmers use. The strings look alike, and, conceptually, they /are/
identical at this level of abstraction. (They are "canonically equivalent", to
be pedantic). They both contain exactly four graphemes, and the final grapheme
is "�" in both cases. So - are you /sure/ that the number of "characters" is the
RIGHT THING to be focusing on?

But it doesn't even stop there. There's another level of abstraction even beyond
graphemes. These objects are called "glyphs", and they are what you would see in
a non-plain text renderer, such a web browser or Micro$oft Word. You can make a
glyph by joining one or more graphemes together by using join-controls. For
example, the character/grapheme/glyph '�' (U+00E6) is a ligature formed by
ligating 'a' and 'e'. It is the same glyph as the one formed from 'a' followed
by U+200D (ZERO WIDTH JOINER) followed by 'e', but you'll need a good text
rendering engine before they'll look identical.

Different levels are important for different people. Right now, D gives you the
code unit level in char[], wchar[] and dchar[] arrays, and the character level
in dchar[] arrays. Phase 2 of etc.unicode, when it arrives, will give you
functions enabling you to work at the grapheme level. All of these different
conceptual viewpoints could, in theory, be plumbed into the workings of a string
class, but only after the Unicode algorithms are actually implemented in
etc.unicode.

I'm not trying to throw a spanner into the works - I just wanted to challenge
the viewpoint that defining char as a UTF-8 fragment makes D deficient somehow.
I don't believe it does, and, even if that low level is not one that you would
use, there certainly will be programmers who need to work at that level.

Arcane Jill

Jul 28 2004

parabolis <parabolis softhome.net> writes:

Arcane Jill wrote:
 Just to tidy up a few loose ends...
 
 The Unicode Consortium /claim/ that the definition of a character is "The
 smallest component of written language that has semantic value", but in fact, a
 more accurate definition would be "a character is anything that the Unicode
 Consortium say is a character". This is because when Unicode was first
invented,
 it wanted to be a superset of all other character standards, so it imported
 "characters" from all of those other standards - no matter that those legacy
 standards conflicted with each other. This is just "historical baggage", and we
 now have to live with it.
 

 [lots of stuff]

I cut the large part of your post which explains the details of
the relationships:

     Character has a UTFImplementation
     Grapheme is a Character
     Glyph is a Gphapmeme

Which suggestes an OO implementation of:
================================
     Interface UTFImplementation { /* ??? */ }
     Character : UTFImplementation { /* ??? */ }
     Grapheme : Character { /* ??? */ }
     Glyph : Gphapmeme { /* ??? */ }
================================

As I have said any String class implementatoin that satisfies 
the most basic string operations is a step in the right 
direction. It sounds like you would like to see the following 
implementation
================================
     Interface String { /* ??? */ }
     CharacterString : String { /* ??? */ }
     GraphemeString : String { /* ??? */ }
     GlyphString : String { /* ??? */ }

     or perhaps

     Interface String { /* ??? */ }
     CharacterString : String { /* ??? */ }
     GraphemeString : CharacterString { /* ??? */ }
     GlyphString : GphapmemeString { /* ??? */ }
================================

 
 Different levels are important for different people. Right now, D gives you the
 code unit level in char[], wchar[] and dchar[] arrays, and the character level
 in dchar[] arrays. Phase 2 of etc.unicode, when it arrives, will give you
 functions enabling you to work at the grapheme level. All of these different
 conceptual viewpoints could, in theory, be plumbed into the workings of a
string
 class, but only after the Unicode algorithms are actually implemented in
 etc.unicode.

You are trying (unsuccessfully) to make use of the fact that 
UTF-32 code units (dchar) alawys correspond to a Character but 
are missing the fact they are still defined as code units in the 
D Docs Types section:
================================================================
         char  unsigned 8 bit UTF-8  0xFF
         wchar  unsigned 16 bit UTF-16  0xFFFF
         dchar  unsigned 32 bit UTF-32
================================================================
If what you were suggesting were really the case then anybody 
who has a reason to use UTF-8 or UTF-16 would have to count the 
Characters by converting to UTF-32 and then checking the length 
of the resulting dchar array. This implementation is a crude 
hack that works by virtue of the fact that the number of code 
units in UTF32 are guaranteed to be 1:1 with Characters defined 
by the UC.

 I'm not trying to throw a spanner into the works - I just wanted to challenge
 the viewpoint that defining char as a UTF-8 fragment makes D deficient somehow.

The reason D uses the precise name 'char' is a direct result of 
D's ability to compile C directly and has nothing to do with the 
unfortunate and misleading association char sometimes has with 
Character.

Jul 28 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <ce8mjf$eh3$1 digitaldaemon.com>, parabolis says...

I missed this one. I think you're posting from a time-machine or something,
because the D forum seems to sort your posts into the wrong place.


You are trying (unsuccessfully) to make use of the fact that 
UTF-32 code units (dchar) alawys correspond to a Character but 
are missing the fact they are still defined as code units in the 
D Docs Types section:
================================================================
         char  unsigned 8 bit UTF-8  0xFF
         wchar  unsigned 16 bit UTF-16  0xFFFF
         dchar  unsigned 32 bit UTF-32
================================================================

Strictly speaking, there are distinctions between (a) an ASCII code unit, (b) an
ASCII codepoint, and (c) and ASCII character. But so what? The mapping between
them is utterly trivial so nobody pays it a blind bit of notice - and nor should
they. So it is with dchar.



If what you were suggesting were really the case then anybody 
who has a reason to use UTF-8 or UTF-16 would have to count the 
Characters by converting to UTF-32 and then checking the length 
of the resulting dchar array.

That's not the /only/ way of doing it. You could iterate through the chars of
UTF-8 and count only bytes < 0x80 or >= 0xC0. Similarly, you could iterate
through the wchars of UTF-16 and count only bytes < 0xD800 or >= 0xDC00. That
would get you the right answer faster than doing an actual conversion (no need
to allocate memory, etc.).

But yes, if you want indexing by character, use dchar[]. Why is that a problem?


This implementation is a crude 
hack

The implementation against which you argued was not one that I suggested. I can
spot a strawman argument a mile away, and I won't fall for it. I suggested
precisly two implementation: (1) the counting technique described above, and (2)
keep everything in dchars throughout - no conversion required.



that works by virtue of the fact that the number of code 
units in UTF32 are guaranteed to be 1:1 with Characters defined 
by the UC.

Isn't that a bit like saying "that works by virtue of the fact that one plus one
equals two"? I mean, there /is/ such a guarantee.


The reason D uses the precise name 'char' is a direct result of 
D's ability to compile C directly

?


and has nothing to do with the 
unfortunate and misleading association char sometimes has with 
Character.

Look, it's just jargon, dude. It's just words. It doesn't matter what things are
called, so long as they are well-defined. I'm not greatly in love with all of
the Unicode Consortium's names for things either, to be honest. Even the regular
correspondents in the Unicode public forum will say "glyph" to each other when
the technical term as defined by the UC is actually "default grapheme cluster".
A bit of common sense is in order here. And why do you keep capitalizing
"character"? - it's very disconcerting.

Jill

Jul 29 2004

parabolis <parabolis softhome.net> writes:

Arcane Jill wrote:

 
 
You are trying (unsuccessfully) to make use of the fact that 
UTF-32 code units (dchar) alawys correspond to a Character but 
are missing the fact they are still defined as code units in the 
D Docs Types section:
================================================================
        char  unsigned 8 bit UTF-8  0xFF
        wchar  unsigned 16 bit UTF-16  0xFFFF
        dchar  unsigned 32 bit UTF-32
================================================================

 
 
 Strictly speaking, there are distinctions between (a) an ASCII code unit, (b)
an
 ASCII codepoint, and (c) and ASCII character. But so what? The mapping between
 them is utterly trivial so nobody pays it a blind bit of notice - and nor
should
 they. So it is with dchar.

It is only trivial if you know how to do it. Otherwise UTF-32 
data might as well be encrypted for someone who does not want to 
bother to spend the time puzzling out the UTF encodings.

My assumption before reading the UTF docs was that UTF-n were 
roughly just Unicode mapped variables using 8, 16 or 32 bits. I 
would have used char slicing, .length and perhaps even .sort 
happily until I came upon an 8bit character that caused some 
/really/ obscure bug. I don't mean to be rude but for people in 
such a situation your suggestion that just using a dchar works 
seems pedantic in the worst way.

Consider an analogy. Let us suppose that in D any byte[] array 
access sequence of 135 followed by 134 would result in all 
future array index calculations to assume you meant (.length - 
index). Let us also suppose that any short[] array access 
sequence of 2086 followed by 2085 would produce the same results.

Now let us further assume there are technical reasons for 
storing array data using these WTF-8, WTF-16 and WTF-32 
encodings (the WTF-32 encoding being trivially similar to the 
way normal array access works in other languages).

The fact that there are no 'normal' arrays in D' is covered only 
by the a notation in the doc section about types and everywhere 
else they are refered to as byte arrays and short arrays.

To your suggestion that if you want common sense array results 
then use an int array in all cases I say you are insane. I would 
not seriously consider using D if there were no byte or short 
arrays.

More importantly D would not call these things arrays and by 
using sensible terminology would have not confused these WTF 
things with arrays and thus provided support for common sense 
arrays *from the start*.


If what you were suggesting were really the case then anybody 
who has a reason to use UTF-8 or UTF-16 would have to count the 
Characters by converting to UTF-32 and then checking the length 
of the resulting dchar array.

 
 
 That's not the /only/ way of doing it. You could iterate through the chars of
 UTF-8 and count only bytes < 0x80 or >= 0xC0. Similarly, you could iterate
 through the wchars of UTF-16 and count only bytes < 0xD800 or >= 0xDC00. That
 would get you the right answer faster than doing an actual conversion (no need
 to allocate memory, etc.).

Quite true ... I could. But only for char/wchar arrays that I 
have used std.utf.validate() and have not written to. Or I could 
implement my suggested String class and get all the benefits I 
believe it will offer. But these only solve my problem.

However my problem is that if D did not confuse Character and 
Code Unit then I am sure the character counting issue would not 
have been overlooked and /something/ sensible would have been 
implemented much earlier in the development process that it has.

  
This implementation is a crude 
hack

 
 
 The implementation against which you argued was not one that I suggested. I can
 spot a strawman argument a mile away, and I won't fall for it. I suggested
 precisly two implementation: (1) the counting technique described above, and
(2)
 keep everything in dchars throughout - no conversion required.

I am not arguing against an implementation. I am arguing against 
/your/ suggestion that dchar[] was an intentional device. I 
believe basic functionality was overlooked that would have been 
included if the Character/Code Unit ambiguity did not exist 
(more on ambiguity below).

 
that works by virtue of the fact that the number of code 
units in UTF32 are guaranteed to be 1:1 with Characters defined 
by the UC.

 
 
 Isn't that a bit like saying "that works by virtue of the fact that one plus
one
 equals two"? I mean, there /is/ such a guarantee.
 


 
 Look, it's just jargon, dude. It's just words. It doesn't matter what things
are
 called, so long as they are well-defined. I'm not greatly in love with all of

I agree. But in this case I am saying string is not 
well-defined. It is used as both Strings of Character and 
Unicode String.

You were suprised char[].sort invalidated an encoding. The 
library implicity suggests char[].length and char[i..k] 
implement length-querry and substring.

Seperating the two terms in a meaningful way would avoid these 
kinds of ambiguities.

 A bit of common sense is in order here. And why do you keep capitalizing
 "character"? - it's very disconcerting.

Actually because it is disconcerting it is useful to draw 
attention to the fact that I do not mean char or UTF-n code unit.

Jul 29 2004

Daniel Horn <hellcatv hotmail.com> writes:

Seems like more spam filters need to start tokenizing at the grapheme 
level... or perhaps one level higher where char sequences that render 
similarly are compared similarly-- maybe more like text recognition
Arcane Jill wrote:

 think of as a string. Let me give you an example. Consider the following two
 UTF-8 sequences:
 
 (a) 0x63 0x61 0x66 0xC3 0x89
 (b) 0x63 0x61 0x66 0x65 0xCC 0x81
 
 Clearly these are different sequences of code units. They have different
 lengths, for a start. If stored in char[] arrays then a.length would be 5 and
 b.length would be 6. So now let's look at them, not at the code unit level, but
 at the character level. Now we have:
 
 (a) "caf\u00E9"
 (b) "cafe\u0301"
 
 Again, these are different sequences of characters. The first one is four
 characters long, and contains the character U+00E9 (lowercase e with acute
 accent). The second one is five characters long, and contains the character
 U+0301 (combining acute accent). These strings would not compare equal either
as
 char[]s or as dchar[]s. But now let's look at them at the /grapheme/ level. Now
 it gets interesting, because here we get:
 
 (a) "caf�"
 (b) "caf�"

Jul 28 2004

"Walter" <newshound digitalmars.com> writes:

"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:ce869q$7o5$1 digitaldaemon.com...
 A Unicode string can be interpretted on many levels. As you have pointed

out,
 char[] arrays, etc., work at the level of code units. You say that it

would be
 more appropriate to work at the character level. That's a reasonable point

of
 view, but the truth is we can already do this. A simple dchar[] array will

do
 the job. Or, if you want to stick with char[] arrays, there are functions

in
 std.utf which can cope with this level of abstration quite happily.

There's a little-known feature of foreach loops:

    char[] s;
    foreach (dchar c; s)
    {
            ... c is each dchar decoded from s[] ...
    }

which is pretty handy for decoding UTF-8 strings. Of course, it also works
for any combination of char, wchar, or dchar for either argument of foreach,
making it pretty convenient for writing generic code.

Jul 28 2004

parabolis <parabolis softhome.net> writes:

Walter wrote:

std.utf which can cope with this level of abstration quite happily.

 
 
 There's a little-known feature of foreach loops:
 
     char[] s;
     foreach (dchar c; s)
     {
             ... c is each dchar decoded from s[] ...
     }
 
 which is pretty handy for decoding UTF-8 strings. Of course, it also works
 for any combination of char, wchar, or dchar for either argument of foreach,
 making it pretty convenient for writing generic code.
 

This was actually taken from a Thread in which I explained some 
of the benefits of a class String implementation. It ended with 
this assessment:

So to recap in big O terms:
     1) The memory requirements of String are identical to UTF16
     2) For length()
        2a) The time requirements of String are 1
        3a) The time requirements of UTF16 are N
     3) For substring()
        3a) The time requirements of String are 1
        3a) The time requirements of UTF16 are N

Jul 28 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <ce8pfd$fm5$1 digitaldaemon.com>, parabolis says...

This was actually taken from a Thread in which I explained some 
of the benefits of a class String implementation. It ended with 
this assessment:

So to recap in big O terms:
     1) The memory requirements of String are identical to UTF16
     2) For length()
        2a) The time requirements of String are 1
        3a) The time requirements of UTF16 are N
     3) For substring()
        3a) The time requirements of String are 1
        3a) The time requirements of UTF16 are N

This confuses me, but perhaps that's because I haven't understood how you
imagine implementing such a class. Example - let me construct a string:







If I've got my maths right, all elements of s will be assigned; all elements of
s will be unique; and to compute the value of s[i] given only i is hard
(requires a modinv(), which is expensive). So, how can any class store s in such
a way as to guarantee the memory and time requirements which you claim?

Arcane Jill

Jul 28 2004

parabolis <parabolis softhome.net> writes:

Arcane Jill wrote:

 This confuses me, but perhaps that's because I haven't understood how you
 imagine implementing such a class. Example - let me construct a string:
 





 
 If I've got my maths right, all elements of s will be assigned; all elements of
 s will be unique; and to compute the value of s[i] given only i is hard
 (requires a modinv(), which is expensive). So, how can any class store s in
such
 a way as to guarantee the memory and time requirements which you claim?
 

I am assuming you accept 2b) and 3b):
     1) The memory requirements of String are identical to UTF16
     2) For length()
        2a) The time requirements of String are 1
        2b) The time requirements of UTF16 are N
     3) For substring()
        3a) The time requirements of String are 1
        3b) The time requirements of UTF16 are N

To show that 1), 2a) and 2b) hold I will make the example a bit 
simpler let me throw away the sparse array portion and just 
construct a class that stores Characters with a two wchar arrays:
================================================================
FatString {
     wchar[] loBits;
     wchar[] hiBits;

     private this( wchar[] lo, wchar[] hi ) {
         loBits = lo;
         hiBits = hi;
     }

     length() { return loBits.length; }

     FatString substring( uint len, uint off ) {
         return new FatString(
             loBits[off..(off+len)],
             hiBits[off..(off+len)],
         );
     }
}
================================================================
Assuming we have N Characters to store...

The memory complexity is simply:

1) The FatString class uses 2*2*N bytes or a big o N.
A wchar[] uses at least 2*N bytes and at most 2*2*N which is 
also a big O N. Thus they have the same memory requirements.

Now for the time complexity:

2a) length() is a simple example. There is one statement with no
tricky side effects. So length has a big O time complexity 1.

3a) substring() is a more interesting example. It is obvious 
that the big O of the constructor is 1. The wonder that is array 
slicing simply creates dynamic array stubs which point into the 
appropriate array for the appropriate length. Again that is a 
big O of 1. Thus substring() has a big O of 1.

Jul 28 2004

Arcane Jill <Arcane_member pathlink.com> writes:








Whoops! I just realized that this example contains invalid characters! Ho hum.
(Hangs head in shame). Let's pretend no-one noticed and change it to:








This doesn't change the following reasoning in any way, however.




In article <ce915n$jld$1 digitaldaemon.com>, parabolis says...

 How can any class store s in such
 a way as to guarantee the memory and time requirements which you claim?


     1) The memory requirements of String are identical to UTF16

1) The FatString class uses 2*2*N bytes or a big o N.
A wchar[] uses at least 2*N bytes and at most 2*2*N which is 
also a big O N. Thus they have the same memory requirements.

The string s, when converted to wchar[], will require precisely 0x42000 bytes
Your FatString representing the same string will contain 0x44000 bytes. How is
that identical?

Jill

Jul 28 2004

parabolis <parabolis softhome.net> writes:

Arcane Jill wrote:








 
 
 Whoops! I just realized that this example contains invalid characters! Ho hum.
 (Hangs head in shame). Let's pretend no-one noticed and change it to:
 






 
 This doesn't change the following reasoning in any way, however.
 
 
 
 
 In article <ce915n$jld$1 digitaldaemon.com>, parabolis says...
 
 
How can any class store s in such
a way as to guarantee the memory and time requirements which you claim?


 
 
    1) The memory requirements of String are identical to UTF16

 
 
1) The FatString class uses 2*2*N bytes or a big o N.
A wchar[] uses at least 2*N bytes and at most 2*2*N which is 
also a big O N. Thus they have the same memory requirements.

 
 
 The string s, when converted to wchar[], will require precisely 0x42000 bytes
 Your FatString representing the same string will contain 0x44000 bytes. How is
 that identical?
 
 Jill

If you want to compare precise memory requirements in bytes then 
they do not have identical requirements. The have roughly 
identical requirements (which is the claim I made in another 
thread when not talking about big O).

The big O requirements are not precise. They are intedned to 
give a feeling for how things compare as they get arbitrarily 
larger while at the same time simplifying certain issues.

How many bytes will a wchar with N characters use? If you knew 
the probability that the Ith character requires 2 bytes then the 
answer would be simple. However you do not know P(I).

So how are we to compare how the memory requirements of a 
wchar[] generally compares to a FatString?

Jul 28 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <ce9583$l7u$1 digitaldaemon.com>, parabolis says...

So how are we to compare how the memory requirements of a 
wchar[] generally compares to a FatString?

sizeof wchar[] string <= sizeof dchar[] string == sizeof FatString

So far, FatString has shown no advantage over dchar[], and it looks a lot more
complicated.

Jill

Jul 28 2004

parabolis <parabolis softhome.net> writes:

Arcane Jill wrote:
So how are we to compare how the memory requirements of a 
wchar[] generally compares to a FatString?

 
 
 sizeof wchar[] string <= sizeof dchar[] string == sizeof FatString

I somehow suspected that you would suggest this. Remember 
FatString is actually a gross oversimplification of what I 
suggested. Lets try a less overly simplified version...

The RLEString

Still not as good as a well written sparse array implementation 
but it shows more of the nature of the implementation I am 
suggesting and I believe the workings of RLE are self evident 
enough that I do not need to explain how it works.

How would you compare RLEString if instead of wchar[] for hiBits 
it used a uint[] whose entries are a run length encoding of the 
upper 16 bits of the Characters in a String?

For N Characters which are all less than u+10000 the wchar[] 
string uses exactly 8 bytes less than RLEString.

That is an 8 byte overhead no matter how long the string is. So 
in this case you save 8 bytes but should sombody want the last 4 
Characters in a 4K string then a wchar[] string must do 4000 
parsing iterations as opposed to the RLEString which does 
effectively a single operation.

But now how do we compare degenerate cases? Does a 4k string 
with a single Character above u+FFFF and the associated memory 
cost of 16 additional bytes negate the 4000 iteration loop? What 
about 2? 3? Eventually you are going to make a value judgment 
that involves P(I).

Finally do not forget that (from rfc2781) :
================================================================
Characters with values greater than 0x10FFFF cannot be encoded 
in UTF-16.
================================================================

A great deal of this space overhead is forced because an 
RLEString can represent u+00000000 to u+FFFFFFFF includsively.

Jul 28 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <ce98li$n8v$1 digitaldaemon.com>, parabolis says...
Arcane Jill wrote:
So how are we to compare how the memory requirements of a 
wchar[] generally compares to a FatString?

 
 
 sizeof wchar[] string <= sizeof dchar[] string == sizeof FatString

I somehow suspected that you would suggest this. Remember 
FatString is actually a gross oversimplification of what I 
suggested.

Which is what, exactly?


Lets try a less overly simplified version...

The RLEString

Still not as good as a well written sparse array implementation 
but it shows more of the nature of the implementation I am 
suggesting and I believe the workings of RLE are self evident 
enough that I do not need to explain how it works.

Yes you do. Without an implementation, you can claim anything. Just an algorithm
in pseudocode would do.


How would you compare RLEString if instead of wchar[] for hiBits 
it used a uint[] whose entries are a run length encoding of the 
upper 16 bits of the Characters in a String?

Show me an implementation which provably has the characteristics you claim, and
obviously the proof will speak for itself. This is logical, Captain.



For N Characters which are all less than u+10000 the wchar[] 
string uses exactly 8 bytes less than RLEString.

Implementation? Algorithm?


That is an 8 byte overhead no matter how long the string is. So 
in this case you save 8 bytes but should sombody want the last 4 
Characters in a 4K string then a wchar[] string must do 4000 
parsing iterations as opposed to the RLEString which does 
effectively a single operation.

I am not clairvoyant. The details of your algorithm have not been specified.


But now how do we compare degenerate cases? Does a 4k string 
with a single Character above u+FFFF and the associated memory 
cost of 16 additional bytes negate the 4000 iteration loop? What 
about 2? 3? Eventually you are going to make a value judgment 
that involves P(I).

And what exactly is P(I)?



Finally do not forget that (from rfc2781) :
================================================================
Characters with values greater than 0x10FFFF cannot be encoded 
in UTF-16.
================================================================

"do not forget"!? As if!

Of COURSE they can't, by definition of UTF-16. I don't need an RFC to tell me
that. This is, however, irrelevant, because there *are* no characters with
codepoints greater than 0x10FFFF. Unless you have discovered some character
standard of which I am unaware, and which has even more characters in it than
Unicode.



A great deal of this space overhead is forced because an 
RLEString can represent u+00000000 to u+FFFFFFFF includsively.

Obviously nobody can argue with this without a reference implementation.


Summary:
Thus far,
* UTF-8 has the edge on memory requirements /if/ the text is mostly ASCII
* UTF-16 has the edge on memory requirements otherwise
* UTF-32 has the edge on speed of random character access

Nothing you've come up with so far has, simultaneously, the memory requirements
of UTF-16 and the lookup speed of UTF-32, which was my understanding of your
original claim.

/Why/ do you want a string class? I ask, because, so far, the only /sensible/
arguments in favor of a string class involve the explication of levels of
abstraction not present in char[], wchar[] or dchar[]. I may have misunderstood
you, in which case I apologize, but you seem to want to craft something to
replace char arrays *just because*, and offering no advantage. I don't get it.

Jill

Jul 28 2004

parabolis <parabolis softhome.net> writes:

Arcane Jill wrote:

 In article <ce98li$n8v$1 digitaldaemon.com>, parabolis says...
I somehow suspected that you would suggest this. Remember 
FatString is actually a gross oversimplification of what I 
suggested.

 
 
 Which is what, exactly?

As I said in the next paragraph... A sparse array implementation 
(quoted below):
The RLEString

Still not as good as a well written sparse array implementation 
but it shows more of the nature of the implementation I am 
suggesting and I believe the workings of RLE are self evident 
enough that I do not need to explain how it works.

 
 
 Yes you do. Without an implementation, you can claim anything. Just an
algorithm
 in pseudocode would do.

I do not believe I am being being unreasonable to excpect that 
RLE is a commonly undertood algorithm. If you are not familiar 
with it then here is the first Google result which explains it:

================================================
The RLE compression algorithm
http://www.prepressure.com/techno/compressionrle.htm

Or perhaps some of these links would help more:

The Stony Brook Algorithm Repository
   http://www.cs.sunysb.edu/~algorith/

Dictionary of Algorithms and Data Structures
   http://www.nist.gov/dads/

The Complete Collection of Algorithm Animations (CCAA)
   http://www.cs.hope.edu/~alganim/ccaa/

Or try searching for "RLE implementation" or "RLE compression 
implementation."
================================================

I will clarify that I was assuming the entries in uint[] are to 
be taken by twos (assuming index i is even) to mean:
     hiBits[i+0] upper 16 bits of Character
     hiBits[i+1] number of times entry[i+0] is repeated.

So for all Characters less than u+10000 the upper 16 bits can be 
represented with only two entries (hence my 8 byte overhead):
     hiBits[0] = 0
     hiBits[0] = loBits.length

But now how do we compare degenerate cases? Does a 4k string 
with a single Character above u+FFFF and the associated memory 
cost of 16 additional bytes negate the 4000 iteration loop? What 
about 2? 3? Eventually you are going to make a value judgment 
that involves P(I).

 
 
 And what exactly is P(I)?

The second post of mine above this suggests you will have a 
difficult time judging an implementation without big O notation 
because a more detailed assessment will eventually stumble on:
================================================================
How many bytes will a wchar with N characters use? If you knew 
the probability that the Ith character requires 2 bytes then the 
answer would be simple. However you do not know P(I).
================================================================

 
Finally do not forget that (from rfc2781) :
================================================================
Characters with values greater than 0x10FFFF cannot be encoded 
in UTF-16.
================================================================

 
 
 "do not forget"!? As if!
 
 Of COURSE they can't, by definition of UTF-16. I don't need an RFC to tell me
 that. This is, however, irrelevant, because there *are* no characters with
 codepoints greater than 0x10FFFF. Unless you have discovered some character
 standard of which I am unaware, and which has even more characters in it than
 Unicode.

Let us hope they do not discover they need more bits in the 
future...

 
 Nothing you've come up with so far has, simultaneously, the memory requirements
 of UTF-16 and the lookup speed of UTF-32, which was my understanding of your
 original claim.

This is invalid reasoning. If my suggestions have so far all 
failed because a lack of specifics does not allow me to make 
claims /for/ the implementation then certainly you cannot say 
the implementation fails. That would assume you know enough 
about it to make judgments which you have refused to do.

 
 /Why/ do you want a string class? I ask, because, so far, the only /sensible/
 arguments in favor of a string class involve the explication of levels of
 abstraction not present in char[], wchar[] or dchar[]. I may have misunderstood
 you, in which case I apologize, but you seem to want to craft something to
 replace char arrays *just because*, and offering no advantage. I don't get it.

Again you know why I  believe a String class would be better and 
my that stated reasons have nothing to do with abstraction. I 
have only cited efficiency concerns.

Jul 29 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <ceate1$1e06$1 digitaldaemon.com>, parabolis says...

The RLEString


I do not believe I am being being unreasonable to excpect that 
RLE is a commonly undertood algorithm. If you are not familiar 
with it

Of course I am, but being familiar with the concept of run length encoding does
not tell me anything about how you imagine a class called RLEString would be
implemented.

The second post of mine above this suggests you will have a 
difficult time judging an implementation without big O notation 

Come on! I'm a mathematician and I've been coding for three decades. The science
of cryptography, with which I have a passing familarity, relies heavily upon big
O notation. But you are abusing that notation. It measures algorithmic
complexity, not memory requirements. You just can't get away with saying that
/memory storage requirements/ are O(n) because to do so is meaningless. Yes /of
course/ the length of a character string is going to be less than or equal to
some multiple of the number of characters. Saying that the /memory requirements/
of a string are O(n) tells you nothing. It is a useless piece of information.
What you want to know (and compare) is the average number of bytes per
character.


 This is, however, irrelevant, because there *are* no characters with
 codepoints greater than 0x10FFFF. Unless you have discovered some character
 standard of which I am unaware, and which has even more characters in it than
 Unicode.

Let us hope they do not discover they need more bits in the 
future...

With all due respect, that was a naive comment. It is trivial to invent a new
encoding. I could do it standing on my head blindfold, /and/ make it backwardly
compatible with any existing UTF. So, alright then, let's imagine a post-Unicode
character set with millions and millions of characters. It will utilise new
encodings - let's call them XUTF-8, XUTF-16 and XUTF-32 - which will be
backwardly compatible with our current UTF family. Applications will upgrade
relatively smoothly. Future D will still be okay in this world: it would only
have to redefine its char types to use the XUTF encodings instead of the UTF
encodings (and that's the great thing about backward compatibility - nothing
breaks). You're getting hung up on academic arguments. The real world runs on
practicality. And in any case, that imaginary character set is not going to
happen any time soon.



 Nothing you've come up with so far has, simultaneously, the memory requirements
 of UTF-16 and the lookup speed of UTF-32, which was my understanding of your
 original claim.

This is invalid reasoning. If my suggestions have so far all 
failed because a lack of specifics does not allow me to make 
claims /for/ the implementation then certainly you cannot say 
the implementation fails. That would assume you know enough 
about it to make judgments which you have refused to do.

Fine. So write your class, make it open source and freely distributable, and I
guarantee that /if/ people find it useful, they will use it.


Again you know why I  believe a String class would be better and 
my that stated reasons have nothing to do with abstraction.

Right - but that's precisely /why/ I don't see such a class as being useful.
According to the wiki (or some article I read somewhere) there was much
discussion here on the D forum about a string class a while back (before I
joined it), and consensus came down on the side of "we don't need one". I don't
see that changing unless a String class offered some higher level of abstraction
than that which can currently be achieved with D's built-in dchar[].


I have only cited efficiency concerns.

So write your class. I have absolutely no problem with my being wrong. (Every
time I'm wrong, I learn something new, and that's /great/). Your implementation
will be your evidence, and it will speak for itself.

Arcane Jill

Jul 29 2004

parabolis <parabolis softhome.net> writes:

Arcane Jill wrote:
 In article <ceate1$1e06$1 digitaldaemon.com>, parabolis says...
 
 
The RLEString


I do not believe I am being being unreasonable to excpect that 
RLE is a commonly undertood algorithm. If you are not familiar 
with it

 
 
 Of course I am, but being familiar with the concept of run length encoding does
 not tell me anything about how you imagine a class called RLEString would be
 implemented.

No but telling you it would use an uint[] to implement an RLE 
encoding of the upper 16 bits of digits should be sufficient.

 
 
The second post of mine above this suggests you will have a 
difficult time judging an implementation without big O notation 

 
 
 Come on! I'm a mathematician and I've been coding for three decades. The
science
 of cryptography, with which I have a passing familarity, relies heavily upon
big
 O notation. But you are abusing that notation. It measures algorithmic
 complexity, not memory requirements. You just can't get away with saying that

Sorry but big O is useful (and used) for both. I understand that 
the results can seem somewhat deceptive but as I said in my 
original suggestion: O(1) vs O(N) is a non-trivial improvement 
and the memory requirements do not do anything unexcepected like 
being N^2. Any finer detail and we will end up debating over 
values for P(I).

 /memory storage requirements/ are O(n) because to do so is meaningless. Yes /of
 course/ the length of a character string is going to be less than or equal to
 some multiple of the number of characters. Saying that the /memory
requirements/

Actually I could suggest a storage algorithm that stores any 
sized String in 8 bytes. The conditions require some insane 
assumptions but it /is/ possible. But I only mention it as more 
of an amusement than anything else. I do get what you meant.

 of a string are O(n) tells you nothing. It is a useless piece of information.
 What you want to know (and compare) is the average number of bytes per
 character.

Yes which would relate to P(I) as I said before. I would like to 
know but it is not possible to find precisely and would require 
a prohibitively immense amount resource to determine an 
approximation experimentally.


Let us hope they do not discover they need more bits in the 
future...

 
 
 With all due respect, that was a naive comment. It is trivial to invent a new

I agree.

Nothing you've come up with so far has, simultaneously, the memory requirements
of UTF-16 and the lookup speed of UTF-32, which was my understanding of your
original claim.

This is invalid reasoning. If my suggestions have so far all 
failed because a lack of specifics does not allow me to make 
claims /for/ the implementation then certainly you cannot say 
the implementation fails. That would assume you know enough 
about it to make judgments which you have refused to do.

 
 
 Fine. So write your class, make it open source and freely distributable, and I
 guarantee that /if/ people find it useful, they will use it.

I was actually hoping this discussion would head more in the 
direction of somebody pointing out that the length and substring 
methods of a String class do not compose the whole class itself 
and thus only under certain circumstances would you want to 
optimize them.

I was hoping to learn more about how people would use a String 
class and thus how it should be tuned. I have no class 
implementation in mind.

Again you know why I  believe a String class would be better and 
my that stated reasons have nothing to do with abstraction.

 
 
 Right - but that's precisely /why/ I don't see such a class as being useful.
 According to the wiki (or some article I read somewhere) there was much
 discussion here on the D forum about a string class a while back (before I
 joined it), and consensus came down on the side of "we don't need one". I don't
 see that changing unless a String class offered some higher level of
abstraction
 than that which can currently be achieved with D's built-in dchar[].
 

I also did not say that it would /fail/ to offer a level of 
abstraction. What I was talking about was the bonus class 
implementations get over non-class implementations because they 
get to make assumptions about data and leverage those 
assumptions for overall better performance.

Jul 29 2004

Sean Kelly <sean f4.ca> writes:

In article <ce8pfd$fm5$1 digitaldaemon.com>, parabolis says...
So to recap in big O terms:
     1) The memory requirements of String are identical to UTF16
     2) For length()
        2a) The time requirements of String are 1
        3a) The time requirements of UTF16 are N
     3) For substring()
        3a) The time requirements of String are 1
        3a) The time requirements of UTF16 are N

Out of curiosity, why care about printable string length anyway?  Unless this is
a UI compoment where I might want to restrict the number of printable characters
I can't think of any reason.


Sean

Jul 28 2004

Berin Loritsch <bloritsch d-haven.org> writes:

Sean Kelly wrote:

 In article <ce8pfd$fm5$1 digitaldaemon.com>, parabolis says...
 
So to recap in big O terms:
    1) The memory requirements of String are identical to UTF16
    2) For length()
       2a) The time requirements of String are 1
       3a) The time requirements of UTF16 are N
    3) For substring()
       3a) The time requirements of String are 1
       3a) The time requirements of UTF16 are N

 
 
 Out of curiosity, why care about printable string length anyway?  Unless this
is
 a UI compoment where I might want to restrict the number of printable
characters
 I can't think of any reason.

Sometimes it helps getting the tail end of a string.  For example,
separating the extension from a file name:

String filename = "filename.txt";
String extension = filename.substring(filename.length() - 4);

(Using Java conventions).

It would be so much easier if there was a "tail" method.

dchar[] filename = "filename.txt";
dchar[] extension = tail(filename, 4);

Most operating systems still don't have proper unicode support so the
fact that a filename might have international characters is not as
likely.

It just gives a representation of one of the non-printing needs for
proper string length detection.

Jul 28 2004

parabolis <parabolis softhome.net> writes:

Sean Kelly wrote:

 In article <ce8pfd$fm5$1 digitaldaemon.com>, parabolis says...
 
So to recap in big O terms:
    1) The memory requirements of String are identical to UTF16
    2) For length()
       2a) The time requirements of String are 1
       3a) The time requirements of UTF16 are N
    3) For substring()
       3a) The time requirements of String are 1
       3a) The time requirements of UTF16 are N

 
 
 Out of curiosity, why care about printable string length anyway?  Unless this
is
 a UI compoment where I might want to restrict the number of printable
characters
 I can't think of any reason.
 

I tend to deal quite a bit with text data.

Jul 28 2004

D Programming

C/C++ Programming

Other

digitalmars.D - String Class (was other things)