www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - String Class (was other things)

reply Arcane Jill <Arcane_member pathlink.com> writes:
Just to tidy up a few loose ends...

The Unicode Consortium /claim/ that the definition of a character is "The
smallest component of written language that has semantic value", but in fact, a
more accurate definition would be "a character is anything that the Unicode
Consortium say is a character". This is because when Unicode was first invented,
it wanted to be a superset of all other character standards, so it imported
"characters" from all of those other standards - no matter that those legacy
standards conflicted with each other. This is just "historical baggage", and we
now have to live with it.

A Unicode string can be interpretted on many levels. As you have pointed out,
char[] arrays, etc., work at the level of code units. You say that it would be
more appropriate to work at the character level. That's a reasonable point of
view, but the truth is we can already do this. A simple dchar[] array will do
the job. Or, if you want to stick with char[] arrays, there are functions in
std.utf which can cope with this level of abstration quite happily.

But I don't think that even the /character/ level is particularly useful,
myself. I'd say you have to go up by at least one further level of abstraction -
to the /grapheme/ level at least, before you start dealing with what we /really/
think of as a string. Let me give you an example. Consider the following two
UTF-8 sequences:

(a) 0x63 0x61 0x66 0xC3 0x89
(b) 0x63 0x61 0x66 0x65 0xCC 0x81

Clearly these are different sequences of code units. They have different
lengths, for a start. If stored in char[] arrays then a.length would be 5 and
b.length would be 6. So now let's look at them, not at the code unit level, but
at the character level. Now we have:

(a) "caf\u00E9"
(b) "cafe\u0301"

Again, these are different sequences of characters. The first one is four
characters long, and contains the character U+00E9 (lowercase e with acute
accent). The second one is five characters long, and contains the character
U+0301 (combining acute accent). These strings would not compare equal either as
char[]s or as dchar[]s. But now let's look at them at the /grapheme/ level. Now
it gets interesting, because here we get:

(a) "café"
(b) "café"

This is what you /see/ when the string is rendered in a plain text editor such
as we programmers use. The strings look alike, and, conceptually, they /are/
identical at this level of abstraction. (They are "canonically equivalent", to
be pedantic). They both contain exactly four graphemes, and the final grapheme
is "é" in both cases. So - are you /sure/ that the number of "characters" is the
RIGHT THING to be focusing on?

But it doesn't even stop there. There's another level of abstraction even beyond
graphemes. These objects are called "glyphs", and they are what you would see in
a non-plain text renderer, such a web browser or Micro$oft Word. You can make a
glyph by joining one or more graphemes together by using join-controls. For
example, the character/grapheme/glyph 'ć' (U+00E6) is a ligature formed by
ligating 'a' and 'e'. It is the same glyph as the one formed from 'a' followed
by U+200D (ZERO WIDTH JOINER) followed by 'e', but you'll need a good text
rendering engine before they'll look identical.

Different levels are important for different people. Right now, D gives you the
code unit level in char[], wchar[] and dchar[] arrays, and the character level
in dchar[] arrays. Phase 2 of etc.unicode, when it arrives, will give you
functions enabling you to work at the grapheme level. All of these different
conceptual viewpoints could, in theory, be plumbed into the workings of a string
class, but only after the Unicode algorithms are actually implemented in
etc.unicode.

I'm not trying to throw a spanner into the works - I just wanted to challenge
the viewpoint that defining char as a UTF-8 fragment makes D deficient somehow.
I don't believe it does, and, even if that low level is not one that you would
use, there certainly will be programmers who need to work at that level.

Arcane Jill
Jul 28 2004
next sibling parent reply parabolis <parabolis softhome.net> writes:
Arcane Jill wrote:
 Just to tidy up a few loose ends...
 
 The Unicode Consortium /claim/ that the definition of a character is "The
 smallest component of written language that has semantic value", but in fact, a
 more accurate definition would be "a character is anything that the Unicode
 Consortium say is a character". This is because when Unicode was first
invented,
 it wanted to be a superset of all other character standards, so it imported
 "characters" from all of those other standards - no matter that those legacy
 standards conflicted with each other. This is just "historical baggage", and we
 now have to live with it.
 

 [lots of stuff]

the relationships: Character has a UTFImplementation Grapheme is a Character Glyph is a Gphapmeme Which suggestes an OO implementation of: ================================ Interface UTFImplementation { /* ??? */ } Character : UTFImplementation { /* ??? */ } Grapheme : Character { /* ??? */ } Glyph : Gphapmeme { /* ??? */ } ================================ As I have said any String class implementatoin that satisfies the most basic string operations is a step in the right direction. It sounds like you would like to see the following implementation ================================ Interface String { /* ??? */ } CharacterString : String { /* ??? */ } GraphemeString : String { /* ??? */ } GlyphString : String { /* ??? */ } or perhaps Interface String { /* ??? */ } CharacterString : String { /* ??? */ } GraphemeString : CharacterString { /* ??? */ } GlyphString : GphapmemeString { /* ??? */ } ================================
 
 Different levels are important for different people. Right now, D gives you the
 code unit level in char[], wchar[] and dchar[] arrays, and the character level
 in dchar[] arrays. Phase 2 of etc.unicode, when it arrives, will give you
 functions enabling you to work at the grapheme level. All of these different
 conceptual viewpoints could, in theory, be plumbed into the workings of a
string
 class, but only after the Unicode algorithms are actually implemented in
 etc.unicode.

You are trying (unsuccessfully) to make use of the fact that UTF-32 code units (dchar) alawys correspond to a Character but are missing the fact they are still defined as code units in the D Docs Types section: ================================================================ char unsigned 8 bit UTF-8 0xFF wchar unsigned 16 bit UTF-16 0xFFFF dchar unsigned 32 bit UTF-32 ================================================================ If what you were suggesting were really the case then anybody who has a reason to use UTF-8 or UTF-16 would have to count the Characters by converting to UTF-32 and then checking the length of the resulting dchar array. This implementation is a crude hack that works by virtue of the fact that the number of code units in UTF32 are guaranteed to be 1:1 with Characters defined by the UC.
 I'm not trying to throw a spanner into the works - I just wanted to challenge
 the viewpoint that defining char as a UTF-8 fragment makes D deficient somehow.

The reason D uses the precise name 'char' is a direct result of D's ability to compile C directly and has nothing to do with the unfortunate and misleading association char sometimes has with Character.
Jul 28 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <ce8mjf$eh3$1 digitaldaemon.com>, parabolis says...

I missed this one. I think you're posting from a time-machine or something,
because the D forum seems to sort your posts into the wrong place.


You are trying (unsuccessfully) to make use of the fact that 
UTF-32 code units (dchar) alawys correspond to a Character but 
are missing the fact they are still defined as code units in the 
D Docs Types section:
================================================================
         char  unsigned 8 bit UTF-8  0xFF
         wchar  unsigned 16 bit UTF-16  0xFFFF
         dchar  unsigned 32 bit UTF-32
================================================================

Strictly speaking, there are distinctions between (a) an ASCII code unit, (b) an ASCII codepoint, and (c) and ASCII character. But so what? The mapping between them is utterly trivial so nobody pays it a blind bit of notice - and nor should they. So it is with dchar.
If what you were suggesting were really the case then anybody 
who has a reason to use UTF-8 or UTF-16 would have to count the 
Characters by converting to UTF-32 and then checking the length 
of the resulting dchar array.

That's not the /only/ way of doing it. You could iterate through the chars of UTF-8 and count only bytes < 0x80 or >= 0xC0. Similarly, you could iterate through the wchars of UTF-16 and count only bytes < 0xD800 or >= 0xDC00. That would get you the right answer faster than doing an actual conversion (no need to allocate memory, etc.). But yes, if you want indexing by character, use dchar[]. Why is that a problem?
This implementation is a crude 
hack

The implementation against which you argued was not one that I suggested. I can spot a strawman argument a mile away, and I won't fall for it. I suggested precisly two implementation: (1) the counting technique described above, and (2) keep everything in dchars throughout - no conversion required.
that works by virtue of the fact that the number of code 
units in UTF32 are guaranteed to be 1:1 with Characters defined 
by the UC.

Isn't that a bit like saying "that works by virtue of the fact that one plus one equals two"? I mean, there /is/ such a guarantee.
The reason D uses the precise name 'char' is a direct result of 
D's ability to compile C directly

?
and has nothing to do with the 
unfortunate and misleading association char sometimes has with 
Character.

Look, it's just jargon, dude. It's just words. It doesn't matter what things are called, so long as they are well-defined. I'm not greatly in love with all of the Unicode Consortium's names for things either, to be honest. Even the regular correspondents in the Unicode public forum will say "glyph" to each other when the technical term as defined by the UC is actually "default grapheme cluster". A bit of common sense is in order here. And why do you keep capitalizing "character"? - it's very disconcerting. Jill
Jul 29 2004
parent parabolis <parabolis softhome.net> writes:
Arcane Jill wrote:

 
 
You are trying (unsuccessfully) to make use of the fact that 
UTF-32 code units (dchar) alawys correspond to a Character but 
are missing the fact they are still defined as code units in the 
D Docs Types section:
================================================================
        char  unsigned 8 bit UTF-8  0xFF
        wchar  unsigned 16 bit UTF-16  0xFFFF
        dchar  unsigned 32 bit UTF-32
================================================================

Strictly speaking, there are distinctions between (a) an ASCII code unit, (b) an ASCII codepoint, and (c) and ASCII character. But so what? The mapping between them is utterly trivial so nobody pays it a blind bit of notice - and nor should they. So it is with dchar.

It is only trivial if you know how to do it. Otherwise UTF-32 data might as well be encrypted for someone who does not want to bother to spend the time puzzling out the UTF encodings. My assumption before reading the UTF docs was that UTF-n were roughly just Unicode mapped variables using 8, 16 or 32 bits. I would have used char slicing, .length and perhaps even .sort happily until I came upon an 8bit character that caused some /really/ obscure bug. I don't mean to be rude but for people in such a situation your suggestion that just using a dchar works seems pedantic in the worst way. Consider an analogy. Let us suppose that in D any byte[] array access sequence of 135 followed by 134 would result in all future array index calculations to assume you meant (.length - index). Let us also suppose that any short[] array access sequence of 2086 followed by 2085 would produce the same results. Now let us further assume there are technical reasons for storing array data using these WTF-8, WTF-16 and WTF-32 encodings (the WTF-32 encoding being trivially similar to the way normal array access works in other languages). The fact that there are no 'normal' arrays in D' is covered only by the a notation in the doc section about types and everywhere else they are refered to as byte arrays and short arrays. To your suggestion that if you want common sense array results then use an int array in all cases I say you are insane. I would not seriously consider using D if there were no byte or short arrays. More importantly D would not call these things arrays and by using sensible terminology would have not confused these WTF things with arrays and thus provided support for common sense arrays *from the start*.
If what you were suggesting were really the case then anybody 
who has a reason to use UTF-8 or UTF-16 would have to count the 
Characters by converting to UTF-32 and then checking the length 
of the resulting dchar array.

That's not the /only/ way of doing it. You could iterate through the chars of UTF-8 and count only bytes < 0x80 or >= 0xC0. Similarly, you could iterate through the wchars of UTF-16 and count only bytes < 0xD800 or >= 0xDC00. That would get you the right answer faster than doing an actual conversion (no need to allocate memory, etc.).

Quite true ... I could. But only for char/wchar arrays that I have used std.utf.validate() and have not written to. Or I could implement my suggested String class and get all the benefits I believe it will offer. But these only solve my problem. However my problem is that if D did not confuse Character and Code Unit then I am sure the character counting issue would not have been overlooked and /something/ sensible would have been implemented much earlier in the development process that it has.
  
This implementation is a crude 
hack

The implementation against which you argued was not one that I suggested. I can spot a strawman argument a mile away, and I won't fall for it. I suggested precisly two implementation: (1) the counting technique described above, and (2) keep everything in dchars throughout - no conversion required.

I am not arguing against an implementation. I am arguing against /your/ suggestion that dchar[] was an intentional device. I believe basic functionality was overlooked that would have been included if the Character/Code Unit ambiguity did not exist (more on ambiguity below).
 
that works by virtue of the fact that the number of code 
units in UTF32 are guaranteed to be 1:1 with Characters defined 
by the UC.

Isn't that a bit like saying "that works by virtue of the fact that one plus one equals two"? I mean, there /is/ such a guarantee.

 
 Look, it's just jargon, dude. It's just words. It doesn't matter what things
are
 called, so long as they are well-defined. I'm not greatly in love with all of

I agree. But in this case I am saying string is not well-defined. It is used as both Strings of Character and Unicode String. You were suprised char[].sort invalidated an encoding. The library implicity suggests char[].length and char[i..k] implement length-querry and substring. Seperating the two terms in a meaningful way would avoid these kinds of ambiguities.
 A bit of common sense is in order here. And why do you keep capitalizing
 "character"? - it's very disconcerting.

Actually because it is disconcerting it is useful to draw attention to the fact that I do not mean char or UTF-n code unit.
Jul 29 2004
prev sibling next sibling parent Daniel Horn <hellcatv hotmail.com> writes:
Seems like more spam filters need to start tokenizing at the grapheme 
level... or perhaps one level higher where char sequences that render 
similarly are compared similarly-- maybe more like text recognition
Arcane Jill wrote:

 think of as a string. Let me give you an example. Consider the following two
 UTF-8 sequences:
 
 (a) 0x63 0x61 0x66 0xC3 0x89
 (b) 0x63 0x61 0x66 0x65 0xCC 0x81
 
 Clearly these are different sequences of code units. They have different
 lengths, for a start. If stored in char[] arrays then a.length would be 5 and
 b.length would be 6. So now let's look at them, not at the code unit level, but
 at the character level. Now we have:
 
 (a) "caf\u00E9"
 (b) "cafe\u0301"
 
 Again, these are different sequences of characters. The first one is four
 characters long, and contains the character U+00E9 (lowercase e with acute
 accent). The second one is five characters long, and contains the character
 U+0301 (combining acute accent). These strings would not compare equal either
as
 char[]s or as dchar[]s. But now let's look at them at the /grapheme/ level. Now
 it gets interesting, because here we get:
 
 (a) "café"
 (b) "café"
 

Jul 28 2004
prev sibling parent reply "Walter" <newshound digitalmars.com> writes:
"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:ce869q$7o5$1 digitaldaemon.com...
 A Unicode string can be interpretted on many levels. As you have pointed

 char[] arrays, etc., work at the level of code units. You say that it

 more appropriate to work at the character level. That's a reasonable point

 view, but the truth is we can already do this. A simple dchar[] array will

 the job. Or, if you want to stick with char[] arrays, there are functions

 std.utf which can cope with this level of abstration quite happily.

There's a little-known feature of foreach loops: char[] s; foreach (dchar c; s) { ... c is each dchar decoded from s[] ... } which is pretty handy for decoding UTF-8 strings. Of course, it also works for any combination of char, wchar, or dchar for either argument of foreach, making it pretty convenient for writing generic code.
Jul 28 2004
parent reply parabolis <parabolis softhome.net> writes:
Walter wrote:

std.utf which can cope with this level of abstration quite happily.

There's a little-known feature of foreach loops: char[] s; foreach (dchar c; s) { ... c is each dchar decoded from s[] ... } which is pretty handy for decoding UTF-8 strings. Of course, it also works for any combination of char, wchar, or dchar for either argument of foreach, making it pretty convenient for writing generic code.

This was actually taken from a Thread in which I explained some of the benefits of a class String implementation. It ended with this assessment: So to recap in big O terms: 1) The memory requirements of String are identical to UTF16 2) For length() 2a) The time requirements of String are 1 3a) The time requirements of UTF16 are N 3) For substring() 3a) The time requirements of String are 1 3a) The time requirements of UTF16 are N
Jul 28 2004
next sibling parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <ce8pfd$fm5$1 digitaldaemon.com>, parabolis says...

This was actually taken from a Thread in which I explained some 
of the benefits of a class String implementation. It ended with 
this assessment:

So to recap in big O terms:
     1) The memory requirements of String are identical to UTF16
     2) For length()
        2a) The time requirements of String are 1
        3a) The time requirements of UTF16 are N
     3) For substring()
        3a) The time requirements of String are 1
        3a) The time requirements of UTF16 are N

This confuses me, but perhaps that's because I haven't understood how you imagine implementing such a class. Example - let me construct a string: # dchar[] s = new dchar[0x110000]; # for (uint i=0; i<0x110000; ++i) # { # s[(i*0x10001L)%0x110000L] = cast(dchar) i; # } If I've got my maths right, all elements of s will be assigned; all elements of s will be unique; and to compute the value of s[i] given only i is hard (requires a modinv(), which is expensive). So, how can any class store s in such a way as to guarantee the memory and time requirements which you claim? Arcane Jill
Jul 28 2004
parent reply parabolis <parabolis softhome.net> writes:
Arcane Jill wrote:

 This confuses me, but perhaps that's because I haven't understood how you
 imagine implementing such a class. Example - let me construct a string:
 
 #    dchar[] s = new dchar[0x110000];
 #    for (uint i=0; i<0x110000; ++i)
 #    {
 #        s[(i*0x10001L)%0x110000L] = cast(dchar) i;
 #    }
 
 If I've got my maths right, all elements of s will be assigned; all elements of
 s will be unique; and to compute the value of s[i] given only i is hard
 (requires a modinv(), which is expensive). So, how can any class store s in
such
 a way as to guarantee the memory and time requirements which you claim?
 

I am assuming you accept 2b) and 3b): 1) The memory requirements of String are identical to UTF16 2) For length() 2a) The time requirements of String are 1 2b) The time requirements of UTF16 are N 3) For substring() 3a) The time requirements of String are 1 3b) The time requirements of UTF16 are N To show that 1), 2a) and 2b) hold I will make the example a bit simpler let me throw away the sparse array portion and just construct a class that stores Characters with a two wchar arrays: ================================================================ FatString { wchar[] loBits; wchar[] hiBits; private this( wchar[] lo, wchar[] hi ) { loBits = lo; hiBits = hi; } length() { return loBits.length; } FatString substring( uint len, uint off ) { return new FatString( loBits[off..(off+len)], hiBits[off..(off+len)], ); } } ================================================================ Assuming we have N Characters to store... The memory complexity is simply: 1) The FatString class uses 2*2*N bytes or a big o N. A wchar[] uses at least 2*N bytes and at most 2*2*N which is also a big O N. Thus they have the same memory requirements. Now for the time complexity: 2a) length() is a simple example. There is one statement with no tricky side effects. So length has a big O time complexity 1. 3a) substring() is a more interesting example. It is obvious that the big O of the constructor is 1. The wonder that is array slicing simply creates dynamic array stubs which point into the appropriate array for the appropriate length. Again that is a big O of 1. Thus substring() has a big O of 1.
Jul 28 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
 #    dchar[] s = new dchar[0x110000];
 #    for (uint i=0; i<0x110000; ++i)
 #    {
 #        s[(i*0x10001L)%0x110000L] = cast(dchar) i;
 #    }


Whoops! I just realized that this example contains invalid characters! Ho hum. (Hangs head in shame). Let's pretend no-one noticed and change it to: # dchar[] s = new dchar[0x110000]; # for (uint i=0; i<0x110000; ++i) # { # dchar c = cast(dchar) i; # if (isValidDchar(c)) s[(i*0x10001L)%0x110000L] = c; # } This doesn't change the following reasoning in any way, however. In article <ce915n$jld$1 digitaldaemon.com>, parabolis says...
 How can any class store s in such
 a way as to guarantee the memory and time requirements which you claim?


     1) The memory requirements of String are identical to UTF16

1) The FatString class uses 2*2*N bytes or a big o N.
A wchar[] uses at least 2*N bytes and at most 2*2*N which is 
also a big O N. Thus they have the same memory requirements.

The string s, when converted to wchar[], will require precisely 0x42000 bytes Your FatString representing the same string will contain 0x44000 bytes. How is that identical? Jill
Jul 28 2004
parent reply parabolis <parabolis softhome.net> writes:
Arcane Jill wrote:

#    dchar[] s = new dchar[0x110000];
#    for (uint i=0; i<0x110000; ++i)
#    {
#        s[(i*0x10001L)%0x110000L] = cast(dchar) i;
#    }


Whoops! I just realized that this example contains invalid characters! Ho hum. (Hangs head in shame). Let's pretend no-one noticed and change it to: # dchar[] s = new dchar[0x110000]; # for (uint i=0; i<0x110000; ++i) # { # dchar c = cast(dchar) i; # if (isValidDchar(c)) s[(i*0x10001L)%0x110000L] = c; # } This doesn't change the following reasoning in any way, however. In article <ce915n$jld$1 digitaldaemon.com>, parabolis says...
How can any class store s in such
a way as to guarantee the memory and time requirements which you claim?


    1) The memory requirements of String are identical to UTF16

1) The FatString class uses 2*2*N bytes or a big o N.
A wchar[] uses at least 2*N bytes and at most 2*2*N which is 
also a big O N. Thus they have the same memory requirements.

The string s, when converted to wchar[], will require precisely 0x42000 bytes Your FatString representing the same string will contain 0x44000 bytes. How is that identical? Jill

If you want to compare precise memory requirements in bytes then they do not have identical requirements. The have roughly identical requirements (which is the claim I made in another thread when not talking about big O). The big O requirements are not precise. They are intedned to give a feeling for how things compare as they get arbitrarily larger while at the same time simplifying certain issues. How many bytes will a wchar with N characters use? If you knew the probability that the Ith character requires 2 bytes then the answer would be simple. However you do not know P(I). So how are we to compare how the memory requirements of a wchar[] generally compares to a FatString?
Jul 28 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <ce9583$l7u$1 digitaldaemon.com>, parabolis says...

So how are we to compare how the memory requirements of a 
wchar[] generally compares to a FatString?

sizeof wchar[] string <= sizeof dchar[] string == sizeof FatString So far, FatString has shown no advantage over dchar[], and it looks a lot more complicated. Jill
Jul 28 2004
parent reply parabolis <parabolis softhome.net> writes:
Arcane Jill wrote:
So how are we to compare how the memory requirements of a 
wchar[] generally compares to a FatString?

sizeof wchar[] string <= sizeof dchar[] string == sizeof FatString

I somehow suspected that you would suggest this. Remember FatString is actually a gross oversimplification of what I suggested. Lets try a less overly simplified version... The RLEString Still not as good as a well written sparse array implementation but it shows more of the nature of the implementation I am suggesting and I believe the workings of RLE are self evident enough that I do not need to explain how it works. How would you compare RLEString if instead of wchar[] for hiBits it used a uint[] whose entries are a run length encoding of the upper 16 bits of the Characters in a String? For N Characters which are all less than u+10000 the wchar[] string uses exactly 8 bytes less than RLEString. That is an 8 byte overhead no matter how long the string is. So in this case you save 8 bytes but should sombody want the last 4 Characters in a 4K string then a wchar[] string must do 4000 parsing iterations as opposed to the RLEString which does effectively a single operation. But now how do we compare degenerate cases? Does a 4k string with a single Character above u+FFFF and the associated memory cost of 16 additional bytes negate the 4000 iteration loop? What about 2? 3? Eventually you are going to make a value judgment that involves P(I). Finally do not forget that (from rfc2781) : ================================================================ Characters with values greater than 0x10FFFF cannot be encoded in UTF-16. ================================================================ A great deal of this space overhead is forced because an RLEString can represent u+00000000 to u+FFFFFFFF includsively.
Jul 28 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <ce98li$n8v$1 digitaldaemon.com>, parabolis says...
Arcane Jill wrote:
So how are we to compare how the memory requirements of a 
wchar[] generally compares to a FatString?

sizeof wchar[] string <= sizeof dchar[] string == sizeof FatString

I somehow suspected that you would suggest this. Remember FatString is actually a gross oversimplification of what I suggested.

Which is what, exactly?
Lets try a less overly simplified version...

The RLEString

Still not as good as a well written sparse array implementation 
but it shows more of the nature of the implementation I am 
suggesting and I believe the workings of RLE are self evident 
enough that I do not need to explain how it works.

Yes you do. Without an implementation, you can claim anything. Just an algorithm in pseudocode would do.
How would you compare RLEString if instead of wchar[] for hiBits 
it used a uint[] whose entries are a run length encoding of the 
upper 16 bits of the Characters in a String?

Show me an implementation which provably has the characteristics you claim, and obviously the proof will speak for itself. This is logical, Captain.
For N Characters which are all less than u+10000 the wchar[] 
string uses exactly 8 bytes less than RLEString.

Implementation? Algorithm?
That is an 8 byte overhead no matter how long the string is. So 
in this case you save 8 bytes but should sombody want the last 4 
Characters in a 4K string then a wchar[] string must do 4000 
parsing iterations as opposed to the RLEString which does 
effectively a single operation.

I am not clairvoyant. The details of your algorithm have not been specified.
But now how do we compare degenerate cases? Does a 4k string 
with a single Character above u+FFFF and the associated memory 
cost of 16 additional bytes negate the 4000 iteration loop? What 
about 2? 3? Eventually you are going to make a value judgment 
that involves P(I).

And what exactly is P(I)?
Finally do not forget that (from rfc2781) :
================================================================
Characters with values greater than 0x10FFFF cannot be encoded 
in UTF-16.
================================================================

"do not forget"!? As if! Of COURSE they can't, by definition of UTF-16. I don't need an RFC to tell me that. This is, however, irrelevant, because there *are* no characters with codepoints greater than 0x10FFFF. Unless you have discovered some character standard of which I am unaware, and which has even more characters in it than Unicode.
A great deal of this space overhead is forced because an 
RLEString can represent u+00000000 to u+FFFFFFFF includsively.

Obviously nobody can argue with this without a reference implementation. Summary: Thus far, * UTF-8 has the edge on memory requirements /if/ the text is mostly ASCII * UTF-16 has the edge on memory requirements otherwise * UTF-32 has the edge on speed of random character access Nothing you've come up with so far has, simultaneously, the memory requirements of UTF-16 and the lookup speed of UTF-32, which was my understanding of your original claim. /Why/ do you want a string class? I ask, because, so far, the only /sensible/ arguments in favor of a string class involve the explication of levels of abstraction not present in char[], wchar[] or dchar[]. I may have misunderstood you, in which case I apologize, but you seem to want to craft something to replace char arrays *just because*, and offering no advantage. I don't get it. Jill
Jul 28 2004
parent reply parabolis <parabolis softhome.net> writes:
Arcane Jill wrote:

 In article <ce98li$n8v$1 digitaldaemon.com>, parabolis says...
I somehow suspected that you would suggest this. Remember 
FatString is actually a gross oversimplification of what I 
suggested.

Which is what, exactly?

As I said in the next paragraph... A sparse array implementation (quoted below):
The RLEString

Still not as good as a well written sparse array implementation 
but it shows more of the nature of the implementation I am 
suggesting and I believe the workings of RLE are self evident 
enough that I do not need to explain how it works.

Yes you do. Without an implementation, you can claim anything. Just an algorithm in pseudocode would do.

RLE is a commonly undertood algorithm. If you are not familiar with it then here is the first Google result which explains it: ================================================ The RLE compression algorithm http://www.prepressure.com/techno/compressionrle.htm Or perhaps some of these links would help more: The Stony Brook Algorithm Repository http://www.cs.sunysb.edu/~algorith/ Dictionary of Algorithms and Data Structures http://www.nist.gov/dads/ The Complete Collection of Algorithm Animations (CCAA) http://www.cs.hope.edu/~alganim/ccaa/ Or try searching for "RLE implementation" or "RLE compression implementation." ================================================ I will clarify that I was assuming the entries in uint[] are to be taken by twos (assuming index i is even) to mean: hiBits[i+0] upper 16 bits of Character hiBits[i+1] number of times entry[i+0] is repeated. So for all Characters less than u+10000 the upper 16 bits can be represented with only two entries (hence my 8 byte overhead): hiBits[0] = 0 hiBits[0] = loBits.length
But now how do we compare degenerate cases? Does a 4k string 
with a single Character above u+FFFF and the associated memory 
cost of 16 additional bytes negate the 4000 iteration loop? What 
about 2? 3? Eventually you are going to make a value judgment 
that involves P(I).

And what exactly is P(I)?

The second post of mine above this suggests you will have a difficult time judging an implementation without big O notation because a more detailed assessment will eventually stumble on: ================================================================ How many bytes will a wchar with N characters use? If you knew the probability that the Ith character requires 2 bytes then the answer would be simple. However you do not know P(I). ================================================================
 
Finally do not forget that (from rfc2781) :
================================================================
Characters with values greater than 0x10FFFF cannot be encoded 
in UTF-16.
================================================================

"do not forget"!? As if! Of COURSE they can't, by definition of UTF-16. I don't need an RFC to tell me that. This is, however, irrelevant, because there *are* no characters with codepoints greater than 0x10FFFF. Unless you have discovered some character standard of which I am unaware, and which has even more characters in it than Unicode.

Let us hope they do not discover they need more bits in the future...
 
 Nothing you've come up with so far has, simultaneously, the memory requirements
 of UTF-16 and the lookup speed of UTF-32, which was my understanding of your
 original claim.

This is invalid reasoning. If my suggestions have so far all failed because a lack of specifics does not allow me to make claims /for/ the implementation then certainly you cannot say the implementation fails. That would assume you know enough about it to make judgments which you have refused to do.
 
 /Why/ do you want a string class? I ask, because, so far, the only /sensible/
 arguments in favor of a string class involve the explication of levels of
 abstraction not present in char[], wchar[] or dchar[]. I may have misunderstood
 you, in which case I apologize, but you seem to want to craft something to
 replace char arrays *just because*, and offering no advantage. I don't get it.

Again you know why I believe a String class would be better and my that stated reasons have nothing to do with abstraction. I have only cited efficiency concerns.
Jul 29 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <ceate1$1e06$1 digitaldaemon.com>, parabolis says...

The RLEString


RLE is a commonly undertood algorithm. If you are not familiar with it

Of course I am, but being familiar with the concept of run length encoding does not tell me anything about how you imagine a class called RLEString would be implemented.
The second post of mine above this suggests you will have a 
difficult time judging an implementation without big O notation 

Come on! I'm a mathematician and I've been coding for three decades. The science of cryptography, with which I have a passing familarity, relies heavily upon big O notation. But you are abusing that notation. It measures algorithmic complexity, not memory requirements. You just can't get away with saying that /memory storage requirements/ are O(n) because to do so is meaningless. Yes /of course/ the length of a character string is going to be less than or equal to some multiple of the number of characters. Saying that the /memory requirements/ of a string are O(n) tells you nothing. It is a useless piece of information. What you want to know (and compare) is the average number of bytes per character.
 This is, however, irrelevant, because there *are* no characters with
 codepoints greater than 0x10FFFF. Unless you have discovered some character
 standard of which I am unaware, and which has even more characters in it than
 Unicode.

Let us hope they do not discover they need more bits in the future...

With all due respect, that was a naive comment. It is trivial to invent a new encoding. I could do it standing on my head blindfold, /and/ make it backwardly compatible with any existing UTF. So, alright then, let's imagine a post-Unicode character set with millions and millions of characters. It will utilise new encodings - let's call them XUTF-8, XUTF-16 and XUTF-32 - which will be backwardly compatible with our current UTF family. Applications will upgrade relatively smoothly. Future D will still be okay in this world: it would only have to redefine its char types to use the XUTF encodings instead of the UTF encodings (and that's the great thing about backward compatibility - nothing breaks). You're getting hung up on academic arguments. The real world runs on practicality. And in any case, that imaginary character set is not going to happen any time soon.
 Nothing you've come up with so far has, simultaneously, the memory requirements
 of UTF-16 and the lookup speed of UTF-32, which was my understanding of your
 original claim.

This is invalid reasoning. If my suggestions have so far all failed because a lack of specifics does not allow me to make claims /for/ the implementation then certainly you cannot say the implementation fails. That would assume you know enough about it to make judgments which you have refused to do.

Fine. So write your class, make it open source and freely distributable, and I guarantee that /if/ people find it useful, they will use it.
Again you know why I  believe a String class would be better and 
my that stated reasons have nothing to do with abstraction.

Right - but that's precisely /why/ I don't see such a class as being useful. According to the wiki (or some article I read somewhere) there was much discussion here on the D forum about a string class a while back (before I joined it), and consensus came down on the side of "we don't need one". I don't see that changing unless a String class offered some higher level of abstraction than that which can currently be achieved with D's built-in dchar[].
I have only cited efficiency concerns.

So write your class. I have absolutely no problem with my being wrong. (Every time I'm wrong, I learn something new, and that's /great/). Your implementation will be your evidence, and it will speak for itself. Arcane Jill
Jul 29 2004
parent parabolis <parabolis softhome.net> writes:
Arcane Jill wrote:
 In article <ceate1$1e06$1 digitaldaemon.com>, parabolis says...
 
 
The RLEString


I do not believe I am being being unreasonable to excpect that RLE is a commonly undertood algorithm. If you are not familiar with it

Of course I am, but being familiar with the concept of run length encoding does not tell me anything about how you imagine a class called RLEString would be implemented.

No but telling you it would use an uint[] to implement an RLE encoding of the upper 16 bits of digits should be sufficient.
 
 
The second post of mine above this suggests you will have a 
difficult time judging an implementation without big O notation 

Come on! I'm a mathematician and I've been coding for three decades. The science of cryptography, with which I have a passing familarity, relies heavily upon big O notation. But you are abusing that notation. It measures algorithmic complexity, not memory requirements. You just can't get away with saying that

Sorry but big O is useful (and used) for both. I understand that the results can seem somewhat deceptive but as I said in my original suggestion: O(1) vs O(N) is a non-trivial improvement and the memory requirements do not do anything unexcepected like being N^2. Any finer detail and we will end up debating over values for P(I).
 /memory storage requirements/ are O(n) because to do so is meaningless. Yes /of
 course/ the length of a character string is going to be less than or equal to
 some multiple of the number of characters. Saying that the /memory
requirements/

Actually I could suggest a storage algorithm that stores any sized String in 8 bytes. The conditions require some insane assumptions but it /is/ possible. But I only mention it as more of an amusement than anything else. I do get what you meant.
 of a string are O(n) tells you nothing. It is a useless piece of information.
 What you want to know (and compare) is the average number of bytes per
 character.

Yes which would relate to P(I) as I said before. I would like to know but it is not possible to find precisely and would require a prohibitively immense amount resource to determine an approximation experimentally.
Let us hope they do not discover they need more bits in the 
future...

With all due respect, that was a naive comment. It is trivial to invent a new

I agree.
Nothing you've come up with so far has, simultaneously, the memory requirements
of UTF-16 and the lookup speed of UTF-32, which was my understanding of your
original claim.

This is invalid reasoning. If my suggestions have so far all failed because a lack of specifics does not allow me to make claims /for/ the implementation then certainly you cannot say the implementation fails. That would assume you know enough about it to make judgments which you have refused to do.

Fine. So write your class, make it open source and freely distributable, and I guarantee that /if/ people find it useful, they will use it.

I was actually hoping this discussion would head more in the direction of somebody pointing out that the length and substring methods of a String class do not compose the whole class itself and thus only under certain circumstances would you want to optimize them. I was hoping to learn more about how people would use a String class and thus how it should be tuned. I have no class implementation in mind.
Again you know why I  believe a String class would be better and 
my that stated reasons have nothing to do with abstraction.

Right - but that's precisely /why/ I don't see such a class as being useful. According to the wiki (or some article I read somewhere) there was much discussion here on the D forum about a string class a while back (before I joined it), and consensus came down on the side of "we don't need one". I don't see that changing unless a String class offered some higher level of abstraction than that which can currently be achieved with D's built-in dchar[].

I also did not say that it would /fail/ to offer a level of abstraction. What I was talking about was the bonus class implementations get over non-class implementations because they get to make assumptions about data and leverage those assumptions for overall better performance.
Jul 29 2004
prev sibling parent reply Sean Kelly <sean f4.ca> writes:
In article <ce8pfd$fm5$1 digitaldaemon.com>, parabolis says...
So to recap in big O terms:
     1) The memory requirements of String are identical to UTF16
     2) For length()
        2a) The time requirements of String are 1
        3a) The time requirements of UTF16 are N
     3) For substring()
        3a) The time requirements of String are 1
        3a) The time requirements of UTF16 are N

Out of curiosity, why care about printable string length anyway? Unless this is a UI compoment where I might want to restrict the number of printable characters I can't think of any reason. Sean
Jul 28 2004
next sibling parent Berin Loritsch <bloritsch d-haven.org> writes:
Sean Kelly wrote:

 In article <ce8pfd$fm5$1 digitaldaemon.com>, parabolis says...
 
So to recap in big O terms:
    1) The memory requirements of String are identical to UTF16
    2) For length()
       2a) The time requirements of String are 1
       3a) The time requirements of UTF16 are N
    3) For substring()
       3a) The time requirements of String are 1
       3a) The time requirements of UTF16 are N

Out of curiosity, why care about printable string length anyway? Unless this is a UI compoment where I might want to restrict the number of printable characters I can't think of any reason.

Sometimes it helps getting the tail end of a string. For example, separating the extension from a file name: String filename = "filename.txt"; String extension = filename.substring(filename.length() - 4); (Using Java conventions). It would be so much easier if there was a "tail" method. dchar[] filename = "filename.txt"; dchar[] extension = tail(filename, 4); Most operating systems still don't have proper unicode support so the fact that a filename might have international characters is not as likely. It just gives a representation of one of the non-printing needs for proper string length detection.
Jul 28 2004
prev sibling parent parabolis <parabolis softhome.net> writes:
Sean Kelly wrote:

 In article <ce8pfd$fm5$1 digitaldaemon.com>, parabolis says...
 
So to recap in big O terms:
    1) The memory requirements of String are identical to UTF16
    2) For length()
       2a) The time requirements of String are 1
       3a) The time requirements of UTF16 are N
    3) For substring()
       3a) The time requirements of String are 1
       3a) The time requirements of UTF16 are N

Out of curiosity, why care about printable string length anyway? Unless this is a UI compoment where I might want to restrict the number of printable characters I can't think of any reason.

I tend to deal quite a bit with text data.
Jul 28 2004