www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Re: [review] new string type

reply foobar <foo bar.com> writes:
Steven Schveighoffer Wrote:
[snipped]
 3. You have no access to the underlying array unless you're dealing with  
 an
 actual array of dchar.

I thought of adding some kind of access. I wasn't sure the best way. I was thinking of allowing direct access via opCast, because I think casting might be a sufficient red flag to let you know you are crossing into dangerous waters. But it could just be as easy as making the array itself public.

 -Steve

A string type should always maintain the invariant that it is a valid unicode string. Therefore I don't like having an unsafe opCast or providing direct access to the underlying array. I feel that there should be a read-only property for that. Algorithms that manipulate char[]'s should construct a new string instance which will validate the char[] it is being built from is a valid utf string. This looks like a great start for a proper string type. There's still the issue of literals that would require compiler/language changes. There's one other issue that should be considered at some stage: normalization and the fact that a single "character" can be constructed from several code points. (acutes and such)
Dec 01 2010
next sibling parent reply spir <denis.spir gmail.com> writes:
On Wed, 01 Dec 2010 03:30:07 -0500
foobar <foo bar.com> wrote:

 Steven Schveighoffer Wrote:
 [snipped]
 3. You have no access to the underlying array unless you're dealing w=



 an
 actual array of dchar.

I thought of adding some kind of access. I wasn't sure the best way. =20 I was thinking of allowing direct access via opCast, because I think =20 casting might be a sufficient red flag to let you know you are crossing=


 into dangerous waters.
=20
 But it could just be as easy as making the array itself public.
=20

 -Steve

A string type should always maintain the invariant that it is a valid uni=

rect access to the underlying array. I feel that there should be a read-onl= y property for that. Algorithms that manipulate char[]'s should construct a= new string instance which will validate the char[] it is being built from = is a valid utf string. But then, why not store a dchar[] array systematically? Validation and deco= ding is the same job. Once decoded, all methods work as expected (eg s[3] r= eturns the 4th code point) and blitz fast.
 This looks like a great start for a proper string type. There's still the=

Yop...
 There's one other issue that should be considered at some stage: normaliz=

l code points. (acutes and such)=20 This is my next little project. May build on Steve's job. (But it's not nec= essary, dchar is enough as a base, I guess.) Denis -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.com
Dec 01 2010
parent reply stephan <none example.com> writes:
 There's one other issue that should be considered at some stage: normalization
and the fact that a single "character" can be constructed from several code
points. (acutes and such)

This is my next little project. May build on Steve's job. (But it's not necessary, dchar is enough as a base, I guess.)

Hi Denis, you might want to consider helping us out. We have got a feature-complete Unicode normalization, case-folding, and concatenation implementation passing all test cases in http://unicode.org/Public/6.0.0/ucd/NormalizationTest.txt (and then some) for all recent Unicode versions. This code was part of a bigger project that we have stopped working on. We feel that the Unicode normalization part might be useful to others. Therefore we consider releasing them under an open source license. Before we can do so, we have to clean up things a bit. Some open issues are a) The code still contains some TODOs and FIXMEs (bugs, inefficiencies, some bigger issues like more efficient storing of data etc.). b) No profiling and no benchmarking against the ICU implementation (http://site.icu-project.org/) has been done yet (we expect surprises). c) Implementation of additional Unicode algorithms (e.g. full case mapping, matching, collation). Since we have stopped working on the bigger project, we haven’t made much progress. Any help would be welcome. Let me know whether this would be of interest to you.
Dec 01 2010
parent =?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:
spir wrote:

 What I have in mind is a "UText" type that provides the right abstraction
 for text processing / string maipulation as one has when dealing with 

 (in any fact any legacy character set). All what is needed is having 

 one-to-one mapping between characters (in the common sense) and 

 strings (what I call "code stacks"); one given stack unambiguously 

 one character. To reach this point, in addition to decoding (ag from 

 code points), we must:

 * group codes into stacks
 * normalize (into 'NFD')

Are those operations independent of the context? Is stacking always desired? I guess one would use one of the D string types when grouping or normalization is not desired, right? Makes sense.
 * sorts points in stacks

Ok, I see that it is possible with NFD. I am not experienced with Unicode, but I think there will be issues with other types of Unicode normalization. (Judging from your posts, I know that you know all these. :) )
 Then, we can for instance index or slice in O(1) as usual, and get a
 consistent substring of _characters_ [...] I do not want to deal with 

Is the concept of _character_ well defined in Unicode outside of the context of the an alphabet (I think your "script" covers alphabet.) It is an interesting decision when we actually want to see an array of code points as characters. When would it be correct to do so? I think the answer is when we start treating the string as a piece of text. For a string to be considered as text, it must be based on an alphabet. ASCII strings are pieces of text, because they are based on the 26-letter alphabet. I hope I don't sound like saying against anything that you said. I am also thinking about the other common operations that work on pieces of text: - sorting (e.g. ç is between c and d in many alphabets) - lowercasing, uppercasing (e.g. i<->İ and ı<->I in many alphabets) As a part of the Turkish D community, we've played with the idea of such a text type. It took advantage of D's support for Unicode encoded source code, so it's fully in Turkish. Yay! :) Here is the module that takes care of sorting, capitalization, and producing the base forms of the letters of the alphabets: http://code.google.com/p/trileri/source/browse/trunk/tr/alfabe.d It is also based on dchar[], as you recommend elsewhere in this thread. It is written with the older D2 operator overloading, doesn't support ranges, etc. But it currently supports ten alphabets (including the 26-letter English, and the Old Irish alphabet). Going out of the context of this thread, we've also worked on a type that contains pieces of text from different alphabets to make a "text", where a text like "jim & ali" is correctly capitalized as "JIM & ALİ". I am thinking more than what you describe. But your string would be useful for implementing ours, as we don't have normalization or stacking support at all. Thanks, Ali
Dec 03 2010
prev sibling next sibling parent spir <denis.spir gmail.com> writes:
On Wed, 01 Dec 2010 17:41:17 +0100
stephan <none example.com> wrote:

=20
 There's one other issue that should be considered at some stage: norma=



eral code points. (acutes and such)
 This is my next little project. May build on Steve's job. (But it's not=



Hi Denis, you might want to consider helping us out. =20 We have got a feature-complete Unicode normalization, case-folding, and=20 concatenation implementation passing all test cases in=20 http://unicode.org/Public/6.0.0/ucd/NormalizationTest.txt (and then=20 some) for all recent Unicode versions. This code was part of a bigger=20 project that we have stopped working on. =20 We feel that the Unicode normalization part might be useful to others.=20 Therefore we consider releasing them under an open source license.=20 Before we can do so, we have to clean up things a bit. Some open issues a=

=20
 a)    The code still contains some TODOs and FIXMEs (bugs,=20
 inefficiencies, some bigger issues like more efficient storing of data=20
 etc.).
=20
 b)    No profiling and no benchmarking against the ICU implementation=20
 (http://site.icu-project.org/) has been done yet (we expect surprises).
=20
 c)    Implementation of additional Unicode algorithms (e.g. full case=20
 mapping, matching, collation).
=20
 Since we have stopped working on the bigger project, we haven=E2=80=99t m=

 much progress. Any help would be welcome. Let me know whether this would=

 be of interest to you.

Yes, of course it would be useful. in any case. Either you wish to go on yo= ur project, and I may be of some help. Or it would anyway be a useful base = or example of how to implement unicode algorithm. Maybe it's time to give s= ome more information of what I intend to write. I have done it already (par= tially in Python, nearly completely in Lua). What I have in mind is a "UText" type that provides the right abstraction f= or text processing / string maipulation as one has when dealing with ASCII = (in any fact any legacy character set). All what is needed is having a true= one-to-one mapping between characters (in the common sense) and elements o= f strings (what I call "code stacks"); one given stack unambiguously denote= s one character. To reach this point, in addition to decoding (ag from utf8= to code points), we must: * group codes into stacks=20 * normalize (into 'NFD') * sorts points in stacks That's the base. Then, we can for instance index or slice in O(1) as usual, and get a consis= tent substring of _characters_ (not "abstract characters"). We can search f= or substrings by simple, direct, comparisons. When dealing with utf32 strin= gs (or worse utf8), simple indexing or counting is O(n) or rather O(k.n) wh= ere k represents the (high) cost of "stacking", and normalizing and sorting= , on the fly -- it's not only traversing the whole string instead of random= , it's heavy computation all along the way. =46rom this base, all kinds of usual routines can be built without any more c= omplexity. That's all what I want do implement. I wish to write all general= -purpose ones (which means, for instance, nothing like casing). Precisely, I do not want to deal with anything related to script-, language= -, locale- specific issues. It's a completely separate & independant topic.= This indeed include the "compatibility" normalisation forms of unicode (wh= ich precisely do not provide a normal form...). It seems part of your proje= ct was to cope such issues. I would be happy to cooperate if you feel like going on (then, let us commu= nicate off list). I still have the Lua code (which used to run); even if us= eless as help for implementation (the languages are too different), it coul= d give some more concrete picture of what I have in mind. Also, it includes= several test datasets, reprocessed for usability, from unicode's online fi= les. Denis -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.com
Dec 01 2010
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Wed, 01 Dec 2010 03:30:07 -0500, foobar <foo bar.com> wrote:

 Steven Schveighoffer Wrote:
 [snipped]
 3. You have no access to the underlying array unless you're dealing  

 an
 actual array of dchar.

I thought of adding some kind of access. I wasn't sure the best way. I was thinking of allowing direct access via opCast, because I think casting might be a sufficient red flag to let you know you are crossing into dangerous waters. But it could just be as easy as making the array itself public.

 -Steve

A string type should always maintain the invariant that it is a valid unicode string. Therefore I don't like having an unsafe opCast or providing direct access to the underlying array. I feel that there should be a read-only property for that. Algorithms that manipulate char[]'s should construct a new string instance which will validate the char[] it is being built from is a valid utf string.

Copying is not a good idea, nor is runtime validation. We can only protect the programmer so much. The good news is that the vast majority of strings are literals, which should be properly constructed by the compiler, and immutable.
 This looks like a great start for a proper string type. There's still  
 the issue of literals that would require compiler/language changes.

That is essential, the compiler has to defer the type of string literals to the library somehow.
 There's one other issue that should be considered at some stage:  
 normalization and the fact that a single "character" can be constructed  
 from several code points. (acutes and such)

This is more solvable with a struct, but at this point, I'm not sure if it's worth worrying about. How common is that need? -Steve
Dec 01 2010
prev sibling parent spir <denis.spir gmail.com> writes:
On Fri, 03 Dec 2010 14:11:35 -0800
Ali =C3=87ehreli <acehreli yahoo.com> wrote:

 spir wrote:
=20
  > What I have in mind is a "UText" type that provides the right abstract=

  > for text processing / string maipulation as one has when dealing with =

  > (in any fact any legacy character set). All what is needed is having a=

  > one-to-one mapping between characters (in the common sense) and elemen=

  > strings (what I call "code stacks"); one given stack unambiguously den=

  > one character. To reach this point, in addition to decoding (ag from  =

  > code points), we must:
=20
  > * group codes into stacks
  > * normalize (into 'NFD')
=20
 Are those operations independent of the context? Is stacking always desir=

=20
 I guess one would use one of the D string types when grouping or=20
 normalization is not desired, right? Makes sense.

We can consider there are two kinds of uses for texts in most applications: 1. just input/output (where input is also from literals in code), possibly = with some concatenation. 2. string manipulation / text processing For the 1st case, there is no need for any sophisticated toolkit like the t= ype I intend to write. We can just read in latin-x, for instance, join it, = output is back, without any problems. (As long as all pieces of text share = the same encoding.) Problems arise as soon as text is to be manipulated or = processed in any other way: indexing, searching, counting, slicing, replaci= ng, etc... all these routines require isolating _characters- in the ordinar= y sense of the word, inside the string of code units or code points. A true text type, usable like ASCII in old days, would either provide routi= nes that do that do in the background, but in a costly way, or first group = codes into characters once only -- then every later operation is as cheap a= s possible. Normalising and sorting are also require so that each character= has only one representation. I intend to write a little article to explain the issue (& misunderstanding= s created by Unicode's use of "abstract character"), and possible solutions. To say it again: if all one needs is text input/output, then using such a t= ool is overkill. Actually, even the string type Steven is implementing is n= ot strictly necessary. But it would have the advantage, if I understand cor= rectly, to present a cleaner interface.
  > * sorts points in stacks
=20
 Ok, I see that it is possible with NFD. I am not experienced with=20
 Unicode, but I think there will be issues with other types of Unicode=20
 normalization. (Judging from your posts, I know that you know all these.=

 :) )

Yes, the algorithm comes with Unicode's docs about "canonicalisation".
  > Then, we can for instance index or slice in O(1) as usual, and get a
  > consistent substring of _characters_ [...] I do not want to deal with=

 anything related to script-, language-, locale- specific issues.
=20
 Is the concept of _character_ well defined in Unicode outside of the=20
 context of the an alphabet (I think your "script" covers alphabet.)
=20
 It is an interesting decision when we actually want to see an array of=20
 code points as characters. When would it be correct to do so? I think=20
 the answer is when we start treating the string as a piece of text.
=20
 For a string to be considered as text, it must be based on an alphabet.=20
 ASCII strings are pieces of text, because they are based on the=20
 26-letter alphabet.
=20
 I hope I don't sound like saying against anything that you said. I am=20
 also thinking about the other common operations that work on pieces of te=

=20
 - sorting (e.g. =C3=A7 is between c and d in many alphabets)
 - lowercasing, uppercasing (e.g. i<->=C4=B0 and =C4=B1<->I in many alphab=

=20
 As a part of the Turkish D community, we've played with the idea of such=

 a text type. It took advantage of D's support for Unicode encoded source=

 code, so it's fully in Turkish. Yay! :)
=20
 Here is the module that takes care of sorting, capitalization, and=20
 producing the base forms of the letters of the alphabets:
=20
      http://code.google.com/p/trileri/source/browse/trunk/tr/alfabe.d
=20
 It is also based on dchar[], as you recommend elsewhere in this thread.
=20
 It is written with the older D2 operator overloading, doesn't support=20
 ranges, etc. But it currently supports ten alphabets (including the=20
 26-letter English, and the Old Irish alphabet).
=20
 Going out of the context of this thread, we've also worked on a type=20
 that contains pieces of text from different alphabets to make a "text",=20
 where a text like "jim & ali" is correctly capitalized as "JIM & AL=C4=B0=

=20
 I am thinking more than what you describe. But your string would be=20
 useful for implementing ours, as we don't have normalization or stacking=

 support at all.

As said, I do not wish the enter the huge area of script-, natural language= -, culture-, specific issues; because it's not general; I just target a gen= eral-purpose tool. My type wouldn't even have a default uppercasee routine,= for instance: first, it's script specific, second, there is no general def= inition for that (*) -- even if Unicode provides such data. Sorting issue a= re even less decidable; it goes down to personal preferences (**). It's als= o too big, too diverse, too complicated, an area. But I guess the type I have in mind would provide a good basis for such a w= ork (or rather, hundreds of language- and domain-speicfic tools or applicat= ions). At least, issues about grouping codes into characters, and multiple = forms of characters, are solved: "AL=C4=B0" would always be represented the= same way, and in a logical way; so that if you count its characters, you g= et 3 for sure, and if you search for '=C4=B0' you find it for sure (which i= s not true today).
 Thanks,
 Ali

Thank to you, Denis (*) Is the uppercase of "g=C3=A2t=C3=A9" "GATE" or "G=C3=82T=C3=89"? (**) Search for instance threads about users complaining when KDE decided t= o sort file names according to supposed user-friendly order ("natural", lol= !). -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.com
Dec 03 2010