www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - RFC: Case-Insensitive Strings (And usually they really do *have* case)

reply "Nick Sabalausky" <a a.a> writes:
Imagine things like Windows filenames or tokenized BASIC code (There's 
probably plenty of other examples too, for instance, I came across this 
issue when dealing with GOLD's pre-defined character set names: 
http://www.devincook.com/goldparser/doc/grammars/character-sets.htm ).

These things *do* have a specific case, but the vast majority of time their 
*comparisons* should be done insensitively. This can lead to awkwardness. 
For instance, how do you store it?

A. If you store it as lower-case: You lose the information of the original 
case. The casing information may not be important for comparisons, but it 
*is* an inherent part of the data. For instance, if the string gets output 
in some way, it's no longer what it originally was.

B. If you store it preserving case: You have to make sure to remember to 
convert to lower-case on most comparisons. Which has three problems:
    1. Easy to forget and cause a bug.
    2. May end up converting lower-case more often than really needed.
    3. What if the comparison occurs inside a routine that has no idea what 
it's supposed to be? You'd have to remember to convert to lower-case when 
passing to that routine. And then within the routines you're back to the 
problems of strategy "A".

C. If you store both the original and lower-case versions: You have to take 
special care to maintain two separate variables and keep them both in-sync.

All three options are kinda sucky. And in any way, how to do distinguish 
between variables that are suposed to be case-insensitive and ones that 
aren't? Naming convention?

I think treating case-sensitivity as an attribute of the data rather than 
the comparison, and utilizing the type-system, cleans things up 
considerably. So I've created a struct to represent a "case-insensitive 
string" (templated for string, wstring and dstring). I think it would be a 
good thing to have in phobos, and would like to submit it for review.

It's currently named Insensitive/WInsensitive/DInsensitive, but I'm thinking 
now that maybe it should be renamed something like 
istring/iwstring/idstring. They're used like this:

// Create
auto normalString = "Hello";
auto insensitiveString = Insensitive("Hello");

// Convert
normalString = insensitiveString.toString();
insensitiveString = Insensitive(normalString);

// Compare
assert(Insensitive("Hello") == Insensitive("hELLo"));

// Mixed-comparisons are deliberately disallowed at
// compile-time since the intent is ambiguous:
assert(normalString == insensitiveString); // Compile Error!

// To disambiguate:
assert(Insensitive(normalString) == insensitiveString)); // Case-Insensitive
assert(normalString == insensitiveString.toString())); // Case-Sensitive

// Preserves original case:
writeln(Insensitive("hELLo")); // Output: hELLo

It also works as expected as the key of an assoc array (which I've 
personally found useful). Also, it doesn't convert to lower-case until it 
actually needs to, and once it does, it caches it.

Here is the source:
http://www.dsource.org/projects/semitwist/browser/trunk/src/semitwist/util/text.d#L758

While it is in the middle of my big grab-bag library, I'm pretty sure it 
doesn't rely on anything that's not in phobos (unless I've missed something, 
in which case, I can correct). The unittests for it do use my own 
unittesting routines, but it should be trivial to see how they'd convert to 
unittest{}, assert() and assertPred.
Jan 09 2011
parent reply Jim <bitcirkel yahoo.com> writes:
I'm a firm believer of alternative B: Store the string with its original case,
unless it's particularly important to do otherwise.

The cost of case-insensitive comparison is REALLY small. Anytime you are to
compare two strings ask yourself whether case-sensitive or case-insensitive is
what you need. Have no inclination to prefer one type of comparison to the
other. Problem solved. Bloat avoided.


Creating specific types of strings that carry with them data on how they are to
be interpreted is over-engineering, solving a problem that doesn't exist.


Nick Sabalausky Wrote:

 Imagine things like Windows filenames or tokenized BASIC code (There's 
 probably plenty of other examples too, for instance, I came across this 
 issue when dealing with GOLD's pre-defined character set names: 
 http://www.devincook.com/goldparser/doc/grammars/character-sets.htm ).
 
 These things *do* have a specific case, but the vast majority of time their 
 *comparisons* should be done insensitively. This can lead to awkwardness. 
 For instance, how do you store it?
 
 A. If you store it as lower-case: You lose the information of the original 
 case. The casing information may not be important for comparisons, but it 
 *is* an inherent part of the data. For instance, if the string gets output 
 in some way, it's no longer what it originally was.
 
 B. If you store it preserving case: You have to make sure to remember to 
 convert to lower-case on most comparisons. Which has three problems:
     1. Easy to forget and cause a bug.
     2. May end up converting lower-case more often than really needed.
     3. What if the comparison occurs inside a routine that has no idea what 
 it's supposed to be? You'd have to remember to convert to lower-case when 
 passing to that routine. And then within the routines you're back to the 
 problems of strategy "A".
 
 C. If you store both the original and lower-case versions: You have to take 
 special care to maintain two separate variables and keep them both in-sync.
 
 All three options are kinda sucky. And in any way, how to do distinguish 
 between variables that are suposed to be case-insensitive and ones that 
 aren't? Naming convention?
 
 I think treating case-sensitivity as an attribute of the data rather than 
 the comparison, and utilizing the type-system, cleans things up 
 considerably. So I've created a struct to represent a "case-insensitive 
 string" (templated for string, wstring and dstring). I think it would be a 
 good thing to have in phobos, and would like to submit it for review.
 
 It's currently named Insensitive/WInsensitive/DInsensitive, but I'm thinking 
 now that maybe it should be renamed something like 
 istring/iwstring/idstring. They're used like this:
 
 // Create
 auto normalString = "Hello";
 auto insensitiveString = Insensitive("Hello");
 
 // Convert
 normalString = insensitiveString.toString();
 insensitiveString = Insensitive(normalString);
 
 // Compare
 assert(Insensitive("Hello") == Insensitive("hELLo"));
 
 // Mixed-comparisons are deliberately disallowed at
 // compile-time since the intent is ambiguous:
 assert(normalString == insensitiveString); // Compile Error!
 
 // To disambiguate:
 assert(Insensitive(normalString) == insensitiveString)); // Case-Insensitive
 assert(normalString == insensitiveString.toString())); // Case-Sensitive
 
 // Preserves original case:
 writeln(Insensitive("hELLo")); // Output: hELLo
 
 It also works as expected as the key of an assoc array (which I've 
 personally found useful). Also, it doesn't convert to lower-case until it 
 actually needs to, and once it does, it caches it.
 
 Here is the source:
 http://www.dsource.org/projects/semitwist/browser/trunk/src/semitwist/util/text.d#L758
 
 While it is in the middle of my big grab-bag library, I'm pretty sure it 
 doesn't rely on anything that's not in phobos (unless I've missed something, 
 in which case, I can correct). The unittests for it do use my own 
 unittesting routines, but it should be trivial to see how they'd convert to 
 unittest{}, assert() and assertPred.
 
 

Jan 09 2011
parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On Sunday 09 January 2011 13:52:53 Jim wrote:
 I'm a firm believer of alternative B: Store the string with its original
 case, unless it's particularly important to do otherwise.
 
 The cost of case-insensitive comparison is REALLY small. Anytime you are to
 compare two strings ask yourself whether case-sensitive or
 case-insensitive is what you need. Have no inclination to prefer one type
 of comparison to the other. Problem solved. Bloat avoided.
 
 
 Creating specific types of strings that carry with them data on how they
 are to be interpreted is over-engineering, solving a problem that doesn't
 exist.

I don't know that it's over-engineering. I expect that there _are_ cases where it makes perfect sense. However, in the general case, I do think that it's overkill. std.string.icmp() deals with most cases where you need case- insensitive comparison, but what if you really do need it everywhere as in Nick's case? Or what about cases like associative arrays, which you can't give a comparison function to (it has to be built into the type)? I don't think that the cost of the comparison here is really the issue. If that's all you need, then there's icmp(). It's when you need the same comparison _everywhere_ that it matters. I don't know if this is worthwhile having in Phobos or not, but it might be. It definitely seems to me that in the general case, this sort of solution should not be used but that there are definitely cases where it would be extremely useful and get a rid of a host of potential problems. Now, I do wonder if perhaps this idea should be generalized to any type and/or a given binary predicate to test for equality rather than making it specific to strings and case-insensitive comparison. The issue here (in the general sense) is that you want to wrap a type so that it will use a specialized comparison function everywhere, and that seems like it should be highly generalizable, though doing it right may require alias this, which _is_ rather buggy at the moment. Still, it would seem to me to be worthwhile to consider how it could and/or should be generalized. - Jonathan M Davis
Jan 09 2011
parent reply "Nick Sabalausky" <a a.a> writes:
"Jonathan M Davis" <jmdavisProg gmx.com> wrote in message 
news:mailman.529.1294620116.4748.digitalmars-d puremagic.com...
 On Sunday 09 January 2011 13:52:53 Jim wrote:
 I'm a firm believer of alternative B: Store the string with its original
 case, unless it's particularly important to do otherwise.

 The cost of case-insensitive comparison is REALLY small. Anytime you are 
 to
 compare two strings ask yourself whether case-sensitive or
 case-insensitive is what you need. Have no inclination to prefer one type
 of comparison to the other. Problem solved. Bloat avoided.


 Creating specific types of strings that carry with them data on how they
 are to be interpreted is over-engineering, solving a problem that doesn't
 exist.


The problem certainly cropped up for me. See below.
 I don't know that it's over-engineering. I expect that there _are_ cases 
 where
 it makes perfect sense. However, in the general case, I do think that it's
 overkill. std.string.icmp() deals with most cases where you need case-
 insensitive comparison, but what if you really do need it everywhere as in
 Nick's case? Or what about cases like associative arrays, which you can't 
 give a
 comparison function to (it has to be built into the type)? I don't think 
 that
 the cost of the comparison here is really the issue. If that's all you 
 need,
 then there's icmp(). It's when you need the same comparison _everywhere_ 
 that it
 matters.

Right. FWIW, this is the scenario that originally inspired it: I was working on some code that processes a grammar definition (specifically, the BNF-style language that GOLD uses). The grammar definition language includes various pre-defined character sets, and allows user-defined character sets. These character sets are referred to be name (such as "AlphaNumeric", "Whitespace", or "Cyrillic Supplementary"). But those character set names are defined by the language as being case-insensitive. (Come to think of it, all the names of everything are case-insensitive: tokens, char sets, meta-data, etc.) Due to the usage patterns in my program, it made sense to store the character sets as an associative array where the keys were the names of the character sets and the values were the data describing what characters were included in the set. And there were plenty of other AAs for other things that were all indexed by case-insensitive names. Obviously, I needed to ensure that *all* comparisons involving these names were done insensitively (to do otherwise would be a bug). And there were also times when I needed to display one of the character set names (error messages, for instance), and it would be awkward not to show the original capitalization. So I had to follow the convention of always creating lower-case versions to insert into and lookup from the AAs, and also maintain the original names (and be very careful about all of it). This quickly became an awful mess. But as soon as I wrote and started using the "Insensitive" type, the whole thing was simplified enormously. While writing and dealing with all that code I realized something: While programmers are usually heavily conditioned to think of case-sensitivity as an attribute of the comparison, it's very frequent that the deciding factor in which comparison to use is *not* the comparison itself but *what* gets compared. And in those cases, you have to use the awful strategy of "relying on convention" to make sure you get it right in *every* place that particular data gets compared. It's analogous to how Asm has separate operators for signed-integer, unsigned-integer and floating-point math: Many times a specific memory location is *supposed* to be treated as either signed, unsigned or float in *all* operations they participate in. Handling this with separate operators that behave differently is notoriously tedious and error-prone. That's why non-asm languages, even ones as low-level as C, employ a type system which allows the programmer to *force* a variable to always, and automatically, be used with the proper version of the given operator. Heck, it all goes back to the whole original point of a type-system.
 Now, I do wonder if perhaps this idea should be generalized to any type 
 and/or a
 given binary predicate to test for equality rather than making it specific 
 to
 strings and case-insensitive comparison. The issue here (in the general 
 sense)
 is that you want to wrap a type so that it will use a specialized 
 comparison
 function everywhere, and that seems like it should be highly 
 generalizable,
 though doing it right may require alias this, which _is_ rather buggy at 
 the
 moment. Still, it would seem to me to be worthwhile to consider how it 
 could
 and/or should be generalized.

That's a very good thought. Have to say I'm not really sure offhand how I'd do that though.
Jan 09 2011
parent reply Jim <bitcirkel yahoo.com> writes:
 While writing and dealing with all that code I realized something: While 
 programmers are usually heavily conditioned to think of case-sensitivity as 
 an attribute of the comparison, it's very frequent that the deciding factor 
 in which comparison to use is *not* the comparison itself but *what* gets 
 compared. And in those cases, you have to use the awful strategy of "relying 
 on convention" to make sure you get it right in *every* place that 
 particular data gets compared.

You have a point. Your case-sensitivity-aware string types will guarantee correctness in a large and complex program. I like that. Ideally though, they would only be compile-time constraints (i.e. not carrying any other data). Perhaps it's possible to create such types? CaseSensitiveString, CaseInsensitiveString, and the standard string type being unspecified.
Jan 10 2011
next sibling parent reply "Nick Sabalausky" <a a.a> writes:
"Jim" <bitcirkel yahoo.com> wrote in message 
news:igfado$11g3$1 digitalmars.com...
 While writing and dealing with all that code I realized something: While
 programmers are usually heavily conditioned to think of case-sensitivity 
 as
 an attribute of the comparison, it's very frequent that the deciding 
 factor
 in which comparison to use is *not* the comparison itself but *what* gets
 compared. And in those cases, you have to use the awful strategy of 
 "relying
 on convention" to make sure you get it right in *every* place that
 particular data gets compared.

You have a point. Your case-sensitivity-aware string types will guarantee correctness in a large and complex program. I like that. Ideally though, they would only be compile-time constraints (i.e. not carrying any other data).

Not carrying any other data means not caching the lowercase version, which means recreating the lowercase version more than necessary. So it's the classic speed vs. space tradeoff. I would think there would be cases where they get compared enough for that to make a difference, although I suppose we'd really need benchmarks to see. OTOH, there are certainly cases (such as my original motivating case) where the extra space is not an issue at all. But that does raise a point worth exploring: Maybe it should support both caching and non-caching behavior? I can think of at least two different ways to do this: 1. Provide a "useCache" template parameter. Problem with that though, is that the caching and non-caching versions end up being two distinct types. 2. Make "useCache" a runtime property. Since the cached foldingCase var is entirely private, the "useCache" flag could probably be encoded as a specific value of foldingCase.length and/or foldingCase.ptr. So the overhead of the non-cached version would only be size_t*2 instead of size_t*2+str.length. One other idea: It's possible to change it so the folding case version is only created and cached when toHash is used. For other cases, it could just use icmp. So the storage space (beyond .length and .ptr) for the folding case version would never be allocated unless you're using it as an AA key (or some other use that needs hashing). Some benchmarking would be needed to be certain, but I think that would hit a nice sweet spot between space and speed, plus it wouldn't bother the user of the type with "cache or no cache?".
 Perhaps it's possible to create such types?
 CaseSensitiveString,
 CaseInsensitiveString,
 and the standard string type being unspecified.

I think the standard type can be assumed to be case-sensitive since that's how it already operates.
Jan 10 2011
parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2011-01-10 13:46:55 -0500, "Nick Sabalausky" <a a.a> said:

 Not carrying any other data means not caching the lowercase version, which
 means recreating the lowercase version more than necessary. So it's the
 classic speed vs. space tradeoff. I would think there would be cases where
 they get compared enough for that to make a difference, although I suppose
 we'd really need benchmarks to see. OTOH, there are certainly cases (such as
 my original motivating case) where the extra space is not an issue at all.

Comparing the lowercase version of two strings works well for ASCII, but I doubt it works very well for Unicode. Case conversion is not bidirectional (for instance both 'SS' and '▀' become 'ss' in lowercase in German), and what's equal and what is not sometime depends on the language. Checking for string equality is a special case of the Unicode collation algorithm. I'm not sure if implementing this part of Unicode is in the scope of Phobos (probably not), but short of having Unicode support it seems the utility of having a special string type dedicated to ASCII case-insensitive strings is quite limited. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 10 2011
next sibling parent reply "Nick Sabalausky" <a a.a> writes:
"Michel Fortin" <michel.fortin michelf.com> wrote in message 
news:igft2o$291g$1 digitalmars.com...
 On 2011-01-10 13:46:55 -0500, "Nick Sabalausky" <a a.a> said:

 Not carrying any other data means not caching the lowercase version, 
 which
 means recreating the lowercase version more than necessary. So it's the
 classic speed vs. space tradeoff. I would think there would be cases 
 where
 they get compared enough for that to make a difference, although I 
 suppose
 we'd really need benchmarks to see. OTOH, there are certainly cases (such 
 as
 my original motivating case) where the extra space is not an issue at 
 all.

Comparing the lowercase version of two strings works well for ASCII, but I doubt it works very well for Unicode. Case conversion is not bidirectional (for instance both 'SS' and '▀' become 'ss' in lowercase in German), and what's equal and what is not sometime depends on the language. Checking for string equality is a special case of the Unicode collation algorithm. I'm not sure if implementing this part of Unicode is in the scope of Phobos (probably not), but short of having Unicode support it seems the utility of having a special string type dedicated to ASCII case-insensitive strings is quite limited.

Yea, Phobos doesn't even have folding-case functions yet (which is why I keep saying "lowercase"). (This is actually one place where Phobos is still behind Tango.) However, I really think that's orthogonal to this since std.string.icmp doesn't handle such non-english issues either (just the english a-z, A-Z, and that's it). When Phobos does become multilingual, then this can be updated to follow suit. One question though: Aren't 'SS' and '▀' considered the same in german anyway? If so, how does using lowercase instead of folding case cause a problem?
Jan 10 2011
next sibling parent Daniel Gibson <metalcaedes gmail.com> writes:
Am 10.01.2011 22:24, schrieb Nick Sabalausky:
 One question though: Aren't 'SS' and '▀' considered the same in german
 anyway? If so, how does using lowercase instead of folding case cause a
 problem?

It's not the same, see http://en.wikipedia.org/wiki/%C3%9F#Current_usage_in_German
Jan 10 2011
prev sibling parent "Nick Sabalausky" <a a.a> writes:
"Nick Sabalausky" <a a.a> wrote in message 
news:igftls$2afo$1 digitalmars.com...
 since std.string.icmp doesn't handle such non-english issues either (just 
 the english a-z, A-Z, and that's it).

Figured this belongs in bugilla: http://d.puremagic.com/issues/show_bug.cgi?id=5443
Jan 10 2011
prev sibling parent reply Daniel Gibson <metalcaedes gmail.com> writes:
Am 10.01.2011 22:16, schrieb Michel Fortin:
 On 2011-01-10 13:46:55 -0500, "Nick Sabalausky" <a a.a> said:

 Not carrying any other data means not caching the lowercase version, which
 means recreating the lowercase version more than necessary. So it's the
 classic speed vs. space tradeoff. I would think there would be cases where
 they get compared enough for that to make a difference, although I suppose
 we'd really need benchmarks to see. OTOH, there are certainly cases (such as
 my original motivating case) where the extra space is not an issue at all.

Comparing the lowercase version of two strings works well for ASCII, but I doubt it works very well for Unicode. Case conversion is not bidirectional (for instance both 'SS' and '▀' become 'ss' in lowercase in German),

That's wrong, '▀' is lowercase and no upper-case version is used really, though one exists in Unicode (see: http://en.wikipedia.org/wiki/Capital_%C3%9F ). Sometimes, when stuff is written in fullcaps, '▀' (which never is the first character of a word) is replaced by "SS", but I wouldn't expect that to be equal on icmp(). (e.g. "Strings vergleichen macht keinen Spa▀!" vs "STRINGS VERGLEICHEN MACHT KEINEN SPASS!") Anyway, in this case comparing in lowercase would cause no trouble at all (comparing in uppercase however would, if you don't use the not-really-existing-but-defined-by-unicode-Capital-▀). I don't know if there may be problems with special characters in other languages, though.
 and what's equal
 and what is not sometime depends on the language.

 Checking for string equality is a special case of the Unicode collation
 algorithm. I'm not sure if implementing this part of Unicode is in the scope of
 Phobos (probably not), but short of having Unicode support it seems the utility
 of having a special string type dedicated to ASCII case-insensitive strings is
 quite limited.

Jan 10 2011
next sibling parent reply "Nick Sabalausky" <a a.a> writes:
"Daniel Gibson" <metalcaedes gmail.com> wrote in message 
news:igfttf$p46$1 digitalmars.com...
 Am 10.01.2011 22:16, schrieb Michel Fortin:
 Comparing the lowercase version of two strings works well for ASCII, but 
 I doubt
 it works very well for Unicode. Case conversion is not bidirectional (for
 instance both 'SS' and '▀' become 'ss' in lowercase in German),

That's wrong, '▀' is lowercase and no upper-case version is used really, though one exists in Unicode (see: http://en.wikipedia.org/wiki/Capital_%C3%9F ). Sometimes, when stuff is written in fullcaps, '▀' (which never is the first character of a word) is replaced by "SS", but I wouldn't expect that to be equal on icmp(). (e.g. "Strings vergleichen macht keinen Spa▀!" vs "STRINGS VERGLEICHEN MACHT KEINEN SPASS!") Anyway, in this case comparing in lowercase would cause no trouble at all (comparing in uppercase however would, if you don't use the not-really-existing-but-defined-by-unicode-Capital-▀). I don't know if there may be problems with special characters in other languages, though.

One of the unicode documents mentions an example involving the three greek "sigma" letters, although I never quite understood how it demonstrated the inadequacy of using lower-case: http://www.unicode.org/reports/tr21/tr21-5.html#Caseless_Matching ...Which references some information near the end of this sub-section: http://www.unicode.org/reports/tr21/tr21-5.html#Introduction Actually, what probably should be stored is a *normalized* folding-case version of the string, because then (if I understand correctly) memcmp could be used. I don't think memcpy technically works on non-ASCII (unless it's in normalized form). In any case, Phobos doesn't currently handle any of that stuff at all, so my case-insensitive string type wouldn't be taking things backwards in that regard.
Jan 10 2011
parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2011-01-10 17:00:40 -0500, "Nick Sabalausky" <a a.a> said:

 Actually, what probably should be stored is a *normalized* folding-case
 version of the string, because then (if I understand correctly) memcmp could
 be used. I don't think memcpy technically works on non-ASCII (unless it's in
 normalized form).

Actually, what would be compatible with Unicode collation is probably an array of collation elements. This would also make it useful to sort case-insensitively, not just testing for equality. Details (perhaps too much details): <http://unicode.org/reports/tr10/>
 In any case, Phobos doesn't currently handle any of that stuff at all, so my
 case-insensitive string type wouldn't be taking things backwards in that
 regard.

Probably not. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 10 2011
parent "Nick Sabalausky" <a a.a> writes:
"Michel Fortin" <michel.fortin michelf.com> wrote in message 
news:iggi34$cqm$1 digitalmars.com...
 On 2011-01-10 17:00:40 -0500, "Nick Sabalausky" <a a.a> said:

 Actually, what probably should be stored is a *normalized* folding-case
 version of the string, because then (if I understand correctly) memcmp 
 could
 be used. I don't think memcpy technically works on non-ASCII (unless it's 
 in
 normalized form).

Actually, what would be compatible with Unicode collation is probably an array of collation elements. This would also make it useful to sort case-insensitively, not just testing for equality. Details (perhaps too much details): <http://unicode.org/reports/tr10/>

I'll have to take a look. (I don't even know what "Unicode collation" is :P )
 In any case, Phobos doesn't currently handle any of that stuff at all, so 
 my
 case-insensitive string type wouldn't be taking things backwards in that
 regard.

Probably not.

Actually I was (mostly) wrong: There's already unicode-compatible upper/lower case characters functions in std.uni, and these are used by std.string.toupper and std.string.tolower (as well as their InPlace coutnerparts). Which incidentally means that my Insentitive type worked correctly for unicode all along (I didn't know about icmp when I wrote it) - well, except for things that need folding case instead of lower-case. Not only that, but Andrei just committed a unicode fix for icmp (bugzilla 5443) just a few hours ago. (That's gotta be a record for fastest issue fixed in D's bug tracker!) Still no folding-case though.
Jan 10 2011
prev sibling parent Michel Fortin <michel.fortin michelf.com> writes:
On 2011-01-10 16:30:50 -0500, Daniel Gibson <metalcaedes gmail.com> said:

 Am 10.01.2011 22:16, schrieb Michel Fortin:
 Comparing the lowercase version of two strings works well for ASCII, 
 but I doubt
 it works very well for Unicode. Case conversion is not bidirectional (for
 instance both 'SS' and '▀' become 'ss' in lowercase in German),

That's wrong, '▀' is lowercase and no upper-case version is used really, though one exists in Unicode (see: http://en.wikipedia.org/wiki/Capital_%C3%9F ).

Oops, you're right. I got it backward. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 10 2011
prev sibling parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On Monday, January 10, 2011 10:46:55 Nick Sabalausky wrote:
 "Jim" <bitcirkel yahoo.com> wrote in message
 news:igfado$11g3$1 digitalmars.com...
 
 While writing and dealing with all that code I realized something: While
 programmers are usually heavily conditioned to think of case-sensitivity
 as
 an attribute of the comparison, it's very frequent that the deciding
 factor
 in which comparison to use is *not* the comparison itself but *what*
 gets compared. And in those cases, you have to use the awful strategy
 of "relying
 on convention" to make sure you get it right in *every* place that
 particular data gets compared.

You have a point. Your case-sensitivity-aware string types will guarantee correctness in a large and complex program. I like that. Ideally though, they would only be compile-time constraints (i.e. not carrying any other data).

Not carrying any other data means not caching the lowercase version, which means recreating the lowercase version more than necessary. So it's the classic speed vs. space tradeoff. I would think there would be cases where they get compared enough for that to make a difference, although I suppose we'd really need benchmarks to see. OTOH, there are certainly cases (such as my original motivating case) where the extra space is not an issue at all.

Why is caching necessary? Shouldn't you just be able to use std.string.icmp() for comparisons internally, avoiding any copying or caching? That shouldn't need to duplicate anything. Or do you need to cache the lower-case version for something other than comparison? - Jonathan M Davis
Jan 10 2011
parent reply "Nick Sabalausky" <a a.a> writes:
"Jonathan M Davis" <jmdavisProg gmx.com> wrote in message 
news:mailman.538.1294690510.4748.digitalmars-d puremagic.com...
 On Monday, January 10, 2011 10:46:55 Nick Sabalausky wrote:
 "Jim" <bitcirkel yahoo.com> wrote in message
 news:igfado$11g3$1 digitalmars.com...

 While writing and dealing with all that code I realized something: 
 While
 programmers are usually heavily conditioned to think of 
 case-sensitivity
 as
 an attribute of the comparison, it's very frequent that the deciding
 factor
 in which comparison to use is *not* the comparison itself but *what*
 gets compared. And in those cases, you have to use the awful strategy
 of "relying
 on convention" to make sure you get it right in *every* place that
 particular data gets compared.

You have a point. Your case-sensitivity-aware string types will guarantee correctness in a large and complex program. I like that. Ideally though, they would only be compile-time constraints (i.e. not carrying any other data).

Not carrying any other data means not caching the lowercase version, which means recreating the lowercase version more than necessary. So it's the classic speed vs. space tradeoff. I would think there would be cases where they get compared enough for that to make a difference, although I suppose we'd really need benchmarks to see. OTOH, there are certainly cases (such as my original motivating case) where the extra space is not an issue at all.

Why is caching necessary? Shouldn't you just be able to use std.string.icmp() for comparisons internally, avoiding any copying or caching? That shouldn't need to duplicate anything. Or do you need to cache the lower-case version for something other than comparison?

Anything involving toHash (such as using Insensitive as an AA key) requires the use of a lower-case version. For anything else, you're probably right, icmp should be fine (Although I'd like to do a benchmark of icmp vs regular string comparison).
Jan 10 2011
parent "Nick Sabalausky" <a a.a> writes:
"Nick Sabalausky" <a a.a> wrote in message 
news:igfq0n$22u8$1 digitalmars.com...
 "Jonathan M Davis" <jmdavisProg gmx.com> wrote in message 
 news:mailman.538.1294690510.4748.digitalmars-d puremagic.com...
 On Monday, January 10, 2011 10:46:55 Nick Sabalausky wrote:
 "Jim" <bitcirkel yahoo.com> wrote in message
 news:igfado$11g3$1 digitalmars.com...

 While writing and dealing with all that code I realized something: 
 While
 programmers are usually heavily conditioned to think of 
 case-sensitivity
 as
 an attribute of the comparison, it's very frequent that the deciding
 factor
 in which comparison to use is *not* the comparison itself but *what*
 gets compared. And in those cases, you have to use the awful strategy
 of "relying
 on convention" to make sure you get it right in *every* place that
 particular data gets compared.

You have a point. Your case-sensitivity-aware string types will guarantee correctness in a large and complex program. I like that. Ideally though, they would only be compile-time constraints (i.e. not carrying any other data).

Not carrying any other data means not caching the lowercase version, which means recreating the lowercase version more than necessary. So it's the classic speed vs. space tradeoff. I would think there would be cases where they get compared enough for that to make a difference, although I suppose we'd really need benchmarks to see. OTOH, there are certainly cases (such as my original motivating case) where the extra space is not an issue at all.

Why is caching necessary? Shouldn't you just be able to use std.string.icmp() for comparisons internally, avoiding any copying or caching? That shouldn't need to duplicate anything. Or do you need to cache the lower-case version for something other than comparison?

Anything involving toHash (such as using Insensitive as an AA key) requires the use of a lower-case version. For anything else, you're probably right, icmp should be fine (Although I'd like to do a benchmark of icmp vs regular string comparison).

The other (smaller) thing I'm concerned about, and the reason I keep mentioning I want to do a benchmark, is that case-sensitive comparisons are able to use the max-speed memcmp whereas icmp has to not only avoid memcmp but also stick extra logic in the middle of the loop (and currently that also includes calls to utf.decode). Without checking, I'm not sure if that wouldn't make multiple calls to icmp slower than just making a lower-case version once and using case-sensitive comparisons on them.
Jan 10 2011