www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Re: Why foreach(c; someString) must yield dchar

reply Kagamin <spam here.lot> writes:
Jonathan M Davis Wrote:

 No, it doesn't hurt to have the iteration type larger than the actual type,
but 
 you're not going to have overflow.

Trivial: take byte and add 256.
 could have had overflow putting it in, but when you're taking it out, you know 
 that it fits because it was already in there. You could have overflow issues
with 
 math or whatnot inside the body of your loop if you're assigning to the
foreach 
 variable, but that has nothing to do with what you're getting out of the loop. 

As long as what you get out of the loop doesn't depend on the element type. Didn't you demonstrated how such dependency can be introduced?
 It's fine with me to use narrow strings. Much as I'd love to avoid a lot of
these 
 issues, dstrings take up too much memory if you're going to be doing a lot of 
 string processing.

If you're going to take much memory, there probably won't be much difference between strings and dstrings, you'll take much memory in both cases. And don't forget that UTF-8 chars take up to 4 bytes.
 problem is that the default behavior is the abnormal (and therefore almost 
 certainly buggy) behavior. Generally D tries to make the normal behavior the 
 behavior that is less likely to cause bugs.

Type system hacks are likely to cause bugs.
 Very few people are actually going to 
 want to deal with code points. They want characters. The result is that it 
 becomes very easy to make mistakes with strings if you ever try and manipulate 
 them character-by-character.

If you care about people and want to force them to use dchar ranges, you can do it with the library: make it refuse narrow strings - as long as the library is unusable with narrow strings, people will have to do something about it, say, use wrappers like one proposed in this thread (but providing forward dchar range interface).
 It makes perfect sense for general arrays. It makes perfect sense if you don't 
 really care about the contents of the array for your algorithm (that is,
whether 
 they're code points or characters or just bytes in memory doesn't matter for 
 what you're doing). However, if you're actually processing characters, it
makes 
 no sense at all. This mess with foreach and strings is one of the big reasons 
 why foreach tends to be avoided in std.algorithm.

The problem here is that integers are not much different from characters in this regard.
 and given the fact that the string module deals almost exclusively with 
 string rather than wstring or dstring, it really doesn't make sense to use 
 dstrings in the general case.

This is my point: you can do it with library, if you can't, fix the library.
 Not to mention, the Linux I/O stuff uses UTF-8, and 
 the Windows I/O stuff uses UTF-16, so dstring is less efficient for dealing
with 
 I/O.

Every string type is inefficient here, but a wrapper comparable to NSString can fix it for you.
 Perhaps what we need is some way to distinguish between the exact element type 
 on an array and the conceptual element type. So, for most arrays, they'd both
be 
 whatever the element type of the array is, but for strings the exact element 
 type would be char, whchar, or dchar while the conceptual type would be dchar. 

Conceptually number is an infinite sequence of digits with decimal point. What do you plan to do about this?
Aug 19 2010
parent reply Jonathan M Davis <jmdavisprog gmail.com> writes:
On Thursday, August 19, 2010 12:18:03 Kagamin wrote:
 Jonathan M Davis Wrote:
 No, it doesn't hurt to have the iteration type larger than the actual
 type, but you're not going to have overflow.

Trivial: take byte and add 256.

Except that that only happens once you do something to the element that you get from foreach. You read byte just fine without having overflow problems. You can't do the same with char or wchar. You often need multiple of them to get anything meaningful - unlike bytes. If you want to change the iteration type to int or long or whatever when iterating over bytes so that you can change the variable without overflow issues, you can. But the byte itself is meaingful by itself. Such is not generally the case with char or wchar.
 It's fine with me to use narrow strings. Much as I'd love to avoid a lot
 of these issues, dstrings take up too much memory if you're going to be
 doing a lot of string processing.

If you're going to take much memory, there probably won't be much difference between strings and dstrings, you'll take much memory in both cases. And don't forget that UTF-8 chars take up to 4 bytes.

For ASCII characters, a UTF-32 character takes _4_ times as much memory as a UTF-8 character. Even if you use lots of Asian characters, as I understand it, most won't take more than 3. So, even if you're using primarily Asian characters with UTF-8, your still have 25% space savings. And since apparently, many Asian characters will fit into one wchar, if you use UTF-16 when you have lots of Asian characters, you're getting closer to 50% space savings over UTF-32. If you have a lot of strings, that's a lot of wasted memory.
 If you care about people and want to force them to use dchar ranges, you
 can do it with the library: make it refuse narrow strings - as long as the
 library is unusable with narrow strings, people will have to do something
 about it, say, use wrappers like one proposed in this thread (but
 providing forward dchar range interface).

We _can't_ force everyone to use dstring. That defeats having string and wstring in the first place and is incredibly inefficient space-wise. The standard libraries _need_ to work well with all string types.
 It makes perfect sense for general arrays. It makes perfect sense if you
 don't really care about the contents of the array for your algorithm
 (that is, whether they're code points or characters or just bytes in
 memory doesn't matter for what you're doing). However, if you're
 actually processing characters, it makes no sense at all. This mess with
 foreach and strings is one of the big reasons why foreach tends to be
 avoided in std.algorithm.

The problem here is that integers are not much different from characters in this regard.

Integers are totally different. An integer may be limited in the size of the number that it can hold, but it makes perfect sense to process each integer individually. An integer is a full value on its own. char and wchar are not. They're only parts of a whole.
 Conceptually number is an infinite sequence of digits with decimal point.
 What do you plan to do about this?

That's a totally different issue. The solution for that is to use a BigInt type which combines multiple integers (or bytes or longs or whatever) together to make larger values that primitive integral types can hold. In that case, if you were to try and iterate over indivdual ints within the BigInt, then you'd be screwed because they don't mean anything on your own. string and wstring are effectively BigInt for chars and wchars. You have to combine multiple of them to get meaningful values. The fact that one of them can't hold a big enough (let alone infinite) range is the whole reason that they were created in the first place (that and the fact that making the type big enough (i.e. dchar) on its own wastes a lot of space). - Jonathan M Davis
Aug 19 2010
parent Kagamin <spam here.lot> writes:
Jonathan M Davis Wrote:

 Trivial: take byte and add 256.

If you want to change the iteration type to int or long or whatever when iterating over bytes so that you can change the variable without overflow issues, you can. But the byte itself is meaingful by itself. Such is not generally the case with char or wchar.

I thought, it's your point that having a meaning doesn't help to avoid bugs.
 If you care about people and want to force them to use dchar ranges, you
 can do it with the library: make it refuse narrow strings - as long as the
 library is unusable with narrow strings, people will have to do something
 about it, say, use wrappers like one proposed in this thread (but
 providing forward dchar range interface).

We _can't_ force everyone to use dstring.

I'm talking not about dstrings, I said dchar range wrapper. Andrei mentioned byDchar, I don't know if that's the thing. Anyway, std.algorithm does iterate over dchars in narrow strings somehow. You can do it too.
Aug 20 2010