www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Ascii matters

reply "bearophile" <bearophileHUGS lycos.com> writes:
I need to manage Unicode text, but in many cases I have lot of 
7-bit or 8-bit ASCII text to process, and this has lead to this 
discussion, so since some time thanks to Jonathan Davis we have 
an efficient translate() again:

http://d.puremagic.com/issues/show_bug.cgi?id=7515


The s2 array generated by this code is a dchar[] (if array() 
becomes pure you are probably able to assign type s2 as dstring):

string s = "test string"; // UTF-8, but also 7-bit ASCII
dchar[] s2 = map!(x => x)(s).array(); // Uses the Id function

To produce a char[] (or string, using assumeUnique), you are free 
to use a cast:

auto s3 = map!(x => cast(char)x)(s).array();

But D casts are unsafe, and one thing I'm learning from Haskell 
is how important is to give types to your code to prevent bugs. 
So maybe an AsciiString wrapper (a subtype of string) range can 
be invented for Phobos. Its consructor verifies the input is a 
7-big ASCII and its "front" method yields chars, so map.array() 
gives a char[]:

astring a1 = "test string"; // enforced 7-bit ASCII
char[] s4 = map!(x => x)(s).array();

This makes some algorithms working on ASCII text cleaner and 
safer, avoiding the need for casts.

Is creating something like this possible and appreciated for 
Phobos?

Bye,
bearophile
Aug 22 2012
parent reply "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Thursday, August 23, 2012 00:11:18 bearophile wrote:
 Is creating something like this possible and appreciated for
 Phobos?
It could certainly be done. In fact, doing so would be incredibly trivial. But given that you can use ubyte[] just fine and the fact that using ASCII really shouldn't be encouraged, I don't like the idea of adding such a range to Phobos. I don't know what the general consensus on that would be though. - Jonathan M Davis
Aug 22 2012
parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Jonathan M Davis:

 But given that you can use ubyte[] just fine
The data I am processing is not generic octets, like 8 bits digitized by some old A/D converter, they are chars, and I expect to see strings when I print them :-)
 and the fact that using ASCII really shouldn't be encouraged,
For generic text I agree with you, using UTF-8 is safer and better. But there is plenty of scientific/technical text-encoded data that is in ASCII, and for both practical and performance reasons in D I want to process it as a sequence of chars (or a sequence of ubytes, as you say). So for some kinds of data that encouragement is a waste of your time. Bye, bearophile
Aug 22 2012
next sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Thursday, August 23, 2012 02:07:52 bearophile wrote:
 Jonathan M Davis:
 But given that you can use ubyte[] just fine
The data I am processing is not generic octets, like 8 bits digitized by some old A/D converter, they are chars, and I expect to see strings when I print them :-)
 and the fact that using ASCII really shouldn't be encouraged,
For generic text I agree with you, using UTF-8 is safer and better. But there is plenty of scientific/technical text-encoded data that is in ASCII, and for both practical and performance reasons in D I want to process it as a sequence of chars (or a sequence of ubytes, as you say). So for some kinds of data that encouragement is a waste of your time.
Then just use ubyte[], and if you need char[] for printing out, then cast it. And if you don't like the casting, you can ever wrap it in a function. char[] fromASCII(ubyte[] str) { return cast(char[])str; } Creating an ASCII range type will just encourage its use, when you should only be operating on ASCII when you really need it. Operating on ASCII is quite possible as it is and isn't even very hard. So, I really don't see much benefit in adding such a range, and the fact that arguably would encourage bad behavior then makes it _undesirable_ rather than just not particularly beneficial. - Jonathan M Davis
Aug 22 2012
prev sibling next sibling parent reply Sean Kelly <sean invisibleduck.org> writes:
On Aug 22, 2012, at 5:07 PM, bearophile <bearophileHUGS lycos.com> =
wrote:

 Jonathan M Davis:
=20
 and the fact that using ASCII really shouldn't be encouraged,
=20 For generic text I agree with you, using UTF-8 is safer and better. But there is plenty of scientific/technical text-encoded data that is =
in ASCII, and for both practical and performance reasons in D I want to = process it as a sequence of chars (or a sequence of ubytes, as you say). = So for some kinds of data that encouragement is a waste of your time. I'm clearly missing something. ASCII and UTF-8 are compatible. What's = stopping you from just processing these as if they were UTF-8 strings?=
Aug 22 2012
parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Sean Kelly:

 I'm clearly missing something.  ASCII and UTF-8 are compatible.
  What's stopping you from just processing these as if they were 
 UTF-8 strings?
std.algorithm is not closed (http://en.wikipedia.org/wiki/Closure_%28mathematics%29 ) on UTF-8, its operations lead to UTF-32. Bye, bearophile
Aug 22 2012
parent reply Don Clugston <dac nospam.com> writes:
On 23/08/12 05:05, bearophile wrote:
 Sean Kelly:

 I'm clearly missing something.  ASCII and UTF-8 are compatible.
  What's stopping you from just processing these as if they were UTF-8
 strings?
std.algorithm is not closed (http://en.wikipedia.org/wiki/Closure_%28mathematics%29 ) on UTF-8, its operations lead to UTF-32.
Which operations in std.algorithm over map 0-0x7F into higher characters?
Aug 23 2012
parent "bearophile" <bearophileHUGS lycos.com> writes:
Don Clugston:

 Which operations in std.algorithm over map 0-0x7F into higher 
 characters?
The first example I've shown: string s = "test string"; dchar[] s2 = map!(x => x)(s).array(); // Uses the Id function Bye, bearophile
Aug 23 2012
prev sibling next sibling parent reply "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Wednesday, August 22, 2012 19:52:10 Sean Kelly wrote:
 I'm clearly missing something. ASCII and UTF-8 are compatible. What's
 stopping you from just processing these as if they were UTF-8 strings?
Range-based functions will treat arrays of char or wchar as forward ranges of dchar. Because of the variable length of their code points, they aren't considered to have length, be random access, or have slicing and will not generally work with range-based functions which require any of those operations (though some range-based functions do specialize on strings and use those operations where they can based on proper understanding of unicode). On the other hand, if you have a string that specifically holds ASCII and you know that it only holds ASCII, you know that you can safely use length, random access, and slicing as if each code unit were a full code point. But the range-based functions don't know that your string is guaranteed to be ASCII- only, so they continue to treat it as a range of dchar rather than char. The solution is to either create a wrapper range whose element type is char or to cast the char[] to ubyte[]. And Bearophile wants such a wrapper range to be added to Phobos. - Jonathan M Davis
Aug 22 2012
parent "bearophile" <bearophileHUGS lycos.com> writes:
Jonathan M Davis:

 And Bearophile wants such a wrapper range to be added to Phobos.
I am just asking if there is interest in it, if people see something wrong in having it in Phobos. Surely I am not demanding it :-) Bye, bearophile
Aug 22 2012
prev sibling parent reply Sean Kelly <sean invisibleduck.org> writes:
On Aug 22, 2012, at 8:03 PM, Jonathan M Davis <jmdavisProg gmx.com> =
wrote:

 On Wednesday, August 22, 2012 19:52:10 Sean Kelly wrote:
 I'm clearly missing something. ASCII and UTF-8 are compatible. What's
 stopping you from just processing these as if they were UTF-8 =
strings?
=20
 Range-based functions will treat arrays of char or wchar as forward =
ranges of=20
 dchar. Because of the variable length of their code points, they =
aren't=20
 considered to have length, be random access, or have slicing and will =
not=20
 generally work with range-based functions which require any of those=20=
 operations (though some range-based functions do specialize on strings =
and use=20
 those operations where they can based on proper understanding of =
unicode). Yeah. I understand why the range-based functions use dchar, but for my = own use I generally want to work directly with a char string of UTF-8 so = I can slice buffers. Typing these as uchar buffers isn't ideal, but it = does work.
 On the other hand, if you have a string that specifically holds ASCII =
and you=20
 know that it only holds ASCII, you know that you can safely use =
length, random=20
 access, and slicing as if each code unit were a full code point. But =
the=20
 range-based functions don't know that your string is guaranteed to be =
ASCII-
 only, so they continue to treat it as a range of dchar rather than =
char. The=20
 solution is to either create a wrapper range whose element type is =
char or to=20
 cast the char[] to ubyte[]. And Bearophile wants such a wrapper range =
to be=20
 added to Phobos.
Gotcha. Despite it being something I'd use regularly, I wouldn't want = this in Phobos because it seems like it could cause maintenance = problems. I'd rather explicitly cast to ubyte as a way to flag that I = was doing something potentially unsafe.=
Aug 22 2012
parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Sean Kelly:

 Gotcha.  Despite it being something I'd use regularly, I 
 wouldn't want this in Phobos because it seems like it could 
 cause maintenance problems.  I'd rather explicitly cast to 
 ubyte as a way to flag that I was doing something potentially 
 unsafe.
What's unsafe in what I have presented? The constructor verifies every char to be in 7 bits, and then you use the new type safely. No casts, and no need to flag something as unsafe. This usage of types to denote capabilities is quite common in functional languages, see articles I've recently linked here as: http://tomasp.net/blog/type-first-development.aspx Bye, bearophile
Aug 23 2012
parent reply Sean Kelly <sean invisibleduck.org> writes:
On Aug 23, 2012, at 4:25 AM, bearophile <bearophileHUGS lycos.com> =
wrote:

 Sean Kelly:
=20
 Gotcha.  Despite it being something I'd use regularly, I wouldn't =
want this in Phobos because it seems like it could cause maintenance = problems. I'd rather explicitly cast to ubyte as a way to flag that I = was doing something potentially unsafe.
=20
 What's unsafe in what I have presented? The constructor verifies every =
char to be in 7 bits, and then you use the new type safely. No casts, = and no need to flag something as unsafe.
=20
 This usage of types to denote capabilities is quite common in =
functional languages, see articles I've recently linked here as:
 http://tomasp.net/blog/type-first-development.aspx
So it throws an exception if there are non-ASCII characters in the = range? Is this really better than just casting the input array to = ubyte?=
Aug 23 2012
parent "bearophile" <bearophileHUGS lycos.com> writes:
Sean Kelly:

 So it throws an exception if there are non-ASCII characters in 
 the range?  Is this really better than just casting the input 
 array to ubyte?
The cast to ubute[] doesn't perform a run-time test of the validity of the input, so yeah, the exception is better. Your code is also able to catch and manage the exception (like asking the user for another valid input file). If you carry around some type as "Astring", later you don't have to cast it back to char[] to print the data as a string (this discussion is about data that is naturally text, this discussion is not about generic numerical octets). An appropriate type statically encodes in your program that you are using an ascii string. This makes your code more readable. But when in the code you see a variable of generic type ubyte[] it doesn't tell you a lot about its contents. Bye, bearophile
Aug 23 2012