digitalmars.D - Ascii matters

bearophile (26/26) Aug 22 2012 I need to manage Unicode text, but in many cases I have lot of

Jonathan M Davis (6/8) Aug 22 2012 It could certainly be done. In fact, doing so would be incredibly trivia...

bearophile (13/15) Aug 22 2012 The data I am processing is not generic octets, like 8 bits

Jonathan M Davis (14/30) Aug 22 2012 Then just use ubyte[], and if you need char[] for printing out, then cas...
Sean Kelly (7/13) Aug 22 2012 in ASCII, and for both practical and performance reasons in D I want to ...

bearophile (6/9) Aug 22 2012 std.algorithm is not closed

Don Clugston (2/9) Aug 23 2012 Which operations in std.algorithm over map 0-0x7F into higher characters...

bearophile (6/8) Aug 23 2012 The first example I've shown:

Jonathan M Davis (16/18) Aug 22 2012 Range-based functions will treat arrays of char or wchar as forward rang...

bearophile (6/7) Aug 22 2012 I am just asking if there is interest in it, if people see

Sean Kelly (23/41) Aug 22 2012 strings?

bearophile (9/14) Aug 23 2012 What's unsafe in what I have presented? The constructor verifies

Sean Kelly (11/19) Aug 23 2012 want this in Phobos because it seems like it could cause maintenance =

bearophile (15/18) Aug 23 2012 The cast to ubute[] doesn't perform a run-time test of the

"bearophile" <bearophileHUGS lycos.com> writes:

I need to manage Unicode text, but in many cases I have lot of 
7-bit or 8-bit ASCII text to process, and this has lead to this 
discussion, so since some time thanks to Jonathan Davis we have 
an efficient translate() again:

http://d.puremagic.com/issues/show_bug.cgi?id=7515


The s2 array generated by this code is a dchar[] (if array() 
becomes pure you are probably able to assign type s2 as dstring):

string s = "test string"; // UTF-8, but also 7-bit ASCII
dchar[] s2 = map!(x => x)(s).array(); // Uses the Id function

To produce a char[] (or string, using assumeUnique), you are free 
to use a cast:

auto s3 = map!(x => cast(char)x)(s).array();

But D casts are unsafe, and one thing I'm learning from Haskell 
is how important is to give types to your code to prevent bugs. 
So maybe an AsciiString wrapper (a subtype of string) range can 
be invented for Phobos. Its consructor verifies the input is a 
7-big ASCII and its "front" method yields chars, so map.array() 
gives a char[]:

astring a1 = "test string"; // enforced 7-bit ASCII
char[] s4 = map!(x => x)(s).array();

This makes some algorithms working on ASCII text cleaner and 
safer, avoiding the need for casts.

Is creating something like this possible and appreciated for 
Phobos?

Bye,
bearophile

Aug 22 2012

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Thursday, August 23, 2012 00:11:18 bearophile wrote:
 Is creating something like this possible and appreciated for
 Phobos?

It could certainly be done. In fact, doing so would be incredibly trivial. But 
given that you can use ubyte[] just fine and the fact that using ASCII really 
shouldn't be encouraged, I don't like the idea of adding such a range to 
Phobos. I don't know what the general consensus on that would be though.

- Jonathan M Davis

Aug 22 2012

"bearophile" <bearophileHUGS lycos.com> writes:

Jonathan M Davis:

 But given that you can use ubyte[] just fine

The data I am processing is not generic octets, like 8 bits 
digitized by some old A/D converter, they are chars, and I expect 
to see strings when I print them :-)


 and the fact that using ASCII really shouldn't be encouraged,

For generic text I agree with you, using UTF-8 is safer and 
better.
But there is plenty of scientific/technical text-encoded data 
that is in ASCII, and for both practical and performance reasons 
in D I want to process it as a sequence of chars (or a sequence 
of ubytes, as you say). So for some kinds of data that 
encouragement is a waste of your time.

Bye,
bearophile

Aug 22 2012

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Thursday, August 23, 2012 02:07:52 bearophile wrote:
 Jonathan M Davis:
 But given that you can use ubyte[] just fine

 
 The data I am processing is not generic octets, like 8 bits
 digitized by some old A/D converter, they are chars, and I expect
 to see strings when I print them :-)
 
 and the fact that using ASCII really shouldn't be encouraged,

 
 For generic text I agree with you, using UTF-8 is safer and
 better.
 But there is plenty of scientific/technical text-encoded data
 that is in ASCII, and for both practical and performance reasons
 in D I want to process it as a sequence of chars (or a sequence
 of ubytes, as you say). So for some kinds of data that
 encouragement is a waste of your time.

Then just use ubyte[], and if you need char[] for printing out, then cast it. 
And if you don't like the casting, you can ever wrap it in a function.

char[] fromASCII(ubyte[] str)
{
 return cast(char[])str;
}

Creating an ASCII range type will just encourage its use, when you should only 
be operating on ASCII when you really need it. Operating on ASCII is quite 
possible as it is and isn't even very hard. So, I really don't see much benefit 
in adding such a range, and the fact that arguably would encourage bad 
behavior then makes it _undesirable_ rather than just not particularly 
beneficial.

- Jonathan M Davis

Aug 22 2012

Sean Kelly <sean invisibleduck.org> writes:

On Aug 22, 2012, at 5:07 PM, bearophile <bearophileHUGS lycos.com> =
wrote:

 Jonathan M Davis:
=20
 and the fact that using ASCII really shouldn't be encouraged,

=20
 For generic text I agree with you, using UTF-8 is safer and better.
 But there is plenty of scientific/technical text-encoded data that is =

in ASCII, and for both practical and performance reasons in D I want to =
process it as a sequence of chars (or a sequence of ubytes, as you say). =
So for some kinds of data that encouragement is a waste of your time.

I'm clearly missing something.  ASCII and UTF-8 are compatible.  What's =
stopping you from just processing these as if they were UTF-8 strings?=

Aug 22 2012

"bearophile" <bearophileHUGS lycos.com> writes:

Sean Kelly:

 I'm clearly missing something.  ASCII and UTF-8 are compatible.
  What's stopping you from just processing these as if they were 
 UTF-8 strings?

std.algorithm is not closed 
(http://en.wikipedia.org/wiki/Closure_%28mathematics%29 ) on 
UTF-8, its operations lead to UTF-32.

Bye,
bearophile

Aug 22 2012

Don Clugston <dac nospam.com> writes:

On 23/08/12 05:05, bearophile wrote:
 Sean Kelly:

 I'm clearly missing something.  ASCII and UTF-8 are compatible.
  What's stopping you from just processing these as if they were UTF-8
 strings?

 std.algorithm is not closed
 (http://en.wikipedia.org/wiki/Closure_%28mathematics%29 ) on UTF-8, its
 operations lead to UTF-32.

Which operations in std.algorithm over map 0-0x7F into higher characters?

Aug 23 2012

"bearophile" <bearophileHUGS lycos.com> writes:

Don Clugston:

 Which operations in std.algorithm over map 0-0x7F into higher 
 characters?

The first example I've shown:

string s = "test string";
dchar[] s2 = map!(x => x)(s).array(); // Uses the Id function

Bye,
bearophile

Aug 23 2012

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Wednesday, August 22, 2012 19:52:10 Sean Kelly wrote:
 I'm clearly missing something. ASCII and UTF-8 are compatible. What's
 stopping you from just processing these as if they were UTF-8 strings?

Range-based functions will treat arrays of char or wchar as forward ranges of 
dchar. Because of the variable length of their code points, they aren't 
considered to have length, be random access, or have slicing and will not 
generally work with range-based functions which require any of those 
operations (though some range-based functions do specialize on strings and use 
those operations where they can based on proper understanding of unicode).

On the other hand, if you have a string that specifically holds ASCII and you 
know that it only holds ASCII, you know that you can safely use length, random 
access, and slicing as if each code unit were a full code point. But the 
range-based functions don't know that your string is guaranteed to be ASCII-
only, so they continue to treat it as a range of dchar rather than char. The 
solution is to either create a wrapper range whose element type is char or to 
cast the char[] to ubyte[]. And Bearophile wants such a wrapper range to be 
added to Phobos.

- Jonathan M Davis

Aug 22 2012

"bearophile" <bearophileHUGS lycos.com> writes:

Jonathan M Davis:

 And Bearophile wants such a wrapper range to be added to Phobos.

I am just asking if there is interest in it, if people see 
something wrong in having it in Phobos. Surely I am not demanding 
it :-)

Bye,
bearophile

Aug 22 2012

Sean Kelly <sean invisibleduck.org> writes:

On Aug 22, 2012, at 8:03 PM, Jonathan M Davis <jmdavisProg gmx.com> =
wrote:

 On Wednesday, August 22, 2012 19:52:10 Sean Kelly wrote:
 I'm clearly missing something. ASCII and UTF-8 are compatible. What's
 stopping you from just processing these as if they were UTF-8 =


strings?
=20
 Range-based functions will treat arrays of char or wchar as forward =

ranges of=20
 dchar. Because of the variable length of their code points, they =

aren't=20
 considered to have length, be random access, or have slicing and will =

not=20
 generally work with range-based functions which require any of those=20=

 operations (though some range-based functions do specialize on strings =

and use=20
 those operations where they can based on proper understanding of =

unicode).

Yeah.  I understand why the range-based functions use dchar, but for my =
own use I generally want to work directly with a char string of UTF-8 so =
I can slice buffers.  Typing these as uchar buffers isn't ideal, but it =
does work.

 On the other hand, if you have a string that specifically holds ASCII =

and you=20
 know that it only holds ASCII, you know that you can safely use =

length, random=20
 access, and slicing as if each code unit were a full code point. But =

the=20
 range-based functions don't know that your string is guaranteed to be =

ASCII-
 only, so they continue to treat it as a range of dchar rather than =

char. The=20
 solution is to either create a wrapper range whose element type is =

char or to=20
 cast the char[] to ubyte[]. And Bearophile wants such a wrapper range =

to be=20
 added to Phobos.

Gotcha.  Despite it being something I'd use regularly, I wouldn't want =
this in Phobos because it seems like it could cause maintenance =
problems.  I'd rather explicitly cast to ubyte as a way to flag that I =
was doing something potentially unsafe.=

Aug 22 2012

"bearophile" <bearophileHUGS lycos.com> writes:

Sean Kelly:

 Gotcha.  Despite it being something I'd use regularly, I 
 wouldn't want this in Phobos because it seems like it could 
 cause maintenance problems.  I'd rather explicitly cast to 
 ubyte as a way to flag that I was doing something potentially 
 unsafe.

What's unsafe in what I have presented? The constructor verifies 
every char to be in 7 bits, and then you use the new type safely. 
No casts, and no need to flag something as unsafe.

This usage of types to denote capabilities is quite common in 
functional languages, see articles I've recently linked here as:
http://tomasp.net/blog/type-first-development.aspx

Bye,
bearophile

Aug 23 2012

Sean Kelly <sean invisibleduck.org> writes:

On Aug 23, 2012, at 4:25 AM, bearophile <bearophileHUGS lycos.com> =
wrote:

 Sean Kelly:
=20
 Gotcha.  Despite it being something I'd use regularly, I wouldn't =


want this in Phobos because it seems like it could cause maintenance =
problems.  I'd rather explicitly cast to ubyte as a way to flag that I =
was doing something potentially unsafe.
=20
 What's unsafe in what I have presented? The constructor verifies every =

char to be in 7 bits, and then you use the new type safely. No casts, =
and no need to flag something as unsafe.
=20
 This usage of types to denote capabilities is quite common in =

functional languages, see articles I've recently linked here as:
 http://tomasp.net/blog/type-first-development.aspx

So it throws an exception if there are non-ASCII characters in the =
range?  Is this really better than just casting the input array to =
ubyte?=

Aug 23 2012

"bearophile" <bearophileHUGS lycos.com> writes:

Sean Kelly:

 So it throws an exception if there are non-ASCII characters in 
 the range?  Is this really better than just casting the input 
 array to ubyte?

The cast to ubute[] doesn't perform a run-time test of the
validity of the input, so yeah, the exception is better. Your
code is also able to catch and manage the exception (like asking
the user for another valid input file).

If you carry around some type as "Astring", later you don't have
to cast it back to char[] to print the data as a string (this
discussion is about data that is naturally text, this discussion
is not about generic numerical octets).

An appropriate type statically encodes in your program that you
are using an ascii string. This makes your code more readable.
But when in the code you see a variable of generic type ubyte[]
it doesn't tell you a lot about its contents.

Bye,
bearophile

Aug 23 2012

D Programming

C/C++ Programming

Other

digitalmars.D - Ascii matters