www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - std.uni, std.ascii, std.encoding, std.utf ugh!

reply learner <learner nomail.com> writes:
Good morning,

Trying to do this:

```
bool foo(string s) nothrow { return s.all!isDigit; }
```

I realised that the conversion from char to dchar could throw.

I need to validate and operate over ascii strings and utf8 
strings, possibly in separate functions, what's the best way to 
transition between:

```
immutable(ubyte)[] -> validate utf8 -> string -> nothrow usage -> 
isDigit etc
immutable(ubyte)[] -> validate ascii -> AsciiString? -> nothrow 
usage -> isDigit etc
string             -> validate ascii -> AsciiString? -> nothrow 
usage -> isDigit etc
```

Thank you
May 05 2020
parent reply WebFreak001 <d.forum webfreak.org> writes:
On Tuesday, 5 May 2020 at 18:41:50 UTC, learner wrote:
 Good morning,

 Trying to do this:

 ```
 bool foo(string s) nothrow { return s.all!isDigit; }
 ```

 I realised that the conversion from char to dchar could throw.

 I need to validate and operate over ascii strings and utf8 
 strings, possibly in separate functions, what's the best way to 
 transition between:

 ```
 immutable(ubyte)[] -> validate utf8 -> string -> nothrow usage 
 -> isDigit etc
 immutable(ubyte)[] -> validate ascii -> AsciiString? -> nothrow 
 usage -> isDigit etc
 string             -> validate ascii -> AsciiString? -> nothrow 
 usage -> isDigit etc
 ```

 Thank you
if you want nothrow operations on the sequence of characters (bytes) of the strings, use `str.representation` to get `immutable(ubyte)[]` and work on that. This is useful for example for doing indexOf (countUntil), startsWith, endsWith, etc. Make sure at least one of your inputs is validated though to avoid potentially handling or cutting off unfinished code points. I think this is the best way to go if you want to do simple things. If your algorithm is sufficiently complex that you would like to still decode but not crash, you can also manually call .decode with UseReplacementDchar.yes to make it emit \uFFFD for invalid characters. To get the best of both worlds, use `.byUTF!dchar` which gives you an input range to iterate over and defaults to using replacement dchar. You can then call the various algorithm & array functions on it. Unless you are working with different encodings than UTF-8 (like doing file or network operations) you shouldn't be needing std.encoding. Also short explanation about the different modules: std.ascii - simple functions to check and modify ASCII characters for various properties. Very easy to memorize everything inside it, you could easily rewrite what you need from scratch yourself. But of course this only handles all the basic ASCII characters, meaning it's only really useful for doing low-level almost binary file handling, not good for user facing parts which need to be international. std.utf - ONLY encoding/decoding of unicode code points to UTF-8 / UTF-16 / UTF-32 byte representation. Doesn't have any idea what the characters actually mean, only checks for format and has limits on code point values. You could still reasonably rewrite this from scratch if you ever choose to. std.uni - All the categorization of every character into all the different unicode types and algorithms modifying / combining / normalizing / etc. codepoints into other codepoints. Doesn't do anything with UTF encoding. I honestly wouldn't want to be the one who rewrites this or ports this to another language.
May 05 2020
parent reply learner <learner nomail.com> writes:
On Tuesday, 5 May 2020 at 19:24:41 UTC, WebFreak001 wrote:
 On Tuesday, 5 May 2020 at 18:41:50 UTC, learner wrote:
 Good morning,

 Trying to do this:

 ```
 bool foo(string s) nothrow { return s.all!isDigit; }
 ```

 I realised that the conversion from char to dchar could throw.

 I need to validate and operate over ascii strings and utf8 
 strings, possibly in separate functions, what's the best way 
 to transition between:

 ```
 immutable(ubyte)[] -> validate utf8 -> string -> nothrow usage 
 -> isDigit etc
 immutable(ubyte)[] -> validate ascii -> AsciiString? -> 
 nothrow usage -> isDigit etc
 string             -> validate ascii -> AsciiString? -> 
 nothrow usage -> isDigit etc
 ```

 Thank you
Thank you WebFreak,
 if you want nothrow operations on the sequence of characters 
 (bytes) of the strings, use `str.representation` to get 
 `immutable(ubyte)[]` and work on that. This is useful for 
 example for doing indexOf (countUntil), startsWith, endsWith, 
 etc. Make sure at least one of your inputs is validated though 
 to avoid potentially handling or cutting off unfinished code 
 points. I think this is the best way to go if you want to do 
 simple things.
What I really want is a way to validate an immutable(ubyte)[] sequence for UFT8 or ASCII, and from that point forward, apply functions like isDigit in nothrow functions.
 If your algorithm is sufficiently complex that you would like 
 to still decode but not crash, you can also manually call 
 .decode with UseReplacementDchar.yes to make it emit \uFFFD for 
 invalid characters.
I will simply reject invalid UTF8 input, that's coming from I/O
 To get the best of both worlds, use `.byUTF!dchar` which gives 
 you an input range to iterate over and defaults to using 
 replacement dchar. You can then call the various algorithm & 
 array functions on it.
Can you explain better?
 Unless you are working with different encodings than UTF-8 
 (like doing file or network operations) you shouldn't be 
 needing std.encoding.
I'm expecting UTF8 and ASCII encoding from I/O Thank you!
May 06 2020
parent WebFreak001 <d.forum webfreak.org> writes:
On Wednesday, 6 May 2020 at 10:57:59 UTC, learner wrote:
 On Tuesday, 5 May 2020 at 19:24:41 UTC, WebFreak001 wrote:
 On Tuesday, 5 May 2020 at 18:41:50 UTC, learner wrote:
 Good morning,

 Trying to do this:

 ```
 bool foo(string s) nothrow { return s.all!isDigit; }
 ```

 I realised that the conversion from char to dchar could throw.

 I need to validate and operate over ascii strings and utf8 
 strings, possibly in separate functions, what's the best way 
 to transition between:

 ```
 immutable(ubyte)[] -> validate utf8 -> string -> nothrow 
 usage -> isDigit etc
 immutable(ubyte)[] -> validate ascii -> AsciiString? -> 
 nothrow usage -> isDigit etc
 string             -> validate ascii -> AsciiString? -> 
 nothrow usage -> isDigit etc
 ```

 Thank you
Thank you WebFreak,
 if you want nothrow operations on the sequence of characters 
 (bytes) of the strings, use `str.representation` to get 
 `immutable(ubyte)[]` and work on that. This is useful for 
 example for doing indexOf (countUntil), startsWith, endsWith, 
 etc. Make sure at least one of your inputs is validated though 
 to avoid potentially handling or cutting off unfinished code 
 points. I think this is the best way to go if you want to do 
 simple things.
What I really want is a way to validate an immutable(ubyte)[] sequence for UFT8 or ASCII, and from that point forward, apply functions like isDigit in nothrow functions.
 If your algorithm is sufficiently complex that you would like 
 to still decode but not crash, you can also manually call 
 .decode with UseReplacementDchar.yes to make it emit \uFFFD 
 for invalid characters.
I will simply reject invalid UTF8 input, that's coming from I/O
 To get the best of both worlds, use `.byUTF!dchar` which gives 
 you an input range to iterate over and defaults to using 
 replacement dchar. You can then call the various algorithm & 
 array functions on it.
Can you explain better?
 Unless you are working with different encodings than UTF-8 
 (like doing file or network operations) you shouldn't be 
 needing std.encoding.
I'm expecting UTF8 and ASCII encoding from I/O Thank you!
Using .representation would be like assuming UTF-8 and .byUTF!dchar will still test and replace invalid characters. If you want to check if a string is UTF-8 beforehand, use `std.utf : validate` - it will throw an UTFException in case of malformed UTF-8. However this will not magically make your algorithms nothrow, except of course it won't actually throw because of decoding exceptions in that case. If you want to give the nothrow attribute to your functions, you will need to work with .representation or .byUTF!dchar
May 06 2020