digitalmars.D.learn - Converting Unicode Escape Sequences to UTF-8

=?UTF-8?B?Tm9yZGzDtnc=?= (2/2) Oct 22 2015 How do I convert a `string` containing Unicode escape sequences

=?UTF-8?Q?Ali_=c3=87ehreli?= (15/17) Oct 22 2015 It's already UTF-8 because it's a 'string'. :)
anonymous (9/11) Oct 22 2015 Ali explained that "\uXXXX" is already UTF-8.

=?UTF-8?B?Tm9yZGzDtnc=?= (5/19) Oct 22 2015 Yep, that's exactly what I want to do.

=?UTF-8?B?Tm9yZGzDtnc=?= (3/11) Oct 22 2015 Can somebody point out in which function/file DMD does this

=?UTF-8?B?Tm9yZGzDtnc=?= (3/5) Oct 22 2015 std.conv.parseEscape includes this logic.

anonymous (14/15) Oct 22 2015 I think parsing only Unicode escape sequences is not a common task. You

=?UTF-8?B?Tm9yZGzDtnc=?= (4/6) Oct 24 2015 Working first version at

=?UTF-8?B?Tm9yZGzDtnc=?= (3/6) Oct 24 2015 Made it a range:

=?UTF-8?B?Tm9yZGzDtnc=?= <per.nordlow gmail.com> writes:

How do I convert a `string` containing Unicode escape sequences 
such as "\uXXXX" into UTF-8?

Oct 22 2015

=?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:

On 10/22/2015 11:10 AM, Nordlöw wrote:
 How do I convert a `string` containing Unicode escape sequences such as
 "\uXXXX" into UTF-8?

It's already UTF-8 because it's a 'string'. :)

import std.stdio;

void main() {
     auto s = "\u1234";

     foreach (codeUnit; s) {
         writefln("%02x %08b", codeUnit, codeUnit);
     }
}

The output has three code units for "U+1234 ETHIOPIC SYLLABLE SEE", not 
two bytes:

e1 11100001
88 10001000
b4 10110100

Ali

Oct 22 2015

anonymous <anonymous example.com> writes:

On Thursday, October 22, 2015 08:10 PM, Nordlöw wrote:

 How do I convert a `string` containing Unicode escape sequences
 such as "\uXXXX" into UTF-8?

Ali explained that "\uXXXX" is already UTF-8.

But if you actually want to interpret such escape sequences from user input 
or some such, then find all occurrences, and for each of them do:

* Drop the backslash and the 'u'.
* Parse XXXX as a hexadecimal integer, and cast to dchar.
* Use std.utf.encode to convert to UTF-8. std.conv.to can probably do it 
too, and possibly simpler, but would allocate.

Also be aware of the longer variant with a capital U: \UXXXXXXXX (8 Xs)

Oct 22 2015

=?UTF-8?B?Tm9yZGzDtnc=?= <per.nordlow gmail.com> writes:

On Thursday, 22 October 2015 at 18:40:06 UTC, anonymous wrote:
 On Thursday, October 22, 2015 08:10 PM, Nordlöw wrote:

 How do I convert a `string` containing Unicode escape 
 sequences such as "\uXXXX" into UTF-8?

 Ali explained that "\uXXXX" is already UTF-8.

 But if you actually want to interpret such escape sequences 
 from user input or some such, then find all occurrences, and 
 for each of them do:

Yep, that's exactly what I want to do.

I want to use this to correctly decode DBpedia downloads since it 
encodes it Unicode characters with these sequences.

 * Drop the backslash and the 'u'.
 * Parse XXXX as a hexadecimal integer, and cast to dchar.
 * Use std.utf.encode to convert to UTF-8. std.conv.to can 
 probably do it
 too, and possibly simpler, but would allocate.

 Also be aware of the longer variant with a capital U: 
 \UXXXXXXXX (8 Xs)

Hmm, why isn't this already in Phobos?

Oct 22 2015

=?UTF-8?B?Tm9yZGzDtnc=?= <per.nordlow gmail.com> writes:

On Thursday, 22 October 2015 at 19:13:20 UTC, Nordlöw wrote:
 * Drop the backslash and the 'u'.
 * Parse XXXX as a hexadecimal integer, and cast to dchar.
 * Use std.utf.encode to convert to UTF-8. std.conv.to can 
 probably do it
 too, and possibly simpler, but would allocate.

 Also be aware of the longer variant with a capital U: 
 \UXXXXXXXX (8 Xs)


Can somebody point out in which function/file DMD does this 
decoding?

Oct 22 2015

=?UTF-8?B?Tm9yZGzDtnc=?= <per.nordlow gmail.com> writes:

On Thursday, 22 October 2015 at 19:16:36 UTC, Nordlöw wrote:
 Can somebody point out in which function/file DMD does this 
 decoding?

std.conv.parseEscape includes this logic.

But why is it private?

Oct 22 2015

anonymous <anonymous example.com> writes:

On 22.10.2015 21:13, Nordlöw wrote:
 Hmm, why isn't this already in Phobos?

I think parsing only Unicode escape sequences is not a common task. You 
usually need to parse some larger language of which escape sequences are 
only a part. For example, parsing JSON or XML are common tasks, and we 
have modules for them.

When we don't have a module for the language in question, then it's 
still likely that you need to parse more than just Unicode escape 
sequences. Some parseUnicodeEscapeSequence function would then probably 
not buy you much on the convenience side but cost you some on the 
performance side.

Also, since escape sequences are defined as part of larger languages, 
they are not well-defined by themselves. We could have a function that 
parses D style sequences, but strictly that would only be good for 
parsing D code.

Oct 22 2015

=?UTF-8?B?Tm9yZGzDtnc=?= <per.nordlow gmail.com> writes:

On Thursday, 22 October 2015 at 21:52:05 UTC, anonymous wrote:
 On 22.10.2015 21:13, Nordlöw wrote:
 Hmm, why isn't this already in Phobos?


Working first version at

https://github.com/nordlow/justd/blob/master/conv_ex.d#L207

Next I'll make it a range.

Oct 24 2015

=?UTF-8?B?Tm9yZGzDtnc=?= <per.nordlow gmail.com> writes:

On Saturday, 24 October 2015 at 08:54:40 UTC, Nordlöw wrote:
 Working first version at

 https://github.com/nordlow/justd/blob/master/conv_ex.d#L207

 Next I'll make it a range.

Made it a range:

https://github.com/nordlow/justd/blob/master/conv_ex.d#L207

Oct 24 2015

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Converting Unicode Escape Sequences to UTF-8