digitalmars.D.learn - Converting a character to upper case in string

NX (6/6) Sep 21 2018 How can I properly convert a character, say, first one to upper

Laurent =?UTF-8?B?VHLDqWd1aWVy?= (5/11) Sep 21 2018 I would probably go for std.utf.decode [1] to get the character

NX (4/8) Sep 21 2018 So by this I assume it is sufficient to work with dchars rather

Laurent =?UTF-8?B?VHLDqWd1aWVy?= (4/13) Sep 21 2018 From what I've tested; it seems sufficient. I might be wrong
Laurent =?UTF-8?B?VHLDqWd1aWVy?= (17/26) Sep 21 2018 ----------

Laurent =?UTF-8?B?VHLDqWd1aWVy?= (15/21) Sep 21 2018 ----------
Gary Willoughby (17/23) Sep 21 2018 Use `asCapitalized` to capitalize the first letter or use
Vladimir Panteleev (10/14) Sep 21 2018 That would depend on how you'd define correctness. If your

Patrick Schluter (29/43) Sep 22 2018 There are other traps in the question of uppercase/lowercase
bauss (3/17) Sep 22 2018 Uppercase and Lowercase gets even more funky with Turkish.

NX <nightmarex1337 hotmail.com> writes:

How can I properly convert a character, say, first one to upper 
case in a unicode correct manner?
In which code level I should be working on? Grapheme? Or maybe 
code point is sufficient?

There are few phobos functions like asCapitalized() none of which 
are what I want.

Sep 21 2018

Laurent =?UTF-8?B?VHLDqWd1aWVy?= <laurent.treguier.sink gmail.com> writes:

On Friday, 21 September 2018 at 12:15:52 UTC, NX wrote:
 How can I properly convert a character, say, first one to upper 
 case in a unicode correct manner?
 In which code level I should be working on? Grapheme? Or maybe 
 code point is sufficient?

 There are few phobos functions like asCapitalized() none of 
 which are what I want.

I would probably go for std.utf.decode [1] to get the character 
and its length in code units, capitalize it, and concatenate the 
result with the rest of the string.

[1] https://dlang.org/phobos/std_utf.html#.decode

Sep 21 2018

NX <nightmarex1337 hotmail.com> writes:

On Friday, 21 September 2018 at 12:34:12 UTC, Laurent Tréguier 
wrote:
 I would probably go for std.utf.decode [1] to get the character 
 and its length in code units, capitalize it, and concatenate 
 the result with the rest of the string.

 [1] https://dlang.org/phobos/std_utf.html#.decode

So by this I assume it is sufficient to work with dchars rather 
than graphemes?

Sep 21 2018

Laurent =?UTF-8?B?VHLDqWd1aWVy?= <laurent.treguier.sink gmail.com> writes:

On Friday, 21 September 2018 at 13:32:54 UTC, NX wrote:
 On Friday, 21 September 2018 at 12:34:12 UTC, Laurent Tréguier 
 wrote:
 I would probably go for std.utf.decode [1] to get the 
 character and its length in code units, capitalize it, and 
 concatenate the result with the rest of the string.

 [1] https://dlang.org/phobos/std_utf.html#.decode

 So by this I assume it is sufficient to work with dchars rather 
 than graphemes?

 From what I've tested; it seems sufficient. I might be wrong 
though, I'm no unicode expert. It might still be a good idea to 
have a look at grapheme related functions.

Sep 21 2018

Laurent =?UTF-8?B?VHLDqWd1aWVy?= <laurent.treguier.sink gmail.com> writes:

On Friday, 21 September 2018 at 13:32:54 UTC, NX wrote:
 On Friday, 21 September 2018 at 12:34:12 UTC, Laurent Tréguier 
 wrote:
 I would probably go for std.utf.decode [1] to get the 
 character and its length in code units, capitalize it, and 
 concatenate the result with the rest of the string.

 [1] https://dlang.org/phobos/std_utf.html#.decode

 So by this I assume it is sufficient to work with dchars rather 
 than graphemes?

----------
import std.stdio;
import std.conv;
import std.string;
import std.uni;

size_t index = 1;
auto theString = "he\u0308llo, world";
auto theStringPart = theString[index .. $];
auto firstLetter = theStringPart.decodeGrapheme;
auto result = theString[0 .. index]
     ~ capitalize(firstLetter[].text)
     ~ theString[index + graphemeStride(theString, index) .. $];
writeln(result);
----------

This will capitalize graphemes as a whole, and might be better 
than what I previously wrote.

Sep 21 2018

Laurent =?UTF-8?B?VHLDqWd1aWVy?= <laurent.treguier.sink gmail.com> writes:

On Friday, 21 September 2018 at 12:15:52 UTC, NX wrote:
 How can I properly convert a character, say, first one to upper 
 case in a unicode correct manner?
 In which code level I should be working on? Grapheme? Or maybe 
 code point is sufficient?

 There are few phobos functions like asCapitalized() none of 
 which are what I want.

----------
import std.conv : to;
import std.stdio : writeln;
import std.string : capitalize;
import std.utf : decode;

size_t index = 1;
size_t oldIndex = index;
auto theString = "hëllo, world";
auto firstLetter = theString.decode(index);
auto result = theString[0 .. oldIndex] ~ 
capitalize(firstLetter.to!string) ~ theString[index .. $];
writeln(result);
----------

(This could be a lot prettier, but this seems to basically work)

Sep 21 2018

Gary Willoughby <dev nomad.uk.net> writes:

On Friday, 21 September 2018 at 12:15:52 UTC, NX wrote:
 How can I properly convert a character, say, first one to upper 
 case in a unicode correct manner?
 In which code level I should be working on? Grapheme? Or maybe 
 code point is sufficient?

 There are few phobos functions like asCapitalized() none of 
 which are what I want.

Use `asCapitalized` to capitalize the first letter or use 
something like this:


import std.conv;
import std.range;
import std.stdio;
import std.uni;

void main(string[] args)
{
	string input = "noe\u0308l";
	int index    = 2;

	auto graphemes    = input.byGrapheme.array;
	string upperCased = [graphemes[index]].byCodePoint.text.toUpper;

	graphemes[index] = upperCased.decodeGrapheme;
	string output    = graphemes.byCodePoint.text;

	writeln(output);
}

Sep 21 2018

Vladimir Panteleev <thecybershadow.lists gmail.com> writes:

On Friday, 21 September 2018 at 12:15:52 UTC, NX wrote:
 How can I properly convert a character, say, first one to upper 
 case in a unicode correct manner?

That would depend on how you'd define correctness. If your 
application needs to support "all" languages, then (depending how 
you interpret it) the task may not be meaningful, as some 
languages don't have the notion of "upper-case" or even 
"character" (as an individual glyph). Some languages do have 
those notions, but they serve a specific purpose that doesn't 
align with the one in English (e.g. Lojban).

 In which code level I should be working on? Grapheme? Or maybe 
 code point is sufficient?

Using graphemes is necessary if you need to support e.g. 
combining marks (e.g. ̏◌ + S = ̏S).

Sep 21 2018

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

On Saturday, 22 September 2018 at 06:01:20 UTC, Vladimir 
Panteleev wrote:
 On Friday, 21 September 2018 at 12:15:52 UTC, NX wrote:
 How can I properly convert a character, say, first one to 
 upper case in a unicode correct manner?

 That would depend on how you'd define correctness. If your 
 application needs to support "all" languages, then (depending 
 how you interpret it) the task may not be meaningful, as some 
 languages don't have the notion of "upper-case" or even 
 "character" (as an individual glyph). Some languages do have 
 those notions, but they serve a specific purpose that doesn't 
 align with the one in English (e.g. Lojban).

There are other traps in the question of uppercase/lowercase 
which makes is indeed very difficult to handle correctly if we 
don't define what correctly means.
Examples:
- It may be necessary to know the locale, i.e. the language of 
the string to uppercase. In Turkish uppercase of i is not I but İ 
and lowercase of I is ı (that was a reason for the calamitous low 
performance of toUpper/toLower in Java for example.
- Some uppercases depend on what they are used for. German ß 
shouldbe uppercased as SS (note also btw that 1 codepoint becomes 
2 in uppercase) in normal text, but for calligraphic work, road 
signs and other usages it can be capital ẞ.
- Greek has 2 lowercase forms for Σ but two lowercase forms σ and 
ς depending on the word position.
- While it becomes less and less relevant Serbo-croatian may use 
digraphs when transcoding the script from Cyrillic (Serbian) to 
Latin (Croatian), these digraphs have 2 uppercase forms 
(title-case and all capital):
   - ǆ -> Ǆ or ǅ
   - ǉ -> Ǉ or ǈ
   - Ǌ -> ǋ or ǌ
Normalization would normally take care of that case.
- Some languages may modify or remove diacritical signs when 
uppercasing. It is quite usual in French to not put accents on 
capitals.

It is also clear that the operation of uppercasing is not 
symetric with lowercasing.

 In which code level I should be working on? Grapheme? Or maybe 
 code point is sufficient?

 Using graphemes is necessary if you need to support e.g. 
 combining marks (e.g. ̏◌ + S = ̏S).

Sep 22 2018

bauss <jj_1337 live.dk> writes:

On Saturday, 22 September 2018 at 06:01:20 UTC, Vladimir 
Panteleev wrote:
 On Friday, 21 September 2018 at 12:15:52 UTC, NX wrote:
 How can I properly convert a character, say, first one to 
 upper case in a unicode correct manner?

 That would depend on how you'd define correctness. If your 
 application needs to support "all" languages, then (depending 
 how you interpret it) the task may not be meaningful, as some 
 languages don't have the notion of "upper-case" or even 
 "character" (as an individual glyph). Some languages do have 
 those notions, but they serve a specific purpose that doesn't 
 align with the one in English (e.g. Lojban).

 In which code level I should be working on? Grapheme? Or maybe 
 code point is sufficient?

 Using graphemes is necessary if you need to support e.g. 
 combining marks (e.g. ̏◌ + S = ̏S).

Uppercase and Lowercase gets even more funky with Turkish.

Sep 22 2018

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Converting a character to upper case in string