www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Get Character At?

reply okibi <okibi ratedo.com> writes:
Is there a getCharAt() function for D?

Thanks!
Apr 24 2007
parent reply Derek Parnell <derek psych.ward> writes:
On Tue, 24 Apr 2007 10:30:16 -0400, okibi wrote:

 Is there a getCharAt() function for D?

Get a character from what? A string, a file, a console screen, ... ? -- Derek Parnell Melbourne, Australia "Justice for David Hicks!" skype: derek.j.parnell
Apr 24 2007
parent reply okibi <okibi ratedo.com> writes:
Derek Parnell Wrote:

 On Tue, 24 Apr 2007 10:30:16 -0400, okibi wrote:
 
 Is there a getCharAt() function for D?

Get a character from what? A string, a file, a console screen, ... ? -- Derek Parnell Melbourne, Australia "Justice for David Hicks!" skype: derek.j.parnell

Such as this: char[] text = "This is a test sentence."; int loc = 5; char num5 = text.getCharAt(loc); Something along those lines.
Apr 24 2007
next sibling parent reply Tomas Lindquist Olsen <tomas famolsen.dk> writes:
okibi wrote:

 Derek Parnell Wrote:
 
 On Tue, 24 Apr 2007 10:30:16 -0400, okibi wrote:
 
 Is there a getCharAt() function for D?

Get a character from what? A string, a file, a console screen, ... ? -- Derek Parnell Melbourne, Australia "Justice for David Hicks!" skype: derek.j.parnell

Such as this: char[] text = "This is a test sentence."; int loc = 5; char num5 = text.getCharAt(loc); Something along those lines.

Why not just do: char[] text = "some text"; char num5 = text[5];
Apr 24 2007
next sibling parent reply okibi <okibi ratedo.com> writes:
Tomas Lindquist Olsen Wrote:

 okibi wrote:
 
 Derek Parnell Wrote:
 
 On Tue, 24 Apr 2007 10:30:16 -0400, okibi wrote:
 
 Is there a getCharAt() function for D?

Get a character from what? A string, a file, a console screen, ... ? -- Derek Parnell Melbourne, Australia "Justice for David Hicks!" skype: derek.j.parnell

Such as this: char[] text = "This is a test sentence."; int loc = 5; char num5 = text.getCharAt(loc); Something along those lines.

Why not just do: char[] text = "some text"; char num5 = text[5];

Because it isn't working for me. That was what I was trying to do seeing as char[] is simply an array of characters. However, it's returning an int and not a char.
Apr 24 2007
next sibling parent BCS <BCS pathlink.com> writes:
okibi wrote:
 
 
 Because it isn't working for me. That was what I was trying to do seeing as
char[] is simply an array of characters. However, it's returning an int and not
a char.

How about a little more code. What I've seen so far should work.
Apr 24 2007
prev sibling parent reply Tomas Lindquist Olsen <tomas famolsen.dk> writes:
okibi wrote:
 
 Why not just do:
 
 char[] text = "some text";
 char num5 = text[5];
 
 

Because it isn't working for me. That was what I was trying to do seeing as char[] is simply an array of characters. However, it's returning an int and not a char.

import std.stdio; void main() { char[] text = "this is a sentence"; int loc = 5; writefln("%s", typeid(typeof(text[loc]))); } this prints 'char' as expected...
Apr 24 2007
parent okibi <okibi ratedo.com> writes:
Tomas Lindquist Olsen Wrote:

 okibi wrote:
 
 Why not just do:
 
 char[] text = "some text";
 char num5 = text[5];
 
 

Because it isn't working for me. That was what I was trying to do seeing as char[] is simply an array of characters. However, it's returning an int and not a char.

import std.stdio; void main() { char[] text = "this is a sentence"; int loc = 5; writefln("%s", typeid(typeof(text[loc]))); } this prints 'char' as expected...

That fixed the problem, thanks!
Apr 24 2007
prev sibling parent reply Clay Smith <clayasaurus gmail.com> writes:
Tomas Lindquist Olsen wrote:
 okibi wrote:
 
 Derek Parnell Wrote:

 On Tue, 24 Apr 2007 10:30:16 -0400, okibi wrote:

 Is there a getCharAt() function for D?

-- Derek Parnell Melbourne, Australia "Justice for David Hicks!" skype: derek.j.parnell

char[] text = "This is a test sentence."; int loc = 5; char num5 = text.getCharAt(loc); Something along those lines.

Why not just do: char[] text = "some text"; char num5 = text[5];

text[5] will return the sixth element in the array.
Apr 24 2007
parent Tomas Lindquist Olsen <tomas famolsen.dk> writes:
Clay Smith wrote:
 
 text[5] will return the sixth element in the array.

He never said anything about getCharAt starting at one...
Apr 24 2007
prev sibling next sibling parent Clay Smith <clayasaurus gmail.com> writes:
okibi wrote:
 Derek Parnell Wrote:
 
 On Tue, 24 Apr 2007 10:30:16 -0400, okibi wrote:

 Is there a getCharAt() function for D?

-- Derek Parnell Melbourne, Australia "Justice for David Hicks!" skype: derek.j.parnell

Such as this: char[] text = "This is a test sentence."; int loc = 5; char num5 = text.getCharAt(loc); Something along those lines.

Just use char num5 = text[loc-1]; ?
Apr 24 2007
prev sibling parent reply Derek Parnell <derek psych.ward> writes:
On Tue, 24 Apr 2007 11:56:19 -0400, okibi wrote:

 Derek Parnell Wrote:
 
 On Tue, 24 Apr 2007 10:30:16 -0400, okibi wrote:
 
 Is there a getCharAt() function for D?

Get a character from what? A string, a file, a console screen, ... ? -- Derek Parnell Melbourne, Australia "Justice for David Hicks!" skype: derek.j.parnell

Such as this: char[] text = "This is a test sentence."; int loc = 5; char num5 = text.getCharAt(loc); Something along those lines.

Because char[] represents a UTF-8 encoded unicode string, to get the Nth character (first character is a position 1), try this ... import std.stdio; import std.utf; T getCharAt(T)(T pText, uint pPos) { size_t lUTF_Index; uint lStride; // Firstly, find out where the character starts in the string. lUTF_Index = std.utf.toUTFindex(pText, pPos-1); // Then find out its width (in bytes) lStride = std.utf.stride(pText, lUTF_Index); // Return the character encoded in UTF format. return pText[lUTF_Index .. lUTF_Index + lStride]; } void main() { char[] text = "a\ua034bcdef"; uint loc = 4; writefln("%s", getCharAt(text, loc)); // shows "c" writefln("%s", text[loc-1]); // correctly fails } If you just use 'text[loc]', you may not get the correct character, and you actually only get a UTF code point fragment anyway. Remember that char[] is not an array of characters. It is an array of UTF-8 code point fragments (each 1-byte wide) and a UTF-8 encoded character (code point) can have from 1 to 4 fragments. -- Derek Parnell Melbourne, Australia "Justice for David Hicks!" skype: derek.j.parnell
Apr 24 2007
next sibling parent Chris Nicholson-Sauls <ibisbasenji gmail.com> writes:
Derek Parnell wrote:
 On Tue, 24 Apr 2007 11:56:19 -0400, okibi wrote:
 
 Derek Parnell Wrote:

 On Tue, 24 Apr 2007 10:30:16 -0400, okibi wrote:

 Is there a getCharAt() function for D?

-- Derek Parnell Melbourne, Australia "Justice for David Hicks!" skype: derek.j.parnell

char[] text = "This is a test sentence."; int loc = 5; char num5 = text.getCharAt(loc); Something along those lines.

Because char[] represents a UTF-8 encoded unicode string, to get the Nth character (first character is a position 1), try this ... import std.stdio; import std.utf; T getCharAt(T)(T pText, uint pPos) { size_t lUTF_Index; uint lStride; // Firstly, find out where the character starts in the string. lUTF_Index = std.utf.toUTFindex(pText, pPos-1); // Then find out its width (in bytes) lStride = std.utf.stride(pText, lUTF_Index); // Return the character encoded in UTF format. return pText[lUTF_Index .. lUTF_Index + lStride]; } void main() { char[] text = "a\ua034bcdef"; uint loc = 4; writefln("%s", getCharAt(text, loc)); // shows "c" writefln("%s", text[loc-1]); // correctly fails } If you just use 'text[loc]', you may not get the correct character, and you actually only get a UTF code point fragment anyway. Remember that char[] is not an array of characters. It is an array of UTF-8 code point fragments (each 1-byte wide) and a UTF-8 encoded character (code point) can have from 1 to 4 fragments.

Which is why I tend to try and bite the bullet and just use dchar[] for general purpose things. I only use char[] in cases where I know it's "safe" to do so (that is, cases where I know what the input will be, and know it will be within the single-byte character range). That said, its a darn good thing Phobos has std.utf and Tango has tango.utils.Utf, otherwise we'd often be in a pickle. (Avoiding potential tango.io joke.) -- Chris Nicholson-Sauls
Apr 24 2007
prev sibling parent reply Daniel Keep <daniel.keep.lists gmail.com> writes:
Derek Parnell wrote:
 On Tue, 24 Apr 2007 11:56:19 -0400, okibi wrote:
 
 Derek Parnell Wrote:

 On Tue, 24 Apr 2007 10:30:16 -0400, okibi wrote:

 Is there a getCharAt() function for D?

-- Derek Parnell Melbourne, Australia "Justice for David Hicks!" skype: derek.j.parnell

char[] text = "This is a test sentence."; int loc = 5; char num5 = text.getCharAt(loc); Something along those lines.

Because char[] represents a UTF-8 encoded unicode string, to get the Nth character (first character is a position 1), try this ... import std.stdio; import std.utf; T getCharAt(T)(T pText, uint pPos) { size_t lUTF_Index; uint lStride; // Firstly, find out where the character starts in the string. lUTF_Index = std.utf.toUTFindex(pText, pPos-1); // Then find out its width (in bytes) lStride = std.utf.stride(pText, lUTF_Index); // Return the character encoded in UTF format. return pText[lUTF_Index .. lUTF_Index + lStride]; } void main() { char[] text = "a\ua034bcdef"; uint loc = 4; writefln("%s", getCharAt(text, loc)); // shows "c" writefln("%s", text[loc-1]); // correctly fails } If you just use 'text[loc]', you may not get the correct character, and you actually only get a UTF code point fragment anyway. Remember that char[] is not an array of characters. It is an array of UTF-8 code point fragments (each 1-byte wide) and a UTF-8 encoded character (code point) can have from 1 to 4 fragments.

I was going to post a link to my old Text In D article[1], but I guess that'd be redundant now :P Incidentally, I don't suppose you know anything about the relative performance of your method up there ^^ and the one in my article down here vv:
 dchar nthCharacter(char[] string, int n)
 {
     int curChar = 0;
     foreach( dchar cp ; string )
         if( curChar++ == n )
             return cp;
     return dchar.init;
 }

I'm curious since I don't want to recommend a slow solution if I can help it :) -- Daniel [1] http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD -- int getRandomNumber() { return 4; // chosen by fair dice roll. // guaranteed to be random. } http://xkcd.com/ v2sw5+8Yhw5ln4+5pr6OFPma8u6+7Lw4Tm6+7l6+7D i28a2Xs3MSr2e4/6+7t4TNSMb6HTOp5en5g6RAHCP http://hackerkey.com/
Apr 24 2007
parent reply Derek Parnell <derek psych.ward> writes:
On Wed, 25 Apr 2007 13:41:25 +1000, Daniel Keep wrote:

 Incidentally, I don't suppose you know anything about the relative
 performance of your method up there ^^ and the one in my article down
 here vv:

It seems that your routine is about 3 times slower than the one I had shown. Here is my test program ... I modified your routine slightly because the idiom "if (x++ == n)" is a dangerous one as it is unclear if 'x' gets incremented before or after the comparision. I changed it to be more clear. I also changed my routine to output a dchar rather than a char[] and to test for invalid position input. //----------------------------- import std.perf; import std.stdio; import std.utf; dchar getCharAt(T)(T pText, int pPos) { size_t lUTF_Index; uint lStride; if (pPos < 0 || pPos >= pText.length) return dchar.init; // Firstly, find out where the character starts in the string. lUTF_Index = std.utf.toUTFindex(pText, pPos); // Then find out its width (in bytes) lStride = std.utf.stride(pText, lUTF_Index); // Return the character encoded in UTF format. return std.utf.toUTF32( pText[lUTF_Index .. lUTF_Index + lStride])[0]; } dchar nthCharacter(T)(T string, int n) { int curChar = 0; foreach( dchar cp ; string ) { if( curChar == n ) return cp; curChar++; } return dchar.init; } void main() { char[] text = "a\ua034bcdefa\ua034bcdefa\ua034bcdefa\ua034bcdefg1" "a\ua034bcdefa\ua034bcdefa\ua034bcdefa\ua034bcdefg2" "a\ua034bcdefa\ua034bcdefa\ua034bcdefa\ua034bcdefg3" "a\ua034bcdefa\ua034bcdefa\ua034bcdefa\ua034bcdefg4" "a\ua034bcdefa\ua034bcdefa\ua034bcdefa\ua034bcdefg5" "a\ua034bcdefa\ua034bcdefa\ua034bcdefa\ua034bcdefg6" ; // Test must locate the last character. int loc = std.utf.toUTF32(text).length-1; assert(getCharAt(text, loc) == '6'); assert(nthCharacter(text, loc) == '6'); PerformanceCounter counter = new PerformanceCounter(); counter.start(); volatile for(int i = 0; i < 10_000_000; ++i) { getCharAt(text, loc); } counter.stop(); writefln("Derek Parnell: %10d", counter.microseconds()); counter.start(); volatile for(int i = 0; i < 10_000_000; ++i) { nthCharacter(text, loc); } counter.stop(); writefln(" Daniel Keep: %10d", counter.microseconds()); } //----------------------------- On my machine (Intel Core 2 6600 2.40GHz, 2GB RAM) I got this result ... c:\temp>test Derek Parnell: 7939664 Daniel Keep: 26683373 -- Derek Parnell Melbourne, Australia "Justice for David Hicks!" skype: derek.j.parnell
Apr 25 2007
parent reply Frits van Bommel <fvbommel REMwOVExCAPSs.nl> writes:
Derek Parnell wrote:
 On Wed, 25 Apr 2007 13:41:25 +1000, Daniel Keep wrote:
 
 Incidentally, I don't suppose you know anything about the relative
 performance of your method up there ^^ and the one in my article down
 here vv:

It seems that your routine is about 3 times slower than the one I had shown. Here is my test program ... I modified your routine slightly because the idiom "if (x++ == n)" is a dangerous one as it is unclear if 'x' gets incremented before or after the comparision. I changed it to be more clear.

How is it unclear? Postfix-increment clearly means that the value before incrementation is returned (and thus compared to n in that expression).
 I also changed my routine to output a dchar rather than a char[] and to
 test for invalid position input.
 
 //-----------------------------
 import std.perf;
 import std.stdio;
 import std.utf;
 
 
  dchar getCharAt(T)(T pText, int pPos)
  {
        size_t lUTF_Index;
        uint   lStride;
 
        if (pPos < 0 || pPos >= pText.length)
         return dchar.init;
        // Firstly, find out where the character starts in the string.
        lUTF_Index = std.utf.toUTFindex(pText, pPos);
 

        // Then find out its width (in bytes)
        lStride = std.utf.stride(pText, lUTF_Index);
 
        // Return the character encoded in UTF format.
        return std.utf.toUTF32(
                 pText[lUTF_Index .. lUTF_Index + lStride])[0];

I think you can change these last two statements to just: --- return pText.decode(lUTF_Index); --- (that's std.utf.decode, just to be clear) That changes the index variable passed, but that doesn't matter here.
 }

 //-----------------------------
 
 On my machine (Intel Core 2 6600   2.40GHz, 2GB RAM) I got this result ...
 
 c:\temp>test
 Derek Parnell:    7939664
   Daniel Keep:   26683373

With mine added: (and obviously on _my_ machine) --- urxae urxae:~/tmp$ dmd -O -release -inline -run test.d Derek Parnell: 17693368 Daniel Keep: 54037341 Frits van Bommel: 12045495 urxae urxae:~/tmp$ gdc -O3 -finline -frelease -o test test.d && ./test Derek Parnell: 19567337 Daniel Keep: 26750383 Frits van Bommel: 14332419 --- (My machine & compilers: AMD Sempron 3200+, 1GB RAM, 64-bit Ubuntu 6.10, running DMD 1.013 and GDC 0.23/x86_64) So my version is even faster (about 30%), at least on my machine. And IMHO it's also more readable. No need to know what "stride" is, for example.
Apr 25 2007
next sibling parent Daniel Keep <daniel.keep.lists gmail.com> writes:
Frits van Bommel wrote:
 Derek Parnell wrote:
 On Wed, 25 Apr 2007 13:41:25 +1000, Daniel Keep wrote:

 Incidentally, I don't suppose you know anything about the relative
 performance of your method up there ^^ and the one in my article down
 here vv:

It seems that your routine is about 3 times slower than the one I had shown. Here is my test program ... I modified your routine slightly because the idiom "if (x++ == n)" is a dangerous one as it is unclear if 'x' gets incremented before or after the comparision. I changed it to be more clear.

How is it unclear? Postfix-increment clearly means that the value before incrementation is returned (and thus compared to n in that expression).
 I also changed my routine to output a dchar rather than a char[] and to
 test for invalid position input.

 //-----------------------------
 import std.perf;
 import std.stdio;
 import std.utf;


  dchar getCharAt(T)(T pText, int pPos)
  {
        size_t lUTF_Index;
        uint   lStride;

        if (pPos < 0 || pPos >= pText.length)
         return dchar.init;
        // Firstly, find out where the character starts in the string.
        lUTF_Index = std.utf.toUTFindex(pText, pPos);

        // Then find out its width (in bytes)
        lStride = std.utf.stride(pText, lUTF_Index);

        // Return the character encoded in UTF format.
        return std.utf.toUTF32(
                 pText[lUTF_Index .. lUTF_Index + lStride])[0];

I think you can change these last two statements to just: --- return pText.decode(lUTF_Index); --- (that's std.utf.decode, just to be clear) That changes the index variable passed, but that doesn't matter here.
 }

 //-----------------------------

 On my machine (Intel Core 2 6600   2.40GHz, 2GB RAM) I got this result
 ...

 c:\temp>test
 Derek Parnell:    7939664
   Daniel Keep:   26683373

With mine added: (and obviously on _my_ machine) --- urxae urxae:~/tmp$ dmd -O -release -inline -run test.d Derek Parnell: 17693368 Daniel Keep: 54037341 Frits van Bommel: 12045495 urxae urxae:~/tmp$ gdc -O3 -finline -frelease -o test test.d && ./test Derek Parnell: 19567337 Daniel Keep: 26750383 Frits van Bommel: 14332419 --- (My machine & compilers: AMD Sempron 3200+, 1GB RAM, 64-bit Ubuntu 6.10, running DMD 1.013 and GDC 0.23/x86_64) So my version is even faster (about 30%), at least on my machine. And IMHO it's also more readable. No need to know what "stride" is, for example.

Yoikes! I'm rather amazed that the "simple" foreach method is that much slower. I'll add the faster version to the article as soon as I get the chance. Thanks, guys. -- Daniel -- int getRandomNumber() { return 4; // chosen by fair dice roll. // guaranteed to be random. } http://xkcd.com/ v2sw5+8Yhw5ln4+5pr6OFPma8u6+7Lw4Tm6+7l6+7D i28a2Xs3MSr2e4/6+7t4TNSMb6HTOp5en5g6RAHCP http://hackerkey.com/
Apr 25 2007
prev sibling parent reply Derek Parnell <derek psych.ward> writes:
On Wed, 25 Apr 2007 15:52:45 +0200, Frits van Bommel wrote:

 Derek Parnell wrote:
 On Wed, 25 Apr 2007 13:41:25 +1000, Daniel Keep wrote:
 
 Incidentally, I don't suppose you know anything about the relative
 performance of your method up there ^^ and the one in my article down
 here vv:

It seems that your routine is about 3 times slower than the one I had shown. Here is my test program ... I modified your routine slightly because the idiom "if (x++ == n)" is a dangerous one as it is unclear if 'x' gets incremented before or after the comparision. I changed it to be more clear.

How is it unclear? Postfix-increment clearly means that the value before incrementation is returned (and thus compared to n in that expression).

Yes, I know what it is supposed to do, but when written as it is, it can either be mistakenly thought that the variable gets incremented before the comparision or requires that extra bit of thinking to 'see' the process flow. For that reason, I prefer to either have ++ written as its own statement or write it so the casual reader can explicitly see the process flow. For example, in the original code by Daniel, I was unsure as to whether he was using a 0-based index or a 1-based index, as I had done in my example. The code he supplied assumed a 0-based if the ++ worked as you describe but it assumed a 1-based index if it worked the other way. As my example was 1-based, and I assumed that Daniel knew how to use ++ correctly, I figured he had thus changed my definition of the Position parameter. But the point is, because it was not absolutely clear what the *intention* of the Daniel was, I decided to coded it so the intention was more clear.
 I think you can change these last two statements to just:

 So my version is even faster (about 30%), at least on my machine. And 
 IMHO it's also more readable. No need to know what "stride" is, for example.

Well, if we were really into a pissing contest, we'd both remove the calls to library routines and code it inline, in assembler etc ... but that was not the point. Daniel's code is another example of 'foreach' not producing the best machine code to solve the problem at hand. -- Derek Parnell Melbourne, Australia "Justice for David Hicks!" skype: derek.j.parnell
Apr 25 2007
parent Frits van Bommel <fvbommel REMwOVExCAPSs.nl> writes:
Derek Parnell wrote:
 On Wed, 25 Apr 2007 15:52:45 +0200, Frits van Bommel wrote:
 
 I think you can change these last two statements to just:

 So my version is even faster (about 30%), at least on my machine. And 
 IMHO it's also more readable. No need to know what "stride" is, for example.

Well, if we were really into a pissing contest, we'd both remove the calls to library routines and code it inline, in assembler etc ... but that was

I was just mentioning that you seemed to be over-complicating the code, and as a side-benefit the simpler code was faster as well.
 not the point. Daniel's code is another example of 'foreach' not producing
 the best machine code to solve the problem at hand.

Well to be fair, I don't think that's purely the fault of 'foreach' implementation problems in this case. 'foreach' is doing genuinely more work in this case. Specifically, the foreach loop is decoding all characters up to the one it returns while the getCharAt() variants only actually decode the character asked for, using no more than the stride of the preceding ones. What the foreach version does is therefore more like the following: ----- dchar nthCharacter2(T)(T string, int n) { int curChar = 0; for(size_t index = 0 ; index < string.length ; string.decode(index)) { if( curChar == n ) return string.decode(index); // return _next_ char curChar++; } return dchar.init; } ----- Which is also on the slow side. (Though on DMD this version is still faster than the 'foreach' version :( ) The results with this added as well: ===== urxae urxae:~/tmp$ dmd -O -release -inline -run test.d Derek Parnell: 14416041 Frits van Bommel: 9803830 Daniel Keep: 37386228 for-decode: 33767606 urxae urxae:~/tmp$ gdc -O3 -finline -frelease -o test test.d && ./test Derek Parnell: 17267995 Frits van Bommel: 11836242 Daniel Keep: 21390295 for-decode: 25339226 ===== ("for-decode" is the code above)
Apr 25 2007