www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Best way to count character spaces.

reply "Taylor Hillegeist" <taylorh140 gmail.com> writes:
So I am aware that Unicode is not simple... I have been working 
on a boxes like project http://boxes.thomasjensen.com/

it basically puts a pretty border around stdin characters. like 
so:
  ________________________
/\                       \
\_|Different all twisty a|
   |of in maze are you,   |
   |passages little.      |
   |   ___________________|_
    \_/_____________________/

but I find that I need to know a bit more than the length of the 
string because of encoding differences

I had a thought at one point to do this:

MyString.splitlines.map!(a => a.toUTF32.length).reduce!max();

Should get me the longest line.

but this has a problem too because control characters might not 
take up space (backspace?).

https://en.wikipedia.org/wiki/Unicode_control_characters

leaving an unwanted nasty space :( or take weird amount of space 
\t. And perhaps the first isn't really something to worry about.

Or should i do something like:

MyString.splitLines
		.map!(a => a
			  .map!(a => a
					.isGraphical)
			  .map!(a => cast(int) a?1:0)
			  .array
			  .reduce!((a,b) => a+b))
		.reduce!max

Mostly I am just curious of best practice in this situation.

Both of the above fail with the input:
"hello \n People \nP\u0008ofEARTH"
on my command prompt at least.
Jun 30 2015
next sibling parent Rikki Cattermole <alphaglosined gmail.com> writes:
On 1/07/2015 6:33 a.m., Taylor Hillegeist wrote:
 So I am aware that Unicode is not simple... I have been working on a
 boxes like project http://boxes.thomasjensen.com/

 it basically puts a pretty border around stdin characters. like so:
   ________________________
 /\                       \
 \_|Different all twisty a|
    |of in maze are you,   |
    |passages little.      |
    |   ___________________|_
     \_/_____________________/

 but I find that I need to know a bit more than the length of the string
 because of encoding differences

 I had a thought at one point to do this:

 MyString.splitlines.map!(a => a.toUTF32.length).reduce!max();

 Should get me the longest line.

 but this has a problem too because control characters might not take up
 space (backspace?).

 https://en.wikipedia.org/wiki/Unicode_control_characters

 leaving an unwanted nasty space :( or take weird amount of space \t. And
 perhaps the first isn't really something to worry about.

 Or should i do something like:

 MyString.splitLines
          .map!(a => a
                .map!(a => a
                      .isGraphical)
                .map!(a => cast(int) a?1:0)
                .array
                .reduce!((a,b) => a+b))
          .reduce!max

 Mostly I am just curious of best practice in this situation.

 Both of the above fail with the input:
 "hello \n People \nP\u0008ofEARTH"
 on my command prompt at least.
Well I would personally use isWhite[0]. I would also use filter and count along with it. So something like this: size_t[] lengths = MyString.splitLines .filter!isWhite .count .array; Untested of course, but may give you ideas :) [0] http://dlang.org/phobos/std_uni.html#.isWhite
Jun 30 2015
prev sibling parent reply "H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:
On Tue, Jun 30, 2015 at 06:33:32PM +0000, Taylor Hillegeist via
Digitalmars-d-learn wrote:
 So I am aware that Unicode is not simple... I have been working on a boxes
 like project http://boxes.thomasjensen.com/
 
 it basically puts a pretty border around stdin characters. like so:
  ________________________
 /\                       \
 \_|Different all twisty a|
   |of in maze are you,   |
   |passages little.      |
   |   ___________________|_
    \_/_____________________/
 
 but I find that I need to know a bit more than the length of the string
 because of encoding differences
[...] Use std.uni.byGrapheme. That's the only reliable way to count anything remotely resembling the display length of the string, which is not to be confused with the number of code points, which is also different from the length of the string in bytes or the number of code units. Note that even with byGrapheme, you may still need some post-processing, because certain terminals may output Asian block characters in double width, meaning that 1 grapheme takes up two columns on the screen. But byGrapheme should get you started on the right footing. T -- If the comments and the code disagree, it's likely that *both* are wrong. -- Christopher
Jun 30 2015
parent Steven Schveighoffer <schveiguy yahoo.com> writes:
On 7/1/15 1:25 AM, H. S. Teoh via Digitalmars-d-learn wrote:
 On Tue, Jun 30, 2015 at 06:33:32PM +0000, Taylor Hillegeist via
Digitalmars-d-learn wrote:
 So I am aware that Unicode is not simple... I have been working on a boxes
 like project http://boxes.thomasjensen.com/

 it basically puts a pretty border around stdin characters. like so:
   ________________________
 /\                       \
 \_|Different all twisty a|
    |of in maze are you,   |
    |passages little.      |
    |   ___________________|_
     \_/_____________________/

 but I find that I need to know a bit more than the length of the string
 because of encoding differences
[...] Use std.uni.byGrapheme. That's the only reliable way to count anything remotely resembling the display length of the string, which is not to be confused with the number of code points, which is also different from the length of the string in bytes or the number of code units. Note that even with byGrapheme, you may still need some post-processing, because certain terminals may output Asian block characters in double width, meaning that 1 grapheme takes up two columns on the screen. But byGrapheme should get you started on the right footing.
BTW, this exercise would make an EXCELLENT blog post highlighting both the power of D's unicode support and the hairy issues of unicode. I like the ascii er... unicode art concept :) -Steve
Jul 01 2015