digitalmars.D.learn - Checking, whether string contains only ascii.

berni (7/12) Feb 22 2017 In my program, I read a postscript file. Normal postscript files

H. S. Teoh via Digitalmars-d-learn (21/27) Feb 22 2017 [...]
H. S. Teoh via Digitalmars-d-learn (12/26) Feb 22 2017 [...]
jklm (7/20) Feb 22 2017 void foo(string postscript)

jklm (2/24) Feb 22 2017 \s postscript args[0]

Adam D. Ruppe (4/6) Feb 22 2017 Easiest:

aberba (3/9) Feb 22 2017 :)

ag0aep6g (21/34) Feb 22 2017 Making full use of the standard library:

=?UTF-8?Q?Ali_=c3=87ehreli?= (12/47) Feb 22 2017 One more:

kinke (6/17) Feb 22 2017 One more again as I couldn't believe noone went for 'any' yet:

H. S. Teoh via Digitalmars-d-learn (9/15) Feb 22 2017 You win 1 intarwebs for the shortest solution posted so far. ;-)

berni (7/8) Feb 23 2017 Puh, it's lot's of possibilities to choose of, now... I thought

HeiHon (9/23) Feb 23 2017 All the examples given here are very nice.

berni (7/12) Feb 23 2017 As far as I know, images and font data have to be in clean7bit

berni <berni example.com> writes:

In my program, I read a postscript file. Normal postscript files 
should only be composed of ascii characters, but one never knows 
what users give us. Therefore I'd like to make sure that the 
string the program read is only made up of ascii characters. This 
simplifies the code thereafter, because I then can assume, that 
codeunit==codepoint. Is there a simple way to do so?

Here a sketch of my function:

void foo(string postscript)
{
    // throw Exception, if postscript is not all ascii
    // other stuff, assuming codeunit=codepoint
}

Feb 22 2017

"H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:

On Wed, Feb 22, 2017 at 07:26:15PM +0000, berni via Digitalmars-d-learn wrote:
 In my program, I read a postscript file. Normal postscript files
 should only be composed of ascii characters, but one never knows what
 users give us.  Therefore I'd like to make sure that the string the
 program read is only made up of ascii characters. This simplifies the
 code thereafter, because I then can assume, that codeunit==codepoint.
 Is there a simple way to do so?

[...]

Hmm... What about:

	import std.range.primitives;

	bool isAsciiOnly(R)(R input)
		if (isInputRange!R && is(ElementType!R : dchar))
	{
		import std.algorithm.iteration : fold;
		return input.fold!((a, b) => a && b < 0x80)(true);
	}

	unittest
	{
		assert(isAsciiOnly("abcdefg"));
		assert(!isAsciiOnly("abcбвг"));
	}

Basically, it iterates over the string / range of characters and checks
that every character is less than 0x80, since anything that's 0x80 or
greater cannot be ASCII.


T

-- 
INTEL = Only half of "intelligence".

Feb 22 2017

"H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:

On Wed, Feb 22, 2017 at 11:43:00AM -0800, H. S. Teoh via Digitalmars-d-learn
wrote:
[...]
 	import std.range.primitives;
 
 	bool isAsciiOnly(R)(R input)
 		if (isInputRange!R && is(ElementType!R : dchar))
 	{
 		import std.algorithm.iteration : fold;
 		return input.fold!((a, b) => a && b < 0x80)(true);
 	}
 
 	unittest
 	{
 		assert(isAsciiOnly("abcdefg"));
 		assert(!isAsciiOnly("abcбвг"));
 	}

[...]

Ah, missing the Exception part:

	void foo(string input)
	{
		if (!input.isAsciiOnly)
			throw new Exception("...");
	}


T

-- 
Why are you blatanly misspelling "blatant"? -- Branden Robinson

Feb 22 2017

jklm <jklm jklmjklm.hu> writes:

On Wednesday, 22 February 2017 at 19:26:15 UTC, berni wrote:
 In my program, I read a postscript file. Normal postscript 
 files should only be composed of ascii characters, but one 
 never knows what users give us. Therefore I'd like to make sure 
 that the string the program read is only made up of ascii 
 characters. This simplifies the code thereafter, because I then 
 can assume, that codeunit==codepoint. Is there a simple way to 
 do so?

 Here a sketch of my function:

void foo(string postscript)
{
    // throw Exception, if postscript is not all ascii
    // other stuff, assuming codeunit=codepoint
}



void foo(string postscript)
{
     import std.ascii, astd.algorithm.ieration;
     if (!args[0].filter!(a => !isASCII(a)).empty)
         throw new Exception("bla");
}

Feb 22 2017

jklm <jklm jklmjklm.hu> writes:

On Wednesday, 22 February 2017 at 19:57:22 UTC, jklm wrote:
 On Wednesday, 22 February 2017 at 19:26:15 UTC, berni wrote:
 In my program, I read a postscript file. Normal postscript 
 files should only be composed of ascii characters, but one 
 never knows what users give us. Therefore I'd like to make 
 sure that the string the program read is only made up of ascii 
 characters. This simplifies the code thereafter, because I 
 then can assume, that codeunit==codepoint. Is there a simple 
 way to do so?

 Here a sketch of my function:

void foo(string postscript)
{
    // throw Exception, if postscript is not all ascii
    // other stuff, assuming codeunit=codepoint
}



 void foo(string postscript)
 {
     import std.ascii, astd.algorithm.ieration;
     if (!postscript.filter!(a => !isASCII(a)).empty)
         throw new Exception("bla");
 }

\s  postscript args[0]

Feb 22 2017

Adam D. Ruppe <destructionator gmail.com> writes:

On Wednesday, 22 February 2017 at 19:26:15 UTC, berni wrote:
 herefore I'd like to make sure that the string the program read 
 is only made up of ascii characters.

Easiest:

foreach(char ch; postscript)
   if(ch > 127) throw new Exception("non-ascii detected");

Feb 22 2017

aberba <karabutaworld gmail.com> writes:

On Wednesday, 22 February 2017 at 20:01:57 UTC, Adam D. Ruppe 
wrote:
 On Wednesday, 22 February 2017 at 19:26:15 UTC, berni wrote:
 herefore I'd like to make sure that the string the program 
 read is only made up of ascii characters.

 Easiest:

 foreach(char ch; postscript)
   if(ch > 127) throw new Exception("non-ascii detected");

:)

Feb 22 2017

ag0aep6g <anonymous example.com> writes:

On Wednesday, 22 February 2017 at 19:26:15 UTC, berni wrote:
 In my program, I read a postscript file. Normal postscript 
 files should only be composed of ascii characters, but one 
 never knows what users give us. Therefore I'd like to make sure 
 that the string the program read is only made up of ascii 
 characters. This simplifies the code thereafter, because I then 
 can assume, that codeunit==codepoint. Is there a simple way to 
 do so?

 Here a sketch of my function:

void foo(string postscript)
{
    // throw Exception, if postscript is not all ascii
    // other stuff, assuming codeunit=codepoint
}


Making full use of the standard library:

----
import std.algorithm: all;
import std.ascii: isASCII;
import std.exception: enforce;

enforce(postscript.all!isASCII);
----

That checks on the code point level (because strings are ranges 
of dchars). If you want to be clever, you can avoid decoding and 
check on the code unit level:

----
/* other imports as above */
import std.utf: byCodeUnit;

enforce(postscript.byCodeUnit.all!isASCII);
----

Or you can do it manually, avoiding all those imports:

----
foreach (char c; postscript) if (c > 0x7F) throw new 
Exception("not ASCII");
----

Feb 22 2017

=?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:

On 02/22/2017 12:02 PM, ag0aep6g wrote:
 On Wednesday, 22 February 2017 at 19:26:15 UTC, berni wrote:
 In my program, I read a postscript file. Normal postscript files
 should only be composed of ascii characters, but one never knows what
 users give us. Therefore I'd like to make sure that the string the
 program read is only made up of ascii characters. This simplifies the
 code thereafter, because I then can assume, that codeunit==codepoint.
 Is there a simple way to do so?

 Here a sketch of my function:

 void foo(string postscript)
 {
    // throw Exception, if postscript is not all ascii
    // other stuff, assuming codeunit=codepoint
 }


 Making full use of the standard library:

 ----
 import std.algorithm: all;
 import std.ascii: isASCII;
 import std.exception: enforce;

 enforce(postscript.all!isASCII);
 ----

 That checks on the code point level (because strings are ranges of
 dchars). If you want to be clever, you can avoid decoding and check on
 the code unit level:

 ----
 /* other imports as above */
 import std.utf: byCodeUnit;

 enforce(postscript.byCodeUnit.all!isASCII);
 ----

 Or you can do it manually, avoiding all those imports:

 ----
 foreach (char c; postscript) if (c > 0x7F) throw new Exception("not
 ASCII");
 ----

One more:

bool isAscii(string s) {
     import std.string : representation;
     import std.algorithm : canFind;
     return !s.representation.canFind!(c => c >= 0x80);
}

unittest {
     assert(isAscii("hello world"));
     assert(!isAscii("hellö wörld"));
}

Ali

Feb 22 2017

kinke <noone nowhere.com> writes:

On Wednesday, 22 February 2017 at 20:07:34 UTC, Ali Çehreli wrote:
 One more:

 bool isAscii(string s) {
     import std.string : representation;
     import std.algorithm : canFind;
     return !s.representation.canFind!(c => c >= 0x80);
 }

 unittest {
     assert(isAscii("hello world"));
     assert(!isAscii("hellö wörld"));
 }

 Ali

One more again as I couldn't believe noone went for 'any' yet:

---
import std.algorithm;
return !s.any!"a > 127"; // code-point level
---

Feb 22 2017

"H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:

On Wed, Feb 22, 2017 at 09:16:24PM +0000, kinke via Digitalmars-d-learn wrote:
[...]
 One more again as I couldn't believe noone went for 'any' yet:
 
 ---
 import std.algorithm;
 return !s.any!"a > 127"; // code-point level
 ---

You win 1 intarwebs for the shortest solution posted so far. ;-)

Though, according to the OP, an exception is wanted, so it should be
more along the lines of:

	enforce(!s.any!"a > 127");


T

-- 
A bend in the road is not the end of the road unless you fail to make the turn.
-- Brian White

Feb 22 2017

berni <berni example.com> writes:

On Wednesday, 22 February 2017 at 21:23:45 UTC, H. S. Teoh wrote:
 	enforce(!s.any!"a > 127");

Puh, it's lot's of possibilities to choose of, now... I thought 
of something like the foreach-loop but wasn't sure if that is 
correct for all utf encodings. All in all, I think I take the 
any-approach, because it feels a little bit more like looking at 
the string at a whole and I like to use enforce.

Thanks for all your answers!

Feb 23 2017

HeiHon <heiko.honrath gmx.de> writes:

On Thursday, 23 February 2017 at 08:34:53 UTC, berni wrote:
 On Wednesday, 22 February 2017 at 21:23:45 UTC, H. S. Teoh 
 wrote:
 	enforce(!s.any!"a > 127");

 Puh, it's lot's of possibilities to choose of, now... I thought 
 of something like the foreach-loop but wasn't sure if that is 
 correct for all utf encodings. All in all, I think I take the 
 any-approach, because it feels a little bit more like looking 
 at the string at a whole and I like to use enforce.

 Thanks for all your answers!

All the examples given here are very nice.
But alas this will not work with postscript files as found in the 
wild.

 In my program, I read a postscript file. Normal postscript 
 files should only be composed of ascii characters, but one 
 never knows what users give us. Therefore I'd like to make sure 
 that the string the program read is only made up of ascii 
 characters.

Generally postscript files may contain binary data.
Think of included images or font data.
So in postscript files there should normally be no utf-8 encoded 
text, but binary data are quite usual.
Think of postscript files as a sequence of ubytes.

Feb 23 2017

berni <berni example.com> writes:

On Thursday, 23 February 2017 at 17:44:05 UTC, HeiHon wrote:
 Generally postscript files may contain binary data.
 Think of included images or font data.
 So in postscript files there should normally be no utf-8 
 encoded text, but binary data are quite usual.
 Think of postscript files as a sequence of ubytes.

As far as I know, images and font data have to be in clean7bit 
too (they are not human readable though). But postscript files 
can contain preview images, which can be binary. I know about 
this. I just tried to keep my question simple -- and actually I'm 
only testing part of the postscript file, where I know, that 
binary data must not occur.

Feb 23 2017

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Checking, whether string contains only ascii.