www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Checking, whether string contains only ascii.

reply berni <berni example.com> writes:
In my program, I read a postscript file. Normal postscript files 
should only be composed of ascii characters, but one never knows 
what users give us. Therefore I'd like to make sure that the 
string the program read is only made up of ascii characters. This 
simplifies the code thereafter, because I then can assume, that 
codeunit==codepoint. Is there a simple way to do so?

Here a sketch of my function:

void foo(string postscript)
{
    // throw Exception, if postscript is not all ascii
    // other stuff, assuming codeunit=codepoint
}
Feb 22
next sibling parent "H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:
On Wed, Feb 22, 2017 at 07:26:15PM +0000, berni via Digitalmars-d-learn wrote:
 In my program, I read a postscript file. Normal postscript files
 should only be composed of ascii characters, but one never knows what
 users give us.  Therefore I'd like to make sure that the string the
 program read is only made up of ascii characters. This simplifies the
 code thereafter, because I then can assume, that codeunit==codepoint.
 Is there a simple way to do so?
[...] Hmm... What about: import std.range.primitives; bool isAsciiOnly(R)(R input) if (isInputRange!R && is(ElementType!R : dchar)) { import std.algorithm.iteration : fold; return input.fold!((a, b) => a && b < 0x80)(true); } unittest { assert(isAsciiOnly("abcdefg")); assert(!isAsciiOnly("abcбвг")); } Basically, it iterates over the string / range of characters and checks that every character is less than 0x80, since anything that's 0x80 or greater cannot be ASCII. T -- INTEL = Only half of "intelligence".
Feb 22
prev sibling next sibling parent "H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:
On Wed, Feb 22, 2017 at 11:43:00AM -0800, H. S. Teoh via Digitalmars-d-learn
wrote:
[...]
 	import std.range.primitives;
 
 	bool isAsciiOnly(R)(R input)
 		if (isInputRange!R && is(ElementType!R : dchar))
 	{
 		import std.algorithm.iteration : fold;
 		return input.fold!((a, b) => a && b < 0x80)(true);
 	}
 
 	unittest
 	{
 		assert(isAsciiOnly("abcdefg"));
 		assert(!isAsciiOnly("abcбвг"));
 	}
[...] Ah, missing the Exception part: void foo(string input) { if (!input.isAsciiOnly) throw new Exception("..."); } T -- Why are you blatanly misspelling "blatant"? -- Branden Robinson
Feb 22
prev sibling next sibling parent reply jklm <jklm jklmjklm.hu> writes:
On Wednesday, 22 February 2017 at 19:26:15 UTC, berni wrote:
 In my program, I read a postscript file. Normal postscript 
 files should only be composed of ascii characters, but one 
 never knows what users give us. Therefore I'd like to make sure 
 that the string the program read is only made up of ascii 
 characters. This simplifies the code thereafter, because I then 
 can assume, that codeunit==codepoint. Is there a simple way to 
 do so?

 Here a sketch of my function:

void foo(string postscript)
{
    // throw Exception, if postscript is not all ascii
    // other stuff, assuming codeunit=codepoint
}
void foo(string postscript) { import std.ascii, astd.algorithm.ieration; if (!args[0].filter!(a => !isASCII(a)).empty) throw new Exception("bla"); }
Feb 22
parent jklm <jklm jklmjklm.hu> writes:
On Wednesday, 22 February 2017 at 19:57:22 UTC, jklm wrote:
 On Wednesday, 22 February 2017 at 19:26:15 UTC, berni wrote:
 In my program, I read a postscript file. Normal postscript 
 files should only be composed of ascii characters, but one 
 never knows what users give us. Therefore I'd like to make 
 sure that the string the program read is only made up of ascii 
 characters. This simplifies the code thereafter, because I 
 then can assume, that codeunit==codepoint. Is there a simple 
 way to do so?

 Here a sketch of my function:

void foo(string postscript)
{
    // throw Exception, if postscript is not all ascii
    // other stuff, assuming codeunit=codepoint
}
void foo(string postscript) { import std.ascii, astd.algorithm.ieration; if (!postscript.filter!(a => !isASCII(a)).empty) throw new Exception("bla"); }
\s postscript args[0]
Feb 22
prev sibling next sibling parent reply Adam D. Ruppe <destructionator gmail.com> writes:
On Wednesday, 22 February 2017 at 19:26:15 UTC, berni wrote:
 herefore I'd like to make sure that the string the program read 
 is only made up of ascii characters.
Easiest: foreach(char ch; postscript) if(ch > 127) throw new Exception("non-ascii detected");
Feb 22
parent aberba <karabutaworld gmail.com> writes:
On Wednesday, 22 February 2017 at 20:01:57 UTC, Adam D. Ruppe 
wrote:
 On Wednesday, 22 February 2017 at 19:26:15 UTC, berni wrote:
 herefore I'd like to make sure that the string the program 
 read is only made up of ascii characters.
Easiest: foreach(char ch; postscript) if(ch > 127) throw new Exception("non-ascii detected");
:)
Feb 22
prev sibling parent reply ag0aep6g <anonymous example.com> writes:
On Wednesday, 22 February 2017 at 19:26:15 UTC, berni wrote:
 In my program, I read a postscript file. Normal postscript 
 files should only be composed of ascii characters, but one 
 never knows what users give us. Therefore I'd like to make sure 
 that the string the program read is only made up of ascii 
 characters. This simplifies the code thereafter, because I then 
 can assume, that codeunit==codepoint. Is there a simple way to 
 do so?

 Here a sketch of my function:

void foo(string postscript)
{
    // throw Exception, if postscript is not all ascii
    // other stuff, assuming codeunit=codepoint
}
Making full use of the standard library: ---- import std.algorithm: all; import std.ascii: isASCII; import std.exception: enforce; enforce(postscript.all!isASCII); ---- That checks on the code point level (because strings are ranges of dchars). If you want to be clever, you can avoid decoding and check on the code unit level: ---- /* other imports as above */ import std.utf: byCodeUnit; enforce(postscript.byCodeUnit.all!isASCII); ---- Or you can do it manually, avoiding all those imports: ---- foreach (char c; postscript) if (c > 0x7F) throw new Exception("not ASCII"); ----
Feb 22
parent reply =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 02/22/2017 12:02 PM, ag0aep6g wrote:
 On Wednesday, 22 February 2017 at 19:26:15 UTC, berni wrote:
 In my program, I read a postscript file. Normal postscript files
 should only be composed of ascii characters, but one never knows what
 users give us. Therefore I'd like to make sure that the string the
 program read is only made up of ascii characters. This simplifies the
 code thereafter, because I then can assume, that codeunit==codepoint.
 Is there a simple way to do so?

 Here a sketch of my function:

 void foo(string postscript)
 {
    // throw Exception, if postscript is not all ascii
    // other stuff, assuming codeunit=codepoint
 }
Making full use of the standard library: ---- import std.algorithm: all; import std.ascii: isASCII; import std.exception: enforce; enforce(postscript.all!isASCII); ---- That checks on the code point level (because strings are ranges of dchars). If you want to be clever, you can avoid decoding and check on the code unit level: ---- /* other imports as above */ import std.utf: byCodeUnit; enforce(postscript.byCodeUnit.all!isASCII); ---- Or you can do it manually, avoiding all those imports: ---- foreach (char c; postscript) if (c > 0x7F) throw new Exception("not ASCII"); ----
One more: bool isAscii(string s) { import std.string : representation; import std.algorithm : canFind; return !s.representation.canFind!(c => c >= 0x80); } unittest { assert(isAscii("hello world")); assert(!isAscii("hellö wörld")); } Ali
Feb 22
parent reply kinke <noone nowhere.com> writes:
On Wednesday, 22 February 2017 at 20:07:34 UTC, Ali Çehreli wrote:
 One more:

 bool isAscii(string s) {
     import std.string : representation;
     import std.algorithm : canFind;
     return !s.representation.canFind!(c => c >= 0x80);
 }

 unittest {
     assert(isAscii("hello world"));
     assert(!isAscii("hellö wörld"));
 }

 Ali
One more again as I couldn't believe noone went for 'any' yet: --- import std.algorithm; return !s.any!"a > 127"; // code-point level ---
Feb 22
parent reply "H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:
On Wed, Feb 22, 2017 at 09:16:24PM +0000, kinke via Digitalmars-d-learn wrote:
[...]
 One more again as I couldn't believe noone went for 'any' yet:
 
 ---
 import std.algorithm;
 return !s.any!"a > 127"; // code-point level
 ---
You win 1 intarwebs for the shortest solution posted so far. ;-) Though, according to the OP, an exception is wanted, so it should be more along the lines of: enforce(!s.any!"a > 127"); T -- A bend in the road is not the end of the road unless you fail to make the turn. -- Brian White
Feb 22
parent reply berni <berni example.com> writes:
On Wednesday, 22 February 2017 at 21:23:45 UTC, H. S. Teoh wrote:
 	enforce(!s.any!"a > 127");
Puh, it's lot's of possibilities to choose of, now... I thought of something like the foreach-loop but wasn't sure if that is correct for all utf encodings. All in all, I think I take the any-approach, because it feels a little bit more like looking at the string at a whole and I like to use enforce. Thanks for all your answers!
Feb 23
parent reply HeiHon <heiko.honrath gmx.de> writes:
On Thursday, 23 February 2017 at 08:34:53 UTC, berni wrote:
 On Wednesday, 22 February 2017 at 21:23:45 UTC, H. S. Teoh 
 wrote:
 	enforce(!s.any!"a > 127");
Puh, it's lot's of possibilities to choose of, now... I thought of something like the foreach-loop but wasn't sure if that is correct for all utf encodings. All in all, I think I take the any-approach, because it feels a little bit more like looking at the string at a whole and I like to use enforce. Thanks for all your answers!
All the examples given here are very nice. But alas this will not work with postscript files as found in the wild.
 In my program, I read a postscript file. Normal postscript 
 files should only be composed of ascii characters, but one 
 never knows what users give us. Therefore I'd like to make sure 
 that the string the program read is only made up of ascii 
 characters.
Generally postscript files may contain binary data. Think of included images or font data. So in postscript files there should normally be no utf-8 encoded text, but binary data are quite usual. Think of postscript files as a sequence of ubytes.
Feb 23
parent berni <berni example.com> writes:
On Thursday, 23 February 2017 at 17:44:05 UTC, HeiHon wrote:
 Generally postscript files may contain binary data.
 Think of included images or font data.
 So in postscript files there should normally be no utf-8 
 encoded text, but binary data are quite usual.
 Think of postscript files as a sequence of ubytes.
As far as I know, images and font data have to be in clean7bit too (they are not human readable though). But postscript files can contain preview images, which can be binary. I know about this. I just tried to keep my question simple -- and actually I'm only testing part of the postscript file, where I know, that binary data must not occur.
Feb 23