www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - suggestion: clean white space / end of line definition

reply Thomas Kuehne <thomas-dloop kuehne.cn> writes:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Current definition(http://www.digitalmars.com/d/lex.html):
 EndOfLine:
	\u000D
	\u000A
	\u000D \u000A
	EndOfFile

 WhiteSpace:
	Space
	Space WhiteSpace

 Space:
	\u0020
	\u0009
	\u000B
	\u000C

DMD's frontend however doesn't strictly conform to those definitions. doc.c:1395: only \u0020, \u0009 and \u000A are treated as spaces html.c:351: \u000D and \u000A are treated as space too html.c:683: \u00A0 is treated as space only if it was encountered via a html entity inifile.c:264: \u000D and \u000A are treated as space too lexer.c:2360: \u000B and \u000C aren't treated as spaces lexer.c: treats \u2028 and \u2029 as line seperators too The oddest case is enitiy.c:577: treat "\&nbsp;" as "\u0020" istead of "\u00A0" suggested definition:
 EndOfLine:
	Unicode(all non-tailorable Line Breaking Classes causing a line break)
	EndOfFile

 WhiteSpace:
	Space
	Space WhiteSpace

 Space:
	( Unicode(General_Category == Space_Seperator)
		|| Unicode(Bidi_Class == Segment_Separator)
		|| Unicode(Bidi_Class == Whitespace)
	) && !EndOfLine

this expands to:
 EndOfLine:
	000A		// LINE FEED
	000B		// LINE TABULATION
	000C		// FORM FEED
	000D		// CARRIAGE RETURN
	000D 000A	// CARRIAGE RETURN followed by LINE FEED
	0085		// NEXT LINE
	2028		// LINE SEPARATOR
	2029		// PARAGRAPH SEPARATOR

 Space:
	Unicode(General_Category == Space_Seperator) && !EndOfLine
		0020       // SPACE
		00A0       // NO-BREAK SPACE
		1680       // OGHAM SPACE MARK
		180E       // MONGOLIAN VOWEL SEPARATOR
		2000..200A // EN QUAD..HAIR SPACE
		202F       // NARROW NO-BREAK SPACE
		205F       // MEDIUM MATHEMATICAL SPACE
		3000       // IDEOGRAPHIC SPACE

	Unicode(Bidi_Class == Segment_Separator) && !EndOfLine
		0009	// CHARACTER TABULATION
		001F	// INFORMATION SEPARATOR ONE

	Unicode(Bidi_Class == Whitespace) && !EndOfLine
		<all part of the Space_Seperator listing>

Thomas -----BEGIN PGP SIGNATURE----- iD8DBQFFQzdILK5blCcjpWoRArgLAJ90xljYG+pNPEit3WU8JtAYlC+3PACfRPTU J0cixnT2X7yynpjxBQx+rps= =IDK6 -----END PGP SIGNATURE-----
Oct 28 2006
parent reply Walter Bright <newshound digitalmars.com> writes:
Thomas Kuehne wrote:
 DMD's frontend however doesn't strictly conform to those definitions.
 
 doc.c:1395: only \u0020, \u0009 and \u000A are treated as spaces
 html.c:351: \u000D and \u000A are treated as space too
 html.c:683: \u00A0 is treated as space only if it was encountered via a html
entity
 inifile.c:264: \u000D and \u000A are treated as space too
 lexer.c:2360: \u000B and \u000C aren't treated as spaces
 lexer.c: treats \u2028 and \u2029 as line seperators too
 
 The oddest case is enitiy.c:577:
 treat "\&nbsp;" as "\u0020" istead of "\u00A0"

Thanks, I'll try to get those fixed.
 suggested definition:
 EndOfLine:
 	Unicode(all non-tailorable Line Breaking Classes causing a line break)
 	EndOfFile

 WhiteSpace:
 	Space
 	Space WhiteSpace

 Space:
 	( Unicode(General_Category == Space_Seperator)
 		|| Unicode(Bidi_Class == Segment_Separator)
 		|| Unicode(Bidi_Class == Whitespace)
 	) && !EndOfLine

this expands to:
 EndOfLine:
 	000A		// LINE FEED
 	000B		// LINE TABULATION
 	000C		// FORM FEED
 	000D		// CARRIAGE RETURN
 	000D 000A	// CARRIAGE RETURN followed by LINE FEED
 	0085		// NEXT LINE
 	2028		// LINE SEPARATOR
 	2029		// PARAGRAPH SEPARATOR

 Space:
 	Unicode(General_Category == Space_Seperator) && !EndOfLine
 		0020       // SPACE
 		00A0       // NO-BREAK SPACE
 		1680       // OGHAM SPACE MARK
 		180E       // MONGOLIAN VOWEL SEPARATOR
 		2000..200A // EN QUAD..HAIR SPACE
 		202F       // NARROW NO-BREAK SPACE
 		205F       // MEDIUM MATHEMATICAL SPACE
 		3000       // IDEOGRAPHIC SPACE

 	Unicode(Bidi_Class == Segment_Separator) && !EndOfLine
 		0009	// CHARACTER TABULATION
 		001F	// INFORMATION SEPARATOR ONE

 	Unicode(Bidi_Class == Whitespace) && !EndOfLine
 		<all part of the Space_Seperator listing>


Is it really worth doing all that?
Oct 30 2006
parent reply Thomas Kuehne <thomas-dloop kuehne.cn> writes:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Walter Bright schrieb am 2006-10-31:
 Thomas Kuehne wrote:

<snip>
 suggested definition:
 EndOfLine:
 	Unicode(all non-tailorable Line Breaking Classes causing a line break)
 	EndOfFile

 WhiteSpace:
 	Space
 	Space WhiteSpace

 Space:
 	( Unicode(General_Category == Space_Seperator)
 		|| Unicode(Bidi_Class == Segment_Separator)
 		|| Unicode(Bidi_Class == Whitespace)
 	) && !EndOfLine

this expands to:
 EndOfLine:
 	000A		// LINE FEED
 	000B		// LINE TABULATION
 	000C		// FORM FEED
 	000D		// CARRIAGE RETURN
 	000D 000A	// CARRIAGE RETURN followed by LINE FEED
 	0085		// NEXT LINE
 	2028		// LINE SEPARATOR
 	2029		// PARAGRAPH SEPARATOR

 Space:
 	Unicode(General_Category == Space_Seperator) && !EndOfLine
 		0020       // SPACE
 		00A0       // NO-BREAK SPACE
 		1680       // OGHAM SPACE MARK
 		180E       // MONGOLIAN VOWEL SEPARATOR
 		2000..200A // EN QUAD..HAIR SPACE
 		202F       // NARROW NO-BREAK SPACE
 		205F       // MEDIUM MATHEMATICAL SPACE
 		3000       // IDEOGRAPHIC SPACE

 	Unicode(Bidi_Class == Segment_Separator) && !EndOfLine
 		0009	// CHARACTER TABULATION
 		001F	// INFORMATION SEPARATOR ONE

 	Unicode(Bidi_Class == Whitespace) && !EndOfLine
 		<all part of the Space_Seperator listing>


Is it really worth doing all that?

What is actually changing for EndOfLine? 000A new 000B formerly white space 000C formerly white space 0085 new 2028 implemented but undocumented 2029 implemented but undocumented \v and \f were probably defined as white space to due to C's isspace. Please note however that \r and \n are recognised by isspace too. Implementing 2028 and 2029 seems implicit due to the use of UTF encodings. All the different line endings can be converted to '\n' for non UTF-8 D files in Module::parse. UTF-8 encoded HTML sources can use a similar approach in html.c(GDC currently uses a isLineSeperator there). UTF-8 encoded D files would require support at lexer.c: 303,709,763,835,1113,1301,1375,1457,1520,1520,2258,2272,2386. The alternative and more robust solution would be a 'new line cleanup' at module.c:485 and a goto from module.c:523. This way, all the '\r', LS and PS tests sprinkled around lexer.c and html.c could be removed. In my opinion the EndOfLine change is well worth it. The SPACE changed was prompted by the broken 00A0 (NO-BREAK SPACE) kludges in html.c and entity.c. The issue isn't that the idea was bad but the reasons wasn't layed out properly. If 00A0 is to be considered a SPACE, then why 00A0 and not character foo-bar? At least the 2000..200A range will become the same problem 00A0 was originally. Using the Unicode standard as reference would direct all further debates if a character is a space to the Unicode consortium and leave D out of potentially length debates. Changes would be required somewhere around lexer.c:490,1331,2218,2368,2375,2404 Using a function like // returns NULL or end of white space char* isUniSpace(char*) would also clean up white space parsing. lexer.c currently tests for '\t' on 6 occasions, 7 times for ' ' and only 3 times for '\f' and '\v' each. Thomas -----BEGIN PGP SIGNATURE----- iD8DBQFFR4z+LK5blCcjpWoRAmbAAJoDASDAvpcpZzWcDl2gh7MhCX5mvgCfdvNm x3IrjxWSgml7rc3R/soHZn0= =YYyK -----END PGP SIGNATURE-----
Oct 31 2006
parent reply Walter Bright <newshound digitalmars.com> writes:
There is a problem though with replacing it all with a function - lexing 
speed. Lexing speed is critically dependent on being able to consume 
whitespace fast, hence all the inline code to do it. Running the source 
through two passes makes it half as fast.
Oct 31 2006
parent reply Thomas Kuehne <thomas-dloop kuehne.cn> writes:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Walter Bright schrieb am 2006-11-01:
 There is a problem though with replacing it all with a function - lexing 
 speed. Lexing speed is critically dependent on being able to consume 
 whitespace fast, hence all the inline code to do it. Running the source 
 through two passes makes it half as fast.

Here is a faster mock-up(untested!) using functions. Use of macros is certanly possible too. Thomas # unsigned char* isEndOfLine(unsigned char* input){ # switch(input[0]){ # /* covered by the lexer: # * case 0x0A: // LINE FEED # */ # case 0x0B: // LINE TABULATION # case 0x0C: // FORM FEED # return input; # case 0x0D: // CARRIAGE RETURN # if(input[1] == 0x0A){ # return input + 1; # } # return input; # case 0xC2: // NEXT LINE # if(input[1] == 0x85){ # return input + 1; # } # break; # case 0xE2: // LINE SEPARATOR || PARAGRAPH SEPARATOR # if((input[1] == 0x80) && ((input[2] == 0xA8) || (input[2] == 0xA9))){ # return input + 2; # } # break; # default: # break; # } # # return 0; # } # # unsigned char* isSpace(unsigned char* input){ # switch(input[0]){ # /* covered by the lexer: # * case 0x20: // SPACE # */ # case 0x09: // CHARACTER TABULATION # case 0x1F: // INFORMATION SEPARATOR ONE # return input; # case 0xC2: # if(input[1] == 0xA0){ # // NO-BREAK SPACE # return input + 1; # } # break; # case 0xE1: # switch(input[1]){ # case 0xA9: # if(input[2] == 0x80){ # // OGHAM SPACE MARK # return input + 2; # } # break; # case 0xA0: # if(input[2] == 0x8E){ # // MONGOLIAN VOWEL SEPARATOR # return input + 2; # } # break; # default: # break; # } # break; # case 0xE2: # switch(input[1]){ # case 0x80: # if((0x80 <= input[2]) && (input[2] <= 0x8A)){ # // EN QUAD..HAIR SPACE # return input + 2; # }else if(input[2] == 0xAF){ # // NARROW NO-BREAK SPACE # return input + 2; # } # break; # case 0x81: # if(input[2] == 0x9F){ # // MEDIUM MATHEMATICAL SPACE # return input + 2; # } # break; # default: # break; # } # break; # case 0xE3: # if((input[1] == 0x80) && (input[2] == 0x80)){ # // IDEOGRAPHIC SPACE # return input + 2; # } # break; # default: # break; # } # return 0; # } # # void lexer(){ # unsigned char* p; # unsigned char* tmp; # while (1) # { # switch (*p) # { # Lspace: # case ' ': # p++; # continue; // skip white space # # Lnew_line: # case '\n': # p++; # //loc.linnum++; # continue; // skip white space # # /* a lot more code goes here */ # # default: # if((tmp = isEndOfLine(p))){ # p = tmp; # goto Lnew_line; # } # if((tmp = isSpace(p))){ # p = tmp; # goto Lspace; # } # # /* a lot more code goes here */ # } # } # } -----BEGIN PGP SIGNATURE----- iD8DBQFFSGXwLK5blCcjpWoRAvzLAKCO0gfLsLKj0nLykQoYOobQ1TKJXwCfUwg+ mSqDFqxJiaBcbCh5LR1Cae4= =mpgd -----END PGP SIGNATURE-----
Nov 01 2006
parent reply Georg Wrede <georg.wrede nospam.org> writes:
Thomas Kuehne wrote:
 Here is a faster mock-up

(Apologies in advance, and totally ignoring the good code, standards compliance and some other good things,) I have to ask: Is this a Good Thing? Admittedly not having thought through this issue myself, all I have is a gut feeling. But that gut feeling says that source code (especially in a systems language in the C family) should strive to hinder all kinds of Funny Stuff from entering the toolchain. Accepting "foreign" characters within strings (and possibly even in comments) is OK, but in the source code itself, that's my issue here. We can already have variable names in D written in Afghan and Negro-Potamian, which I definitely don't consider a good idea. If we were to follow this line of thought, then the next thing we know somebody might demand to have all D keywords translated to every single language in the bushes, and to have the compiler accept them as Equal synonyms to the Original Keywords. (This actually happened with the CP/M operating system in Finland in the early eighties! You don't want to hear the whole story.) What will this do to cross-cultural study, reuse, and copying of example code? Won't it eventually compatmentalize most all code written outside of the Anglo-Centric world? That is, alienate it from us, but also from each of the other cultures too. And who says parentheses and operators should only be the ones you need a Western keyboard to type? I bet there are cultures that use (or will insist on using, once the rumour is it's possible) some preposterous ink blots instead, for example. And the next thing of course would be the idiot Humanists who'd demand that a non-breaking space really has to be equal to the underscore "for people think in words, and subjecting humans to CamelCase or under_scored names constitutes deplorable Oppression". And this kind of people refuse to see the [to us] obvious horrible ramifications of it. And this I wrote in spite of my mother tongue needing non-ASCII characters in every single sentence. But, as I said at the outset, this is just a gut feeling, so I'm not pressing the issue as if it were something I'd analyzed through-and-through. --- Now, what is obvious, however, is that the current compiler *should* be consistent with whitespace and the like, instead of haphazardly enumerating some of them each time. No argument there.
Nov 02 2006
parent Thomas Kuehne <thomas-dloop kuehne.cn> writes:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Georg Wrede schrieb am 2006-11-02:
 Thomas Kuehne wrote:
 Here is a faster mock-up

(Apologies in advance, and totally ignoring the good code, standards compliance and some other good things,) I have to ask: Is this a Good Thing? Admittedly not having thought through this issue myself, all I have is a gut feeling. But that gut feeling says that source code (especially in a systems language in the C family) should strive to hinder all kinds of Funny Stuff from entering the toolchain. Accepting "foreign" characters within strings (and possibly even in comments) is OK, but in the source code itself, that's my issue here. We can already have variable names in D written in Afghan and Negro-Potamian, which I definitely don't consider a good idea. If we were to follow this line of thought, then the next thing we know somebody might demand to have all D keywords translated to every single language in the bushes, and to have the compiler accept them as Equal synonyms to the Original Keywords. (This actually happened with the CP/M operating system in Finland in the early eighties! You don't want to hear the whole story.)

Keywords are a few "magic" words, teaching those doesn't require any knowledge of the natural language they were taken from. I definitely agree with your view on keywords. The rest however ... sounds like a typical culture-centric view. Forcing everyone - especially beginners and non-IT people - to use English isn't a viable solution. No, transliteration doesn't cut it: mama ma ma ma hint: This a variant of a Chinese language joke and involves 4 different characters. In addition there are quite a few words and concepts that have no English equivalent. For simplicity's sake lets use an ASCII representable German word: "Heimat" * home - to narrow * native country - quite often wrong
 What will this do to cross-cultural study, reuse, and copying of example
 code? Won't it eventually compatmentalize most all code written outside
 of the Anglo-Centric world? That is, alienate it from us, but also from
 each of the other cultures too.

That's what coding standards are for. The same reuse issue goes for C/C++ and the preprocessor and seems to work reasonably well. Thomas -----BEGIN PGP SIGNATURE----- iD8DBQFFS63HLK5blCcjpWoRApQSAJ9grRZKbapVxKmO4uh8b7jeO8RVfgCbB2p2 d7kuli8re+qW4WTVk1Fi2y4= =CB4r -----END PGP SIGNATURE-----
Nov 03 2006