digitalmars.D - suggestion: clean white space / end of line definition

Thomas Kuehne (20/68) Oct 28 2006 -----BEGIN PGP SIGNED MESSAGE-----

Walter Bright (3/57) Oct 30 2006 Is it really worth doing all that?

Thomas Kuehne (49/94) Oct 31 2006 -----BEGIN PGP SIGNED MESSAGE-----

Walter Bright (4/4) Oct 31 2006 There is a problem though with replacing it all with a function - lexing...

Thomas Kuehne (140/144) Nov 01 2006 -----BEGIN PGP SIGNED MESSAGE-----

Georg Wrede (40/41) Nov 02 2006 (Apologies in advance, and totally ignoring the good code, standards

Thomas Kuehne (25/48) Nov 03 2006 -----BEGIN PGP SIGNED MESSAGE-----

Thomas Kuehne <thomas-dloop kuehne.cn> writes:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Current definition(http://www.digitalmars.com/d/lex.html):
 EndOfLine:
	\u000D
	\u000A
	\u000D \u000A
	EndOfFile

 WhiteSpace:
	Space
	Space WhiteSpace

 Space:
	\u0020
	\u0009
	\u000B
	\u000C

DMD's frontend however doesn't strictly conform to those definitions.

doc.c:1395: only \u0020, \u0009 and \u000A are treated as spaces
html.c:351: \u000D and \u000A are treated as space too
html.c:683: \u00A0 is treated as space only if it was encountered via a html
entity
inifile.c:264: \u000D and \u000A are treated as space too
lexer.c:2360: \u000B and \u000C aren't treated as spaces
lexer.c: treats \u2028 and \u2029 as line seperators too

The oddest case is enitiy.c:577:
treat "\&nbsp;" as "\u0020" istead of "\u00A0"

suggested definition:
 EndOfLine:
	Unicode(all non-tailorable Line Breaking Classes causing a line break)
	EndOfFile

 WhiteSpace:
	Space
	Space WhiteSpace

 Space:
	( Unicode(General_Category == Space_Seperator)
		|| Unicode(Bidi_Class == Segment_Separator)
		|| Unicode(Bidi_Class == Whitespace)
	) && !EndOfLine

this expands to:
 EndOfLine:
	000A		// LINE FEED
	000B		// LINE TABULATION
	000C		// FORM FEED
	000D		// CARRIAGE RETURN
	000D 000A	// CARRIAGE RETURN followed by LINE FEED
	0085		// NEXT LINE
	2028		// LINE SEPARATOR
	2029		// PARAGRAPH SEPARATOR

 Space:
	Unicode(General_Category == Space_Seperator) && !EndOfLine
		0020       // SPACE
		00A0       // NO-BREAK SPACE
		1680       // OGHAM SPACE MARK
		180E       // MONGOLIAN VOWEL SEPARATOR
		2000..200A // EN QUAD..HAIR SPACE
		202F       // NARROW NO-BREAK SPACE
		205F       // MEDIUM MATHEMATICAL SPACE
		3000       // IDEOGRAPHIC SPACE

	Unicode(Bidi_Class == Segment_Separator) && !EndOfLine
		0009	// CHARACTER TABULATION
		001F	// INFORMATION SEPARATOR ONE

	Unicode(Bidi_Class == Whitespace) && !EndOfLine
		<all part of the Space_Seperator listing>

Thomas

-----BEGIN PGP SIGNATURE-----

iD8DBQFFQzdILK5blCcjpWoRArgLAJ90xljYG+pNPEit3WU8JtAYlC+3PACfRPTU
J0cixnT2X7yynpjxBQx+rps=
=IDK6
-----END PGP SIGNATURE-----

Oct 28 2006

Walter Bright <newshound digitalmars.com> writes:

Thomas Kuehne wrote:
 DMD's frontend however doesn't strictly conform to those definitions.
 
 doc.c:1395: only \u0020, \u0009 and \u000A are treated as spaces
 html.c:351: \u000D and \u000A are treated as space too
 html.c:683: \u00A0 is treated as space only if it was encountered via a html
entity
 inifile.c:264: \u000D and \u000A are treated as space too
 lexer.c:2360: \u000B and \u000C aren't treated as spaces
 lexer.c: treats \u2028 and \u2029 as line seperators too
 
 The oddest case is enitiy.c:577:
 treat "\&nbsp;" as "\u0020" istead of "\u00A0"

Thanks, I'll try to get those fixed.


 suggested definition:
 EndOfLine:
 	Unicode(all non-tailorable Line Breaking Classes causing a line break)
 	EndOfFile

 WhiteSpace:
 	Space
 	Space WhiteSpace

 Space:
 	( Unicode(General_Category == Space_Seperator)
 		|| Unicode(Bidi_Class == Segment_Separator)
 		|| Unicode(Bidi_Class == Whitespace)
 	) && !EndOfLine

 
 this expands to:
 EndOfLine:
 	000A		// LINE FEED
 	000B		// LINE TABULATION
 	000C		// FORM FEED
 	000D		// CARRIAGE RETURN
 	000D 000A	// CARRIAGE RETURN followed by LINE FEED
 	0085		// NEXT LINE
 	2028		// LINE SEPARATOR
 	2029		// PARAGRAPH SEPARATOR

 Space:
 	Unicode(General_Category == Space_Seperator) && !EndOfLine
 		0020       // SPACE
 		00A0       // NO-BREAK SPACE
 		1680       // OGHAM SPACE MARK
 		180E       // MONGOLIAN VOWEL SEPARATOR
 		2000..200A // EN QUAD..HAIR SPACE
 		202F       // NARROW NO-BREAK SPACE
 		205F       // MEDIUM MATHEMATICAL SPACE
 		3000       // IDEOGRAPHIC SPACE

 	Unicode(Bidi_Class == Segment_Separator) && !EndOfLine
 		0009	// CHARACTER TABULATION
 		001F	// INFORMATION SEPARATOR ONE

 	Unicode(Bidi_Class == Whitespace) && !EndOfLine
 		<all part of the Space_Seperator listing>


Is it really worth doing all that?

Oct 30 2006

Thomas Kuehne <thomas-dloop kuehne.cn> writes:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Walter Bright schrieb am 2006-10-31:
 Thomas Kuehne wrote:

<snip>

 suggested definition:
 EndOfLine:
 	Unicode(all non-tailorable Line Breaking Classes causing a line break)
 	EndOfFile

 WhiteSpace:
 	Space
 	Space WhiteSpace

 Space:
 	( Unicode(General_Category == Space_Seperator)
 		|| Unicode(Bidi_Class == Segment_Separator)
 		|| Unicode(Bidi_Class == Whitespace)
 	) && !EndOfLine

 
 this expands to:
 EndOfLine:
 	000A		// LINE FEED
 	000B		// LINE TABULATION
 	000C		// FORM FEED
 	000D		// CARRIAGE RETURN
 	000D 000A	// CARRIAGE RETURN followed by LINE FEED
 	0085		// NEXT LINE
 	2028		// LINE SEPARATOR
 	2029		// PARAGRAPH SEPARATOR

 Space:
 	Unicode(General_Category == Space_Seperator) && !EndOfLine
 		0020       // SPACE
 		00A0       // NO-BREAK SPACE
 		1680       // OGHAM SPACE MARK
 		180E       // MONGOLIAN VOWEL SEPARATOR
 		2000..200A // EN QUAD..HAIR SPACE
 		202F       // NARROW NO-BREAK SPACE
 		205F       // MEDIUM MATHEMATICAL SPACE
 		3000       // IDEOGRAPHIC SPACE

 	Unicode(Bidi_Class == Segment_Separator) && !EndOfLine
 		0009	// CHARACTER TABULATION
 		001F	// INFORMATION SEPARATOR ONE

 	Unicode(Bidi_Class == Whitespace) && !EndOfLine
 		<all part of the Space_Seperator listing>


 Is it really worth doing all that?

What is actually changing for EndOfLine?
	000A new
 	000B formerly white space
 	000C formerly white space
 	0085 new
 	2028 implemented but undocumented
 	2029 implemented but undocumented

\v and \f were probably defined as white space to due to
C's isspace. Please note however that \r and \n are recognised
by isspace too. Implementing 2028 and 2029 seems implicit due to
the use of UTF encodings.

All the different line endings can be converted to '\n' for
non UTF-8 D files in Module::parse. UTF-8 encoded HTML sources
can use a similar approach in html.c(GDC currently uses a
isLineSeperator there). UTF-8 encoded D files would require
support at
lexer.c: 303,709,763,835,1113,1301,1375,1457,1520,1520,2258,2272,2386.
The alternative and more robust solution would be a 'new line cleanup'
at module.c:485 and a goto from module.c:523. This way, all the
'\r', LS and PS tests sprinkled around lexer.c and html.c could be
removed.

In my opinion the EndOfLine change is well worth it.


The SPACE changed was prompted by the broken
00A0 (NO-BREAK SPACE) kludges in html.c and entity.c.
The issue isn't that the idea was bad but the
reasons wasn't layed out properly. If 00A0 is to be considered
a SPACE, then why 00A0 and not character foo-bar? At least the
2000..200A range will become the same problem 00A0 was originally.
Using the Unicode standard as reference would direct all further
debates if a character is a space to the Unicode consortium and leave
D out of potentially length debates.

Changes would be required somewhere around
lexer.c:490,1331,2218,2368,2375,2404

Using a function like

// returns NULL or end of white space
char* isUniSpace(char*)

would also clean up white space parsing.
lexer.c currently tests for '\t' on 6 occasions,
7 times for ' ' and only 3 times for '\f' and '\v' each.

Thomas


-----BEGIN PGP SIGNATURE-----

iD8DBQFFR4z+LK5blCcjpWoRAmbAAJoDASDAvpcpZzWcDl2gh7MhCX5mvgCfdvNm
x3IrjxWSgml7rc3R/soHZn0=
=YYyK
-----END PGP SIGNATURE-----

Oct 31 2006

Walter Bright <newshound digitalmars.com> writes:

There is a problem though with replacing it all with a function - lexing 
speed. Lexing speed is critically dependent on being able to consume 
whitespace fast, hence all the inline code to do it. Running the source 
through two passes makes it half as fast.

Oct 31 2006

Thomas Kuehne <thomas-dloop kuehne.cn> writes:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Walter Bright schrieb am 2006-11-01:
 There is a problem though with replacing it all with a function - lexing 
 speed. Lexing speed is critically dependent on being able to consume 
 whitespace fast, hence all the inline code to do it. Running the source 
 through two passes makes it half as fast.

Here is a faster mock-up(untested!) using functions.
Use of macros is certanly possible too.

Thomas



































































































































-----BEGIN PGP SIGNATURE-----

iD8DBQFFSGXwLK5blCcjpWoRAvzLAKCO0gfLsLKj0nLykQoYOobQ1TKJXwCfUwg+
mSqDFqxJiaBcbCh5LR1Cae4=
=mpgd
-----END PGP SIGNATURE-----

Nov 01 2006

Georg Wrede <georg.wrede nospam.org> writes:

Thomas Kuehne wrote:
 Here is a faster mock-up

(Apologies in advance, and totally ignoring the good code, standards 
compliance and some other good things,) I have to ask:

Is this a Good Thing?

Admittedly not having thought through this issue myself, all I have is a 
gut feeling. But that gut feeling says that source code (especially in a 
systems language in the C family) should strive to hinder all kinds of 
Funny Stuff from entering the toolchain.

Accepting "foreign" characters within strings (and possibly even in 
comments) is OK, but in the source code itself, that's my issue here.

We can already have variable names in D written in Afghan and 
Negro-Potamian, which I definitely don't consider a good idea. If we 
were to follow this line of thought, then the next thing we know 
somebody might demand to have all D keywords translated to every single 
language in the bushes, and to have the compiler accept them as Equal 
synonyms to the Original Keywords. (This actually happened with the CP/M 
operating system in Finland in the early eighties! You don't want to 
hear the whole story.)

What will this do to cross-cultural study, reuse, and copying of example 
code? Won't it eventually compatmentalize most all code written outside 
of the Anglo-Centric world? That is, alienate it from us, but also from 
each of the other cultures too.

And who says parentheses and operators should only be the ones you need 
a Western keyboard to type? I bet there are cultures that use (or will 
insist on using, once the rumour is it's possible) some preposterous ink 
blots instead, for example.

And the next thing of course would be the idiot Humanists who'd demand 
that a non-breaking space really has to be equal to the underscore "for 
people think in words, and subjecting humans to CamelCase or 
under_scored names constitutes deplorable Oppression". And this kind of 
people refuse to see the [to us] obvious horrible ramifications of it.

And this I wrote in spite of my mother tongue needing non-ASCII 
characters in every single sentence.

But, as I said at the outset, this is just a gut feeling, so I'm not 
pressing the issue as if it were something I'd analyzed 
through-and-through.

---

Now, what is obvious, however, is that the current compiler *should* be 
consistent with whitespace and the like, instead of haphazardly 
enumerating some of them each time. No argument there.

Nov 02 2006

Thomas Kuehne <thomas-dloop kuehne.cn> writes:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Georg Wrede schrieb am 2006-11-02:
 Thomas Kuehne wrote:
 Here is a faster mock-up

 (Apologies in advance, and totally ignoring the good code, standards 
 compliance and some other good things,) I have to ask:

 Is this a Good Thing?

 Admittedly not having thought through this issue myself, all I have is a 
 gut feeling. But that gut feeling says that source code (especially in a 
 systems language in the C family) should strive to hinder all kinds of 
 Funny Stuff from entering the toolchain.

 Accepting "foreign" characters within strings (and possibly even in 
 comments) is OK, but in the source code itself, that's my issue here.

 We can already have variable names in D written in Afghan and 
 Negro-Potamian, which I definitely don't consider a good idea. If we 
 were to follow this line of thought, then the next thing we know 
 somebody might demand to have all D keywords translated to every single 
 language in the bushes, and to have the compiler accept them as Equal 
 synonyms to the Original Keywords. (This actually happened with the CP/M 
 operating system in Finland in the early eighties! You don't want to 
 hear the whole story.)

Keywords are a few "magic" words, teaching those doesn't require any
knowledge of the natural language they were taken from. I definitely
agree with your view on keywords. The rest however ... sounds like a
typical culture-centric view.

Forcing everyone - especially beginners and non-IT people - to use
English isn't a viable solution. No, transliteration doesn't cut it:

mama ma ma ma

hint: This a variant of a Chinese language joke and involves 4 different
characters.

In addition there are quite a few words and concepts that have
no English equivalent. For simplicity's sake lets use an ASCII
representable German word: "Heimat"

* home - to narrow
* native country - quite often wrong

 What will this do to cross-cultural study, reuse, and copying of example
 code? Won't it eventually compatmentalize most all code written outside
 of the Anglo-Centric world? That is, alienate it from us, but also from
 each of the other cultures too.

That's what coding standards are for. The same reuse issue goes for C/C++
and the preprocessor and seems to work reasonably well.

Thomas


-----BEGIN PGP SIGNATURE-----

iD8DBQFFS63HLK5blCcjpWoRApQSAJ9grRZKbapVxKmO4uh8b7jeO8RVfgCbB2p2
d7kuli8re+qW4WTVk1Fi2y4=
=CB4r
-----END PGP SIGNATURE-----

Nov 03 2006

D Programming

C/C++ Programming

Other

digitalmars.D - suggestion: clean white space / end of line definition