Welcome to Web-News
A Web-based News Reader
Subject suggestion: clean white space / end of line definition
From Thomas Kuehne <thomas-dloop@kuehne.cn>
Date Sat, 28 Oct 2006 10:00:33 +0000 (UTC)
Newsgroups digitalmars.D

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Current definition(http://www.digitalmars.com/d/lex.html):
> EndOfLine:
>        \u000D
>        \u000A
>        \u000D \u000A
>        EndOfFile
>
> WhiteSpace:
>        Space
>        Space WhiteSpace
>
> Space:
>        \u0020
>        \u0009
>        \u000B
>        \u000C

DMD's frontend however doesn't strictly conform to those definitions.

doc.c:1395: only \u0020, \u0009 and \u000A are treated as spaces
html.c:351: \u000D and \u000A are treated as space too
html.c:683: \u00A0 is treated as space only if it was encountered via a html entity
inifile.c:264: \u000D and \u000A are treated as space too
lexer.c:2360: \u000B and \u000C aren't treated as spaces
lexer.c: treats \u2028 and \u2029 as line seperators too

The oddest case is enitiy.c:577:
treat "\&nbsp;" as "\u0020" istead of "\u00A0"

suggested definition:
> EndOfLine:
>        Unicode(all non-tailorable Line Breaking Classes causing a line break)
>        EndOfFile
>
> WhiteSpace:
>        Space
>        Space WhiteSpace
>
> Space:
>        ( Unicode(General_Category == Space_Seperator)
>                || Unicode(Bidi_Class == Segment_Separator)
>                || Unicode(Bidi_Class == Whitespace)
>        ) && !EndOfLine

this expands to:
> EndOfLine:
>        000A                // LINE FEED
>        000B                // LINE TABULATION
>        000C                // FORM FEED
>        000D                // CARRIAGE RETURN
>        000D 000A        // CARRIAGE RETURN followed by LINE FEED
>        0085                // NEXT LINE
>        2028                // LINE SEPARATOR
>        2029                // PARAGRAPH SEPARATOR
>
> Space:
>        Unicode(General_Category == Space_Seperator) && !EndOfLine
>                0020       // SPACE
>                00A0       // NO-BREAK SPACE
>                1680       // OGHAM SPACE MARK
>                180E       // MONGOLIAN VOWEL SEPARATOR
>                2000..200A // EN QUAD..HAIR SPACE
>                202F       // NARROW NO-BREAK SPACE
>                205F       // MEDIUM MATHEMATICAL SPACE
>                3000       // IDEOGRAPHIC SPACE
>
>        Unicode(Bidi_Class == Segment_Separator) && !EndOfLine
>                0009        // CHARACTER TABULATION
>                001F        // INFORMATION SEPARATOR ONE
>
>        Unicode(Bidi_Class == Whitespace) && !EndOfLine
>                <all part of the Space_Seperator listing>
>

Thomas

-----BEGIN PGP SIGNATURE-----

iD8DBQFFQzdILK5blCcjpWoRArgLAJ90xljYG+pNPEit3WU8JtAYlC+3PACfRPTU
J0cixnT2X7yynpjxBQx+rps=
=IDK6
-----END PGP SIGNATURE-----

Recent messages in this thread
 
-# suggestion: clean white space / end of line definition (Current message) Thomas Kuehne 28-Oct-2006 06:00 am
.\# Re: suggestion: clean white space / end of line definition Walter Bright 30-Oct-2006 08:14 pm