www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - [Issue 3465] New: isIdeographic can be wrong in std.xml

reply d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3465

           Summary: isIdeographic can be wrong in std.xml
           Product: D
           Version: 2.035
          Platform: Other
        OS/Version: All
            Status: NEW
          Severity: minor
          Priority: P2
         Component: Phobos
        AssignedTo: nobody puremagic.com
        ReportedBy: y0uf00bar gmail.com



The std.xml functionisIdeographic failed my parser on one of the xml
conformance tests for the character 0x4E00.

// As implemented in XML Piece Parser Project,  http://source.miryn.org/
// but I took it from std.xml

//WRONG in std.xml
//invariant IdeographicTable=[0x4E00,0x9FA5,0x3007,0x3007,0x3021,0x3029];

//RIGHT, because for lookup function,
// the table data range pairs should be ordered!
dchar[] IdeographicTable=[0x3007,0x3007,0x3021,0x3029,0x4E00,0x9FA5];

// PERFORMANCE SUGGESTION
// also lookup is best done for tables that are larger
// for smaller tables, like this one, or character, 
// surely a hard coded search will be faster


// Surely not much more code, is generated for this.
// and faster, since no function call to lookup, and no array slices used.

bool isIdeographic(dchar c)
{
    if (c == 0x3007)
        return true;
    if (c >= 0x3007 && c <= 0x3029)
        return true;
    if (c >= 0x4E00 && c <= 0x9FA5)
        return true;
    return false;
}

// Only suggestion here..
// isChar has to be called for every single character in the document, and 
//    it must be worth a bit of optimisation,
//     especially for common cases.

/**
 * Returns true if the character is a character according to the XML standard
 * Character references must refer to one of these.
 * Any unicode character, excluding surrogate blocks FFFE and FFFF.
 * #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
 * Avoid [#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDEF],
 * Standards: $(LINK2 http://www.w3.org/TR/1998/REC-xml-19980210, XML 1.0)
 *
 * Params:
 *    c = the character to be tested
 *    The standard ASCII case gets at most 3 value comparisons.
  */
bool isChar(dchar c) 
{
    if (c <= 0xD7FF)
    {
        if (c >= 0x20)
        {
            if (c >= 0x7F)
            {
                if (c <= 0x84)
                    return false;
                if (c >= 0x86)
                {
                    if (c <= 0x9F)
                        return false;
                }
            }
            return true;
        }
        switch(c)
        {
        case 0xA:
        case 0x9:
        case 0xD:
            return true;
        default:
            return false;
        }
    }
    else if (c >= 0xE000)
    {
        if (c < 0xFFFE)
        {
            if (c >= 0xFDD0 && c <= 0xFDEF)
                return false;
            return true;
        }
        if (c >= 0x10000)
        {
            if (c <= 0x10FFFF)
            {
        /* some conformance tests have the 0x10FFFF
                if ((c & 0xFFFE) == 0xFFFE)
                {
                    return false; 
                }
        */
                return true;
            }
        }
    }
    return false;
}

// Most digits are expected to be ASCII ones
bool isDigit(dchar c)
{
    if (c <= 0x0039 && c >= 0x0030)
        return true;
    else
        return lookup(DigitTable,c);
}

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Nov 01 2009
next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3465




// A check on my code indicates afternoon doziness, so here is the better
version

bool isIdeographic(dchar c)
{
    if (c == 0x3007)
        return true;
    if (c <= 0x3029 && c >= 0x3021 )
        return true;
    if (c <= 0x9FA5 && c >= 0x4E00)
        return true;
    return false;
}

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Nov 01 2009
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3465


Shin Fujishiro <rsinfu gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED
                 CC|                            |rsinfu gmail.com
         AssignedTo|nobody puremagic.com        |rsinfu gmail.com


-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
May 23 2010
prev sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3465


Shin Fujishiro <rsinfu gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|                            |FIXED



---
Fixed in svn r1552.
Thanks for your contribution!

Excuse me: I removed certain part of your code from the actual commit. The
contributed code took care of newer Unicode standards. I like new things, but
as far as supporting XML 1.0, we have to stick to Unicode 2.0.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
May 23 2010