www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - [Issue 4250] New: std.regex does not support character sets other than unicode

reply d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=4250

           Summary: std.regex does not support character sets other than
                    unicode
           Product: D
           Version: 2.041
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Phobos
        AssignedTo: nobody puremagic.com
        ReportedBy: lio+bugzilla lunesu.com


--- Comment #0 from Lionello Lunesu <lio+bugzilla lunesu.com> 2010-05-29
07:46:59 PDT ---
Created an attachment (id=647)
Patch against phobos/std/regex.d in dmd.2.046.zip

I'm writing an application that works with Chinese text encoded in GBK,
http://en.wikipedia.org/wiki/GBK . I could convert all the text to UTF8 first,
before using regex, but it's much faster to leave the text as-is and only
convert the regular expression to GBK instead. 

I suspect the following opcode need patching:
1. REanychar uses std.utf.stride;
2. REdchar and REidchar are used when the character in the regex >= 0x80;
3. REichar and REidchar use std.ctype.toupper (during creation and execution)

Point 1 and 3 are easily solved by providing the user with callback functions.
To prevent unnecessary indirection, these can be aliases if
(is(__traits(compiles, std.utf.stride(new E[], 0)))).d

Attached a proof of concept patch for point 1. If this is OK, I can do the same
for point 2 and 3 as well. (Point 2 might not even need a patch; not clear
about that now.)

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
May 29 2010
next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=4250


Lionello Lunesu <lio+bugzilla lunesu.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
 Attachment #647 is|0                           |1
           obsolete|                            |


--- Comment #1 from Lionello Lunesu <lio+bugzilla lunesu.com> 2010-05-29
17:53:38 PDT ---
Created an attachment (id=648)
Patch against phobos/std/regex.d in dmd.2.046.zip

Fixed the diff.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
May 29 2010
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=4250



--- Comment #2 from Lionello Lunesu <lio+bugzilla lunesu.com> 2010-05-29
18:02:11 PDT ---
Created an attachment (id=649)
Testcase (using GB18030 encoded date with std.regex)

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
May 29 2010
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=4250


Walter Bright <bugzilla digitalmars.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |bugzilla digitalmars.com
           Severity|normal                      |enhancement


--- Comment #3 from Walter Bright <bugzilla digitalmars.com> 2010-05-30
11:02:48 PDT ---
It's not designed to do anything but UTF, so marked as an enhancement request.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
May 30 2010
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=4250


Andrei Alexandrescu <andrei metalanguage.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED
                 CC|                            |andrei metalanguage.com
         AssignedTo|nobody puremagic.com        |andrei metalanguage.com


-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jan 09 2011
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=4250


Dmitry Olshansky <dmitry.olsh gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |dmitry.olsh gmail.com
         AssignedTo|andrei metalanguage.com     |dmitry.olsh gmail.com


--- Comment #4 from Dmitry Olshansky <dmitry.olsh gmail.com> 2012-03-12
03:34:56 PDT ---
The first straightforward step would be to add option to skip UTF-processing
assuming it is plain ASCII, that covers an important use case. 
The next move largely depends on std.encoding or whatever it would be.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Mar 12 2012
prev sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=4250


Dmitry Olshansky <dmitry.olsh gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
 Attachment #648 is|0                           |1
           obsolete|                            |


--- Comment #5 from Dmitry Olshansky <dmitry.olsh gmail.com> 2012-07-22
08:21:13 PDT ---
(From update of attachment 648)
Old regex is gone for good since 2.056.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jul 22 2012