www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Unicode, graphemes and D

reply bearophile <bearophileHUGS lycos.com> writes:
For people interested in a better Unicode handling in D, I have seen that Perl
has some support for graphemes, /\X/ matches an extended grapheme cluster:

http://perldoc.perl.org/perl5120delta.html#Unicode-overhaul

http://perldoc.perl.org/perluniprops.html

Perl seems one of the best languages to manage Unicode (D and Go too are good):
http://rosettacode.org/wiki/String_length#Grapheme_Length_2

Bye,
bearophile
Apr 05 2012
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 05.04.2012 15:53, bearophile wrote:
 For people interested in a better Unicode handling in D, I have seen that Perl
has some support for graphemes, /\X/ matches an extended grapheme cluster:

 http://perldoc.perl.org/perl5120delta.html#Unicode-overhaul

 http://perldoc.perl.org/perluniprops.html

 Perl seems one of the best languages to manage Unicode (D and Go too are good):
 http://rosettacode.org/wiki/String_length#Grapheme_Length_2

 Bye,
 bearophile

FYI http://www.google-melange.com/gsoc/proposal/review/google/gsoc2012/dolsh/20002# -- Dmitry Olshansky
Apr 05 2012
next sibling parent reply "stephan" <stephanfmueller+dlang gmail.com> writes:
 FYI
 http://www.google-melange.com/gsoc/proposal/review/google/gsoc2012/dolsh/20002#

Maybe helpful for your GSOC project: as part of a larger code base, we have implemented many standard Unicode algorithms (normalization; casefolding; graphemes; info like general category, Bidi class, joining type, etc.; ...). The doc and source can be found at http://stephan.bitbucket.org/. As this was just a helper, it is not fully polished (but it works and is reasonably fast).
Apr 05 2012
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 05.04.2012 18:56, stephan wrote:
 FYI
 http://www.google-melange.com/gsoc/proposal/review/google/gsoc2012/dolsh/20002#

Maybe helpful for your GSOC project: as part of a larger code base, we have implemented many standard Unicode algorithms (normalization; casefolding; graphemes; info like general category, Bidi class, joining type, etc.; ...). The doc and source can be found at http://stephan.bitbucket.org/. As this was just a helper, it is not fully polished (but it works and is reasonably fast).

Nice. I'll add a link to my proposal. Though I can use it iff the license is Boost compatible. -- Dmitry Olshansky
Apr 05 2012
prev sibling parent "stephan" <stephanfmueller+dlang gmail.com> writes:
On Thursday, 5 April 2012 at 16:17:46 UTC, Dmitry Olshansky wrote:
 Though I can use it iff the license is Boost compatible.

Ah, the licensing question. I am not a lawyer and I don't know much about copyright law. So you have to do your own research. But here is my view regarding the unicodedata.d license situation. Our code is Boost licensed. It is however not a clean-room installation. Although almost all algorithms and data structures are different and there is minimal (and clearly marked) direct copying, we have looked quite a bit at the ICU implementation (and its predecessors) for inspiration. The ICU license is very permissive, hence you should be ok here. Furthermore, data files from the Unicode Consortium are part of the distribution. They are used in the "script mode" (version SCRIPT_DATA) to generate the relevant Unicode data in an appropriate format. Furthermore, they are used in the extensive unit tests (version ALL_UNIT_TESTS) for testing correctness against various test files and derived property files. Again, the data files have a very permissive license. Let me know if I can be of any help.
Apr 05 2012