www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Unicode character module (unichar)

reply Hauke Duden <H.NS.Duden gmx.net> writes:
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit

As promised in another thread, here's the unichar module that I've 
written. It provides basic Unicode character property functions (like 
charIsDigit, charToLower, etc).

It is documented in doxygen style and the compiled docs are included in 
the zip file.

Let me know what you think!


Hauke
Jun 04 2004
parent reply "Walter" <newshound digitalmars.com> writes:
"Hauke Duden" <H.NS.Duden gmx.net> wrote in message
news:c9qlqe$1me1$1 digitaldaemon.com...
 As promised in another thread, here's the unichar module that I've
 written. It provides basic Unicode character property functions (like
 charIsDigit, charToLower, etc).

 It is documented in doxygen style and the compiled docs are included in
 the zip file.

 Let me know what you think!

Great! Some quick comments: o Can the enum be changed to enum CHARCATEGORY, and then replace CHARCATEGORY_LETTER, etc., to CHARCATEGORY.LETTER? o change inout in the foreach in charToTitle to nothing. o "Descimal" should be "Decimal" o no need for 'char' prefix on functions, the module name should suffice. The 2Mb ram at runtime is a little costly, so I think it should remain a separate package from std.ctype.
Jun 04 2004
next sibling parent reply Hauke Duden <H.NS.Duden gmx.net> writes:
Walter wrote:
Let me know what you think!

Great! Some quick comments: o Can the enum be changed to enum CHARCATEGORY, and then replace CHARCATEGORY_LETTER, etc., to CHARCATEGORY.LETTER?

Yes. I guess that's just a C++-ism I got used to.
 o change inout in the foreach in charToTitle to nothing.

Hmmm. I don't know that much about the inner workings of foreach but won't that create a copy of the referenced element?
 o "Descimal" should be "Decimal"

Whoops ;).
 o no need for 'char' prefix on functions, the module name should suffice.

As I said in another post, I'm reluctant to change this. Mostly because I want the functions to look different from the ctype ones but also because of D's overloading issue.
 The 2Mb ram at runtime is a little costly, so I think it should remain a
 separate package from std.ctype.

I agree. Hauke
Jun 04 2004
parent reply "Walter" <newshound digitalmars.com> writes:
"Hauke Duden" <H.NS.Duden gmx.net> wrote in message
news:c9r0ri$26gi$1 digitaldaemon.com...
 o change inout in the foreach in charToTitle to nothing.

won't that create a copy of the referenced element?

Yes.
 o no need for 'char' prefix on functions, the module name should


 As I said in another post, I'm reluctant to change this. Mostly because
 I want the functions to look different from the ctype ones but also
 because of D's overloading issue.

I just don't understand what the D overloading issue is. D has much tighter control over overloading than C++ has, overloads from one module aren't going to be mistaken for another one if both are imported. One reason for the package/module system in D is to pitch the C-ism of decorating names with a pseudo-package name into the ash heap of history <g>.
Jun 04 2004
parent reply Hauke Duden <H.NS.Duden gmx.net> writes:
Walter wrote:
  >>>o no need for 'char' prefix on functions, the module name should
 
 suffice.
 
As I said in another post, I'm reluctant to change this. Mostly because
I want the functions to look different from the ctype ones but also
because of D's overloading issue.

I just don't understand what the D overloading issue is. D has much tighter control over overloading than C++ has, overloads from one module aren't going to be mistaken for another one if both are imported. One reason for the package/module system in D is to pitch the C-ism of decorating names with a pseudo-package name into the ash heap of history <g>.

Here's an example of what I mean: module unichar: bool isSeparator(dchar chr); module funkyMenu: bool isSeparator(MenuItem item); module myApp: import unichar; import funkyMenu; void foo(MenuItem item) { if(isSeparator(item)) .... } This will cause a compiler error because D stops looking for more overloads as soon as it finds unichar.isSeparator and never finds funkyMenu.isSeparator. And to make matters worse, the error message will not even tell you that there is some kind of conflict. No, the compiler will tell you that there is no isSeparator(MenuItem) even though there most certainly is. In C++ there'd be no such problem because the call is not actually ambiguous! It is perfectly clear that the MenuItem version is the one that should be called. That's what I mean and that's the reason why I don't want to define any global functions with names that may also occur in other contexts. Otherwise there may be weird effects for the library's user like working code failing to compile once another import is added - even though there is no ambiguity. Hauke
Jun 05 2004
next sibling parent reply "Walter" <newshound digitalmars.com> writes:
"Hauke Duden" <H.NS.Duden gmx.net> wrote in message
news:c9t8un$2jfc$1 digitaldaemon.com...
 I just don't understand what the D overloading issue is. D has much


 control over overloading than C++ has, overloads from one module aren't
 going to be mistaken for another one if both are imported. One reason


 the package/module system in D is to pitch the C-ism of decorating names
 with a pseudo-package name into the ash heap of history <g>.

Here's an example of what I mean: module unichar: bool isSeparator(dchar chr); module funkyMenu: bool isSeparator(MenuItem item); module myApp: import unichar; import funkyMenu; void foo(MenuItem item) { if(isSeparator(item)) .... } This will cause a compiler error because D stops looking for more overloads as soon as it finds unichar.isSeparator and never finds funkyMenu.isSeparator.

No, that isn't what happens. What happens is that isSeparator appears in multiple modules, and the compiler doesn't know which one to use, so issues an error. You'll find the same error if you use a dchar argument for isSeparator. Next, overloading does NOT happen across modules. Overloading happens AFTER the symbol lookup. Only functions in the same scope are overloadable.
 And to make matters worse, the error message will
 not even tell you that there is some kind of conflict. No, the compiler
 will tell you that there is no isSeparator(MenuItem) even though there
 most certainly is.

The error message I get is: unichar.d(2): function isSeparator conflicts with funkyMenu.isSeparator at funkyMenu.d(3)
 In C++ there'd be no such problem because the call is not actually
 ambiguous! It is perfectly clear that the MenuItem version is the one
 that should be called.

 That's what I mean and that's the reason why I don't want to define any
 global functions with names that may also occur in other contexts.
 Otherwise there may be weird effects for the library's user like working
 code failing to compile once another import is added - even though there
 is no ambiguity.

There is an ambiguity, and the compiler issues an error for it. The reason it behaves this way is to avoid the C++ global namespace pollution problem, where two completely unrelated functions in two unrelated source files happen to have the same name, and inadvertantly overload against each other causing some very strange errors. This doesn't happen in D, if you want two names in different modules to overload against each other, a specific action is required to make it happen (an alias declaration). It will NOT happen by default. You'll get the "conflicts" error above. Next, instead of the C++ 'fix' for this problem by adding a pseudo-package name to each global symbol, in D you can just use the module name for it, i.e.: unichar.isSeparator() funkyMenu.isSeparator() which is better than the C++ unichar_isSeparator(), isn't it?
Jun 05 2004
next sibling parent reply Hauke Duden <H.NS.Duden gmx.net> writes:
Walter wrote:
 Next, overloading does NOT happen across modules. Overloading happens AFTER
 the symbol lookup. Only functions in the same scope are overloadable.
 
 
And to make matters worse, the error message will
not even tell you that there is some kind of conflict. No, the compiler
will tell you that there is no isSeparator(MenuItem) even though there
most certainly is.

The error message I get is: unichar.d(2): function isSeparator conflicts with funkyMenu.isSeparator at funkyMenu.d(3)

My apologies. I now get the same error. I distictly remember getting a much more misleading error when I experimented with overloads some time ago, though. Was there a related compiler error in earlier DMD versions?
 There is an ambiguity, and the compiler issues an error for it. The reason
 it behaves this way is to avoid the C++ global namespace pollution problem,
 where two completely unrelated functions in two unrelated source files
 happen to have the same name, and inadvertantly overload against each other
 causing some very strange errors.

What kind of strange errors are these? It seems to me that overloads with different argument types are unproblematic. You can sometimes have ambiguous calls, for example, if one function takes the base class type the other function's parameter but that'd simply cause a compiler error. Nothing that I would call "strange". Hauke
Jun 05 2004
parent reply "Walter" <newshound digitalmars.com> writes:
"Hauke Duden" <H.NS.Duden gmx.net> wrote in message
news:c9thhq$2urf$1 digitaldaemon.com...
 I distictly remember getting a much more misleading error when I
 experimented with overloads some time ago, though. Was there a related
 compiler error in earlier DMD versions?

That's possible. I don't remember.
 There is an ambiguity, and the compiler issues an error for it. The


 it behaves this way is to avoid the C++ global namespace pollution


 where two completely unrelated functions in two unrelated source files
 happen to have the same name, and inadvertantly overload against each


 causing some very strange errors.

with different argument types are unproblematic. You can sometimes have ambiguous calls, for example, if one function takes the base class type the other function's parameter but that'd simply cause a compiler error. Nothing that I would call "strange".

Suppose, in file 'a.h', you have: void output(int); void output(long); which sends its argument to stdout. You download 'b.h' off the net, which has: void output(char); buried in it somewhere which writes its argument out to the serial port. Now, #include "a.h" output('c'); and all is fine. Now, #include "a.h" #include "b.h" output('c'); and your program breaks at runtime, possibly in invisible ways. In D, this would break in an obvious manner at compile time. Much more reliable.
Jun 05 2004
next sibling parent reply hellcatv hotmail.com writes:
I think these features to enable catching errors at compile time are necessary.
however I was wondering if there are a few shortcuts.
I'm running into two situations:
first (and simplest)
I have a file ftoa with a single class or function ... lets say char[]
ftoa(real);

in my other file I say
import ftoa;
alias ftoa.ftoa ftoa;
nope! can't do it.  Is there any way I can call it ftoa without having to name
my file different than my function (or especially class as the case may be)

secondly:
it would be nice if I could alias everything in a module;
alias ftoa.* *;
or something :-)
--Daniel


In article <c9ti5d$2vk4$1 digitaldaemon.com>, Walter says...
"Hauke Duden" <H.NS.Duden gmx.net> wrote in message
news:c9thhq$2urf$1 digitaldaemon.com...
 I distictly remember getting a much more misleading error when I
 experimented with overloads some time ago, though. Was there a related
 compiler error in earlier DMD versions?

That's possible. I don't remember.
 There is an ambiguity, and the compiler issues an error for it. The


 it behaves this way is to avoid the C++ global namespace pollution


 where two completely unrelated functions in two unrelated source files
 happen to have the same name, and inadvertantly overload against each


 causing some very strange errors.

with different argument types are unproblematic. You can sometimes have ambiguous calls, for example, if one function takes the base class type the other function's parameter but that'd simply cause a compiler error. Nothing that I would call "strange".

Suppose, in file 'a.h', you have: void output(int); void output(long); which sends its argument to stdout. You download 'b.h' off the net, which has: void output(char); buried in it somewhere which writes its argument out to the serial port. Now, #include "a.h" output('c'); and all is fine. Now, #include "a.h" #include "b.h" output('c'); and your program breaks at runtime, possibly in invisible ways. In D, this would break in an obvious manner at compile time. Much more reliable.

Jun 05 2004
parent "Walter" <newshound digitalmars.com> writes:
<hellcatv hotmail.com> wrote in message
news:c9tjiv$3e$1 digitaldaemon.com...
 I think these features to enable catching errors at compile time are

 however I was wondering if there are a few shortcuts.
 I'm running into two situations:
 first (and simplest)
 I have a file ftoa with a single class or function ... lets say char[]
 ftoa(real);

 in my other file I say
 import ftoa;
 alias ftoa.ftoa ftoa;
 nope! can't do it.
 Is there any way I can call it ftoa without having to name
 my file different than my function (or especially class as the case may

There's no way to distinguish the names if you don't name them something different.
 secondly:
 it would be nice if I could alias everything in a module;
 alias ftoa.* *;
 or something :-)

Seems a little too easy <g>.
Jun 05 2004
prev sibling parent Hauke Duden <H.NS.Duden gmx.net> writes:
Walter wrote:
 "Hauke Duden" <H.NS.Duden gmx.net> wrote in message
 news:c9thhq$2urf$1 digitaldaemon.com...
 
I distictly remember getting a much more misleading error when I
experimented with overloads some time ago, though. Was there a related
compiler error in earlier DMD versions?

That's possible. I don't remember.
There is an ambiguity, and the compiler issues an error for it. The


reason
it behaves this way is to avoid the C++ global namespace pollution


problem,
where two completely unrelated functions in two unrelated source files
happen to have the same name, and inadvertantly overload against each


other
causing some very strange errors.

What kind of strange errors are these? It seems to me that overloads with different argument types are unproblematic. You can sometimes have ambiguous calls, for example, if one function takes the base class type the other function's parameter but that'd simply cause a compiler error. Nothing that I would call "strange".

Suppose, in file 'a.h', you have: void output(int); void output(long); which sends its argument to stdout. You download 'b.h' off the net, which has: void output(char); buried in it somewhere which writes its argument out to the serial port. Now, #include "a.h" output('c'); and all is fine. Now, #include "a.h" #include "b.h" output('c'); and your program breaks at runtime, possibly in invisible ways.

Heh, the other reason being the implicit conversion between char and int, of course (hint,hint) ;). But I get your point. Thanks for the example. Hauke
Jun 05 2004
prev sibling parent reply "Carlos Santander B." <carlos8294 msn.com> writes:
"Walter" <newshound digitalmars.com> escribió en el mensaje
news:c9tf74$2rrl$1 digitaldaemon.com
| The error message I get is:
| unichar.d(2): function isSeparator conflicts with funkyMenu.isSeparator at
| funkyMenu.d(3)
|

I believe there's a problem with this message. Suppose you're compiling a
whole bunch of files and you get messages like this one (like what happened
to me trying to compile mango beta 7), how could you possibly know where the
conflict is? It should rather say something like "myApp.d(9): do you mean
unichar.isSeparator or funkyMenu.isSeparator?".

-----------------------
Carlos Santander Bernal
Jun 06 2004
parent J C Calvarese <jcc7 cox.net> writes:
Carlos Santander B. wrote:
 "Walter" <newshound digitalmars.com> escribió en el mensaje
 news:c9tf74$2rrl$1 digitaldaemon.com
 | The error message I get is:
 | unichar.d(2): function isSeparator conflicts with funkyMenu.isSeparator at
 | funkyMenu.d(3)
 |
 
 I believe there's a problem with this message. Suppose you're compiling a
 whole bunch of files and you get messages like this one (like what happened
 to me trying to compile mango beta 7), how could you possibly know where the
 conflict is? It should rather say something like "myApp.d(9): do you mean
 unichar.isSeparator or funkyMenu.isSeparator?".

Yes, in fact I think it'd be ideal to provide the 3 locations involved: caller and 2 definitions... myApp.d(9): ambiguous "isSeparator" = unichar.d(946) or funkyMenu.d(86). This could be tremendously helpful! As libraries get more complicated, these issues get harder and harder for the code maintainer to track down.
 
 -----------------------
 Carlos Santander Bernal

-- Justin (a/k/a jcc7) http://jcc_7.tripod.com/d/
Jun 07 2004
prev sibling parent reply Sean Kelly <sean f4.ca> writes:
Hauke Duden wrote:
 In C++ there'd be no such problem because the call is not actually 
 ambiguous! It is perfectly clear that the MenuItem version is the one 
 that should be called.
 
 That's what I mean and that's the reason why I don't want to define any 
 global functions with names that may also occur in other contexts. 
 Otherwise there may be weird effects for the library's user like working 
 code failing to compile once another import is added - even though there 
 is no ambiguity.

I prefer to think of modules as C++ namespaces. And in C++ I rarely import symbols with a "using" declaration, but rather fully qualify them: std::cout, etc. So why not the same thing here? unichar.toLower, etc. Or come up with a shorter module name if that one is too long. One thing I haven't tried... is it possible to import a package and still be required to provide module names when referring to symbols stored within each module in that package? That would be ideal. Sean
Jun 05 2004
parent Antti =?iso-8859-1?Q?Syk=E4ri?= <jsykari gamma.hut.fi> writes:
In article <c9tghe$2tec$1 digitaldaemon.com>, Sean Kelly wrote:
 Hauke Duden wrote:
 In C++ there'd be no such problem because the call is not actually 
 ambiguous! It is perfectly clear that the MenuItem version is the one 
 that should be called.
 
 That's what I mean and that's the reason why I don't want to define any 
 global functions with names that may also occur in other contexts. 
 Otherwise there may be weird effects for the library's user like working 
 code failing to compile once another import is added - even though there 
 is no ambiguity.

I prefer to think of modules as C++ namespaces. And in C++ I rarely import symbols with a "using" declaration, but rather fully qualify them: std::cout, etc. So why not the same thing here? unichar.toLower, etc. Or come up with a shorter module name if that one is too long.

In C++, that's a good habit. But in D you don't have to, because the potential conflicts between modules/namespaces will be detected automatically by the compiler. In short: If you want to be absolutely sure, C++ forces you to specify everything. In D you can use unqualified names as much as you like, and only when it is necessary to resolve the conflict you have to fully qualify them. Productive (because you're not likely to bump into a conflict too often) and safe (no surprises when you do). -Antti -- I will not be using Plan 9 in the creation of weapons of mass destruction to be used by nations other than the US.
Jun 06 2004
prev sibling parent Hauke Duden <H.NS.Duden gmx.net> writes:
I have updated the unichar module incorporating (most of) Walter's 
suggestions and also written a utype module as a drop-in replacement for 
ctype.

Available here:

http://www.hazardarea.com/unichar.zip

Hauke
Jun 05 2004