www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Unichar optimization: performance vs RAM usage. Your opinion?

reply Hauke Duden <H.NS.Duden gmx.net> writes:
I had an idea last night on how to optimize the unichar module to only 
use about 50 KB instead of the current 2 MB of RAM at runtime. It may 
also be possible to reduce the impact on the executable size (12 KB 
currently) but I haven't worked that out fully yet.

Unfortunately the change would make the lookup code slower. For example, 
translating "chr" to lower case would have to be changed from

return lowerData[chr]

to

return chr + lowerData[chr>>9][chr & 511];

In other words instead of one memory lookup it will do two memory 
lookups, a shift, and AND and an addition. It is not THAT expensive, but 
in an inner loop it might add up.

And people have lots of RAM nowadays - 2 MB is a small amount and since 
only small parts of it will be actually used (depending on the language 
the program is working with) most of it will be paged to the disk by the OS.

So what do you think: is the lower RAM usage worth the performance 
decrease? I could also implement both versions and select one at compile 
time, but then the question is which one is the default?


Hauke
Jun 08 2004
next sibling parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <ca3vvv$gpf$1 digitaldaemon.com>, Hauke Duden says...

return lowerData[chr]

Huh? There are over a million codepoints in Unicode (1,114,112, in fact). That must be a ----ing HUGE lookup table! Even if the data were only one byte per codepoint, that's a meg already. Of course it's reasonable to compress that. There are huge swathes of codepoints that are just totally unused (yet), and many, many blocks in which all characters have identical properties. One humungous lookup table is definitely not how I would have done it. Maybe I'm just not enough of a speed freak. Jill
Jun 08 2004
parent reply Hauke Duden <H.NS.Duden gmx.net> writes:
Arcane Jill wrote:

 In article <ca3vvv$gpf$1 digitaldaemon.com>, Hauke Duden says...
 
 
return lowerData[chr]

Huh? There are over a million codepoints in Unicode (1,114,112, in fact). That must be a ----ing HUGE lookup table! Even if the data were only one byte per codepoint, that's a meg already.

Not quite ;). If you look at the data then you'll see that only the first ~200.000 actually have case mappings. That translates to about 2 MB of needed RAM (for all tables together).
 Of course it's reasonable to compress that. There are huge swathes of
codepoints
 that are just totally unused (yet), and many, many blocks in which all
 characters have identical properties. One humungous lookup table is definitely
 not how I would have done it. Maybe I'm just not enough of a speed freak.

Dividing the Unicode range into pages and using the same memory for identical ones is exactly the optimization I think about doing. The question is if the modest (by today's standards) decrease in needed RAM is worth the decrease in performance. Hauke
Jun 08 2004
parent Kevin Bealer <Kevin_member pathlink.com> writes:
In article <ca4aap$11jk$1 digitaldaemon.com>, Hauke Duden says...
Arcane Jill wrote:

 In article <ca3vvv$gpf$1 digitaldaemon.com>, Hauke Duden says...
 
 
return lowerData[chr]

Huh? There are over a million codepoints in Unicode (1,114,112, in fact). That must be a ----ing HUGE lookup table! Even if the data were only one byte per codepoint, that's a meg already.

Not quite ;). If you look at the data then you'll see that only the first ~200.000 actually have case mappings. That translates to about 2 MB of needed RAM (for all tables together).
 Of course it's reasonable to compress that. There are huge swathes of
codepoints
 that are just totally unused (yet), and many, many blocks in which all
 characters have identical properties. One humungous lookup table is definitely
 not how I would have done it. Maybe I'm just not enough of a speed freak.

Dividing the Unicode range into pages and using the same memory for identical ones is exactly the optimization I think about doing. The question is if the modest (by today's standards) decrease in needed RAM is worth the decrease in performance. Hauke

I've seen the idea of a two-level table mentioned on one of the Unicode sites; it was one of the recommended implementations for a lookup table. I used a version of this at IBM to handle ASCII/EBCDIC/Unicode translation. I think doing an extra table lookup per character will not be a big issue, since the first table (the pointers to the other tables) will probably stay in cache. Kevin
Jun 09 2004
prev sibling next sibling parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <ca3vvv$gpf$1 digitaldaemon.com>, Hauke Duden says...

return chr + lowerData[chr>>9][chr & 511];

You can get faster than that.
       union U
       {
           dchar c;
           ubyte[3] n;
       }

       U u;
       u.c = chr;
       return lowerData[u.n[2]][u.n[1]][u.n[0]];

No bitshifting. You just need tables for blocks of 256 characters, then tables for blocks of 256 pointers to these, then tables for blocks of 17 pointers to these. Duplicate blocks (in particular, those for unassigned codepoints) can simply be the same block. Well, that's how I would have done it. Jill
Jun 08 2004
next sibling parent Arcane Jill <Arcane_member pathlink.com> writes:
In article <ca46cu$rfj$1 digitaldaemon.com>, Arcane Jill says...
       union U
       {
           dchar c;
           ubyte[3] n;
       }


That's assuming little-endian architecture, of course. You'd need two versions to cover both possibilities. Jill
Jun 08 2004
prev sibling parent Hauke Duden <H.NS.Duden gmx.net> writes:
Arcane Jill wrote:
 In article <ca3vvv$gpf$1 digitaldaemon.com>, Hauke Duden says...
 
 
return chr + lowerData[chr>>9][chr & 511];

You can get faster than that.
      union U
      {
          dchar c;
          ubyte[3] n;
      }

      U u;
      u.c = chr;
      return lowerData[u.n[2]][u.n[1]][u.n[0]];

No bitshifting. You just need tables for blocks of 256 characters, then tables for blocks of 256 pointers to these, then tables for blocks of 17 pointers to these. Duplicate blocks (in particular, those for unassigned codepoints) can simply be the same block. Well, that's how I would have done it.

Splitting the data into blocks and using the same memory for equal blocks is indeed the main idea behind the optimization. But doing 3 memory lookups with byte indices is probably slower than doing the bit shift and 2 lookups. Memory is usually a bottleneck and shifts have been 1-cycle operations for ages (even on the 386 IIRC). Pipelining is also possible to some degree. Also, byte-based operations are slower than int-based ones on many CPUs. Hauke
Jun 08 2004
prev sibling parent reply "Walter" <newshound digitalmars.com> writes:
It's such a huge reduction in memory usage, I'd go for it. Reducing memory
consumption has its own associated performance speedups - less swapping,
less memory for the gc to scan, etc. Keep the faster version there with a
version {} statement, though, in case experience shows otherwise.
Jun 08 2004
parent DemmeGod <me demmegod.com> writes:
I think what Walter means to say is that it'll never make it into Phobos
using 2mb of ram.  Yes?

John

On Tue, 08 Jun 2004 09:23:08 -0700, Walter wrote:

 It's such a huge reduction in memory usage, I'd go for it. Reducing memory
 consumption has its own associated performance speedups - less swapping,
 less memory for the gc to scan, etc. Keep the faster version there with a
 version {} statement, though, in case experience shows otherwise.

Jun 08 2004