www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - New Lookup Table (MixString)

reply Salih Dincer <salihdb hotmail.com> writes:
Hi,

This is an InputRange and RandomAccessRange combined; it's also 
the placement of wchar in possible null parts of dchar.

Please criticize this code. Each element of a UTF string is 
matched to its counterparts as dchar and used as wchar. The code 
is self-explanatory, do you think it's useful?

```d
import std.stdio, std.algorithm;
import std.range, std.conv;

enum alphabets
{
   u = "ÂABCÇDEFGHIJKLMNOPQRSTUVĞİWXYZÖŞÜÇÎÛ",
   l = "âabcçdefghıjklmnopqrstuvğiwxyzöşüçîû",

   ASCII65_95 = "AABCCDEFGHIJKLMNOPQRSTUVGIWXYZOSUCIU",
   ASCII96_127 = "aabccdefghijklmnopqrstuvgiwxyzosuciu",
   ASCII65_127 = ASCII65_95 ~ ASCII96_127
}

//enum dictU = alphabets.u.to!(wchar[]);
enum dictU = "ÂABCÇDEFGHIJKLMNOPQRSTUVĞİWXYZÖŞÜÇÎÛ".to!(wchar[]);
enum dictL = alphabets.l.to!(wchar[]);

struct MixString(T, T[] leftLiterals)
{
   size_t index;
   dchar[] dict;

   this(string d)
   {
     // load dictionary
     foreach(dchar c; d) dict ~= c;

     // place counterparts
     foreach(i, wchar c; leftLiterals)
     {
       dict[i] |= c << 16;
     }
   }

   // input range functions
   bool empty() { return index == dict.length; }
   T front() { return dict[index] & 0x0000_FFFF; }
   void popFront() { ++index; }

   // search elements
   auto nextIndexOf(wchar key)
   {
     scope(exit) index = 0;

     size_t i = 1;
     while(!empty)
     {
       if(front == key)
       {
         return i;
       } else i++;
       popFront();
     }
     return 0;
   }
}
//
alias ConvUpper = MixString!(wchar, dictU);
alias ConvLower = MixString!(wchar, dictL);

void main()
{
   auto test = ConvUpper(alphabets.l);/*
   foreach(wchar c; test)
   {
     c.writefln!"%4X: %s"(c);
   }//*/

   string text = "fıstıkçı şâhap bir insandır!";
   foreach(wchar c; text)
   {
     if(auto result = test.nextIndexOf(c))
     {
       wchar lookup = test.dict[result - 1] >> 16;
       lookup.write;
     } else
       c.write;
   }
   writeln;
}
/*

FISTIKÇI ŞÂHAP BİR İNSANDIR!

*/
```

SDB 79
Sep 02 2023
parent reply "Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:
Lets see:

O(n) search for alphabet index

Limited tables, that do not scale to other languages.

Tables limited to BMP.

Not particularly useful generally speaking, but with some improvements 
it may be useful in a limited capacity.



Search can be replaced with either binary search (where probability of a 
particular character is unknown), fibonacci search if the probability is 
known with a preference towards the start of the ranges.

Typically for such tables, they would be implemented using a multi-level 
trie. With the lookup being O(1). Costs more ROM, but is well worth it 
for the speed.

Unicode Demystified covers the standard method for doing this sort of 
lookup as well as how to do the case conversion correctly. 
https://www.amazon.com/Unicode-Demystified-Practical-Programmers-Encoding/dp/0201700522
Sep 02 2023
parent reply Salih Dincer <salihdb hotmail.com> writes:
On Saturday, 2 September 2023 at 14:20:58 UTC, Richard (Rikki) 
Andrew Cattermole wrote:
 Lets see:

 O(n) search for alphabet index
I don't think speed is a big issue because a thousand pages and possibly 47 letters old text (kutadgu-bilig-fergana-holograph.txt: ~ 2 MB.) is completed in under 1 second. The conversion done includes reading from the file, finding the counterparts, and writing to the file... For-example: ```d enum abece { b = "AEINRLİDKMUYTBSOÜŞZGÇHĞVCÖPFJXWÂÎÛĖĀĪŪĦŜŊĠŻṬẒḲĮ".to!(wchar[]), k = "aeınrlidkmuytbsoüşzgçhğvcöpfjxwâîûėāīūħŝŋġżṭẓḳį".to!(wchar[]), ele = "gusiocCOISUG".to!(wchar[]) } void main() { alias MSbyk = MixString!(wchar, abece.b); enum bütünSözlük = "aeınrlidkmuytbsoüşzgçhğvcöpfjxwâîûėāīūħŝŋġżṭẓḳį"; // abece.k.to!string; auto büyük = MSbyk(bütünSözlük); // Source: https://archive.org/download/kutadgu-bilig-fergana-nushasi/681053_djvu.txt auto dosya = File("KutadguBilig.txt", "r"); while (!dosya.eof) { foreach(wchar c; dosya.readln) { if(auto result = büyük.nextIndexOf(c)) { wchar lookup = büyük.dict[result - 1] >> 16; lookup.write; } else { c.write; } } writeln; } } /* pico enpi:~/Projeler/NewLookup$ time ./newLookupTable > result.txt real 0m0,875s user 0m0,859s sys 0m0,016s */ ``` On Saturday, 2 September 2023 at 14:20:58 UTC, Richard (Rikki) Andrew Cattermole wrote:
 Unicode Demystified covers the standard method for doing this 
 sort of lookup as well as how to do the case conversion 
 correctly. 
 https://www.amazon.com/Unicode-Demystified-Practical-Programmers-Encoding/dp/0201700522
Thank you, I will read the book you mentioned. SDB 79
Sep 03 2023
next sibling parent Salih Dincer <salihdb hotmail.com> writes:
On Sunday, 3 September 2023 at 10:36:58 UTC, Salih Dincer wrote:
 For-example:
 ```d
 enum abece
 {
   b = 
 "AEINRLİDKMUYTBSOÜŞZGÇHĞVCÖPFJXWÂÎÛĖĀĪŪĦŜŊĠŻṬẒḲĮ".to!(wchar[]),
   k = 
 "aeınrlidkmuytbsoüşzgçhğvcöpfjxwâîûėāīūħŝŋġżṭẓḳį".to!(wchar[]),
   ele = "gusiocCOISUG".to!(wchar[])
 }

 void main()
 {
   alias MSbyk = MixString!(wchar, abece.b);
   enum bütünSözlük = 
 "aeınrlidkmuytbsoüşzgçhğvcöpfjxwâîûėāīūħŝŋġżṭẓḳį"; // 
 abece.k.to!string;
``` I wonder why I can't use abece.k directly. The error it gives is as follows:
 core.exception.ArrayIndexError newLookupTable.d(39):
 index [1] is out of bounds for array of length 1
SDB 79
Sep 03 2023
prev sibling parent "Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:
On 03/09/2023 10:36 PM, Salih Dincer wrote:
 I don't think speed is a big issue because a thousand pages and possibly 
 47 letters old text (kutadgu-bilig-fergana-holograph.txt: ~ 2 MB.) is 
 completed in under 1 second.
Yeah your lookup table is small enough that it won't matter. Problem is that it won't scale. Unicode as a whole is 0x10FFFF big, with the first plane being 64k (BMP). Imagine trying to throw hardware at those sort of numbers.
Sep 03 2023