www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Replacing tango.text.Ascii.isearch

reply torhu <torhu yahoo.com> writes:
I need a case-insensitive check to see if a string contains 
another string for a "quick filter" feature. It should 
preferrably be perceived as instant by the user, and needs to 
check a few thousand strings in typical cases. Is a regex the 
best option, or what would you suggest?
Oct 05 2022
parent reply Steven Schveighoffer <schveiguy gmail.com> writes:
On 10/5/22 12:59 PM, torhu wrote:
 I need a case-insensitive check to see if a string contains another 
 string for a "quick filter" feature. It should preferrably be perceived 
 as instant by the user, and needs to check a few thousand strings in 
 typical cases. Is a regex the best option, or what would you suggest?
https://dlang.org/phobos/std_uni.html#asLowerCase ```d bool isearch(S1, S2)(S1 haystack, S2 needle) { import std.uni; import std.algorithm; return haystack.asLowerCase.canFind(needle.asLowerCase); } ``` untested. -Steve
Oct 05 2022
next sibling parent reply torhu <torhu yahoo.com> writes:
On Wednesday, 5 October 2022 at 17:29:25 UTC, Steven 
Schveighoffer wrote:
[...]


I wanted to do some quick benchmarking to figure out what works.

When I run this:

```d
import std.stdio;
import std.datetime.stopwatch;

void main()
{
	auto sw = StopWatch();	
	sw.stop();
	writeln(sw.peek().toString());
}
```

It prints this:
 2 weeks, 6 days, 9 hours, 34 minutes, 43 secs, 214 ms, 946 ╬╝s, 
 and 7 hnsecs
Am I doing something wrong here?
Oct 05 2022
next sibling parent reply torhu <torhu yahoo.com> writes:
On Wednesday, 5 October 2022 at 20:40:46 UTC, torhu wrote:

 Am I doing something wrong here?
Right, you can instantiate structs without arguments. It's been ten years since I last used D, I was thinking of structs like if they were classes.
Oct 05 2022
parent torhu <torhu yahoo.com> writes:
On Wednesday, 5 October 2022 at 20:45:55 UTC, torhu wrote:
 On Wednesday, 5 October 2022 at 20:40:46 UTC, torhu wrote:

 Am I doing something wrong here?
Right, you can instantiate structs without arguments. It's been ten years since I last used D, I was thinking of structs like if they were classes.
I think there should be sensible default here, seems like an easy trap to remove.
Oct 05 2022
prev sibling parent =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 10/5/22 13:40, torhu wrote:

      auto sw = StopWatch();
Either this: auto sw = StopWatch(AutoStart.yes); or this: auto sw = StopWatch(); sw.start(); Ali
Oct 05 2022
prev sibling next sibling parent reply torhu <torhu yahoo.com> writes:
On Wednesday, 5 October 2022 at 17:29:25 UTC, Steven 
Schveighoffer wrote:

 ```d
 bool isearch(S1, S2)(S1 haystack, S2 needle)
 {
     import std.uni;
     import std.algorithm;
     return haystack.asLowerCase.canFind(needle.asLowerCase);
 }
 ```

 untested.

 -Steve
I did some basic testing, and regex was two orders of magnitude faster. So now I know, I guess.
Oct 05 2022
next sibling parent reply Siarhei Siamashka <siarhei.siamashka gmail.com> writes:
On Wednesday, 5 October 2022 at 21:50:32 UTC, torhu wrote:
 I did some basic testing, and regex was two orders of magnitude 
 faster. So now I know, I guess.
Substring search functionality is currently in a very bad shape in Phobos. I discovered this myself a few weeks ago when I was trying to solve the https://www.facebook.com/codingcompetitions/hacker-cup/2022/round-1/problems/A2 puzzle using D language. A part of the solution requires a fast substring search (actually a subarray search) and Phobos doesn't offer anything with even remotely acceptable performance. Phobos does have a Boyer-Moore implementation. This algorithm is historically famous in computer science as one of the first attempts to optimize substring search, but it also has pathologically bad performance on certain input data and probably shouldn't be recommended for any practical use nowadays. The users of old versions of Python discovered this too: https://codeforces.com/blog/entry/106849?#comment-952483 The standard 'find' function from the following simple example also becomes awfully slow when arrays 'a' and 'b' are large and/or are adversarially constructed: ```D import std; void main() { auto a = [1, 2, 3, 4]; auto b = [2, 3]; writefln("Is %s a subarray of %s? The answer is %s.", b, a, a.find(b).empty ? "no" : "yes"); } ``` I think that the best fit for your use case is the https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm and there's an old issue in bugzilla about this: https://issues.dlang.org/show_bug.cgi?id=16066 BTW, if anyone is curious, one of the possible solutions for the hacker-cup/2022/round-1/problems/A2 puzzle in D language can be found here: https://codeforces.com/blog/entry/106768?#comment-952808
Oct 06 2022
parent Sergey <kornburn yandex.ru> writes:
On Thursday, 6 October 2022 at 08:15:10 UTC, Siarhei Siamashka 
wrote:
 On Wednesday, 5 October 2022 at 21:50:32 UTC, torhu wrote:
Please don’t tell us that D will be slower than Python again?)
Oct 06 2022
prev sibling parent reply rassoc <rassoc posteo.de> writes:
On 10/5/22 23:50, torhu via Digitalmars-d-learn wrote:
 I did some basic testing, and regex was two orders of magnitude faster. So now
I know, I guess.
And what kind of testing was that? Mind to share? Because I did the following real quick and wasn't able to measure a "two orders of magnitude" difference. Sure, the regex version came on top, but they were both faster than the ruby baseline I cooked up. First, generate a word file with 100k entries of various lengths: $> dmd -run words.d foobaz 100000 --- import std; string randomWord(ulong n) { static chars = letters.array; return generate!(() => chars.choice).take(n).text; } void main(string[] args) { enforce(args.length == 3, "Usage: dmd -run words.d needle num"); auto f = File("words.txt", "w"); foreach (i; 0..args[2].to!ulong) { ulong n = uniform(0, 50), m = uniform(0, 50); if (i % 2 == 0) f.writeln(randomWord(n), args[1], randomWord(m)); else f.writeln(randomWord(n + m)); } } --- And then for the actual measuring: $> dmd -O -version={range,regex} -of=search-{range,regex} search.d $> ldc -O -d-version={range,regex} -of=search-{range,regex}-ldc search.d $> time ./search-{range,regex}{,-ldc} foobaz --- import std; void main(string[] args) { enforce(args.length == 2, "Usage: search 'needle'"); version (regex) auto rx = regex(args[1], "i"); else version (range) auto needle = args[1].asLowerCase.text; else static assert(0, "use -version={regex,range}"); ulong matches; foreach (line; File("words.txt").byLine) { version (regex) if (line.matchFirst(rx)) matches++; version (range) if (line.asLowerCase.canFind(needle)) matches++; } writeln(matches); } ---
Oct 06 2022
parent reply torhu <torhu yahoo.com> writes:
On Thursday, 6 October 2022 at 21:36:48 UTC, rassoc wrote:

 And what kind of testing was that? Mind to share? Because I did 
 the following real quick and wasn't able to measure a "two 
 orders of magnitude" difference. Sure, the regex version came 
 on top, but they were both faster than the ruby baseline I 
 cooked up.
Originally I just loaded a one megabyte file and searched the whole thing. I changed it to split it into (40 000) lines instead, regex is about ten times faster then. I compile with -release -O -inline. Here is the second version: ```d import std; import std.datetime.stopwatch; enum FILE = "test.lst"; string text; string needle; void test(bool delegate(string haystack) dg) { auto sw = StopWatch(AutoStart.yes); int counter = 0; foreach (line; lineSplitter(text)) { if (dg(line)) counter++; } sw.stop(); writefln("%s", sw.peek()); writefln("counter: %s", counter); } void main(char[][] args) { enforce(args.length > 1, "Need a needle argument."); text = cast(string)read(FILE); needle = args[1].idup; auto re = regex(to!string(escaper(needle)), "i"); string needleLower = needle.toLower(); test((h) => !!h.matchFirst(re)); test((h) => h.asLowerCase().canFind(needleLower)); } ```
Oct 06 2022
parent reply rassoc <rassoc posteo.de> writes:
On 10/7/22 01:39, torhu via Digitalmars-d-learn wrote:
 regex is about ten times faster then.
Interesting! Using your code, I'm seeing a 1.5x max difference for ldc, nothing close to 10x. Welp, the woes of superficial benchmarking. :)
Oct 06 2022
parent reply Siarhei Siamashka <siarhei.siamashka gmail.com> writes:
On Friday, 7 October 2022 at 00:57:38 UTC, rassoc wrote:
 On 10/7/22 01:39, torhu via Digitalmars-d-learn wrote:
 regex is about ten times faster then.
Interesting! Using your code, I'm seeing a 1.5x max difference for ldc, nothing close to 10x. Welp, the woes of superficial benchmarking. :)
Benchmark results depend on many things, such as the actual text in both needle and haystack and the needle length. Are we dealing with unicode text by the way? One example is searching for something like "äußere" in https://www.gutenberg.org/ebooks/6343.txt.utf-8 If it's the source code, then searching for "sqlite3_value_bytes16" in the sqlite3.c file from https://www.sqlite.org/2022/sqlite-amalgamation-3390400.zip may be a good test too. I'm getting at least 5x difference in favor of regex with LDC on these two examples. Also are we allowed to artificially construct needle and haystack to blow up this test rather than only benchmarking it on typical real data?
Oct 06 2022
parent reply Siarhei Siamashka <siarhei.siamashka gmail.com> writes:
On Friday, 7 October 2022 at 06:34:50 UTC, Siarhei Siamashka 
wrote:
 Also are we allowed to artificially construct needle and 
 haystack to blow up this test rather than only benchmarking it 
 on typical real data?
Such as generating the input data via running: python -c "print(('a' * 49 + 'b') * 20000)" > test.lst And then using "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" (the character 'a' replicated 50 times) as the needle to search for. Much longer needles work even better. In Linux the command line size is limited by 128K, so there's a huge room for improvement.
Oct 07 2022
parent reply bachmeier <no spam.net> writes:
On Friday, 7 October 2022 at 07:16:19 UTC, Siarhei Siamashka 
wrote:
 On Friday, 7 October 2022 at 06:34:50 UTC, Siarhei Siamashka 
 wrote:
 Also are we allowed to artificially construct needle and 
 haystack to blow up this test rather than only benchmarking it 
 on typical real data?
Such as generating the input data via running: python -c "print(('a' * 49 + 'b') * 20000)" > test.lst And then using "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" (the character 'a' replicated 50 times) as the needle to search for. Much longer needles work even better. In Linux the command line size is limited by 128K, so there's a huge room for improvement.
https://www.cs.utexas.edu/users/moore/best-ideas/string-searching/ "the longer the pattern is, the faster the algorithm goes"
Oct 07 2022
parent reply Siarhei Siamashka <siarhei.siamashka gmail.com> writes:
On Friday, 7 October 2022 at 12:19:59 UTC, bachmeier wrote:
 https://www.cs.utexas.edu/users/moore/best-ideas/string-searching/

 "the longer the pattern is, the faster the algorithm goes"
Yes, that's how substring search works in the standard libraries of the other programming languages. Now please take the torhu's test program (posted a few comments above) and generate input for it using python -c "print(('a' * 49 + 'b') * 20000)" > test.lst Run it to search for "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" (the character 'a' replicated 50 times) and then compare its performance against the other similar programs doing the same thing: Python: ```Python import sys assert len(sys.argv) >= 2, "Need a needle argument." needle = sys.argv[1].lower() print(sum([1 if line.lower().find(needle) != -1 else 0 for line in open("test.lst")])) ``` Ruby/Crystal: ```Ruby abort "Need a needle argument." unless ARGV.size >= 1 needle = ARGV[0].downcase puts File.open("test.lst").each_line.sum {|line| line.downcase.index(needle) ? 1 : 0 } ``` A generic substring search is tuned to be fast on any input in the other programming languages. But in Phobos a simpleminded slow search algorithm is used by default. The Boyer-Moore algorithm can be used too if it's explicitly requested, but it has some gotchas of its own.
Oct 07 2022
parent reply rassoc <rassoc posteo.de> writes:
On 10/8/22 00:50, Siarhei Siamashka via Digitalmars-d-learn wrote:
 On Friday, 7 October 2022 at 12:19:59 UTC, bachmeier wrote:
 python -c "print(('a' * 49 + 'b') * 20000)" > test.lst
That's generating a file with a single line: $> wc -l test.lst 1 test.lst Going with an appropriate 100k mixed line file and your mentioned needle, D is still quite a bit slower, but the results aren't as drastic and nowhere near "two orders of magnitude". $> crystal build --release -o search-cr search.cr abort "Need a needle argument." unless ARGV.size >= 1 needle = ARGV[0].downcase puts File.open("words.txt").each_line.count(&.downcase.includes? needle) $> ldc2 -O2 --release -of search-ldc search.d import std; void main(string[] args) { enforce(args.length > 1, "Need a needle argument."); auto needle = args[1].toLower; File("words.txt").byLine.count!(ln => ln.asLowerCase.canFind(needle)).writeln; }
Oct 07 2022
parent reply Siarhei Siamashka <siarhei.siamashka gmail.com> writes:
On Saturday, 8 October 2022 at 01:07:46 UTC, rassoc wrote:
 On 10/8/22 00:50, Siarhei Siamashka via Digitalmars-d-learn 
 wrote:
 On Friday, 7 October 2022 at 12:19:59 UTC, bachmeier wrote:
 python -c "print(('a' * 49 + 'b') * 20000)" > test.lst
That's generating a file with a single line: $> wc -l test.lst 1 test.lst
If you insist on having 100K lines in the input data, then you can try a different test (run it as a shell script): ```bash python -c "print(('a' * 9999 + 'b') * 89 + '\\n' * 99999 + 'a' * 10000)" > words.txt cp words.txt test.lst wc -l words.txt NEEDLE=`python -c "print('a' * 10000, end='')"` echo "=== Python ===" time python search.py "$NEEDLE" echo "=== Ruby ===" time ruby search.rb "$NEEDLE" echo "=== Crystal ===" time ./search-cr "$NEEDLE" echo "=== D (LDC) ===" time ./search-ldc "$NEEDLE" ``` It is testing a 1MB file with 100K lines and a 10K characters long needle. Results from my computer (with Python 3.10, because earlier versions of Python are slower): ``` 100000 words.txt === Python === 1 real 0m0.083s user 0m0.072s sys 0m0.010s === Ruby === 1 real 0m0.313s user 0m0.313s sys 0m0.000s === Crystal === 1 real 0m1.492s user 0m1.482s sys 0m0.010s === D (LDC) === 1 real 1m10.314s user 1m10.234s sys 0m0.050s ```
 Going with an appropriate 100k mixed line file and your 
 mentioned needle, D is still quite a bit slower, but the 
 results aren't as drastic
Your "appropriate" file is also entirely artificial and happens to be specifically crafted to work best with the current ".canFind" implementation from Phobos. The primary source of slowness are partial matches. Such as the `"int i = "` prefix when searching for `"int i = 0;"` substring in a string, which contains `"int i = 1;"`. The time is wasted on comparing the initial 8 characters before encountering a mismatch and bailing out. But when searching in a randomly generated gibberish, the chances of encountering long partial matches are very low. At least lower than in the real text files.
 and nowhere near "two orders of magnitude".
Does the difference really have to be two orders of magnitude for you to acknowledge that there might be a performance problem in Phobos? Moreover, you are quoting torhu, who compared two D implementations compiled by DMD (regex vs. canFind) rather than standard libraries of different programming languages.
 ```D
 import std;
 void main(string[] args) {
     enforce(args.length > 1, "Need a needle argument.");
     auto needle = args[1].toLower;
     File("words.txt").byLine.count!(ln => 
 ln.asLowerCase.canFind(needle)).writeln;
 }
 ```
It's a nicely looking one-liner implementation in D language and everything is fine. Except that similar one-liners implemented using other programming languages are faster and more versatile (can handle any input data without catastrophic performance pitfalls). BTW, changing ".asLowerCase" to ".toLower" introduces an extra memory allocation, but still significantly improves handling of the worst case. Conversion to lowercase is expensive for unicode characters and revisiting the same character to repeat such conversion again and again is bad for performance.
Oct 08 2022
parent rassoc <rassoc posteo.de> writes:
On 10/9/22 03:08, Siarhei Siamashka via Digitalmars-d-learn wrote:
 Does the difference really have to be two orders of magnitude for you to
acknowledge that there might be a performance problem in Phobos? [...] Except
that similar one-liners implemented using other programming languages are
faster and more versatile (can handle any input data without catastrophic
performance pitfalls).
Oh, I get all that, there's no reason to argue with me or win me over, I can see that the implementation is subpar. Since I'm never hitting this performance bottleneck in my code, and being a regular dev and not a core maintainer, I simply haven't been motivated enough to contribute an improvement. Change like that isn't happening in the forums. Various optimizations have made it into Phobos in the past, don't think you would get any pushback if you can show that a new implementation improves the situation in almost all cases while maintaining compatibility.
Oct 09 2022
prev sibling parent reply bauss <jacobbauss gmail.com> writes:
On Wednesday, 5 October 2022 at 17:29:25 UTC, Steven 
Schveighoffer wrote:
 On 10/5/22 12:59 PM, torhu wrote:
 I need a case-insensitive check to see if a string contains 
 another string for a "quick filter" feature. It should 
 preferrably be perceived as instant by the user, and needs to 
 check a few thousand strings in typical cases. Is a regex the 
 best option, or what would you suggest?
https://dlang.org/phobos/std_uni.html#asLowerCase ```d bool isearch(S1, S2)(S1 haystack, S2 needle) { import std.uni; import std.algorithm; return haystack.asLowerCase.canFind(needle.asLowerCase); } ``` untested. -Steve
This doesn't actually work properly in all languages. It will probably work in most, but it's not entirely correct. Ex. Turkish will not work with it properly. Very interesting article: http://www.moserware.com/2008/02/does-your-code-pass-turkey-test.html
Oct 13 2022
next sibling parent reply rikki cattermole <rikki cattermole.co.nz> writes:
On 13/10/2022 9:27 PM, bauss wrote:
 This doesn't actually work properly in all languages. It will probably 
 work in most, but it's not entirely correct.
 
 Ex. Turkish will not work with it properly.
 
 Very interesting article: 
 http://www.moserware.com/2008/02/does-your-code-pass-turkey-test.html
Yes turkic languages, they require a state machine and quite a bit of LUTs to work correctly. You also need to provide a language and it has to operate on the whole string, not individual characters. I didn't think it was relevant since Ascii was in the original post ;)
Oct 13 2022
parent reply bauss <jacobbauss gmail.com> writes:
On Thursday, 13 October 2022 at 08:30:04 UTC, rikki cattermole 
wrote:
 On 13/10/2022 9:27 PM, bauss wrote:
 This doesn't actually work properly in all languages. It will 
 probably work in most, but it's not entirely correct.
 
 Ex. Turkish will not work with it properly.
 
 Very interesting article: 
 http://www.moserware.com/2008/02/does-your-code-pass-turkey-test.html
Yes turkic languages, they require a state machine and quite a bit of LUTs to work correctly. You also need to provide a language and it has to operate on the whole string, not individual characters. I didn't think it was relevant since Ascii was in the original post ;)
I think it's relevant when it comes to D since D is arguably a unicode language, not ascii. D should strive to be correct, rather than fast.
Oct 13 2022
parent reply bauss <jacobbauss gmail.com> writes:
On Thursday, 13 October 2022 at 08:35:50 UTC, bauss wrote:
 On Thursday, 13 October 2022 at 08:30:04 UTC, rikki cattermole 
 wrote:
 On 13/10/2022 9:27 PM, bauss wrote:
 This doesn't actually work properly in all languages. It will 
 probably work in most, but it's not entirely correct.
 
 Ex. Turkish will not work with it properly.
 
 Very interesting article: 
 http://www.moserware.com/2008/02/does-your-code-pass-turkey-test.html
Yes turkic languages, they require a state machine and quite a bit of LUTs to work correctly. You also need to provide a language and it has to operate on the whole string, not individual characters. I didn't think it was relevant since Ascii was in the original post ;)
I think it's relevant when it comes to D since D is arguably a unicode language, not ascii. D should strive to be correct, rather than fast.
Oh and to add onto this, IFF you have to do it the hacky way, then converting to uppercase instead of lowercase should be preferred, because not all lowercase characters can perform round trip, although a small group of characters, then using uppercase fixes it, so that's a relatively easy fix. A round trip is basically converting characters from one culture to another and then back. It's impossible with some characters when converting to lowercase, but should always be possible when converting to uppercase.
Oct 13 2022
parent reply rikki cattermole <rikki cattermole.co.nz> writes:
On 13/10/2022 9:42 PM, bauss wrote:
 Oh and to add onto this, IFF you have to do it the hacky way, then 
 converting to uppercase instead of lowercase should be preferred, 
 because not all lowercase characters can perform round trip, although a 
 small group of characters, then using uppercase fixes it, so that's a 
 relatively easy fix. A round trip is basically converting characters 
 from one culture to another and then back. It's impossible with some 
 characters when converting to lowercase, but should always be possible 
 when converting to uppercase.
You will want to repeat this process with normalize to NFKC and normalize to NFD before transforming. Otherwise there is a possibility that you will miss some transformations as the simplified mappings are 1:1 for characters and not everything is representable as a single character.
Oct 13 2022
parent reply bauss <jacobbauss gmail.com> writes:
On Thursday, 13 October 2022 at 08:48:49 UTC, rikki cattermole 
wrote:
 On 13/10/2022 9:42 PM, bauss wrote:
 Oh and to add onto this, IFF you have to do it the hacky way, 
 then converting to uppercase instead of lowercase should be 
 preferred, because not all lowercase characters can perform 
 round trip, although a small group of characters, then using 
 uppercase fixes it, so that's a relatively easy fix. A round 
 trip is basically converting characters from one culture to 
 another and then back. It's impossible with some characters 
 when converting to lowercase, but should always be possible 
 when converting to uppercase.
You will want to repeat this process with normalize to NFKC and normalize to NFD before transforming. Otherwise there is a possibility that you will miss some transformations as the simplified mappings are 1:1 for characters and not everything is representable as a single character.
Yeah, text isn't easy :D
Oct 13 2022
parent rikki cattermole <rikki cattermole.co.nz> writes:
On 13/10/2022 9:55 PM, bauss wrote:
 Yeah, text isn't easy :D
Indeed! It has me a bit concerned actually, I'm wondering if my string stuff will even work correctly for UI's due to performance issues. My string builder for instance allocates like crazy just to do slicing. But hey, at least I can feel confident that my general purpose allocator & infrastructure is working correctly!
Oct 13 2022
prev sibling next sibling parent Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Thursday, 13 October 2022 at 08:27:17 UTC, bauss wrote:
 On Wednesday, 5 October 2022 at 17:29:25 UTC, Steven 
 Schveighoffer wrote:
 On 10/5/22 12:59 PM, torhu wrote:
 I need a case-insensitive check to see if a string contains 
 another string for a "quick filter" feature. It should 
 preferrably be perceived as instant by the user, and needs to 
 check a few thousand strings in typical cases. Is a regex the 
 best option, or what would you suggest?
https://dlang.org/phobos/std_uni.html#asLowerCase ```d bool isearch(S1, S2)(S1 haystack, S2 needle) { import std.uni; import std.algorithm; return haystack.asLowerCase.canFind(needle.asLowerCase); } ``` untested. -Steve
This doesn't actually work properly in all languages. It will probably work in most, but it's not entirely correct. Ex. Turkish will not work with it properly.
Greek will also be problematic. 2 different lowercase sigmas but only 1 uppercase. Other languages that may make issues, German where normally ß uppercases as SS (or not) but not the other way round, but here we already arrived to Unicode land and the normalization conundrum.
Oct 13 2022
prev sibling parent reply Siarhei Siamashka <siarhei.siamashka gmail.com> writes:
On Thursday, 13 October 2022 at 08:27:17 UTC, bauss wrote:
 ```d
 bool isearch(S1, S2)(S1 haystack, S2 needle)
 {
     import std.uni;
     import std.algorithm;
     return haystack.asLowerCase.canFind(needle.asLowerCase);
 }
 ```

 untested.

 -Steve
This doesn't actually work properly in all languages. It will probably work in most, but it's not entirely correct. Ex. Turkish will not work with it properly. Very interesting article: http://www.moserware.com/2008/02/does-your-code-pass-turkey-test.html
Wow, I didn't expect anything like this and just thought that the nightmares of handling 8-bit codepages for non-English languages ceased to exist nowadays. Too bad. What are the best practices to deal with Turkish text in D language? For example, [Ukrainian letters 'і' and 'І'](https://en.wikipedia.org/wiki/Dotted_I_(Cyrillic)) don't share the same codes with Latin 'i' and 'I' and this is working fine. Except for a possible [phishing opportunity](https://www.theguardian.com/technology/2017/apr/19/phishing url-trick-hackers). Why haven't the standard committees done the same for Turkish 'I' yet? As for the [German letter 'ß'](https://en.wikipedia.org/wiki/%C3%9F), wikipedia says that the uppercase variant 'ẞ' exists since 2008 (ISO 10646). Do German people use it now? ```D import std; void main() { "ß".asUpperCase.writeln; // prints "SS" "ẞ".asLowerCase.writeln; // prints "ß" "ẞ".asLowerCase.asUpperCase.writeln; // prints "SS" } ```
Oct 24 2022
parent reply rikki cattermole <rikki cattermole.co.nz> writes:
On 25/10/2022 5:17 PM, Siarhei Siamashka wrote:
 Wow, I didn't expect anything like this and just thought that the 
 nightmares of handling 8-bit codepages for non-English languages ceased 
 to exist nowadays. Too bad. What are the best practices to deal with 
 Turkish text in D language?
std.uni doesn't support it. For casing it only supports the simple mappings which are 1:1 and not language dependent. I haven't got to it yet for my own string handling library, so I can't point you to that (even if it was not ready). I'm sure somebody has got it but you may end up wanting to use ICU unfortunately.
Oct 24 2022
parent reply Siarhei Siamashka <siarhei.siamashka gmail.com> writes:
On Tuesday, 25 October 2022 at 06:32:00 UTC, rikki cattermole 
wrote:
 On 25/10/2022 5:17 PM, Siarhei Siamashka wrote:
 What are the best practices to deal with Turkish text in D 
 language?
std.uni doesn't support it.
OK, I'm not specifically interested in this personally and I even would be happy to remain blissfully ignorant. Just wondered whether a preferred solution already exists, considering that this forum has a Turkish section. Should we ignore the `"D should strive to be correct, rather than fast"` comment from bauss for now? Or some actions can be taken to improve the current situation?
Oct 25 2022
parent reply rikki cattermole <rikki cattermole.co.nz> writes:
On 26/10/2022 6:06 PM, Siarhei Siamashka wrote:
 Should we ignore the `"D should strive to be correct, rather than fast"` 
 comment from bauss for now? Or some actions can be taken to improve the 
 current situation?
Bauss is correct. It should be implemented but it does not need to be fast. But yeah, if you are able to ignore that Unicode is a thing, I'd recommend it. It is complicated, as we humans are very complicated ;)
Oct 25 2022
parent reply Siarhei Siamashka <siarhei.siamashka gmail.com> writes:
On Wednesday, 26 October 2022 at 05:17:06 UTC, rikki cattermole 
wrote:
 if you are able to ignore that Unicode is a thing, I'd 
 recommend it. It is complicated, as we humans are very 
 complicated ;)
I can't ignore Unicode, because I frequently have to deal with Cyrillic alphabet ;) Also Unicode is significantly simpler than a set of various incompatible 8-bit encodings (such as [CP1251](https://en.wikipedia.org/wiki/Windows-1251) vs. variants of [KOI-8](https://en.wikipedia.org/wiki/KOI-8) vs. [ISO/IEC 8859-5](https://en.wikipedia.org/wiki/ISO/IEC_8859-5)) that were simultaneously in use earlier and caused a lot of pain. But I'm surely able to ignore the peculiarities of modern Turkish Unicode and wait for the other people to come up with a solution for D language if they really care.
Oct 25 2022
next sibling parent rikki cattermole <rikki cattermole.co.nz> writes:
On 26/10/2022 6:49 PM, Siarhei Siamashka wrote:
 On Wednesday, 26 October 2022 at 05:17:06 UTC, rikki cattermole wrote:
 if you are able to ignore that Unicode is a thing, I'd recommend it. 
 It is complicated, as we humans are very complicated ;)
I can't ignore Unicode, because I frequently have to deal with Cyrillic alphabet ;) Also Unicode is significantly simpler than a set of various incompatible 8-bit encodings (such as [CP1251](https://en.wikipedia.org/wiki/Windows-1251) vs. variants of [KOI-8](https://en.wikipedia.org/wiki/KOI-8) vs. [ISO/IEC 8859-5](https://en.wikipedia.org/wiki/ISO/IEC_8859-5)) that were simultaneously in use earlier and caused a lot of pain. But I'm surely able to ignore the peculiarities of modern Turkish Unicode and wait for the other people to come up with a solution for D language if they really care.
Cyrillic isn't an issue. Lithuanian, Turkish and Azeri are the ones with the biggest issues. There is a bunch of non-simple mappings for Latin, Armenian and Greek, but they are not language dependent. There is six conditional ones which are all Greek. So if you are not dealing with these languages (even if you are, a simple replace should be easy to do for most), you should be fine with the simple mappings supported by std.uni.
Oct 25 2022
prev sibling parent reply =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 10/25/22 22:49, Siarhei Siamashka wrote:

 Unicode is significantly simpler than a set of various
 incompatible 8-bit encodings
Strongly agreed.
 I'm surely
 able to ignore the peculiarities of modern Turkish Unicode
The problem with Unicode is its main aim of allowing characters of multiple writing systems in the same text. When multiple writing systems are in play, conflicts and ambiguities will appear.
 and wait for
 the other people to come up with a solution for D language if they
 really care.
I solved my problem by writing an Alphabet hierarchy in the past. I don't like that code but it still works: https://bitbucket.org/acehreli/ddili/src/4c0552fe8352dfe905c9734a57d84d36ce4ed476/src/alphabet.d#lines-50 It handles capitalization, ordering, etc. I use it when preparing the Index section of the Turkish edition of "Programming in D": http://ddili.org/ders/d/ix.html One of the ambiguities is what came up on this thread: Should a word that starts with I (capital i) be listed under I (because it's Turkish) or under İ (because it's English)? So far, I am lucky because the only word that starts with I happens to be the English "IDE", so it goes under i (which appears as İ in the Turkish edition) and would make sense to a Turkish reader because a Turkish reader might (really?) accept it as the capital of ide. It's confusing but it seems to work. :) It doesn't matter. Life is imperfect and things will somehow work in the end. Ali
Oct 25 2022
parent reply Siarhei Siamashka <siarhei.siamashka gmail.com> writes:
On Wednesday, 26 October 2022 at 06:05:14 UTC, Ali Çehreli wrote:
 The problem with Unicode is its main aim of allowing characters 
 of multiple writing systems in the same text. When multiple 
 writing systems are in play, conflicts and ambiguities will 
 appear.
I personally don't think that it's the problem of the Unicode itself. Based on what I can see, it looks like the individuals or the committees responsible for mapping the Turkish alphabet to Unicode just made a blunder. For example, let's compare the Latin uppercase "B" and the Cyrillic uppercase "В". Looks exactly the same, right? Would it be a smart idea for them to share the same index in the Unicode table? But wait. What happens if we convert these letters to lowercase? The Latin "B" becomes "b" and the Cyrillic "В" becomes "в". Oops! So by having different indexes for the Latin uppercase "B" and the Cyrillic uppercase "В", we dodged a whole bunch of nasty problems. Another example. Patrick Schluter mentioned the Greek sigma letter and the [wikipedia article](https://en.wikipedia.org/wiki/Sigma) says: "uppercase Σ, lowercase σ, lowercase in word-final position ς", which makes everything rather problematic. Now let's compare this to the Belarusian language and its letter "у". The Belarusian "у" transforms into "ў" depending on context, however this transformation doesn't happen for the first letter of proper nouns or in acronyms (and this theoretically makes the uppercase "ў" redundant). Just imagine an alternative Greek-inspired reality, where both "у" and "ў" uppercase to "У". And yet the uppercase "Ў" exists in Unicode, so luckily in our reality we don't have to deal with uppercase/lowercase round trip failures. This is computers friendly. And as I already mentioned in an earlier comment, the Germans also got the uppercase "ẞ" in Unicode since 2008 (better late than never).
 I solved my problem by writing an Alphabet hierarchy in the 
 past. I don't like that code but it still works:

 [...]

 It's confusing but it seems to work. :) It doesn't matter. Life 
 is imperfect and things will somehow work in the end.
What's your opinion/conclusion? Is it fine the way it is? Do you think that some unique property of the Turkish language/alphabet made these difficulties unavoidable? Or do you think that it was a mistake, but now it has to live with us forever for compatibility reasons? Anything else? And as for the D language and Phobos, should "ß" still uppercase to "SS"? Or can we change it to uppercase "ẞ" and remove German from the list of tricky languages at https://dlang.org/library/std/uni/to_upper.html ? Should Turkish be listed there?
Oct 28 2022
parent rikki cattermole <rikki cattermole.co.nz> writes:
On 29/10/2022 11:05 AM, Siarhei Siamashka wrote:
 And as for the D language and Phobos, should "ß" still uppercase to 
 "SS"? Or can we change it to uppercase "ẞ" and remove German from the 
 list of tricky languages at 
 https://dlang.org/library/std/uni/to_upper.html ? Should Turkish be 
 listed there?
That particular function, is based upon the simple mappings provided by UnicodeData.txt and (should be) in compliance of the Unicode standard. The only thing we need to do is regenerate the tables backing it whenever Unicode updates. Note the behavior you are asking for is defined in the Unicode database file SpecialCasing.txt which have not been implemented. ``` titlecase(uppercase(<es-zed>)) ``` That file is how you support languages like Turkish. We currently don't have it implemented. It requires operating on a whole string and to pass in what language rules to apply (i.e. Turkish, Azeri).
Oct 28 2022