digitalmars.D - Fixing dub search

aberba (15/15) Dec 28 2020 Current dub registry search is inaccurate because it uses the

Imperatorn (2/6) Dec 28 2020 https://github.com/dlang/dub-registry/pull/481

aberba (2/9) Dec 28 2020 I've sent him an email about using MeiliSearch instead of a hack

Imperatorn (2/12) Dec 29 2020 👍

sarn (13/18) Dec 28 2020 ElasticSearch also has a simple REST API and would do this job on

aberba (15/35) Dec 29 2020 If you've looked at the very discussion you referenced, you'd

sarn (9/17) Dec 29 2020 I read it. It's the bug report for this issue, and it's the

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (4/8) Dec 29 2020 The easiest option is to see if an indexing service (like the

James Blachly (2/6) Dec 30 2020 This is an excellent suggestion!

sarn (13/19) Jan 01 2021 Here's one problem: new contributors won't have access to the

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (7/10) Jan 01 2021 Two standard solutions:

bachmeier (7/14) Dec 29 2020 sqlite was the first thing I thought about when I saw this

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (8/13) Dec 29 2020 I'm not being dismissive (I also don't use Dub), but in general

bachmeier (8/21) Dec 29 2020 Well, except that sqlite works now and has been extensively

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (12/20) Dec 29 2020 I actually implemented Damerau–Levenshtein in Python the other

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (6/19) Dec 29 2020 Also, keep in mind that the fuzzy search does not have to be
Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (7/11) Dec 29 2020 Her is the SLA for Algolia, they offer 99.99% and 99.999% which

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (7/10) Dec 29 2020 It is written in Rust...

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (9/11) Dec 29 2020 You could just use a trie for the tokens and implement
aberba (8/18) Dec 29 2020 Read the previous GitHub discussion. They've gone through that
aberba (4/5) Dec 29 2020 If anyone has one written in D too, we can use that as well. I

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (11/16) Dec 29 2020 Alright, if someone wants to start on it, then I'm willing to

aberba <karabutaworld gmail.com> writes:

Current dub registry search is inaccurate because it uses the 
built-in MongoDB search which isn't designed for accurate search.

To fix this, a real search engine is needed. ElasticSearch is an 
overkill for what we need... basic accurate string search.

Solution: use MeiliSearch. It's very lightweight and fast (1GB 
vps is more than enough). Very easy to use... just a REST API 
call. I already have a package for meilisearch on dub.


What's needed: a hosted running instance of MeiliSearch for use 
in dub search. Since only the search functionality needs to be 
fixed, MeiliSearch will index a copy off all packages and 
re-index when they chang. The MeiliSearch index will handle 
search queries whilst MongoDB continues to handle everything else.


I can make a PR for the MeiliSearch integration but I need to 
know foundation is willing to host a MeiliSearch instance for 
that.

Dec 28 2020

Imperatorn <johan_forsberg_86 hotmail.com> writes:

On Monday, 28 December 2020 at 10:49:58 UTC, aberba wrote:
 Current dub registry search is inaccurate because it uses the 
 built-in MongoDB search which isn't designed for accurate 
 search.

 [...]

https://github.com/dlang/dub-registry/pull/481

Dec 28 2020

aberba <karabutaworld gmail.com> writes:

On Monday, 28 December 2020 at 18:55:09 UTC, Imperatorn wrote:
 On Monday, 28 December 2020 at 10:49:58 UTC, aberba wrote:
 Current dub registry search is inaccurate because it uses the 
 built-in MongoDB search which isn't designed for accurate 
 search.

 [...]

 https://github.com/dlang/dub-registry/pull/481

I've sent him an email about using MeiliSearch instead of a hack

Dec 28 2020

Imperatorn <johan_forsberg_86 hotmail.com> writes:

On Monday, 28 December 2020 at 19:15:33 UTC, aberba wrote:
 On Monday, 28 December 2020 at 18:55:09 UTC, Imperatorn wrote:
 On Monday, 28 December 2020 at 10:49:58 UTC, aberba wrote:
 Current dub registry search is inaccurate because it uses the 
 built-in MongoDB search which isn't designed for accurate 
 search.

 [...]

 https://github.com/dlang/dub-registry/pull/481

 I've sent him an email about using MeiliSearch instead of a hack

👍

Dec 29 2020

sarn <sarn theartofmachinery.com> writes:

On Monday, 28 December 2020 at 10:49:58 UTC, aberba wrote:
 To fix this, a real search engine is needed. ElasticSearch is 
 an overkill for what we need... basic accurate string search.

 Solution: use MeiliSearch. It's very lightweight and fast (1GB 
 vps is more than enough). Very easy to use... just a REST API 
 call. I already have a package for meilisearch on dub.

ElasticSearch also has a simple REST API and would do this job on 
whatever hardware we'd realistically use.  I'm not a huge ES fan, 
personally, but do you have more reasons to dismiss it as 
overkill and recommend MeiliSearch instead?

The best place for discussion is here, though:
https://github.com/dlang/dub-registry/issues/93

But I have to say something again: please, please, please, I beg, 
consider using an embedded search tool before adding an external 
server (or, worse, an external SaaS) as a runtime dependency to 
the dub registry.  There are only a few thousand packages, and 
they don't update much.  Even grepping the whole dataset every 
request would be fast enough (just not featureful enough).

Dec 28 2020

aberba <karabutaworld gmail.com> writes:

On Monday, 28 December 2020 at 20:49:12 UTC, sarn wrote:
 On Monday, 28 December 2020 at 10:49:58 UTC, aberba wrote:
 To fix this, a real search engine is needed. ElasticSearch is 
 an overkill for what we need... basic accurate string search.

 Solution: use MeiliSearch. It's very lightweight and fast (1GB 
 vps is more than enough). Very easy to use... just a REST API 
 call. I already have a package for meilisearch on dub.

 ElasticSearch also has a simple REST API and would do this job 
 on whatever hardware we'd realistically use.  I'm not a huge ES 
 fan, personally, but do you have more reasons to dismiss it as 
 overkill and recommend MeiliSearch instead?

 The best place for discussion is here, though:
 https://github.com/dlang/dub-registry/issues/93

 But I have to say something again: please, please, please, I 
 beg, consider using an embedded search tool before adding an 
 external server (or, worse, an external SaaS) as a runtime 
 dependency to the dub registry.  There are only a few thousand 
 packages, and they don't update much.  Even grepping the whole 
 dataset every request would be fast enough (just not featureful 
 enough).

If you've looked at the very discussion you referenced, you'd 
realize they went around and still came back to using mongodb for 
search.

Not only is elasticsearch built in Java, hence bloatware, it's 
also designed to do more than just search... hence more bloatware 
and overkill for just basic search. You may compare the size of 
elastic with meilisearch which is just a small binary. 
MeiliSearch can run on very little ram...

The accuracy you'd get from a search engine just isn't possible 
with brute-force and hacks. Search is a complex problem involving 
stemming, plurals, synonyms, step words, ranking, etc. You'd want 
to use a real search engine.

And between elasticsearch and MeiliSearch, MeiliSearch is 
simpler, lightweight and easy to use.

Dec 29 2020

sarn <sarn theartofmachinery.com> writes:

On Tuesday, 29 December 2020 at 08:45:02 UTC, aberba wrote:
 On Monday, 28 December 2020 at 20:49:12 UTC, sarn wrote:
 The best place for discussion is here, though:
 https://github.com/dlang/dub-registry/issues/93

 If you've looked at the very discussion you referenced, you'd 
 realize they went around and still came back to using mongodb 
 for search.

I read it.  It's the bug report for this issue, and it's the 
discussion thread for this community project.  There's no them vs 
you.  There's isn't even a final decision in that thread.  
MongoDB is being used now because that's what's implemented.

 And between elasticsearch and MeiliSearch, MeiliSearch is 
 simpler, lightweight and easy to use.

Have you considered anything other than ElasticSearch, 
MeiliSearch and custom hacks?  There are at least four other 
options mentioned in the thread I linked to.  Maybe add 
MeiliSearch and your reasons for using it.

Dec 29 2020

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

On Tuesday, 29 December 2020 at 22:27:17 UTC, sarn wrote:
 Have you considered anything other than ElasticSearch, 
 MeiliSearch and custom hacks?  There are at least four other 
 options mentioned in the thread I linked to.  Maybe add 
 MeiliSearch and your reasons for using it.

The easiest option is to see if an indexing service (like the 
mentioned Algolia) is willing to sponsor Dub as an open source 
project, then they get some free advertising in return.

Dec 29 2020

James Blachly <james.blachly gmail.com> writes:

On 12/29/20 5:34 PM, Ola Fosheim Grøstad wrote:
 The easiest option is to see if an indexing service (like the mentioned 
 Algolia) is willing to sponsor Dub as an open source project, then they 
 get some free advertising in return.
 

This is an excellent suggestion!

Dec 30 2020

sarn <sarn theartofmachinery.com> writes:

On Thursday, 31 December 2020 at 02:39:44 UTC, James Blachly 
wrote:
 On 12/29/20 5:34 PM, Ola Fosheim Grøstad wrote:
 The easiest option is to see if an indexing service (like the 
 mentioned Algolia) is willing to sponsor Dub as an open source 
 project, then they get some free advertising in return.
 

 This is an excellent suggestion!

Here's one problem: new contributors won't have access to the 
third-party service and will need to set up their own indexes 
from scratch, but the scripts or whatever to do that probably 
won't be maintained because
existing contributors won't need them.  It's healthier for the 
project if anyone can just download the codebase and hack on it 
without signing up for some third party service.

By the way, I think there's a good project in here and I'm 
willing to contribute my 2c and maybe more, but I know I'll lose 
any discussion in this thread.  I'm following this one: 
https://github.com/dlang/dub-registry/issues/93

Jan 01 2021

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

On Friday, 1 January 2021 at 11:02:49 UTC, sarn wrote:
 Here's one problem: new contributors won't have access to the 
 third-party service and will need to set up their own indexes 
 from scratch, but the scripts or whatever to do that probably

Two standard solutions:

1. make a tiny webservice for it have one set up for production 
and another one for testing.

2. make a tiny local in-memory emulator for it (no need for 
advanced matching or ranking)

(I didn't quite get the argument about "losing")

Jan 01 2021

bachmeier <no spam.net> writes:

On Monday, 28 December 2020 at 20:49:12 UTC, sarn wrote:
 On Monday, 28 December 2020 at 10:49:58 UTC, aberba wrote:

 The best place for discussion is here, though:
 https://github.com/dlang/dub-registry/issues/93

In that thread you wrote

 The FTS features of DBs like Sqlite and Postgres are really 
 nice if you're already using those DBs (otherwise other tools 
 are more powerful). Moving all data to Sqlite or PG is 
 obviously a whole bigger decision.

sqlite was the first thing I thought about when I saw this 
thread. How much data would have to be copied into an sqlite 
database for searching of packages? That has the advantage of 
more or less no dependencies, dead simple to add, claimed good 
results...

Dec 29 2020

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

On Tuesday, 29 December 2020 at 22:53:55 UTC, bachmeier wrote:
 sqlite was the first thing I thought about when I saw this 
 thread. How much data would have to be copied into an sqlite 
 database for searching of packages? That has the advantage of 
 more or less no dependencies, dead simple to add, claimed good 
 results...

I'm not being dismissive (I also don't use Dub), but in general 
this would not scale very well. Unless you want to do all 
searches locally. Also, a high quality search engine requires 
custom ranking, so not really sure if it is overall less work 
than rolling your own if you want high quality search results. 
The text corpus is tiny, so there is really no point in using a 
generic on-disk solution?

Dec 29 2020

bachmeier <no spam.net> writes:

On Tuesday, 29 December 2020 at 23:17:33 UTC, Ola Fosheim Grøstad 
wrote:
 On Tuesday, 29 December 2020 at 22:53:55 UTC, bachmeier wrote:
 sqlite was the first thing I thought about when I saw this 
 thread. How much data would have to be copied into an sqlite 
 database for searching of packages? That has the advantage of 
 more or less no dependencies, dead simple to add, claimed good 
 results...

 I'm not being dismissive (I also don't use Dub), but in general 
 this would not scale very well. Unless you want to do all 
 searches locally. Also, a high quality search engine requires 
 custom ranking, so not really sure if it is overall less work 
 than rolling your own if you want high quality search results. 
 The text corpus is tiny, so there is really no point in using a 
 generic on-disk solution?

Well, except that sqlite works now and has been extensively 
tested. I don't want to discourage anyone from rolling their own, 
but knowing how long things take around here, and using actuarial 
tables to compute my life expectancy, it's not obvious that it 
would impact me. That's also why adding another dependency 
concerns me.

Dec 29 2020

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

On Tuesday, 29 December 2020 at 23:32:28 UTC, bachmeier wrote:
 On Tuesday, 29 December 2020 at 23:17:33 UTC, Ola Fosheim 
 Grøstad wrote:
 Well, except that sqlite works now and has been extensively 
 tested. I don't want to discourage anyone from rolling their 
 own, but knowing how long things take around here, and using 
 actuarial tables to compute my life expectancy, it's not 
 obvious that it would impact me. That's also why adding another 
 dependency concerns me.

I actually implemented Damerau–Levenshtein in Python the other 
day in order to validate an exam question... It takes <15 minutes 
from scratch. A faster version on a trie can be done in an 
evening, debugged and tested. A full system in a weekend.

But, the advantage in using an existing online service is that 
you get automatic scaling and better uptime: write once, run 
forever... I think such a service should be grateful if Dub 
provided them with:

1. cheap advertising
2. a maintained API to their service

Maybe they even will pay for the work, who knows?

Dec 29 2020

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

On Tuesday, 29 December 2020 at 23:40:44 UTC, Ola Fosheim Grøstad 
wrote:
 On Tuesday, 29 December 2020 at 23:32:28 UTC, bachmeier wrote:
 On Tuesday, 29 December 2020 at 23:17:33 UTC, Ola Fosheim 
 Grøstad wrote:
 Well, except that sqlite works now and has been extensively 
 tested. I don't want to discourage anyone from rolling their 
 own, but knowing how long things take around here, and using 
 actuarial tables to compute my life expectancy, it's not 
 obvious that it would impact me. That's also why adding 
 another dependency concerns me.

 I actually implemented Damerau–Levenshtein in Python the other 
 day in order to validate an exam question... It takes <15 
 minutes from scratch. A faster version on a trie can be done in 
 an evening, debugged and tested. A full system in a weekend.

Also, keep in mind that the fuzzy search does not have to be 
crazy fast when people often search for the same stuff. Just log 
all search phrases and preload the caches with the most common 
ones. With some luck maybe 90% of all searches hit caches?

Dec 29 2020

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

On Tuesday, 29 December 2020 at 23:40:44 UTC, Ola Fosheim Grøstad 
wrote:
 But, the advantage in using an existing online service is that 
 you get automatic scaling and better uptime: write once, run 
 forever... I think such a service should be grateful if Dub 
 provided them with:

Her is the SLA for Algolia, they offer 99.99% and 99.999% which 
translates to 53 minutes and 5 minutes of downtime per year. It 
would be difficult (highly improbable) to compete with that for a 
self hosted solution.

https://www.algolia.com/blog/for-slas-theres-no-such-thing-as-100-uptime-only-100-transparency/

Dec 29 2020

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

On Monday, 28 December 2020 at 10:49:58 UTC, aberba wrote:
 I can make a PR for the MeiliSearch integration but I need to 
 know foundation is willing to host a MeiliSearch instance for 
 that.

It is written in Rust...

But seriously, in-memory-search is easy to implement, so it would 
look better if it is done in D.

An alternative is to use an existing online indexing service, 
probably cheaper and more scalable than setting up a dedicated 
service yourself.

Dec 29 2020

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

On Tuesday, 29 December 2020 at 11:47:11 UTC, Ola Fosheim Grøstad 
wrote:
 But seriously, in-memory-search is easy to implement, so it 
 would look better if it is done in D.

You could just use a trie for the tokens and implement 
Levenshtein-Damerau fuzzy matching on that. That is a fun 
exercise to do. The next fun exercise is to abstract it in a way 
that fits into Phobos!

(Fun fact: I've just read a bunch of suggestions for how to do 
this as I am spending my holiday grading exams in text search... 
:-P Ok, not so fun...)

Dec 29 2020

aberba <karabutaworld gmail.com> writes:

On Tuesday, 29 December 2020 at 11:47:11 UTC, Ola Fosheim Grøstad 
wrote:
 On Monday, 28 December 2020 at 10:49:58 UTC, aberba wrote:
 I can make a PR for the MeiliSearch integration but I need to 
 know foundation is willing to host a MeiliSearch instance for 
 that.

 It is written in Rust...

 But seriously, in-memory-search is easy to implement, so it 
 would look better if it is done in D.

 An alternative is to use an existing online indexing service, 
 probably cheaper and more scalable than setting up a dedicated 
 service yourself.

Read the previous GitHub discussion. They've gone through that 
route.

Any PaaS cost more than IaaS. If cost isn't an issue then we can 
go with that too.

But since the registry is hosted, it's quite straightforward to 
do ./meilisearch --master-key PRIVATE_KEY and be done with.

Dec 29 2020

aberba <karabutaworld gmail.com> writes:

On Tuesday, 29 December 2020 at 11:47:11 UTC, Ola Fosheim Grøstad 
wrote:

 It is written in Rust...

If anyone has one written in D too, we can use that as well. I 
just want to have the embarrassing search fixed.

Dec 29 2020

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

On Tuesday, 29 December 2020 at 12:04:28 UTC, aberba wrote:
 On Tuesday, 29 December 2020 at 11:47:11 UTC, Ola Fosheim 
 Grøstad wrote:

 It is written in Rust...

 If anyone has one written in D too, we can use that as well. I 
 just want to have the embarrassing search fixed.

Alright, if someone wants to start on it, then I'm willing to 
help out with suggestions and code reviews.

This is a decent starting point:

https://nlp.stanford.edu/IR-book/information-retrieval-book.html

And also Wikipedia

https://en.wikipedia.org/wiki/Inverted_index
https://en.wikipedia.org/wiki/Approximate_string_matching
https://en.wikipedia.org/wiki/Suffix_array
https://en.wikipedia.org/wiki/Trie
https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm

Dec 29 2020

D Programming

C/C++ Programming

Other

digitalmars.D - Fixing dub search