www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - IDEA: Text search engine tailored to a specific schema

reply "Casey" <sybrandy gmail.com> writes:
O.K.  This is just an idea that's been running through my head, 
so I figured someone here may be interested.

Text search engines that I know of are meant to index 
unstructured data or apply a schema to data at runtime.  However, 
since D has the ability to do things at compile time, perhaps it 
would be an ideal solution for situations where a specific schema 
is used and much be searched on.  Instead of generic data 
structures used to represent the data, specialized data 
structures could be created at compile time to allow for better 
indexing and performance.

That's about as far as I got with it.  To me, it seemed 
interesting enough to share.

Enjoy.
Apr 16 2015
next sibling parent Rikki Cattermole <alphaglosined gmail.com> writes:
On 17/04/2015 2:26 p.m., Casey wrote:
 O.K.  This is just an idea that's been running through my head, so I
 figured someone here may be interested.

 Text search engines that I know of are meant to index unstructured data
 or apply a schema to data at runtime.  However, since D has the ability
 to do things at compile time, perhaps it would be an ideal solution for
 situations where a specific schema is used and much be searched on.
 Instead of generic data structures used to represent the data,
 specialized data structures could be created at compile time to allow
 for better indexing and performance.

 That's about as far as I got with it.  To me, it seemed interesting
 enough to share.

 Enjoy.
This sounds a lot like an ORM. Only the schema is specified as a struct/class. In this case it wouldn't need to generate anything as it has already been done.
Apr 16 2015
prev sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2015-04-17 04:26, Casey wrote:
 O.K.  This is just an idea that's been running through my head, so I
 figured someone here may be interested.

 Text search engines that I know of are meant to index unstructured data
 or apply a schema to data at runtime.  However, since D has the ability
 to do things at compile time, perhaps it would be an ideal solution for
 situations where a specific schema is used and much be searched on.
 Instead of generic data structures used to represent the data,
 specialized data structures could be created at compile time to allow
 for better indexing and performance.

 That's about as far as I got with it.  To me, it seemed interesting
 enough to share.
Sounds a bit like the regular expression module. If you provide the regular expression at compile time it will generate an engine specific for that regular expression. -- /Jacob Carlborg
Apr 17 2015
parent reply "Casey Sybrandy" <sybrandy gmail.com> writes:
I was thinking something a bit more specific without having to 
manually generate the structs.

For example, let's say I have a JSON document that has a number 
of fields in it.  Some are numbers, some are strings, etc.  What 
I'm thinking either a) based of the JSON structure or b) based on 
a schema that describes the JSON, the objects and/or indices are 
defined at compile-time and done so in an optimal manner.  For 
example, if based on the schema we know that a field is an 
enumeration, instead of a inverted index a simple associative 
array that contains arrays of matching document IDs is used 
instead.  This way, if I search on that specific field, it can be 
done in the most efficient way possible.  Also, the documents 
themselves would be stored more optimally.

So, no, this isn't an ORM as I'm not mapping objects to an 
underlying data store.  I guess what I'm thinking of is the text 
search equivalent of the regular expression engine.  Thinking 
about it now, I should have mentioned that this would be like 
Sphinx/Lucene/ElasticSearch except it would be optimized to a 
specific document structure vs. more general purpose.  The 
optimizations would be generated at compile-time based on a 
sample document structure or schema vs. coding everything 
manually.
Apr 17 2015
parent reply Jacob Carlborg <doob me.com> writes:
On 2015-04-17 16:21, Casey Sybrandy wrote:
 I was thinking something a bit more specific without having to manually
 generate the structs.

 For example, let's say I have a JSON document that has a number of
 fields in it.  Some are numbers, some are strings, etc.  What I'm
 thinking either a) based of the JSON structure or b) based on a schema
 that describes the JSON, the objects and/or indices are defined at
 compile-time and done so in an optimal manner.  For example, if based on
 the schema we know that a field is an enumeration, instead of a inverted
 index a simple associative array that contains arrays of matching
 document IDs is used instead.  This way, if I search on that specific
 field, it can be done in the most efficient way possible.  Also, the
 documents themselves would be stored more optimally.
I think this is similar how the D implementation of Thrift works. -- /Jacob Carlborg
Apr 17 2015
parent "Casey" <sybrandy gmail.com> writes:
 I think this is similar how the D implementation of Thrift 
 works.
Yes, exactly! Except instead of writing code to send/receive messages, I'm thinking that indices are built that are specific to the data. That I think is the harder part as you have to know what is optimal for the data, what operations are expected on the data, and putting it all together. However, I have to wonder if by making the indices very specific query performance is improved by a significant amount.
Apr 17 2015