www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Is it possible to collect object usage information during compilation?

reply "DaveG" <daveg inter.net> writes:
Let me preface this by saying I only have a general conceptual 
understanding of compilers and know nothing about actual 
implementation.

One common problem with Object-Relational Mapping (ORM) is what 
data to load and when. There is basically 2 options:
1. Load everything: This certainly works, but is very 
inefficient, particularly when you have a large number of fields 
or, even worse, have a derived field that causes a join on one or 
more tables. If you need all the data this is fine, but most of 
the time only a small subset is actually used. (lazy loading can 
mitigate some issues, but introduces other problems)
2. Specify what fields to populate: This can work, but makes more 
work for the user, adds complexity to user code, and often 
introduces bugs particularly over time as code changes. 
Implementation details are leaking into the interface.

Basically, I'm looking for a way to "look ahead" to see what 
properties on an object are actually referenced (or not 
referenced) so we know what data needs to be loaded. Simple 
analysis for things like unused scope variables already exist, 
but this is needed for properties on each instance of a class (or 
struct). I guess this would require the compiler to make 2 passes 
once to trace each object and a second to do something with the 
data collected. This would potential cost a lot in compilation 
time so there would probably need to be some sort of annotation 
on the definition to indicate this type of check is necessary.

I might be crazy, but it seems like the compiler has all the 
information necessary to figure this out and it would make user 
code simpler, less error prone, and more efficient. So does 
anybody have any idea on how to actually achieve this?

-Dave
Jan 09 2015
parent reply Jacob Carlborg <doob me.com> writes:
On 2015-01-10 07:46, DaveG wrote:
 Let me preface this by saying I only have a general conceptual
 understanding of compilers and know nothing about actual implementation.

 One common problem with Object-Relational Mapping (ORM) is what data to
 load and when. There is basically 2 options:
 1. Load everything: This certainly works, but is very inefficient,
 particularly when you have a large number of fields or, even worse, have
 a derived field that causes a join on one or more tables. If you need
 all the data this is fine, but most of the time only a small subset is
 actually used. (lazy loading can mitigate some issues, but introduces
 other problems)
 2. Specify what fields to populate: This can work, but makes more work
 for the user, adds complexity to user code, and often introduces bugs
 particularly over time as code changes. Implementation details are
 leaking into the interface.

 Basically, I'm looking for a way to "look ahead" to see what properties
 on an object are actually referenced (or not referenced) so we know what
 data needs to be loaded. Simple analysis for things like unused scope
 variables already exist, but this is needed for properties on each
 instance of a class (or struct). I guess this would require the compiler
 to make 2 passes once to trace each object and a second to do something
 with the data collected. This would potential cost a lot in compilation
 time so there would probably need to be some sort of annotation on the
 definition to indicate this type of check is necessary.

 I might be crazy, but it seems like the compiler has all the information
 necessary to figure this out and it would make user code simpler, less
 error prone, and more efficient. So does anybody have any idea on how to
 actually achieve this?
I'm not exactly sure if this is what you want but you can implement the "opDispatch" [1] method in a class or struct. This method will be called if no other method exists with the same name. There's also something called "alias this" [2] that allows you to do something similar. class Foo { void foo () {} void opDispatch (string name)() {} } auto f = new Foo; f.foo(); // will call "foo" f.bar(); // will be lowered to f.opDispatch!("bar")(); If you're implementing an ORM I would recommend executing queries lazily. You can do something like this: class Person : ORM.Base { String name; Int age; // this method returns a range/proxy that implements the range api [3] static ORM.Range!(Person) all () {} } "String" would look something like this: struct String { alias get this; // this method will fetch the data from the database private string get (); } Using the interface would look something like this: auto p = Person.all(); // no database query has been performed yet // the range interface makes it possible to use a foreach // when starting the foreach loop is when the first query will happen foreach (e ; p) { // this call will trigger a call to the "get" method in "String" // via the "alias this" string name = e.name; writeln(name); } [1] http://dlang.org/operatoroverloading.html#dispatch [2] http://dlang.org/class.html#alias-this [3] http://dlang.org/phobos/std_range.html#isInputRange -- /Jacob Carlborg
Jan 10 2015
parent reply Martin Nowak <code+news.digitalmars dawg.eu> writes:
On 01/10/2015 11:20 AM, Jacob Carlborg wrote:
 On 2015-01-10 07:46, DaveG wrote:
 I might be crazy, but it seems like the compiler has all the information
 necessary to figure this out and it would make user code simpler, less
 error prone, and more efficient. So does anybody have any idea on how to
 actually achieve this?
I'm not exactly sure if this is what you want but you can implement the "opDispatch" [1] method in a class or struct. This method will be called if no other method exists with the same name. There's also something called "alias this" [2] that allows you to do something similar. class Foo { void foo () {} void opDispatch (string name)() {} } auto f = new Foo; f.foo(); // will call "foo" f.bar(); // will be lowered to f.opDispatch!("bar")(); If you're implementing an ORM I would recommend executing queries lazily. You can do something like this: class Person : ORM.Base { String name; Int age; // this method returns a range/proxy that implements the range api [3] static ORM.Range!(Person) all () {} } "String" would look something like this: struct String { alias get this; // this method will fetch the data from the database private string get (); } Using the interface would look something like this: auto p = Person.all(); // no database query has been performed yet // the range interface makes it possible to use a foreach // when starting the foreach loop is when the first query will happen foreach (e ; p) { // this call will trigger a call to the "get" method in "String" // via the "alias this" string name = e.name; writeln(name); }
The idea isn't bad, but the performance will suck. This is generally known as N+1 query, only that this is even worse, as each field is queried individually. Here is a sketch for an optimal solution. I'm actually eagerly waiting that someone finally implements it. http://dpaste.dzfl.pl/cd375ac594cf
Jan 10 2015
next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2015-01-10 13:36, Martin Nowak wrote:

 The idea isn't bad, but the performance will suck. This is generally
 known as N+1 query, only that this is even worse, as each field is
 queried individually.
Since the "all" method was called I would assume all rows in the person table are fetched in one single query. Although I don't know if that will work if not the whole row should be loaded.
 Here is a sketch for an optimal solution. I'm actually eagerly waiting
 that someone finally implements it.

 http://dpaste.dzfl.pl/cd375ac594cf
How would you handled fetching multiple rows and a foreach loop, i.e. my example? Perhaps a detail but using a wrapped type instead of the raw types in Person you could handle things like null in the database. -- /Jacob Carlborg
Jan 10 2015
parent reply Martin Nowak <code+news.digitalmars dawg.eu> writes:
On 01/10/2015 01:52 PM, Jacob Carlborg wrote:
 On 2015-01-10 13:36, Martin Nowak wrote:

 The idea isn't bad, but the performance will suck. This is generally
 known as N+1 query, only that this is even worse, as each field is
 queried individually.
Since the "all" method was called I would assume all rows in the person table are fetched in one single query. Although I don't know if that will work if not the whole row should be loaded.
For row or document oriented databases you want to query all fields in parallel. For columnar stores it might be possible to efficiently query specific fields for many documents.
 Here is a sketch for an optimal solution. I'm actually eagerly waiting
 that someone finally implements it.

 http://dpaste.dzfl.pl/cd375ac594cf
How would you handled fetching multiple rows and a foreach loop, i.e. my example?
I'd simple produce multiple rows, the principle remains the same.
 Perhaps a detail but using a wrapped type instead of the raw types in
 Person you could handle things like null in the database.
The example already uses Variant.
Jan 10 2015
next sibling parent Jacob Carlborg <doob me.com> writes:
On 2015-01-10 14:19, Martin Nowak wrote:

 I'd simple produce multiple rows, the principle remains the same.
Ok, I think I understand the code now. You managed to register the fields at compile time. Pretty neat. I thought the query would need to be delayed to the first call to opDispatch.
 The example already uses Variant.
Yes, but when you get the value out of the of the variant. I think one also needs to be able to check if a field was is "null" or not. Or am I missing something? -- /Jacob Carlborg
Jan 10 2015
prev sibling parent reply "DaveG" <daveg inter.net> writes:
On Saturday, 10 January 2015 at 13:19:19 UTC, Martin Nowak wrote:
 On 01/10/2015 01:52 PM, Jacob Carlborg wrote:
 On 2015-01-10 13:36, Martin Nowak wrote:

 The idea isn't bad, but the performance will suck. This is 
 generally
 known as N+1 query, only that this is even worse, as each 
 field is
 queried individually.
Since the "all" method was called I would assume all rows in the person table are fetched in one single query. Although I don't know if that will work if not the whole row should be loaded.
The issue is not with the rows returned, but the columns (or object properties - which may map to multiple tables or be derived in some other way). Which rows need to returned is determined by some type of filtering mechanism, which is not an issue because that (logically) has to be explicit. The issue is determining which properties (for each "row") actually need to be returned without the need to explicitly request them (the data is already implicit within the user code itself).
 Here is a sketch for an optimal solution. I'm actually 
 eagerly waiting
 that someone finally implements it.

 http://dpaste.dzfl.pl/cd375ac594cf
Martin, that is brilliant! It seemed like all the pieces where there, I just couldn't put them together. I'm glad I'm not the only one thinking about this. I have never been able to find an ORM (in any language) that comes close to working for us. We are currently looking into switching off PHP and the front runner is C# because it's a safe bet, we run Windows, and some people are sold on the concept of Entity Framework. Entity is (or was) built in to the .NET so they could theoretically do some neat tricks like compile query logic at compilation, and infer what data is actually needed by the program (the issue being discussed). Turns out they do query caching, but that's about it. I'm not sure I can sell the idea of D (this is a very small and conservative group). I would also have to sell the idea of writing an ORM which is certainly not on the roadmap, but this will certainly help my argument. Oh, we will also need a good SQL Server library which, to my knowledge, D is lacking. This is going to be a hard sell... -Dave
Jan 10 2015
next sibling parent reply "Paolo Invernizzi" <paolo.invernizzi no.address> writes:
On Saturday, 10 January 2015 at 17:31:42 UTC, DaveG wrote:
 On Saturday, 10 January 2015 at 13:19:19 UTC, Martin Nowak 
 wrote:
 Here is a sketch for an optimal solution. I'm actually eagerly 
 waiting that someone finally implements it.

 http://dpaste.dzfl.pl/cd375ac594cf
I would also have to sell the idea of writing an ORM which is certainly not on the roadmap, but this will certainly help my argument.
Maybe not, something simpler than a full ORM should be compelling also. I guess you know about the ORM Vietnam [1], but also this [2] can be of some help in selling a simple D solution. I would like to see, someday, something in D that: - can check at compile time the syntax of SQL; - can check at compile time the SQL query statement against the current DB schema; - can read the output of a DB schema dump at CT, and parse it into what is needed for the previous points (more complicated); The first point should be easy today, the second and the last one involve more work... [1] http://blogs.tedneward.com/2006/06/26/The+Vietnam+Of+Computer+Science.aspx [2] http://wozniak.ca/what-orms-have-taught-me-just-learn-sql --- Paolo
Jan 10 2015
parent reply "DaveG" <daveg inter.net> writes:
On Saturday, 10 January 2015 at 18:31:18 UTC, Paolo Invernizzi 
wrote:
 On Saturday, 10 January 2015 at 17:31:42 UTC, DaveG wrote:
 On Saturday, 10 January 2015 at 13:19:19 UTC, Martin Nowak 
 wrote:
 Here is a sketch for an optimal solution. I'm actually 
 eagerly waiting that someone finally implements it.

 http://dpaste.dzfl.pl/cd375ac594cf
I would also have to sell the idea of writing an ORM which is certainly not on the roadmap, but this will certainly help my argument.
Maybe not, something simpler than a full ORM should be compelling also. I guess you know about the ORM Vietnam [1], but also this [2] can be of some help in selling a simple D solution. I would like to see, someday, something in D that: - can check at compile time the syntax of SQL; - can check at compile time the SQL query statement against the current DB schema; - can read the output of a DB schema dump at CT, and parse it into what is needed for the previous points (more complicated); The first point should be easy today, the second and the last one involve more work... [1] http://blogs.tedneward.com/2006/06/26/The+Vietnam+Of+Computer+Science.aspx [2] http://wozniak.ca/what-orms-have-taught-me-just-learn-sql --- Paolo
I have no intention of writing anything as massive as Entity Framework or Hibernate. We have been successful over the past 4 years with just a small collection of functions to reduce some of the pain (and redundancy) in writing a lot of dynamic SQL. Now that we have an opportunity to start fresh we have a chance to do something better. The traditional problems with ORMs in general are well known and these are the reasons why I have never used one in production. 1. Complexity. You basically need to learn an entire new language (sometimes literally). This is an investment which can be worth it if the abstraction is successful. The following problems are why I think the investment is not worth it. 2. Limitations. Unfortunately too often you need to drop in to SQL to really get things done. This alone is a non-starter. If I need to bypass the abstraction to do anything really interesting or complex, it has failed. Sometimes (usually) this is for performance, other times it's because there is simply no way (or it's too complicated) to express what I want through the abstraction. 3. Compilation/Translation. The time to translate commands to SQL (or whatever backend) can be a high price. Most ORMs do some type of caching now which is generally sufficient. In D most of the work can be done at compile time which is even better. 4. Unnecessary Data. Greedy data retrieval is way to common, the default is usually to get everything. For small queries and data sets you can write it off as "not a problem", but when your model gets large and interconnects, this can be catastrophic. Again, thanks Martin for the clever basis for a solution in D. 5. DB Performance. The efficiency of the SQL that is actually generated. People seem to focus on this because the generated SQL is generally quite verbose. Interestingly, in my experience, this is often the smallest performance problem because the query optimizer (at least in SQL Server with good indexes and statistics) will generate the same execution plan regardless. This is also a code gen problem that can be tweaked without breaking user code. You may have noticed that 4 of 5 problems are about performance. That's because, at least in our case, it is that important and it is that much of a problem. Current ORMs often look great, but in my experience, the price is always to high. Some "micro-ORMs" avoid the performance problems, but they do so by sacrificing most of the features (you still have to write raw SQL for example). Some of the problems are inherit to solution and cannot be "solved", but they can be reduced. For a long time I thought some of these problems where fundamental and had basically written off the concept of ORMs [see: Vietnam of Computer Science]. The good news is most of the problems appear to be solvable. #1 is unavoidable obviously there will be something new (whether it's a DSL or just an API) #2 is really dependent on the other problems and implementation. #3 is "just" implementation. #4 has a conceptual solution, now it's "just" implementation. #5 does not have a solution because it will depend on the backend, but I think it's reasonable to expect a solution that works for almost all cases. It will be impossible to know without testing. One final note. You may have noticed I didn't mention the schema syncing problem (keeping database and code in sync). There was a time I would have said that was essential and while it would be nice in a perfect world, I'm comfortable keeping them in sync manually (or semi-manual with scripts). I can generate a bunch of classes from an existing database fairly easily and when I change a table I can manually update a class. If I was writing SQL directly I would have to update my query, this is really no different. Doing validation in unit tests is perfectly acceptable to me. Sorry for long post. -Dave
Jan 10 2015
next sibling parent Jacob Carlborg <doob me.com> writes:
On 2015-01-10 21:53, DaveG wrote:

 One final note. You may have noticed I didn't mention the schema syncing
 problem (keeping database and code in sync). There was a time I would
 have said that was essential and while it would be nice in a perfect
 world, I'm comfortable keeping them in sync manually (or semi-manual
 with scripts). I can generate a bunch of classes from an existing
 database fairly easily and when I change a table I can manually update a
 class. If I was writing SQL directly I would have to update my query,
 this is really no different. Doing validation in unit tests is perfectly
 acceptable to me.
With D you can inspect all fields on a class/struct. Then you can create a migration tool that inspects your models and make sure the database match. -- /Jacob Carlborg
Jan 11 2015
prev sibling parent reply "Paolo Invernizzi" <paolo.invernizzi no.address> writes:
On Saturday, 10 January 2015 at 20:53:47 UTC, DaveG wrote:
 On Saturday, 10 January 2015 at 18:31:18 UTC, Paolo Invernizzi 
 wrote:
 I would like to see, someday, something in D that:

 - can check at compile time the syntax of SQL;
 - can check at compile time the SQL query statement against 
 the current DB schema;
 - can read the output of a DB schema dump at CT, and parse it 
 into what is needed for the previous points (more complicated);
One final note. You may have noticed I didn't mention the schema syncing problem (keeping database and code in sync). There was a time I would have said that was essential and while it would be nice in a perfect world, I'm comfortable keeping them in sync manually (or semi-manual with scripts). I can generate a bunch of classes from an existing database fairly easily and when I change a table I can manually update a class. If I was writing SQL directly I would have to update my query, this is really no different. Doing validation in unit tests is perfectly acceptable to me.
I think basically we have the same feeling over the ORM topic. Doing validation in unit tests is for sure acceptable, but my point is that I would like CT validation of plain SQL query over the current DB schema.... without having to use an ORM. ;-) --- Paolo
Jan 11 2015
parent "DaveG" <daveg inter.net> writes:
On Sunday, 11 January 2015 at 10:06:53 UTC, Paolo Invernizzi 
wrote:
 On Saturday, 10 January 2015 at 20:53:47 UTC, DaveG wrote:
 On Saturday, 10 January 2015 at 18:31:18 UTC, Paolo Invernizzi 
 wrote:
 I would like to see, someday, something in D that:

 - can check at compile time the syntax of SQL;
 - can check at compile time the SQL query statement against 
 the current DB schema;
 - can read the output of a DB schema dump at CT, and parse it 
 into what is needed for the previous points (more 
 complicated);
One final note. You may have noticed I didn't mention the schema syncing problem (keeping database and code in sync). There was a time I would have said that was essential and while it would be nice in a perfect world, I'm comfortable keeping them in sync manually (or semi-manual with scripts). I can generate a bunch of classes from an existing database fairly easily and when I change a table I can manually update a class. If I was writing SQL directly I would have to update my query, this is really no different. Doing validation in unit tests is perfectly acceptable to me.
I think basically we have the same feeling over the ORM topic. Doing validation in unit tests is for sure acceptable, but my point is that I would like CT validation of plain SQL query over the current DB schema.... without having to use an ORM. ;-) --- Paolo
I agree. That's one thing Couldfusion did well that I haven't really seen since. You could write blocks of SQL complete with validation and syntax highlighting (depending on the editor). Because the SQL parser was built in you could then take resultsets returned from the database and perform additional queries on them locally. The problem a SQL parser doesn't solve is writing dynamic queries which require piecing together a bunch of partial statements. This is where an abstraction layer can really be useful. -Dave
Jan 11 2015
prev sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2015-01-10 18:31, DaveG wrote:

 Oh, we will also need a good SQL Server library which, to my knowledge,
 D is lacking. This is going to be a hard sell...
That would be FreeTDS [1] with D bindings [2]. Unless Microsoft have headers available for interfacing with SQL Server. [1] http://www.freetds.org/ [2] https://github.com/jacob-carlborg/dstep -- /Jacob Carlborg
Jan 10 2015
parent reply "Adam D. Ruppe" <destructionator gmail.com> writes:
On Saturday, 10 January 2015 at 20:51:44 UTC, Jacob Carlborg 
wrote:
 That would be FreeTDS [1] with D bindings [2]. Unless Microsoft 
 have headers available for interfacing with SQL Server.
You can use ODBC if it is a Windows program. If you want to talk to SQL Server from a Linux program though, FreeTDS is what you'll want.
Jan 10 2015
parent reply "DaveG" <daveg inter.net> writes:
On Saturday, 10 January 2015 at 20:57:00 UTC, Adam D. Ruppe wrote:
 On Saturday, 10 January 2015 at 20:51:44 UTC, Jacob Carlborg 
 wrote:
 That would be FreeTDS [1] with D bindings [2]. Unless 
 Microsoft have headers available for interfacing with SQL 
 Server.
You can use ODBC if it is a Windows program. If you want to talk to SQL Server from a Linux program though, FreeTDS is what you'll want.
In the past I have used FreeTDS, through PHP, and it had a lot of problems. This was several years ago and could have been at least partially due to the PHP layer. Last year I messed around with the ODBC wrapper and got amazingly poor performance, I believe the project was abandoned before I figured out the problem. Anybody actually using this in D? I'll have to write some tests and fire up the SQL profiler.
Jan 10 2015
parent reply Jacob Carlborg <doob me.com> writes:
On 2015-01-11 02:08, DaveG wrote:

 In the past I have used FreeTDS, through PHP, and it had a lot of
 problems. This was several years ago and could have been at least
 partially due to the PHP layer.

 Last year I messed around with the ODBC wrapper and got amazingly poor
 performance, I believe the project was abandoned before I figured out
 the problem. Anybody actually using this in D? I'll have to write some
 tests and fire up the SQL profiler.
We used Ruby on Rails with an SQL Server at my previous work. We used TinyTDS which uses FreeTDS. It worked surprisingly well but it did had some problems. One of those problems were encoding problems, but that mostly because we used an older version of SQL Server. -- /Jacob Carlborg
Jan 11 2015
parent reply "DaveG" <daveg inter.net> writes:
On Sunday, 11 January 2015 at 09:54:42 UTC, Jacob Carlborg wrote:
 On 2015-01-11 02:08, DaveG wrote:

 In the past I have used FreeTDS, through PHP, and it had a lot 
 of
 problems. This was several years ago and could have been at 
 least
 partially due to the PHP layer.

 Last year I messed around with the ODBC wrapper and got 
 amazingly poor
 performance, I believe the project was abandoned before I 
 figured out
 the problem. Anybody actually using this in D? I'll have to 
 write some
 tests and fire up the SQL profiler.
We used Ruby on Rails with an SQL Server at my previous work. We used TinyTDS which uses FreeTDS. It worked surprisingly well but it did had some problems. One of those problems were encoding problems, but that mostly because we used an older version of SQL Server.
It was probably around 2011 when last I used FreeTDS, and even then I think it was an older version, so it's quite possible those issues have been resolved. My bias against it probably unjustified. We are only targeting Windows anyway so ODBC is probably a safe bet.
Jan 11 2015
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/11/15 8:21 AM, DaveG wrote:
 On Sunday, 11 January 2015 at 09:54:42 UTC, Jacob Carlborg wrote:
 On 2015-01-11 02:08, DaveG wrote:

 In the past I have used FreeTDS, through PHP, and it had a lot of
 problems. This was several years ago and could have been at least
 partially due to the PHP layer.

 Last year I messed around with the ODBC wrapper and got amazingly poor
 performance, I believe the project was abandoned before I figured out
 the problem. Anybody actually using this in D? I'll have to write some
 tests and fire up the SQL profiler.
We used Ruby on Rails with an SQL Server at my previous work. We used TinyTDS which uses FreeTDS. It worked surprisingly well but it did had some problems. One of those problems were encoding problems, but that mostly because we used an older version of SQL Server.
It was probably around 2011 when last I used FreeTDS, and even then I think it was an older version, so it's quite possible those issues have been resolved. My bias against it probably unjustified. We are only targeting Windows anyway so ODBC is probably a safe bet.
Should be easy to extract the necessaries for a generic ODBC driver for D from https://github.com/prestodb/presto-odbc. -- Andrei
Jan 11 2015
prev sibling next sibling parent Jacob Carlborg <doob me.com> writes:
On 2015-01-10 13:36, Martin Nowak wrote:

 I'm actually eagerly waiting that someone finally implements it.
There are two ORM libraries at code.dlang.org [1] [2]. Although I don't know how usable they are. [1] http://code.dlang.org/packages/hibernated [2] http://code.dlang.org/packages/dvorm -- /Jacob Carlborg
Jan 10 2015
prev sibling parent Martin Nowak <code+news.digitalmars dawg.eu> writes:
On 01/10/2015 01:36 PM, Martin Nowak wrote:
 The idea isn't bad, but the performance will suck. This is generally
 known as N+1 query, only that this is even worse, as each field is
 queried individually.

 Here is a sketch for an optimal solution. I'm actually eagerly waiting
 that someone finally implements it.

 http://dpaste.dzfl.pl/cd375ac594cf
I also added a where clause, with a very simple expression template capture. http://dpaste.dzfl.pl/cd375ac594cf#line-140
Jan 10 2015