digitalmars.D.learn - D language manipulation of dataframe type structures

Jay Norwood (8/8) Sep 24 2013 I've been playing with the python pandas app enables interactive

lomereiter (3/3) Sep 24 2013 I thought about it once but quickly abandoned the idea. The

bearophile (7/10) Sep 25 2013 The quick compile times could allow interactive data exploration

Jay Norwood (13/13) Sep 25 2013 While the interactive exploratory aspects of the pandas are

anon (3/6) Dec 26 2014 https://github.com/MartinNowak/drepl

Jared Miller (38/48) Sep 25 2013 I agree with other posters that a D REPL and

John Colvin (7/58) Sep 25 2013 I had considered one day making some a semi-port of pandas, at

bearophile (5/7) Sep 25 2013 There are (or were) two different repls for D. The second is for

Laeeth Isharc (53/56) Dec 26 2014 The quick compile times could allow interactive data exploration

Russel Winder via Digitalmars-d-learn (37/93) Dec 26 2014 On Fri, 2014-12-26 at 20:44 +0000, Laeeth Isharc via Digitalmars-d-learn

Laeeth Isharc (19/34) Dec 26 2014 Fair argument against an earlier poster but from my perspective,

Russel Winder via Digitalmars-d-learn (43/60) Dec 27 2014 On Sat, 2014-12-27 at 01:33 +0000, Laeeth Isharc via Digitalmars-d-learn

aldanor (10/17) Dec 27 2014 There will sure be some algorithms where numba/cython would do

Russel Winder via Digitalmars-d-learn (21/38) Dec 27 2014 Agreed, it is not NumPy that is the win, it is PyTables, Pandas,

aldanor (7/17) Dec 26 2014 Pandas has numpy as "backend" which does a lot of heavy lifting,

Laeeth Isharc (113/132) Dec 26 2014 I don't believe I agree that we need a perfect multi-dimensional

Russel Winder via Digitalmars-d-learn (71/130) Dec 27 2014 On Sat, 2014-12-27 at 06:21 +0000, Laeeth Isharc via Digitalmars-d-learn

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= (9/27) Dec 27 2014 I wonder how TSX would work with GIL. I suppose most GIL locks

Russel Winder via Digitalmars-d-learn (30/39) Dec 27 2014 For Intel chips this is good stuff (stolen from Sun's Rock processor).

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= (5/10) Dec 27 2014 I don't disagree in principle, but if an OpenMP supporting

Russel Winder via Digitalmars-d-learn (16/19) Dec 27 2014 No-one with resources showed any interest in having a D with GPGPU

Laeeth Isharc (186/189) Dec 27 2014 Russell:

Russel Winder via Digitalmars-d-learn (21/26) Dec 27 2014 On Sat, 2014-12-27 at 15:33 +0000, Laeeth Isharc via Digitalmars-d-learn
Russel Winder via Digitalmars-d-learn (21/26) Dec 27 2014 On Sat, 2014-12-27 at 15:33 +0000, Laeeth Isharc via Digitalmars-d-learn

Laeeth Isharc (28/51) Dec 27 2014 No matter how plugged in a person may be, it is impossible to be

Vlad Levenfeld (7/7) Dec 28 2014 Laeeth - I am not sure exactly what your needs are but I have a

Laeeth Isharc (7/16) Dec 29 2014 Hi Vlad.

"Jay Norwood" <jayn prismnet.com> writes:

I've been playing with the python pandas app enables interactive 
manipulation of tables of data in their dataframe structure, 
which they say is similar to the structures used in R.

It appears pandas has laid claim to being a faster version of R, 
but is doing so basically limited to what they can exploit from 
moving operations back and forth from underlying cython code.

Has anyone written an example app in D that manipulates dataframe 
type structures?

Sep 24 2013

"lomereiter" <lomereiter gmail.com> writes:

I thought about it once but quickly abandoned the idea. The 
primary reason was that D doesn't have REPL and is thus not 
suitable for interactive data exploration.

Sep 24 2013

"bearophile" <bearophileHUGS lycos.com> writes:

lomereiter:

 I thought about it once but quickly abandoned the idea. The 
 primary reason was that D doesn't have REPL and is thus not 
 suitable for interactive data exploration.

The quick compile times could allow interactive data exploration 
in D, perhaps a little less well than Python.

People have created a D repl two or more times but Walter&Andrei 
seem not interested in it.

Bye,
bearophile

Sep 25 2013

"Jay Norwood" <jayn prismnet.com> writes:

While the interactive exploratory aspects of the pandas are 
attractive,  in my case the interaction has just been a crutch to 
discover how to correctly use their api.

Once through that api learning curve, I'd mainly be interested in 
repeating the operations that worked correctly.  The execution 
speed would be more important to me at that point.

In the recent pandas documents, they describe some speed 
improvements available from using eval(expression_string) calls 
that get executed by a numexpr app.  Their testing shows it only 
improves execution time when table sizes go beyond about 10k 
rows.  Seems like this puts the improvements beyond the reach of 
my particular app.

ok, thanks.  I'll have to dig into it some more.

Sep 25 2013

"anon" <anon mail.invalid> writes:

On Wednesday, 25 September 2013 at 04:35:57 UTC, lomereiter wrote:
 I thought about it once but quickly abandoned the idea. The 
 primary reason was that D doesn't have REPL and is thus not 
 suitable for interactive data exploration.


https://github.com/MartinNowak/drepl
https://drepl.dawg.eu/

Dec 26 2014

"Jared Miller" <jared economicmodeling.com> writes:

I agree with other posters that a D REPL and
interactive/visualization data environment would be very cool,
but unfortunately doesn't exist. Batch computing is more
practical, but REPLs really hook new users. I see statistical
computing as a huge opportunity for D adoption. (R is just
super-ugly and slow, leaving Python + its various native-code
cyborg appendages as the hot new stats environment).

There are tons of ways of accomplishing the same thing in D, but
as far as I know there isn't a "standard" at this point. A
statically typed dataframe is, at minimum, just a range of
structs -- even more minimally, a bare *array* of structs, or
alternatively just a 2-D array in a thin wrapper that provides
access via column labels rather than indexes. You can manipulate
these ranges with functions from std.range and std.algorithm.
Missing or N/A data is a common issue, and can be represented in
a variety of ways, with integers being the most annoying since
there is no built-in NaN value for ints (check out the Nullable
template from std.typecons).

Supporting features like having *both* rows and columns are
accessible via labels rather than indexes requires a little bit
more wrapping. We have a NamedMatrix class at my workplace for
that purpose. It's easy to overload the index operator [] for
access, * for matrix multiplication, etc.

CSV loads can be done with std.csv; unfortunately there's no
corresponding support in that module for *writing* CSV (I've
rolled my own). At my workplace we also have a MysqlConnection
class that provides one-liner loading from a SQL query into
minimalist, range-of-structs dataframes.

Beyond that, it really depends on how you want to manipulate the
dataframes. What specific things do you want to do? If you've got
an idea, I could work up some sample code.

So yes, there are people doing it in The Real World.
Unfortunately my colleagues don't have a nice, tidy,
self-contained DataFrame module to share (yet). But having one
would be a great thing for D. The bigger problem though is
matching the huge 3rd-party stats libraries (like CRAN for R).


On Wednesday, 25 September 2013 at 03:41:36 UTC, Jay Norwood
wrote:
 I've been playing with the python pandas app enables 
 interactive manipulation of tables of data in their dataframe 
 structure, which they say is similar to the structures used in 
 R.

 It appears pandas has laid claim to being a faster version of 
 R, but is doing so basically limited to what they can exploit 
 from moving operations back and forth from underlying cython 
 code.

 Has anyone written an example app in D that manipulates 
 dataframe type structures?

Sep 25 2013

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Wednesday, 25 September 2013 at 18:37:48 UTC, Jared Miller 
wrote:
 I agree with other posters that a D REPL and
 interactive/visualization data environment would be very cool,
 but unfortunately doesn't exist. Batch computing is more
 practical, but REPLs really hook new users. I see statistical
 computing as a huge opportunity for D adoption. (R is just
 super-ugly and slow, leaving Python + its various native-code
 cyborg appendages as the hot new stats environment).

 There are tons of ways of accomplishing the same thing in D, but
 as far as I know there isn't a "standard" at this point. A
 statically typed dataframe is, at minimum, just a range of
 structs -- even more minimally, a bare *array* of structs, or
 alternatively just a 2-D array in a thin wrapper that provides
 access via column labels rather than indexes. You can manipulate
 these ranges with functions from std.range and std.algorithm.
 Missing or N/A data is a common issue, and can be represented in
 a variety of ways, with integers being the most annoying since
 there is no built-in NaN value for ints (check out the Nullable
 template from std.typecons).

 Supporting features like having *both* rows and columns are
 accessible via labels rather than indexes requires a little bit
 more wrapping. We have a NamedMatrix class at my workplace for
 that purpose. It's easy to overload the index operator [] for
 access, * for matrix multiplication, etc.

 CSV loads can be done with std.csv; unfortunately there's no
 corresponding support in that module for *writing* CSV (I've
 rolled my own). At my workplace we also have a MysqlConnection
 class that provides one-liner loading from a SQL query into
 minimalist, range-of-structs dataframes.

 Beyond that, it really depends on how you want to manipulate the
 dataframes. What specific things do you want to do? If you've 
 got
 an idea, I could work up some sample code.

 So yes, there are people doing it in The Real World.
 Unfortunately my colleagues don't have a nice, tidy,
 self-contained DataFrame module to share (yet). But having one
 would be a great thing for D. The bigger problem though is
 matching the huge 3rd-party stats libraries (like CRAN for R).


 On Wednesday, 25 September 2013 at 03:41:36 UTC, Jay Norwood
 wrote:
 I've been playing with the python pandas app enables 
 interactive manipulation of tables of data in their dataframe 
 structure, which they say is similar to the structures used in 
 R.

 It appears pandas has laid claim to being a faster version of 
 R, but is doing so basically limited to what they can exploit 
 from moving operations back and forth from underlying cython 
 code.

 Has anyone written an example app in D that manipulates 
 dataframe type structures?


I had considered one day making some a semi-port of pandas, at 
the very least stealing Wes' basic algorithms (no point 
reinventing the hard stuff). The interface could be better in D 
than python I reckon, although of course the lack of a repl is a 
bit of a show-stopper.

Sep 25 2013

"bearophile" <bearophileHUGS lycos.com> writes:

John Colvin:

 although of course the lack of a repl is a bit of a 
 show-stopper.

There are (or were) two different repls for D. The second is for 
D2.

Bye,
bearophile

Sep 25 2013

"Laeeth Isharc" <Laeeth.nospam nospam-laeeth.com> writes:

"
 I thought about it once but quickly abandoned the idea. The 
 primary reason was that D doesn't have REPL and is thus not 
 suitable for interactive data exploration.

The quick compile times could allow interactive data exploration

I agree with other posters that a D REPL and
interactive/visualization data environment would be very cool,
but unfortunately doesn't exist. Batch computing is more
practical, but REPLs really hook new users. I see statistical
computing as a huge opportunity for D adoption. (R is just
super-ugly and slow, leaving Python + its various native-code
cyborg appendages as the hot new stats environment).

There are tons of ways of accomplishing the same thing in D, but
as far as I know there isn't a "standard" at this point. A
statically typed dataframe is, at minimum, just a range of
structs -- even more minimally, a bare *array* of structs, or
alternatively just a 2-D array in a thin wrapper that provides
access via column labels rather than indexes. You can manipulate
these ranges with functions from std.range and std.algorithm.
Missing or N/A data is a common issue, and can be represented in
a variety of ways, with integers being the most annoying since
there is no built-in NaN value for ints (check out the Nullable
template from std.typecons).

Supporting features like having *both* rows and columns are
accessible via labels rather than indexes requires a little bit
more wrapping. We have a NamedMatrix class at my workplace for
that purpose. It's easy to overload the index operator [] for
access, * for matrix multiplication, etc.

CSV loads can be done with std.csv; unfortunately there's no
corresponding support in that module for *writing* CSV (I've
rolled my own). At my workplace we also have a MysqlConnection
class that provides one-liner loading from a SQL query into
minimalist, range-of-structs dataframes.

Beyond that, it really depends on how you want to manipulate the
dataframes. What specific things do you want to do? If you've got
an idea, I could work up some sample code.

So yes, there are people doing it in The Real World.
Unfortunately my colleagues don't have a nice, tidy,
self-contained DataFrame module to share (yet). But having one
would be a great thing for D. The bigger problem though is
matching the huge 3rd-party stats libraries (like CRAN for R).
"

----

Since we do have an interactive shell (the pastebin), and now 
bindings and wrappers for hdf5 (key for large data sets) and 
basic seeds for a matrix library, should we start to think about 
what would be needed for a dataframe, and the best way to 
approach it, starting very simply?

One doesn't need to have a comparable library to R for it to 
start being useful in particular use cases.

Pandas and Julia would be obvious potential sources of 
inspiration (and it may be that one still uses them to call out 
to D in some cases), but rather than trying to just port pandas 
to D, it seems to make sense to ask how one should do it from 
scratch to better suit D.


Laeeth.

Dec 26 2014

Russel Winder via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

On Fri, 2014-12-26 at 20:44 +0000, Laeeth Isharc via Digitalmars-d-learn
wrote:
[=E2=80=A6]
 I agree with other posters that a D REPL and
 interactive/visualization data environment would be very cool,
 but unfortunately doesn't exist. Batch computing is more
 practical, but REPLs really hook new users. I see statistical
 computing as a huge opportunity for D adoption. (R is just
 super-ugly and slow, leaving Python + its various native-code
 cyborg appendages as the hot new stats environment).

REPLs are over-hyped and have become a fashion touchstone that few dare
argue against for fear of being denounced as un-hip. REPLs have their
place, but in the main are nowhere near as useful as people claim.
IPython Notebooks on the other hand are a balance between
editor/execution environment and REPL that really has a lot going for
it.

Stats folks using R, love R and hate Python. Stats folk using Python,
love Python and hate R. In the end it's all about what you know and can
use to get the job done. To be frank (as in open rather than Jill), D
hasn't got the infrastructure to compete with either R or Python and so
is a non-starter in the data science arena.

 There are tons of ways of accomplishing the same thing in D, but
 as far as I know there isn't a "standard" at this point. A
 statically typed dataframe is, at minimum, just a range of
 structs -- even more minimally, a bare *array* of structs, or
 alternatively just a 2-D array in a thin wrapper that provides
 access via column labels rather than indexes. You can manipulate
 these ranges with functions from std.range and std.algorithm.
 Missing or N/A data is a common issue, and can be represented in
 a variety of ways, with integers being the most annoying since
 there is no built-in NaN value for ints (check out the Nullable
 template from std.typecons).
=20
 Supporting features like having *both* rows and columns are
 accessible via labels rather than indexes requires a little bit
 more wrapping. We have a NamedMatrix class at my workplace for
 that purpose. It's easy to overload the index operator [] for
 access, * for matrix multiplication, etc.
=20
 CSV loads can be done with std.csv; unfortunately there's no
 corresponding support in that module for *writing* CSV (I've
 rolled my own). At my workplace we also have a MysqlConnection
 class that provides one-liner loading from a SQL query into
 minimalist, range-of-structs dataframes.
=20
 Beyond that, it really depends on how you want to manipulate the
 dataframes. What specific things do you want to do? If you've got
 an idea, I could work up some sample code.
=20
 So yes, there are people doing it in The Real World.
 Unfortunately my colleagues don't have a nice, tidy,
 self-contained DataFrame module to share (yet). But having one
 would be a great thing for D. The bigger problem though is
 matching the huge 3rd-party stats libraries (like CRAN for R).
 "

Nor the whole Python/SciPy/Matplotlib thing.
 ----
=20
 Since we do have an interactive shell (the pastebin), and now=20
 bindings and wrappers for hdf5 (key for large data sets) and=20
 basic seeds for a matrix library, should we start to think about=20
 what would be needed for a dataframe, and the best way to=20
 approach it, starting very simply?
=20
 One doesn't need to have a comparable library to R for it to=20
 start being useful in particular use cases.

Whilst I can do workshops for data science folk using Python and have an
argument why Python beats R for almost all cases so far brought up,
there is no way I can even start to mention D.

 Pandas and Julia would be obvious potential sources of=20
 inspiration (and it may be that one still uses them to call out=20
 to D in some cases), but rather than trying to just port pandas=20
 to D, it seems to make sense to ask how one should do it from=20
 scratch to better suit D.

Pandas is just one of the "native code cyborg appendages" you were
railing about earlier. It happens to be "a big thing" in data science
and one of the reasons Python is running away with the market, reducing
the R market penetration and only being a little bit dented in same
places by Julia.

It's not about the language, its about the total milieu. Whether or not
Python is a good language vs D is irrelevant,
Python/SciPy/Matplotlib/Pandas/IPython is there and ready, D has no play
in the game.

--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Dec 26 2014

"Laeeth Isharc" <Laeeth.nospam nospam-laeeth.com> writes:

"
 REPLs are over-hyped and have become a fashion touchstone that 
 few dare argue against for fear of being denounced as un-hip. 
 REPLs have their
 place, but in the main are nowhere near as useful as people 
 claim.
 IPython Notebooks on the other hand are a balance between
 editor/execution environment and REPL that really has a lot 
 going for
 it."

Fair argument against an earlier poster but from my perspective, 
all I meant is that the absence of a shell is not a good reason 
to write off D for exploring data.  Because there is a shell 
already that could be developed, and because one can call D from 
python / Julia in a notebook.

Stats folks using R, love R and hate Python. Stats folk using
 Python, love Python and hate R. In the end it's all about what 
 you know and can use to get the job done. To be frank (as in 
 open rather than Jill), D hasn't got the infrastructure to 
 compete with either R or Python and so is a non-starter in the 
 data science arena.

About the future you may or may not be right.  (Whether it is 
commercially interesting to run workshops in D for stats people 
is certainly a interesting question.  However given the ways that 
technology unfolds it may be that it is less relevant for the 
question I am most interested today in answering).

I want to do things in D myself, and I would find a data frame 
helpful.  I understand you don't program much in D these days, 
and that's a reasonable decision, but for those who want to use 
it to do quantish things with dataframes, perhaps we could think 
about how to approach the problem.  And having weighed your 
warnings, if you have any suggestions on how best to implement 
this, I would be open to these also.


Laeeth.

Dec 26 2014

Russel Winder via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

On Sat, 2014-12-27 at 01:33 +0000, Laeeth Isharc via Digitalmars-d-learn
wrote:
[=E2=80=A6]
 Fair argument against an earlier poster but from my perspective,=20
 all I meant is that the absence of a shell is not a good reason=20
 to write off D for exploring data.  Because there is a shell=20
 already that could be developed, and because one can call D from=20
 python / Julia in a notebook.

I think we are agreeing. Very lightweight editor and executor of code
fragments is as good, if not better, that the one line REPL.

[=E2=80=A6]
 About the future you may or may not be right.  (Whether it is=20
 commercially interesting to run workshops in D for stats people=20
 is certainly a interesting question.  However given the ways that=20
 technology unfolds it may be that it is less relevant for the=20
 question I am most interested today in answering).

Part of the problem here is tribalism. Most data science people want to
use the same tools that other data science people use, even though the
issue is to differentiate themselves. Currently R and Python are the
tools of the moment. Julia hasn't made deep penetration, but is totally
focused on trying to replace R and Python for data analysis.

 I want to do things in D myself, and I would find a data frame=20
 helpful.  I understand you don't program much in D these days,=20
 and that's a reasonable decision, but for those who want to use=20
 it to do quantish things with dataframes, perhaps we could think=20
 about how to approach the problem.  And having weighed your=20
 warnings, if you have any suggestions on how best to implement=20
 this, I would be open to these also.

A BLAS library is certainly a precusor, as is very good data
visualization tools, graphs, diagrams etc. It isn't the language per se
that make R, Python and increasingly Julia, but the fact that the
results of the analysis can be rendered graphically.

I know much less about R, but the whole Python/NumPy thing works but
only because it is faster and easier than Python alone. NumPy
performance is actually quite poor. I am finding I can write Python +
Numba code that hugely outperforms that same algorithm using NumPy.

Go is making great play of the fact that it can attract Python people
using Python for system style programming. Go has Gtk and Qt for
graphics. D has Gtk, but no real Qt. But in the end D isn't getting the
traction as the C/Python replacement as Go has done. Go has masses of
people putting a lot of effort into Web. It's not the ideas, it's the
number of people getting on board and doing things.

To get some traction in any of these areas, finance data analysis and
model building, or systems activity, it is all about people doing it,
publicizing it and making things available for others to use.=20

Taking the R array types and Pandas' DataFrames and TimeSeries and
building and using D versions is going to be needed for D to get
traction. But it needs to be better than Julia in some way that makes
others sit up and take notice. There has to be the ability to create
some hype.=20
--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Dec 27 2014

"aldanor" <i.s.smirnov gmail.com> writes:

On Saturday, 27 December 2014 at 10:54:01 UTC, Russel Winder via 
Digitalmars-d-learn wrote:
 I know much less about R, but the whole Python/NumPy thing 
 works but
 only because it is faster and easier than Python alone. NumPy
 performance is actually quite poor. I am finding I can write 
 Python +
 Numba code that hugely outperforms that same algorithm using 
 NumPy.

There will sure be some algorithms where numba/cython would do 
better (especially if they cannot be easily vectorized), but 
that's not the point. The thing about numpy is that it provides a 
unified accepted interface (plus a reasonable set of reasonably 
fast tools and algorithms) for arrays and buffers for a multitude 
of scientific libraries (scipy, pytables, h5py, pandas, scikit-*, 
just to name a few), which then makes it much easier to use them 
together and write your own ones.

Dec 27 2014

Russel Winder via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

On Sat, 2014-12-27 at 13:46 +0000, aldanor via Digitalmars-d-learn
wrote:
 On Saturday, 27 December 2014 at 10:54:01 UTC, Russel Winder via=20
 Digitalmars-d-learn wrote:
 I know much less about R, but the whole Python/NumPy thing=20
 works but
 only because it is faster and easier than Python alone. NumPy
 performance is actually quite poor. I am finding I can write=20
 Python +
 Numba code that hugely outperforms that same algorithm using=20
 NumPy.

 There will sure be some algorithms where numba/cython would do=20
 better (especially if they cannot be easily vectorized), but=20
 that's not the point. The thing about numpy is that it provides a=20
 unified accepted interface (plus a reasonable set of reasonably=20
 fast tools and algorithms) for arrays and buffers for a multitude=20
 of scientific libraries (scipy, pytables, h5py, pandas, scikit-*,=20
 just to name a few), which then makes it much easier to use them=20
 together and write your own ones.

Agreed, it is not NumPy that is the win, it is PyTables, Pandas,
SciKit-Learn etc. These are the standard tools because they are domain
specific and aimed at the audience. The audience neither knows nor cares
that NumPy is actually not very good because they have the tools they
need and nothing to compare them against =E2=80=93 unless Julia gets real
traction, or a language like D can use it's one or two entries in the
field to create a usable set of libraries. As with the Vibe.d, and Dub
experience, pick a field, write and use something that does the job
better than anything else in that field, then market the experience.

--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Dec 27 2014

"aldanor" <i.s.smirnov gmail.com> writes:

On Wednesday, 25 September 2013 at 03:41:36 UTC, Jay Norwood 
wrote:
 I've been playing with the python pandas app enables 
 interactive manipulation of tables of data in their dataframe 
 structure, which they say is similar to the structures used in 
 R.

 It appears pandas has laid claim to being a faster version of 
 R, but is doing so basically limited to what they can exploit 
 from moving operations back and forth from underlying cython 
 code.

 Has anyone written an example app in D that manipulates 
 dataframe type structures?

Pandas has numpy as "backend" which does a lot of heavy lifting, 
so first things first -- imo D needs a fast and flexible 
blas/lapack-compatible multi-dimensional rectangular array 
library that could later serve as backend for pandas-like 
libraries.

Dec 26 2014

"Laeeth Isharc" <laeethnospam spammenot_laeeth.com> writes:

On Friday, 26 December 2014 at 21:31:00 UTC, aldanor wrote:
On Wednesday, 25 September 2013 at 03:41:36 UTC, Jay Norwood
wrote:
I've been playing with the python pandas app enables
interactive manipulation of tables of data in their dataframe
structure, which they say is similar to the structures used in
R.

It appears pandas has laid claim to being a faster version of
R, but is doing so basically limited to what they can exploit
from moving operations back and forth from underlying cython
code.

Has anyone written an example app in D that manipulates
dataframe type structures?

Pandas has numpy as "backend" which does a lot of heavy
lifting, so first things first -- imo D needs a fast and
flexible blas/lapack-compatible multi-dimensional rectangular
array library that could later serve as backend for pandas-like
libraries.

I don't believe I agree that we need a perfect multi-dimensional
rectangular array library to serve as a backend before thinking
and doing much on data frames (although it will certainly be very
useful when ready).

First, it seems we do have matrices, even if lacking in complete
functionality for linear algebra, and the like. There is a
chicken and egg aspect in the development of tools - it is rarely
the case that one kind of tool necessarily totally precedes
another, and often complementarities and dynamic effects between
different stages. If one waits till one has everything one needs
done for one, one won't get much done.

Secondly, much of the kind of thing Pandas is useful for is not
exactly rocket science from a quantitative perspective, but it's
just the kinds of thing that is very useful if you are thinking
about working with data sets of a decent size.The concepts seem
to me to fit very well with std.algorithm and std.range, and can
be thought of as just as way to bring out the power of the tools
we alreaady have when working with data in the world as it is.
See here for an example of just how simple. Remember Excel
pivottables?

http://pandas.pydata.org/pandas-docs/stable/groupby.html

Thirdly, one of the reasons Pandas is popular is because it is
written in C/Cython and very fast. It's significantly faster
than Julia. One might hit roadblocks down the line when it comes
to the Global Interpreter Lock and difficulty of processing
larger sets quickly in Python, but at least this stage is fast
and easy. So people do care about speed, but they also care
about the frictions being taken away, so that they can spend
their energies on addressing the problem at hand. Ie a dataframe
will be helpful, in my view.

Processing of log data is a growing domain - partly from
internet, but also from the internet of things. See below for
one company using D to process logs:

http://venturebeat.com/2014/11/12/adroll-hits-gigantic-130-terabytes-of-ad-data-processed-daily-says-size-matters/
http://tech.adroll.com/blog/data/2014/11/17/d-is-for-data-science.html

A poster on this forum is already using D as a library to call
from R (from Reddit), which brings home the point that it isn't
necessary for D to be able to do every part of the process for it
to be able to take over some of the heavy work.

"[–]bachmeier 6 points 1 month ago

I call D shared libraries from R. I've put together a library
that offers similar functionality to Rcpp. I've got a
presentation showing its use on Linux. Both the presentation and
library code should be made available within the next couple of
days.

My library makes available the R API and anything in Gretl. You
can allocate and manipulate R objects in D, add R assert
statements in your D code, and so on. What I'm working on now is
calling into GSL for optimization.

These are all mature libraries - my code is just an interface.
It's generally easy to call any C library from D, and modern
Fortran, which provides C interoperability, is not too much
harder.
"

See here, for just one use case in the internet of things. They
don't use D, but maybe they should have. And it shows an example
where perhaps at least log processing could easily be handled by
what we have with a few small additional data structures - even
if people use outside libraries for the machine learning part.

http://www.forbes.com/sites/danwoods/2014/11/04/how-splunk-caught-wall-streets-eye-by-taming-the-messy-world-of-iot-data/3/

"By using Splunk software, Hrebek said that his division’s leader
product is able to offer customers a real-time view of operations
on a train and to use machine learning to suggest optimal
strategies for driving trains along various routes. Just shaving
a small percentage off of fuel costs can mean huge savings for a
railroad.

Why Doesn’t BI Work for the IoT?

In both of the use cases just mentioned, for years, existing
business intelligence technology had been applied to the problem
of making sense of the data with little success.

The problem is not that that it is impossible to use traditional
ETL technology and an RDBMS or, more commonly, spreadsheets to
get something working so that some of the data becomes useful. It
is just that the effort involved is great and the technical
effort involved in maintaining such systems is massive. Hrebek
compared using spreadsheets for IoT data to living in the ninth
circle of hell in Dante’s Inferno, because the process is so
tedious and error prone.

Machine data is different from flat files that are the paradigm
for BI technology, which works in rows and columns. Also, machine
data can be naturally organized into a time series, but this is
not the default way that a spreadsheet or an RDBMS works.

Why Does Splunk Work for the IoT?

IoT data essentially looks exactly the same as the machine data
from servers in a data center that Splunk Enterprise was
initially created to handle. The software allows you to:

Automatically parse fields
Identify several different types of records as a related group
Organize and store records by timestamp
Create dashboards and analytics that are updated in real time

With each successive release, Splunk is making the process of
parsing machine data as automatic and machine assisted as
possible. Its software handles variations of IoT data by allowing
a simple mapping of a field into a standard name. For example,
the GPS coordinates of a train car might be recorded in six or
seven different ways in various forms of machine data, but can be
unified via Splunk Enterprise. Splunk software allows these
mappings to be implemented and maintained with a minimum of
effort.

The bottom line is that there is no way to avoid the
imperfections that naturally occur in the real world. We are
always going to have lots of trees and to have to deal with them
both as individuals and as a forest, in a normalized aggregate
form. The reason Splunk is making such inroads in IoT
applications is that it can handle both the trees and the forest
and turn the information from the real world into a clear view of
what is happening that allows useful models of reality to be
created. If you are building an IOT application, you must find a
way to handle the messy nature of the real world."

Many more similar oppties for D here:
https://www.google.de/search?q=internet+of+things+massive+log+processing+growth&btnG=Search&oe=utf-8&gws_rd=cr

Laeeth.

Dec 26 2014

Russel Winder via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

On Sat, 2014-12-27 at 06:21 +0000, Laeeth Isharc via Digitalmars-d-learn
wrote:
[=E2=80=A6]
 I don't believe I agree that we need a perfect multi-dimensional=20
 rectangular array library to serve as a backend before thinking=20
 and doing much on data frames (although it will certainly be very=20
 useful when ready).

Also, if there is a ready made C or C++ library that can be made use of,
do it.

 First, it seems we do have matrices, even if lacking in complete=20
 functionality for linear algebra, and the like.  There is a=20
 chicken and egg aspect in the development of tools - it is rarely=20
 the case that one kind of tool necessarily totally precedes=20
 another, and often complementarities and dynamic effects between=20
 different stages.  If one waits till one has everything one needs=20
 done for one, one won't get much done.

In the end there is no point in a language/compiler/editor if there is
not the perceived support for the things that large numbers of people

all find themselves with a vocal audience doing things. The language
evolves with the libraries and "end user" applications. In the end it is
all about people doing things with a language and hyping it up.

 Secondly, much of the kind of thing Pandas is useful for is not=20
 exactly rocket science from a quantitative perspective, but it's=20
 just the kinds of thing that is very useful if you are thinking=20
 about working with data sets of a decent size.The concepts seem=20
 to me to fit very well with std.algorithm and std.range, and can=20
 be thought of as just as way to bring out the power of the tools=20
 we alreaady have when working with data in the world as it is. =20
 See here for an example of just how simple.  Remember Excel=20
 pivottables?
=20
 http://pandas.pydata.org/pandas-docs/stable/groupby.html

I recently discovered a number of hedge funds work solely on moving
average based algorithmic trading. NumPy, SciPy and Pandas all have
variations on this basic algorithm.

Isn't "group by" standard in all languages. Certainly, Python, Groovy,
Scala, Haskell,=E2=80=A6=20

 Thirdly, one of the reasons Pandas is popular is because it is=20
 written in C/Cython and very fast.  It's significantly faster=20
 than Julia.  One might hit roadblocks down the line when it comes=20
 to the Global Interpreter Lock and difficulty of processing=20
 larger sets quickly in Python, but at least this stage is fast=20
 and easy.  So people do care about speed, but they also care=20
 about the frictions being taken away, so that they can spend=20
 their energies on addressing the problem at hand.  Ie a dataframe=20
 will be helpful, in my view.

Perceived to be fast. In fact it isn't anything like as fast as it
should be. NumPy (which underpins Pandas and provides all the data
structures and basic algorithms), is actually quite slow.

I have ranted many times about GIL in Python, and on two occasions spent
2 or 3 hours trying to convince Guido about the lunacy of a GIL based
interpreted in 2014. Armin Rigo has an STM-based version in PyPy and
CPython and has shown it can work just fine. Guido though is I/O bound
rather than CPU bound in his work and doesn't see a need for anything
other than multiprocessing for accessing parallelism in Python. Sadly,
it can be shown that multiprocessing is slow and inefficient at what it
does and it needs replacing.

NumPy's approach to parallelism is nice as an abstraction, but doesn't
really "cut it" unless you do not know any better.

In principle this is fertile territory for a new language to take the
stage. Hence Julia. I fear D has missed the boat of this opportunity
now. On the other hand if some real data science people begin to do data
science with D and show that more can be done with less, and without
loss of functionality, then there is an opportunity for marketing and
possible traction in the market. =20

 Processing of log data is a growing domain - partly from=20
 internet, but also from the internet of things.  See below for=20
 one company using D to process logs:
=20
 http://venturebeat.com/2014/11/12/adroll-hits-gigantic-130-terabytes-of-a=

d-data-processed-daily-says-size-matters/
 http://tech.adroll.com/blog/data/2014/11/17/d-is-for-data-science.html

This is worth hyping up, it should be front and centre on teh dlang
pages along with Facebook funding bug fixes. Having the tweets list is
great but too ephemeral, the "D is for Data Science" tweet will fade too
quickly.

 A poster on this forum is already using D as a library to call=20
 from R (from Reddit), which brings home the point that it isn't=20
 necessary for D to be able to do every part of the process for it=20
 to be able to take over some of the heavy work.

Funny isn't it how every language must do everything. So for all new
languages you have to have a new build system and a new event loop. The
problem is though that C is the language of extension for R, Python,=E2=80=
=A6
even though it is a language that should now only used for working
"right on the metal", if at all.

 "[=E2=80=93]bachmeier 6 points 1 month ago
=20
 I call D shared libraries from R. I've put together a library=20
 that offers similar functionality to Rcpp. I've got a=20
 presentation showing its use on Linux. Both the presentation and=20
 library code should be made available within the next couple of=20
 days.
=20
 My library makes available the R API and anything in Gretl. You=20
 can allocate and manipulate R objects in D, add R assert=20
 statements in your D code, and so on. What I'm working on now is=20
 calling into GSL for optimization.
=20
 These are all mature libraries - my code is just an interface.=20
 It's generally easy to call any C library from D, and modern=20
 Fortran, which provides C interoperability, is not too much=20
 harder.
 "

But if all the libraries are C , C++ and Fortran, is there any value add
role for D?

Lots of C++ system embed Python or Lua for dynamic scripting capability,
lots of Python and R system call out to C. This seems a well established
milieu. Is there a good way for D to, in an evolutionary way establish a
permanent foothold. Certainly it cannot be a revolutionary one.

[=E2=80=A6]

Splunk stuff is just an example of using dataflow networks for
processing data rather than using SQL. The "Big Data using JVM"
community are already on this road, cf. various proprietary frameworks
running over Hadoop and Spark.

Dataflow frameworks are likely to be the next big thing. Java and Groovy
have established offerings, no other language really does other than Go.
If D could get a really good dataflow framework before C++, Rust, etc.
then that might be a route to traction.

--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Dec 27 2014

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:

On Saturday, 27 December 2014 at 13:39:59 UTC, Russel Winder via 
Digitalmars-d-learn wrote:
 I have ranted many times about GIL in Python, and on two 
 occasions spent
 2 or 3 hours trying to convince Guido about the lunacy of a GIL 
 based
 interpreted in 2014. Armin Rigo has an STM-based version in 
 PyPy and
 CPython and has shown it can work just fine.

I wonder how TSX would work with GIL. I suppose most GIL locks 
are short lived enough to be covered by TSX before it fails and 
takes a lock.

 In principle this is fertile territory for a new language to 
 take the
 stage. Hence Julia. I fear D has missed the boat of this 
 opportunity
 now. On the other hand if some real data science people begin 
 to do data
 science with D and show that more can be done with less, and 
 without
 loss of functionality, then there is an opportunity for 
 marketing and
 possible traction in the market.

To be fair, you also have to compete against commercial solutions 
such as SPSS, SAS and others.

Then you have OpenMP for C++ and Fortran, which it will be 
difficult for D to compete with in terms of performance vs effort.

Dec 27 2014

Russel Winder via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

On Sat, 2014-12-27 at 13:53 +0000, via Digitalmars-d-learn wrote:
[=E2=80=A6]
=20
 I wonder how TSX would work with GIL. I suppose most GIL locks=20
 are short lived enough to be covered by TSX before it fails and=20
 takes a lock.

For Intel chips this is good stuff (stolen from Sun's Rock processor).
Hardware supported transactional memory easily beats software
transactional memory, but the latter is portable.

[=E2=80=A6]
=20
 To be fair, you also have to compete against commercial solutions=20
 such as SPSS, SAS and others.

It is relatively easy to compete against these generally. Small
organizations (which actually make up the bulk of users) prefer not to
pay the extortionate fees. Anecdotal evidence clearly show a mass move
from Matlab to Python+NumPy+=E2=80=A6 =E2=80=93 the anecdotes being my Pyth=
on Workshops
last year where 40%+ of people were in this position.

 Then you have OpenMP for C++ and Fortran, which it will be=20
 difficult for D to compete with in terms of performance vs effort.

If you said MPI, then yes, it is the de facto standard native code
clustering system: on JVM there is Netty and a few other systems. OpenMP
is really just a way of hacking sequential code to create parallel code
on a multicore single address space; and a very good hack it is too. But
it remains a hack and not a good way of transitioning from fundamentally
sequential code to fundamentally parallel code. OpenMP exists exactly
because Fortran, C and C++ codes had to be made data parallel without
being rewritten. D should not be in this boat.


--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Dec 27 2014

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:

On Saturday, 27 December 2014 at 14:07:51 UTC, Russel Winder via 
Digitalmars-d-learn wrote:
 sequential code to fundamentally parallel code. OpenMP exists 
 exactly
 because Fortran, C and C++ codes had to be made data parallel 
 without
 being rewritten. D should not be in this boat.

I don't disagree in principle, but if an OpenMP supporting 
compiler can generate code for GPGPU then D will be miles behind 
for many homogeneous workloads.

Dec 27 2014

Russel Winder via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

On Sat, 2014-12-27 at 14:28 +0000, via Digitalmars-d-learn wrote:
[=E2=80=A6]
 I don't disagree in principle, but if an OpenMP supporting=20
 compiler can generate code for GPGPU then D will be miles behind=20
 for many homogeneous workloads.

No-one with resources showed any interest in having a D with GPGPU
capability, so I think we can more or less say that C++ has won this
arena. Well except that everyone uses C, including the Python folk. I am
awaiting the Java play in this space from the IBM folk.

--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Dec 27 2014

"Laeeth Isharc" <laeethnospam spammenot_laeeth.com> writes:

Russell:
"I think we are agreeing. Very lightweight editor and executor of 
code
fragments is as good, if not better, that the one line REPL."

Yes - the key for me is that the absence of a shell is by no 
means a reason to say that D is not suited to this task.  One may 
wish to refine what exists, but that is another question entirely.

"Part of the problem here is tribalism. Most data science people 
want to
use the same tools that other data science people use, even 
though the
issue is to differentiate themselves."

Yes - we are answering two different questions.  I could not care 
less about persuading anyone en masse in a broad sector, those 
who think of themselves as being 'data scientists' included.  
It's silly, in my view, to think of it as an established field 
very distinct from others, and with a fixed way of doing things.  
If for no other reason that things are in flux and the sector is 
growing quickly, which means that there is room for many 
different approaches, and it is premature to think the popularity 
of approach X or Y today means that approach 'D' can't be 
productive tomorrow.

But as I said, I am less convinced in persuading anyone, and 
rather more concerned with getting a basic data frame in D up and 
running because I could certainly use it, and the hard work has 
been done already.  The basics should be an evening's work for an 
advanced D hacker, but it will probably take me longer than that. 
  In any case, since nobody else has come forward, I will keep 
working away at it.

"A BLAS library is certainly a precusor, as is very good data
visualization tools, graphs, diagrams etc."

Perhaps a prerequisite to D being seen as a contender, but I 
don't see how it's a prerequisite just to have a dataframe, which 
is really a very simple yet incredibly useful thing.

"Go has masses of people putting a lot of effort into Web. It's 
not the ideas, it's the number of people getting on board and 
doing things".

Also about the quality of the people.  (I have no view about Go, 
but have a very positive view on D).  When things get big there 
is a danger they get cluttered.  That's one blessing for D.

"To get some traction in any of these areas, finance data 
analysis and
model building, or systems activity, it is all about people doing 
it,
publicizing it and making things available for others to use".

Yes - so do you have any thoughts on what a data frame structure 
should look like?  I am trying to do and after that will make 
available.

"But it needs to be better than Julia in some way that makes
others sit up and take notice. There has to be the ability to 
create
some hype."

Don't care ;)  This concept of "what is your edge" is not my cup 
of tea because I do not see the world in those terms.  Something 
of high quality that's highly productive will over time stand a 
decent chance of becoming more widely adopted, whereas trying to 
force it into some kind of marketing framework can prove 
counterproductive.

Right now, the main thing I care about is solving the problem at 
hand, because if it solves my problem well then I am pretty sure 
it will be useful to others too, and be so better than if one had 
adopted a more consciously 'commercial/marketing' mindset.

I would post the dataframe skeleton here, but it's too 
embarassing right now and want to read the std.variant library to 
see what tricks I can learn.  (A data series seems kind of like a 
variant, but with every cell the same type).  Obviously in some 
cases the data frame type is defined at compile time, like a 
struct, and that's easy.  But if you are loading from a file you 
need to be able to have dynamic typing for the column.

"> I don't believe I agree that we need a perfect 
multi-dimensional
 rectangular array library to serve as a backend before thinking 
 and doing much on data frames (although it will certainly be 
 very useful when ready).

Also, if there is a ready made C or C++ library that can be made 
use of,
do it."

Well, the hard parts of arrays themselves (and it's not that 
fiendishly hard, I would think) seem to need to be tightly 
integrated with the language, so I don't see how a C/C++ library 
will help so much.  For the linear algebra, yes...
hyping it up.

"I recently discovered a number of hedge funds work solely on 
moving
average based algorithmic trading. NumPy, SciPy and Pandas all 
have
variations on this basic algorithm."

Well, having worked for more or less quanty hedge funds since 98, 
I would think it unlikely that anyone depends only on moving 
averages although basic old-school trend-following certainly does 
work - it is just a hard sell to herding institutional investors, 
and does not fit very well with the concept of a 'career'.  (You 
have to be able to see the five years of subdued returns since 
2009 as just part of the cycle, which indeed may be the correct 
view when one sees markets as a natural phenomenon, but is not 
the view of asset allocators, or talented people one may want to 
hire in other areas).


"Perceived to be fast. In fact it isn't anything like as fast as 
it
should be. NumPy (which underpins Pandas and provides all the data
structures and basic algorithms), is actually quite slow."

Yes - was tired when I wrote, and meant to say Pandas is fast for 
key things such as parsing large data files eg CSVs - 
significantly faster than Julia, from what I have seen.  And yes 
- I agree about Numpy, and don't need to be persuaded of the 
benefits of moving to something else if one can make it slightly 
less inconvenient.  Which is how this conversation started - you 
really don't need a perfect BLAS implementation/wrapper to start 
to benefit from a dataframe.

"Guido though is I/O bound rather than CPU bound in his work and 
doesn't see a need for anything other than multiprocessing for 
accessing parallelism in Python. Sadly, it can be shown that 
multiprocessing is slow and inefficient at what it does and it 
needs replacing".

I cannot claim deep expertise here, but this was one of the 
things that got me looking at D originally.  Just too frustrating 
trying to fit with the restrictions to write nogil Cython code, 
knowing that one might need to rewrite when one has mentally long 
moved on.  Ie I feel like I am short options building my platform 
that way, and I don't like being short options when they don't 
cost much to buy.  Hence D.  It also struck me that there was a 
degree of complacency amongst some Python people, whereas hunger 
and insecurity may be a spur to greater and more creative efforts.

"In principle this is fertile territory for a new language to 
take the
stage. Hence Julia." I fear D has missed the boat of this 
opportunity
now."
I really don't see why one can't just take the next boat arriving 
in fifteen minutes.  Or establish a new boat service going 
somewhere better that hooks up with the existing network.  
Conditions are changing so quickly, and the gap between the talk 
about big data etc and what people have actually done so far so 
large that to me the field seems wide open.  I don't see an 
alternative acceptable way to do what I would like, so D it will 
be.  And if I think that way today, probably others will have the 
same thoughts in coming years.  (Perhaps not).

"This is worth hyping up, it should be front and centre on teh 
dlang
pages along with Facebook funding bug fixes."

I agree.  Also in a few lines a punchier summary of why 
Sociomantic use D, what the benefits have been, and how they deal 
with the standards sorts of hurdles that might have been 
objections in a more mature and conventional company ("how are 
you going to hire experienced D programmers").

"But if all the libraries are C , C++ and Fortran, is there any 
value add
role for D?"

I guess we vote with our feet/fingers.  Sounds like you don't 
find D especially useful (since you don't use it much currently), 
whereas I do.  De gustibus non est disputandum, particularly when 
tastes reflect being in different situations.

"Lots of C++ system embed Python or Lua for dynamic scripting 
capability,
lots of Python and R system call out to C. This seems a well 
established
milieu. Is there a good way for D to, in an evolutionary way 
establish a
permanent foothold. Certainly it cannot be a revolutionary one."

You write as if Christensen's book "The Innovator's Dilemma" had 
never been written, and nor had it been a standard textbook in 
business schools for some years.  You may have good arguments as 
to why he is wrong, or why it doesn't apply to D, but you haven't 
set them out, as far as I am aware.

Not Russell
"There will sure be some algorithms where numba/cython would do
better (especially if they cannot be easily vectorized), but
that's not the point. The thing about numpy is that it provides a
unified accepted interface (plus a reasonable set of reasonably
fast tools and algorithms) for arrays and buffers for a multitude
of scientific libraries (scipy, pytables, h5py, pandas, scikit-*,
just to name a few), which then makes it much easier to use them
together and write your own ones."

Yes.  But one has to start somewhere (if not happy with the 
python route), and we start to have equivalents of 
scipy,pytables/h5py.  So why not pandas?


"Splunk stuff is just an example of using dataflow networks for
processing data rather than using SQL. The "Big Data using JVM"
community are already on this road, cf. various proprietary 
frameworks
running over Hadoop and Spark."

Yes - technically, it may well be "nothing more than".  But many 
of the practical problems which have a high commercial return to 
solving are "nothing more than" quite simple things technically.  
One doesn't need to be a technical genius to make valuable 
commercial contributions.  And maybe Hadoop and Spark are just 
the perfect solution for most people (maybe not!), but that 
certainly leaves some room for others.

So... data frames!?

Dec 27 2014

Russel Winder via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

On Sat, 2014-12-27 at 15:33 +0000, Laeeth Isharc via Digitalmars-d-learn
wrote:
[=E2=80=A6]
=20
 I guess we vote with our feet/fingers.  Sounds like you don't=20
 find D especially useful (since you don't use it much currently),=20
 whereas I do.  De gustibus non est disputandum, particularly when=20
 tastes reflect being in different situations.

[=E2=80=A6]

For the avoidance of confusion, the reason I am not using D just now is
that I am not actually doing much (other than some training workshops)
just now. I was going to use D for a start-up a couple of years ago and
Go for a start-up last year, but both projects fell through. These days
my only real programming is tinkering with a few toy problems. Oh and
tinkering with GPars and Spock, but that is JVM stuff and so likely not
interesting to the folk on this list.

--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Dec 27 2014

Russel Winder via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

On Sat, 2014-12-27 at 15:33 +0000, Laeeth Isharc via Digitalmars-d-learn
wrote:
[=E2=80=A6lots of agreed uncontentious stuff :-) =E2=80=A6]


 You write as if Christensen's book "The Innovator's Dilemma" had=20
 never been written, and nor had it been a standard textbook in=20
 business schools for some years.  You may have good arguments as=20
 to why he is wrong, or why it doesn't apply to D, but you haven't=20
 set them out, as far as I am aware.


In the post-production world as I know it (Nuke, etc.) The C++/Python
combination has never failed to be adequate to the innovation demanded
by film makers. In the image processing world the C++/Lua combination
has never failed to adapt to the innovation needed by photograph
tinkerers. My point was really that the customers have never found an
innovative need that the extant platforms couldn't provide. I felt this
was somewhat different to the Christensen argument. On the other hand, I
may have missed the point=E2=80=A6

--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Dec 27 2014

"Laeeth Isharc" <Laeeth.nospam nospam-laeeth.com> writes:

On Saturday, 27 December 2014 at 16:41:04 UTC, Russel Winder via 
Digitalmars-d-learn wrote:
 On Sat, 2014-12-27 at 15:33 +0000, Laeeth Isharc via 
 Digitalmars-d-learn
 wrote:
 […lots of agreed uncontentious stuff :-) …]


 You write as if Christensen's book "The Innovator's Dilemma" 
 had never been written, and nor had it been a standard 
 textbook in business schools for some years.  You may have 
 good arguments as to why he is wrong, or why it doesn't apply 
 to D, but you haven't set them out, as far as I am aware.


 In the post-production world as I know it (Nuke, etc.) The 
 C++/Python
 combination has never failed to be adequate to the innovation 
 demanded
 by film makers. In the image processing world the C++/Lua 
 combination
 has never failed to adapt to the innovation needed by photograph
 tinkerers. My point was really that the customers have never 
 found an
 innovative need that the extant platforms couldn't provide. I 
 felt this
 was somewhat different to the Christensen argument. On the 
 other hand, I
 may have missed the point…

No matter how plugged in a person may be, it is impossible to be 
aware of everything that is going on, especially in exactly the 
kind of domains Christensen talks about - ones that aren't by any 
standard important in a spot sense to the bigger picture, but 
that critically provide a quiet relatively uncontested niche for 
the seeds of something to unfold until it is ready to break out 
into the broader world.

So I think the point is that one shouldn't be bothered one jot by 
the disinclination of the people you know to want to use D, 
particularly since you are so plugged in to all these other 
worlds (and being an insider in a sense that matters today has an 
opportunity cost because it means one is not spending time and 
attention speaking to non insiders as much at that instant).  New 
growth will come from the fringes.

I think one should be very worried if the Adam Ruppe of the world 
would start to say D sucks - nice idea, but just not expressive 
enough for me, and I am switching back to Ruby and Python.  
Because that would indicate a loss of ground in the home niche.  
But somehow I don't think so...!  And meantime quietly things 
continue to develop.

What matters is not the challenges one faces, but how one deals 
with them.  An outpouring of frustration in recent days, and the 
result is we are going to get better docs, better examples, and 
who knows what else.  That's a sign of health.

Will post code I have in a few days.


Laeeth.

Dec 27 2014

"Vlad Levenfeld" <vlevenfeld gmail.com> writes:

Laeeth - I am not sure exactly what your needs are but I have a
fairly complete solution for generic multidimensional interfaces
(template-based, bounds checked, RAII-ready, non-integer indices,
the whole shebang) that I have been building. Anyway I don't want
to spam the forum if I've missed the point of this discussion,
but perhaps we could speak about it further over email and you
could give me your opinion? I'm at vlevenfeld gmail.com

Dec 28 2014

"Laeeth Isharc" <laeethnospam spammenot_laeeth.com> writes:

On Monday, 29 December 2014 at 04:08:58 UTC, Vlad Levenfeld wrote:
 Laeeth - I am not sure exactly what your needs are but I have a
 fairly complete solution for generic multidimensional interfaces
 (template-based, bounds checked, RAII-ready, non-integer 
 indices,
 the whole shebang) that I have been building. Anyway I don't 
 want
 to spam the forum if I've missed the point of this discussion,
 but perhaps we could speak about it further over email and you
 could give me your opinion? I'm at vlevenfeld gmail.com

Hi Vlad.

Thanks v much for getting in touch.

Your work sounds very interesting.  I will drop you a line in 
coming days.

Happy new year.


Laeeth.

Dec 29 2014

D Programming

C/C++ Programming

Other

digitalmars.D.learn - D language manipulation of dataframe type structures