digitalmars.D - [GSoC] Dataframes for D

Prateek Nayak (71/71) May 29 2019 Hello everyone,

Andre Pany (16/88) May 29 2019 The outlook looks really great. CSV as a starting point is good.

Laeeth Isharc (18/124) May 29 2019 Interoperability with pandas would be important in our use case,

Prateek Nayak (10/49) May 29 2019 Interop is important and I see many binary format being suggested

jmh530 (11/12) May 29 2019 Glad to see progress being made on this!

Yatheendra (3/3) May 29 2019 Is it worth considering some binary format that is standard-ish &

jmh530 (6/9) May 29 2019 I have heard of that, but I don't know too much about it.
Laeeth Isharc (4/7) May 29 2019 Ultimately you need to be able to talk to the world because this

Prateek Nayak (8/20) May 29 2019 The DataFrame currently uses Mir's ndslice at the core of it

jmh530 (8/16) May 30 2019 It's probably smart to focus on getting the homogeneous case

Prateek Nayak (12/24) May 29 2019 On a second thought, my mentor Nicholas Wilson led me to an

jmh530 (7/20) May 30 2019 Hmm, my point above was for a tuple of mir slices, which seems to

welkam (3/5) May 30 2019 Im not familiar with data frames but this sounds like array of

jmh530 (4/7) May 30 2019 I think I was thinking of it more like a struct of arrays, but I

Prateek Nayak (38/46) Jun 03 2019 Due to the popularity of heterogeneous DataFrames, we decided to

jmh530 (2/6) Jun 04 2019 Excellent!
James Blachly (10/15) Jun 05 2019 Amazing, thanks!

Prateek Nayak (4/20) Jun 05 2019 Oops! Thanks for spotting that. I'll update it today with a

Prateek Nayak (28/76) Jun 10 2019 I have decided to make weekly updates for the community to know

Prateek Nayak (49/136) Jun 17 2019 -------------

Prateek Nayak (45/46) Jun 25 2019 -------------

jmh530 (20/33) Jun 25 2019 [snip]

jmh530 (4/42) Jun 25 2019 Stupid typos.
Prateek Nayak (14/33) Jun 25 2019 1) Current, byDim doesn't work on DataFrame DataFrame.

jmh530 (17/22) Jun 25 2019 I see

Prateek Nayak (18/40) Jun 25 2019 The apply right now works exactly as

jmh530 (7/17) Jun 26 2019 By no means do you need to apologize for any inconvenience.

Prateek Nayak (10/28) Jun 28 2019 I'm sorry I couldn't reply sooner - I was sick for past couple of

Prateek Nayak (50/51) Jul 02 2019 -------------

Prateeek Nayak (38/41) Jul 17 2019 ---------------

jmh530 (18/19) Jul 18 2019 Thanks for the update. I'm glad you're still making good progress.

Prateeek Nayak (18/38) Jul 18 2019 * "at" was for a fast access to element. It's only necessary to

jmh530 (3/14) Jul 18 2019 Ah, so what you would want to check is that all the RowTypes are

Prateeek Nayak (4/8) Jul 18 2019 Yes. Will require a small redesign in the internal structure and

James Blachly (10/11) May 29 2019 Outstanding, and greatly needed. Congratulations to you and your mentors...

jmh530 (4/6) May 30 2019 This was a good read.

bachmeier (14/14) Jul 18 2019 Looking at the readme, I see the following example for accessing

Prateek Nayak (7/21) Jul 23 2019 I'm really sorry I overlooked this. Sorry about that (－‸ლ)

Dejan Lekic (5/8) Jul 19 2019 Really glad to see someone working on that. I hope you will have

Prateek Nayak (4/13) Jul 23 2019 Right now there is a CSV reader in Magpie but it isn't perfect

jmh530 (6/10) Jul 23 2019 mir was originally intended to be included in Phobos, but got
Suliman (1/1) Jul 25 2019 Could you do any benchmarks against Python Pandas.

Prateek Nayak (5/6) Jul 25 2019 As soon as the aggregate is done, I'll get onto this.

Prateeek Nayak (39/39) Aug 08 2019 -----------

bioinfornatics (33/75) Aug 09 2019 Dear D community,

Prateek Nayak (17/64) Aug 09 2019 I was looking into Parquet and it even came up in the reddit post

Joseph Rushton Wakeling (5/14) Aug 10 2019 It's clearly important that your project supports the same data

Prateek Nayak (9/13) Aug 10 2019 It is never an inherent problem to support a new file format but

James Blachly (14/25) Aug 09 2019 Again, thank you so much for working on this!

Prateek Nayak <lelouch.cpp gmail.com> writes:

Hello everyone,

I have began work on my Google Summer of Code 2019 project 
DataFrame for D.

-----------------
About the Project
-----------------

DataFrames have become a standard while handling and manipulating 
data. They give a neat representation, access and power to 
modulate the data in way user wants.
This project aims at bringing native DataFrame to D one which 
brings with it:

* A User Friendly API
* Multi - Indexing
* Writing to CSV and parsing from CSV
* Column binary operation in the form: df["Index1"] = 
df["Index2"] + df["Index3"];
* groupBy on an arbitrary number of columns
* Data Aggregation

Disclaimer: The entire structuring was inspired by Pandas, the 
most popular DataFrame library in Python and hence most of the 
usage will look very similar to the ones in Pandas.

Main focus of this project is user-friendliness of the API while 
also maintaining fair amount of speed and power.
The preliminary road map can be viewed here -> 
https://docs.google.com/document/d/1Zrf_tFYLauAd_NM4-UMBGt_z-fORhFMrGvW633x8rZs/edit?usp=sharing

The core developments can be seen here -> 
https://github.com/Kriyszig/magpie


-----------------------------
Brief idea of what is to come
-----------------------------

This month
----------
* Finish up with structure of DataFrame
* Finish Terminal Output (What good is data which cannot be seen)
* Finish writing to CSV
* Parsing DataFrame from CSV (Both single and multi-indexed)
* Accessing Elements
* Accessing Rows and Columns
* Assignment of element, an entire row or column
* Binary operation on rows and columns

Next Month
----------
* groupBy
* join
* Begin writing ops for aggregation


-----------
Speed Bumps
-----------

I am relatively new to D and hail from functional C background. 
Sometimes (most of the times) my code can start to look more C 
than D.
However I am adapting thanks to my mentors Nicholas Wilson and 
Ilya Yaroshenko. They have helped me a ton - whether it be with 
debugging errors or me falling back to my functional C past, they 
have always come for my rescue and I am grateful for their 
support.


-------------------------------------
Addressing Suggestions from Community
-------------------------------------

This suggestion comes from Laeeth Isharc
Source: 
https://github.com/dlang/projects/issues/15#issuecomment-495831750

Though this is not on my current road map, I would love to pursue 
this idea. Adding an easy way to inter operate with other 
libraries would be very beneficial.
Although I haven't formally addressed this in the road map, I 
would love to implement a msgpack based I/O as I continue to 
develop the library. Also JSON I/O was something on my mind to 
implement after the data aggregation part.  (I had prioritised 
JSON as I believed there were much more datasets as JSON compared 
to any other format)

May 29 2019

Andre Pany <andre s-e-a-p.de> writes:

On Wednesday, 29 May 2019 at 18:00:02 UTC, Prateek Nayak wrote:
Hello everyone,

I have began work on my Google Summer of Code 2019 project
DataFrame for D.

-----------------
About the Project
-----------------

DataFrames have become a standard while handling and
manipulating data. They give a neat representation, access and
power to modulate the data in way user wants.
This project aims at bringing native DataFrame to D one which
brings with it:

* A User Friendly API
* Multi - Indexing
* Writing to CSV and parsing from CSV
* Column binary operation in the form: df["Index1"] =
df["Index2"] + df["Index3"];
* groupBy on an arbitrary number of columns
* Data Aggregation

Disclaimer: The entire structuring was inspired by Pandas, the
most popular DataFrame library in Python and hence most of the
usage will look very similar to the ones in Pandas.

Main focus of this project is user-friendliness of the API
while also maintaining fair amount of speed and power.
The preliminary road map can be viewed here ->
https://docs.google.com/document/d/1Zrf_tFYLauAd_NM4-UMBGt_z-fORhFMrGvW633x8rZs/edit?usp=sharing

The core developments can be seen here ->
https://github.com/Kriyszig/magpie

-----------------------------
Brief idea of what is to come
-----------------------------

This month
----------
* Finish up with structure of DataFrame
* Finish Terminal Output (What good is data which cannot be
seen)
* Finish writing to CSV
* Parsing DataFrame from CSV (Both single and multi-indexed)
* Accessing Elements
* Accessing Rows and Columns
* Assignment of element, an entire row or column
* Binary operation on rows and columns

Next Month
----------
* groupBy
* join
* Begin writing ops for aggregation

-----------
Speed Bumps
-----------

I am relatively new to D and hail from functional C background.
Sometimes (most of the times) my code can start to look more C
than D.
However I am adapting thanks to my mentors Nicholas Wilson and
Ilya Yaroshenko. They have helped me a ton - whether it be with
debugging errors or me falling back to my functional C past,
they have always come for my rescue and I am grateful for their
support.

-------------------------------------
Addressing Suggestions from Community
-------------------------------------

This suggestion comes from Laeeth Isharc
Source:
https://github.com/dlang/projects/issues/15#issuecomment-495831750

Though this is not on my current road map, I would love to
pursue this idea. Adding an easy way to inter operate with
other libraries would be very beneficial.
Although I haven't formally addressed this in the road map, I
would love to implement a msgpack based I/O as I continue to
develop the library. Also JSON I/O was something on my mind to
implement after the data aggregation part. (I had prioritised
JSON as I believed there were much more datasets as JSON
compared to any other format)

The outlook looks really great. CSV as a starting point is good.
I am not sure wheter adding a second text format like json brings
real value.
Text formats have 2 disadvantages, string to number and vise
versa slows down loading and saving files. If I remember
correctly saving a file as CSV needed around 50 seconds while
saving the same data as binary file (Parquet) took 3 seconds.
Second issue is the file size. A CSV with a size of 580 MB has a
size of 280 MB while saving as e.g. Parquet file.

The file size isn't an issue on your local file system but it is
a big issue while storing these file in the cloud e.g. Amazon S3.
The file size will cause longer transfer times.

Adding a packed binary format would be great, if possible.

Kind regards
Andre

May 29 2019

Laeeth Isharc <laeeth kaleidic.io> writes:

On Wednesday, 29 May 2019 at 18:26:46 UTC, Andre Pany wrote:
On Wednesday, 29 May 2019 at 18:00:02 UTC, Prateek Nayak wrote:
Hello everyone,

I have began work on my Google Summer of Code 2019 project
DataFrame for D.

-----------------
About the Project
-----------------

Disclaimer: The entire structuring was inspired by Pandas, the
most popular DataFrame library in Python and hence most of the
usage will look very similar to the ones in Pandas.

The core developments can be seen here ->
https://github.com/Kriyszig/magpie

-----------------------------
Brief idea of what is to come
-----------------------------

Next Month
----------
* groupBy
* join
* Begin writing ops for aggregation

-----------
Speed Bumps
-----------

I am relatively new to D and hail from functional C
background. Sometimes (most of the times) my code can start to
look more C than D.
However I am adapting thanks to my mentors Nicholas Wilson and
Ilya Yaroshenko. They have helped me a ton - whether it be
with debugging errors or me falling back to my functional C
past, they have always come for my rescue and I am grateful
for their support.

-------------------------------------
Addressing Suggestions from Community
-------------------------------------

This suggestion comes from Laeeth Isharc
Source:
https://github.com/dlang/projects/issues/15#issuecomment-495831750

The outlook looks really great. CSV as a starting point is
good. I am not sure wheter adding a second text format like
json brings real value.
Text formats have 2 disadvantages, string to number and vise
versa slows down loading and saving files. If I remember
correctly saving a file as CSV needed around 50 seconds while
saving the same data as binary file (Parquet) took 3 seconds.
Second issue is the file size. A CSV with a size of 580 MB has
a size of 280 MB while saving as e.g. Parquet file.

The file size isn't an issue on your local file system but it
is a big issue while storing these file in the cloud e.g.
Amazon S3. The file size will cause longer transfer times.

Adding a packed binary format would be great, if possible.

Kind regards
Andre

Interoperability with pandas would be important in our use case,
and I think probably for quite a few others. So yes I agree that
it's not ideal to use JSON but lots of things are not ideal. And
I think people use JSON, msgpack and hdf5 for interop with
pandas. CSV is more complicated in practice than one might
initially think. Finally of course Excel interop.

It's not my cup of tea but gzipped JSON is quite compact...

I have a little streaming msgpack to our own Variable type
deserialiser (it can store a primitive or a variant). It's not
long and I could share that.

I don't think the initial dataframe needs to have all this stuff
in it from day one necessarily.

It's worth reusing another JSON implementation. Since you work
with Ilya - asdf isn't bad and quite fast though the error
messages leave something to be desired.

You might see John Colvin repo from his talk on nogc map,
filter,fold etc. He didn't do chunkBy yet.

May 29 2019

Prateek Nayak <lelouch.cpp gmail.com> writes:

On Wednesday, 29 May 2019 at 22:31:57 UTC, Laeeth Isharc wrote:
 On Wednesday, 29 May 2019 at 18:26:46 UTC, Andre Pany wrote:
 On Wednesday, 29 May 2019 at 18:00:02 UTC, Prateek Nayak wrote:
 [...]

 The outlook looks really great. CSV as a starting point is 
 good. I am not sure wheter adding a second text format like 
 json brings real value.
 Text formats have 2 disadvantages, string to number and vise 
 versa slows down loading and saving files. If I remember 
 correctly saving a file as CSV needed around 50 seconds while 
 saving the same data as binary file (Parquet) took 3 seconds.
 Second issue is the file size. A CSV with a size of 580 MB has 
 a size of 280 MB while saving as e.g. Parquet file.

 The file size isn't an issue on your local file system but it 
 is a big issue while storing these file in the cloud e.g. 
 Amazon S3. The file size will cause longer transfer times.

 Adding a packed binary format would be great, if possible.

 Kind regards
 Andre

 Interoperability with pandas would be important in our use 
 case, and I think probably for quite a few others.  So yes I 
 agree that it's not ideal to use JSON but lots of things are 
 not ideal.  And I think people use JSON, msgpack and hdf5 for 
 interop with pandas.  CSV is more complicated in practice than 
 one might initially think.  Finally of course Excel interop.

 It's not my cup of tea but gzipped JSON is quite compact...

 I have a little streaming msgpack to our own Variable type 
 deserialiser (it can store a primitive or a variant).  It's not 
 long and I could share that.

 I don't think the initial dataframe needs to have all this 
 stuff in it from day one necessarily.

 It's worth reusing another JSON implementation.  Since you work 
 with Ilya - asdf isn't bad and quite fast though the error 
 messages leave something to be desired.

 You might see John Colvin repo from his talk on nogc map, 
 filter,fold etc.  He didn't do chunkBy yet.

Interop is important and I see many binary format being suggested 
as a way of interop.
Pandas I/O tools cover some of the popular formats: 
https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

Apache Arrow was something new that I hadn't heard of before. 
I'll look into it but then again I couldn't find any way to 
integrate it with D right away.

I would love to hear which binary format does the community want 
to be added as a way of interop in the future.

May 29 2019

jmh530 <john.michael.hall gmail.com> writes:

On Wednesday, 29 May 2019 at 18:00:02 UTC, Prateek Nayak wrote:
 [snip]

Glad to see progress being made on this!

Somewhat tangentially on the interoperability, I have also made 
quite a bit of use with R's data.frames. One difference between 
that and what I have seen of the implementation is that R's 
data.frames allow for different columns to be different types. 
This makes certain kinds of analysis of groups very easy. For 
instance, right now I'm working with a dataset whose columns are 
doubles, dates, integers, bools, and strings. I can do the 
equivalent of groupby on the strings as "factors" in R and it's 
pretty straightforward to get everything working nicely.

May 29 2019

Yatheendra <3df4 gmail.ru> writes:

Is it worth considering some binary format that is standard-ish & 
whose toolset can write out text when viewing? I'm thinking 
https://capnproto.org .

May 29 2019

jmh530 <john.michael.hall gmail.com> writes:

On Wednesday, 29 May 2019 at 19:33:38 UTC, Yatheendra wrote:
 Is it worth considering some binary format that is standard-ish 
 & whose toolset can write out text when viewing? I'm thinking 
 https://capnproto.org .

I have heard of that, but I don't know too much about it.

I think there are some hierarchical data formats that have some 
popularity, like hdf5. I think there's already a D wrapper for 
the C library. I don't have much experience with this kind of 
stuff though.

May 29 2019

Laeeth Isharc <laeeth kaleidic.io> writes:

On Wednesday, 29 May 2019 at 19:33:38 UTC, Yatheendra wrote:
 Is it worth considering some binary format that is standard-ish 
 & whose toolset can write out text when viewing? I'm thinking 
 https://capnproto.org .

Ultimately you need to be able to talk to the world because this 
kind of thing is social and you may not have a choice about 
formats.  However no point trying to do it all in one summer...

May 29 2019

Prateek Nayak <lelouch.cpp gmail.com> writes:

On Wednesday, 29 May 2019 at 18:41:28 UTC, jmh530 wrote:
 On Wednesday, 29 May 2019 at 18:00:02 UTC, Prateek Nayak wrote:
 [snip]

 Glad to see progress being made on this!

 Somewhat tangentially on the interoperability, I have also made 
 quite a bit of use with R's data.frames. One difference between 
 that and what I have seen of the implementation is that R's 
 data.frames allow for different columns to be different types. 
 This makes certain kinds of analysis of groups very easy. For 
 instance, right now I'm working with a dataset whose columns 
 are doubles, dates, integers, bools, and strings. I can do the 
 equivalent of groupby on the strings as "factors" in R and it's 
 pretty straightforward to get everything working nicely.

The DataFrame currently uses Mir's ndslice at the core of it 
which allows for homogeneous data to be stored within it.
Right now, we are considering operable data to be homogeneous 
keeping the API simpler.
I'm not sure how something like Variant will play out in this 
scenario. It may allow for data to be flexible but parsing will 
probably require an assertion library.

May 29 2019

jmh530 <john.michael.hall gmail.com> writes:

On Thursday, 30 May 2019 at 03:38:50 UTC, Prateek Nayak wrote:
 [snip]

 The DataFrame currently uses Mir's ndslice at the core of it 
 which allows for homogeneous data to be stored within it.
 Right now, we are considering operable data to be homogeneous 
 keeping the API simpler.
 I'm not sure how something like Variant will play out in this 
 scenario. It may allow for data to be flexible but parsing will 
 probably require an assertion library.

It's probably smart to focus on getting the homogeneous case 
working first.

I don't think of it as the entire thing being Variant, so much as 
a tuple containing 1-dimensional mir slices that are all the same 
length. The idea is that each column should have its own type. I 
had done a simple implementation of this a year or so ago and had 
shown Ilya.

May 30 2019

Prateek Nayak <lelouch.cpp gmail.com> writes:

On Wednesday, 29 May 2019 at 18:41:28 UTC, jmh530 wrote:
 On Wednesday, 29 May 2019 at 18:00:02 UTC, Prateek Nayak wrote:
 [snip]

 Glad to see progress being made on this!

 Somewhat tangentially on the interoperability, I have also made 
 quite a bit of use with R's data.frames. One difference between 
 that and what I have seen of the implementation is that R's 
 data.frames allow for different columns to be different types. 
 This makes certain kinds of analysis of groups very easy. For 
 instance, right now I'm working with a dataset whose columns 
 are doubles, dates, integers, bools, and strings. I can do the 
 equivalent of groupby on the strings as "factors" in R and it's 
 pretty straightforward to get everything working nicely.

On a second thought, my mentor Nicholas Wilson led me to an 
interesting Github Gist
-> https://gist.github.com/aG0aep6G/a1b87df1ac5930870ffe

A similar structure can be used to represent non homogeneous 
data. The DataFrame structure can be overloaded for such an 
integration. However homogeneous DataFrame still remain the main 
objective for now. This integration will definitely happen once 
the homogeneous DataFrame comes close to looking and working like 
an actual DataFrame.

I'll keep you updated here in case I find anything better for non 
homogeneous data and when the whole things starts to take shape.

May 29 2019

jmh530 <john.michael.hall gmail.com> writes:

On Thursday, 30 May 2019 at 04:38:09 UTC, Prateek Nayak wrote:
 

 On a second thought, my mentor Nicholas Wilson led me to an 
 interesting Github Gist
 -> https://gist.github.com/aG0aep6G/a1b87df1ac5930870ffe

 A similar structure can be used to represent non homogeneous 
 data. The DataFrame structure can be overloaded for such an 
 integration. However homogeneous DataFrame still remain the 
 main objective for now. This integration will definitely happen 
 once the homogeneous DataFrame comes close to looking and 
 working like an actual DataFrame.

 I'll keep you updated here in case I find anything better for 
 non homogeneous data and when the whole things starts to take 
 shape.

Hmm, my point above was for a tuple of mir slices, which seems to 
correspond more to a struct of arrays. I wonder if your 
implementation would be able to take an array of structs approach 
using built-in mir. I know mir can handle complex numbers, which 
requires a struct. I don't know off the top of my head how it can 
handle more generally a slice whose type is a struct.

May 30 2019

welkam <wwwelkam gmail.com> writes:

On Wednesday, 29 May 2019 at 18:41:28 UTC, jmh530 wrote:
 R's data.frames allow for different columns to be different 
 types.

Im not familiar with data frames but this sounds like array of 
structs

May 30 2019

jmh530 <john.michael.hall gmail.com> writes:

On Thursday, 30 May 2019 at 09:06:58 UTC, welkam wrote:
 [snip]

 Im not familiar with data frames but this sounds like array of 
 structs

I think I was thinking of it more like a struct of arrays, but I 
think an array of structs may also work (see my responses to 
Prateek)...

May 30 2019

Prateek Nayak <lelouch.cpp gmail.com> writes:

On Thursday, 30 May 2019 at 09:41:59 UTC, jmh530 wrote:
 On Thursday, 30 May 2019 at 09:06:58 UTC, welkam wrote:
 [snip]

 Im not familiar with data frames but this sounds like array of 
 structs

 I think I was thinking of it more like a struct of arrays, but 
 I think an array of structs may also work (see my responses to 
 Prateek)...

Due to the popularity of heterogeneous DataFrames, we decided to 
take care of it the early stages of development before it's too 
late.

The heterogeneous DataFrame is now live at: 
https://github.com/Kriyszig/magpie/tree/experimental

Some parts are still under development but the goals in the road 
maps will be reached on time.

---------------------------------
Summing up the first week of GSoC
---------------------------------

* Base and file I/O ops were built for homogeneous DataFrame
* Based on the type of data community has worked with, it seemed 
evident homogeneous DataFrames weren't gonna cut it. So a rebuild 
was initiated over the weekend to allow for heterogeneous data.
* The API was overhauled to allow for Heterogeneous DataFrames.
* New parser that can parse selective columns.

The code will land in master once it's cleaned up and is deemed 
stable.

-----------------------------------
Things that will be dealt this week
-----------------------------------

This week will be for:

* Improving Parser
* Overhaul code structure (in Experimental)
* Adding setters for data and index in DataFrame
* Adding functions to create a multi-indexed DataFrame, the same 
way one can do in Python.
* Adding Documentation and examples
* Index Operations
* Retrieve rows and columns

The last one will set in motion the implementation of Column 
Binary ops of the form:
df["Index1"] = df["Index2"] + df["Index3"];

Meanwhile if you guys have any more suggestion please feel free 
to contact me - you can use this thread, open an issue on Github, 
reach out to me on slack (Prateek Nayak) or you can email me 
directly (lelouch.cpp gmail.com)

Jun 03 2019

jmh530 <john.michael.hall gmail.com> writes:

On Tuesday, 4 June 2019 at 03:13:03 UTC, Prateek Nayak wrote:
 [snip]

 Due to the popularity of heterogeneous DataFrames, we decided 
 to take care of it the early stages of development before it's 
 too late.

Excellent!

Jun 04 2019

James Blachly <james.blachly gmail.com> writes:

On 6/3/19 11:13 PM, Prateek Nayak wrote:
 Due to the popularity of heterogeneous DataFrames, we decided to take 
 care of it the early stages of development before it's too late.
 
 The heterogeneous DataFrame is now live at: 
 https://github.com/Kriyszig/magpie/tree/experimental

Amazing, thanks!

experimental branch readme code snippet has typo; var name should be 
`heterogeneous`

```
// Creating a heterogeneous DataFrame of 10 integer columns and 10 
double columns
DataFrame!(int, 10, double, 10) homogeneous;
```

Again I cannot thank you enough!

Jun 05 2019

Prateek Nayak <lelouch.cpp gmail.com> writes:

On Thursday, 6 June 2019 at 00:24:17 UTC, James Blachly wrote:
 On 6/3/19 11:13 PM, Prateek Nayak wrote:
 Due to the popularity of heterogeneous DataFrames, we decided 
 to take care of it the early stages of development before it's 
 too late.
 
 The heterogeneous DataFrame is now live at: 
 https://github.com/Kriyszig/magpie/tree/experimental

 Amazing, thanks!

 experimental branch readme code snippet has typo; var name 
 should be `heterogeneous`

 ```
 // Creating a heterogeneous DataFrame of 10 integer columns and 
 10 double columns
 DataFrame!(int, 10, double, 10) homogeneous;
 ```

 Again I cannot thank you enough!

Oops! Thanks for spotting that. I'll update it today with a 
complete example for the usage and a snippet each for the 
functions added since it was last modified.

Jun 05 2019

Prateek Nayak <lelouch.cpp gmail.com> writes:

On Tuesday, 4 June 2019 at 03:13:03 UTC, Prateek Nayak wrote:
On Thursday, 30 May 2019 at 09:41:59 UTC, jmh530 wrote:
On Thursday, 30 May 2019 at 09:06:58 UTC, welkam wrote:
[snip]

Im not familiar with data frames but this sounds like array
of structs

I think I was thinking of it more like a struct of arrays, but
I think an array of structs may also work (see my responses to
Prateek)...

Due to the popularity of heterogeneous DataFrames, we decided
to take care of it the early stages of development before it's
too late.

The heterogeneous DataFrame is now live at:
https://github.com/Kriyszig/magpie/tree/experimental

Some parts are still under development but the goals in the
road maps will be reached on time.

---------------------------------
Summing up the first week of GSoC
---------------------------------

* Base and file I/O ops were built for homogeneous DataFrame
* Based on the type of data community has worked with, it
seemed evident homogeneous DataFrames weren't gonna cut it. So
a rebuild was initiated over the weekend to allow for
heterogeneous data.
* The API was overhauled to allow for Heterogeneous DataFrames.
* New parser that can parse selective columns.

The code will land in master once it's cleaned up and is deemed
stable.

-----------------------------------
Things that will be dealt this week
-----------------------------------

This week will be for:

* Improving Parser
* Overhaul code structure (in Experimental)
* Adding setters for data and index in DataFrame
* Adding functions to create a multi-indexed DataFrame, the
same way one can do in Python.
* Adding Documentation and examples
* Index Operations
* Retrieve rows and columns

The last one will set in motion the implementation of Column
Binary ops of the form:
df["Index1"] = df["Index2"] + df["Index3"];

Meanwhile if you guys have any more suggestion please feel free
to contact me - you can use this thread, open an issue on
Github, reach out to me on slack (Prateek Nayak) or you can
email me directly (lelouch.cpp gmail.com)

I have decided to make weekly updates for the community to know
the progress of the project.

-------------------------------
Summing up last week's progress
-------------------------------

* Brought Heterogeneous DataFrames to the same point as the
previous Homogeneous DataFrame development.
* Assignment ops - both direct and indexed.
* Retrieve an entire column using index operation.
* Retrieving entire row using index operation.
* Small redesign here and there to reduce code size.

The index op for rows and columns will return the value as an
Axis structure.
Binary Ops on Axis structure will basically translate to Column
and Row binary operation on DataFrame.

----------------------------------
Tasks that will be dealt this week
----------------------------------

* Column and row binary operation
* There are a few places where O(n) operations can be converted
to O(log n) operations - these few optimisations will be done.
* Updating the Documentation with the developments of the week.

So far no major roadblocks have been encountered.

P.S.
Found this interesting Reddit post regarding file formats on
r/datasets:
https://www.reddit.com/r/datasets/comments/bymru3/are_there_any_data_formats_for_storing_text_worth/

Jun 10 2019

Prateek Nayak <lelouch.cpp gmail.com> writes:

On Tuesday, 11 June 2019 at 04:35:22 UTC, Prateek Nayak wrote:
On Tuesday, 4 June 2019 at 03:13:03 UTC, Prateek Nayak wrote:
On Thursday, 30 May 2019 at 09:41:59 UTC, jmh530 wrote:
On Thursday, 30 May 2019 at 09:06:58 UTC, welkam wrote:
[snip]

Im not familiar with data frames but this sounds like array
of structs

I think I was thinking of it more like a struct of arrays,
but I think an array of structs may also work (see my
responses to Prateek)...

Due to the popularity of heterogeneous DataFrames, we decided
to take care of it the early stages of development before it's
too late.

The heterogeneous DataFrame is now live at:
https://github.com/Kriyszig/magpie/tree/experimental

Some parts are still under development but the goals in the
road maps will be reached on time.

---------------------------------
Summing up the first week of GSoC
---------------------------------

The code will land in master once it's cleaned up and is
deemed stable.

-----------------------------------
Things that will be dealt this week
-----------------------------------

This week will be for:

The last one will set in motion the implementation of Column
Binary ops of the form:
df["Index1"] = df["Index2"] + df["Index3"];

Meanwhile if you guys have any more suggestion please feel
free to contact me - you can use this thread, open an issue on
Github, reach out to me on slack (Prateek Nayak) or you can
email me directly (lelouch.cpp gmail.com)

I have decided to make weekly updates for the community to know
the progress of the project.

-------------------------------
Summing up last week's progress
-------------------------------

The index op for rows and columns will return the value as an
Axis structure.
Binary Ops on Axis structure will basically translate to Column
and Row binary operation on DataFrame.

----------------------------------
Tasks that will be dealt this week
----------------------------------

So far no major roadblocks have been encountered.

P.S.
Found this interesting Reddit post regarding file formats on
r/datasets:
https://www.reddit.com/r/datasets/comments/bymru3/are_there_any_data_formats_for_storing_text_worth/

-------------
Weekly Update
-------------

This week, the development was bit slower compared to the last
couple of weeks - I had to attend college for a couple of days
and it took more time than I would have liked.
However, that said, everything from the last week's goal is
achieved.

-----------------------
What happened last week
-----------------------

* Redesigned Axis - the structure that returns the values during
column binary operations
* Added binary operations for Axis [This was equivalent of binary
operations on DataFrame]
* Tested whether the Binary Operations worked fine on DataFrames
* Fixed couple of tiny bugs here and there.
* Added more ways to build index.
* HashMap like implementation to check for duplicates in index.

-------------------
Goals for this week
-------------------
* Work on apply - That applies a function to values in a
row/column or the entire DataFrame
* Add some inbuilt operations [This will be operations like mean,
median - operations that are essential]
* Optimize parts of DataFrame.
* Addition of helpers that will eventually trigger the
development of groupBy

----------
Roadblocks
----------
There were a few moments where I ran into trouble while returning
Axis structure because of the variable return type. My mentors
helped me a lot when I got stuck with the implementation. There
ware also a couple of bugs which were simple but still took a
while to solve.
Other than that, things went really smooth.

---------------------
Community Suggestions
---------------------

Petar Kirov [ZombineDev] suggested using static arrays as a way
to declare a DataFrame.
Soon it will be possible to declare DataFrame as:

DataFrame!(int[5], double[10]) df;

[Implemented in BinaryOps branch - Will land in master with
BinaryOps implementation soon]
Thank you for the suggestion.

Jun 17 2019

Prateek Nayak <lelouch.cpp gmail.com> writes:

 [snip]



-------------
Week 4 Update
-------------

This marks the completion of Stage I of Google Summer of Code 
2019. It seems like it was only yesterday when I started working 
on this project and it has already been a month.

--------------------------
So what happened last week
--------------------------
* apply - to apply a function on a row/column
* function to convert a column of data to level of Index
* drop - to drop a row/column

Going back to the original proposal, I had allocated some time 
for optimisations in case there was time:
I was testing old parser with large files and it failed 
miserably. So I redesigned the from_csv function and added it to 
the library as fastCSV.
fastCSV gives 40x speed improvement over from_csv and fastCSV 
will eventually replace from_csv

* fastCSV was added to the library.


-------------------
Plans for this week
-------------------

Plans for this entire stage isn't strictly on a week by week 
timeline but the following things will be dealt sequentially 
throughout this stage:

This stage is reserved for implementation of groupBy. So for the 
beginning the internal structure and grouping will be decided. 
Later things like display and combining into a DataFrame struct 
will be dealt with.

These tasks were scheduled for Stage-III but will again fall 
under sequential implementation. If the above tasks are done. The 
following tasks will be dealt with:
* Aggregate [with complete set of popular operations]
* Join will be implemented to merge two DataFrame.

Aggregate was reserved for later stages to support implementation 
for both normal DataFrame and groubBy at once.


----------
Roadblocks
----------
This week there hasn't been any roadblocks. I needed the help of 
my mentors to solve a couple errors here and there but other than 
that things were smooth.
As for the future roadblocks, I cannot see any apparent ones but 
then again they show up when you are least expecting them :(

Jun 25 2019

jmh530 <john.michael.hall gmail.com> writes:

On Tuesday, 25 June 2019 at 17:25:34 UTC, Prateek Nayak wrote:
 [snip]



 -------------
 Week 4 Update
 -------------

 This marks the completion of Stage I of Google Summer of Code 
 2019. It seems like it was only yesterday when I started 
 working on this project and it has already been a month.

 --------------------------
 So what happened last week
 --------------------------
 * apply - to apply a function on a row/column
 * function to convert a column of data to level of Index
 * drop - to drop a row/column

[snip]

Glad to see you're still making great progress.

I had worked on the byDim function in mir.ndslice.topology is 
byDim because I had wanted the same sort of functionality as R's 
apply. It works a little differently than R's, but I find it very 
good for a lot of things. Your version of apply (I'm looking at 
the apply branch of magpie) looks like it operates a bit like a 
byDim chained with an each, so byDim!N.each!f. However, it also 
has this index variable allowing it to skip rows or something 
(I'm not really sure if this feature pulls its weight...).

So I have two questions: 1) does byDim also work with 
dataframes?, 2) can you add an overload that is apply(f, axis) 
without the index parameter?

One of my take-a-ways from looking at the apply function (again 
just looking at that apply branch) is that you might benefit from 
using more of what is Ilya has already put in mir.ndslice where 
available. For instance, the overload of apply that is just 
apply!f is basically the same as mir's each, but each has more 
features.

Jun 25 2019

jmh530 <john.michael.hall gmail.com> writes:

On Tuesday, 25 June 2019 at 17:54:36 UTC, jmh530 wrote:
 On Tuesday, 25 June 2019 at 17:25:34 UTC, Prateek Nayak wrote:
 [snip]



 -------------
 Week 4 Update
 -------------

 This marks the completion of Stage I of Google Summer of Code 
 2019. It seems like it was only yesterday when I started 
 working on this project and it has already been a month.

 --------------------------
 So what happened last week
 --------------------------
 * apply - to apply a function on a row/column
 * function to convert a column of data to level of Index
 * drop - to drop a row/column

 [snip]

 Glad to see you're still making great progress.

 I had worked on the byDim function in mir.ndslice.topology is 
 byDim because I had wanted the same sort of functionality as 
 R's apply. It works a little differently than R's, but I find 
 it very good for a lot of things. Your version of apply (I'm 
 looking at the apply branch of magpie) looks like it operates a 
 bit like a byDim chained with an each, so byDim!N.each!f. 
 However, it also has this index variable allowing it to skip 
 rows or something (I'm not really sure if this feature pulls 
 its weight...).

 So I have two questions: 1) does byDim also work with 
 dataframes?, 2) can you add an overload that is apply(f, axis) 
 without the index parameter?

 One of my take-a-ways from looking at the apply function (again 
 just looking at that apply branch) is that you might benefit 
 from using more of what is Ilya has already put in mir.ndslice 
 where available. For instance, the overload of apply that is 
 just apply!f is basically the same as mir's each, but each has 
 more features.

Stupid typos.

I had worked on the byDim function in mir.ndslice.topology 
because ..."

Jun 25 2019

Prateek Nayak <lelouch.cpp gmail.com> writes:

On Tuesday, 25 June 2019 at 17:54:36 UTC, jmh530 wrote:
 Glad to see you're still making great progress.

 I had worked on the byDim function in mir.ndslice.topology is 
 byDim because I had wanted the same sort of functionality as 
 R's apply. It works a little differently than R's, but I find 
 it very good for a lot of things. Your version of apply (I'm 
 looking at the apply branch of magpie) looks like it operates a 
 bit like a byDim chained with an each, so byDim!N.each!f. 
 However, it also has this index variable allowing it to skip 
 rows or something (I'm not really sure if this feature pulls 
 its weight...).

 So I have two questions: 1) does byDim also work with 
 dataframes?, 2) can you add an overload that is apply(f, axis) 
 without the index parameter?

 One of my take-a-ways from looking at the apply function (again 
 just looking at that apply branch) is that you might benefit 
 from using more of what is Ilya has already put in mir.ndslice 
 where available. For instance, the overload of apply that is 
 just apply!f is basically the same as mir's each, but each has 
 more features.

1) Current, byDim doesn't work on DataFrame DataFrame.
2) Sure, the overload can be made but what are you specifically 
looking for?
apply(f, axis)(indexes) ?

You are right, apply works like byDim!axis.each on particular 
columns/rows.
I'll look into Mir's implementation. Thanks for that advice. I do 
believe apply can be strengthened to account for different use 
cases.
When the heterogeneous DataFrame support came, mir-algorithms was 
dropped from dependencies and Structure of Array implementation 
was taken up using TypeTuples. Once the basic working is solid, 
I'll port useful features from Mir to Magpie.

Jun 25 2019

jmh530 <john.michael.hall gmail.com> writes:

On Tuesday, 25 June 2019 at 20:44:43 UTC, Prateek Nayak wrote:
 [snip]
 2) Sure, the overload can be made but what are you specifically 
 looking for?
 apply(f, axis)(indexes) ?
 [snip]

I see
void apply(alias Fn, int axis, T)(T index)
and
void apply(alias Fn)()
in the current implementation.

I think you interpreted what I am asking as something like
void apply(alias Fn, int axis, T[])(T[] indices)
which also might make sense.

But I guess I was suggesting a little simpler as
void apply(alias Fn, int axis)()
so that it applies to all the rows or columns.

This is particularly relevant in the homogeneous data case. My 
motivation reflects a common use case of the apply function in R 
to calculate summary statistics of an array/matrix by column or 
row. For instance, I might want to calculate the standard 
deviation of every column.

Jun 25 2019

Prateek Nayak <lelouch.cpp gmail.com> writes:

On Tuesday, 25 June 2019 at 21:07:35 UTC, jmh530 wrote:
 On Tuesday, 25 June 2019 at 20:44:43 UTC, Prateek Nayak wrote:
 [snip]
 2) Sure, the overload can be made but what are you 
 specifically looking for?
 apply(f, axis)(indexes) ?
 [snip]

 I see
 void apply(alias Fn, int axis, T)(T index)
 and
 void apply(alias Fn)()
 in the current implementation.

 I think you interpreted what I am asking as something like
 void apply(alias Fn, int axis, T[])(T[] indices)
 which also might make sense.

 But I guess I was suggesting a little simpler as
 void apply(alias Fn, int axis)()
 so that it applies to all the rows or columns.

 This is particularly relevant in the homogeneous data case. My 
 motivation reflects a common use case of the apply function in 
 R to calculate summary statistics of an array/matrix by column 
 or row. For instance, I might want to calculate the standard 
 deviation of every column.

The apply right now works exactly as
void apply(alias Fn, int axis, T)(T indices)
indices can be an array of integer or a 2D array of string index
https://github.com/Kriyszig/magpie/blob/dec86d1942f9c9b4db31438407798329af0aed96/source/magpie/dataframe.d#L1200

The overload you need also exists: apply(Fn)
https://github.com/Kriyszig/magpie/blob/dec86d1942f9c9b4db31438407798329af0aed96/source/magpie/dataframe.d#L1246

Unittest for apply -
https://github.com/Kriyszig/magpie/blob/dec86d1942f9c9b4db31438407798329af0aed96/source/magpie/dataframe.d#L2821

I agree, things like mean and standard deviation calculations are 
of utmost importance in data science. Aggregate will bring such 
features as inbuilt functions. Count, Min, Max, Mean, SD, 
Variance, etc.
This will be added soon (by soon I mean somewhere between the 
final week of this stage [possibly sooner] and the fist week of 
the next - As soon as groupBy is stable, I will get onto 
aggregate)
Sorry for the inconvenience.

Jun 25 2019

jmh530 <john.michael.hall gmail.com> writes:

On Wednesday, 26 June 2019 at 05:41:48 UTC, Prateek Nayak wrote:
 [snip]

 I agree, things like mean and standard deviation calculations 
 are of utmost importance in data science. Aggregate will bring 
 such features as inbuilt functions. Count, Min, Max, Mean, SD, 
 Variance, etc.
 This will be added soon (by soon I mean somewhere between the 
 final week of this stage [possibly sooner] and the fist week of 
 the next - As soon as groupBy is stable, I will get onto 
 aggregate)
 Sorry for the inconvenience.

By no means do you need to apologize for any inconvenience.

I suppose what I am thinking is more about leveraging work that 
is already done as much as possible. For instance, I know that 
count/sum/min/max are part of mir-algorithm already and I had 
helped add sd and variance to numir.

Do you mind if I send you an email?

Jun 26 2019

Prateek Nayak <lelouch.cpp gmail.com> writes:

On Wednesday, 26 June 2019 at 12:50:23 UTC, jmh530 wrote:
 On Wednesday, 26 June 2019 at 05:41:48 UTC, Prateek Nayak wrote:
 [snip]

 I agree, things like mean and standard deviation calculations 
 are of utmost importance in data science. Aggregate will bring 
 such features as inbuilt functions. Count, Min, Max, Mean, SD, 
 Variance, etc.
 This will be added soon (by soon I mean somewhere between the 
 final week of this stage [possibly sooner] and the fist week 
 of the next - As soon as groupBy is stable, I will get onto 
 aggregate)
 Sorry for the inconvenience.

 By no means do you need to apologize for any inconvenience.

 I suppose what I am thinking is more about leveraging work that 
 is already done as much as possible. For instance, I know that 
 count/sum/min/max are part of mir-algorithm already and I had 
 helped add sd and variance to numir.

 Do you mind if I send you an email?

I'm sorry I couldn't reply sooner - I was sick for past couple of 
days.
I don't mind emails one bit. The email id is: 
lelouch.cpp gmail.com [I know its weird :)]
I'll reply as soon as I see the mail [At worse it will take 12hrs 
to get a reply for me when my phone doesn't notify me of a new 
mail. At best I'll reply immediately]

Again, sorry for the delayed response. Hope to hear from you soon 
:)

Jun 28 2019

Prateek Nayak <lelouch.cpp gmail.com> writes:

On Saturday, 29 June 2019 at 04:39:39 UTC, Prateek Nayak wrote:
 [snip]

-------------
Weekly Update
-------------

I caught a flu this week which was really unfortunate. However 
I'm getting better and the work is going forward :)

-----------------------
What happened last week
-----------------------

I mostly dealt with the internal structure of `Group` - the 
structure that is returned during groupBy operation.
First I thought an array of DataFrames might be a good idea but 
soon dropped the idea as it would mean some parts - like the 
column index remain same but need to be copied to every DataFrame 
structure in the array and its just a waste of space at that 
point.
The implementation now looks somewhat similar to the DataFrame 
structure itself - there is an `Index` and `data`. Indexes are 
sorted based on the groups formed as a result of groupBy.

There are few places where optimizations can be made [mostly wrt 
space used] and I'll work on it this week.

Some of the functionality added to `Group` so far:
* display - User can choose to display a singe group or multiple 
groups
* combine - returns a DataFrame combining the groups user would 
like

At this point there was a need fora function in DataFrame which 
could convert a level of indexing to a column of operable data if 
required. This is because combine on groupBy doesn't remember the 
position from where the data was extracted. Hence if a level of 
data is used for groupBy, it would automatically be converted to 
a level of index in the result of combine. Hence `indexToData` 
was added to revert the result of this if the user desires so.

There were a few minor updates here and there. Nothing major. 
They include a new argument for `extends` in `Index` which can 
now insert the index at the position of user's choice. The other 
was stripping of trailing white spaces which appeared in display.


--------------------------
What will happen this week
--------------------------
This week will deal with optimizations of Group, add binary 
operations to `Group` which may be helpful. Document the changes 
once stability is reached. Start work on aggregate/join.

----------
Roadblocks
----------
I can't spot any major roadblocks up ahead. Work should go 
smoothly this week :)



-> Thank you jmh530 for sharing your work. This should help in 
improving the functionality of DataFrames further.

Jul 02 2019

Prateeek Nayak <lelouch.cpp gmail.com> writes:

On Wednesday, 3 July 2019 at 05:04:20 UTC, Prateek Nayak wrote:
 On Saturday, 29 June 2019 at 04:39:39 UTC, Prateek Nayak wrote:
 [snip]

 [snip]

---------------
Progress Update
---------------

The past couple of weeks went as expected without any Roadblocks

* groupBy can group a DataFrame based on arbitrary number of 
columns
* groupBy returns a Group structure which supports binary 
operations
* Retrieve single/multiple group as DataFrame.
* Merge two/more Group into a single DataFrame.
* Index operations on DataFrame. An entire column/row is returned 
as Axis the same way Index operation on DataFrame is implemented.
* Display Group on Terminal

Works on DataFrame
* Added short hand data operations which I missed before! \(°^°)/
* Added function to convert index to an operable data column and 
vice versa

---------------------
What is due this week
---------------------
This week was mostly reserved for refactor. Mr. Wilson introduced 
me to the beautiful lockstep in range and I worked it in the 
codebase wherever it's necessary.
This week I as adding ways to retrieve data as a Slice and assign 
Slice to DataFrame. This IMHO is important as ndslice is used 
widely and it opens a lot of doors for data computations. A way 
to easily retrieve data as Slice operate on it and assign the 
data back to DataFrame sees valuable. I hope to get the initial 
PR ready by the beginning of next week.

After this will bring Aggregate - on whole Frame/Group on a 
selected column/row of DataFrame or Group, selective operation on 
selective columns/rows.

-----------------
Future Roadblocks
-----------------
I can't see any obvious roadblocks but then you never do see them 
coming ¯\_(ツ)_/¯

Jul 17 2019

jmh530 <john.michael.hall gmail.com> writes:

On Thursday, 18 July 2019 at 05:03:38 UTC, Prateeek Nayak wrote:
 [snip]

Thanks for the update. I'm glad you're still making good progress.

I'm just looking over the readme.md. I noticed the "at" function 
has a signature like at!(row, column)(). Because it uses a 
template, doesn't that imply that the row and column parameters 
must be known at compile-time? What if we want run-time access 
using a function style instead of like df[0, 0]? mir's ndslice 
also has a set of select functions that are also useful for 
access.

There's also a typo in the GroupBy text:
"Group DataFrame based on na arbitrary number of columns."

I noticed that you make a lot of use of static foreach's over 
RowType in dataframe.d. Does that this means that this means 
there isn't any extra cost if you use a homogeneous dataframe 
with RowType.length == 1? If you can advertise that it doesn't 
have any additional overhead for working with homogeneous, then 
that's probably a win. You might also add a trait for 
isHomogeneous that checks if RowType.length == 1.

Jul 18 2019

Prateeek Nayak <lelouch.cpp gmail.com> writes:

On Thursday, 18 July 2019 at 10:55:55 UTC, jmh530 wrote:
 On Thursday, 18 July 2019 at 05:03:38 UTC, Prateeek Nayak wrote:
 [snip]

 Thanks for the update. I'm glad you're still making good 
 progress.

 I'm just looking over the readme.md. I noticed the "at" 
 function has a signature like at!(row, column)(). Because it 
 uses a template, doesn't that imply that the row and column 
 parameters must be known at compile-time? What if we want 
 run-time access using a function style instead of like df[0, 
 0]? mir's ndslice also has a set of select functions that are 
 also useful for access.

 There's also a typo in the GroupBy text:
 "Group DataFrame based on na arbitrary number of columns."

 I noticed that you make a lot of use of static foreach's over 
 RowType in dataframe.d. Does that this means that this means 
 there isn't any extra cost if you use a homogeneous dataframe 
 with RowType.length == 1? If you can advertise that it doesn't 
 have any additional overhead for working with homogeneous, then 
 that's probably a win. You might also add a trait for 
 isHomogeneous that checks if RowType.length == 1.

* "at" was for a fast access to element. It's only necessary to 
know one of the two argument at compile time to be honest but 
df[i1, i2] has to be written as at!(i2)(i1) which reverses the 
two position hence I thought at!(i1, i2) could reduce some mishap 
that position reversal can cause.
I agree a method to access the element at runtime. I will 
overload at for that.

* Sorry about the typo, will fix it soon (^_^)

* The data in DataFrame is stored as TypeTuple which requires the 
column index to be known statically. When trying to do a runtime 
operation on data, I was forced to traverse the tuple statically 
to find the particular index. Homogeneous DataFrame defined as 
DataFrame!(int, 5) will give RowType as (int, int, int, int, int).
For now that overhead still exists but I think isHomogeneous 
template can open some new door for optimization. I will 
definitely look into this over the next week. Thanks for bringing 
it to my notice.

Jul 18 2019

jmh530 <john.michael.hall gmail.com> writes:

On Thursday, 18 July 2019 at 15:34:32 UTC, Prateeek Nayak wrote:
 [snip]

 * The data in DataFrame is stored as TypeTuple which requires 
 the column index to be known statically. When trying to do a 
 runtime operation on data, I was forced to traverse the tuple 
 statically to find the particular index. Homogeneous DataFrame 
 defined as DataFrame!(int, 5) will give RowType as (int, int, 
 int, int, int).
 For now that overhead still exists but I think isHomogeneous 
 template can open some new door for optimization. I will 
 definitely look into this over the next week. Thanks for 
 bringing it to my notice.

Ah, so what you would want to check is that all the RowTypes are 
the same instead.

Jul 18 2019

Prateeek Nayak <lelouch.cpp gmail.com> writes:

On Thursday, 18 July 2019 at 16:23:20 UTC, jmh530 wrote:
 On Thursday, 18 July 2019 at 15:34:32 UTC, Prateeek Nayak wrote:
 [snip]

 Ah, so what you would want to check is that all the RowTypes 
 are the same instead.

Yes. Will require a small redesign in the internal structure and 
some optimizations here and there but can seriously cut down 
overheads.

Jul 18 2019

James Blachly <james.blachly gmail.com> writes:

On 5/29/19 2:00 PM, Prateek Nayak wrote:
 (snip)

Outstanding, and greatly needed. Congratulations to you and your mentors.

Our lab has transitioned to D for new software but still relies on 
python+pandas for some analytics pipelines.

I second the notions about importance of interoperability. An 
interesting in-memory interop framework I haven't seen mentioned here 
yet is Apache Arrow. In this 2017 blog post, Wes McKinney, the author of 
Pandas, discusses it in the context of mistakes made designing pandas; 
recommended reading if you have not:

https://wesmckinney.com/blog/apache-arrow-pandas-internals/

May 29 2019

jmh530 <john.michael.hall gmail.com> writes:

On Thursday, 30 May 2019 at 02:16:16 UTC, James Blachly wrote:
 [snip]

 https://wesmckinney.com/blog/apache-arrow-pandas-internals/

This was a good read.

The columnar data structures he describes sound more like struct 
of arrays than array of structs.

May 30 2019

bachmeier <no spam.net> writes:

Looking at the readme, I see the following example for accessing 
elements by name:

df[["1"], ["0"]];

Why can't that instead be

df["1", "0"];

Something that gets in the way of adoption is verbose notation, 
and I'm not seeing any advantage to the array notation.

Also, for this example:

Index indx;
indx.setIndex([1, 2, 3, 4], ["Row Index"], [1, 2, 3], ["Column 
Index"]);

That's pretty verbose/hard to parse compared to

rownames(x) = [1, 2, 3, 4];
colnames(x) = [1, 2, 3];

Jul 18 2019

Prateek Nayak <lelouch.cpp gmail.com> writes:

On Thursday, 18 July 2019 at 16:02:40 UTC, bachmeier wrote:
 Looking at the readme, I see the following example for 
 accessing elements by name:

 df[["1"], ["0"]];

 Why can't that instead be

 df["1", "0"];

 Something that gets in the way of adoption is verbose notation, 
 and I'm not seeing any advantage to the array notation.

 Also, for this example:

 Index indx;
 indx.setIndex([1, 2, 3, 4], ["Row Index"], [1, 2, 3], ["Column 
 Index"]);

 That's pretty verbose/hard to parse compared to

 rownames(x) = [1, 2, 3, 4];
 colnames(x) = [1, 2, 3];

I'm really sorry I overlooked this. Sorry about that (－‸ლ)
I'll fix the first case in the PR where I'll make optimizations 
for homogeneous DataFrame.
I'll address the second problem of verbosity soon but not in the 
immediate PR

Thanks for the feedback ٩(^‿^)۶

Jul 23 2019

Dejan Lekic <dejan.lekic gmail.com> writes:

On Wednesday, 29 May 2019 at 18:00:02 UTC, Prateek Nayak wrote:
 Hello everyone,

 I have began work on my Google Summer of Code 2019 project 
 DataFrame for D.

Really glad to see someone working on that. I hope you will have 
time to implement a good CSV/TSV reader/writer based on the 
fantastic iopipe project (that should IMHO go into Phobos this 
way or another)...

Jul 19 2019

Prateek Nayak <lelouch.cpp gmail.com> writes:

On Friday, 19 July 2019 at 14:50:35 UTC, Dejan Lekic wrote:
 On Wednesday, 29 May 2019 at 18:00:02 UTC, Prateek Nayak wrote:
 Hello everyone,

 I have began work on my Google Summer of Code 2019 project 
 DataFrame for D.

 Really glad to see someone working on that. I hope you will 
 have time to implement a good CSV/TSV reader/writer based on 
 the fantastic iopipe project (that should IMHO go into Phobos 
 this way or another)...

Right now there is a CSV reader in Magpie but it isn't perfect 
enough to go into Phobos yet. I'll improve the parser and when 
I'm happy with the read speed, I'll send a PR (^_^)

Jul 23 2019

jmh530 <john.michael.hall gmail.com> writes:

On Tuesday, 23 July 2019 at 18:21:30 UTC, Prateek Nayak wrote:
 [snip]

 Right now there is a CSV reader in Magpie but it isn't perfect 
 enough to go into Phobos yet. I'll improve the parser and when 
 I'm happy with the read speed, I'll send a PR (^_^)

mir was originally intended to be included in Phobos, but got 
split off to its own library. If anything, Magpie has a better 
place in mir than Phobos. However, I think there is probably 
value in splitting off the csv reader to a separate project and 
just putting that up on dub when it is ready for broader use.

Jul 23 2019

Suliman <evermind live.ru> writes:

Could you do any benchmarks against Python Pandas.

Jul 25 2019

Prateek Nayak <lelouch.cpp gmail.com> writes:

On Thursday, 25 July 2019 at 11:55:48 UTC, Suliman wrote:
 Could you do any benchmarks against Python Pandas.

As soon as the aggregate is done, I'll get onto this.
It's best to benchmark with real world examples IMHO and 
aggregate brings most of the analytics functionality.
I'll keep the thread updated (^_^)

Jul 25 2019

Prateeek Nayak <lelouch.cpp gmail.com> writes:

-----------
Update Time
-----------

Pardon me for the delay, my university just started and it has 
been a busy first week. However I have some good news

* Aggregate implementation is under review - The preliminary 
implementation restricted the set of operations that aggregate 
could do but then Mr. Wilson suggested there should be a way to 
expand it's usability so we worked on a revamp which takes the 
function you desire as input and operates them on row/column of 
DataFrame
* There is a new way set index using index operation
* to_csv supports setting precision for floating point numbers - 
this was a problem I knew existed but I hadn't addressed it till 
now. Better late then never.
* Homogeneous DataFrame don't use TypeTuple anymore
* at overload coming soon


--------------------
What is to come next
--------------------

* The first few responses from the community were mostly 
regarding bringing binary file I/O support because of their lean 
size and fast read/write. I will explore more regarding this.
* Time Series is gaining importance with the rise of Machine 
Learning. I would like to implement something along the lines of 
time series functionality Pandas has.
* Something you would line to see. I am open to suggestions (^_^)

--------------
Problems faced
--------------

There remains a small implementation detail that remains - a 
dispatch function. Given non-homogeneous cases still require 
traversal to a column, a function to apply an alias statically or 
non-statically depending on the DataFrame is under discussion.
This will reduce code redundancy however my preliminary attempts 
to tackle this have ended in failure. I will try to finish it by 
the weekend. If I cannot solve it by then, I will seek your help 
in the Learn section (^_^)
Thank you

Aug 08 2019

bioinfornatics <bioinfornatics fedoraproject.org> writes:

On Thursday, 8 August 2019 at 16:49:09 UTC, Prateeek Nayak wrote:
 -----------
 Update Time
 -----------

 Pardon me for the delay, my university just started and it has 
 been a busy first week. However I have some good news

 * Aggregate implementation is under review - The preliminary 
 implementation restricted the set of operations that aggregate 
 could do but then Mr. Wilson suggested there should be a way to 
 expand it's usability so we worked on a revamp which takes the 
 function you desire as input and operates them on row/column of 
 DataFrame
 * There is a new way set index using index operation
 * to_csv supports setting precision for floating point numbers 
 - this was a problem I knew existed but I hadn't addressed it 
 till now. Better late then never.
 * Homogeneous DataFrame don't use TypeTuple anymore
 * at overload coming soon


 --------------------
 What is to come next
 --------------------

 * The first few responses from the community were mostly 
 regarding bringing binary file I/O support because of their 
 lean size and fast read/write. I will explore more regarding 
 this.
 * Time Series is gaining importance with the rise of Machine 
 Learning. I would like to implement something along the lines 
 of time series functionality Pandas has.
 * Something you would line to see. I am open to suggestions 
 (^_^)

 --------------
 Problems faced
 --------------

 There remains a small implementation detail that remains - a 
 dispatch function. Given non-homogeneous cases still require 
 traversal to a column, a function to apply an alias statically 
 or non-statically depending on the DataFrame is under 
 discussion.
 This will reduce code redundancy however my preliminary 
 attempts to tackle this have ended in failure. I will try to 
 finish it by the weekend. If I cannot solve it by then, I will 
 seek your help in the Learn section (^_^)
 Thank you

Dear D community,

Thanks, Prateeek Nayak for your works.
As currently, I am working with pandas (python, dataframe ...) . 
They are an extra feature that I appreciate a lot, it is the IO 
tool part:

* SQL
method: read_sql and to_sql
Description: which allow to read and save from a DataBase. These 
methods combined with SqlAlchemy are awesome.

* Parquet
method: read_parquet and to_parquet
Description: In BigData environment Parquet is a file format 
often used

These abilities made Panda and its Dataframe API a core library 
to have. Using like this, allow standardizing data structured 
used into our application and in same time offer rich statistics 
API.

Indeed it is important for tho code maintainability. And the 
FairData point that an application is a set of input data + 
program's feature = result. Thus put data structured as the first 
component to think how to develop an application is important.
The application is more robust and flexible as we can handle 
multiple input data file format.

I hope to see such features in D.


Best regards

Source:
- 
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql.html
- 
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html
- 
https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#parquet

Aug 09 2019

Prateek Nayak <lelouch.cpp gmail.com> writes:

On Friday, 9 August 2019 at 08:08:39 UTC, bioinfornatics wrote:
 Dear D community,

 Thanks, Prateeek Nayak for your works.
 As currently, I am working with pandas (python, dataframe ...) 
 . They are an extra feature that I appreciate a lot, it is the 
 IO tool part:

 * SQL
 method: read_sql and to_sql
 Description: which allow to read and save from a DataBase. 
 These methods combined with SqlAlchemy are awesome.

 * Parquet
 method: read_parquet and to_parquet
 Description: In BigData environment Parquet is a file format 
 often used

 These abilities made Panda and its Dataframe API a core library 
 to have. Using like this, allow standardizing data structured 
 used into our application and in same time offer rich 
 statistics API.

 Indeed it is important for tho code maintainability. And the 
 FairData point that an application is a set of input data + 
 program's feature = result. Thus put data structured as the 
 first component to think how to develop an application is 
 important.
 The application is more robust and flexible as we can handle 
 multiple input data file format.

 I hope to see such features in D.


 Best regards

 Source:
 - 
 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql.html
 - 
 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html
 - 
 https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#parquet

I was looking into Parquet and it even came up in the reddit post 
i had linked to earlier on - smaller file size and better I/O 
makes it really good for industrial use.
A quick search on DUB didn't give any result for a parser so I'll 
probably work on a library to work with Parquet files.
I looked into Cap'n Proto too - it looks promising but its 
missing from Pandas I/O section which was disappointing.
Thanks for mentioning SQL. I will start working on these features 
soon.

 Again, thank you so much for working on this!
 We will be excited to put Magpie through its paces in our lab, 
 but it is missing* a few key (really, basic IMO) features we 
 make heavy use of in pandas.
 * I have read the README and glanced at code but not used 
 Magpie yet, so if I am > wrong about below please correct me!
 Since you are soliciting ideas:
 1. Selecting/indexing into data with boolean vectors. e.g:
 df[df.A > 30 && df.B != "ignore"]
 1a. This really means returning a boolean vector for df.COL 
 <op> <operand>
 1b. ...and being able to subset data by a bool vector
 2. We make heavy use of "pivot" functionality.
 Kind regards

I was thinking of the same feature as 1 - a filter like function 
for DataFrame and Group - finding possible ways to implement it
I'm really embarrassed to admit I never even thought about Pivot. 
I looks like a beautiful feature to have - will definitely add to 
Magpie soon (possibly over the next couple of weeks - I'm a bit 
tied down right now with commencement of University academics but 
it will definitely come soon)

Aug 09 2019

Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:

On Saturday, 10 August 2019 at 04:10:31 UTC, Prateek Nayak wrote:
 I was looking into Parquet and it even came up in the reddit 
 post i had linked to earlier on - smaller file size and better 
 I/O makes it really good for industrial use.
 A quick search on DUB didn't give any result for a parser so 
 I'll probably work on a library to work with Parquet files.
 I looked into Cap'n Proto too - it looks promising but its 
 missing from Pandas I/O section which was disappointing.
 Thanks for mentioning SQL. I will start working on these 
 features soon.

It's clearly important that your project supports the same data 
exchange formats as pandas, but it doesn't seem inherently a 
problem to support other formats as well, assuming you have the 
time and inclination to do so.

Aug 10 2019

Prateek Nayak <lelouch.cpp gmail.com> writes:

On Saturday, 10 August 2019 at 12:38:19 UTC, Joseph Rushton 
Wakeling wrote:
 It's clearly important that your project supports the same data 
 exchange formats as pandas, but it doesn't seem inherently a 
 problem to support other formats as well, assuming you have the 
 time and inclination to do so.

It is never an inherent problem to support a new file format but 
the initial comments from the community was mainly regarding easy 
interop with python. That is why I was thinking of Parquet 
support.
Cap'n Proto is great and I'll love to implement a Cap'n Proto I/O 
sooner or later but Parquet seems to have a heavier presence due 
to popularity of Pandas so I decided to look into it first.

Aug 10 2019

James Blachly <james.blachly gmail.com> writes:

On 8/8/19 12:49 PM, Prateeek Nayak wrote:
 --------------------
 What is to come next
 --------------------
 
 * The first few responses from the community were mostly regarding 
 bringing binary file I/O support because of their lean size and fast 
 read/write. I will explore more regarding this.
 * Time Series is gaining importance with the rise of Machine Learning. I 
 would like to implement something along the lines of time series 
 functionality Pandas has.
 * Something you would line to see. I am open to suggestions (^_^)

Again, thank you so much for working on this!
We will be excited to put Magpie through its paces in our lab, but it is 
missing* a few key (really, basic IMO) features we make heavy use of in 
pandas.

* I have read the README and glanced at code but not used Magpie yet, so 
if I am wrong about below please correct me!


Since you are soliciting ideas:
1. Selecting/indexing into data with boolean vectors. e.g:

df[df.A > 30 && df.B != "ignore"]

1a. This really means returning a boolean vector for df.COL <op> <operand>

1b. ...and being able to subset data by a bool vector


2. We make heavy use of "pivot" functionality.


Kind regards

Aug 09 2019

D Programming

C/C++ Programming

Other

digitalmars.D - [GSoC] Dataframes for D