www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - [GSoC] Dataframes for D

reply Prateek Nayak <lelouch.cpp gmail.com> writes:
Hello everyone,

I have began work on my Google Summer of Code 2019 project 
DataFrame for D.

-----------------
About the Project
-----------------

DataFrames have become a standard while handling and manipulating 
data. They give a neat representation, access and power to 
modulate the data in way user wants.
This project aims at bringing native DataFrame to D one which 
brings with it:

* A User Friendly API
* Multi - Indexing
* Writing to CSV and parsing from CSV
* Column binary operation in the form: df["Index1"] = 
df["Index2"] + df["Index3"];
* groupBy on an arbitrary number of columns
* Data Aggregation

Disclaimer: The entire structuring was inspired by Pandas, the 
most popular DataFrame library in Python and hence most of the 
usage will look very similar to the ones in Pandas.

Main focus of this project is user-friendliness of the API while 
also maintaining fair amount of speed and power.
The preliminary road map can be viewed here -> 
https://docs.google.com/document/d/1Zrf_tFYLauAd_NM4-UMBGt_z-fORhFMrGvW633x8rZs/edit?usp=sharing

The core developments can be seen here -> 
https://github.com/Kriyszig/magpie


-----------------------------
Brief idea of what is to come
-----------------------------

This month
----------
* Finish up with structure of DataFrame
* Finish Terminal Output (What good is data which cannot be seen)
* Finish writing to CSV
* Parsing DataFrame from CSV (Both single and multi-indexed)
* Accessing Elements
* Accessing Rows and Columns
* Assignment of element, an entire row or column
* Binary operation on rows and columns

Next Month
----------
* groupBy
* join
* Begin writing ops for aggregation


-----------
Speed Bumps
-----------

I am relatively new to D and hail from functional C background. 
Sometimes (most of the times) my code can start to look more C 
than D.
However I am adapting thanks to my mentors Nicholas Wilson and 
Ilya Yaroshenko. They have helped me a ton - whether it be with 
debugging errors or me falling back to my functional C past, they 
have always come for my rescue and I am grateful for their 
support.


-------------------------------------
Addressing Suggestions from Community
-------------------------------------

This suggestion comes from Laeeth Isharc
Source: 
https://github.com/dlang/projects/issues/15#issuecomment-495831750

Though this is not on my current road map, I would love to pursue 
this idea. Adding an easy way to inter operate with other 
libraries would be very beneficial.
Although I haven't formally addressed this in the road map, I 
would love to implement a msgpack based I/O as I continue to 
develop the library. Also JSON I/O was something on my mind to 
implement after the data aggregation part.  (I had prioritised 
JSON as I believed there were much more datasets as JSON compared 
to any other format)
May 29 2019
next sibling parent reply Andre Pany <andre s-e-a-p.de> writes:
On Wednesday, 29 May 2019 at 18:00:02 UTC, Prateek Nayak wrote:
 Hello everyone,

 I have began work on my Google Summer of Code 2019 project 
 DataFrame for D.

 -----------------
 About the Project
 -----------------

 DataFrames have become a standard while handling and 
 manipulating data. They give a neat representation, access and 
 power to modulate the data in way user wants.
 This project aims at bringing native DataFrame to D one which 
 brings with it:

 * A User Friendly API
 * Multi - Indexing
 * Writing to CSV and parsing from CSV
 * Column binary operation in the form: df["Index1"] = 
 df["Index2"] + df["Index3"];
 * groupBy on an arbitrary number of columns
 * Data Aggregation

 Disclaimer: The entire structuring was inspired by Pandas, the 
 most popular DataFrame library in Python and hence most of the 
 usage will look very similar to the ones in Pandas.

 Main focus of this project is user-friendliness of the API 
 while also maintaining fair amount of speed and power.
 The preliminary road map can be viewed here -> 
 https://docs.google.com/document/d/1Zrf_tFYLauAd_NM4-UMBGt_z-fORhFMrGvW633x8rZs/edit?usp=sharing

 The core developments can be seen here -> 
 https://github.com/Kriyszig/magpie


 -----------------------------
 Brief idea of what is to come
 -----------------------------

 This month
 ----------
 * Finish up with structure of DataFrame
 * Finish Terminal Output (What good is data which cannot be 
 seen)
 * Finish writing to CSV
 * Parsing DataFrame from CSV (Both single and multi-indexed)
 * Accessing Elements
 * Accessing Rows and Columns
 * Assignment of element, an entire row or column
 * Binary operation on rows and columns

 Next Month
 ----------
 * groupBy
 * join
 * Begin writing ops for aggregation


 -----------
 Speed Bumps
 -----------

 I am relatively new to D and hail from functional C background. 
 Sometimes (most of the times) my code can start to look more C 
 than D.
 However I am adapting thanks to my mentors Nicholas Wilson and 
 Ilya Yaroshenko. They have helped me a ton - whether it be with 
 debugging errors or me falling back to my functional C past, 
 they have always come for my rescue and I am grateful for their 
 support.


 -------------------------------------
 Addressing Suggestions from Community
 -------------------------------------

 This suggestion comes from Laeeth Isharc
 Source: 
 https://github.com/dlang/projects/issues/15#issuecomment-495831750

 Though this is not on my current road map, I would love to 
 pursue this idea. Adding an easy way to inter operate with 
 other libraries would be very beneficial.
 Although I haven't formally addressed this in the road map, I 
 would love to implement a msgpack based I/O as I continue to 
 develop the library. Also JSON I/O was something on my mind to 
 implement after the data aggregation part.  (I had prioritised 
 JSON as I believed there were much more datasets as JSON 
 compared to any other format)
The outlook looks really great. CSV as a starting point is good. I am not sure wheter adding a second text format like json brings real value. Text formats have 2 disadvantages, string to number and vise versa slows down loading and saving files. If I remember correctly saving a file as CSV needed around 50 seconds while saving the same data as binary file (Parquet) took 3 seconds. Second issue is the file size. A CSV with a size of 580 MB has a size of 280 MB while saving as e.g. Parquet file. The file size isn't an issue on your local file system but it is a big issue while storing these file in the cloud e.g. Amazon S3. The file size will cause longer transfer times. Adding a packed binary format would be great, if possible. Kind regards Andre
May 29 2019
parent reply Laeeth Isharc <laeeth kaleidic.io> writes:
On Wednesday, 29 May 2019 at 18:26:46 UTC, Andre Pany wrote:
 On Wednesday, 29 May 2019 at 18:00:02 UTC, Prateek Nayak wrote:
 Hello everyone,

 I have began work on my Google Summer of Code 2019 project 
 DataFrame for D.

 -----------------
 About the Project
 -----------------

 DataFrames have become a standard while handling and 
 manipulating data. They give a neat representation, access and 
 power to modulate the data in way user wants.
 This project aims at bringing native DataFrame to D one which 
 brings with it:

 * A User Friendly API
 * Multi - Indexing
 * Writing to CSV and parsing from CSV
 * Column binary operation in the form: df["Index1"] = 
 df["Index2"] + df["Index3"];
 * groupBy on an arbitrary number of columns
 * Data Aggregation

 Disclaimer: The entire structuring was inspired by Pandas, the 
 most popular DataFrame library in Python and hence most of the 
 usage will look very similar to the ones in Pandas.

 Main focus of this project is user-friendliness of the API 
 while also maintaining fair amount of speed and power.
 The preliminary road map can be viewed here -> 
 https://docs.google.com/document/d/1Zrf_tFYLauAd_NM4-UMBGt_z-fORhFMrGvW633x8rZs/edit?usp=sharing

 The core developments can be seen here -> 
 https://github.com/Kriyszig/magpie


 -----------------------------
 Brief idea of what is to come
 -----------------------------

 This month
 ----------
 * Finish up with structure of DataFrame
 * Finish Terminal Output (What good is data which cannot be 
 seen)
 * Finish writing to CSV
 * Parsing DataFrame from CSV (Both single and multi-indexed)
 * Accessing Elements
 * Accessing Rows and Columns
 * Assignment of element, an entire row or column
 * Binary operation on rows and columns

 Next Month
 ----------
 * groupBy
 * join
 * Begin writing ops for aggregation


 -----------
 Speed Bumps
 -----------

 I am relatively new to D and hail from functional C 
 background. Sometimes (most of the times) my code can start to 
 look more C than D.
 However I am adapting thanks to my mentors Nicholas Wilson and 
 Ilya Yaroshenko. They have helped me a ton - whether it be 
 with debugging errors or me falling back to my functional C 
 past, they have always come for my rescue and I am grateful 
 for their support.


 -------------------------------------
 Addressing Suggestions from Community
 -------------------------------------

 This suggestion comes from Laeeth Isharc
 Source: 
 https://github.com/dlang/projects/issues/15#issuecomment-495831750

 Though this is not on my current road map, I would love to 
 pursue this idea. Adding an easy way to inter operate with 
 other libraries would be very beneficial.
 Although I haven't formally addressed this in the road map, I 
 would love to implement a msgpack based I/O as I continue to 
 develop the library. Also JSON I/O was something on my mind to 
 implement after the data aggregation part.  (I had prioritised 
 JSON as I believed there were much more datasets as JSON 
 compared to any other format)
The outlook looks really great. CSV as a starting point is good. I am not sure wheter adding a second text format like json brings real value. Text formats have 2 disadvantages, string to number and vise versa slows down loading and saving files. If I remember correctly saving a file as CSV needed around 50 seconds while saving the same data as binary file (Parquet) took 3 seconds. Second issue is the file size. A CSV with a size of 580 MB has a size of 280 MB while saving as e.g. Parquet file. The file size isn't an issue on your local file system but it is a big issue while storing these file in the cloud e.g. Amazon S3. The file size will cause longer transfer times. Adding a packed binary format would be great, if possible. Kind regards Andre
Interoperability with pandas would be important in our use case, and I think probably for quite a few others. So yes I agree that it's not ideal to use JSON but lots of things are not ideal. And I think people use JSON, msgpack and hdf5 for interop with pandas. CSV is more complicated in practice than one might initially think. Finally of course Excel interop. It's not my cup of tea but gzipped JSON is quite compact... I have a little streaming msgpack to our own Variable type deserialiser (it can store a primitive or a variant). It's not long and I could share that. I don't think the initial dataframe needs to have all this stuff in it from day one necessarily. It's worth reusing another JSON implementation. Since you work with Ilya - asdf isn't bad and quite fast though the error messages leave something to be desired. You might see John Colvin repo from his talk on nogc map, filter,fold etc. He didn't do chunkBy yet.
May 29 2019
parent Prateek Nayak <lelouch.cpp gmail.com> writes:
On Wednesday, 29 May 2019 at 22:31:57 UTC, Laeeth Isharc wrote:
 On Wednesday, 29 May 2019 at 18:26:46 UTC, Andre Pany wrote:
 On Wednesday, 29 May 2019 at 18:00:02 UTC, Prateek Nayak wrote:
 [...]
The outlook looks really great. CSV as a starting point is good. I am not sure wheter adding a second text format like json brings real value. Text formats have 2 disadvantages, string to number and vise versa slows down loading and saving files. If I remember correctly saving a file as CSV needed around 50 seconds while saving the same data as binary file (Parquet) took 3 seconds. Second issue is the file size. A CSV with a size of 580 MB has a size of 280 MB while saving as e.g. Parquet file. The file size isn't an issue on your local file system but it is a big issue while storing these file in the cloud e.g. Amazon S3. The file size will cause longer transfer times. Adding a packed binary format would be great, if possible. Kind regards Andre
Interoperability with pandas would be important in our use case, and I think probably for quite a few others. So yes I agree that it's not ideal to use JSON but lots of things are not ideal. And I think people use JSON, msgpack and hdf5 for interop with pandas. CSV is more complicated in practice than one might initially think. Finally of course Excel interop. It's not my cup of tea but gzipped JSON is quite compact... I have a little streaming msgpack to our own Variable type deserialiser (it can store a primitive or a variant). It's not long and I could share that. I don't think the initial dataframe needs to have all this stuff in it from day one necessarily. It's worth reusing another JSON implementation. Since you work with Ilya - asdf isn't bad and quite fast though the error messages leave something to be desired. You might see John Colvin repo from his talk on nogc map, filter,fold etc. He didn't do chunkBy yet.
Interop is important and I see many binary format being suggested as a way of interop. Pandas I/O tools cover some of the popular formats: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html Apache Arrow was something new that I hadn't heard of before. I'll look into it but then again I couldn't find any way to integrate it with D right away. I would love to hear which binary format does the community want to be added as a way of interop in the future.
May 29 2019
prev sibling next sibling parent reply jmh530 <john.michael.hall gmail.com> writes:
On Wednesday, 29 May 2019 at 18:00:02 UTC, Prateek Nayak wrote:
 [snip]
Glad to see progress being made on this! Somewhat tangentially on the interoperability, I have also made quite a bit of use with R's data.frames. One difference between that and what I have seen of the implementation is that R's data.frames allow for different columns to be different types. This makes certain kinds of analysis of groups very easy. For instance, right now I'm working with a dataset whose columns are doubles, dates, integers, bools, and strings. I can do the equivalent of groupby on the strings as "factors" in R and it's pretty straightforward to get everything working nicely.
May 29 2019
next sibling parent reply Yatheendra <3df4 gmail.ru> writes:
Is it worth considering some binary format that is standard-ish & 
whose toolset can write out text when viewing? I'm thinking 
https://capnproto.org .
May 29 2019
next sibling parent jmh530 <john.michael.hall gmail.com> writes:
On Wednesday, 29 May 2019 at 19:33:38 UTC, Yatheendra wrote:
 Is it worth considering some binary format that is standard-ish 
 & whose toolset can write out text when viewing? I'm thinking 
 https://capnproto.org .
I have heard of that, but I don't know too much about it. I think there are some hierarchical data formats that have some popularity, like hdf5. I think there's already a D wrapper for the C library. I don't have much experience with this kind of stuff though.
May 29 2019
prev sibling parent Laeeth Isharc <laeeth kaleidic.io> writes:
On Wednesday, 29 May 2019 at 19:33:38 UTC, Yatheendra wrote:
 Is it worth considering some binary format that is standard-ish 
 & whose toolset can write out text when viewing? I'm thinking 
 https://capnproto.org .
Ultimately you need to be able to talk to the world because this kind of thing is social and you may not have a choice about formats. However no point trying to do it all in one summer...
May 29 2019
prev sibling next sibling parent reply Prateek Nayak <lelouch.cpp gmail.com> writes:
On Wednesday, 29 May 2019 at 18:41:28 UTC, jmh530 wrote:
 On Wednesday, 29 May 2019 at 18:00:02 UTC, Prateek Nayak wrote:
 [snip]
Glad to see progress being made on this! Somewhat tangentially on the interoperability, I have also made quite a bit of use with R's data.frames. One difference between that and what I have seen of the implementation is that R's data.frames allow for different columns to be different types. This makes certain kinds of analysis of groups very easy. For instance, right now I'm working with a dataset whose columns are doubles, dates, integers, bools, and strings. I can do the equivalent of groupby on the strings as "factors" in R and it's pretty straightforward to get everything working nicely.
The DataFrame currently uses Mir's ndslice at the core of it which allows for homogeneous data to be stored within it. Right now, we are considering operable data to be homogeneous keeping the API simpler. I'm not sure how something like Variant will play out in this scenario. It may allow for data to be flexible but parsing will probably require an assertion library.
May 29 2019
parent jmh530 <john.michael.hall gmail.com> writes:
On Thursday, 30 May 2019 at 03:38:50 UTC, Prateek Nayak wrote:
 [snip]

 The DataFrame currently uses Mir's ndslice at the core of it 
 which allows for homogeneous data to be stored within it.
 Right now, we are considering operable data to be homogeneous 
 keeping the API simpler.
 I'm not sure how something like Variant will play out in this 
 scenario. It may allow for data to be flexible but parsing will 
 probably require an assertion library.
It's probably smart to focus on getting the homogeneous case working first. I don't think of it as the entire thing being Variant, so much as a tuple containing 1-dimensional mir slices that are all the same length. The idea is that each column should have its own type. I had done a simple implementation of this a year or so ago and had shown Ilya.
May 30 2019
prev sibling next sibling parent reply Prateek Nayak <lelouch.cpp gmail.com> writes:
On Wednesday, 29 May 2019 at 18:41:28 UTC, jmh530 wrote:
 On Wednesday, 29 May 2019 at 18:00:02 UTC, Prateek Nayak wrote:
 [snip]
Glad to see progress being made on this! Somewhat tangentially on the interoperability, I have also made quite a bit of use with R's data.frames. One difference between that and what I have seen of the implementation is that R's data.frames allow for different columns to be different types. This makes certain kinds of analysis of groups very easy. For instance, right now I'm working with a dataset whose columns are doubles, dates, integers, bools, and strings. I can do the equivalent of groupby on the strings as "factors" in R and it's pretty straightforward to get everything working nicely.
On a second thought, my mentor Nicholas Wilson led me to an interesting Github Gist -> https://gist.github.com/aG0aep6G/a1b87df1ac5930870ffe A similar structure can be used to represent non homogeneous data. The DataFrame structure can be overloaded for such an integration. However homogeneous DataFrame still remain the main objective for now. This integration will definitely happen once the homogeneous DataFrame comes close to looking and working like an actual DataFrame. I'll keep you updated here in case I find anything better for non homogeneous data and when the whole things starts to take shape.
May 29 2019
parent jmh530 <john.michael.hall gmail.com> writes:
On Thursday, 30 May 2019 at 04:38:09 UTC, Prateek Nayak wrote:
 

 On a second thought, my mentor Nicholas Wilson led me to an 
 interesting Github Gist
 -> https://gist.github.com/aG0aep6G/a1b87df1ac5930870ffe

 A similar structure can be used to represent non homogeneous 
 data. The DataFrame structure can be overloaded for such an 
 integration. However homogeneous DataFrame still remain the 
 main objective for now. This integration will definitely happen 
 once the homogeneous DataFrame comes close to looking and 
 working like an actual DataFrame.

 I'll keep you updated here in case I find anything better for 
 non homogeneous data and when the whole things starts to take 
 shape.
Hmm, my point above was for a tuple of mir slices, which seems to correspond more to a struct of arrays. I wonder if your implementation would be able to take an array of structs approach using built-in mir. I know mir can handle complex numbers, which requires a struct. I don't know off the top of my head how it can handle more generally a slice whose type is a struct.
May 30 2019
prev sibling parent reply welkam <wwwelkam gmail.com> writes:
On Wednesday, 29 May 2019 at 18:41:28 UTC, jmh530 wrote:
 R's data.frames allow for different columns to be different 
 types.
Im not familiar with data frames but this sounds like array of structs
May 30 2019
parent reply jmh530 <john.michael.hall gmail.com> writes:
On Thursday, 30 May 2019 at 09:06:58 UTC, welkam wrote:
 [snip]

 Im not familiar with data frames but this sounds like array of 
 structs
I think I was thinking of it more like a struct of arrays, but I think an array of structs may also work (see my responses to Prateek)...
May 30 2019
parent reply Prateek Nayak <lelouch.cpp gmail.com> writes:
On Thursday, 30 May 2019 at 09:41:59 UTC, jmh530 wrote:
 On Thursday, 30 May 2019 at 09:06:58 UTC, welkam wrote:
 [snip]

 Im not familiar with data frames but this sounds like array of 
 structs
I think I was thinking of it more like a struct of arrays, but I think an array of structs may also work (see my responses to Prateek)...
Due to the popularity of heterogeneous DataFrames, we decided to take care of it the early stages of development before it's too late. The heterogeneous DataFrame is now live at: https://github.com/Kriyszig/magpie/tree/experimental Some parts are still under development but the goals in the road maps will be reached on time. --------------------------------- Summing up the first week of GSoC --------------------------------- * Base and file I/O ops were built for homogeneous DataFrame * Based on the type of data community has worked with, it seemed evident homogeneous DataFrames weren't gonna cut it. So a rebuild was initiated over the weekend to allow for heterogeneous data. * The API was overhauled to allow for Heterogeneous DataFrames. * New parser that can parse selective columns. The code will land in master once it's cleaned up and is deemed stable. ----------------------------------- Things that will be dealt this week ----------------------------------- This week will be for: * Improving Parser * Overhaul code structure (in Experimental) * Adding setters for data and index in DataFrame * Adding functions to create a multi-indexed DataFrame, the same way one can do in Python. * Adding Documentation and examples * Index Operations * Retrieve rows and columns The last one will set in motion the implementation of Column Binary ops of the form: df["Index1"] = df["Index2"] + df["Index3"]; Meanwhile if you guys have any more suggestion please feel free to contact me - you can use this thread, open an issue on Github, reach out to me on slack (Prateek Nayak) or you can email me directly (lelouch.cpp gmail.com)
Jun 03 2019
next sibling parent jmh530 <john.michael.hall gmail.com> writes:
On Tuesday, 4 June 2019 at 03:13:03 UTC, Prateek Nayak wrote:
 [snip]

 Due to the popularity of heterogeneous DataFrames, we decided 
 to take care of it the early stages of development before it's 
 too late.
Excellent!
Jun 04 2019
prev sibling next sibling parent reply James Blachly <james.blachly gmail.com> writes:
On 6/3/19 11:13 PM, Prateek Nayak wrote:
 Due to the popularity of heterogeneous DataFrames, we decided to take 
 care of it the early stages of development before it's too late.
 
 The heterogeneous DataFrame is now live at: 
 https://github.com/Kriyszig/magpie/tree/experimental
Amazing, thanks! experimental branch readme code snippet has typo; var name should be `heterogeneous` ``` // Creating a heterogeneous DataFrame of 10 integer columns and 10 double columns DataFrame!(int, 10, double, 10) homogeneous; ``` Again I cannot thank you enough!
Jun 05 2019
parent Prateek Nayak <lelouch.cpp gmail.com> writes:
On Thursday, 6 June 2019 at 00:24:17 UTC, James Blachly wrote:
 On 6/3/19 11:13 PM, Prateek Nayak wrote:
 Due to the popularity of heterogeneous DataFrames, we decided 
 to take care of it the early stages of development before it's 
 too late.
 
 The heterogeneous DataFrame is now live at: 
 https://github.com/Kriyszig/magpie/tree/experimental
Amazing, thanks! experimental branch readme code snippet has typo; var name should be `heterogeneous` ``` // Creating a heterogeneous DataFrame of 10 integer columns and 10 double columns DataFrame!(int, 10, double, 10) homogeneous; ``` Again I cannot thank you enough!
Oops! Thanks for spotting that. I'll update it today with a complete example for the usage and a snippet each for the functions added since it was last modified.
Jun 05 2019
prev sibling parent reply Prateek Nayak <lelouch.cpp gmail.com> writes:
On Tuesday, 4 June 2019 at 03:13:03 UTC, Prateek Nayak wrote:
 On Thursday, 30 May 2019 at 09:41:59 UTC, jmh530 wrote:
 On Thursday, 30 May 2019 at 09:06:58 UTC, welkam wrote:
 [snip]

 Im not familiar with data frames but this sounds like array 
 of structs
I think I was thinking of it more like a struct of arrays, but I think an array of structs may also work (see my responses to Prateek)...
Due to the popularity of heterogeneous DataFrames, we decided to take care of it the early stages of development before it's too late. The heterogeneous DataFrame is now live at: https://github.com/Kriyszig/magpie/tree/experimental Some parts are still under development but the goals in the road maps will be reached on time. --------------------------------- Summing up the first week of GSoC --------------------------------- * Base and file I/O ops were built for homogeneous DataFrame * Based on the type of data community has worked with, it seemed evident homogeneous DataFrames weren't gonna cut it. So a rebuild was initiated over the weekend to allow for heterogeneous data. * The API was overhauled to allow for Heterogeneous DataFrames. * New parser that can parse selective columns. The code will land in master once it's cleaned up and is deemed stable. ----------------------------------- Things that will be dealt this week ----------------------------------- This week will be for: * Improving Parser * Overhaul code structure (in Experimental) * Adding setters for data and index in DataFrame * Adding functions to create a multi-indexed DataFrame, the same way one can do in Python. * Adding Documentation and examples * Index Operations * Retrieve rows and columns The last one will set in motion the implementation of Column Binary ops of the form: df["Index1"] = df["Index2"] + df["Index3"]; Meanwhile if you guys have any more suggestion please feel free to contact me - you can use this thread, open an issue on Github, reach out to me on slack (Prateek Nayak) or you can email me directly (lelouch.cpp gmail.com)
I have decided to make weekly updates for the community to know the progress of the project. ------------------------------- Summing up last week's progress ------------------------------- * Brought Heterogeneous DataFrames to the same point as the previous Homogeneous DataFrame development. * Assignment ops - both direct and indexed. * Retrieve an entire column using index operation. * Retrieving entire row using index operation. * Small redesign here and there to reduce code size. The index op for rows and columns will return the value as an Axis structure. Binary Ops on Axis structure will basically translate to Column and Row binary operation on DataFrame. ---------------------------------- Tasks that will be dealt this week ---------------------------------- * Column and row binary operation * There are a few places where O(n) operations can be converted to O(log n) operations - these few optimisations will be done. * Updating the Documentation with the developments of the week. So far no major roadblocks have been encountered. P.S. Found this interesting Reddit post regarding file formats on r/datasets: https://www.reddit.com/r/datasets/comments/bymru3/are_there_any_data_formats_for_storing_text_worth/
Jun 10 2019
parent reply Prateek Nayak <lelouch.cpp gmail.com> writes:
On Tuesday, 11 June 2019 at 04:35:22 UTC, Prateek Nayak wrote:
 On Tuesday, 4 June 2019 at 03:13:03 UTC, Prateek Nayak wrote:
 On Thursday, 30 May 2019 at 09:41:59 UTC, jmh530 wrote:
 On Thursday, 30 May 2019 at 09:06:58 UTC, welkam wrote:
 [snip]

 Im not familiar with data frames but this sounds like array 
 of structs
I think I was thinking of it more like a struct of arrays, but I think an array of structs may also work (see my responses to Prateek)...
Due to the popularity of heterogeneous DataFrames, we decided to take care of it the early stages of development before it's too late. The heterogeneous DataFrame is now live at: https://github.com/Kriyszig/magpie/tree/experimental Some parts are still under development but the goals in the road maps will be reached on time. --------------------------------- Summing up the first week of GSoC --------------------------------- * Base and file I/O ops were built for homogeneous DataFrame * Based on the type of data community has worked with, it seemed evident homogeneous DataFrames weren't gonna cut it. So a rebuild was initiated over the weekend to allow for heterogeneous data. * The API was overhauled to allow for Heterogeneous DataFrames. * New parser that can parse selective columns. The code will land in master once it's cleaned up and is deemed stable. ----------------------------------- Things that will be dealt this week ----------------------------------- This week will be for: * Improving Parser * Overhaul code structure (in Experimental) * Adding setters for data and index in DataFrame * Adding functions to create a multi-indexed DataFrame, the same way one can do in Python. * Adding Documentation and examples * Index Operations * Retrieve rows and columns The last one will set in motion the implementation of Column Binary ops of the form: df["Index1"] = df["Index2"] + df["Index3"]; Meanwhile if you guys have any more suggestion please feel free to contact me - you can use this thread, open an issue on Github, reach out to me on slack (Prateek Nayak) or you can email me directly (lelouch.cpp gmail.com)
I have decided to make weekly updates for the community to know the progress of the project. ------------------------------- Summing up last week's progress ------------------------------- * Brought Heterogeneous DataFrames to the same point as the previous Homogeneous DataFrame development. * Assignment ops - both direct and indexed. * Retrieve an entire column using index operation. * Retrieving entire row using index operation. * Small redesign here and there to reduce code size. The index op for rows and columns will return the value as an Axis structure. Binary Ops on Axis structure will basically translate to Column and Row binary operation on DataFrame. ---------------------------------- Tasks that will be dealt this week ---------------------------------- * Column and row binary operation * There are a few places where O(n) operations can be converted to O(log n) operations - these few optimisations will be done. * Updating the Documentation with the developments of the week. So far no major roadblocks have been encountered. P.S. Found this interesting Reddit post regarding file formats on r/datasets: https://www.reddit.com/r/datasets/comments/bymru3/are_there_any_data_formats_for_storing_text_worth/
------------- Weekly Update ------------- This week, the development was bit slower compared to the last couple of weeks - I had to attend college for a couple of days and it took more time than I would have liked. However, that said, everything from the last week's goal is achieved. ----------------------- What happened last week ----------------------- * Redesigned Axis - the structure that returns the values during column binary operations * Added binary operations for Axis [This was equivalent of binary operations on DataFrame] * Tested whether the Binary Operations worked fine on DataFrames * Fixed couple of tiny bugs here and there. * Added more ways to build index. * HashMap like implementation to check for duplicates in index. ------------------- Goals for this week ------------------- * Work on apply - That applies a function to values in a row/column or the entire DataFrame * Add some inbuilt operations [This will be operations like mean, median - operations that are essential] * Optimize parts of DataFrame. * Addition of helpers that will eventually trigger the development of groupBy ---------- Roadblocks ---------- There were a few moments where I ran into trouble while returning Axis structure because of the variable return type. My mentors helped me a lot when I got stuck with the implementation. There ware also a couple of bugs which were simple but still took a while to solve. Other than that, things went really smooth. --------------------- Community Suggestions --------------------- Petar Kirov [ZombineDev] suggested using static arrays as a way to declare a DataFrame. Soon it will be possible to declare DataFrame as: DataFrame!(int[5], double[10]) df; [Implemented in BinaryOps branch - Will land in master with BinaryOps implementation soon] Thank you for the suggestion.
Jun 17 2019
parent reply Prateek Nayak <lelouch.cpp gmail.com> writes:
 [snip]
------------- Week 4 Update ------------- This marks the completion of Stage I of Google Summer of Code 2019. It seems like it was only yesterday when I started working on this project and it has already been a month. -------------------------- So what happened last week -------------------------- * apply - to apply a function on a row/column * function to convert a column of data to level of Index * drop - to drop a row/column Going back to the original proposal, I had allocated some time for optimisations in case there was time: I was testing old parser with large files and it failed miserably. So I redesigned the from_csv function and added it to the library as fastCSV. fastCSV gives 40x speed improvement over from_csv and fastCSV will eventually replace from_csv * fastCSV was added to the library. ------------------- Plans for this week ------------------- Plans for this entire stage isn't strictly on a week by week timeline but the following things will be dealt sequentially throughout this stage: This stage is reserved for implementation of groupBy. So for the beginning the internal structure and grouping will be decided. Later things like display and combining into a DataFrame struct will be dealt with. These tasks were scheduled for Stage-III but will again fall under sequential implementation. If the above tasks are done. The following tasks will be dealt with: * Aggregate [with complete set of popular operations] * Join will be implemented to merge two DataFrame. Aggregate was reserved for later stages to support implementation for both normal DataFrame and groubBy at once. ---------- Roadblocks ---------- This week there hasn't been any roadblocks. I needed the help of my mentors to solve a couple errors here and there but other than that things were smooth. As for the future roadblocks, I cannot see any apparent ones but then again they show up when you are least expecting them :(
Jun 25 2019
parent reply jmh530 <john.michael.hall gmail.com> writes:
On Tuesday, 25 June 2019 at 17:25:34 UTC, Prateek Nayak wrote:
 [snip]
------------- Week 4 Update ------------- This marks the completion of Stage I of Google Summer of Code 2019. It seems like it was only yesterday when I started working on this project and it has already been a month. -------------------------- So what happened last week -------------------------- * apply - to apply a function on a row/column * function to convert a column of data to level of Index * drop - to drop a row/column
[snip] Glad to see you're still making great progress. I had worked on the byDim function in mir.ndslice.topology is byDim because I had wanted the same sort of functionality as R's apply. It works a little differently than R's, but I find it very good for a lot of things. Your version of apply (I'm looking at the apply branch of magpie) looks like it operates a bit like a byDim chained with an each, so byDim!N.each!f. However, it also has this index variable allowing it to skip rows or something (I'm not really sure if this feature pulls its weight...). So I have two questions: 1) does byDim also work with dataframes?, 2) can you add an overload that is apply(f, axis) without the index parameter? One of my take-a-ways from looking at the apply function (again just looking at that apply branch) is that you might benefit from using more of what is Ilya has already put in mir.ndslice where available. For instance, the overload of apply that is just apply!f is basically the same as mir's each, but each has more features.
Jun 25 2019
next sibling parent jmh530 <john.michael.hall gmail.com> writes:
On Tuesday, 25 June 2019 at 17:54:36 UTC, jmh530 wrote:
 On Tuesday, 25 June 2019 at 17:25:34 UTC, Prateek Nayak wrote:
 [snip]
------------- Week 4 Update ------------- This marks the completion of Stage I of Google Summer of Code 2019. It seems like it was only yesterday when I started working on this project and it has already been a month. -------------------------- So what happened last week -------------------------- * apply - to apply a function on a row/column * function to convert a column of data to level of Index * drop - to drop a row/column
[snip] Glad to see you're still making great progress. I had worked on the byDim function in mir.ndslice.topology is byDim because I had wanted the same sort of functionality as R's apply. It works a little differently than R's, but I find it very good for a lot of things. Your version of apply (I'm looking at the apply branch of magpie) looks like it operates a bit like a byDim chained with an each, so byDim!N.each!f. However, it also has this index variable allowing it to skip rows or something (I'm not really sure if this feature pulls its weight...). So I have two questions: 1) does byDim also work with dataframes?, 2) can you add an overload that is apply(f, axis) without the index parameter? One of my take-a-ways from looking at the apply function (again just looking at that apply branch) is that you might benefit from using more of what is Ilya has already put in mir.ndslice where available. For instance, the overload of apply that is just apply!f is basically the same as mir's each, but each has more features.
Stupid typos. I had worked on the byDim function in mir.ndslice.topology because ..."
Jun 25 2019
prev sibling parent reply Prateek Nayak <lelouch.cpp gmail.com> writes:
On Tuesday, 25 June 2019 at 17:54:36 UTC, jmh530 wrote:
 Glad to see you're still making great progress.

 I had worked on the byDim function in mir.ndslice.topology is 
 byDim because I had wanted the same sort of functionality as 
 R's apply. It works a little differently than R's, but I find 
 it very good for a lot of things. Your version of apply (I'm 
 looking at the apply branch of magpie) looks like it operates a 
 bit like a byDim chained with an each, so byDim!N.each!f. 
 However, it also has this index variable allowing it to skip 
 rows or something (I'm not really sure if this feature pulls 
 its weight...).

 So I have two questions: 1) does byDim also work with 
 dataframes?, 2) can you add an overload that is apply(f, axis) 
 without the index parameter?

 One of my take-a-ways from looking at the apply function (again 
 just looking at that apply branch) is that you might benefit 
 from using more of what is Ilya has already put in mir.ndslice 
 where available. For instance, the overload of apply that is 
 just apply!f is basically the same as mir's each, but each has 
 more features.
1) Current, byDim doesn't work on DataFrame DataFrame. 2) Sure, the overload can be made but what are you specifically looking for? apply(f, axis)(indexes) ? You are right, apply works like byDim!axis.each on particular columns/rows. I'll look into Mir's implementation. Thanks for that advice. I do believe apply can be strengthened to account for different use cases. When the heterogeneous DataFrame support came, mir-algorithms was dropped from dependencies and Structure of Array implementation was taken up using TypeTuples. Once the basic working is solid, I'll port useful features from Mir to Magpie.
Jun 25 2019
parent reply jmh530 <john.michael.hall gmail.com> writes:
On Tuesday, 25 June 2019 at 20:44:43 UTC, Prateek Nayak wrote:
 [snip]
 2) Sure, the overload can be made but what are you specifically 
 looking for?
 apply(f, axis)(indexes) ?
 [snip]
I see void apply(alias Fn, int axis, T)(T index) and void apply(alias Fn)() in the current implementation. I think you interpreted what I am asking as something like void apply(alias Fn, int axis, T[])(T[] indices) which also might make sense. But I guess I was suggesting a little simpler as void apply(alias Fn, int axis)() so that it applies to all the rows or columns. This is particularly relevant in the homogeneous data case. My motivation reflects a common use case of the apply function in R to calculate summary statistics of an array/matrix by column or row. For instance, I might want to calculate the standard deviation of every column.
Jun 25 2019
parent reply Prateek Nayak <lelouch.cpp gmail.com> writes:
On Tuesday, 25 June 2019 at 21:07:35 UTC, jmh530 wrote:
 On Tuesday, 25 June 2019 at 20:44:43 UTC, Prateek Nayak wrote:
 [snip]
 2) Sure, the overload can be made but what are you 
 specifically looking for?
 apply(f, axis)(indexes) ?
 [snip]
I see void apply(alias Fn, int axis, T)(T index) and void apply(alias Fn)() in the current implementation. I think you interpreted what I am asking as something like void apply(alias Fn, int axis, T[])(T[] indices) which also might make sense. But I guess I was suggesting a little simpler as void apply(alias Fn, int axis)() so that it applies to all the rows or columns. This is particularly relevant in the homogeneous data case. My motivation reflects a common use case of the apply function in R to calculate summary statistics of an array/matrix by column or row. For instance, I might want to calculate the standard deviation of every column.
The apply right now works exactly as void apply(alias Fn, int axis, T)(T indices) indices can be an array of integer or a 2D array of string index https://github.com/Kriyszig/magpie/blob/dec86d1942f9c9b4db31438407798329af0aed96/source/magpie/dataframe.d#L1200 The overload you need also exists: apply(Fn) https://github.com/Kriyszig/magpie/blob/dec86d1942f9c9b4db31438407798329af0aed96/source/magpie/dataframe.d#L1246 Unittest for apply - https://github.com/Kriyszig/magpie/blob/dec86d1942f9c9b4db31438407798329af0aed96/source/magpie/dataframe.d#L2821 I agree, things like mean and standard deviation calculations are of utmost importance in data science. Aggregate will bring such features as inbuilt functions. Count, Min, Max, Mean, SD, Variance, etc. This will be added soon (by soon I mean somewhere between the final week of this stage [possibly sooner] and the fist week of the next - As soon as groupBy is stable, I will get onto aggregate) Sorry for the inconvenience.
Jun 25 2019
parent reply jmh530 <john.michael.hall gmail.com> writes:
On Wednesday, 26 June 2019 at 05:41:48 UTC, Prateek Nayak wrote:
 [snip]

 I agree, things like mean and standard deviation calculations 
 are of utmost importance in data science. Aggregate will bring 
 such features as inbuilt functions. Count, Min, Max, Mean, SD, 
 Variance, etc.
 This will be added soon (by soon I mean somewhere between the 
 final week of this stage [possibly sooner] and the fist week of 
 the next - As soon as groupBy is stable, I will get onto 
 aggregate)
 Sorry for the inconvenience.
By no means do you need to apologize for any inconvenience. I suppose what I am thinking is more about leveraging work that is already done as much as possible. For instance, I know that count/sum/min/max are part of mir-algorithm already and I had helped add sd and variance to numir. Do you mind if I send you an email?
Jun 26 2019
parent reply Prateek Nayak <lelouch.cpp gmail.com> writes:
On Wednesday, 26 June 2019 at 12:50:23 UTC, jmh530 wrote:
 On Wednesday, 26 June 2019 at 05:41:48 UTC, Prateek Nayak wrote:
 [snip]

 I agree, things like mean and standard deviation calculations 
 are of utmost importance in data science. Aggregate will bring 
 such features as inbuilt functions. Count, Min, Max, Mean, SD, 
 Variance, etc.
 This will be added soon (by soon I mean somewhere between the 
 final week of this stage [possibly sooner] and the fist week 
 of the next - As soon as groupBy is stable, I will get onto 
 aggregate)
 Sorry for the inconvenience.
By no means do you need to apologize for any inconvenience. I suppose what I am thinking is more about leveraging work that is already done as much as possible. For instance, I know that count/sum/min/max are part of mir-algorithm already and I had helped add sd and variance to numir. Do you mind if I send you an email?
I'm sorry I couldn't reply sooner - I was sick for past couple of days. I don't mind emails one bit. The email id is: lelouch.cpp gmail.com [I know its weird :)] I'll reply as soon as I see the mail [At worse it will take 12hrs to get a reply for me when my phone doesn't notify me of a new mail. At best I'll reply immediately] Again, sorry for the delayed response. Hope to hear from you soon :)
Jun 28 2019
parent reply Prateek Nayak <lelouch.cpp gmail.com> writes:
On Saturday, 29 June 2019 at 04:39:39 UTC, Prateek Nayak wrote:
 [snip]
------------- Weekly Update ------------- I caught a flu this week which was really unfortunate. However I'm getting better and the work is going forward :) ----------------------- What happened last week ----------------------- I mostly dealt with the internal structure of `Group` - the structure that is returned during groupBy operation. First I thought an array of DataFrames might be a good idea but soon dropped the idea as it would mean some parts - like the column index remain same but need to be copied to every DataFrame structure in the array and its just a waste of space at that point. The implementation now looks somewhat similar to the DataFrame structure itself - there is an `Index` and `data`. Indexes are sorted based on the groups formed as a result of groupBy. There are few places where optimizations can be made [mostly wrt space used] and I'll work on it this week. Some of the functionality added to `Group` so far: * display - User can choose to display a singe group or multiple groups * combine - returns a DataFrame combining the groups user would like At this point there was a need fora function in DataFrame which could convert a level of indexing to a column of operable data if required. This is because combine on groupBy doesn't remember the position from where the data was extracted. Hence if a level of data is used for groupBy, it would automatically be converted to a level of index in the result of combine. Hence `indexToData` was added to revert the result of this if the user desires so. There were a few minor updates here and there. Nothing major. They include a new argument for `extends` in `Index` which can now insert the index at the position of user's choice. The other was stripping of trailing white spaces which appeared in display. -------------------------- What will happen this week -------------------------- This week will deal with optimizations of Group, add binary operations to `Group` which may be helpful. Document the changes once stability is reached. Start work on aggregate/join. ---------- Roadblocks ---------- I can't spot any major roadblocks up ahead. Work should go smoothly this week :) -> Thank you jmh530 for sharing your work. This should help in improving the functionality of DataFrames further.
Jul 02 2019
parent reply Prateeek Nayak <lelouch.cpp gmail.com> writes:
On Wednesday, 3 July 2019 at 05:04:20 UTC, Prateek Nayak wrote:
 On Saturday, 29 June 2019 at 04:39:39 UTC, Prateek Nayak wrote:
 [snip]
[snip]
--------------- Progress Update --------------- The past couple of weeks went as expected without any Roadblocks * groupBy can group a DataFrame based on arbitrary number of columns * groupBy returns a Group structure which supports binary operations * Retrieve single/multiple group as DataFrame. * Merge two/more Group into a single DataFrame. * Index operations on DataFrame. An entire column/row is returned as Axis the same way Index operation on DataFrame is implemented. * Display Group on Terminal Works on DataFrame * Added short hand data operations which I missed before! \(°^°)/ * Added function to convert index to an operable data column and vice versa --------------------- What is due this week --------------------- This week was mostly reserved for refactor. Mr. Wilson introduced me to the beautiful lockstep in range and I worked it in the codebase wherever it's necessary. This week I as adding ways to retrieve data as a Slice and assign Slice to DataFrame. This IMHO is important as ndslice is used widely and it opens a lot of doors for data computations. A way to easily retrieve data as Slice operate on it and assign the data back to DataFrame sees valuable. I hope to get the initial PR ready by the beginning of next week. After this will bring Aggregate - on whole Frame/Group on a selected column/row of DataFrame or Group, selective operation on selective columns/rows. ----------------- Future Roadblocks ----------------- I can't see any obvious roadblocks but then you never do see them coming ¯\_(ツ)_/¯
Jul 17 2019
parent reply jmh530 <john.michael.hall gmail.com> writes:
On Thursday, 18 July 2019 at 05:03:38 UTC, Prateeek Nayak wrote:
 [snip]
Thanks for the update. I'm glad you're still making good progress. I'm just looking over the readme.md. I noticed the "at" function has a signature like at!(row, column)(). Because it uses a template, doesn't that imply that the row and column parameters must be known at compile-time? What if we want run-time access using a function style instead of like df[0, 0]? mir's ndslice also has a set of select functions that are also useful for access. There's also a typo in the GroupBy text: "Group DataFrame based on na arbitrary number of columns." I noticed that you make a lot of use of static foreach's over RowType in dataframe.d. Does that this means that this means there isn't any extra cost if you use a homogeneous dataframe with RowType.length == 1? If you can advertise that it doesn't have any additional overhead for working with homogeneous, then that's probably a win. You might also add a trait for isHomogeneous that checks if RowType.length == 1.
Jul 18 2019
parent reply Prateeek Nayak <lelouch.cpp gmail.com> writes:
On Thursday, 18 July 2019 at 10:55:55 UTC, jmh530 wrote:
 On Thursday, 18 July 2019 at 05:03:38 UTC, Prateeek Nayak wrote:
 [snip]
Thanks for the update. I'm glad you're still making good progress. I'm just looking over the readme.md. I noticed the "at" function has a signature like at!(row, column)(). Because it uses a template, doesn't that imply that the row and column parameters must be known at compile-time? What if we want run-time access using a function style instead of like df[0, 0]? mir's ndslice also has a set of select functions that are also useful for access. There's also a typo in the GroupBy text: "Group DataFrame based on na arbitrary number of columns." I noticed that you make a lot of use of static foreach's over RowType in dataframe.d. Does that this means that this means there isn't any extra cost if you use a homogeneous dataframe with RowType.length == 1? If you can advertise that it doesn't have any additional overhead for working with homogeneous, then that's probably a win. You might also add a trait for isHomogeneous that checks if RowType.length == 1.
* "at" was for a fast access to element. It's only necessary to know one of the two argument at compile time to be honest but df[i1, i2] has to be written as at!(i2)(i1) which reverses the two position hence I thought at!(i1, i2) could reduce some mishap that position reversal can cause. I agree a method to access the element at runtime. I will overload at for that. * Sorry about the typo, will fix it soon (^_^) * The data in DataFrame is stored as TypeTuple which requires the column index to be known statically. When trying to do a runtime operation on data, I was forced to traverse the tuple statically to find the particular index. Homogeneous DataFrame defined as DataFrame!(int, 5) will give RowType as (int, int, int, int, int). For now that overhead still exists but I think isHomogeneous template can open some new door for optimization. I will definitely look into this over the next week. Thanks for bringing it to my notice.
Jul 18 2019
parent reply jmh530 <john.michael.hall gmail.com> writes:
On Thursday, 18 July 2019 at 15:34:32 UTC, Prateeek Nayak wrote:
 [snip]

 * The data in DataFrame is stored as TypeTuple which requires 
 the column index to be known statically. When trying to do a 
 runtime operation on data, I was forced to traverse the tuple 
 statically to find the particular index. Homogeneous DataFrame 
 defined as DataFrame!(int, 5) will give RowType as (int, int, 
 int, int, int).
 For now that overhead still exists but I think isHomogeneous 
 template can open some new door for optimization. I will 
 definitely look into this over the next week. Thanks for 
 bringing it to my notice.
Ah, so what you would want to check is that all the RowTypes are the same instead.
Jul 18 2019
parent Prateeek Nayak <lelouch.cpp gmail.com> writes:
On Thursday, 18 July 2019 at 16:23:20 UTC, jmh530 wrote:
 On Thursday, 18 July 2019 at 15:34:32 UTC, Prateeek Nayak wrote:
 [snip]
Ah, so what you would want to check is that all the RowTypes are the same instead.
Yes. Will require a small redesign in the internal structure and some optimizations here and there but can seriously cut down overheads.
Jul 18 2019
prev sibling next sibling parent reply James Blachly <james.blachly gmail.com> writes:
On 5/29/19 2:00 PM, Prateek Nayak wrote:
 (snip)
Outstanding, and greatly needed. Congratulations to you and your mentors. Our lab has transitioned to D for new software but still relies on python+pandas for some analytics pipelines. I second the notions about importance of interoperability. An interesting in-memory interop framework I haven't seen mentioned here yet is Apache Arrow. In this 2017 blog post, Wes McKinney, the author of Pandas, discusses it in the context of mistakes made designing pandas; recommended reading if you have not: https://wesmckinney.com/blog/apache-arrow-pandas-internals/
May 29 2019
parent jmh530 <john.michael.hall gmail.com> writes:
On Thursday, 30 May 2019 at 02:16:16 UTC, James Blachly wrote:
 [snip]

 https://wesmckinney.com/blog/apache-arrow-pandas-internals/
This was a good read. The columnar data structures he describes sound more like struct of arrays than array of structs.
May 30 2019
prev sibling next sibling parent reply bachmeier <no spam.net> writes:
Looking at the readme, I see the following example for accessing 
elements by name:

df[["1"], ["0"]];

Why can't that instead be

df["1", "0"];

Something that gets in the way of adoption is verbose notation, 
and I'm not seeing any advantage to the array notation.

Also, for this example:

Index indx;
indx.setIndex([1, 2, 3, 4], ["Row Index"], [1, 2, 3], ["Column 
Index"]);

That's pretty verbose/hard to parse compared to

rownames(x) = [1, 2, 3, 4];
colnames(x) = [1, 2, 3];
Jul 18 2019
parent Prateek Nayak <lelouch.cpp gmail.com> writes:
On Thursday, 18 July 2019 at 16:02:40 UTC, bachmeier wrote:
 Looking at the readme, I see the following example for 
 accessing elements by name:

 df[["1"], ["0"]];

 Why can't that instead be

 df["1", "0"];

 Something that gets in the way of adoption is verbose notation, 
 and I'm not seeing any advantage to the array notation.

 Also, for this example:

 Index indx;
 indx.setIndex([1, 2, 3, 4], ["Row Index"], [1, 2, 3], ["Column 
 Index"]);

 That's pretty verbose/hard to parse compared to

 rownames(x) = [1, 2, 3, 4];
 colnames(x) = [1, 2, 3];
I'm really sorry I overlooked this. Sorry about that (-‸ლ) I'll fix the first case in the PR where I'll make optimizations for homogeneous DataFrame. I'll address the second problem of verbosity soon but not in the immediate PR Thanks for the feedback ٩(^‿^)۶
Jul 23 2019
prev sibling parent reply Dejan Lekic <dejan.lekic gmail.com> writes:
On Wednesday, 29 May 2019 at 18:00:02 UTC, Prateek Nayak wrote:
 Hello everyone,

 I have began work on my Google Summer of Code 2019 project 
 DataFrame for D.
Really glad to see someone working on that. I hope you will have time to implement a good CSV/TSV reader/writer based on the fantastic iopipe project (that should IMHO go into Phobos this way or another)...
Jul 19 2019
parent reply Prateek Nayak <lelouch.cpp gmail.com> writes:
On Friday, 19 July 2019 at 14:50:35 UTC, Dejan Lekic wrote:
 On Wednesday, 29 May 2019 at 18:00:02 UTC, Prateek Nayak wrote:
 Hello everyone,

 I have began work on my Google Summer of Code 2019 project 
 DataFrame for D.
Really glad to see someone working on that. I hope you will have time to implement a good CSV/TSV reader/writer based on the fantastic iopipe project (that should IMHO go into Phobos this way or another)...
Right now there is a CSV reader in Magpie but it isn't perfect enough to go into Phobos yet. I'll improve the parser and when I'm happy with the read speed, I'll send a PR (^_^)
Jul 23 2019
next sibling parent jmh530 <john.michael.hall gmail.com> writes:
On Tuesday, 23 July 2019 at 18:21:30 UTC, Prateek Nayak wrote:
 [snip]

 Right now there is a CSV reader in Magpie but it isn't perfect 
 enough to go into Phobos yet. I'll improve the parser and when 
 I'm happy with the read speed, I'll send a PR (^_^)
mir was originally intended to be included in Phobos, but got split off to its own library. If anything, Magpie has a better place in mir than Phobos. However, I think there is probably value in splitting off the csv reader to a separate project and just putting that up on dub when it is ready for broader use.
Jul 23 2019
prev sibling parent reply Suliman <evermind live.ru> writes:
Could you do any benchmarks against Python Pandas.
Jul 25 2019
parent reply Prateek Nayak <lelouch.cpp gmail.com> writes:
On Thursday, 25 July 2019 at 11:55:48 UTC, Suliman wrote:
 Could you do any benchmarks against Python Pandas.
As soon as the aggregate is done, I'll get onto this. It's best to benchmark with real world examples IMHO and aggregate brings most of the analytics functionality. I'll keep the thread updated (^_^)
Jul 25 2019
parent reply Prateeek Nayak <lelouch.cpp gmail.com> writes:
-----------
Update Time
-----------

Pardon me for the delay, my university just started and it has 
been a busy first week. However I have some good news

* Aggregate implementation is under review - The preliminary 
implementation restricted the set of operations that aggregate 
could do but then Mr. Wilson suggested there should be a way to 
expand it's usability so we worked on a revamp which takes the 
function you desire as input and operates them on row/column of 
DataFrame
* There is a new way set index using index operation
* to_csv supports setting precision for floating point numbers - 
this was a problem I knew existed but I hadn't addressed it till 
now. Better late then never.
* Homogeneous DataFrame don't use TypeTuple anymore
* at overload coming soon


--------------------
What is to come next
--------------------

* The first few responses from the community were mostly 
regarding bringing binary file I/O support because of their lean 
size and fast read/write. I will explore more regarding this.
* Time Series is gaining importance with the rise of Machine 
Learning. I would like to implement something along the lines of 
time series functionality Pandas has.
* Something you would line to see. I am open to suggestions (^_^)

--------------
Problems faced
--------------

There remains a small implementation detail that remains - a 
dispatch function. Given non-homogeneous cases still require 
traversal to a column, a function to apply an alias statically or 
non-statically depending on the DataFrame is under discussion.
This will reduce code redundancy however my preliminary attempts 
to tackle this have ended in failure. I will try to finish it by 
the weekend. If I cannot solve it by then, I will seek your help 
in the Learn section (^_^)
Thank you
Aug 08 2019
next sibling parent reply bioinfornatics <bioinfornatics fedoraproject.org> writes:
On Thursday, 8 August 2019 at 16:49:09 UTC, Prateeek Nayak wrote:
 -----------
 Update Time
 -----------

 Pardon me for the delay, my university just started and it has 
 been a busy first week. However I have some good news

 * Aggregate implementation is under review - The preliminary 
 implementation restricted the set of operations that aggregate 
 could do but then Mr. Wilson suggested there should be a way to 
 expand it's usability so we worked on a revamp which takes the 
 function you desire as input and operates them on row/column of 
 DataFrame
 * There is a new way set index using index operation
 * to_csv supports setting precision for floating point numbers 
 - this was a problem I knew existed but I hadn't addressed it 
 till now. Better late then never.
 * Homogeneous DataFrame don't use TypeTuple anymore
 * at overload coming soon


 --------------------
 What is to come next
 --------------------

 * The first few responses from the community were mostly 
 regarding bringing binary file I/O support because of their 
 lean size and fast read/write. I will explore more regarding 
 this.
 * Time Series is gaining importance with the rise of Machine 
 Learning. I would like to implement something along the lines 
 of time series functionality Pandas has.
 * Something you would line to see. I am open to suggestions 
 (^_^)

 --------------
 Problems faced
 --------------

 There remains a small implementation detail that remains - a 
 dispatch function. Given non-homogeneous cases still require 
 traversal to a column, a function to apply an alias statically 
 or non-statically depending on the DataFrame is under 
 discussion.
 This will reduce code redundancy however my preliminary 
 attempts to tackle this have ended in failure. I will try to 
 finish it by the weekend. If I cannot solve it by then, I will 
 seek your help in the Learn section (^_^)
 Thank you
Dear D community, Thanks, Prateeek Nayak for your works. As currently, I am working with pandas (python, dataframe ...) . They are an extra feature that I appreciate a lot, it is the IO tool part: * SQL method: read_sql and to_sql Description: which allow to read and save from a DataBase. These methods combined with SqlAlchemy are awesome. * Parquet method: read_parquet and to_parquet Description: In BigData environment Parquet is a file format often used These abilities made Panda and its Dataframe API a core library to have. Using like this, allow standardizing data structured used into our application and in same time offer rich statistics API. Indeed it is important for tho code maintainability. And the FairData point that an application is a set of input data + program's feature = result. Thus put data structured as the first component to think how to develop an application is important. The application is more robust and flexible as we can handle multiple input data file format. I hope to see such features in D. Best regards Source: - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql.html - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html - https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#parquet
Aug 09 2019
parent reply Prateek Nayak <lelouch.cpp gmail.com> writes:
On Friday, 9 August 2019 at 08:08:39 UTC, bioinfornatics wrote:
 Dear D community,

 Thanks, Prateeek Nayak for your works.
 As currently, I am working with pandas (python, dataframe ...) 
 . They are an extra feature that I appreciate a lot, it is the 
 IO tool part:

 * SQL
 method: read_sql and to_sql
 Description: which allow to read and save from a DataBase. 
 These methods combined with SqlAlchemy are awesome.

 * Parquet
 method: read_parquet and to_parquet
 Description: In BigData environment Parquet is a file format 
 often used

 These abilities made Panda and its Dataframe API a core library 
 to have. Using like this, allow standardizing data structured 
 used into our application and in same time offer rich 
 statistics API.

 Indeed it is important for tho code maintainability. And the 
 FairData point that an application is a set of input data + 
 program's feature = result. Thus put data structured as the 
 first component to think how to develop an application is 
 important.
 The application is more robust and flexible as we can handle 
 multiple input data file format.

 I hope to see such features in D.


 Best regards

 Source:
 - 
 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql.html
 - 
 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html
 - 
 https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#parquet
I was looking into Parquet and it even came up in the reddit post i had linked to earlier on - smaller file size and better I/O makes it really good for industrial use. A quick search on DUB didn't give any result for a parser so I'll probably work on a library to work with Parquet files. I looked into Cap'n Proto too - it looks promising but its missing from Pandas I/O section which was disappointing. Thanks for mentioning SQL. I will start working on these features soon.
 Again, thank you so much for working on this!
 We will be excited to put Magpie through its paces in our lab, 
 but it is missing* a few key (really, basic IMO) features we 
 make heavy use of in pandas.
 * I have read the README and glanced at code but not used 
 Magpie yet, so if I am > wrong about below please correct me!
 Since you are soliciting ideas:
 1. Selecting/indexing into data with boolean vectors. e.g:
 df[df.A > 30 && df.B != "ignore"]
 1a. This really means returning a boolean vector for df.COL 
 <op> <operand>
 1b. ...and being able to subset data by a bool vector
 2. We make heavy use of "pivot" functionality.
 Kind regards
I was thinking of the same feature as 1 - a filter like function for DataFrame and Group - finding possible ways to implement it I'm really embarrassed to admit I never even thought about Pivot. I looks like a beautiful feature to have - will definitely add to Magpie soon (possibly over the next couple of weeks - I'm a bit tied down right now with commencement of University academics but it will definitely come soon)
Aug 09 2019
parent reply Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On Saturday, 10 August 2019 at 04:10:31 UTC, Prateek Nayak wrote:
 I was looking into Parquet and it even came up in the reddit 
 post i had linked to earlier on - smaller file size and better 
 I/O makes it really good for industrial use.
 A quick search on DUB didn't give any result for a parser so 
 I'll probably work on a library to work with Parquet files.
 I looked into Cap'n Proto too - it looks promising but its 
 missing from Pandas I/O section which was disappointing.
 Thanks for mentioning SQL. I will start working on these 
 features soon.
It's clearly important that your project supports the same data exchange formats as pandas, but it doesn't seem inherently a problem to support other formats as well, assuming you have the time and inclination to do so.
Aug 10 2019
parent Prateek Nayak <lelouch.cpp gmail.com> writes:
On Saturday, 10 August 2019 at 12:38:19 UTC, Joseph Rushton 
Wakeling wrote:
 It's clearly important that your project supports the same data 
 exchange formats as pandas, but it doesn't seem inherently a 
 problem to support other formats as well, assuming you have the 
 time and inclination to do so.
It is never an inherent problem to support a new file format but the initial comments from the community was mainly regarding easy interop with python. That is why I was thinking of Parquet support. Cap'n Proto is great and I'll love to implement a Cap'n Proto I/O sooner or later but Parquet seems to have a heavier presence due to popularity of Pandas so I decided to look into it first.
Aug 10 2019
prev sibling parent James Blachly <james.blachly gmail.com> writes:
On 8/8/19 12:49 PM, Prateeek Nayak wrote:
 --------------------
 What is to come next
 --------------------
 
 * The first few responses from the community were mostly regarding 
 bringing binary file I/O support because of their lean size and fast 
 read/write. I will explore more regarding this.
 * Time Series is gaining importance with the rise of Machine Learning. I 
 would like to implement something along the lines of time series 
 functionality Pandas has.
 * Something you would line to see. I am open to suggestions (^_^)
Again, thank you so much for working on this! We will be excited to put Magpie through its paces in our lab, but it is missing* a few key (really, basic IMO) features we make heavy use of in pandas. * I have read the README and glanced at code but not used Magpie yet, so if I am wrong about below please correct me! Since you are soliciting ideas: 1. Selecting/indexing into data with boolean vectors. e.g: df[df.A > 30 && df.B != "ignore"] 1a. This really means returning a boolean vector for df.COL <op> <operand> 1b. ...and being able to subset data by a bool vector 2. We make heavy use of "pivot" functionality. Kind regards
Aug 09 2019