digitalmars.D.learn - Does something like std.algorithm.iteration:splitter with multiple
- ParticlePeter (13/13) Mar 23 2016 I need to parse an ascii with multiple tokens. The tokens can be
- ParticlePeter (5/9) Mar 23 2016 file
- Andrea Fontana (2/12) Mar 23 2016 Any input => output example?
- ParticlePeter (21/22) Mar 23 2016 Sure, it is ensight gold case file format:
- Simen Kjaeraas (57/70) Mar 23 2016 Without a bit more detail, it's a bit hard to help.
- ParticlePeter (12/69) Mar 23 2016 Thanks Simen,
- Simen Kjaeraas (23/34) Mar 23 2016 My pleasure. :) Testing it on your example data shows it to work
- wobbles (39/53) Mar 23 2016 This isn't tested, but this is my first thought:
- ParticlePeter (36/38) Mar 27 2016 Thanks Wobbles, I took your approach. There were some minor
- wobbles (2/7) Mar 28 2016 Great, thanks for fixing it up!
I need to parse an ascii with multiple tokens. The tokens can be seen as keys. After every token there is a bunch of lines belonging to that token, the values. The order of tokens is unknown. I would like to read the file in as a whole string, and split the string with: splitter(fileString, [token1, token2, ... tokenN]); And would like to get a range of strings each starting with tokenX and ending before the next token. Does something like this exist? I know how to parse the string line by line and create new strings and append the appropriate lines, but I don't know how to do this with a lazy result range and new allocations.
Mar 23 2016
On Wednesday, 23 March 2016 at 11:57:49 UTC, ParticlePeter wrote: Stupid typos:I need to parse an asciifilewith multiple tokens. ......to do this with a lazy result range andwithoutnew allocations.
Mar 23 2016
On Wednesday, 23 March 2016 at 12:00:15 UTC, ParticlePeter wrote:On Wednesday, 23 March 2016 at 11:57:49 UTC, ParticlePeter wrote: Stupid typos:Any input => output example?I need to parse an asciifilewith multiple tokens. ......to do this with a lazy result range andwithoutnew allocations.
Mar 23 2016
On Wednesday, 23 March 2016 at 14:20:12 UTC, Andrea Fontana wrote:Any input => output example?Sure, it is ensight gold case file format: FORMAT type: ensight gold GEOMETRY model: 1 exgold2.geo** VARIABLE scalar per node: 1 Stress exgold2.scl** vector per node: 1 Displacement exgold2.dis** TIME time set: 1 number of steps: 3 filename start number: 0 filename increment: 1 time values: 1.0 2.0 3.0 The separators would be ["FORMAT", "TIME", "VARIABLE", "GEOMETRY"]. The blank lines between the blocks and the order of the separators in the file is not known. I would expect a range of four ranges of lines: one for each text-block above.
Mar 23 2016
On Wednesday, 23 March 2016 at 11:57:49 UTC, ParticlePeter wrote:I need to parse an ascii with multiple tokens. The tokens can be seen as keys. After every token there is a bunch of lines belonging to that token, the values. The order of tokens is unknown. I would like to read the file in as a whole string, and split the string with: splitter(fileString, [token1, token2, ... tokenN]); And would like to get a range of strings each starting with tokenX and ending before the next token. Does something like this exist? I know how to parse the string line by line and create new strings and append the appropriate lines, but I don't know how to do this with a lazy result range and new allocations.Without a bit more detail, it's a bit hard to help. std.algorithm.splitter has an overload that takes a function instead of a separator: import std.algorithm; auto a = "a,b;c"; auto b = a.splitter!(e => e == ';' || e == ','); assert(equal(b, ["a", "b", "c"])); However, not only are the separators lost in the process, it only allows single-element separators. This might be good enough given the information you've divulged, but I'll hazard a guess it isn't. My next stop is std.algorithm.chunkBy: auto a = ["a","b","c", "d", "e"]; auto b = a.chunkBy!(e => e == "a" || e == "d"); auto result = [ tuple(true, ["a"]), tuple(false, ["b", "c"]), tuple(true, ["d"]), tuple(false, ["e"]) ]; No assert here, since the ranges in the tuples are not arrays. My immediate concern is that two consecutive tokens with no intervening values will mess it up. Also, the result looks a bit messy. A little more involved, and according to documentation not guaranteed to work: bool isToken(string s) { return s == "a" || s == "d"; } bool tokenCounter(string s) { static string oldToken; static bool counter = true; if (s.isToken && s != oldToken) { oldToken = s; counter = !counter; } return counter; } unittest { import std.algorithm; import std.stdio; import std.typecons; import std.array; auto a = ["a","b","c", "d", "e", "a", "d"]; auto b = a.chunkBy!tokenCounter.map!(e=>e[1]); auto result = [ ["a", "b", "c"], ["d", "e"], ["a"], ["d"] ]; writeln(b); writeln(result); } Again no assert, but b and result have basically the same contents. Also handles consecutive tokens neatly (but consecutive identical tokens will be grouped together). Hope this helps. -- Simen
Mar 23 2016
On Wednesday, 23 March 2016 at 15:23:38 UTC, Simen Kjaeraas wrote:Without a bit more detail, it's a bit hard to help. std.algorithm.splitter has an overload that takes a function instead of a separator: import std.algorithm; auto a = "a,b;c"; auto b = a.splitter!(e => e == ';' || e == ','); assert(equal(b, ["a", "b", "c"])); However, not only are the separators lost in the process, it only allows single-element separators. This might be good enough given the information you've divulged, but I'll hazard a guess it isn't. My next stop is std.algorithm.chunkBy: auto a = ["a","b","c", "d", "e"]; auto b = a.chunkBy!(e => e == "a" || e == "d"); auto result = [ tuple(true, ["a"]), tuple(false, ["b", "c"]), tuple(true, ["d"]), tuple(false, ["e"]) ]; No assert here, since the ranges in the tuples are not arrays. My immediate concern is that two consecutive tokens with no intervening values will mess it up. Also, the result looks a bit messy. A little more involved, and according to documentation not guaranteed to work: bool isToken(string s) { return s == "a" || s == "d"; } bool tokenCounter(string s) { static string oldToken; static bool counter = true; if (s.isToken && s != oldToken) { oldToken = s; counter = !counter; } return counter; } unittest { import std.algorithm; import std.stdio; import std.typecons; import std.array; auto a = ["a","b","c", "d", "e", "a", "d"]; auto b = a.chunkBy!tokenCounter.map!(e=>e[1]); auto result = [ ["a", "b", "c"], ["d", "e"], ["a"], ["d"] ]; writeln(b); writeln(result); } Again no assert, but b and result have basically the same contents. Also handles consecutive tokens neatly (but consecutive identical tokens will be grouped together). Hope this helps. -- SimenThanks Simen, your tokenCounter is inspirational, for the rest I'll take some time for testing. But some additional thoughts from my sided: I get all the lines of the file into one range. Calling array on it should give me an array, but how would I use find to get an index into this array? With the indices I could slice up the array into four slices, no allocation required. If there is no easy way to just get an index instead of an range, I would try to use something like the tokenCounter to find all the indices.
Mar 23 2016
On Wednesday, 23 March 2016 at 18:10:05 UTC, ParticlePeter wrote:Thanks Simen, your tokenCounter is inspirational, for the rest I'll take some time for testing.My pleasure. :) Testing it on your example data shows it to work there. However, as stated above, the documentation says it's undefined, so future changes (even optimizations and bugfixes) to Phobos could make it stop working: "This predicate must be an equivalence relation, that is, it must be reflexive (pred(x,x) is always true), symmetric (pred(x,y) == pred(y,x)), and transitive (pred(x,y) && pred(y,z) implies pred(x,z)). If this is not the case, the range returned by chunkBy may assert at runtime or behave erratically."But some additional thoughts from my sided: I get all the lines of the file into one range. Calling array on it should give me an array, but how would I use find to get an index into this array? With the indices I could slice up the array into four slices, no allocation required. If there is no easy way to just get an index instead of an range, I would try to use something like the tokenCounter to find all the indices.The chunkBy example should not allocate. chunkBy itself is lazy, as are its sub-ranges. No copying of string contents is performed. So unless you have very specific reasons to use slicing, I don't see why chunkBy shouldn't be good enough. Full disclosure: There is a malloc call in RefCounted, which is used for optimization purposes when chunkBy is called on a forward range. When chunkBy is called on an array, that's a 6-word allocation (24 bytes on 32-bit, 48 bytes on 64-bit), happening once. There are no other dependencies that allocate. Such is the beauty of D. :) -- Simen
Mar 23 2016
On Wednesday, 23 March 2016 at 11:57:49 UTC, ParticlePeter wrote:I need to parse an ascii with multiple tokens. The tokens can be seen as keys. After every token there is a bunch of lines belonging to that token, the values. The order of tokens is unknown. I would like to read the file in as a whole string, and split the string with: splitter(fileString, [token1, token2, ... tokenN]); And would like to get a range of strings each starting with tokenX and ending before the next token. Does something like this exist? I know how to parse the string line by line and create new strings and append the appropriate lines, but I don't know how to do this with a lazy result range and new allocations.This isn't tested, but this is my first thought: void main(){ string testString = "this:is:a-test;" foreach(str; testString.multiSlice([":","-",";"])) writefln("Got: %s", str); } auto multiSlice(string string, string[] delims){ struct MultiSliceRange{ string m_str; string[] m_delims; bool empty(){ return m_str.length == 0; } void popFront(){ auto idx = findNextIndex; m_str = m_str[idx..$]; return; } string front(){ auto idx = findNextIndex; return m_str[0..idx]; } private long findNextIndex(){ long foundIndex=-1; foreach(delim; m_delims){ if(m_str.canFind(delim)){ if(foundIndex == -1 || m_str.indexOf(delim)= 0)){foundIndex = m_str.indexOf(delim); } } } return foundIndex; } } return MultiSliceRange(string, delims); } Again, totally untested, but I think logically it should work. ( No D compiler on this machine so it mightn't even compile :] )
Mar 23 2016
On Wednesday, 23 March 2016 at 20:00:55 UTC, wobbles wrote:Again, totally untested, but I think logically it should work. ( No D compiler on this machine so it mightn't even compile :] )Thanks Wobbles, I took your approach. There were some minor issues, here is a working version: auto multiSlice(string data, string[] delims) { import std.algorithm : canFind; import std.string : indexOf; struct MultiSliceRange { string m_str; string[] m_delims; bool empty(){ return m_str.length == 0; } void popFront(){ auto idx = findNextIndex; m_str = m_str[idx..$]; return; } string front(){ auto idx = findNextIndex; return m_str[0..idx]; } private size_t findNextIndex() { auto index = size_t.max; foreach(delim; m_delims) { if(m_str.canFind(delim)) { auto foundIndex = m_str.indexOf(delim); if(index > foundIndex && foundIndex > 0) { index = foundIndex; } } } return index; } } return MultiSliceRange(data, delims); }
Mar 27 2016
On Sunday, 27 March 2016 at 07:45:00 UTC, ParticlePeter wrote:On Wednesday, 23 March 2016 at 20:00:55 UTC, wobbles wrote:Great, thanks for fixing it up![...]Thanks Wobbles, I took your approach. There were some minor issues, here is a working version: [...]
Mar 28 2016