digitalmars.D - Python-like slicing and handling UTF-8 strings as a bonus

FG (118/118) Dec 29 2012 Slices are great but not really what I had expected, coming from Python.

Vladimir Panteleev (5/19) Dec 29 2012 This is a common fallacy when dealing with Unicode. Please see

FG (8/24) Dec 29 2012 Probably because I like concise code. I always prefer:

FG (4/9) Dec 29 2012 Actually, when I look at this, it can be a one-liner after all. :)
bearophile (6/10) Dec 29 2012 In std.algorithm there is min(), that helps a little:

Peter Alexander (12/14) Dec 29 2012 std.range have drop and take, which work on code points, not code

bearophile (6/8) Dec 29 2012 Right, 90% of the code doesn't need to slice strings (and
FG (6/15) Dec 29 2012 At least dropping off the back is also possible s[2..$-5]:

monarch_dodra (18/39) Dec 30 2012 But as a general rule, making a range out of the first (or last)

FG <home fgda.pl> writes:

Slices are great but not really what I had expected, coming from Python.
I've seen code like s[a..$-b] used without checking the values, just to end up 
with a Range violation. But there are 3 constraints to check here:
	a >= 0 && a + b <= s.length && b >= 0

That's way too much coding for a simple program/script that shortens a string, 
before it prints it on a screen. If I can't write s[0..80] without fear, then 
let there at least be a function that does it like Python would.

Additionally, as strings are UTF-8-encoded, I'd like such a function to give me 
proper substrings, without multibyte characters cut in the middle, where 
s[0..80] would mean 80 characters on the screen and not 80 bytes.

I would envision it being part of std.string eventually.
Forgive me if such a function already exists -- I couldn't find it.
I also still don't speak D too well, so don't laugh. :)




import std.array, std.range, std.stdio;


auto getSlice(T)(T[] s, ptrdiff_t start, ptrdiff_t end = ptrdiff_t.max)
pure  safe
{
     bool start_from_back, end_from_back;
     size_t full_len = s.length;
     ptrdiff_t len;
     if (full_len > ptrdiff_t.max)
         len = ptrdiff_t.max;
     else len = cast(ptrdiff_t) full_len;
     if (end < 0)
     {
         end_from_back = true;
         end += len;
     }
     if (end > len) end = len;
     if (start < 0)
     {
         if (0 - start >= len)
             start = 0;
         else
         {
             start += len;
             start_from_back = true;
         }
     }
     if (start < 0) start = 0;
     if (start > end || start >= len || end <= 0)
         return s[0..0];

     static if(is(T == char) || is(T == immutable(char)) ||
             is(T : wchar) || is(T : immutable(wchar)))
     {
         ptrdiff_t real_start = -1, real_end = -1, loop, last_pos;
         if (!start_from_back || !end_from_back)
         {
             foreach (ptrdiff_t i, dchar c; s)
             {
                 if (!start_from_back && loop >= start && real_start < 0)
                     real_start = i;
                 if (!end_from_back && loop >= end && real_end < 0)
                     real_end = i;
                 if ((start_from_back || real_start > -1) &&
                         (end_from_back || real_end > -1 || end == len))
                     break;
                 loop++;
             }
         }
         start -= len;
         end -= len;
         loop = -1;
         if (start_from_back || end_from_back)
         {
             foreach_reverse (ptrdiff_t i, dchar c; s)
             {
                 if (start_from_back && loop <= start && real_start < 0)
                     real_start = i;
                 if (end_from_back && loop <= end && real_end < 0)
                     real_end = i;
                 if ((!start_from_back || real_start > -1) &&
                         (!end_from_back || real_end > -1))
                     break;
                 loop--;
             }
         }
         if (real_end < 0) real_end = (end_from_back ? 0 : len);
         if (real_start < 0) real_start = (start_from_back ? 0 : len);
         if (real_start > real_end) real_start = real_end = 0;
         return s[real_start..real_end];
     }
     else return s[start..end];
}

unittest {
     string s = "okrągły stół";
     dstring d = "okrągły stół"d;
     auto t = [0, 1, 2, 3, 4];
     assert(t.getSlice(0, -1) == [0, 1, 2, 3]);
     assert(t.getSlice(1, -2) == [1, 2]);
     assert(t.getSlice(-4, -2) == [1, 2]);
     assert(t.getSlice(-5, 7) == [0, 1, 2, 3, 4]);
     assert(s.getSlice(0, 0) == "");
     assert(s.getSlice(0, 1) == "o");
     assert(s.getSlice(0) == s);
     assert(s.getSlice(8) == "stół");
     assert(s.getSlice(8, -1) == "stó");
     assert(s.getSlice(8, -2) == "st");
     assert(s.getSlice(8, -4) == "");
     assert(s.getSlice(10, 11) == "ó");
     assert(s.getSlice(10, -1) == "ó");
     assert(s.getSlice(10, 12) == "ół");
     assert(s.getSlice(11, 12) == "ł");
     assert(s.getSlice(11, 15) == "ł");
     assert(d.getSlice(0, 0) == ""d);
     assert(d.getSlice(0, 1) == "o"d);
     assert(d.getSlice(0) == d);
     assert(d.getSlice(8) == "stół"d);
     assert(d.getSlice(8, -1) == "stó"d);
     assert(d.getSlice(8, -2) == "st"d);
     assert(d.getSlice(8, -4) == ""d);
     assert(d.getSlice(10, 11) == "ó"d);
     assert(d.getSlice(10, -1) == "ó"d);
     assert(d.getSlice(10, 12) == "ół"d);
     assert(d.getSlice(11, 12) == "ł"d);
     assert(d.getSlice(11, 15) == "ł"d);
     assert(d.getSlice(11, 15) == "ł"d);
}

Dec 29 2012

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:

On Saturday, 29 December 2012 at 22:25:35 UTC, FG wrote:
 Slices are great but not really what I had expected, coming 
 from Python.
 I've seen code like s[a..$-b] used without checking the values, 
 just to end up with a Range violation. But there are 3 
 constraints to check here:
 	a >= 0 && a + b <= s.length && b >= 0

 That's way too much coding for a simple program/script that 
 shortens a string, before it prints it on a screen. If I can't 
 write s[0..80] without fear, then let there at least be a 
 function that does it like Python would.

Why?

 Additionally, as strings are UTF-8-encoded, I'd like such a 
 function to give me proper substrings, without multibyte 
 characters cut in the middle, where s[0..80] would mean 80 
 characters on the screen and not 80 bytes.

This is a common fallacy when dealing with Unicode. Please see 
the linked and the following points:

http://utf8everywhere.org/#myth.utf32.o1

Dec 29 2012

FG <home fgda.pl> writes:

On 2012-12-29 23:35, Vladimir Panteleev wrote:
 On Saturday, 29 December 2012 at 22:25:35 UTC, FG wrote:
 Slices are great but not really what I had expected, coming from Python.
 I've seen code like s[a..$-b] used without checking the values, just to end up
 with a Range violation. But there are 3 constraints to check here:
     a >= 0 && a + b <= s.length && b >= 0

 That's way too much coding for a simple program/script that shortens a string,
 before it prints it on a screen. If I can't write s[0..80] without fear, then
 let there at least be a function that does it like Python would.

 Why?

Probably because I like concise code. I always prefer:
     if (A) print(getMessage().getSlice(0..100));

to writing something like this:
     auto message = getMessage();
     if (A) print(message.length > 100 ? message[0..100] : message);


 Additionally, as strings are UTF-8-encoded, I'd like such a function to give
 me proper substrings, without multibyte characters cut in the middle, where
 s[0..80] would mean 80 characters on the screen and not 80 bytes.

 This is a common fallacy when dealing with Unicode. Please see the linked and
 the following points:

 http://utf8everywhere.org/#myth.utf32.o1

True. I didn't think about all the languages out there.
Just some common European ones.

Dec 29 2012

FG <home fgda.pl> writes:

On 2012-12-29 23:55, FG wrote:
 Probably because I like concise code. I always prefer:
      if (A) print(getMessage().getSlice(0..100));

 to writing something like this:
      auto message = getMessage();
      if (A) print(message.length > 100 ? message[0..100] : message);

Actually, when I look at this, it can be a one-liner after all. :)

     if (A) print(getMessage()[0..($>100?100:$)]);

Didn't expect this to work.

Dec 29 2012

"bearophile" <bearophileHUGS lycos.com> writes:

FG:

 to writing something like this:
     auto message = getMessage();
     if (A) print(message.length > 100 ? message[0..100] : 
 message);

In std.algorithm there is min(), that helps a little:

if (A)
     print(message[0 .. min($, 100)]);

Bye,
bearophile

Dec 29 2012

"Peter Alexander" <peter.alexander.au gmail.com> writes:

On Saturday, 29 December 2012 at 22:25:35 UTC, FG wrote:
 Forgive me if such a function already exists -- I couldn't find 
 it.

std.range have drop and take, which work on code points, not code 
units. They also handle over-dropping or over-taking gracefully. 
For example:

string s = "okrągły stół";
writeln(s.drop(8).take(3)); // "stó"
writeln(s.drop(8).take(100)); // "stół"
writeln(s.drop(100).take(100)); // ""

http://dpaste.dzfl.pl/2f8ebf49

It doesn't support negative indexing.

Generally speaking though, the vast majority of user code should 
never need to index into a Unicode string.

Dec 29 2012

"bearophile" <bearophileHUGS lycos.com> writes:

Peter Alexander:

 Generally speaking though, the vast majority of user code 
 should never need to index into a Unicode string.

Right, 90% of the code doesn't need to slice strings (and 
generally strings are Unicode). But the other 90% of the code 
needs to slice things...

Bye,
bearophile

Dec 29 2012

FG <home fgda.pl> writes:

On 2012-12-30 00:03, Peter Alexander wrote:
 On Saturday, 29 December 2012 at 22:25:35 UTC, FG wrote:
 Forgive me if such a function already exists -- I couldn't find it.

 std.range have drop and take, which work on code points, not code units. They
 also handle over-dropping or over-taking gracefully. For example:

 string s = "okrągły stół";
 writeln(s.drop(8).take(3)); // "stó"
 writeln(s.drop(8).take(100)); // "stół"
 writeln(s.drop(100).take(100)); // ""

Ah, so this is the way of doing it. Thanks.


 It doesn't support negative indexing.

At least dropping off the back is also possible s[2..$-5]:

     writeln(s.retro.drop(5).retro.drop(2)); // "rągły"

     (or with dropBack, without retro, if available)

I have no idea how to do s[$-4..$-2] though.

Dec 29 2012

"monarch_dodra" <monarchdodra gmail.com> writes:

On Sunday, 30 December 2012 at 00:02:17 UTC, FG wrote:
 On 2012-12-30 00:03, Peter Alexander wrote:
 On Saturday, 29 December 2012 at 22:25:35 UTC, FG wrote:
 Forgive me if such a function already exists -- I couldn't 
 find it.

 std.range have drop and take, which work on code points, not 
 code units. They
 also handle over-dropping or over-taking gracefully. For 
 example:

 string s = "okrągły stół";
 writeln(s.drop(8).take(3)); // "stó"
 writeln(s.drop(8).take(100)); // "stół"
 writeln(s.drop(100).take(100)); // ""

 Ah, so this is the way of doing it. Thanks.


 It doesn't support negative indexing.

 At least dropping off the back is also possible s[2..$-5]:

     writeln(s.retro.drop(5).retro.drop(2)); // "rągły"

     (or with dropBack, without retro, if available)

dropBack is available IFF retro is available. (AFAIK)

 I have no idea how to do s[$-4..$-2] though.

But as a general rule, making a range out of the first (or last) 
elements of a non RA range is a limitation of how ranges can 
"only shrink". strings are a special case of non-RA, 
non-sliceable range you can index and slice...

Anyways, you can always get creative with length:

//----
s = "hello world";
s[s.dropBack(4).length .. s.dropBack(2).length];
//----

In this particular example, it is a bit suboptimal, but quite 
frankly, I'd assume readability trumps performance for this kind 
of code (and is what I'd use in my end code).

One last thing: keep in mind "drop/take" are linear operations. 
If you are handling unicode, then everything is linear anyways, 
so I'm not saying these functions are slow or anything, just 
don't forget they aren't the o(1) functions you'd get with ASCII.

Dec 30 2012

D Programming

C/C++ Programming

Other

digitalmars.D - Python-like slicing and handling UTF-8 strings as a bonus