www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Python-like slicing and handling UTF-8 strings as a bonus

reply FG <home fgda.pl> writes:
Slices are great but not really what I had expected, coming from Python.
I've seen code like s[a..$-b] used without checking the values, just to end up 
with a Range violation. But there are 3 constraints to check here:
	a >= 0 && a + b <= s.length && b >= 0

That's way too much coding for a simple program/script that shortens a string, 
before it prints it on a screen. If I can't write s[0..80] without fear, then 
let there at least be a function that does it like Python would.

Additionally, as strings are UTF-8-encoded, I'd like such a function to give me 
proper substrings, without multibyte characters cut in the middle, where 
s[0..80] would mean 80 characters on the screen and not 80 bytes.

I would envision it being part of std.string eventually.
Forgive me if such a function already exists -- I couldn't find it.
I also still don't speak D too well, so don't laugh. :)




import std.array, std.range, std.stdio;


auto getSlice(T)(T[] s, ptrdiff_t start, ptrdiff_t end = ptrdiff_t.max)
pure  safe
{
     bool start_from_back, end_from_back;
     size_t full_len = s.length;
     ptrdiff_t len;
     if (full_len > ptrdiff_t.max)
         len = ptrdiff_t.max;
     else len = cast(ptrdiff_t) full_len;
     if (end < 0)
     {
         end_from_back = true;
         end += len;
     }
     if (end > len) end = len;
     if (start < 0)
     {
         if (0 - start >= len)
             start = 0;
         else
         {
             start += len;
             start_from_back = true;
         }
     }
     if (start < 0) start = 0;
     if (start > end || start >= len || end <= 0)
         return s[0..0];

     static if(is(T == char) || is(T == immutable(char)) ||
             is(T : wchar) || is(T : immutable(wchar)))
     {
         ptrdiff_t real_start = -1, real_end = -1, loop, last_pos;
         if (!start_from_back || !end_from_back)
         {
             foreach (ptrdiff_t i, dchar c; s)
             {
                 if (!start_from_back && loop >= start && real_start < 0)
                     real_start = i;
                 if (!end_from_back && loop >= end && real_end < 0)
                     real_end = i;
                 if ((start_from_back || real_start > -1) &&
                         (end_from_back || real_end > -1 || end == len))
                     break;
                 loop++;
             }
         }
         start -= len;
         end -= len;
         loop = -1;
         if (start_from_back || end_from_back)
         {
             foreach_reverse (ptrdiff_t i, dchar c; s)
             {
                 if (start_from_back && loop <= start && real_start < 0)
                     real_start = i;
                 if (end_from_back && loop <= end && real_end < 0)
                     real_end = i;
                 if ((!start_from_back || real_start > -1) &&
                         (!end_from_back || real_end > -1))
                     break;
                 loop--;
             }
         }
         if (real_end < 0) real_end = (end_from_back ? 0 : len);
         if (real_start < 0) real_start = (start_from_back ? 0 : len);
         if (real_start > real_end) real_start = real_end = 0;
         return s[real_start..real_end];
     }
     else return s[start..end];
}

unittest {
     string s = "okrągły stół";
     dstring d = "okrągły stół"d;
     auto t = [0, 1, 2, 3, 4];
     assert(t.getSlice(0, -1) == [0, 1, 2, 3]);
     assert(t.getSlice(1, -2) == [1, 2]);
     assert(t.getSlice(-4, -2) == [1, 2]);
     assert(t.getSlice(-5, 7) == [0, 1, 2, 3, 4]);
     assert(s.getSlice(0, 0) == "");
     assert(s.getSlice(0, 1) == "o");
     assert(s.getSlice(0) == s);
     assert(s.getSlice(8) == "stół");
     assert(s.getSlice(8, -1) == "stó");
     assert(s.getSlice(8, -2) == "st");
     assert(s.getSlice(8, -4) == "");
     assert(s.getSlice(10, 11) == "ó");
     assert(s.getSlice(10, -1) == "ó");
     assert(s.getSlice(10, 12) == "ół");
     assert(s.getSlice(11, 12) == "ł");
     assert(s.getSlice(11, 15) == "ł");
     assert(d.getSlice(0, 0) == ""d);
     assert(d.getSlice(0, 1) == "o"d);
     assert(d.getSlice(0) == d);
     assert(d.getSlice(8) == "stół"d);
     assert(d.getSlice(8, -1) == "stó"d);
     assert(d.getSlice(8, -2) == "st"d);
     assert(d.getSlice(8, -4) == ""d);
     assert(d.getSlice(10, 11) == "ó"d);
     assert(d.getSlice(10, -1) == "ó"d);
     assert(d.getSlice(10, 12) == "ół"d);
     assert(d.getSlice(11, 12) == "ł"d);
     assert(d.getSlice(11, 15) == "ł"d);
     assert(d.getSlice(11, 15) == "ł"d);
}
Dec 29 2012
next sibling parent reply "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Saturday, 29 December 2012 at 22:25:35 UTC, FG wrote:
 Slices are great but not really what I had expected, coming 
 from Python.
 I've seen code like s[a..$-b] used without checking the values, 
 just to end up with a Range violation. But there are 3 
 constraints to check here:
 	a >= 0 && a + b <= s.length && b >= 0

 That's way too much coding for a simple program/script that 
 shortens a string, before it prints it on a screen. If I can't 
 write s[0..80] without fear, then let there at least be a 
 function that does it like Python would.

Why?
 Additionally, as strings are UTF-8-encoded, I'd like such a 
 function to give me proper substrings, without multibyte 
 characters cut in the middle, where s[0..80] would mean 80 
 characters on the screen and not 80 bytes.

This is a common fallacy when dealing with Unicode. Please see the linked and the following points: http://utf8everywhere.org/#myth.utf32.o1
Dec 29 2012
parent reply FG <home fgda.pl> writes:
On 2012-12-29 23:35, Vladimir Panteleev wrote:
 On Saturday, 29 December 2012 at 22:25:35 UTC, FG wrote:
 Slices are great but not really what I had expected, coming from Python.
 I've seen code like s[a..$-b] used without checking the values, just to end up
 with a Range violation. But there are 3 constraints to check here:
     a >= 0 && a + b <= s.length && b >= 0

 That's way too much coding for a simple program/script that shortens a string,
 before it prints it on a screen. If I can't write s[0..80] without fear, then
 let there at least be a function that does it like Python would.

Why?

Probably because I like concise code. I always prefer: if (A) print(getMessage().getSlice(0..100)); to writing something like this: auto message = getMessage(); if (A) print(message.length > 100 ? message[0..100] : message);
 Additionally, as strings are UTF-8-encoded, I'd like such a function to give
 me proper substrings, without multibyte characters cut in the middle, where
 s[0..80] would mean 80 characters on the screen and not 80 bytes.

This is a common fallacy when dealing with Unicode. Please see the linked and the following points: http://utf8everywhere.org/#myth.utf32.o1

True. I didn't think about all the languages out there. Just some common European ones.
Dec 29 2012
parent FG <home fgda.pl> writes:
On 2012-12-29 23:55, FG wrote:
 Probably because I like concise code. I always prefer:
      if (A) print(getMessage().getSlice(0..100));

 to writing something like this:
      auto message = getMessage();
      if (A) print(message.length > 100 ? message[0..100] : message);

Actually, when I look at this, it can be a one-liner after all. :) if (A) print(getMessage()[0..($>100?100:$)]); Didn't expect this to work.
Dec 29 2012
prev sibling next sibling parent reply "Peter Alexander" <peter.alexander.au gmail.com> writes:
On Saturday, 29 December 2012 at 22:25:35 UTC, FG wrote:
 Forgive me if such a function already exists -- I couldn't find 
 it.

std.range have drop and take, which work on code points, not code units. They also handle over-dropping or over-taking gracefully. For example: string s = "okrągły stół"; writeln(s.drop(8).take(3)); // "stó" writeln(s.drop(8).take(100)); // "stół" writeln(s.drop(100).take(100)); // "" http://dpaste.dzfl.pl/2f8ebf49 It doesn't support negative indexing. Generally speaking though, the vast majority of user code should never need to index into a Unicode string.
Dec 29 2012
parent FG <home fgda.pl> writes:
On 2012-12-30 00:03, Peter Alexander wrote:
 On Saturday, 29 December 2012 at 22:25:35 UTC, FG wrote:
 Forgive me if such a function already exists -- I couldn't find it.

std.range have drop and take, which work on code points, not code units. They also handle over-dropping or over-taking gracefully. For example: string s = "okrągły stół"; writeln(s.drop(8).take(3)); // "stó" writeln(s.drop(8).take(100)); // "stół" writeln(s.drop(100).take(100)); // ""

Ah, so this is the way of doing it. Thanks.
 It doesn't support negative indexing.

At least dropping off the back is also possible s[2..$-5]: writeln(s.retro.drop(5).retro.drop(2)); // "rągły" (or with dropBack, without retro, if available) I have no idea how to do s[$-4..$-2] though.
Dec 29 2012
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
FG:

 to writing something like this:
     auto message = getMessage();
     if (A) print(message.length > 100 ? message[0..100] : 
 message);

In std.algorithm there is min(), that helps a little: if (A) print(message[0 .. min($, 100)]); Bye, bearophile
Dec 29 2012
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Peter Alexander:

 Generally speaking though, the vast majority of user code 
 should never need to index into a Unicode string.

Right, 90% of the code doesn't need to slice strings (and generally strings are Unicode). But the other 90% of the code needs to slice things... Bye, bearophile
Dec 29 2012
prev sibling parent "monarch_dodra" <monarchdodra gmail.com> writes:
On Sunday, 30 December 2012 at 00:02:17 UTC, FG wrote:
 On 2012-12-30 00:03, Peter Alexander wrote:
 On Saturday, 29 December 2012 at 22:25:35 UTC, FG wrote:
 Forgive me if such a function already exists -- I couldn't 
 find it.

std.range have drop and take, which work on code points, not code units. They also handle over-dropping or over-taking gracefully. For example: string s = "okrągły stół"; writeln(s.drop(8).take(3)); // "stó" writeln(s.drop(8).take(100)); // "stół" writeln(s.drop(100).take(100)); // ""

Ah, so this is the way of doing it. Thanks.
 It doesn't support negative indexing.

At least dropping off the back is also possible s[2..$-5]: writeln(s.retro.drop(5).retro.drop(2)); // "rągły" (or with dropBack, without retro, if available)

dropBack is available IFF retro is available. (AFAIK)
 I have no idea how to do s[$-4..$-2] though.

But as a general rule, making a range out of the first (or last) elements of a non RA range is a limitation of how ranges can "only shrink". strings are a special case of non-RA, non-sliceable range you can index and slice... Anyways, you can always get creative with length: //---- s = "hello world"; s[s.dropBack(4).length .. s.dropBack(2).length]; //---- In this particular example, it is a bit suboptimal, but quite frankly, I'd assume readability trumps performance for this kind of code (and is what I'd use in my end code). One last thing: keep in mind "drop/take" are linear operations. If you are handling unicode, then everything is linear anyways, so I'm not saying these functions are slow or anything, just don't forget they aren't the o(1) functions you'd get with ASCII.
Dec 30 2012