www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - String created from buffer has wrong length and strip() result is

reply "Lucas Burson" <ljdelight+dlang gmail.com> writes:
When creating a string from a ubyte[], I have an invalid length 
and string.strip() doesn't strip off all whitespace. I'm new to 
the language. Is this a compiler issue?


import std.string : strip;
import std.stdio  : writefln;

int main()
{
    const string ATA_STR = " ATA ";

    // this works fine
    {
       ubyte[] buffer = [' ', 'A', 'T', 'A', ' ' ];
       string test = strip(cast(string)(buffer));
       assert(test == strip(ATA_STR));
    }

    // This is where things breaks
    {
       ubyte[] buff = new ubyte[16];
       buff[0..ATA_STR.length] = cast(ubyte[])(ATA_STR);

       // read the string back from the buffer, stripping 
whitespace
       string stringFromBuffer = strip(cast(string)(buff[0..16]));
       // this shows strip() doesn't remove all whitespace
       writefln("StrFromBuff is '%s'; length %d", 
stringFromBuffer, stringFromBuffer.length);

       // !! FAILS. stringFromBuffer is length 15, not 3.
       assert(stringFromBuffer.length == strip(ATA_STR).length);

    }

    return 0;
}
Oct 16 2014
parent reply "thedeemon" <dlang thedeemon.com> writes:
On Friday, 17 October 2014 at 06:29:24 UTC, Lucas Burson wrote:

    // This is where things breaks
    {
       ubyte[] buff = new ubyte[16];
       buff[0..ATA_STR.length] = cast(ubyte[])(ATA_STR);

       // read the string back from the buffer, stripping 
 whitespace
       string stringFromBuffer = 
 strip(cast(string)(buff[0..16]));
       // this shows strip() doesn't remove all whitespace
       writefln("StrFromBuff is '%s'; length %d", 
 stringFromBuffer, stringFromBuffer.length);

       // !! FAILS. stringFromBuffer is length 15, not 3.
       assert(stringFromBuffer.length == strip(ATA_STR).length);
Unlike C, strings in D are not zero-terminated by default, they are just arrays, i.e. a pair of pointer and size. You create an array of 16 bytes and cast it to string, now you have a 16-chars string. You fill first few chars with data from ATA_STR but the rest 10 bytes of the array are still part of the string, not initialized with data, so having zeroes. Since this tail of zeroes is not whitespace (tabs or spaces etc.) 'strip' doesn't remove it.
Oct 17 2014
next sibling parent "thedeemon" <dlang thedeemon.com> writes:
You fill first few chars with data from
 ATA_STR but the rest 10 bytes of the array are still part of 
 the string
Edit: you fill first 5 chars and have 11 bytes of zeroes in the tail. My counting skill is too bad. ;)
Oct 17 2014
prev sibling parent reply spir via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:
On 17/10/14 09:29, thedeemon via Digitalmars-d-learn wrote:
 On Friday, 17 October 2014 at 06:29:24 UTC, Lucas Burson wrote:

    // This is where things breaks
    {
       ubyte[] buff = new ubyte[16];
       buff[0..ATA_STR.length] = cast(ubyte[])(ATA_STR);

       // read the string back from the buffer, stripping whitespace
       string stringFromBuffer = strip(cast(string)(buff[0..16]));
       // this shows strip() doesn't remove all whitespace
       writefln("StrFromBuff is '%s'; length %d", stringFromBuffer,
 stringFromBuffer.length);

       // !! FAILS. stringFromBuffer is length 15, not 3.
       assert(stringFromBuffer.length == strip(ATA_STR).length);
Unlike C, strings in D are not zero-terminated by default, they are just arrays, i.e. a pair of pointer and size. You create an array of 16 bytes and cast it to string, now you have a 16-chars string. You fill first few chars with data from ATA_STR but the rest 10 bytes of the array are still part of the string, not initialized with data, so having zeroes. Since this tail of zeroes is not whitespace (tabs or spaces etc.) 'strip' doesn't remove it.
Side-note: since your string has those zeroes at the end, strip only removes the space at start (thus, final size=15), instead of at both ends. d
Oct 17 2014
parent reply "Lucas Burson" <ljdelight+dlang gmail.com> writes:
On Friday, 17 October 2014 at 08:31:04 UTC, spir via 
Digitalmars-d-learn wrote:
 On 17/10/14 09:29, thedeemon via Digitalmars-d-learn wrote:
 On Friday, 17 October 2014 at 06:29:24 UTC, Lucas Burson wrote:

   // This is where things breaks
   {
      ubyte[] buff = new ubyte[16];
      buff[0..ATA_STR.length] = cast(ubyte[])(ATA_STR);

      // read the string back from the buffer, stripping 
 whitespace
      string stringFromBuffer = 
 strip(cast(string)(buff[0..16]));
      // this shows strip() doesn't remove all whitespace
      writefln("StrFromBuff is '%s'; length %d", 
 stringFromBuffer,
 stringFromBuffer.length);

      // !! FAILS. stringFromBuffer is length 15, not 3.
      assert(stringFromBuffer.length == strip(ATA_STR).length);
Unlike C, strings in D are not zero-terminated by default, they are just arrays, i.e. a pair of pointer and size. You create an array of 16 bytes and cast it to string, now you have a 16-chars string. You fill first few chars with data from ATA_STR but the rest 10 bytes of the array are still part of the string, not initialized with data, so having zeroes. Since this tail of zeroes is not whitespace (tabs or spaces etc.) 'strip' doesn't remove it.
Side-note: since your string has those zeroes at the end, strip only removes the space at start (thus, final size=15), instead of at both ends. d
Okay things are becoming more clear. The cast to string is nothing like the C++ string ctor, I made a bad assumption. So given the below buffer would I use fromStringz (is this in the stdlib?) to cast it from a null-terminated buffer to a good string? Shouldn't the compiler give a warning about casting a buffer to a string without using fromStringz? Buffer = [ 0x20, 0x41, 0x54, 0x41, 0x20, 0x00, 0x00, ...]?
Oct 17 2014
next sibling parent reply ketmar via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:
On Fri, 17 Oct 2014 15:24:21 +0000
Lucas Burson via Digitalmars-d-learn
<digitalmars-d-learn puremagic.com> wrote:

 So given the below buffer would I use fromStringz (is this in the=20
 stdlib?) to cast it from a null-terminated buffer to a good=20
 string? Shouldn't the compiler give a warning about casting a=20
 buffer to a string without using fromStringz?
if you are really-really sure that your buffer is null-terminated, you can use this trick: import std.conv; string s =3D to!string(cast(char*)buff.ptr); please note, that this is NOT SAFE. you'd better doublecheck that your buffer is not empty and is null-terminated.
Oct 17 2014
parent reply "Lucas Burson" <ljdelight+dlang gmail.com> writes:
On Friday, 17 October 2014 at 15:30:52 UTC, ketmar via 
Digitalmars-d-learn wrote:
 On Fri, 17 Oct 2014 15:24:21 +0000
 Lucas Burson via Digitalmars-d-learn
 <digitalmars-d-learn puremagic.com> wrote:

 So given the below buffer would I use fromStringz (is this in 
 the stdlib?) to cast it from a null-terminated buffer to a 
 good string? Shouldn't the compiler give a warning about 
 casting a buffer to a string without using fromStringz?
if you are really-really sure that your buffer is null-terminated, you can use this trick: import std.conv; string s = to!string(cast(char*)buff.ptr); please note, that this is NOT SAFE. you'd better doublecheck that your buffer is not empty and is null-terminated.
The buffer is populated from a scsi ioctl so it "should" be only ascii and null-terminated but it's a good idea to harden the code a bit. Thank you for your help!
Oct 17 2014
parent reply ketmar via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:
On Fri, 17 Oct 2014 16:08:04 +0000
Lucas Burson via Digitalmars-d-learn
<digitalmars-d-learn puremagic.com> wrote:

 The buffer is populated from a scsi ioctl so it "should" be only=20
 ascii and null-terminated but it's a good idea to harden the code=20
 a bit.
 Thank you for your help!
i developed a habit of making such buffers one byte bigger than necessary and just setting the last byte to 0 before converting. this way it's guaranteed to be 0-terminated.
Oct 17 2014
parent reply "Lucas Burson" <ljdelight+dlang gmail.com> writes:
On Friday, 17 October 2014 at 17:40:09 UTC, ketmar via 
Digitalmars-d-learn wrote:

 i developed a habit of making such buffers one byte bigger than
 necessary and just setting the last byte to 0 before 
 converting. this
 way it's guaranteed to be 0-terminated.
Perfect, great idea. Below is my utility method to pull strings out of a buffer. /** * Get a string from buffer where the string spans [offset_start, offset_end). * Params: * buffer = Buffer with an ASCII string to obtain. * offset_start = Beginning byte offset within the buffer where the string starts. * offset_end = Ending byte offset which is not included in the string. */ string bufferGetString(ubyte[] buffer, ulong offset_start, ulong offset_end) in { assert(buffer != null); assert(offset_start < offset_end); assert(offset_end <= buffer.length); } body { ulong bufflen = offset_end - offset_start; // add one to the lenth for null-termination ubyte[] temp = new ubyte[bufflen+1]; temp[0..bufflen] = buffer[offset_start..offset_end]; temp[bufflen] = '\0'; return strip(to!string(cast(const char*) temp.ptr)); } unittest { ubyte[] no_null = [' ', 'A', 'B', 'C', ' ']; assert("ABC" == bufferGetString(no_null, 0, no_null.length)); assert("ABC" == bufferGetString(no_null, 1, no_null.length-1)); assert("A" == bufferGetString(no_null, 1, 2)); }
Oct 17 2014
next sibling parent ketmar via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:
On Sat, 18 Oct 2014 00:32:09 +0000
Lucas Burson via Digitalmars-d-learn
<digitalmars-d-learn puremagic.com> wrote:

 On Friday, 17 October 2014 at 17:40:09 UTC, ketmar via=20
 Digitalmars-d-learn wrote:
=20
 i developed a habit of making such buffers one byte bigger than
 necessary and just setting the last byte to 0 before=20
 converting. this
 way it's guaranteed to be 0-terminated.
=20 Perfect, great idea. Below is my utility method to pull strings=20 out of a buffer. =20 =20 /** * Get a string from buffer where the string spans [offset_start,=20 offset_end). * Params: * buffer =3D Buffer with an ASCII string to obtain. * offset_start =3D Beginning byte offset within the buffer=20 where the string starts. * offset_end =3D Ending byte offset which is not included in=20 the string. */ string bufferGetString(ubyte[] buffer, ulong offset_start, ulong=20 offset_end) in { assert(buffer !=3D null); assert(offset_start < offset_end); assert(offset_end <=3D buffer.length); } body { ulong bufflen =3D offset_end - offset_start; =20 // add one to the lenth for null-termination ubyte[] temp =3D new ubyte[bufflen+1]; temp[0..bufflen] =3D buffer[offset_start..offset_end]; temp[bufflen] =3D '\0'; =20 return strip(to!string(cast(const char*) temp.ptr)); } =20 unittest { ubyte[] no_null =3D [' ', 'A', 'B', 'C', ' ']; assert("ABC" =3D=3D bufferGetString(no_null, 0, no_null.length)); assert("ABC" =3D=3D bufferGetString(no_null, 1, no_null.length-1)); assert("A" =3D=3D bufferGetString(no_null, 1, 2)); }
note that you can make your code slightly simplier (and more correct): size_t bufflen =3D offset_end-offset_start; // add one to the lenth for null-termination auto temp =3D new ubyte[bufflen+1]; // compiler knows the type ;-) temp[0..$-1] =3D buffer[offset_start..offset_end]; // this is not necessary, as 'temp' is initialized with zeroes //temp[$-1] =3D '\0'; return strip(to!string(cast(const char*) temp.ptr)); also note that this allocates like crazy. ;-) this can be tolerable, but good to remember anyway. besides, slices rocks, so you can just pass a slice there. so: string bufferGetString (const(ubyte)[] buffer) { import std.conv : to; import std.string : strip; if (buffer.length =3D=3D 0) return null; // or "" if (buffer[$-1] =3D=3D 0) return to!string(cast(char*)buffer.ptr).strip; auto temp =3D new ubyte[](buffer.length+1); temp[0..$-1] =3D buffer[]; return to!string(cast(char*)temp.ptr).strip; } unittest { ubyte[] no_null =3D [' ', 'A', 'B', 'C', ' ']; immutable ubyte[] no_nullI =3D [' ', 'A', 'B', 'C', ' ']; assert("ABC" =3D=3D bufferGetString(no_null[0..$])); assert("ABC" =3D=3D bufferGetString(no_null[1..$-1])); // look, we can use const/immutable buffers too! assert("A" =3D=3D bufferGetString(no_nullI[1..2])); } slices are cheap, and you'll get range checking at the call site.
Oct 17 2014
prev sibling parent reply ketmar via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:
On Sat, 18 Oct 2014 00:32:09 +0000
Lucas Burson via Digitalmars-d-learn
<digitalmars-d-learn puremagic.com> wrote:

p.s. it's ok to take '.length' from 'null' array. compiler is smart
enough.
Oct 17 2014
parent reply "Lucas Burson" <ljdelight+dlang gmail.com> writes:
On Saturday, 18 October 2014 at 00:53:57 UTC, ketmar via 
Digitalmars-d-learn wrote:
 On Sat, 18 Oct 2014 00:32:09 +0000
 Lucas Burson via Digitalmars-d-learn
 <digitalmars-d-learn puremagic.com> wrote:
Wow, your changes made it much simpler. Thank you for the suggestions and expertise ketmar :)
Oct 18 2014
parent ketmar via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:
On Sat, 18 Oct 2014 16:56:09 +0000
Lucas Burson via Digitalmars-d-learn
<digitalmars-d-learn puremagic.com> wrote:

 Wow, your changes made it much simpler. Thank you for the=20
 suggestions and expertise ketmar :)
you're welcome.
Oct 18 2014
prev sibling parent ketmar via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:
On Fri, 17 Oct 2014 18:30:43 +0300
ketmar via Digitalmars-d-learn <digitalmars-d-learn puremagic.com>
wrote:

 Shouldn't the compiler give a warning about casting a=20
 buffer to a string without using fromStringz?
nope. such casting is perfectly legal, as D strings can contain embedded '\0's.
Oct 17 2014