digitalmars.D.learn - size of a string in bytes

Nestor (17/17) Jan 28 2017 Hi,

rikki cattermole (11/27) Jan 28 2017 A few misconceptions going on here.

Nestor (9/49) Jan 28 2017 I do not want string lenth or code points. Perhaps I didn't

rikki cattermole (12/57) Jan 28 2017 .length
Ivan Kazmenko (14/18) Jan 28 2017 As said, the byte count is indeed string.length.

Nestor (7/18) Jan 28 2017 Thank you Ivan,

Adam D. Ruppe (11/13) Jan 28 2017 Not true in the language, but the Phobos library does treat char
ag0aep6g (13/19) Jan 28 2017 In D, a `char` is a UTF-8 code unit. Its size is one byte,

Nestor (3/18) Jan 28 2017 Very good explanation.

H. S. Teoh via Digitalmars-d-learn (10/16) Jan 28 2017 The .length property of a string is the number of bytes used to store

Nestor <nestorperez2016 yopmail.com> writes:

Hi,

One can get the length of a string easily, however since strings 
are UTF-8, sometimes characters take more than one byte. I would 
like to know then how many bytes does a string take, but this 
code didn't work as I expected:

import std.stdio;
void main() {
   string mystring1;
   string mystring2 = "A string of just 48 characters for testing 
size.";
   writeln(mystring1.sizeof);
   writeln( mystring2.sizeof);
}

In both cases the size is 8, so apparently sizeof is giving me 
just the default size of a string type and not the size of the 
variable in memory, which is what I want.

Ideas?

Jan 28 2017

rikki cattermole <rikki cattermole.co.nz> writes:

On 29/01/2017 3:51 AM, Nestor wrote:
 Hi,

 One can get the length of a string easily, however since strings are
 UTF-8, sometimes characters take more than one byte. I would like to
 know then how many bytes does a string take, but this code didn't work
 as I expected:

 import std.stdio;
 void main() {
   string mystring1;
   string mystring2 = "A string of just 48 characters for testing size.";
   writeln(mystring1.sizeof);
   writeln( mystring2.sizeof);
 }

 In both cases the size is 8, so apparently sizeof is giving me just the
 default size of a string type and not the size of the variable in
 memory, which is what I want.

 Ideas?

A few misconceptions going on here.
A string element is not a grapheme it is a character which is one byte.

So what you want is mystring.length

Now sizeof is not telling you about the elements, its telling you how 
big the reference to it is. Specifically length + pointer. It would have 
been 16 if you compiled in 64bit mode for example.

If you want to know about graphemes and code points that is another story.
For that you'll want std.uni[0] and std.utf[1].

[0] http://dlang.org/phobos/std_uni.html
[1] http://dlang.org/phobos/std_utf.html

Jan 28 2017

Nestor <nestorperez2016 yopmail.com> writes:

On Saturday, 28 January 2017 at 14:56:03 UTC, rikki cattermole 
wrote:
 On 29/01/2017 3:51 AM, Nestor wrote:
 Hi,

 One can get the length of a string easily, however since 
 strings are
 UTF-8, sometimes characters take more than one byte. I would 
 like to
 know then how many bytes does a string take, but this code 
 didn't work
 as I expected:

 import std.stdio;
 void main() {
   string mystring1;
   string mystring2 = "A string of just 48 characters for 
 testing size.";
   writeln(mystring1.sizeof);
   writeln( mystring2.sizeof);
 }

 In both cases the size is 8, so apparently sizeof is giving me 
 just the
 default size of a string type and not the size of the variable 
 in
 memory, which is what I want.

 Ideas?

 A few misconceptions going on here.
 A string element is not a grapheme it is a character which is 
 one byte.

 So what you want is mystring.length

 Now sizeof is not telling you about the elements, its telling 
 you how big the reference to it is. Specifically length + 
 pointer. It would have been 16 if you compiled in 64bit mode 
 for example.

 If you want to know about graphemes and code points that is 
 another story.
 For that you'll want std.uni[0] and std.utf[1].

 [0] http://dlang.org/phobos/std_uni.html
 [1] http://dlang.org/phobos/std_utf.html

I do not want string lenth or code points. Perhaps I didn't 
explain myselft.

I want to know variable size in memory. For example, say I have 
an UTF-8 string of only 2 characters, but each of them takes 2 
bytes. string length would be 2, but the content of the string 
would take 4 bytes in memory (excluding overhead for type size).

How can I get that?

Jan 28 2017

rikki cattermole <rikki cattermole.co.nz> writes:

On 29/01/2017 4:32 AM, Nestor wrote:
 On Saturday, 28 January 2017 at 14:56:03 UTC, rikki cattermole wrote:
 On 29/01/2017 3:51 AM, Nestor wrote:
 Hi,

 One can get the length of a string easily, however since strings are
 UTF-8, sometimes characters take more than one byte. I would like to
 know then how many bytes does a string take, but this code didn't work
 as I expected:

 import std.stdio;
 void main() {
   string mystring1;
   string mystring2 = "A string of just 48 characters for testing size.";
   writeln(mystring1.sizeof);
   writeln( mystring2.sizeof);
 }

 In both cases the size is 8, so apparently sizeof is giving me just the
 default size of a string type and not the size of the variable in
 memory, which is what I want.

 Ideas?

 A few misconceptions going on here.
 A string element is not a grapheme it is a character which is one byte.

 So what you want is mystring.length

 Now sizeof is not telling you about the elements, its telling you how
 big the reference to it is. Specifically length + pointer. It would
 have been 16 if you compiled in 64bit mode for example.

 If you want to know about graphemes and code points that is another
 story.
 For that you'll want std.uni[0] and std.utf[1].

 [0] http://dlang.org/phobos/std_uni.html
 [1] http://dlang.org/phobos/std_utf.html

 I do not want string lenth or code points. Perhaps I didn't explain
 myselft.

 I want to know variable size in memory. For example, say I have an UTF-8
 string of only 2 characters, but each of them takes 2 bytes. string
 length would be 2, but the content of the string would take 4 bytes in
 memory (excluding overhead for type size).

 How can I get that?

.length

You are misunderstanding a char will always be exactly one byte in size.

Check[0] for proof.

Keep in mind here is the definition of string[1]:
alias immutable(char)[]  string;

There is nothing fancy going on.
What you were asking about "characters" wise is actually graphemes as 
per the unicode standard, they can be multiple bytes and codepoints in 
size but not a char.

[0] http://dlang.org/spec/type.html
[1] https://github.com/dlang/druntime/blob/master/src/object.d

Jan 28 2017

Ivan Kazmenko <gassa mail.ru> writes:

On Saturday, 28 January 2017 at 15:32:33 UTC, Nestor wrote:
 I want to know variable size in memory. For example, say I have 
 an UTF-8 string of only 2 characters, but each of them takes 2 
 bytes. string length would be 2, but the content of the string 
 would take 4 bytes in memory (excluding overhead for type size).

As said, the byte count is indeed string.length.
The number of code points can be found by std.range.walkLength, 
but be aware it takes O(answer) time to compute.

Example:

-----
import std.range, std.stdio;
void main () {
	auto s = "Привет!";
	writeln (s.length); // 13 bytes
	writeln (s.walkLength); // 7 code points
}
-----

Ivan Kazmenko.

Jan 28 2017

Nestor <nestorperez2016 yopmail.com> writes:

On Saturday, 28 January 2017 at 16:01:38 UTC, Ivan Kazmenko wrote:
 As said, the byte count is indeed string.length.
 The number of code points can be found by std.range.walkLength, 
 but be aware it takes O(answer) time to compute.

 Example:

 -----
 import std.range, std.stdio;
 void main () {
 	auto s = "Привет!";
 	writeln (s.length); // 13 bytes
 	writeln (s.walkLength); // 7 code points
 }

Thank you Ivan,

I believe I saw somewhere that in D a char was not neccesarrily 
the same as an ubyte because chars sometimes take more than one 
byte, so since a string is an array of chars, I thought length 
behaved like walkLength (which I had not seen), in other words, 
that it simply returned the amount of elements in the array.

Jan 28 2017

Adam D. Ruppe <destructionator gmail.com> writes:

On Saturday, 28 January 2017 at 18:04:58 UTC, Nestor wrote:
 I believe I saw somewhere that in D a char was not neccesarrily 
 the same as an ubyte because chars sometimes take more than

Not true in the language, but the Phobos library does treat char 
and ubyte differently because of the multi-char things.

But the built-in .length on a string and indexing all work the 
same as bytes.

Note that .length on a wstring or dstring (utf-16 or utf-32) are 
not bytes, but words. So wstring.length = number of wchars = 
number of 16 bit items. And dstring is 32 bit. Exactly the same 
as ushort[].length or int[].length - it is length of elements so 
if you actually want byte length, you'd cast it first or 
something.

Jan 28 2017

ag0aep6g <anonymous example.com> writes:

On Saturday, 28 January 2017 at 18:04:58 UTC, Nestor wrote:
 I believe I saw somewhere that in D a char was not neccesarrily 
 the same as an ubyte because chars sometimes take more than one 
 byte,

In D, a `char` is a UTF-8 code unit. Its size is one byte, 
exactly and always.

A `char` is not a "character" in the common meaning of the word. 
There's a more specialized word for "character" as a visual unit: 
grapheme. For example, 'Ä' is a grapheme (a visual unit, a 
"character"), but there is no single `char` for it. To encode 'Ä' 
in UTF-8, a sequence of multiple code units is used.

 so since a string is an array of chars, I thought length 
 behaved like walkLength (which I had not seen), in other words, 
 that it simply returned the amount of elements in the array.

The elements of a `string` are (immutable) `char`s. That is, 
`string` is an array of UTF-8 code units. It's not an array of 
graphemes.

A `string`'s .length gives you the number of `char`s in it, i.e. 
the number of UTF-8 code units, i.e. the number of bytes.

Jan 28 2017

Nestor <nestorperez2016 yopmail.com> writes:

On Saturday, 28 January 2017 at 19:09:01 UTC, ag0aep6g wrote:
 In D, a `char` is a UTF-8 code unit. Its size is one byte, 
 exactly and always.

 A `char` is not a "character" in the common meaning of the 
 word. There's a more specialized word for "character" as a 
 visual unit: grapheme. For example, 'Ä' is a grapheme (a visual 
 unit, a "character"), but there is no single `char` for it. To 
 encode 'Ä' in UTF-8, a sequence of multiple code units is used.
 
 ...
 
 The elements of a `string` are (immutable) `char`s. That is, 
 `string` is an array of UTF-8 code units. It's not an array of 
 graphemes.

 A `string`'s .length gives you the number of `char`s in it, 
 i.e. the number of UTF-8 code units, i.e. the number of bytes.

Very good explanation.
Thank you all for making this clear to me.

Jan 28 2017

"H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:

On Sat, Jan 28, 2017 at 03:32:33PM +0000, Nestor via Digitalmars-d-learn wrote:
[...]
 I do not want string lenth or code points. Perhaps I didn't explain
 myselft.

The .length property of a string is the number of bytes used to store
the string.


 I want to know variable size in memory. For example, say I have an
 UTF-8 string of only 2 characters, but each of them takes 2 bytes.
 string length would be 2, but the content of the string would take 4
 bytes in memory (excluding overhead for type size).

What you call "string length" is called grapheme count in D.  What you
want is the .length property.

The number of bytes in a UTF-8 string is the same thing as the number of
code units (note: do not confuse with code points, which is something
else).


--T

Jan 28 2017

D Programming

C/C++ Programming

Other

digitalmars.D.learn - size of a string in bytes