www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - size of a string in bytes

reply Nestor <nestorperez2016 yopmail.com> writes:
Hi,

One can get the length of a string easily, however since strings 
are UTF-8, sometimes characters take more than one byte. I would 
like to know then how many bytes does a string take, but this 
code didn't work as I expected:

import std.stdio;
void main() {
   string mystring1;
   string mystring2 = "A string of just 48 characters for testing 
size.";
   writeln(mystring1.sizeof);
   writeln( mystring2.sizeof);
}

In both cases the size is 8, so apparently sizeof is giving me 
just the default size of a string type and not the size of the 
variable in memory, which is what I want.

Ideas?
Jan 28 2017
parent reply rikki cattermole <rikki cattermole.co.nz> writes:
On 29/01/2017 3:51 AM, Nestor wrote:
 Hi,

 One can get the length of a string easily, however since strings are
 UTF-8, sometimes characters take more than one byte. I would like to
 know then how many bytes does a string take, but this code didn't work
 as I expected:

 import std.stdio;
 void main() {
   string mystring1;
   string mystring2 = "A string of just 48 characters for testing size.";
   writeln(mystring1.sizeof);
   writeln( mystring2.sizeof);
 }

 In both cases the size is 8, so apparently sizeof is giving me just the
 default size of a string type and not the size of the variable in
 memory, which is what I want.

 Ideas?
A few misconceptions going on here. A string element is not a grapheme it is a character which is one byte. So what you want is mystring.length Now sizeof is not telling you about the elements, its telling you how big the reference to it is. Specifically length + pointer. It would have been 16 if you compiled in 64bit mode for example. If you want to know about graphemes and code points that is another story. For that you'll want std.uni[0] and std.utf[1]. [0] http://dlang.org/phobos/std_uni.html [1] http://dlang.org/phobos/std_utf.html
Jan 28 2017
parent reply Nestor <nestorperez2016 yopmail.com> writes:
On Saturday, 28 January 2017 at 14:56:03 UTC, rikki cattermole 
wrote:
 On 29/01/2017 3:51 AM, Nestor wrote:
 Hi,

 One can get the length of a string easily, however since 
 strings are
 UTF-8, sometimes characters take more than one byte. I would 
 like to
 know then how many bytes does a string take, but this code 
 didn't work
 as I expected:

 import std.stdio;
 void main() {
   string mystring1;
   string mystring2 = "A string of just 48 characters for 
 testing size.";
   writeln(mystring1.sizeof);
   writeln( mystring2.sizeof);
 }

 In both cases the size is 8, so apparently sizeof is giving me 
 just the
 default size of a string type and not the size of the variable 
 in
 memory, which is what I want.

 Ideas?
A few misconceptions going on here. A string element is not a grapheme it is a character which is one byte. So what you want is mystring.length Now sizeof is not telling you about the elements, its telling you how big the reference to it is. Specifically length + pointer. It would have been 16 if you compiled in 64bit mode for example. If you want to know about graphemes and code points that is another story. For that you'll want std.uni[0] and std.utf[1]. [0] http://dlang.org/phobos/std_uni.html [1] http://dlang.org/phobos/std_utf.html
I do not want string lenth or code points. Perhaps I didn't explain myselft. I want to know variable size in memory. For example, say I have an UTF-8 string of only 2 characters, but each of them takes 2 bytes. string length would be 2, but the content of the string would take 4 bytes in memory (excluding overhead for type size). How can I get that?
Jan 28 2017
next sibling parent rikki cattermole <rikki cattermole.co.nz> writes:
On 29/01/2017 4:32 AM, Nestor wrote:
 On Saturday, 28 January 2017 at 14:56:03 UTC, rikki cattermole wrote:
 On 29/01/2017 3:51 AM, Nestor wrote:
 Hi,

 One can get the length of a string easily, however since strings are
 UTF-8, sometimes characters take more than one byte. I would like to
 know then how many bytes does a string take, but this code didn't work
 as I expected:

 import std.stdio;
 void main() {
   string mystring1;
   string mystring2 = "A string of just 48 characters for testing size.";
   writeln(mystring1.sizeof);
   writeln( mystring2.sizeof);
 }

 In both cases the size is 8, so apparently sizeof is giving me just the
 default size of a string type and not the size of the variable in
 memory, which is what I want.

 Ideas?
A few misconceptions going on here. A string element is not a grapheme it is a character which is one byte. So what you want is mystring.length Now sizeof is not telling you about the elements, its telling you how big the reference to it is. Specifically length + pointer. It would have been 16 if you compiled in 64bit mode for example. If you want to know about graphemes and code points that is another story. For that you'll want std.uni[0] and std.utf[1]. [0] http://dlang.org/phobos/std_uni.html [1] http://dlang.org/phobos/std_utf.html
I do not want string lenth or code points. Perhaps I didn't explain myselft. I want to know variable size in memory. For example, say I have an UTF-8 string of only 2 characters, but each of them takes 2 bytes. string length would be 2, but the content of the string would take 4 bytes in memory (excluding overhead for type size). How can I get that?
.length You are misunderstanding a char will always be exactly one byte in size. Check[0] for proof. Keep in mind here is the definition of string[1]: alias immutable(char)[] string; There is nothing fancy going on. What you were asking about "characters" wise is actually graphemes as per the unicode standard, they can be multiple bytes and codepoints in size but not a char. [0] http://dlang.org/spec/type.html [1] https://github.com/dlang/druntime/blob/master/src/object.d
Jan 28 2017
prev sibling next sibling parent reply Ivan Kazmenko <gassa mail.ru> writes:
On Saturday, 28 January 2017 at 15:32:33 UTC, Nestor wrote:
 I want to know variable size in memory. For example, say I have 
 an UTF-8 string of only 2 characters, but each of them takes 2 
 bytes. string length would be 2, but the content of the string 
 would take 4 bytes in memory (excluding overhead for type size).
As said, the byte count is indeed string.length. The number of code points can be found by std.range.walkLength, but be aware it takes O(answer) time to compute. Example: ----- import std.range, std.stdio; void main () { auto s = "Привет!"; writeln (s.length); // 13 bytes writeln (s.walkLength); // 7 code points } ----- Ivan Kazmenko.
Jan 28 2017
parent reply Nestor <nestorperez2016 yopmail.com> writes:
On Saturday, 28 January 2017 at 16:01:38 UTC, Ivan Kazmenko wrote:
 As said, the byte count is indeed string.length.
 The number of code points can be found by std.range.walkLength, 
 but be aware it takes O(answer) time to compute.

 Example:

 -----
 import std.range, std.stdio;
 void main () {
 	auto s = "Привет!";
 	writeln (s.length); // 13 bytes
 	writeln (s.walkLength); // 7 code points
 }
Thank you Ivan, I believe I saw somewhere that in D a char was not neccesarrily the same as an ubyte because chars sometimes take more than one byte, so since a string is an array of chars, I thought length behaved like walkLength (which I had not seen), in other words, that it simply returned the amount of elements in the array.
Jan 28 2017
next sibling parent Adam D. Ruppe <destructionator gmail.com> writes:
On Saturday, 28 January 2017 at 18:04:58 UTC, Nestor wrote:
 I believe I saw somewhere that in D a char was not neccesarrily 
 the same as an ubyte because chars sometimes take more than
Not true in the language, but the Phobos library does treat char and ubyte differently because of the multi-char things. But the built-in .length on a string and indexing all work the same as bytes. Note that .length on a wstring or dstring (utf-16 or utf-32) are not bytes, but words. So wstring.length = number of wchars = number of 16 bit items. And dstring is 32 bit. Exactly the same as ushort[].length or int[].length - it is length of elements so if you actually want byte length, you'd cast it first or something.
Jan 28 2017
prev sibling parent reply ag0aep6g <anonymous example.com> writes:
On Saturday, 28 January 2017 at 18:04:58 UTC, Nestor wrote:
 I believe I saw somewhere that in D a char was not neccesarrily 
 the same as an ubyte because chars sometimes take more than one 
 byte,
In D, a `char` is a UTF-8 code unit. Its size is one byte, exactly and always. A `char` is not a "character" in the common meaning of the word. There's a more specialized word for "character" as a visual unit: grapheme. For example, 'Ä' is a grapheme (a visual unit, a "character"), but there is no single `char` for it. To encode 'Ä' in UTF-8, a sequence of multiple code units is used.
 so since a string is an array of chars, I thought length 
 behaved like walkLength (which I had not seen), in other words, 
 that it simply returned the amount of elements in the array.
The elements of a `string` are (immutable) `char`s. That is, `string` is an array of UTF-8 code units. It's not an array of graphemes. A `string`'s .length gives you the number of `char`s in it, i.e. the number of UTF-8 code units, i.e. the number of bytes.
Jan 28 2017
parent Nestor <nestorperez2016 yopmail.com> writes:
On Saturday, 28 January 2017 at 19:09:01 UTC, ag0aep6g wrote:
 In D, a `char` is a UTF-8 code unit. Its size is one byte, 
 exactly and always.

 A `char` is not a "character" in the common meaning of the 
 word. There's a more specialized word for "character" as a 
 visual unit: grapheme. For example, 'Ä' is a grapheme (a visual 
 unit, a "character"), but there is no single `char` for it. To 
 encode 'Ä' in UTF-8, a sequence of multiple code units is used.
 
 ...
 
 The elements of a `string` are (immutable) `char`s. That is, 
 `string` is an array of UTF-8 code units. It's not an array of 
 graphemes.

 A `string`'s .length gives you the number of `char`s in it, 
 i.e. the number of UTF-8 code units, i.e. the number of bytes.
Very good explanation. Thank you all for making this clear to me.
Jan 28 2017
prev sibling parent "H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:
On Sat, Jan 28, 2017 at 03:32:33PM +0000, Nestor via Digitalmars-d-learn wrote:
[...]
 I do not want string lenth or code points. Perhaps I didn't explain
 myselft.
The .length property of a string is the number of bytes used to store the string.
 I want to know variable size in memory. For example, say I have an
 UTF-8 string of only 2 characters, but each of them takes 2 bytes.
 string length would be 2, but the content of the string would take 4
 bytes in memory (excluding overhead for type size).
What you call "string length" is called grapheme count in D. What you want is the .length property. The number of bytes in a UTF-8 string is the same thing as the number of code units (note: do not confuse with code points, which is something else). --T
Jan 28 2017