www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - std.file

reply novice2 <novice2_member pathlink.com> writes:
Hello.
What can i do, if my program must work with file/folder names, contain
non-english symbols? This names may be readed from Windows registry, may be
entered by user interaction.
Environment: localised Windows XP
Local code page: 1251 (8 bit, cyrillic, non-english letters have codes > 0x80)
Test program:
/*********/
private import std.file;
private import std.utf;

char[] test(char[] dirName)
{
try
{
exists(dirName);
return "passed";
}
catch(UtfError)
{
return "failed";
}
}

void main()
{
char[] dir1 = "not exists dir A";
char[] dir2 = "not exists dir \xC0"; //this is cyrillic letter "A"
printf("dir1=%.*s\n", dir1);
printf("dir2=%.*s\n", dir2);

printf("test1 %.*s\n", test(dir1));
printf("test2 %.*s\n", test(dir2));
printf("test3 %.*s\n", test(toUTF8(dir1)));
printf("test4 %.*s\n", test(toUTF8(dir2)));

}
/*********/

For my environment this program print:
dir1=not exists folder A
dir2=not exists folder 
test1 passed
test2 failed
test2 passed
test2 failed
Sep 30 2004
next sibling parent reply novice <novice_member pathlink.com> writes:
For my environment this program print:
dir1=not exists folder A
dir2=not exists folder 
test1 passed
test2 failed
test2 passed
test2 failed

i skip problem explaination: std.file.exists() throw exception "Bad UTF sequence" if dirname contain non-english letter. can i bypass this problem?
Sep 30 2004
parent M <M_member pathlink.com> writes:
Maybe you must give the function UTF-8 string and you don't. You can give it
non-english letters, but they must be in UTF code.

M
In article <cjggll$25fd$1 digitaldaemon.com>, novice says...
For my environment this program print:
dir1=not exists folder A
dir2=not exists folder 
test1 passed
test2 failed
test2 passed
test2 failed

i skip problem explaination: std.file.exists() throw exception "Bad UTF sequence" if dirname contain non-english letter. can i bypass this problem?

Sep 30 2004
prev sibling parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cjgfna$2507$1 digitaldaemon.com>, novice2 says...
Hello.

Hiya
What can i do, if my program must work with file/folder names, contain
non-english symbols? This names may be readed from Windows registry, may be
entered by user interaction.
Environment: localised Windows XP
Local code page: 1251 (8 bit, cyrillic, non-english letters have codes > 0x80)

Understand that your local code page is something about which D doesn't not know or care. I'll try to explain more further on. Bear with me.
char[] dir2 = "not exists dir \xC0"; //this is cyrillic letter "A"

No it isn't. It's an invalid UTF-8 sequence. What you should do instead is this: # char[] dir2 = "not exists dir \u0410"; //Cyrillic capital letter A (or simply insert the Cryllic capital letter A straight into your source code as a single character). In D, source code is portable. The sequence "\u0410" emits the Unicode character U+0410 (CYRILLIC CAPITAL LETTER A), and - importantly - it will do so for /all users/, not just folk like who use Windows code page 1251. In D, "\x##" emits UTF fragments, not characters. Therefore you should never use "\x##" in a string unless you are prepared to encode UTF-8 by hand. "\u####" or "\U########" emit characters, so that's what you want. The codepoint (character code) should always be the /Unicode/ codepoint, not the Windows-1251 codepoint. Arcane Jill -------------------------------------------------------------------------------- PS. Walter - I change my mind about things occasionally, and I'm now starting to agree with Regan in suggesting that "\x" should be deprecated, precisely because it causes this kind of confusion. It's reasonable to assume that people who want to do UTF-encoding by hand are likely to be knowledgeable enough to figure out some other way of doing this.
Sep 30 2004
parent reply novice <novice_member pathlink.com> writes:
Hi, Arcane Jill

or care. I'll try to explain more further on. Bear with me.

thank you
U+0410 (CYRILLIC CAPITAL LETTER A), and - importantly - it will do so for /all
users/, not just folk like who use Windows code page 1251.

Unfortunately (?) code page 1251 is standart for russian Windows localization. Many editors (standart notepad for example) use it. It is standart to exchange text between to russian windows. Quite the contrary: unicode used by windows internaly. I must search for special text editor for produce unicode text :(
(or simply insert the Cryllic capital letter A straight into your source
 code as a single character).

I tried just insert cyrillic letter into source before my question appear. Compiler error "bad utf sequence" :(
In D, source code is portable.

Yes, unicode is portable. But where i can see it? My friends in unix use iso8859-5 or koi-8r code page (8 bit codepage like 1251), ALL russian users in windows MUST use 1251 code page...
Sep 30 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cjgst9$2blj$1 digitaldaemon.com>, novice says...

Unfortunately (?) code page 1251 is standart for russian Windows localization.

Yes, I understand that - just as WINDOWS-1252 is standard for Western European. However, that has got nothing to do with UTF-8, which is independent of localization - and that's the whole point, of course. Windows /does/ understand Unicode. Windows 95 understood Unicode, and every version of Unicode thereafter uses Unicode internally.
Many editors (standart notepad for example) use it. 

Standard Notepad /also/ uses UTF-8. Click on "Save As..."; Go to the "Encoding" pull-down menu and select "UTF-8". That's all you have to do. Really - it's that simple.
It is standart to exchange
text between to russian windows. Quite the contrary: unicode used by windows
internaly. I must search for special text editor for produce unicode text :(

I think you may be surprised to learn that /almost all/ text editors these days can save in UTF-8. There's usually an "Encoding" option on the "Save As..." menu item. Not only that, many text editors can auto-detect UTF encodings, so a UTF-8 text file created using one text editor can be loaded up in another with no problems. What text editor are you using? Even in the unlikely event that your text editor can't cope with UTF, there are plenty that can. (And you're going to want other features too, like syntax highlighting, so maybe a text editor upgrade wouldn't be a bad thing).
(or simply insert the Cryllic capital letter A straight into your source
 code as a single character).

I tried just insert cyrillic letter into source before my question appear. Compiler error "bad utf sequence" :(

Yes, that's because you didn't save your source code as UTF-8. Saving your source code as UTF-8 before passing it to DMD will fix this.
Yes, unicode is portable. But where i can see it? My friends in unix use
iso8859-5 or koi-8r code page (8 bit codepage like 1251), ALL russian users in
windows MUST use 1251 code page...

That's just not true. Windows uses Unicode internally (its filenames are stored in UTF-16, for example). And Windows can understand a great variety of encodings. For example - have you ever used Google? (You know, www.google.com)? If so, you've been using UTF-8. (For proof, Google something, then view the page source. You'll see it starts: <html><head><meta HTTP-EQUIV="content-type" CONTENT="text/html; charset=UTF-8"> Saying that all Russian users of Windows MUST use encoding Windows-1251 is simply not true. Windows has been using Unicode for nearly a decade. If you open a Unicode file, it will "just work". You probably won't even notice you've done it. Arcane Jill
Sep 30 2004
parent J C Calvarese <jcc7 cox.net> writes:
Arcane Jill wrote:
 In article <cjgst9$2blj$1 digitaldaemon.com>, novice says...
 
 
Unfortunately (?) code page 1251 is standart for russian Windows localization.

Yes, I understand that - just as WINDOWS-1252 is standard for Western European. However, that has got nothing to do with UTF-8, which is independent of localization - and that's the whole point, of course. Windows /does/ understand Unicode. Windows 95 understood Unicode, and every version of Unicode thereafter uses Unicode internally.
Many editors (standart notepad for example) use it. 

Standard Notepad /also/ uses UTF-8. Click on "Save As..."; Go to the "Encoding" pull-down menu and select "UTF-8". That's all you have to do. Really - it's that simple.

Actually, I think that whether Notepad.exe supports Unicode depends on the version of Windows. Notepad supports Unicode on Windows 2000/XP. I think that with Win95/Win98/WinME, Notepad doesn't have an option to save in Unicode. (I'd hate to guess whether WinNT's Notepad supports Unicode, but I doubt that it does.) In any case, I'm sure there are several free Unicode-enabled editors out there. If the OP uses one of those, I suspect he'll have much more success with D.
 
 
 
It is standart to exchange
text between to russian windows. Quite the contrary: unicode used by windows
internaly. I must search for special text editor for produce unicode text :(

I think you may be surprised to learn that /almost all/ text editors these days can save in UTF-8. There's usually an "Encoding" option on the "Save As..." menu item. Not only that, many text editors can auto-detect UTF encodings, so a UTF-8 text file created using one text editor can be loaded up in another with no problems. What text editor are you using? Even in the unlikely event that your text editor can't cope with UTF, there are plenty that can. (And you're going to want other features too, like syntax highlighting, so maybe a text editor upgrade wouldn't be a bad thing).
(or simply insert the Cryllic capital letter A straight into your source
code as a single character).

I tried just insert cyrillic letter into source before my question appear. Compiler error "bad utf sequence" :(

Yes, that's because you didn't save your source code as UTF-8. Saving your source code as UTF-8 before passing it to DMD will fix this.
Yes, unicode is portable. But where i can see it? My friends in unix use
iso8859-5 or koi-8r code page (8 bit codepage like 1251), ALL russian users in
windows MUST use 1251 code page...

That's just not true. Windows uses Unicode internally (its filenames are stored in UTF-16, for example). And Windows can understand a great variety of encodings. For example - have you ever used Google? (You know, www.google.com)? If so, you've been using UTF-8. (For proof, Google something, then view the page source. You'll see it starts: <html><head><meta HTTP-EQUIV="content-type" CONTENT="text/html; charset=UTF-8"> Saying that all Russian users of Windows MUST use encoding Windows-1251 is simply not true. Windows has been using Unicode for nearly a decade. If you open a Unicode file, it will "just work". You probably won't even notice you've done it. Arcane Jill

-- Justin (a/k/a jcc7) http://jcc_7.tripod.com/d/
Sep 30 2004