www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - removing ansi control escape characters from a string

reply Timothee Cour <thelastmammoth gmail.com> writes:
Not sure this is the right forum but here we go:

When using unix command 'script' to log the terminal input/output commands,
it includes special ansi control escape characters.
I'd like to filter out the generated script file from those character
sequences, so that it preserves the content (including newlines) but
removes escape codes for coloring, etc.
That way I can apply tools like grep as if those characters were absent.

Here's an example, after a short script session where I typed 1234 BS BS BS
BS 6789 (ie 4 backspaces), saved into log.txt.
Typing 'cat log.txt|grep 1234' returns '6789', even though 1234 doesn't
appear once we cat the file in a terminal, because cat-ing in a terminal
replays the backspace sequence, but 1234 really is in the file.

So I'd like help on writing a D utility function that will convert a string
s1 into a string s2 such that:
* writeln(s1) prints the same as writeln(s2) (modulo removing colors from
escape sequences)
* s2 doesn't contain any escape sequence (as given by std.uni.isControl)

Note, I'm NOT just looking into filtering out escape the 'isControl'
characters (that's easy), because that leaves all the '[33m' garbage in the
string; also the behavior of backspace needs to be emulated.

Here's a first stab at the problem, but it's incomplete (ie doesn't deal
with backspaces which should erase a char from the string etc).
string remove_terminal_escape_codes(string a){
string pattern=(){
import std.uni;
import std.conv;
import std.range;
string pattern="[";
foreach(char ci; 0..255){
if(isControl(ci)){
pattern~=ci.to!dchar;
}
}
pattern~="]";
return pattern~`\[(\d{2}m|\dm|\d\d;\d\dm|J|K|\d\d[A-Z])`;
}();
import std.regex;
return replace(a,regex(pattern,"g"),"");
}





More details:

cat -v log.txt:
-------------------------
Script started on Fri May 31 17:32:55 2013
^[[1m^[[7m%^[[27m^[[1m^[[0m


                                           ^M
^M^M^[[0m^[[27m^[[24m^[[J^[[32mprompt:M ^[[33m~/shortcuts
^[[00m%^[[K^[[199C^[[33mprompt_end:^[[1m17:32^[[0m
^[[1m#27378^[[0m^[[222D1^H1234^H ^H^H ^H^H^H1 ^H^H ^H6^H6789^M^M
^[[1m^[[7m%^[[27m^[[1m^[[0m


                                           ^M
^M^M^[[0m^[[27m^[[24m^[[J^[[32mprompt:M ^[[33m~/shortcuts
^[[00m%^[[K^[[199C^[[34mprompt_end:Err 1^[[37m ^[[1m#27378^[[0m^[[222D^M^M

Script done on Fri May 31 17:33:08 2013
-------------------------

cat log.txt:
-------------------------
Script started on Fri May 31 17:32:55 2013
prompt:M ~/shortcuts %6789


prompt_end:17:32 #27378
prompt:M ~/shortcuts %


prompt_end:Err 1 #27378

Script done on Fri May 31 17:33:08 2013
-------------------------

I can attach the file as .txt if forum allows.
May 31 2013
next sibling parent "Adam D. Ruppe" <destructionator gmail.com> writes:
On Saturday, 1 June 2013 at 01:08:46 UTC, Timothee Cour wrote:
 removes escape codes for coloring, etc.
Getting all these is a very difficult task because the escape sequences aren't all well defined. But, you should get pretty good results by just filtering out anything that starts with "\033[" through anything that is char
= 'A'.
"\033" is often written ^[ by the shell. If you hold ctrl and press the [ key, it sends the same character as pressing the esc key, which is \033. string outputString; bool inEscape = false; bool justEnteredEscape = false; foreach(c; inputString) { if(justEnteredEscape) { justEnteredEscape = false; if(c == '[') inEscape = true; else { // NOTE: this is actually likely wrong but prolly good enough outputString ~= c; } else if(inEscape) { if(c >= 'A') inEscape = false; // otherwise we want to skip this character, since it is part of e.g. a color sequence } else if(c == '\033') { justEnteredEscape = true; } else { if(c == 8) continue; // skip backspace if(c == 0) continue; // and so on for whatever else you don't want.... outputString ~= c; } } // outputString should be ok now I didn't actually run that code since I don't have test data available but I think it will work to strip out the majority of the escape sequences you'll see on an output stream. Now I said the if(justEnteredEscape) part is wrong, but probably good enough. The reason it is probably wrong is some terminals will use other characters there, especially on input. Terminal input is a huge mess. For example, if I hit F1 on my xterm, the sequence it sends is ^[OP. We're only looking for ^[[ there. The problem is how do you tell the difference between xterm sending ^[OP and the user hitting escape, then typing O and P? This is why unix sucks btw... you really can't. Real apps tell the difference by looking at the time delay. In fact, if you open vim or something and type <esc> OP really quickly in xterm (assuming your xterm sends the same sequences as mine - it might not! This is where the termcap and terminfo databases come in and omg that's painful). Anyway if you hit it really quickly, vim will pop open its help screen! Whereas if you type it a little slower, it will go to command mode then open a new line and type P. The reason is if you hit it fast, the application has no way it can possibly tell if you were doing it manually or if xterm was sending the F1 key. And that's the problem you'll face with the log file. Unless it has timing information, you can't use that method. So there's no way for you to tell. Also btw those codes have variable length and content, so you'd really have to understand them to strip them (see my terminal.d [1] for an example of this, it is a lot of code), so even if it was totally possible, that's a lot of work for filtering a log file. If the majority is output or regular key input though, you don't have to worry too much about this. The color sequences are pretty well defined and by far most common in output and this should handle them. [1] https://github.com/adamdruppe/misc-stuff-including-D-programming-language-web-stuff/blob/master/terminal.d
May 31 2013
prev sibling parent "Adam D. Ruppe" <destructionator gmail.com> writes:
On Saturday, 1 June 2013 at 01:08:46 UTC, Timothee Cour wrote:
 string; also the behavior of backspace needs to be emulated.
oops I missed this on my fires readthrough. But if instead of if(c == 8) continue, you did if(c == 8){ outputString = outputString[0 .. $-1]; continue;} that should be good enough for this.
May 31 2013