www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - A crazy idea for accurately tracking source position

reply Alix Pexton <alix.DOT.pexton gmail.DOT.com> writes:
TL;DR

Here is some under documented, incomplete and untested code.
CAVEAT IMPLEMENTOR: some details have been omitted to keep things brief!

struct someRange
{
	ulong seq;
	bool fresh = true;
	long line;
	dchar front;
	// and lets just pretend that there is
	// somewhere for more characters to come from!

	void popFront()
	{
		// advance by whatever means to update front.
		if (front.isNewline)
		{
			++line;
			fresh = true;
			return;
		}
		if (fresh)
		{
			if (front.isTab)
			{
				seq = 0xffff,ffff,ffff,fffeL;
			}
			else
			{
				seq = 0x1L;
			}
			fresh = false;
		}
		else
		{
			seq <<= 1;
			if (!front.isTab)
			{
				seq |= 0x1L;
			}
		}
	}

	// and the rest...
}


A long time ago I wrote a very rudimentary XML lexer/parser in pascal. 
At the time I thought it was a good idea to point to the exact character 
where a error was detected. Knowing that tabs could be involved and that 
they can have different widths I stored the line position as a 
tabs/spaces tuple, because no one would ever put a space anywhere but at 
the beginning of the line, right!

Jump forward a decade or so and I know better. I.e. just knowing the 
number of tabs and spaces isn't enough, because when tabs can be 
anywhere, sometimes the spaces are swallowed up. What is needed is a 
string of tabs and spaces that matches the sequence of tabs and non-tabs 
in the source. Such could be built while lexing for immediate use if an 
invalid character is encountered and then thrown away at each newline, 
but it would not be practical to store that much information in every 
token. The sequence could be split between tokens from a single line, 
with each token having just the pattern since the last token, but 
reversing the reading of the tokens in order to reconstruct the sequence 
or building it while parsing just in case it is needed are at best 
impractical.

What would help would be a way of fitting that sequence of tabs and 
spaces into a smaller format.

Here is the crazy part...

Using a ulong (or longer if possible) to store our tab sequence...

Upon starting to lex a fresh line we check the first character, if its a 
tab we set all but the lsb to 1 . If that first char is anything other 
than a tab, we set all bits but the lsb to 0 .

On each subsequent character in the line we shift the sequence left 1 
bit and set the new lsb as 0 if its a tab and 1 if it is anything else.

If the line is longer than the number of bits in our ulong[er] we throw 
our toys out of the pram and go home in tears.

Any time a token is emitted the current (or most relevant) value of the 
ulong can be stored in it.

To decode the sequence if it is needed, we check the msb, if it is 1 
then the first character is a tab and we shift the whole value until the 
first 0 reaches the msb (keeping track of how many shifts we do so as 
not to reach apple headquarters) and then one more shift to account for 
the first character. If the msb is 0 then the first character is a space 
and we shift left until one past the first 1 . For each remaining bit we 
add a tab when the msb is 0 and a space when it is 1 .

Thus we have reconstructed a string that when displayed above or below 
the line that generated it, will end at the correct character, 
regardless of the number of tabs or spaces used to represent them. Hurrah!

For any type of source where lexed lines are regularly contain more 
characters than there are bits in our longest integer, this technique 
will fail. However, I reason that in most cases the lines that are all 
non tabs and full width are often not parsed (i.e. they are comments 
etc). Lines that start hard to the left are often short and lines that 
reach the right are often the ones with many tabs in them. In other 
words, many lines that are too wide, are not too long.

Am I on to something or should I be on something?

A...
Apr 16 2014
next sibling parent reply Alix Pexton <alix.DOT.pexton gmail.DOT.com> writes:
Just fixing an obvious typo in my code (that is still incomplete).
====

struct someRange
{
     ulong seq;
     bool fresh = true;
     long line;
     dchar front;
     // and lets just pretend that there is
     // somewhere for more characters to come from!

     void popFront()
     {
         // advance by whatever means to update front.
         if (front.isNewline)
         {
             ++line;
             fresh = true;
             return;
         }
         if (fresh)
         {
             if (front.isTab)
             {
                 seq = 0xffff_ffff_ffff_fffeL;
             }
             else
             {
                 seq = 0x1L;
             }
             fresh = false;
         }
         else
         {
             seq <<= 1;
             if (!front.isTab)
             {
                 seq |= 0x1L;
             }
         }
     }

     // and the rest...
}
Apr 17 2014
parent reply "matovitch" <camille.brugel laposte.net> writes:
You are doing it all wrong. The easiest way to compute
the col position is the following :

col_pos = 0;

if (non_tab_character_encounter)
      col_pos++;
else
      col_pos += tab_length - col_pos % tab_length;

That's it.
Apr 17 2014
parent Alix Pexton <alix.DOT.pexton gmail.DOT.com> writes:
On 17/04/2014 8:20 PM, matovitch wrote:
 You are doing it all wrong. The easiest way to compute
 the col position is the following :

 col_pos = 0;

 if (non_tab_character_encounter)
       col_pos++;
 else
       col_pos += tab_length - col_pos % tab_length;

 That's it.

Tabs can have variable widths, with your technique the end only lines up with the desired column if tab_length matches the width of tabs being used to display the output, which is at best happenstance. By constructing a string that matches the pattern of characters that precede the column exactly, everything lines up no matter how wide the display tabs are. A... A...
Apr 18 2014
prev sibling parent Alix Pexton <alix.DOT.pexton gmail.DOT.com> writes:
A complete, "tested" and working proof of concept!

Pipe the output to a file and load it in an editor that allows you to 
mess with the size of tabs and no matter what width they have things 
will still line up (as long as the font is fixed width).

I'm pretty sure this is sub optimal even though I did change from my 
original design so that there is less going on on the path for 
characters and spaces which I assume would be the more common case.

A...

=====

import std.stdio;

enum lsb = 0x1L;
enum msb = 0x8000_0000_0000_0000L;
enum max = 0xFFFF_FFFF_FFFF_FFFFL;

struct columnRange
{
	private string input;
	private ulong pattern;
	private long check;

	this(string s)
	{
		input = s;
	}

	 property bool empty()
	{
		return input.length == 0;
	}

	 property string front()
	{
		return input;
	}

	columnRange save()
	{
		return this;
	}

	void popFront()
	{
		if (input.length > 0)
		{
			if (pattern == 0)
			{
				if (input[0] != 0x09)
				{
					pattern = ~lsb;
				}
				else
				{
					pattern = lsb;
				}
				input = input[1..$];
				++check;
				return;
			}
			pattern <<= 1;
			if (input[0] == 0x09)
			{
				pattern |= lsb;
			}
			input = input[1..$];
			++check;
		}
	}

	 property string indent()
	{
		if (pattern == 0)
		{
			return "";
		}
		if (check >= 64)
		{
			return "somewhere way way way down there                           --->";
		}

		auto copy = pattern;
		auto ret = "";
		size_t i;

		if ((pattern & msb) == 0)
		{
			ret ~= "\t";
			do
			{
				copy <<= 1;
				++i;
			} while ((copy & msb) == 0);
		}
		else
		{
			ret ~= " ";
			do
			{
				copy <<= 1;
				++i;
			} while ((copy & msb) != 0);
		}

		if (i < 64)
		{
			copy <<= 1;
			++i;
			while (i < 64)
			{
				if ((copy & msb) == 0)
				{
					ret ~= " ";
				}
				else
				{
					ret ~= "\t";
				}
				copy <<= 1;
				++i;
			}
		}
		return ret;
	}
}

void main()
{
	enum input1 = "1s in input1 \twill have 1 caret directly below them,\t 
1 per line, except this 1";

	writefln("%s", input1);
	auto r1 = columnRange(input1);

	while(!r1.empty)
	{
		if (r1.front[0] == '1')
		{
			writefln("%s^", r1.indent);
		}
		r1.popFront;
	}

}
Apr 18 2014