www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Arbitrary identifiers - syntax

reply Cecil Ward <cecil cecilward.com> writes:
This is an intentionally vague post about an idea without a clear 
solution, so this is not a concrete proposal, but is intended to 
solicit suggestions and ideas.

In mathematics or physics, you might have variables such as t and 
t′ the second character of the last variable is a U+2032 (prime), 
and there’s also a similar glyph at U+02B9. I posted a while back 
about the use of unicode, and in that I was thinking about text 
in various non-English human languages. The docs say that D 
identifiers such as variable names are chosen from a subset of 
Unicode defined by an appendix of C99. This gives a massive list 
of acceptable characters in umpteen writing systems and human 
languages. How does D deal with that in the lexer? Enormous table 
lookup? I would be interested to know, compiler authors.

However in maths many of the symbols such as my earlier example 
contain characters that are not legal in identifiers as Unicode 
considers them to be maybe punctuation or similar non-ident 
concept. How to make D maths-friendly. Yes we can and do write 
things like t_prime, but it doesn’t look great. And it’s 
longwinded. Yes I hear you about the ease-of-use of Unicode but 
that was discussed before and belongs to the earlier thread. Is 
there a way of allowing (almost) ‘arbitrary’ content in 
identifiers in D’s grammar? Think of the kind of syntax that 
exploits say "my file.ext"-type double quoting for otherwise 
unacceptable filenames such as this example one with a space in 
it.

Is it at all possible that a future D might have a mechanism like 
that to accommodate arbitrary identifiers for maths? Maybe even a 
kind of extensible lexer? - perhaps way too hard, and an easier 
but less attractive solution like the bracketing could be found. 
abut whatever is suggested would have to be compact, neat and 
minimal so that mathematical equations could clearly resemble D 
statements and expressions.

I thought about all the imaginative literal string syntax that we 
already have, where a lot of work was done to make literal 
strings more workable in various use-cases.

I’d be very interested to hear suggestions as to how we do 
special relativity with t, t′, and then t″. `it may be just 
simply too hard to do it cleverly. I’m thinking about making D 
the most maths-friendly language, Let’s displace Fortran ;-). ( 
Would need to make complex numbers friendlier for that though, 
maybe with more of the syntactic sugar brought back, but that’s 
another story. ) I think it would possibly be a good idea to 
restrict ‘arbitrary’ characters to a certain subset, not allowing 
absolutely any Unicode character, so no whitespace, no control 
characters, no existing D tokens such as ‘=‘, maybe disallow all 
punctuation characters that are already ‘taken’ in D, that is, 
already in use in the existing lexer’s grammar, but I’m unsure 
about that. What do do about ‘-‘ hyphen-minus? It is allowed in 
some languages, such as XSLT and used there a lot. Perhaps ban it 
because of the confusion with minus for subtraction. I don’t 
know. It doesn’t seem to be used in physics, for that same reason.

Thoughts?
Jul 04 2023
parent "Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:
Ah huh!

This is something that I am very familiar with, as I'm updating dmd to 
use UAX31 identifiers (Unicode 15).

What you are wanting is called Medial.

The definition of a UAX31 identifier is: ``<Identifier> := <Start> 
<Continue>* (<Medial> <Continue>+)*``

For possible characters for Medial: 
https://unicode.org/reports/tr31/#Table_Optional_Medial

https://unicode.org/reports/tr31

As for how to represent it... the way that dmd does it currently is with 
a ``wchar[2][]`` and then a binary search with a start + end. This of 
course isn't standard and is not the best.

The standard solution as per Unicode Demystified (strongly recommend 
buying it if you are interested in this subject) is to use an inversion 
list which is just the start of a given range, and using the index 
odd/even to determine if its in the range or not. You would use a search 
algorithm like binary to do the lookup.

I will be switching dmd over should my C23 PR go in, to a inversion list 
+ fibonacci search to take advantage of ASCII, BMP, then per plane 
probabilities. I've been talking about this quite a bit recently on 
Discord #langdev channel.
Jul 04 2023