Related
I've been reading links as this question and of course this question on preparing for the upcoming "utf8" char type char8_t and their corresponding string type in C++20, and can say, up to a point, that it's about time. Also that it's a mess.
Feel free to correct me where I'm wrong:
C++, any standards, have no means to specify that the source code has a given text encoding (something like Python's # encoding:... metadata), nor what Standards can it be compiled into (like say #!/bin/env g++ -std=c++14) .
Up until C++11, there was also no way to specify that any given string literal would have a given encoding - the compiler was free to reparse a UTF8 string literal into say UTF16 or even EBCDIC if it so desired.
C++11 introduces u16"text" and u32"text" and associated char types to produce UTF16 and UTF32-encoded text, but does not provide string or stream facilities to work with them, so they're basically useless.
C++11 also introduces u8"text" for producing an UTF8-encoded string...
but does not even introduce either a proper UTF8 char type or string type (that's what char8_t is intended to be in C++20?), so it's even uselesser than the above.
Because of all this, when char8_t is finally introduced, it kills lots of code that was intended to be valid and so far some of the remediations sought include disabling char8_t behaviour altogether.
Even then, there's no readily available tooling (as in: not the same crap tier interface as <random>) to check, transform (within the same string) or convert (copying across string types) text encodings in C++. Even codecvt seems to have been dropped.
Given all of the above, I have some questions regarding why are we in this weird status and if it'll ever get better. Historically Unicode support has been one of the lowest points of C++.
Similarly, am wondering how useful is a poor-man's-emulation of the whole concept (disclaimer: am the maintainer of cxxomfort, I already backport lots of things. Work needs: latest MSVC target at the office is MSVC 2012).
Why did C++ not add char8_t at the proper time when u8"text" was introduced or otherwise delay introduction of u8?
Alternatively, why wasn't another, non-breaking prefix like c8"text" introduced with char8_t in C++20 instead of introducing a wide-scope breaking change? I thought TPTB hated breaking changes, even more something that literally breaks the simplest possible case: cout<< prefix"hello world".
Is char8_t intended to functionally be (closer to) an alias of unsigned char or of char?
If the former, is working up the way to eg.: typedef std::basic_string<unsigned char> u8string a viable emulation strategy? Are there backport / reference implementations available one can look into before writing my own?
What's the closest we have in C++17-or-below to marking text as (intended to be) UTF-8 *for storage only*?
re: char8_t as unsigned char, this is more or less what I'm looking at in terms of pseudocode:
// this is here basically only for type-distinctiveness
class char8_t {
unsigned char value;
public:
non_explicit constexpr char8_t (unsigned char ch = 0x00) noexcept;
operator unsigned char () const noexcept;
// implement all operators to mirror operations on unsigned char
};
// public adapter jic
friend unsigned char to_char (char8_t);
// note we're *not* using our new char-type here
namespace std {
typedef std::basic_string<unsigned char> u8string;
}
// unsure if these two would actually be needed
// (couldn't make a compelling case so far,
// even testing with Windows's broken conhost)
namespace std {
basic_istream<char8_t> u8cin;
basic_ostream<char8_t> u8cout;
}
// we work up operator<<, operator>> and string conversion from there
// adding utf8-validity checks where needed
std::ostream& operator<< (std::ostream&, std::u8string const&);
std::istream& operator>> (std::istream&, std::u8string&);
// likely a macro; we'll see
#define u8c(ch) static_cast<char8_t>(ch)
// char8_t ch = u8c('x');
// very likely not a macro pre-C++20; can't skip utf-8 validity check on [2]?
u8string u8s (char8_t const* str); // [1], likely trivial
u8string u8s (char const* str); // [2], non-trivial
// C++20 and up
#define u8s(str) u8##str // or something; not sure
// end result:
// no, I can't even think how would one spell this:
u8string text = u8s("H€łlo Ẅørλd");
// this wouldn't work without refactoring u8string into a full specialization,
// to add the required constructor, but doing so is a PITA because
// the basic_string interface is YAIM (yet another infamous mess):
u8string text = u8"H€łlo Ẅørλd";
I've tagged this C++ as a general, but this is more about (the value of) implementation for Standards pre-C++20. More importantly, I'm not looking for "perfect" solutions or justifications; given the context, poor-man's is more than good enough.
I'm the author of the P0482 and P1423 char8_t papers.
Also that it's a mess.
I completely agree. SG16 is working to improve all things Unicode and text related, but we're having to start near ground level, so it is going to take a while.
If you haven't seen it yet, the repository linked below provides some utilities for writing code that will work in C++17 and C++20.
https://github.com/tahonermann/char8_t-remediation
C++, any standards, have no means to specify that the source code has a given text encoding (something like Python's # encoding:... metadata), nor what Standards can it be compiled into (like say #!/bin/env g++ -std=c++14).
This is correct, but not without precedent. IBM's xlC compiler supports a #pragma filetag directive that behaves similarly to Python's encoding declaration. I started on a paper exploring this space and had hoped to submit it for the Prague meeting, but did not complete it in time. I expect to submit it for the Varna meeting (in June).
Up until C++11, there was also no way to specify that any given string literal would have a given encoding - the compiler was free to reparse a UTF8 string literal into say UTF16 or even EBCDIC if it so desired.
Correct, and this technically remained true for char16_t and char32_t string literals until C++20 and the adoption of P1041. Note though that there is no reparsing going on. In translation phase 1, the source code contents are converted to the compiler's internal encoding and then in translation phase 5, character and string literals are converted to the encoding of the appropriate execution character set.
C++11 introduces u16"text" and u32"text" and associated char types to produce UTF16 and UTF32-encoded text, but does not provide string or stream facilities to work with them, so they're basically useless.
Correct. P1629 is one of the more significant changes we're hoping to complete for C++23. The goal is to provide text encoders, decoders, and transcoders that facilitate working with text at the code unit and code point levels. We would also provide support for enumerating grapheme clusters.
C++11 also introduces u8"text" for producing an UTF8-encoded string... but does not even introduce either a proper UTF8 char type or string type (that's what char8_t is intended to be in C++20?), so it's even uselesser than the above.
Correct. The goal for C++20 was to 1) enable differentiating "text" and u8"text" in the type system, 2) enable separating locale dependent and UTF-8 text (with enforcement from the type system), 3) ensure use of an unsigned type for UTF-8 code units, and 4) avoid the char type aliasing penalty. That was all we had time to get done for C++20 (standardization is not a rapid process).
Because of all this, when char8_t is finally introduced, it kills lots of code that was intended to be valid and so far some of the remediations sought include disabling char8_t behaviour altogether.
Correct, char8_t was proposed as a breaking change; something not to be taken lightly. In this case, it was deemed acceptable because 1) code searches found little use of u8 character and string literals, 2) the options for addressing backward compatibility concerns as discussed in P1423 were considered adequate, and 3) a non-breaking proposal would have added long term baggage to the language for little gain.
Even then, there's no readily available tooling (as in: not the same crap tier interface as ) to check, transform (within the same string) or convert (copying across string types) text encodings in C++. Even codecvt seems to have been dropped.
Correct. We'll be working to improve this situation, but it will take time. codecvt has not been dropped (yet); the <codecvt> header and various UTF converters were deprecated in C++17. std::codecvt suffers from performance and usability issues, so is not considered something we can continue to build on. We believe P1629 is a superior direction.
Why did C++ not add char8_t at the proper time when u8"text" was introduced or otherwise delay introduction of u8?
I asked one of the C++ committee members who was involved in that original effort. He told me that he asked the people working on Unicode at the time if a new type should be added and the response was, "eh, we don't need it".
Alternatively, why wasn't another, non-breaking prefix like c8"text" introduced with char8_t in C++20 instead of introducing a wide-scope breaking change? I thought TPTB hated breaking changes, even more something that literally breaks the simplest possible case: cout<< prefix"hello world".
A different prefix was considered and at one point I briefly favored that approach. However, as mentioned earlier, that would have left us with two ways of spelling UTF-8 literals and related historical baggage. In the long run, it was felt that a breaking change, so long as we had reasonable means to mitigate the breakage, offered more benefits.
With regard to that simple test case, take a minute to think about what that code should do. Then go read this: What is the printf() formatting character for char8_t *?.
Is char8_t intended to functionally be (closer to) an alias of unsigned char or of char?
char8_t is intentionally and explicitly not an alias (because that has negative performance implications) but is specified to have the same underlying representation as unsigned char. The reason for unsigned char over char is to avoid expressions like u8'\x80' < 0 ever evaluating to true (which may or may not be the case with char today).
If the former, is working up the way to eg.: typedef std::basic_string u8string a viable emulation strategy? Are there backport / reference implementations available one can look into before writing my own?
I won't comment on whether this approach is a good idea or not, but it has been done before. For example, EASTL has such a typedef (That project also provides a definition of char8_t if the native type isn't available)
What's the closest we have in C++17-or-below to marking text as (intended to be) UTF-8 for storage only?
I don't think there is one right answer to this question. I've seen projects use unsigned char or provide a char8_t like type via a class.
With regard to your pseudocode, some tweaks to the code in the previously mentioned char8_t-remediation repository to provide unsigned char types instead of char should enable code like the following to work. See the definitions of the _as_char user-defined literals and U8 macro.
typedef std::basic_string<unsigned char> u8string;
u8string u8s(U8("text"));
I am in a C++ class right now so this question will concern itself primarily with that language, though I haven't been able to find any information for any other language either and I have a feeling whatever the answer is it's probably largely cross language.
In C++ unmarked numbers are assumed to be of integral type ('4', for example, is an integer)
various bounding marks allow for the number to be interpreted differently (''4'', for example, is a character, '"4"' a string).
As far as I know, there is only one kind of unary mark: the decimal point. ('4.' is a double).
I would like to create a new unary mark that designates a constant number in the code to be interpreted as a member of a created datatype. More fundamentally, I would like to know what the '.' and ',' and '"', and ''' are (they aren't operators, keywords, or statements, so what are they?) and how the compiler deals with/interprets them.
More information, if you feel it is necessary:
I am trying to make a complex number header that I can include in any project to do complex math. I am aware of the library but it is, IMHO, ugly and if used extensively slows down coding time. Also I'm mostly trying to code this to improve my programming skills. My goal is to be able to declare a complex variable by doing something of the form cmplx num1= 3 + 4i; where '3' and '4' are arbitrary and 'i' is a mark similar to the decimal point which indicates '4' as imaginary.
I would like to create a new unary mark that designates a constant number in the code to be interpreted as a member of a created datatype.
You can use user defined literals that were introduced in C++11. As an example, assuming you have a class type Type and you want to use the num_y syntax, where num is a NumericType, you can do:
Type operator"" _y(NumericType i) {
return Type(i);
}
Live demo
Things like 4, "4" and 4. are all single tokens,
indivisible. There's no way you can add new tokens to the
language. In C++11, it is possible to define user defined
literals, but they still consist of several tokens; for complex,
a much more natural solution would be to support a constant i,
to allow writing things like 4 + 3*i. (But you'd still need
the C++11 support for constexpr for it to be a compile time
constant.)
Is the definition of enums in C++ and setting values in random order a valid? Is it used in any well-known code?
e.g.:
enum ESampleType{ STPositiveControl='P', STNegativeControl='N', STSample='S' };
VS2008 compiles without warnings. I wonder whether gcc would.
In my opinion, it a least harms "least surprise rule" as iterating over all values using
for(int i=STPositiveControl; i<=STSample; ++i)
would fail.
P.S: The rationale for this approach:
In my DB application I'm defining wrapper methods. Some columns contain "constants" encoded as
single char. I'm trying to make it a little more type safe.
It's a standard and widely used feature. Most of the time (but
not always) it will be used to create bitmaps, but it can also
be used as you show, to give "printable" values to the enum
contants.
It shouldn't surprise anyone who knows C++, as it is widely
used.
Yes it is valid.
There is an example use of it in the book The C++ Programming Language (3rd Edition) by Bjarne Stroustrup, section "6.1 A Desk Calculator [expr.calculator]" (and more precisely "6.1.1 The Parser [expr.parser]"), for the parser code of a simple arithmetic calculator. Here's an excerpt:
The parser uses a function get_token() to get input. The value of the most recent call of get_token() can be found in the global variable curr_tok. The type of curr_tok is the enumeration Token_value:
enum Token_value {
NAME, NUMBER, END,
PLUS='+', MINUS='-', MUL='*', DIV='/',
PRINT=';', ASSIGN='=', LP='(', RP=')'
};
Token_value curr_tok = PRINT;
Representing each token by the integer value of its character is convenient and efficient and can be a help to people using debuggers. This works as long as no character used as input has a value used as an enumerator – and no character set I know of has a printing character with a single-digit integer value. (...)
(That last sentence is specific to this example, because the enumeration mixes "default-value" and "explicit-value" enumerators and wants each one to be unique.)
However it's only an educative example (and notably it uses a global variable, and CAPS names for enumerators when you should reserve them for macros (but Stroustrup doesn't like macros :p)).
Now, indeed you can't iterate over it (at least with a plain for loop; but see this question). (And as James Kanze pointed, an enum's values are not always ordered, contiguous, and unique.)
I frequently tutor fellow students in programming, most often in C++ or Java.
It is uniquely aggravating to try to verbally convey the essential syntax of a C++ expression. The speaker must give either an idiomatic translation into English, or a full specification of the code in verbal longhand, using explicit yet slow terms such as "opening parenthesis", "bitwise and", et cetera. Neither of these solutions is optimal.
In C++, there is a finite set of keywords—63—and operators—54, discounting named operators and treating compound assignment operators and prefix versus postfix auto-increment and decrement as distinct. There are just a few types of literal, a similar number of grouping symbols, and the semicolon. Unless I'm utterly mistaken, that's about it.
Would it not then be feasible to ascribe a concise, unique pronunciation to each of these distinct concepts (including one for whitespace, where it is required) and go from there? Programming languages are far more regular than natural languages, so the pronunciation could be standardised.
Instead of creating new "words" to describe them, for things such as "include" you could simply prefix it with "keyword" when saying it aloud. You could use words/phrases commonly known to say other parts as well. As with any new programmer, you have to literally describe everything anyway, so I don't think that requires special attention. I think creating new words is the harder method...
So, for example:
#include <iostream>;
int main()
{
if (1 < 2)
return 1;
else
return 0;
}
Could be read out as:
(keyword) include iostream new-line
(keyword) int main no params start
block if number 1 (operator) less than
number 2 new-line (keyword) return
number 1 new-line (keyword) else
new-line (keyword) return number 0 end
block
Treat words in () as optional descriptive words, most likely to be used in more complex code. You could use the word 'literal' if you want them to actually write the descriptive word. For example
(keyword) if literal number (operator)
less than literal keyword
becomes
if (number < keyword)
Other words could be given defined meanings as well, such as 'split-line' when you want them to continue on the next line, without closing any currently open parenthesis, etc.
I personally find this method quite simple to use and easy to teach. YMMV, as always.
Of course, this doesn't solve the internationalisation issue, but at worst, would result in 'new words' being used in the non-English languages, which is no worse than the proposed solution you offered.
As a blind developer, programming since I was 13, I found this question really interesting. First of all, as mentioned by other peple, learning a new language to be able to understand code is not a practical solution, as it would probably take longer to learn the spoken utterances as it would to learn the actual programming language.
Reading the question/answers two further points occured to me:
Firstly, you'd be surprised how important "thinking time" is. I have previously programmed in C/C++/Java and now use C# as my primary language, and consider myself very competant. But when I did a couple of projects in Python, I found the reduced punctuation robbed me of my "thinking time" - subconsciously, I was using the punctuation to digest what I'd just heard - fascinating... However, the situation is a bit different when it comes to identifiers, as these aren't well known by the listener - I personally find it hard to listen to code with acronym variables (RGXRatio, RGVRatio) as I don't have time to figure out what it means. On the flip side, hungarian notation and initial underscores makes code hard to listen to as the length of the variables (in terms of time taken to speak) is much longer than the more important operations being performed on those variables.
Another thing to consider is that the length of the audio stream is an end result, but not the root cause. The reason the audio is so long is because audio is a one-dimensional medium, whereas reading text is a 2d medium with the ability to jump around and skip past irelevant/familiar text. It wouldn't work for a face-to-face lecture, but what if there were keyboard commands for controlling the speech. In text documents my screen reader lets me jump to the next line, but what if this were adapted to the semantics of a programming language. some research, such as by T V Raman at Google, includes using different voices for syntax highlighting, and audio cues to mark metadata like capitals.
I know the original question specifically related to a lecture given to a class, but if like myself you have to listen to entire files of source code , I also find the structure of the code makes a huge difference. I personally read code like a story - left to right, top to bottom. so it's very hard to trace through unfamiliar code when it's written bottom-up.
So would it not then be feasible to simply ascribe a concise, unique pronunciation to each of these distinct concepts (including one for whitespace, where it is required) and go from there? Programming languages are far more regular than natural languages, so the pronunciation could be standardised
Perhaps, but you've lost sight of your goal. The premise was that the person listening did not already know the language. If he does, we can simply say "include iostream" when we mean #include <iostream>, or "vector of int" when we mean std::vector<int>.
Your premise was that the person listening is not familiar enough with the language to understand what you read out loud unless you read out exactly what it says.
Now, inventing a whole new language just to describe the primitives that occur in your source code doesn't solve the problem. Instead, you still have to read out every syntactic token (with simpler, more "standardized" pronunciations, yes, but they still have to be read out loud), and the person listening still won't understand you, because if they don't know C++ well enough to understand "include iostream", they won't understand your standardized pronunciation either. And if you're going to teach them your pronunciation, why bother, when you could've just taught them to understand C++ syntax directly instead?
There's also the root problem that C++ code tends to consist of a lot of syntactic tokens. Take a line as simple as this:
std::vector<int> v;
I count 9 tokens. Not one of them can be omitted. If the person listening does not understand the code and syntax well enough to understand a high-level description such as "declare a vector of int, named v", then you'll have to read out all 9 tokens in some form. Even if you come up with simpler names than "namespace resolution operator" and "less than sign", you still have to list 9 token names. Which is a lot of work.
In short, no, I don't think it'd work. First, it's still too cumbersome, and second, it's presuming prior knowledge on the part of the person listening, when the motivation for this was that the person listening was a student without the prior knowledge that made it possible to understand a high-level description of the code.
I was working with a new C++ developer a while back when he asked the question: "Why can't variable names start with numbers?"
I couldn't come up with an answer except that some numbers can have text in them (123456L, 123456U) and that wouldn't be possible if the compilers were thinking everything with some amount of alpha characters was a variable name.
Was that the right answer? Are there any more reasons?
string 2BeOrNot2Be = "that is the question"; // Why won't this compile?
Because then a string of digits would be a valid identifier as well as a valid number.
int 17 = 497;
int 42 = 6 * 9;
String 1111 = "Totally text";
Well think about this:
int 2d = 42;
double a = 2d;
What is a? 2.0? or 42?
Hint, if you don't get it, d after a number means the number before it is a double literal
It's a convention now, but it started out as a technical requirement.
In the old days, parsers of languages such as FORTRAN or BASIC did not require the uses of spaces. So, basically, the following are identical:
10 V1=100
20 PRINT V1
and
10V1=100
20PRINTV1
Now suppose that numeral prefixes were allowed. How would you interpret this?
101V=100
as
10 1V = 100
or as
101 V = 100
or as
1 01V = 100
So, this was made illegal.
Because backtracking is avoided in lexical analysis while compiling. A variable like:
Apple;
the compiler will know it's a identifier right away when it meets letter 'A'.
However a variable like:
123apple;
compiler won't be able to decide if it's a number or identifier until it hits 'a', and it needs backtracking as a result.
Compilers/parsers/lexical analyzers was a long, long time ago for me, but I think I remember there being difficulty in unambiguosly determining whether a numeric character in the compilation unit represented a literal or an identifier.
Languages where space is insignificant (like ALGOL and the original FORTRAN if I remember correctly) could not accept numbers to begin identifiers for that reason.
This goes way back - before special notations to denote storage or numeric base.
I agree it would be handy to allow identifiers to begin with a digit. One or two people have mentioned that you can get around this restriction by prepending an underscore to your identifier, but that's really ugly.
I think part of the problem comes from number literals such as 0xdeadbeef, which make it hard to come up with easy to remember rules for identifiers that can start with a digit. One way to do it might be to allow anything matching [A-Za-z_]+ that is NOT a keyword or number literal. The problem is that it would lead to weird things like 0xdeadpork being allowed, but not 0xdeadbeef. Ultimately, I think we should be fair to all meats :P.
When I was first learning C, I remember feeling the rules for variable names were arbitrary and restrictive. Worst of all, they were hard to remember, so I gave up trying to learn them. I just did what felt right, and it worked pretty well. Now that I've learned alot more, it doesn't seem so bad, and I finally got around to learning it right.
It's likely a decision that came for a few reasons, when you're parsing the token you only have to look at the first character to determine if it's an identifier or literal and then send it to the correct function for processing. So that's a performance optimization.
The other option would be to check if it's not a literal and leave the domain of identifiers to be the universe minus the literals. But to do this you would have to examine every character of every token to know how to classify it.
There is also the stylistic implications identifiers are supposed to be mnemonics so words are much easier to remember than numbers. When a lot of the original languages were being written setting the styles for the next few decades they weren't thinking about substituting "2" for "to".
Variable names cannot start with a digit, because it can cause some problems like below:
int a = 2;
int 2 = 5;
int c = 2 * a;
what is the value of c? is 4, or is 10!
another example:
float 5 = 25;
float b = 5.5;
is first 5 a number, or is an object (. operator)
There is a similar problem with second 5.
Maybe, there are some other reasons. So, we shouldn't use any digit in the beginnig of a variable name.
The restriction is arbitrary. Various Lisps permit symbol names to begin with numerals.
COBOL allows variables to begin with a digit.
Use of a digit to begin a variable name makes error checking during compilation or interpertation a lot more complicated.
Allowing use of variable names that began like a number would probably cause huge problems for the language designers. During source code parsing, whenever a compiler/interpreter encountered a token beginning with a digit where a variable name was expected, it would have to search through a huge, complicated set of rules to determine whether the token was really a variable, or an error. The added complexity added to the language parser may not justify this feature.
As far back as I can remember (about 40 years), I don't think that I have ever used a language that allowed use of a digit to begin variable names. I'm sure that this was done at least once. Maybe, someone here has actually seen this somewhere.
As several people have noticed, there is a lot of historical baggage about valid formats for variable names. And language designers are always influenced by what they know when they create new languages.
That said, pretty much all of the time a language doesn't allow variable names to begin with numbers is because those are the rules of the language design. Often it is because such a simple rule makes the parsing and lexing of the language vastly easier. Not all language designers know this is the real reason, though. Modern lexing tools help, because if you tried to define it as permissible, they will give you parsing conflicts.
OTOH, if your language has a uniquely identifiable character to herald variable names, it is possible to set it up for them to begin with a number. Similar rule variations can also be used to allow spaces in variable names. But the resulting language is likely to not to resemble any popular conventional language very much, if at all.
For an example of a fairly simple HTML templating language that does permit variables to begin with numbers and have embedded spaces, look at Qompose.
Because if you allowed keyword and identifier to begin with numberic characters, the lexer (part of the compiler) couldn't readily differentiate between the start of a numeric literal and a keyword without getting a whole lot more complicated (and slower).
C++ can't have it because the language designers made it a rule. If you were to create your own language, you could certainly allow it, but you would probably run into the same problems they did and decide not to allow it. Examples of variable names that would cause problems:
0x, 2d, 5555
One of the key problems about relaxing syntactic conventions is that it introduces cognitive dissonance into the coding process. How you think about your code could be deeply influenced by the lack of clarity this would introduce.
Wasn't it Dykstra who said that the "most important aspect of any tool is its effect on its user"?
The compiler has 7 phase as follows:
Lexical analysis
Syntax Analysis
Semantic Analysis
Intermediate Code Generation
Code Optimization
Code Generation
Symbol Table
Backtracking is avoided in the lexical analysis phase while compiling the piece of code. The variable like Apple, the compiler will know its an identifier right away when it meets letter ‘A’ character in the lexical Analysis phase. However, a variable like 123apple, the compiler won’t be able to decide if its a number or identifier until it hits ‘a’ and it needs backtracking to go in the lexical analysis phase to identify that it is a variable. But it is not supported in the compiler.
When you’re parsing the token you only have to look at the first character to determine if it’s an identifier or literal and then send it to the correct function for processing. So that’s a performance optimization.
Probably because it makes it easier for the human to tell whether it's a number or an identifier, and because of tradition. Having identifiers that could begin with a digit wouldn't complicate the lexical scans all that much.
Not all languages have forbidden identifiers beginning with a digit. In Forth, they could be numbers, and small integers were normally defined as Forth words (essentially identifiers), since it was faster to read "2" as a routine to push a 2 onto the stack than to recognize "2" as a number whose value was 2. (In processing input from the programmer or the disk block, the Forth system would split up the input according to spaces. It would try to look the token up in the dictionary to see if it was a defined word, and if not would attempt to translate it into a number, and if not would flag an error.)
Suppose you did allow symbol names to begin with numbers. Now suppose you want to name a variable 12345foobar. How would you differentiate this from 12345? It's actually not terribly difficult to do with a regular expression. The problem is actually one of performance. I can't really explain why this is in great detail, but it essentially boils down to the fact that differentiating 12345foobar from 12345 requires backtracking. This makes the regular expression non-deterministic.
There's a much better explanation of this here.
it is easy for a compiler to identify a variable using ASCII on memory location rather than number .
I think the simple answer is that it can, the restriction is language based. In C++ and many others it can't because the language doesn't support it. It's not built into the rules to allow that.
The question is akin to asking why can't the King move four spaces at a time in Chess? It's because in Chess that is an illegal move. Can it in another game sure. It just depends on the rules being played by.
Originally it was simply because it is easier to remember (you can give it more meaning) variable names as strings rather than numbers although numbers can be included within the string to enhance the meaning of the string or allow the use of the same variable name but have it designated as having a separate, but close meaning or context. For example loop1, loop2 etc would always let you know that you were in a loop and/or loop 2 was a loop within loop1.
Which would you prefer (has more meaning) as a variable: address or 1121298? Which is easier to remember?
However, if the language uses something to denote that it not just text or numbers (such as the $ in $address) it really shouldn't make a difference as that would tell the compiler that what follows is to be treated as a variable (in this case).
In any case it comes down to what the language designers want to use as the rules for their language.
The variable may be considered as a value also during compile time by the compiler
so the value may call the value again and again recursively
Backtracking is avoided in lexical analysis phase while compiling the piece of code. The variable like Apple; , the compiler will know its a identifier right away when it meets letter ‘A’ character in the lexical Analysis phase. However, a variable like 123apple; , compiler won’t be able to decide if its a number or identifier until it hits ‘a’ and it needs backtracking to go in the lexical analysis phase to identify that it is a variable. But it is not supported in compiler.
Reference
There could be nothing wrong with it when comes into declaring variable.but there is some ambiguity when it tries to use that variable somewhere else like this :
let 1 = "Hello world!"
print(1)
print(1)
print is a generic method that accepts all types of variable. so in that situation compiler does not know which (1) the programmer refers to : the 1 of integer value or the 1 that store a string value.
maybe better for compiler in this situation to allows to define something like that but when trying to use this ambiguous stuff, bring an error with correction capability to how gonna fix that error and clear this ambiguity.