Why can't variable names start with numbers? - c++

I was working with a new C++ developer a while back when he asked the question: "Why can't variable names start with numbers?"
I couldn't come up with an answer except that some numbers can have text in them (123456L, 123456U) and that wouldn't be possible if the compilers were thinking everything with some amount of alpha characters was a variable name.
Was that the right answer? Are there any more reasons?
string 2BeOrNot2Be = "that is the question"; // Why won't this compile?

Because then a string of digits would be a valid identifier as well as a valid number.
int 17 = 497;
int 42 = 6 * 9;
String 1111 = "Totally text";

Well think about this:
int 2d = 42;
double a = 2d;
What is a? 2.0? or 42?
Hint, if you don't get it, d after a number means the number before it is a double literal

It's a convention now, but it started out as a technical requirement.
In the old days, parsers of languages such as FORTRAN or BASIC did not require the uses of spaces. So, basically, the following are identical:
10 V1=100
20 PRINT V1
and
10V1=100
20PRINTV1
Now suppose that numeral prefixes were allowed. How would you interpret this?
101V=100
as
10 1V = 100
or as
101 V = 100
or as
1 01V = 100
So, this was made illegal.

Because backtracking is avoided in lexical analysis while compiling. A variable like:
Apple;
the compiler will know it's a identifier right away when it meets letter 'A'.
However a variable like:
123apple;
compiler won't be able to decide if it's a number or identifier until it hits 'a', and it needs backtracking as a result.

Compilers/parsers/lexical analyzers was a long, long time ago for me, but I think I remember there being difficulty in unambiguosly determining whether a numeric character in the compilation unit represented a literal or an identifier.
Languages where space is insignificant (like ALGOL and the original FORTRAN if I remember correctly) could not accept numbers to begin identifiers for that reason.
This goes way back - before special notations to denote storage or numeric base.

I agree it would be handy to allow identifiers to begin with a digit. One or two people have mentioned that you can get around this restriction by prepending an underscore to your identifier, but that's really ugly.
I think part of the problem comes from number literals such as 0xdeadbeef, which make it hard to come up with easy to remember rules for identifiers that can start with a digit. One way to do it might be to allow anything matching [A-Za-z_]+ that is NOT a keyword or number literal. The problem is that it would lead to weird things like 0xdeadpork being allowed, but not 0xdeadbeef. Ultimately, I think we should be fair to all meats :P.
When I was first learning C, I remember feeling the rules for variable names were arbitrary and restrictive. Worst of all, they were hard to remember, so I gave up trying to learn them. I just did what felt right, and it worked pretty well. Now that I've learned alot more, it doesn't seem so bad, and I finally got around to learning it right.

It's likely a decision that came for a few reasons, when you're parsing the token you only have to look at the first character to determine if it's an identifier or literal and then send it to the correct function for processing. So that's a performance optimization.
The other option would be to check if it's not a literal and leave the domain of identifiers to be the universe minus the literals. But to do this you would have to examine every character of every token to know how to classify it.
There is also the stylistic implications identifiers are supposed to be mnemonics so words are much easier to remember than numbers. When a lot of the original languages were being written setting the styles for the next few decades they weren't thinking about substituting "2" for "to".

Variable names cannot start with a digit, because it can cause some problems like below:
int a = 2;
int 2 = 5;
int c = 2 * a;
what is the value of c? is 4, or is 10!
another example:
float 5 = 25;
float b = 5.5;
is first 5 a number, or is an object (. operator)
There is a similar problem with second 5.
Maybe, there are some other reasons. So, we shouldn't use any digit in the beginnig of a variable name.

The restriction is arbitrary. Various Lisps permit symbol names to begin with numerals.

COBOL allows variables to begin with a digit.

Use of a digit to begin a variable name makes error checking during compilation or interpertation a lot more complicated.
Allowing use of variable names that began like a number would probably cause huge problems for the language designers. During source code parsing, whenever a compiler/interpreter encountered a token beginning with a digit where a variable name was expected, it would have to search through a huge, complicated set of rules to determine whether the token was really a variable, or an error. The added complexity added to the language parser may not justify this feature.
As far back as I can remember (about 40 years), I don't think that I have ever used a language that allowed use of a digit to begin variable names. I'm sure that this was done at least once. Maybe, someone here has actually seen this somewhere.

As several people have noticed, there is a lot of historical baggage about valid formats for variable names. And language designers are always influenced by what they know when they create new languages.
That said, pretty much all of the time a language doesn't allow variable names to begin with numbers is because those are the rules of the language design. Often it is because such a simple rule makes the parsing and lexing of the language vastly easier. Not all language designers know this is the real reason, though. Modern lexing tools help, because if you tried to define it as permissible, they will give you parsing conflicts.
OTOH, if your language has a uniquely identifiable character to herald variable names, it is possible to set it up for them to begin with a number. Similar rule variations can also be used to allow spaces in variable names. But the resulting language is likely to not to resemble any popular conventional language very much, if at all.
For an example of a fairly simple HTML templating language that does permit variables to begin with numbers and have embedded spaces, look at Qompose.

Because if you allowed keyword and identifier to begin with numberic characters, the lexer (part of the compiler) couldn't readily differentiate between the start of a numeric literal and a keyword without getting a whole lot more complicated (and slower).

C++ can't have it because the language designers made it a rule. If you were to create your own language, you could certainly allow it, but you would probably run into the same problems they did and decide not to allow it. Examples of variable names that would cause problems:
0x, 2d, 5555

One of the key problems about relaxing syntactic conventions is that it introduces cognitive dissonance into the coding process. How you think about your code could be deeply influenced by the lack of clarity this would introduce.
Wasn't it Dykstra who said that the "most important aspect of any tool is its effect on its user"?

The compiler has 7 phase as follows:
Lexical analysis
Syntax Analysis
Semantic Analysis
Intermediate Code Generation
Code Optimization
Code Generation
Symbol Table
Backtracking is avoided in the lexical analysis phase while compiling the piece of code. The variable like Apple, the compiler will know its an identifier right away when it meets letter ‘A’ character in the lexical Analysis phase. However, a variable like 123apple, the compiler won’t be able to decide if its a number or identifier until it hits ‘a’ and it needs backtracking to go in the lexical analysis phase to identify that it is a variable. But it is not supported in the compiler.
When you’re parsing the token you only have to look at the first character to determine if it’s an identifier or literal and then send it to the correct function for processing. So that’s a performance optimization.

Probably because it makes it easier for the human to tell whether it's a number or an identifier, and because of tradition. Having identifiers that could begin with a digit wouldn't complicate the lexical scans all that much.
Not all languages have forbidden identifiers beginning with a digit. In Forth, they could be numbers, and small integers were normally defined as Forth words (essentially identifiers), since it was faster to read "2" as a routine to push a 2 onto the stack than to recognize "2" as a number whose value was 2. (In processing input from the programmer or the disk block, the Forth system would split up the input according to spaces. It would try to look the token up in the dictionary to see if it was a defined word, and if not would attempt to translate it into a number, and if not would flag an error.)

Suppose you did allow symbol names to begin with numbers. Now suppose you want to name a variable 12345foobar. How would you differentiate this from 12345? It's actually not terribly difficult to do with a regular expression. The problem is actually one of performance. I can't really explain why this is in great detail, but it essentially boils down to the fact that differentiating 12345foobar from 12345 requires backtracking. This makes the regular expression non-deterministic.
There's a much better explanation of this here.

it is easy for a compiler to identify a variable using ASCII on memory location rather than number .

I think the simple answer is that it can, the restriction is language based. In C++ and many others it can't because the language doesn't support it. It's not built into the rules to allow that.
The question is akin to asking why can't the King move four spaces at a time in Chess? It's because in Chess that is an illegal move. Can it in another game sure. It just depends on the rules being played by.

Originally it was simply because it is easier to remember (you can give it more meaning) variable names as strings rather than numbers although numbers can be included within the string to enhance the meaning of the string or allow the use of the same variable name but have it designated as having a separate, but close meaning or context. For example loop1, loop2 etc would always let you know that you were in a loop and/or loop 2 was a loop within loop1.
Which would you prefer (has more meaning) as a variable: address or 1121298? Which is easier to remember?
However, if the language uses something to denote that it not just text or numbers (such as the $ in $address) it really shouldn't make a difference as that would tell the compiler that what follows is to be treated as a variable (in this case).
In any case it comes down to what the language designers want to use as the rules for their language.

The variable may be considered as a value also during compile time by the compiler
so the value may call the value again and again recursively

Backtracking is avoided in lexical analysis phase while compiling the piece of code. The variable like Apple; , the compiler will know its a identifier right away when it meets letter ‘A’ character in the lexical Analysis phase. However, a variable like 123apple; , compiler won’t be able to decide if its a number or identifier until it hits ‘a’ and it needs backtracking to go in the lexical analysis phase to identify that it is a variable. But it is not supported in compiler.
Reference

There could be nothing wrong with it when comes into declaring variable.but there is some ambiguity when it tries to use that variable somewhere else like this :
let 1 = "Hello world!"
print(1)
print(1)
print is a generic method that accepts all types of variable. so in that situation compiler does not know which (1) the programmer refers to : the 1 of integer value or the 1 that store a string value.
maybe better for compiler in this situation to allows to define something like that but when trying to use this ambiguous stuff, bring an error with correction capability to how gonna fix that error and clear this ambiguity.

Related

what exactly is a token, in relation to parsing

I have to use a parser and writer in c++, i am trying to implement the functions, however i do not understand what a token is. one of my function/operations is to check to see if there are more tokens to produce
bool Parser::hasMoreTokens()
how exactly do i go about this, please help
SO!
I am opening a text file with text in it, all words are lowercased. How do i go about checking to see if it hasmoretokens?
This is what i have
bool Parser::hasMoreTokens() {
while(source.peek()!=NULL){
return true;
}
return false;
}
Tokens are the output of lexical analysis and the input to parsing. Typically they are things like
numbers
variable names
parentheses
arithmetic operators
statement terminators
That is, roughly, the biggest things that can be unambiguously identified by code that just looks at its input one character at a time.
One note, which you should feel free to ignore if it confuses you: The boundary between lexical analysis and parsing is a little fuzzy. For instance:
Some programming languages have complex-number literals that look, say, like 2+3i or 3.2e8-17e6i. If you were parsing such a language, you could make the lexer gobble up a whole complex number and make it into a token; or you could have a simpler lexer and a more complicated parser, and make (say) 3.2e8, -, 17e6i be separate tokens; it would then be the parser's job (or even the code generator's) to notice that what it's got is really a single literal.
In some programming languages, the lexer may not be able to tell whether a given token is a variable name or a type name. (This happens in C, for instance.) But the grammar of the language may distinguish between the two, so that you'd like "variable foo" and "type name foo" to be different tokens. (This also happens in C.) In this case, it may be necessary for some information to be fed back from the parser to the lexer so that it can produce the right sort of token in each case.
So "what exactly is a token?" may not always have a perfectly well defined answer.
A token is whatever you want it to be. Traditionally (and for
good reasons), language specifications broke the analysis into
two parts: the first part broke the input stream into tokens,
and the second parsed the tokens. (Theoretically, I think you
can write any grammar in only a single level, without using
tokens—or what is the same thing, using individual
characters as tokens. I wouldn't like to see the results of
that for a language like C++, however.) But the definition of
what a token is depends entirely on the language you are
parsing: most languages, for example, treat white space as
a separator (but not Fortran); most languages will predefine
a set of punctuation/operators using punctuation characters, and
not allow these characters in symbols (but not COBOL, where
"abc-def" would be a single symbol). In some cases (including
in the C++ preprocessor), what is a token depends on context, so
you may need some feedback from the parser. (Hopefully not;
that sort of thing is for very experienced programmers.)
One thing is probably sure (unless each character is a token):
you'll have to read ahead in the stream. You typically can't
tell whether there are more tokens by just looking at a single
character. I've generally found it useful, in fact, for the
tokenizer to read a whole token at a time, and keep it until the
parser needs it. A function like hasMoreTokens would in fact
scan a complete token.
(And while I'm at it, if source is an istream:
istream::peek does not return a pointer, but an int.)
A token is the smallest unit of a programming language that has a meaning. A parenthesis (, a name foo, an integer 123, are all tokens. Reducing a text to a series of tokens is generally the first step of parsing it.
A token is usually akin to a word in sponken language. In C++, (int, float, 5.523, const) will be tokens. Is the minimal unit of text which constitutes a semantic element.
When you split a large unit (long string) into a group of sub-units (smaller strings), each of the sub-units (smaller strings) is referred to as a "token". If there are no more sub-units, then you are done parsing.
How do I tokenize a string in C++?
A token is a terminal in a grammar, a sequence of one or more symbol(s) that is defined by the sequence itself, ie it does not derive from any other production defined in the grammar.

Where does the k prefix for constants come from?

it's a pretty common practice that constants are prefixed with k (e.g. k_pi). But what does the k mean?
Is it simply that c already meant char?
It's a historical oddity, still common practice among teams who like to blindly apply coding standards that they don't understand.
Long ago, most commercial programming languages were weakly typed; automatic type checking, which we take for granted now, was still mostly an academic topic. This meant that is was easy to write code with category errors; it would compile and run, but go wrong in ways that were hard to diagnose. To reduce these errors, a chap called Simonyi suggested that you begin each variable name with a tag to indicate its (conceptual) type, making it easier to spot when they were misused. Since he was Hungarian, the practise became known as "Hungarian notation".
Some time later, as typed languages (particularly C) became more popular, some idiots heard that this was a good idea, but didn't understand its purpose. They proposed adding redundant tags to each variable, to indicate its declared type. The only use for them is to make it easier to check the type of a variable; unless someone has changed the type and forgotten to update the tag, in which case they are actively harmful.
The second (useless) form was easier to describe and enforce, so it was blindly adopted by many, many teams; decades later, you still see it used, and even advocated, from time to time.
"c" was the tag for type "char", so it couldn't also be used for "const"; so "k" was chosen, since that's the first letter of "konstant" in German, and is widely used for constants in mathematics.
I haven't seen it that much, but maybe it comes from certain languages' (the germanic ones in particular) spelling of the word constant - konstant.
Don't use Hungarian Notation. If you want constants to stand out, make them all caps.
As a side note: there are a lot of things in the Google Coding Standards that are poor practice (in terms of code readability). That is what happens when you design a coding standard by committee.
It means the value is k-onstant.
I think mathematical convention was the precedent. k is used in maths all the time as just some constant.
K stands for konstant, a wordplay on constant. It relates to Coding Styles.
It's just a matter of preference, some people and projects use them which means they also embrace the Hungarian notation, many don't. That's not that important.
If you're unsure what a prefix or style might mean, always check if the project has a coding style reference and read that.
Actually, whenever I define constants in typescript, I do something like this -
NODE_ENV = 'production';
But recently, I saw that the k prefix is being used in the Flutter SDK. It makes sense to me to keep using the k prefix cuz' it helps your editor/IDE in searching out constants in your codebase.
It's a convention, probably from math. But there are other suggestions for constant too, for example Kernighan and Ritchie in their book "The C language" suggest writing constants' name in capital letters (e.g. #define MAX 55).
I think, it means coefficient (as k in math means)

Semicolon in C++?

Is the "missing semicolon" error really required? Why not treat it as a warning?
When I compile this code
int f = 1
int h=2;
the compiler intelligently tells me that where I am missing it. But to me it's like - "If you know it, just treat it as if it's there and go ahead. (Later I can fix the warning.)
int sdf = 1, df=2;
sdf=1 df =2
Even for this code, it behaves the same. That is, even if multiple statements (without ;) are in the same line, the compiler knows.
So, why not just remove this requirement? Why not behave like Python, Visual Basic, etc.
Summary of discussion
Two examples/instances were missing, and a semi-colon would actually cause a problem.
1.
return
(a+b)
This was presented as one of the worst aspects of JavaScript. But, in this scenario, semicolon insertion is a problem for JavaScript, but not
for C++. In C++, you will get another error if ; insertion is done after return. That is, a missing return value.
2
int *y;
int f = 1
*y = 2;
For this I guess, there is no better way than to introduce as statement separator, that is, a semicolon.
It's very good that the C++ compiler doesn't do this. One of the worst aspects of JavaScript is the semicolon insertion. Picture this:
return
(a + b);
The C++ compiler will happily continue on the next line as expected, while a language that "inserts" semicolons, like JavaScript, will treat it as "return;" and miss out the "(a + b);".
Instead of rely on compiler error-fixing, make it a habit to use semicolons.
There are many cases where a semicolon is needed.
What if you had:
int *y;
int f = 1
*y = 2;
This would be parsed as
int *y;
int f = 1 * y = 2;
So without the semicolons it is ambiguous.
First, this is only a small example; are you sure the compiler can intelligently tell you what's wrong for more complex code? For any piece of code? Could all compilers intelligently recognize this in the same way, so that a piece of C++ code could be guaranteed portable with missing semicolons?
Second, C++ was created more than a decade ago when computing resources aren't nearly what they are now. Even today, builds can take a considerable amount of time. Semicolons help to clearly demarcate different commands (for the user and for the compiler!) and assist both the programmer and the compiler in understanding what's happening.
; is for the programmer's convenience. If the line of code is very long then we can press enter and go to second line because we have ; for line separator. It is programming conventions. There must be a line separator.
Having semi-colons (or line breaks, pick one) makes the compiler vastly simpler and error messages more readable.
But contrary to what other people have said, neither form of delimiters (as an absolute) is strictly necessary.
Consider, for example, Haskell, which doesn’t have either. Even the current version of VB allows line breaks in many places inside a statement, as does Python. Neither requires line continuations in many places.
For example, VB now allows the following code:
Dim result = From element in collection
Where element < threshold
Select element
No statement delimiters, no line continuations, and yet no ambiguities whatsoever.
Theoretically, this could be driven much further. All ambiguities can be eliminated (again, look at Haskell) by introducing some rules. But again, this makes the parser much more complicated (it has to be context sensitive in a lot of places, e.g. your return example, which cannot be resolved without first knowing the return type of the function). And again, it makes it much harder to output meaningful diagnostics since an erroneous line break could mean any of several things so the compiler cannot know which error the user has made, and not even where the error was made.
In C programs semicolons are statement terminators, not separators. You might want to read this fun article.
+1 to you both.
The semi-colon is a command line delimiter, unlike VB, python etc. C and C++ ignore white space within lines of code including carriage returns! This was originally because at inception of C computer monitors could only cope with 80 characters of text and as C++ is based on the C specification it followed suit.
I could post up the question "Why must I keep getting errors about missing \ characters in VB when I try and write code over several lines, surely if VB knows of the problem it can insert it?"
Auto insertion as has already been pointed out could be a nightmare, especially on code that wraps onto a second line.
I won't extend much of the need for semi-colon vs line continuation characters, both have advantages and disadvantages and in the end it's a simple language design choice (even though it affects all the users).
I am more worried about the suggestion for the compiler to fix the code.
If you have ever seen a marvelous tool (such as... hum let's pick up a merge tool) and the way it does its automated work, you would be very glad indeed that the compiler did not modify the code. Ultimately if the compiler knew how to fix the code, then it would mean it knew your intent, and thought transmission has not been implemented yet.
As for the warning ? Any programmer worth its salt knows that warnings should be treated as errors (and compilation stopped) so what would be the advantage ?
int sdf = 1,df=2;
sdf=1 df =2
I think the general problem is that without the semicolon there's no telling what the programmer could have actually have meant (e.g may-be the second line was intended as sdf = 1 + df - 2; with serious typos). Something like this might well result from completely arbitrary typos and have any intended meaning, wherefore it might not be such a good idea after all to have the compiler silently "correct" errors.
You may also have noticed that you often get "expected semicolon" where the real problem is not a lack of a semicolon but something completely different instead. Imagine a malformed expression that the compiler could make sense out of by silently going and inserting semicolons.
The semicolon may seem redundant but it is a simple way for the programmer to confirm "yes, that was my intention".
Also, warnings instead of compiler errors are too weak. People compile code with warnings off, ignore warnings they get, and AFAIK the standard never prescribes what the compiler must warn about.

How to Predict if Function Name Follows Convention

Suppose you have a repository of 10,000 function names and possibly their frequency of use in a corpus of code which can be in C/C#/C++. (they have different conventions usually prescribed)
Some Samples may be:
DoPaint
OnPaint
CloseWindow
DeleteGraphOnClose
FreeConnection
ConnectInternat (smallTypo, but part of code)
FreeSoH
Now given a function name, how can we predict if the name follows the convention of Human Generated Name?
Note:
Obviously all candidate names will be valid names
generated names can have arbitrary characters and will be treated as bad
Letter cases can get garbled up
Some candidates:
Z090292 - not likely
onDelete - likely
CloseWindow - likely
iGetIndex - unlikely
Any pointers on technique and software are welcome
You could try conducting some Bayesian analysis on the text:
Load the list of names (and their frequencies) into your program. It might be worth tokenising the names at this point. So e.g. CloseWindow becomes Close and Window, with the frequency of both incremented. At this point it would also be useful to load in some non human function names to train the program in nagatives as well.
Take a function name, and using the data you have just gathered find the probability of each part coming up
P((HumanGenerated|Seeing the Token) = P(Seeing the Token|Human Generated) * P(Humangenerated)) / P(Seeing the Token)
In this case the probability of something being human or computer generated would be decided based on known knowledge i.e. what percentage of function names are thought to be human generated.
The probability of seeing the token ( P(Seeing the Token)) would have to gradually evolve. It would consist of the number of of times the token is seen in human functions and the number of times it is seen in computer functions...this solution is based on the premise that the program learns over time (and thus needs to be trained)
The result, P((HumanGenerated|Seeing the Token) , will give you a probability of the function name being generated by a human.
NB: This is only a rough outline, many details are missing. If you are interested in this line of investigation that I would suggested reading up on probability theory and in particular Bayesian analysis
Split the identifiers into individal words (based on capitalization), and put the words into a spell checker (such as ispell). Consider all words with spelling errors as non-human-generated, along with the identifiers in which they occur.
A friend of mine might help. He is doing a PhD on this very subject, as far as I can tell.
Home page
Predicting if it's human-generated is a very tricky question. Analyzing the code base to find the function names is easier - you might look at tools such as NDepend.
You can probably detect camelcase. Also, you could possible do a regex search for typical words like: do, get, set, in, etc before the next capitalized word.
In addition to using a dictionary as Martin V. Lowes suggested is a good one, but you have to remember to also account for the following common forms of variables:
Single-letter variable names.
Variable names that use underscores instead of camel case.
Metasyntactic variables.
Hungarian notation.
Keywords/types with a character attached (i.e. $return or list_).

Is it feasible to ascribe pronunciations to distinct source code concepts?

I frequently tutor fellow students in programming, most often in C++ or Java.
It is uniquely aggravating to try to verbally convey the essential syntax of a C++ expression. The speaker must give either an idiomatic translation into English, or a full specification of the code in verbal longhand, using explicit yet slow terms such as "opening parenthesis", "bitwise and", et cetera. Neither of these solutions is optimal.
In C++, there is a finite set of keywords—63—and operators—54, discounting named operators and treating compound assignment operators and prefix versus postfix auto-increment and decrement as distinct. There are just a few types of literal, a similar number of grouping symbols, and the semicolon. Unless I'm utterly mistaken, that's about it.
Would it not then be feasible to ascribe a concise, unique pronunciation to each of these distinct concepts (including one for whitespace, where it is required) and go from there? Programming languages are far more regular than natural languages, so the pronunciation could be standardised.
Instead of creating new "words" to describe them, for things such as "include" you could simply prefix it with "keyword" when saying it aloud. You could use words/phrases commonly known to say other parts as well. As with any new programmer, you have to literally describe everything anyway, so I don't think that requires special attention. I think creating new words is the harder method...
So, for example:
#include <iostream>;
int main()
{
if (1 < 2)
return 1;
else
return 0;
}
Could be read out as:
(keyword) include iostream new-line
(keyword) int main no params start
block if number 1 (operator) less than
number 2 new-line (keyword) return
number 1 new-line (keyword) else
new-line (keyword) return number 0 end
block
Treat words in () as optional descriptive words, most likely to be used in more complex code. You could use the word 'literal' if you want them to actually write the descriptive word. For example
(keyword) if literal number (operator)
less than literal keyword
becomes
if (number < keyword)
Other words could be given defined meanings as well, such as 'split-line' when you want them to continue on the next line, without closing any currently open parenthesis, etc.
I personally find this method quite simple to use and easy to teach. YMMV, as always.
Of course, this doesn't solve the internationalisation issue, but at worst, would result in 'new words' being used in the non-English languages, which is no worse than the proposed solution you offered.
As a blind developer, programming since I was 13, I found this question really interesting. First of all, as mentioned by other peple, learning a new language to be able to understand code is not a practical solution, as it would probably take longer to learn the spoken utterances as it would to learn the actual programming language.
Reading the question/answers two further points occured to me:
Firstly, you'd be surprised how important "thinking time" is. I have previously programmed in C/C++/Java and now use C# as my primary language, and consider myself very competant. But when I did a couple of projects in Python, I found the reduced punctuation robbed me of my "thinking time" - subconsciously, I was using the punctuation to digest what I'd just heard - fascinating... However, the situation is a bit different when it comes to identifiers, as these aren't well known by the listener - I personally find it hard to listen to code with acronym variables (RGXRatio, RGVRatio) as I don't have time to figure out what it means. On the flip side, hungarian notation and initial underscores makes code hard to listen to as the length of the variables (in terms of time taken to speak) is much longer than the more important operations being performed on those variables.
Another thing to consider is that the length of the audio stream is an end result, but not the root cause. The reason the audio is so long is because audio is a one-dimensional medium, whereas reading text is a 2d medium with the ability to jump around and skip past irelevant/familiar text. It wouldn't work for a face-to-face lecture, but what if there were keyboard commands for controlling the speech. In text documents my screen reader lets me jump to the next line, but what if this were adapted to the semantics of a programming language. some research, such as by T V Raman at Google, includes using different voices for syntax highlighting, and audio cues to mark metadata like capitals.
I know the original question specifically related to a lecture given to a class, but if like myself you have to listen to entire files of source code , I also find the structure of the code makes a huge difference. I personally read code like a story - left to right, top to bottom. so it's very hard to trace through unfamiliar code when it's written bottom-up.
So would it not then be feasible to simply ascribe a concise, unique pronunciation to each of these distinct concepts (including one for whitespace, where it is required) and go from there? Programming languages are far more regular than natural languages, so the pronunciation could be standardised
Perhaps, but you've lost sight of your goal. The premise was that the person listening did not already know the language. If he does, we can simply say "include iostream" when we mean #include <iostream>, or "vector of int" when we mean std::vector<int>.
Your premise was that the person listening is not familiar enough with the language to understand what you read out loud unless you read out exactly what it says.
Now, inventing a whole new language just to describe the primitives that occur in your source code doesn't solve the problem. Instead, you still have to read out every syntactic token (with simpler, more "standardized" pronunciations, yes, but they still have to be read out loud), and the person listening still won't understand you, because if they don't know C++ well enough to understand "include iostream", they won't understand your standardized pronunciation either. And if you're going to teach them your pronunciation, why bother, when you could've just taught them to understand C++ syntax directly instead?
There's also the root problem that C++ code tends to consist of a lot of syntactic tokens. Take a line as simple as this:
std::vector<int> v;
I count 9 tokens. Not one of them can be omitted. If the person listening does not understand the code and syntax well enough to understand a high-level description such as "declare a vector of int, named v", then you'll have to read out all 9 tokens in some form. Even if you come up with simpler names than "namespace resolution operator" and "less than sign", you still have to list 9 token names. Which is a lot of work.
In short, no, I don't think it'd work. First, it's still too cumbersome, and second, it's presuming prior knowledge on the part of the person listening, when the motivation for this was that the person listening was a student without the prior knowledge that made it possible to understand a high-level description of the code.