Method's Curly brackets do not match when code to long - c++

I have a C++ method with multiple nested "if statements" which are enclosed in the curly brackets of the method. When using Gedit's matching curly brackets, I was checking to ensure that all my curly brackets were correctly matched.
Problem: Gedit stated that my last two curly brackets were " out of range"!
I checked same code using Geany, and it showed correct matches for all my curly brackets.
However, when compiling, the method's local variables defined at the beginning of the method, were not recognized within some latter written nested "if statements" within the method. Is there a limit of the number line codes contained between curly brackets? Or, is there a limit of nested "if and else if statements" that would cause this problem?

Is there a limit of the number line codes contained between curly brackets?
Only available disk space and memory.
Or, is there a limit of nested "if and else if statements" that would cause this problem?
Ditto.
Clearly you are mistaken about your braces matching. If you have a method that is so long you can't be sure, refactor it.

There are obviously limits, since the compiler has to keep track
of everything in memory, and memory is finite. I've actually
had an out of memory error with g++ (with machine generated
code). Reaching the limit should result in a compile time
error, however.
Practically, for hand written code, you can consider that there
are no limits on a modern machine. You generally shouldn't have
more than about ten or fifteen lines in a function (although
there are exceptions), and you shouldn't nest more than about
two levels. Of course, one of the cases where more lines might
be justified is a long sequence of if/else if, and in the
case of if/else if, the compiler sees more levels of nesting
than you do. But I would still expect a modern compiler on
a modern machine to handle a couple of hundred if/else if
without a problem.

This issue has been discussed here:
http://ubuntuforums.org/showthread.php?t=1175657
It seems there is indeed a limit on the number of characters that are being searched for finding the match.

Related

Adding indentation

I'm trying to make a parser for a made-up programming language. I'm now at the part of the exercise where we're required to make sure the parser's output is a conversion in C of the input.
So things like...
STARTMAIN a=b+2; return a ENDMAIN
...must become...
int main () { a=b+2; return a; }
So far so good, almost. The exercise also requires that in the same time, as we convert, we have to add proper indentation and (as I had to learn the hard way last year) newlines.
The obvious part is that each time a { opens, you increase a counter and then add the appropriate tabs on each new line. However, closing brackets ('}') are a different story as you can't detect them before hand, and once you've parsed them, you can't just put them a tab to the left by removing the last tab printed.
Is there a solution to this, and/or a consistent way of checking and adding indentation?
Well, you've now discovered one reason why people do not always bother to format generated output neatly; it is relatively hard to do so.
Indeed, one way to deal with the problem is to provide an official formatter for the language. Google's Go programming language comes with the 'gofmt' program to encourage the official format. C does not have such a standard, hence the religious wars over the placement of braces, but it does have programs such as indent which can in fact format the code neatly for you.
The trick is not to output anything on a line until you know how many tabs to output. So, on a line with a close brace, you decrement the indent counter (making sure it never goes negative) and only then do you output the leading tabs and the following brace.
Note that some parts of C require a semi-colon (or comma) after the close brace (think initializers and structure definitions); others do not (think statement blocks).

Semicolon in C++?

Is the "missing semicolon" error really required? Why not treat it as a warning?
When I compile this code
int f = 1
int h=2;
the compiler intelligently tells me that where I am missing it. But to me it's like - "If you know it, just treat it as if it's there and go ahead. (Later I can fix the warning.)
int sdf = 1, df=2;
sdf=1 df =2
Even for this code, it behaves the same. That is, even if multiple statements (without ;) are in the same line, the compiler knows.
So, why not just remove this requirement? Why not behave like Python, Visual Basic, etc.
Summary of discussion
Two examples/instances were missing, and a semi-colon would actually cause a problem.
1.
return
(a+b)
This was presented as one of the worst aspects of JavaScript. But, in this scenario, semicolon insertion is a problem for JavaScript, but not
for C++. In C++, you will get another error if ; insertion is done after return. That is, a missing return value.
2
int *y;
int f = 1
*y = 2;
For this I guess, there is no better way than to introduce as statement separator, that is, a semicolon.
It's very good that the C++ compiler doesn't do this. One of the worst aspects of JavaScript is the semicolon insertion. Picture this:
return
(a + b);
The C++ compiler will happily continue on the next line as expected, while a language that "inserts" semicolons, like JavaScript, will treat it as "return;" and miss out the "(a + b);".
Instead of rely on compiler error-fixing, make it a habit to use semicolons.
There are many cases where a semicolon is needed.
What if you had:
int *y;
int f = 1
*y = 2;
This would be parsed as
int *y;
int f = 1 * y = 2;
So without the semicolons it is ambiguous.
First, this is only a small example; are you sure the compiler can intelligently tell you what's wrong for more complex code? For any piece of code? Could all compilers intelligently recognize this in the same way, so that a piece of C++ code could be guaranteed portable with missing semicolons?
Second, C++ was created more than a decade ago when computing resources aren't nearly what they are now. Even today, builds can take a considerable amount of time. Semicolons help to clearly demarcate different commands (for the user and for the compiler!) and assist both the programmer and the compiler in understanding what's happening.
; is for the programmer's convenience. If the line of code is very long then we can press enter and go to second line because we have ; for line separator. It is programming conventions. There must be a line separator.
Having semi-colons (or line breaks, pick one) makes the compiler vastly simpler and error messages more readable.
But contrary to what other people have said, neither form of delimiters (as an absolute) is strictly necessary.
Consider, for example, Haskell, which doesn’t have either. Even the current version of VB allows line breaks in many places inside a statement, as does Python. Neither requires line continuations in many places.
For example, VB now allows the following code:
Dim result = From element in collection
Where element < threshold
Select element
No statement delimiters, no line continuations, and yet no ambiguities whatsoever.
Theoretically, this could be driven much further. All ambiguities can be eliminated (again, look at Haskell) by introducing some rules. But again, this makes the parser much more complicated (it has to be context sensitive in a lot of places, e.g. your return example, which cannot be resolved without first knowing the return type of the function). And again, it makes it much harder to output meaningful diagnostics since an erroneous line break could mean any of several things so the compiler cannot know which error the user has made, and not even where the error was made.
In C programs semicolons are statement terminators, not separators. You might want to read this fun article.
+1 to you both.
The semi-colon is a command line delimiter, unlike VB, python etc. C and C++ ignore white space within lines of code including carriage returns! This was originally because at inception of C computer monitors could only cope with 80 characters of text and as C++ is based on the C specification it followed suit.
I could post up the question "Why must I keep getting errors about missing \ characters in VB when I try and write code over several lines, surely if VB knows of the problem it can insert it?"
Auto insertion as has already been pointed out could be a nightmare, especially on code that wraps onto a second line.
I won't extend much of the need for semi-colon vs line continuation characters, both have advantages and disadvantages and in the end it's a simple language design choice (even though it affects all the users).
I am more worried about the suggestion for the compiler to fix the code.
If you have ever seen a marvelous tool (such as... hum let's pick up a merge tool) and the way it does its automated work, you would be very glad indeed that the compiler did not modify the code. Ultimately if the compiler knew how to fix the code, then it would mean it knew your intent, and thought transmission has not been implemented yet.
As for the warning ? Any programmer worth its salt knows that warnings should be treated as errors (and compilation stopped) so what would be the advantage ?
int sdf = 1,df=2;
sdf=1 df =2
I think the general problem is that without the semicolon there's no telling what the programmer could have actually have meant (e.g may-be the second line was intended as sdf = 1 + df - 2; with serious typos). Something like this might well result from completely arbitrary typos and have any intended meaning, wherefore it might not be such a good idea after all to have the compiler silently "correct" errors.
You may also have noticed that you often get "expected semicolon" where the real problem is not a lack of a semicolon but something completely different instead. Imagine a malformed expression that the compiler could make sense out of by silently going and inserting semicolons.
The semicolon may seem redundant but it is a simple way for the programmer to confirm "yes, that was my intention".
Also, warnings instead of compiler errors are too weak. People compile code with warnings off, ignore warnings they get, and AFAIK the standard never prescribes what the compiler must warn about.

Is it feasible to write a regex that can validate simple math?

I’m using a commercial application that has an option to use RegEx to validate field formatting. Normally this works quite well. However, today I’m faced with validating the following strings: quoted alphanumeric codes with simple arithmetic operators (+-/*). Apparently the issue is sometimes users add additional spaces (e.g. “ FLR01” instead of “FLR01”) or have other typos such as mismatched parenthesis that cause issues with downstream processing.
The first examples all had 5 codes being added:
"FLR01"+"FLR02"+"FLR03"+"FMD01"+"FMR05"
So I started going down the road of matching 5 alphanumeric characters quoted by strings:
"[0-9a-zA-Z]{5}"[+-*/]
However, the formulas quickly got harder and I don’t know how to get around the following complications:
I need to test for one of the four simple math operators (+-*/) between each code, but not after the last one.
There can be any number of codes being added together, not just five as in the example above.
Enclosed parenthesis are okay (“X”+”Y”)/”2”
Mismatched parenthesis are not okay.
No formula (e.g. a blank) is okay.
Valid:
"FLR01"+"FLR02"+"FLR03"+"FMD01"+"FMR05"
"0XT"+"1SEAL"+"1XT"+"23LSL"+"23NBL"
("LS400"+"LT400")*"LC430"/("EL414"+"EL414R"+"LC407"+"LC407R"+"LC410"+"LC410R"+"LC420"+"LC420R")
Invalid:
" FLR01" +"FLR02"
"FLR01"J"FLR02"
("FLR01"+"FLR02"
Is this not something you can easily do with RegExp? Based on Jeff’s answer to 230517, I suspect I’m failing at least the ‘matched pairing’ issue. Even a partial solution to the problem (e.g. flagging extra spaces, invalid operators) would likely be better than nothing, even if I can't solve the parenthesis issue. Suggestions welcomed!
Thanks,
Stephen
As you are aware you can't check for matching parentheses with regular expressions. You need something more powerful since regexes have no way of remembering state and counting the nested parentheses.
This is a simple enough syntax that you could hand code a simple parser which counts the parentheses, incrementing and decrementing a counter as it goes. You'd simply have to make sure the counter never goes negative.
As for the rest, how about this?
("[0-9a-zA-Z]+"([+\-*/]"[0-9a-zA-Z]+")*)?
You could also use this regular expression to check the parentheses. It wouldn't verify that they're nested properly but it would verify that the open and close parentheses show up in the right places. Add in the counter described above and you'd have a proper validator.
(\(*"[0-9a-zA-Z]+"\)*([+\-*/]\(*"[0-9a-zA-Z]+"\)*)*)?
You can easily use regex's to match your tokens (numbers, operators, etc), but you cannot match balanced parenthesis. This isn't too big of a problem though, as you just need to create a state machine that operates on the tokens you match. If you're not familiar with these, think of it as a flow chart within your program where you keep track of where you are, and where you can go. You can also have a look at the Wikipedia page.

Does replacing statements by expressions using the C++ comma operator could allow more compiler optimizations?

The C++ comma operator is used to chain individual expressions, yielding the value of the last executed expression as the result.
For example the skeleton code (6 statements, 6 expressions):
step1;
step2;
if (condition)
step3;
return step4;
else
return step5;
May be rewritten to: (1 statement, 6 expressions)
return step1,
step2,
condition?
step3, step4 :
step5;
I noticed that it is not possible to perform step-by-step debugging of such code, as the expression chain seems to be executed as a whole. Does it means that the compiler is able to perform special optimizations which are not possible with the traditional statement approach (specially if the steps are const or inline)?
Note: I'm not talking about the coding style merit of that way of expressing sequence of expressions! Just about the possible optimisations allowed by replacing statements by expressions.
Most compilers will break your code down into "basic blocks", which are stretches of code with no jumps/branches in or out. Optimisations will be performed on a graph of these blocks: that graph captures all the control flow in the function. The basic blocks are equivalent in your two versions of the code, so I doubt that you'd get different optimisations. That the basic blocks are the same isn't entirely obvious: it relies on the fact that the control flow between the steps is the same in both cases, and so are the sequence points. The most plausible difference is that you might find in the second case there is only one block including a "return", and in the first case there are two. The blocks are still equivalent, since the optimiser can replace two blocks that "do the same thing" with one block that is jumped to from two different places. That's a very common optimisation.
It's possible, of course, that a particular compiler doesn't ignore or eliminate the differences between your two functions when optimising. But there's really no way of saying whether any differences would make the result faster or slower, without examining what that compiler is doing. In short there's no difference between the possible optimisations, but it doesn't necessarily follow that there's no difference between the actual optimisations.
The reason you can't single-step your second version of the code is just down to how the debugger works, not the compiler. Single-step usually means, "run to the next statement", so if you break your code into multiple statements, you can more easily debug each one. Otherwise, if your debugger has an assembly view, then in the second case you could switch to that and single-step the assembly, allowing you to see how it progresses. Or if any of your steps involve function calls, then you may be able to "do the hokey-cokey", by repeatedly doing "step in, step out" of the functions, and separate them that way.
Using the comma operator neither promotes nor hinders optimization in any circumstances I'm aware of, because the C++ standard guarantee is only that evaluation will be in left-to-right order, not that statement execution necessarily will be. (This is the same guarantee you get with statement line order.)
What it is likely to do, though, is turn your code into a confusing mess, since many programmers are unaware that the comma-as-operator even exists, and are apt to confuse it with commas used as parameter separators. (Want to really make your code unreadable? Call a function like my_func((++i, y), x).)
The "best" use of the comma operator I've seen is to work with multiple variables in the iteration statement of a for loop:
for (int i = 0, j = 0;
i < 10 && j < 12;
i += j, ++j) // each time through the loop we're tinkering with BOTH i and j
{
}
Very unlikely IMHO. The thing get's compiled down to assembler/machine code, then further low-level optimizations are done, so it probably turns out to the same thing.
OTOH, if the comma operator is overloaded, the game changes completely. But I'm sure you know that. ;)
The obligatory list:
Don't worry about rewriting almost equivalent code to gain performance
If you have a perf-problem, profile to see what the problem is
If you can't get it faster by algorithmic ops, look at the disassembly and see that the compiler does what you intended
If not, ask here and post source and disassembly for both versions. :)

Why can't variable names start with numbers?

I was working with a new C++ developer a while back when he asked the question: "Why can't variable names start with numbers?"
I couldn't come up with an answer except that some numbers can have text in them (123456L, 123456U) and that wouldn't be possible if the compilers were thinking everything with some amount of alpha characters was a variable name.
Was that the right answer? Are there any more reasons?
string 2BeOrNot2Be = "that is the question"; // Why won't this compile?
Because then a string of digits would be a valid identifier as well as a valid number.
int 17 = 497;
int 42 = 6 * 9;
String 1111 = "Totally text";
Well think about this:
int 2d = 42;
double a = 2d;
What is a? 2.0? or 42?
Hint, if you don't get it, d after a number means the number before it is a double literal
It's a convention now, but it started out as a technical requirement.
In the old days, parsers of languages such as FORTRAN or BASIC did not require the uses of spaces. So, basically, the following are identical:
10 V1=100
20 PRINT V1
and
10V1=100
20PRINTV1
Now suppose that numeral prefixes were allowed. How would you interpret this?
101V=100
as
10 1V = 100
or as
101 V = 100
or as
1 01V = 100
So, this was made illegal.
Because backtracking is avoided in lexical analysis while compiling. A variable like:
Apple;
the compiler will know it's a identifier right away when it meets letter 'A'.
However a variable like:
123apple;
compiler won't be able to decide if it's a number or identifier until it hits 'a', and it needs backtracking as a result.
Compilers/parsers/lexical analyzers was a long, long time ago for me, but I think I remember there being difficulty in unambiguosly determining whether a numeric character in the compilation unit represented a literal or an identifier.
Languages where space is insignificant (like ALGOL and the original FORTRAN if I remember correctly) could not accept numbers to begin identifiers for that reason.
This goes way back - before special notations to denote storage or numeric base.
I agree it would be handy to allow identifiers to begin with a digit. One or two people have mentioned that you can get around this restriction by prepending an underscore to your identifier, but that's really ugly.
I think part of the problem comes from number literals such as 0xdeadbeef, which make it hard to come up with easy to remember rules for identifiers that can start with a digit. One way to do it might be to allow anything matching [A-Za-z_]+ that is NOT a keyword or number literal. The problem is that it would lead to weird things like 0xdeadpork being allowed, but not 0xdeadbeef. Ultimately, I think we should be fair to all meats :P.
When I was first learning C, I remember feeling the rules for variable names were arbitrary and restrictive. Worst of all, they were hard to remember, so I gave up trying to learn them. I just did what felt right, and it worked pretty well. Now that I've learned alot more, it doesn't seem so bad, and I finally got around to learning it right.
It's likely a decision that came for a few reasons, when you're parsing the token you only have to look at the first character to determine if it's an identifier or literal and then send it to the correct function for processing. So that's a performance optimization.
The other option would be to check if it's not a literal and leave the domain of identifiers to be the universe minus the literals. But to do this you would have to examine every character of every token to know how to classify it.
There is also the stylistic implications identifiers are supposed to be mnemonics so words are much easier to remember than numbers. When a lot of the original languages were being written setting the styles for the next few decades they weren't thinking about substituting "2" for "to".
Variable names cannot start with a digit, because it can cause some problems like below:
int a = 2;
int 2 = 5;
int c = 2 * a;
what is the value of c? is 4, or is 10!
another example:
float 5 = 25;
float b = 5.5;
is first 5 a number, or is an object (. operator)
There is a similar problem with second 5.
Maybe, there are some other reasons. So, we shouldn't use any digit in the beginnig of a variable name.
The restriction is arbitrary. Various Lisps permit symbol names to begin with numerals.
COBOL allows variables to begin with a digit.
Use of a digit to begin a variable name makes error checking during compilation or interpertation a lot more complicated.
Allowing use of variable names that began like a number would probably cause huge problems for the language designers. During source code parsing, whenever a compiler/interpreter encountered a token beginning with a digit where a variable name was expected, it would have to search through a huge, complicated set of rules to determine whether the token was really a variable, or an error. The added complexity added to the language parser may not justify this feature.
As far back as I can remember (about 40 years), I don't think that I have ever used a language that allowed use of a digit to begin variable names. I'm sure that this was done at least once. Maybe, someone here has actually seen this somewhere.
As several people have noticed, there is a lot of historical baggage about valid formats for variable names. And language designers are always influenced by what they know when they create new languages.
That said, pretty much all of the time a language doesn't allow variable names to begin with numbers is because those are the rules of the language design. Often it is because such a simple rule makes the parsing and lexing of the language vastly easier. Not all language designers know this is the real reason, though. Modern lexing tools help, because if you tried to define it as permissible, they will give you parsing conflicts.
OTOH, if your language has a uniquely identifiable character to herald variable names, it is possible to set it up for them to begin with a number. Similar rule variations can also be used to allow spaces in variable names. But the resulting language is likely to not to resemble any popular conventional language very much, if at all.
For an example of a fairly simple HTML templating language that does permit variables to begin with numbers and have embedded spaces, look at Qompose.
Because if you allowed keyword and identifier to begin with numberic characters, the lexer (part of the compiler) couldn't readily differentiate between the start of a numeric literal and a keyword without getting a whole lot more complicated (and slower).
C++ can't have it because the language designers made it a rule. If you were to create your own language, you could certainly allow it, but you would probably run into the same problems they did and decide not to allow it. Examples of variable names that would cause problems:
0x, 2d, 5555
One of the key problems about relaxing syntactic conventions is that it introduces cognitive dissonance into the coding process. How you think about your code could be deeply influenced by the lack of clarity this would introduce.
Wasn't it Dykstra who said that the "most important aspect of any tool is its effect on its user"?
The compiler has 7 phase as follows:
Lexical analysis
Syntax Analysis
Semantic Analysis
Intermediate Code Generation
Code Optimization
Code Generation
Symbol Table
Backtracking is avoided in the lexical analysis phase while compiling the piece of code. The variable like Apple, the compiler will know its an identifier right away when it meets letter ‘A’ character in the lexical Analysis phase. However, a variable like 123apple, the compiler won’t be able to decide if its a number or identifier until it hits ‘a’ and it needs backtracking to go in the lexical analysis phase to identify that it is a variable. But it is not supported in the compiler.
When you’re parsing the token you only have to look at the first character to determine if it’s an identifier or literal and then send it to the correct function for processing. So that’s a performance optimization.
Probably because it makes it easier for the human to tell whether it's a number or an identifier, and because of tradition. Having identifiers that could begin with a digit wouldn't complicate the lexical scans all that much.
Not all languages have forbidden identifiers beginning with a digit. In Forth, they could be numbers, and small integers were normally defined as Forth words (essentially identifiers), since it was faster to read "2" as a routine to push a 2 onto the stack than to recognize "2" as a number whose value was 2. (In processing input from the programmer or the disk block, the Forth system would split up the input according to spaces. It would try to look the token up in the dictionary to see if it was a defined word, and if not would attempt to translate it into a number, and if not would flag an error.)
Suppose you did allow symbol names to begin with numbers. Now suppose you want to name a variable 12345foobar. How would you differentiate this from 12345? It's actually not terribly difficult to do with a regular expression. The problem is actually one of performance. I can't really explain why this is in great detail, but it essentially boils down to the fact that differentiating 12345foobar from 12345 requires backtracking. This makes the regular expression non-deterministic.
There's a much better explanation of this here.
it is easy for a compiler to identify a variable using ASCII on memory location rather than number .
I think the simple answer is that it can, the restriction is language based. In C++ and many others it can't because the language doesn't support it. It's not built into the rules to allow that.
The question is akin to asking why can't the King move four spaces at a time in Chess? It's because in Chess that is an illegal move. Can it in another game sure. It just depends on the rules being played by.
Originally it was simply because it is easier to remember (you can give it more meaning) variable names as strings rather than numbers although numbers can be included within the string to enhance the meaning of the string or allow the use of the same variable name but have it designated as having a separate, but close meaning or context. For example loop1, loop2 etc would always let you know that you were in a loop and/or loop 2 was a loop within loop1.
Which would you prefer (has more meaning) as a variable: address or 1121298? Which is easier to remember?
However, if the language uses something to denote that it not just text or numbers (such as the $ in $address) it really shouldn't make a difference as that would tell the compiler that what follows is to be treated as a variable (in this case).
In any case it comes down to what the language designers want to use as the rules for their language.
The variable may be considered as a value also during compile time by the compiler
so the value may call the value again and again recursively
Backtracking is avoided in lexical analysis phase while compiling the piece of code. The variable like Apple; , the compiler will know its a identifier right away when it meets letter ‘A’ character in the lexical Analysis phase. However, a variable like 123apple; , compiler won’t be able to decide if its a number or identifier until it hits ‘a’ and it needs backtracking to go in the lexical analysis phase to identify that it is a variable. But it is not supported in compiler.
Reference
There could be nothing wrong with it when comes into declaring variable.but there is some ambiguity when it tries to use that variable somewhere else like this :
let 1 = "Hello world!"
print(1)
print(1)
print is a generic method that accepts all types of variable. so in that situation compiler does not know which (1) the programmer refers to : the 1 of integer value or the 1 that store a string value.
maybe better for compiler in this situation to allows to define something like that but when trying to use this ambiguous stuff, bring an error with correction capability to how gonna fix that error and clear this ambiguity.