C++ language symbol separator - c++

I need to parse some c++ files to get some information out of it. One user case is I have a enum value "ID_XYZ", I want to find out how many times it appears in a source file. So my question is what are the separator dividing symbols in C++?

You can't really tokenize C or C++ source code based purely on separator characters -- you pretty much need to read in a character at a time, and figure out whether that character can be part of the current token or not.
Just for a couple of examples, when you see a C-style begin-comment token, you need to look at characters until you encounter a close-comment token. Likewise, strings and pre-processor directives (e.g., #if 0 .... #endif sequences). To do it truly correctly, you also need to deal correctly with trigraphs. For example, consider something like this:
// Why doesn't this work??/
ID_XYZ = 1;
If the lexer doesn't handle trigraphs correctly, it will probably identify this as an instance of your ID_XYZ -- but in reality, it's not -- the ??/ at the end of the previous line is really a trigraph that resolves to \, which means the "single-line" comment actually extends to the end of the next line, and the apparent instance of ID_XYZ is really part of the comment.

Related

Why do preprocessor commands have to start as first nonwhite space

I am trying to do a #ifndef part way through a setter line and I received this error
"Error 20 error C2014: preprocessor command must start as first nonwhite space"
I am aware of the error means, I am just curious of why it is like that? Is it a compiler choice? What is the reasoning behind this? That it is easier for the user to notice?
Here is the code if someone is wondering:
inline void SetSomething(int value) { #ifndef XBOX ASSERT(value <= 1); #endif test = value; };
At first C did not have any standard preprocessor. Then people started using preprocessing as an external tool. You might note that the # is the same as with comments in general in Unix-land shell scripts.
As the language evolved the preprocessor became integrated with the compiler and more part of the language proper, but it kept its totally different structure, namely, in particular that it's line oriented while the C and C++ core languages are free form.
After that the lines have blurred a bit more. Now the preprocessing typically adds #line directives or the equivalent for use by the core language compiler, also #pragma directives are for the core language compiler, and in the other direction we now have _Pragma (IIRC). Still the structure is mostly as it was originally. C and C++ are evolved languages, not designed languages.
Taking a look into the standard (section 16 "Preprocessing Directives") starting with # as the frirst non whitespace character is what makes a preprocessing directive by definition.
A preprocessing directive consists of a sequence of preprocessing tokens that satisfies the following constraints:
The first token in the sequence is a # preprocessing token that (at the start of translation phase 4) is either
the first character in the source file (optionally after white space containing no new-line characters) or that
follows white space containing at least one new-line character.
If you want the most important reason, it's because the standard says so.
If you want to know why the standard says so, it's the easiest way to get the neccessary functionality.
Remember that preprocessing and compiling are two potentially completely separate tasks, and the preprocessor has no idea at all about the language of its output.

Encode/decode certain text sequences in Qt

I have a QTextEdit where the user can insert arbitrary text. In this text, there may be some special sequences of characters which I wish to translate automatically. And from the translated version, I wish I could go back to the sequences.
Take for instance this:
QMessageBox::information(0, "Foo", MAGIC_TRANSLATE(myTextEdit->text()));
If the user wrote, inside myTextEdit's text, the sequence \n, I would like that MAGIC_TRANSLATE converted the string \n to an actual new line character.
In the same way, if I give a text with a new line inside it, a MAGIC_UNTRANSLATE will convert the newline with a \n string.
Now, of course I can implement these two functions by myself, but what I am asking is if there is something already made, easy to use, in Qt, which allows me to specify a dictionary and it does the rest for me.
Note that sequences with common prefix can create some conflicts, for example converting:
\foo -> FOO
\foobar -> FOOBAR
can give rise to issues when translating the text asd \foobar lol, because if \foo is searched and replaced before \foobar, then the resulting text will be asd FOObar lol instead of the (more natural) asd FOOBAR lol.
I hope to have made clear my needs. I believe that this may be a common task, so I hope there is a Qt solution which takes into account this kind of issues when having conflicting prefixes.
I am sorry if this is a trivial topic (as I think it may be), but I am not familiar at all with encoding techniques and issues, and my knowledge of Qt encoding cover only very simple Unicode-related issues.
EDIT:
Btw, in my case a data-oriented approach, based on resources or external files or anything that does not requires a recompilation would be great.
It sounds like your question is, "I want to run a sequence of regular expression or simple string replacements to map between two encodings of some text".
First you need to work out your mapping, exactly. As you say, if your escape sequences like \foo and \foobar are fiddly, you might find that you don't have a bidirectional, lossless mapping. No library in the world can help you if your design or encoding is flawed.
When you end up with a precise design (which we can't help you on given the complete lack of information provided on the purpose of this function), you'll probably find that a sequence of string replacements is fine. If it really is more complicated, then some QRegExps should be enough.
It is always a bit ugly to self-answer questions, but... Maybe this solution is useful to someone.
As suggested by Nicholas in his answer, a good strategy is to use replacement. It is simple and effective in most cases, for example in the plain C/C++ escaping:
\n \r \t etc
This works because they are all different. It will always work with a replacement if the sequences are all different and, in particular, if no sequence is a prefix to another sequence.
For example, if your sequences are the one aboves plus some greek letters, you will not like the \nu sequence, which should be translated to ν.
Instead, if the replacing function tests for \n before \nu, the result is wrong.
Assuming that both sequences will be translated in two completely different entities, there are two solutions: place a close-sequence character, for example \nu;, or just replace by longest to shorter strings. This ensure that any sequence which is prefix of another one is not replaced before it.
For various reasons, I tried another way: using a trie, which is a tree of all the prefixes of a dictionary of words. Long story short: it works fairly well and probably works faster than (most) regexes and replacements.
Regex are state machines and it is not rare to re-process the input, with a trie, you avoid to re-match characters twice, so you go pretty fast.
Code for tries is pretty easy to find on the internet, and the modifications to do efficient matching are trivial, so I will not write the code here.

c++ is a white space independent language, exception to the rule

This wikepedia page defines c++ as a "white space independent language". While mostly true as with all languages there are exceptions to the rule. The only one I can think of at the moment is this:
vector<vector<double> >
Must have a space otherwise the compiler interprets the >> as a stream operator. What other ones are around. It would be interesting to compile a list of the exceptions.
Following that logic, you can use any two-character lexeme to produce such "exceptions" to the rule. For example, += and + = would be interpreted differently. I wouldn't call them exceptions though. In C++ in many contexts "no space at all" is quite different from "one or more spaces". When someone says that C++ is space-independent they usually mean that "one space" in C++ is typically the same as "more than one space".
This is reflected in the language specification, which states (see 2.1/1) that at phase 3 of translation the implementation is allowed to replace sequences of multiple whitespace characters with one space character.
The syntax and semantic rules for parsing C++ are indeed quite complex (I'm trying to be nice, I think one is authorized to say "a mess"). Proof of this fact is that for YEARS different compiler authors where just arguing on what was legal C++ and what it was not.
In C++ for example you may need to parse an unbounded number of tokens before deciding what is the semantic meaning of the first of them (the dreaded "most vexing parse rule" that also often bites newcomers).
Your objection IMO however doesn't really make sense... for example ++ has a different meaning from + +, and in Pascal begin is not the same as beg in. Does this make Pascal a space-dependent language? Is there any space-independent language (except brainf*ck)?
The only problem about C++03 >>/> > is that this mistake when typing was very common so they decided to add even more complexity to the language definition to solve this issue in C++11.
The cases in which one whitespace instead of more whitespaces can make a difference (something that differentiates space-dependent languages and that however plays no role in the > > / >> case) are indeed few:
inside double-quoted strings (but everyone wants that and every language that supports string literals that I know does the same)
inside single quotes (the same, even if something that not many C++ programmers know is that there can be more that one char inside single quotes)
in the preprocessor directives because they work on a line basis (newline is a whitespace and it makes a difference there)
in line continuation as noticed by stefanv: to continue a single line you can put a backslash right before a newline and in that case the language will ignore both characters (you can do this even in the middle of an identifier, even if the typical use is just to make long preprocessor macros readable). If you put other whitespace characters after the backslash and before the newline however the line continuation is not recognized (some compiler accepts it anyway and simply checks if last non-whitespace of a line is a backslash). Line continuation can also be specified using trigraph equivalent ??/ of backslash (any reasonable compiler should IMO emit a warning when finding a trigraph as they most probably were not indented by the programmer).
inside single-line comments // because also there adding a newline to other whitespaces in the middle of a comment makes a difference
Like it or not, but macro's are also part of C++ and multi-line macro's should be separated with a backslash followed by EOL, no whitespace should be in between the backslash and the EOL.
Not a big issue, but still a whitespace exception.
This is because of limitations in the parser pre c++11 this is no longer the case.
The reason being that it was hard to parse >> as end of a template compared to operator >>
While C++03 did interpret >> as the shift operator in all cases (which was overridden for use in streams, but it's still the shift operator), the language parser in C++11 will now attempt to close a brace when reasonable.
Nested template parameters: set<set<int> >.
Character literals: ' '.
String literals: " ".
Justoposition of keywords and identifiers: else return x;, void foo(){}, etc.

C++ - Escaping or disabling backslash on string

I am writing a C++ program to solve a common problem of message decoding. Part of the problem requires me to get a bunch of random characters, including '\', and map them to a key, one by one.
My program works fine in most cases, except that when I read characters such as '\' from a string, I obviously get a completely different character representation (e.g. '\0' yields a null character, or '\' simply escapes itself when it needs to be treated as a character).
Since I am not supposed to have any control on what character keys are included, I have been desperately trying to find a way to treat special control characters such as the backslash as the character itself.
My questions are basically these:
Is there a way to turn all special characters off within the scope of my program?
Is there a way to override current digraphs definitions of special characters and define them as something else (like digraphs using very rare keys)?
Is there some obscure method on the String class that I missed which can force the actual character on the string to be read instead of the pre-defined constant?
I have been trying to look for a solution for hours now but all possible fixes I've found are for other languages.
Any help is greatly appreciate.
If you read in a string like "\0" from stdin or a file, it will be treated as two separate characters: '\\' and '0'. There is no additional processing that you have to do.
Escaping characters is only used for string/character literals. That is to say, when you want to hard-code something into your source code.

what exactly is a token, in relation to parsing

I have to use a parser and writer in c++, i am trying to implement the functions, however i do not understand what a token is. one of my function/operations is to check to see if there are more tokens to produce
bool Parser::hasMoreTokens()
how exactly do i go about this, please help
SO!
I am opening a text file with text in it, all words are lowercased. How do i go about checking to see if it hasmoretokens?
This is what i have
bool Parser::hasMoreTokens() {
while(source.peek()!=NULL){
return true;
}
return false;
}
Tokens are the output of lexical analysis and the input to parsing. Typically they are things like
numbers
variable names
parentheses
arithmetic operators
statement terminators
That is, roughly, the biggest things that can be unambiguously identified by code that just looks at its input one character at a time.
One note, which you should feel free to ignore if it confuses you: The boundary between lexical analysis and parsing is a little fuzzy. For instance:
Some programming languages have complex-number literals that look, say, like 2+3i or 3.2e8-17e6i. If you were parsing such a language, you could make the lexer gobble up a whole complex number and make it into a token; or you could have a simpler lexer and a more complicated parser, and make (say) 3.2e8, -, 17e6i be separate tokens; it would then be the parser's job (or even the code generator's) to notice that what it's got is really a single literal.
In some programming languages, the lexer may not be able to tell whether a given token is a variable name or a type name. (This happens in C, for instance.) But the grammar of the language may distinguish between the two, so that you'd like "variable foo" and "type name foo" to be different tokens. (This also happens in C.) In this case, it may be necessary for some information to be fed back from the parser to the lexer so that it can produce the right sort of token in each case.
So "what exactly is a token?" may not always have a perfectly well defined answer.
A token is whatever you want it to be. Traditionally (and for
good reasons), language specifications broke the analysis into
two parts: the first part broke the input stream into tokens,
and the second parsed the tokens. (Theoretically, I think you
can write any grammar in only a single level, without using
tokens—or what is the same thing, using individual
characters as tokens. I wouldn't like to see the results of
that for a language like C++, however.) But the definition of
what a token is depends entirely on the language you are
parsing: most languages, for example, treat white space as
a separator (but not Fortran); most languages will predefine
a set of punctuation/operators using punctuation characters, and
not allow these characters in symbols (but not COBOL, where
"abc-def" would be a single symbol). In some cases (including
in the C++ preprocessor), what is a token depends on context, so
you may need some feedback from the parser. (Hopefully not;
that sort of thing is for very experienced programmers.)
One thing is probably sure (unless each character is a token):
you'll have to read ahead in the stream. You typically can't
tell whether there are more tokens by just looking at a single
character. I've generally found it useful, in fact, for the
tokenizer to read a whole token at a time, and keep it until the
parser needs it. A function like hasMoreTokens would in fact
scan a complete token.
(And while I'm at it, if source is an istream:
istream::peek does not return a pointer, but an int.)
A token is the smallest unit of a programming language that has a meaning. A parenthesis (, a name foo, an integer 123, are all tokens. Reducing a text to a series of tokens is generally the first step of parsing it.
A token is usually akin to a word in sponken language. In C++, (int, float, 5.523, const) will be tokens. Is the minimal unit of text which constitutes a semantic element.
When you split a large unit (long string) into a group of sub-units (smaller strings), each of the sub-units (smaller strings) is referred to as a "token". If there are no more sub-units, then you are done parsing.
How do I tokenize a string in C++?
A token is a terminal in a grammar, a sequence of one or more symbol(s) that is defined by the sequence itself, ie it does not derive from any other production defined in the grammar.