Lex/Yacc based C parser: why unterminated string literal is not diagnosed? - regex

I built C parser from Lex/Flex & YACC/Bison grammars (1, 2) as:
$ flex c.l && yacc -d c.y && gcc lex.yy.c y.tab.c -o c
and then tested on this C code:
char* s = "xxx;
which is expected to produce missing terminating " character (or syntax error) diagnostics.
However, it doesn't:
$ ./c t1.c
char* s = xxx;
Why? How to fix it?
Note: The STRING_LITERAL is defined in lex specification as:
L?\"(\\.|[^\\"])*\" { count(); return(STRING_LITERAL); }
Here we see the [^\\"] part, which represents the "except the double-quote ", backslash , or new-line character" (C11, 6.4.5 String literals, 1) and the \\. part, which (incorrectly?) represents the escape-sequence (C11, 6.4.4.4 Character constants, 1). -- end note
UPD: Fix: The STRING_LITERAL is defined in lex specification as:
L?\"(\\.|[^\\"\n])*\" { count(); return(STRING_LITERAL); }

The lexer you link has a rule:
. { /* Add code to complain about unmatched characters */ }
so when it sees an unmatched ", it will silently ignore it. If you add code here to complain about the character, you'll see that.
If you want a syntax error, you could have this action just return *yytext;
Note that your STRING_LITERAL pattern will match strings that contain embedded newlines, so if you have a mismatched " in a larger program wity another string later, it will be recognized as a long string with embedded newlines. This will likely lead to poor error reporting, since the error would be reported after the bug string rather than where it starts, making it hard for a user to debug.

Related

c++ complains about __VA_ARGS__

The following code has been compiled with gcc-5.4.0 with no issues:
% gcc -W -Wall a.c
...
#include <stdio.h>
#include <stdarg.h>
static int debug_flag;
static void debug(const char *fmt, ...)
{
va_list ap;
va_start(ap, fmt);
vfprintf(stderr, fmt, ap);
va_end(ap);
}
#define DEBUG(...) \
do { \
if (debug_flag) { \
debug("DEBUG:"__VA_ARGS__); \
} \
} while(0)
int main(void)
{
int dummy = 10;
debug_flag = 1;
DEBUG("debug msg dummy=%d\n", dummy);
return 0;
}
However compiling this with g++ has interesting effects:
% g++ -W -Wall -std=c++11 a.c
a.c: In function ‘int main()’:
a.c:18:10: error: unable to find string literal operator ‘operator""__VA_ARGS__’ with ‘const char [8]’, ‘long unsigned int’ arguments
debug("DEBUG: "__VA_ARGS__); \
% g++ -W -Wall -std=c++0x
<same error>
% g++ -W -Wall -std=c++03
<no errors>
Changing debug("DEBUG:"__VA_ARGS__); to debug("DEBUG:" __VA_ARGS__); i.e. space before __VA_ARGS__ enables to compile with all three -std= options.
What is the reason for such behaviour? Thanks.
Since C++11 there is support for user-defined literals, which are literals, including string literals, immediately (without whitespace) followed by an identifier. A user-defined literal is considered a single preprocessor token. See https://en.cppreference.com/w/cpp/language/user_literal for details on their purpose.
Therefore "DEBUG:"__VA_ARGS__ is a single preprocessor token and it has no special meaning in a macro definition. The correct behavior is to simply place it unchanged into the macro expansion, where it then fails to compile as no user-defined literal operator for a __VA_ARG__ suffix was declared.
So GCC is correct to reject it as C++11 code.
This is one of the backwards-incompatible changes between C++03 and C++11 listed in the appendix of the C++11 standard draft N3337: https://timsong-cpp.github.io/cppwp/n3337/diff.cpp03.lex
Before C++11 the string literal (up to the closing ") would be its own preprocessor token and the following identifier a second preprocessor token, even without whitespace between them.
So GCC is also correct to accept it in C++03 mode. (-std=c++0x is the same as -std=c++11, C++0x was the placeholder name for C++11 when it was still in drafting)
It is also an incompatibility with C (in all revisions up to now) since C doesn't support user-defined literals either and considers the two parts of "DEBUG:"__VA_ARGS__ as two preprocessor tokens as well.
Therefore it is correct for GCC to accept it as C code as well (which is how the gcc command interprets .c files in contrast to g++ which treats them as C++).
To fix this add a whitespace between "DEBUG:" and __VA_ARGS__ as you suggested. That should make it compatible with all C and C++ revisions.

C++ unknown escape sequences on regex + groups captured by half

I'm trying to use the following regex
(https?|rtsp):\/\/(?:([^\s#\/]+?)[#])?([^\s\/:]+)(?:[:]([0-9]+))?(?:(\/[^\s?#]+)([?][^\s#]+)?)?([#]\S*)?
on C++ like this:
#include <iostream>
#include <string>
#include <regex>
int main() {
std::string str("rtsp://3333:1232#hellowebsite.com:2222");
std::regex r("(https?|rtsp):\/\/(?:([^\s#\/]+?)[#])?([^\s\/:]+)(?:[:]([0-9]+))?(?:(\/[^\s?#]+)([?][^\s#]+)?)?([#]\S*)?");
std::smatch m;
std::regex_search(str, m, r);
std::cout << str << std::endl;
for(auto v: m) std::cout << v << std::endl;
}
To match rtsp or http URLs, but this is the output of compilation + running:
main.cpp:7:33: warning: unknown escape sequence '\/' [-Wunknown-escape-sequence]
std::regex r("(https?|rtsp):\/\/(?:([^\s#\/]+?)[#])?([^\s\/:]+)(?:[:]([0-9]+))?(?...
^~
main.cpp:7:35: warning: unknown escape sequence '\/' [-Wunknown-escape-sequence]
std::regex r("(https?|rtsp):\/\/(?:([^\s#\/]+?)[#])?([^\s\/:]+)(?:[:]([0-9]+))?(?...
^~
main.cpp:7:43: warning: unknown escape sequence '\s' [-Wunknown-escape-sequence]
std::regex r("(https?|rtsp):\/\/(?:([^\s#\/]+?)[#])?([^\s\/:]+)(?:[:]([0-9]+))?(?...
^~
main.cpp:7:46: warning: unknown escape sequence '\/' [-Wunknown-escape-sequence]
std::regex r("(https?|rtsp):\/\/(?:([^\s#\/]+?)[#])?([^\s\/:]+)(?:[:]([0-9]+))?(?...
^~
main.cpp:7:60: warning: unknown escape sequence '\s' [-Wunknown-escape-sequence]
std::regex r("(https?|rtsp):\/\/(?:([^\s#\/]+?)[#])?([^\s\/:]+)(?:[:]([0-9]+))?(?...
^~
main.cpp:7:62: warning: unknown escape sequence '\/' [-Wunknown-escape-sequence]
std::regex r("(https?|rtsp):\/\/(?:([^\s#\/]+?)[#])?([^\s\/:]+)(?:[:]([0-9]+))?(?...
^~
main.cpp:7:88: warning: unknown escape sequence '\/' [-Wunknown-escape-sequence]
...r("(https?|rtsp):\/\/(?:([^\s#\/]+?)[#])?([^\s\/:]+)(?:[:]([0-9]+))?(?:(\/[^\s?#]+)([...
^~
main.cpp:7:92: warning: unknown escape sequence '\s' [-Wunknown-escape-sequence]
...r("(https?|rtsp):\/\/(?:([^\s#\/]+?)[#])?([^\s\/:]+)(?:[:]([0-9]+))?(?:(\/[^\s?#]+)([...
^~
main.cpp:7:105: warning: unknown escape sequence '\s' [-Wunknown-escape-sequence]
...\s#]+)?)?([#]\S*)?");
^~
main.cpp:7:118: warning: unknown escape sequence '\S' [-Wunknown-escape-sequence]
...\S*)?");
^~
10 warnings generated.
 ./main
rtsp://3333:1232#hellowebsite.com:2222
rtsp://3333:1232#helloweb
rtsp
3333:1232
helloweb
check here..
First of all, why I'm getting unknown escape sequences? \\, \s and etc are pretty known.
Most importantly, why do I get these unfinished groups? It works fine on regex online testers.
Especially when you're doing regexes, raw string literals are your friend. So, as a starting point, I'd do something like this:
std::regex r(R"--((https?|rtsp):\/\/(?:([^\s#\/]+?)[#])?([^\s\/:]+)(?:[:]([0-9]+))?(?:(\/[^\s?#]+)([?][^\s#]+)?)?([#]\S*)?)--");
If you really don't want to use raw string literals, the starting point is to note that a back-slash in a C++ string introduces an escape sequence, so when you want the literal to actually contain a back-slash you need to use two back-slash characters in a row, so (at bare minimum) you need to convert those, so it starts something like this:
std::regex r("(https?|rtsp):\\/\\/(?:
...continuing for all the other back-slashes it contains. There might be a bit more to do after that, but that's the minimum that it's immediately obvious you need to do.

How to pass raw string literals to [[deprecated(message)]] attribute?

I want to pass a raw string literals to [[deprecated(message)]] attribute as the message. The message is used again and again. So I want to avoid code repeat.
First, I tried to use static constexpr variable.
static constexpr auto str = R"(
Use this_func()
Description: ...
Parameter: ...
)";
[[deprecated(str)]]
void test1() {
}
I got the error "deprecated message is not a string". It seems that static constexpr variable isn't accepted by [[deprecated(message)]].
I tried to define the row string literals as preprocessor macro.
#define STR R"(
Use this_func()
Description: ...
Parameter: ...
)"
[[deprecated(STR)]]
void test2() {
}
It works as I expected as follows on clang++ 8.0.0.
prog.cc:38:5: warning: 'test2' is deprecated:
Use this_func()
Description: ...
Parameter: ...
[-Wdeprecated-declarations]
test2();
^
Demo: https://wandbox.org/permlink/gN4iOrul8Y0F76TZ
But g++ 9.2.0 outputs the compile error as follows:
prog.cc:19:13: error: unterminated raw string
19 | #define STR R"(
| ^
prog.cc:23:2: warning: missing terminating " character
23 | )"
| ^
https://wandbox.org/permlink/e62pQ2Dq9vTuG6Or
#define STR R"( \
Use this_func() \
Description: ... \
Parameter: ... \
)"
If I add backslashes on the tail of each line, no compile error occurred but output message is different from I expected as follows:
prog.cc:38:11: warning: 'void test2()' is deprecated: \\nUse this_func() \\nDescription: ... \\nParameter: ... \\n [-Wdeprecated-declarations]
I'm not sure which compiler works correctly.
Is there any way to pass the raw string literals variable/macro to [[deprecated]] attribute?
There is no such thing as a "raw string literal variable". There may be a variable which points to a string literal, but it is a variable, not the literal itself. The deprecated attribute does not take a C++ constant expression evaluating to a string. It takes a string literal: an actual token sequence.
So the most you can do is use a macro to contain your string literal. Of course, macros and raw string literals don't play nice together, since the raw string is supposed to consume the entire text. So the \ characters will act as both continuations for the macro and be part of the string.

Why doesn't these unicode variable names work with -fextended-identifiers? «, » and ≠ [duplicate]

This question already has answers here:
😃 (and other Unicode characters) in identifiers not allowed by g++
(3 answers)
Closed 7 years ago.
I heard that it is possible to use unicode variable names using the -fextended-identifiers flag in gcc. So I made a test program in C++ but it does not compile.
#include <iostream>
#include <string>
#define ¬ !
#define ≠ !=
#define « <<
#define » >>
/* uniq: remove duplicate lines from stdin */
int main() {
std::string s;
std::string t = "";
while (cin » s) {
if (s ≠ t)
cout « s;
t = s;
}
return 0;
}
I get these errors:
g++ -fextended-identifiers -g3 -o a main.cpp
main.cpp:10:3: error: stray ‘\342’ in program
if (s ≠ t)
^
main.cpp:10:3: error: stray ‘\211’ in program
main.cpp:10:3: error: stray ‘\240’ in program
main.cpp:11:4: error: stray ‘\302’ in program
cout « s;
^
main.cpp:11:4: error: stray ‘\253’ in program
What is going on? Aren't these macro names supposed to work with -fextended-identifiers?
G++ doesn't support Unicode characters in the source yet:
What is the status of adding the UTF-8 support for identifier names in GCC?
Notably, the errors generated by your program are for the individual octets of the UTF-8 encoding, not for the Unicode character they represent. ≠ is being seen as three bytes: \342\211\240 and « as two: \302\253.
The C++ Standard requires (section 2.10):
An identifier is an arbitrarily long sequence of letters and digits. Each universal-character-name in an identifier shall designate a character whose encoding in ISO 10646 falls into one of the ranges specified in E.1. The initial element shall not be a universal-character-name designating a character whose encoding falls into one of the ranges specified in E.2. Upper- and lower-case letters are different. All characters are significant.
And E.1:
Ranges of characters allowed [charname.allowed]
00A8, 00AA, 00AD, 00AF, 00B2-00B5, 00B7-00BA, 00BC-00BE, 00C0-00D6, 00D8-00F6, 00F8-00FF
0100-167F, 1681-180D, 180F-1FFF
200B-200D, 202A-202E, 203F-2040, 2054, 2060-206F
2070-218F, 2460-24FF, 2776-2793, 2C00-2DFF, 2E80-2FFF
3004-3007, 3021-302F, 3031-303F
3040-D7FF
F900-FD3D, FD40-FDCF, FDF0-FE44, FE47-FFFD
10000-1FFFD, 20000-2FFFD, 30000-3FFFD, 40000-4FFFD, 50000-5FFFD,
60000-6FFFD, 70000-7FFFD, 80000-8FFFD, 90000-9FFFD, A0000-AFFFD,
B0000-BFFFD, C0000-CFFFD, D0000-DFFFD, E0000-EFFFD
0300-036F, 1DC0-1DFF, 20D0-20FF, FE20-FE2F
Your angle brackets are 0x300A and 0x300B, which are not included. Not equal is 0x2260, also disallowed.

Getting error when comparing a character with component of a string in C++

#include <iostream>
#include <fstream>
#include <cstring>
#define MAX_CHARS_PER_LINE 512
#define MAX_TOKENS_PER_LINE 20
#define DELIMITER " "
using namespace std;
int main ()
{
//char buf[MAX_CHARS_PER_LINE];
string buf; // string to store a line in file
fstream fin;
ofstream fout;
fin.open("PiCalculator.h", ios::in);
fout.open("op1.txt", ios::out);
fout<< "#include \"PiCalculator.h\"\n";
static int flag=0; //this variable counts the no of curly brackets
while (!fin.eof())
{
// read an entire line into memory
getline(fin, buf);
//fout<<buf.back()<<endl;
if(buf.back()== "{" && buf.front() !='c'){
flag++;
fout<<buf<<endl;
}
if(flag > 0)
fout<<buf<<endl;
if(buf.back()== "}"){
flag--;
fout<<buf<<endl;
}
}
cout<<buf.back()<<endl;
return 0;
}
Here I'm getting the error in the if condition:
if(buf.back()== "{" && buf.front() !='c')
The error states that: ISO C++ forbids comparison between pointer and integer [-fpermissive]
Can anyone help me in sorting out the problem ??
The comparison you have might be invalid due to you checking for "{" for bug.back and 'c' for bug.front. I assume that the types returned by the back and front function have to be the same. One is in single quotes, and the other "{" is double quotes. Thats the issue.
Change "{" to '{'
Hope this helps.
As noted by Samuel, std::string::back returns a char&; (e.g., see http://en.cppreference.com/w/cpp/string/basic_string/back) and you're comparing it to a string literal (double quotes signify a string, single quotes a char).
It's not clear why it isn't necessary to have a #include <string> directive in this case, but it would be best practice to have it. It's not obvious that you need <cstring>, though.
I recommend compiling with warnings on; if I compile your code that way, I get reasonably helpful error messages:
g++ -Wall -Wextra -std=c++11 q3.cc -o q3
q3.cc: In function ‘int main()’:
q3.cc:24:25: warning: comparison with string literal results in unspecified behaviour [-Waddress]
if(buf.back()== "{" && buf.front() !='c'){
^
q3.cc:24:25: error: ISO C++ forbids comparison between pointer and integer [-fpermissive]
q3.cc:30:25: warning: comparison with string literal results in unspecified behaviour [-Waddress]
if(buf.back()== "}"){
^
q3.cc:30:25: error: ISO C++ forbids comparison between pointer and integer [-fpermissive]
This is gcc version 4.8.2.
buf.back() returns either a char or a char &.
What you are comparing is char with a string literal "{" which won't work.
change this line to (and all others where you compare with a string literal:
if(buf.back()== '{' && buf.front() !='c'){
Note the '{'
buf.back()== "{"
The string literal "{" decays into a pointer and the buf.back() returns an int, and hence u get the error
Please change that to
buf.back() == '{'