Ensure every string literal is wrapped inside macro - c++

I want to wrap every string literal in my project with a macro. I want to make sure every string literal in my project is wrapped with a macro, and have some external tool help provide me the location in which there's a string literal not wrapped in a macro.
Is there any way I could use Clang Plugins to ensure that every string literal is wrapped inside macro?
Cases I want to handle:
#define MY_ASSERT(Y) {if(!(Y)) throw Exception(#Y); }
The #Y should be warned as unwrapped string literal.
"a" "b" "c"
It will require that the whole thing will be inside a macro, like this:
MY_STR("a" "b" "c")
How could I do that with Clang plugin, or is there an other way in general to do it?

You could do that with the DMS Software Reengineering Toolkit and its C++ front end.
DMS can read source code according it the explicit grammar definition of C++ (handles C++17 in GCC and MS dialects), builds an AST, applies rewrite rules supplied to modify the tree, and then prettyprints the AST back to source text, preserving comments, text alignments, number radixes, etc.
To do this, you need just one DMS rule (see DMS Rewrite Rules for details):
rule wrap_string_in_macro(s:string_literal):primary_expression->primary_expression
= "\s" -> " my_macro_name(\s) ";
The nonterminal string_literal covers the wide variety of C++ strings (8bit, ISO, wide, raw, sequence of strings, ...) so you don't have to worry about them, this rule will pick them up. But your macro might need to worry about those. So you could arguably write a larger set of rules so you can specialize the macro call:
rule wrap_ISO_string_in_macro(s:ISO_STRING_LITERAL):primary_expression->primary_expression
= "\s" -> " my_macro_name_for_ISO_string(\s) ";
rule wrap_ISO_string_in_macro(s:WIDE_STRING_LITERAL):primary_expression->primary_expression
= "\s" -> " my_macro_name_for_wide_string(\s) ";
...
These rules will pick up individual strings, but that leaves the problem of handling sequences of strings:
rule wrap_ISO_string_list_in_macro(seq: string_literal_list,s:ISO_STRING_LITERAL):primary_expression->primary_expression
= " \string_literal_list \s" -> " my_macro_name_for_ISO_string_list(\s) ";
...

Related

Expand escape sequences detected by flex

In my scanner.lex file I have this:
{Some rule that matches strings} return STRING; //STRING is enum
in my c++ file I have this:
if (yylex == STRING) {
cout << "STRING: " << yytext << endl;
Obviously with some logic to take the input from stdin.
Now if this program gets the input "Hello\nWorld", my output is "STRING: Hello\nWorld", while I would want my output to be:
Hello
World
The same goes for other escape characters such as \",\0, \x<hex_number>, \t, \\... But I'm not sure how to achieve this. I'm not even sure if that's a flex issue or if I can solve this using only c++ tools...
How can I get this done?
As #Some programmer dude mentions in a comment, there is an an example of how to do this using start conditions in the Flex documentation. That example puts the escaping rules into a separate start condition; each rule is implemented by appending the unescaped text to a buffer. And that's the way it's normally done.
Of course, you might find an external library which unescapes C-style escaped strings, which you could call on the string returned by flex. But that would be both slower and less flexible than the approach suggested in the Flex manual: slower because it requires a second scan of the string, and less flexible because the library is likely to have its own idea of what escapes to handle.
If you're using C++, you might find it more elegant to modify that example to use a std::string buffer instead of an arbitrary fixed-size character array. You can compile a flex-generated scanner with C++, so there is no problem using C++ standard library objects in your scanner code.
Depending on the various semantic value types you are managing, you will probably want to modify the yylex prototype to either use an additional reference parameter or a more structured return type, in order to return the token value to the caller. Note that while it is OK to use yytext before the next call to yylex, it's not generally considered good style since it won't work with most parsers: in general, parsers require the ability to look one or more tokens ahead, and thus yytext is likely to be overwritten by the time your parser needs its value. The flex manual documents the macro hook used to modify the yylex() prototype.

How to stringize string with trailing backslash

When I build my C++ project the compiler generates this equivalent macro:
#define SOLUTION_DIR "c:\dev\my_project\"
In a normally #defined macro the trailing escaped double quotes would trigger compiler errors due to the unterminated string but compiler can do whatever it wants and makes this available to the code literally even if the string is invalid.
The usual way to expand macro values to C strings:
#define STRINGIZE( x ) #x
#define EXPAND( x ) STRINGIZE( x )
doesn't work in this case due to the unterminated string passed as argument.
std::string s = EXPAND( SOLUTION_DIR );
...
error: newline in constant
Is there a way to extract the string value of this macro and use it in my code equivalent to:
std::string str = R"(c:\dev\my_project\)";
where R is raw character prefix described here https://en.cppreference.com/w/cpp/language/string_literal
Notes:
I tried re-writing these macros using the R prefix to avoid escaping
the final quote mark but couldn't get to a functional version.
I can tell the compiler to define SOLUTION_DIR string without the
surrounding quotes, but I can't avoid the trailing backslash. In
this case however I get other warnings and errors due to the unknown
escape sequences (\d) and the fact that the trailing
backslash is taken to indicate that the macro is continuing on
the next line.
Update:
Here's the context for those who think something is broken and needs to be fixed.
I use Visual Studio 2019 (VS). In the project properties "C++/Preprocessor/Preprocessor Definitions" one can define various macros in the format:
NAME1=VALUE1;NAME2=VALUE2;...
which are then made available at compile time as
#define NAME1 VALUE1
#define NAME2 VALUE2
VS generates a number of predefined macros (not C++ but build environment macros) for various directories and other values (debug/release, 32 or 64 bit etc). They take the form $(Name) and are set to some string value such as:
$(Configuration) Debug
$(SolutionDir) C:\dev\some_project\
They are used to create location independent project settings such as the temp or binary output directories, or set the correct environment for whatever version of the project is being built (for instance Debug/x64).
In my case I need to get a hold of the current solution path directly in my code, and using the $(SolutionDir) VS macro seemed the easiest way to do it.
So here's how I defined my SOLUTION_PATH macro in "Properties/Preprocessor/Preprocessor Definitions":
SOLUTION_DIR="$(SolutionDir)
which translates into the compile time macro described initially:
#define SOLUTION_DIR "c:\dev\my_project\"
However, by default many macros that expand to paths, including $(SolutionDir), contain a trailing backslash which can't be removed hence the "broken" macro above.
Generally an executable binary doesn't need to and should not know anything about its build directories, so the path related macros are not necessarily designed to be used to define C++ macros, and the trailing backslash is not an issue. But my project needs that information because it itself triggers other build actions that depend on the current environment.
So this is not a malfunction of any of the components, everything works as designed, it just happens that for my specific project it would be very useful to be able to do things this way, even if it's non-standard.
I was able to make this work by adding a trailing ".":
SOLUTION_DIR="$(SolutionDir)."
which results in the equivalent:
#define SOLUTION_DIR "C:\dev\my_project\."
which points to the same directory and now compiles with no errors.

Escaping blocks of foreign code in C++

I'm currently working on a toy language that works like this: one can embed blocks written in this language into a C++ source, and before compilation, these blocks are translated into C++ in an extra preprocessing step, producing a valid C++ source.
I want to make sure that these blocks can always be identified in the source unambiguously and also, whenever such a block is present in the source, it cannot be valid C++. Moreover, I want to achieve these by putting as few constraints to the embedded language as possible (the language itself is still somewhat fluid).
The obvious way would be to introduce a pair of special multi-character parentheses, made of characters that cannot appear together in valid C++ code (or in the embedded language). However, I'm not sure how to ensure that particular a character sequence is good for this purpose (not after GotW #78, anyway (: ).
So what is a good way to escape these blocks?
If your compiler can be made to accept C++11 standard, you could use raw string literals like eg:
std::cout << R"*(<!DOCTYPE html>
<html>
<head>
<title>Title with a backslash \ here
and double " quote</title>)*";
Hence with raw string literals there is no forbidden sequence of characters in those raw string literals. Any sequence of characters could appear in them (but you can define the ending sequence of the raw string)
And you could use #{ and }# like I do in MELT macro-strings; MELT is Lisp-like domain specific language to extend GCC, and you can embed code in it with e.g.
(code_chunk hellocount_chk
#{ /* $HELLOCOUNT_CHK chunk */
static int $HELLOCOUNT_CHK#_counter;
$HELLOCOUNT_CHK#_counter++;
$HELLOCOUNT_CHK#_lab:
printf ("Hello World, counted %d\n",
$HELLOCOUNT_CHK#_counter);
if (random() % 4 == 0) goto $HELLOCOUNT_CHK#_lab;
}#)
The #{ and }# are enclosing macro-strings (these character sequences are unlikely to appear in C or C++ code, except in string literals and comments), with the $ starting symbols in such macro-strings (up to a non-letter or # character).
Using #{ and }# is not fool-proof (e.g. because of raw string literals) but good enough: a cooperative user could manage to avoid them.

Why must C/C++ string literal declarations be single-line?

Is there any particular reason that multi-line string literals such as the following are not permitted in C++?
string script =
"
Some
Formatted
String Literal
";
I know that multi-line string literals may be created by putting a backslash before each newline.
I am writing a programming language (similar to C) and would like to allow the easy creation of multi-line strings (as in the above example).
Is there any technical reason for avoiding this kind of string literal? Otherwise I would have to use a python-like string literal with a triple quote (which I don't want to do):
string script =
"""
Some
Formatted
String Literal
""";
Why must C/C++ string literal declarations be single-line?
The terse answer is "because the grammar prohibits multiline string literals." I don't know whether there is a good reason for this other than historical reasons.
There are, of course, ways around this. You can use line splicing:
const char* script = "\
Some\n\
Formatted\n\
String Literal\n\
";
If the \ appears as the last character on the line, the newline will be removed during preprocessing.
Or, you can use string literal concatenation:
const char* script =
" Some\n"
" Formatted\n"
" String Literal\n";
Adjacent string literals are concatenated during preprocessing, so these will end up as a single string literal at compile-time.
Using either technique, the string literal ends up as if it were written:
const char* script = " Some\n Formatted\n String Literal\n";
One has to consider that C was not written to be an "Applications" programming language but a systems programming language. It would not be inaccurate to say it was designed expressly to rewrite Unix. With that in mind, there was no EMACS or VIM and your user interfaces were serial terminals. Multiline string declarations would seem a bit pointless on a system that did not have a multiline text editor. Furthermore, string manipulation would not be a primary concern for someone looking to write an OS at that particular point in time. The traditional set of UNIX scripting tools such as AWK and SED (amongst MANY others) are a testament to the fact they weren't using C to do significant string manipulation.
Additional considerations: it was not uncommon in the early 70s (when C was written) to submit your programs on PUNCH CARDS and come back the next day to get them. Would it have eaten up extra processing time to compile a program with multiline strings literals? Not really. It can actually be less work for the compiler. But you were going to come back for it the next day anyhow in most cases. But nobody who was filling out a punch card was going to put large amounts of text that wasn't needed in their programs.
In a modern environment, there is probably no reason not to include multiline string literals other than designer's preference. Grammatically speaking, it's probably simpler because you don't have to take linefeeds into consideration when parsing the string literal.
In addition to the existing answers, you can work around this using C++11's raw string literals, e.g.:
#include <iostream>
#include <string>
int main() {
std::string str = R"(a
b)";
std::cout << str;
}
/* Output:
a
b
*/
Live demo.
[n3290: 2.14.5/4]: [ Note: A source-file new-line in a raw string
literal results in a new-line in the resulting execution
string-literal. Assuming no whitespace at the beginning of lines in
the following example, the assert will succeed:
const char *p = R"(a\
b
c)";
assert(std::strcmp(p, "a\\\nb\nc") == 0);
—end note ]
Though non-normative, this note and the example that follows it in [n3290: 2.14.5/5] serve to complement the indication in the grammar that the production r-char-sequence may contain newlines (whereas the production s-char-sequence, used for normal string literals, may not).
Others have mentioned some excellent workarounds, I just wanted to address the reason.
The reason is simply that C was created at a time when processing was at a premium and compilers had to be simple and as fast as possible. These days, if C were to be updated (I'm looking at you, C1X), it's quite possible to do exactly what you want. It's unlikely, however. Mostly for historical reasons; such a change could require extensive rewrites of compilers, and so will likely be rejected.
The C preprocessor works on a line-by-line basis, but with lexical tokens. That means that the preprocessor understands that "foo" is a token. If C were to allow multi-line literals, however, the preprocessor would be in trouble. Consider:
"foo
#ifdef BAR
bar
#endif
baz"
The preprocessor isn't able to mess with the inside of a token - but it's operating line-by-line. So how is it supposed to handle this case? The easy solution is to simply forbid multiline strings entirely.
Actually, you can break it up thus:
string script =
"\n"
" Some\n"
" Formatted\n"
" String Literal\n";
Adjacent string literals are concatenated by the compiler.
Strings can lay on multiple lines, but each line has to be quoted individually :
string script =
" \n"
" Some \n"
" Formatted \n"
" String Literal ";
I am writing a programming language
(similar to C) and would like to let
write multi-line strings easily (like
in above example).
There is no reason why you couldn't create a programming language that allows multi-line strings.
For example, Vedit Macro Language (which is C-like scripting language for VEDIT text editor) allows multi-line strings, for example:
Reg_Set(1,"
Some
Formatted
String Literal
")
It is up to you how you define your language syntax.
You can also do:
string useMultiple = "this"
"is "
"a string in C.";
Place one literal after another without any special chars.
Literal declarations doesn't have to be single-line.
GPUImage inlines multiline shader code. Checkout its SHADER_STRING macro.

Implementation of string literal concatenation in C and C++

AFAIK, this question applies equally to C and C++
Step 6 of the "translation phases" specified in the C standard (5.1.1.2 in the draft C99 standard) states that adjacent string literals have to be concatenated into a single literal. I.e.
printf("helloworld.c" ": %d: Hello "
"world\n", 10);
Is equivalent (syntactically) to:
printf("helloworld.c: %d: Hello world\n", 10);
However, the standard doesn't seem to specify which part of the compiler has to handle this - should it be the preprocessor (cpp) or the compiler itself. Some online research tells me that this function is generally expected to be performed by the preprocessor (source #1, source #2, and there are more), which makes sense.
However, running cpp in Linux shows that cpp doesn't do it:
eliben#eliben-desktop:~/test$ cat cpptest.c
int a = 5;
"string 1" "string 2"
"string 3"
eliben#eliben-desktop:~/test$ cpp cpptest.c
# 1 "cpptest.c"
# 1 "<built-in>"
# 1 "<command-line>"
# 1 "cpptest.c"
int a = 5;
"string 1" "string 2"
"string 3"
So, my question is: where should this feature of the language be handled, in the preprocessor or the compiler itself?
Perhaps there's no single good answer. Heuristic answers based on experience, known compilers, and general good engineering practice will be appreciated.
P.S. If you're wondering why I care about this... I'm trying to figure out whether my Python based C parser should handle string literal concatenation (which it doesn't do, at the moment), or leave it to cpp which it assumes runs before it.
The standard doesn't specify a preprocessor vs. a compiler, it just specifies the phases of translation you already noted. Traditionally, phases 1 through 4 were in the preprocessor, Phases 5 though 7 in the compiler, and phase 8 the linker -- but none of that is required by the standard.
Unless the preprocessor is specified to handle this, it's safe to assume it's the compiler's job.
Edit:
Your "I.e." link at the beginning of the post answers the question:
Adjacent string literals are concatenated at compile time; this allows long strings to be split over multiple lines, and also allows string literals resulting from C preprocessor defines and macros to be appended to strings at compile time...
In the ANSI C standard, this detail is covered in section 5.1.1.2, item (6):
5.1.1.2 Translation phases
...
4. Preprocessing directives are executed and macro invocations are expanded. ...
5. Each source character set member and escape sequence in character constants and string literals is converted to a member of the execution character set.
6. Adjacent character string literal tokens are concatenated and adjacent wide string literal tokens are concatenated.
The standard does not define that the implementation must use a pre-processor and compiler, per se.
Step 4 is clearly a preprocessor responsibility.
Step 5 requires that the "execution character set" be known. This information is also required by the compiler. It is easier to port the compiler to a new platform if the preprocessor does not contain platform dependendencies, so the tendency is to implement step 5, and thus step 6, in the compiler.
I would handle it in the scanning token part of the parser, so in the compiler. It seems more logical. The preprocessor has not to know the "structure" of the language, and in fact it ignores it usually so that macros can generate uncompilable code. It handles nothing more than what it is entitled to handle by directives that are specifically addressed to it (# ...), and the "consequences" of them (like those of a #define x h, which would make the preprocessor change a lot of x into h)
There are tricky rules for how string literal concatenation interacts with escape sequences.
Suppose you have
const char x1[] = "a\15" "4";
const char y1[] = "a\154";
const char x2[] = "a\r4";
const char y2[] = "al";
then x1 and x2 must wind up equal according to strcmp, and the same for y1 and y2. (This is what Heath is getting at in quoting the translation steps - escape conversion happens before string constant concatenation.) There's also a requirement that if any of the string constants in a concatenation group has an L or U prefix, you get a wide or Unicode string. Put it all together and it winds up being significantly more convenient to do this work as part of the "compiler" rather than the "preprocessor."