Say I have this C program.
#include <stdio.h>
int main(void)
{
int monday = 1;
int tuesday = 2;
if(monday == tuesday) { fprintf("I should quit my day job"); }
return 1;
}
What would the tokens be?
What does bison provide me, as a programmer? Certianly, bison does not generate machine code with just a parser grammar? So how do I interface with bison? I am not expecting a full answer here, just a pointer to good websites and books.
Bison implements a generalized LR parser. See http://www.gnu.org/software/bison/manual/bison.html for fairly extensive documentation, with examples. You don't get back a parse tree per se; instead, you write "actions" that activate on each reduction. Of course, if your actions simply build a parse tree, that will do the trick, if you want to obtain a parse tree. Modern bison also has a lot of extra tweaking you can insert.
The tokens, line by line:
#include <stdio.h>
The above line is not a C statement, but used by the C preprocessor.
int main(void)
Five tokens: The keyword int, the identifier main, the symbol (, the keyword void and the symbol ).
{
One token: The symbol {.
int monday = 1;
Five tokens: The keyword int, the identifier monday, the symbol =, the integer number 1, and the symbol ;.
... And so on
It should also be noted that = and == are two separate tokens, and that the string is one single token.
Related
I am printing a line like this
cout<<"Hello //stackoverflow";
And this produces the following output
Hello //stackoverflow
I want to know why it does not give me an error as I commented half of the statement and there should be
missing terminating " character
error.
The grammar of C++ (like most of programming languages) is context-sensitive. Simply, // does not start a comment if it is within a string literal.
For an in depth analysis of this, you'd have to refer to the language grammar, and the string literal production rules in particular.
Informally speaking, the fact that // appears in the quoted string literal means that it does not denote a comment block. The same applies to /* and */.
The converse applies to other constructs, where maximal munch requires parsing into the token denoting the start of a comment block; a space is needed before the pointer dereference operator in
#include <iostream>
using namespace std;
int main() {
int n = 1;
int* p = &n;
cout << 1 / *p; // Removing the final space will fail compilation.
}
In easy terms, This is because everything inside quotes is recognized as a string and so the computer does not evaluate // as the way to start a comment.
I'm working on an old network engine and the type of package sent over the network is made up of 2 bytes.
This is more or less human readable form, for example "LO" stands for Login.
In the part that reads the data there is an enormous switch, like this:
short sh=(((int)ad.cData[p])<<8)+((int)ad.cData[p+1]);
switch(sh)
{
case CMD('M','D'):
..some code here
break
where CMD is a define:
#define CMD(a,b) ((a<<8)+b)
I know there are better ways but just to clean up a bit and also to be able to search for the tag (say "LO") more easily (and not search for different types of "'L','O'" or "'L' , 'O'" or the occasional "'L', 'O'" <- spaces make it hard to search) I tried to make a MACRO for the switch so I could use "LO" instead of the define but I just can't get it to compile.
So here is the question: how do you change the #define to a macro that I can use like this instead:
case CMD("MD"):
..some code here
break
It started out as a little subtask to make life a little bit easier but now I can't get it out of my head, thanks for any help!
Cheers!
[edit] The code works, it the world that's wrong! ie. Visual Studio 2010 has a bug concerning this. No wonder I cut my teeth on it.
Macro-based solution
A string-literal is really an instance of char const[N] where N is the length of the string, including the terminating null-byte. With this in mind you can easily access any character within the string-literal by using string-literal[idx] to specify that you'd like to read the character stored at offset idx.
#define CMD(str) ((str[0]<<8)+str[1])
CMD("LO") => (("LO"[0]<<8)+"LO"[1]) => (('L'<<8)+'0')
You should however keep in mind that there's nothing preventing your from using the above macro with a string which is shorter than that of length 2, meaning that you can run into undefined-behavior if you try to read an offset which is not actually valid.
RECOMMENDED: C++11, use a constexpr function
You could create a function usable in constant-expressions (and with that, in case-labels), with a parameter of reference to const char[3], which is the "real" type of your string-literal "FO".
constexpr short cmd (char const(&ref)[3]) {
return (ref[0]<<8) + ref[1];
}
int main () {
short data = ...;
switch (data) {
case cmd("LO"):
...
}
}
C++11 and user-defined literals
In C++11 we were granted the possibility to define user-defined literals. This will make your code far easier to maintain and interpret, as well as having it be safer to use:
#include <stdexcept>
constexpr short operator"" _cmd (char const * s, unsigned long len) {
return len != 2 ? throw std::invalid_argument ("") : ((s[0]<<8)+s[1]);
}
int main () {
short data = ...;
switch (data) {
case "LO"_cmd:
...
}
}
The value associated with a case-label must be yield through a constant-expression. It might look like the above might throw an exception during runtime, but since a case-label is constant-expression the compiler must be able to evaluate "LO"_cmd during translation.
If this is not possible, as in "FOO"_cmd, the compiler will issue a diagnostic saying that the code is ill-formed.
// K&R syntax
int foo(a, p)
int a;
char *p;
{
return 0;
}
// ANSI syntax
int foo(int a, char *p)
{
return 0;
}
As you see, in K&R style, the types of variables are declared in new lines instead of in the braces. How to convert a K&R function declaration to an ANSI function declaration automatically? Does anybody know such an easy-to-use tool in Linux?
You can use cproto or protoize (part of GCC) to generate function prototypes or convert old style (K&R) functions to ANSI format.
Since You wanna convert a multiline string, you chould consider perl
you have
void old_style( c , a ) char c; int a; { /* some multiline code */ }
and must have
void old_style( char c, int a) {}
So
perl -i.bkp -nle 's/\((void|int|char|float|long) [a-zA-Z0-9_-]*\)([a-zA-Z0-9_-] ?,[a-zA-Z0-9_-] ?)\(.*{\)/\1(\2)/g'
or something like it, would do the trick.
It would be easier to tackle down the correct regex to this if you try it out and post in comments the output of
diff file.c file.c.bkp
for each of your source files.
if you want to create standard C prototypes for a .h file use mkproto.c
mkproto thisoldfile.c > thisoldfile.h
You then could also paste over the old K&R code in the C file definition if desired.
Another answer by Robert contained the following (useful) information before it was deleted for being a 'link-only' answer:
You can find mkproto.c at:
https://www.pcorner.com/list/C
There are plenty of other utilities there.
The site hosts two versions of "mkproto" — MKPROTO.ZIP dated 1989-09-07 and MKPROTOB.ZIP dated 1992-06-27. You have to register with the host site, The Programmer's Corner, to download the files. The files appear about 2/3 of the way down a long page of possible downloads.
My gcc installation had neither cproto nor mkproto. But I did have Vim and I figured out a global substitute to do this one parameter at a time...
:%s/^\(\%(\w\+\%(\s*\*\+\s*\|\s\+\)\)\+\)\(\w\+\)\s*(\#=\(.\{-}\)\([(,)]\)\s*\(\w\+\)\s*\([,)].*\)\n\s*\(.\{-}\)\5\([^;]*\);/\1\2\3\4\7\5\8\6/
where the recorded subexpressions are:
the function return type
the function name
the already-prototyped parameters (if any)
the delimiter before the next K&R parameter to fix - '(' or ','
the next K&R parameter to fix
the following delimiter - ',' or ')' followed by any remaining (unprocessed) K&R parameter identifiers and closing ')'
the parameter's type
the parameter's post-identifier characters (e.g., brackets)
Repeat until no more matches in the file. Note, the pattern presumes the K&R function declaration is on one line, followed by individual parameter declarations on successive lines.
Two applications of this substitute successfully processes:
int main(argc,argv)
int argc;
char *argv[];
{
printf("Hello world!\n");
return 0;
}
into:
int main(int argc,char *argv[])
{
printf("Hello world!\n");
return 0;
}
If I have an arithmetic expression like x+y-12 / z in a string (c-style or otherwise) in c or c++, how can I extract one item at a time (including the operator)? There may or may not be a space in the expression and multiple digits are allowed for constants.
If your input is simple you can start with something like this:
typedef struct token {
int type;
int ival;
char sval[256];
int ssize;
} Token;
char *get_next_tok(char *buffer, Token *token) {
char *p = buffer; while (isspace(*p)) p++; // trim
if (my_isopchar(*p)) // checks -+*...
p=my_get_op(p, token); // a function to handle multi-char ops
else if (isdigit(*p)) {
token->ival=strtol(p, &p, 10);
token->type=TK_CONST;
}
else if (isalpha(*p)) {
while (isalpha(*p)) {
token->sval[token->ssize++] = *p; p++;
}
token->type = TK_VAR;
}
return p;
}
Easy way: strtok
Hard way: Flex+Bison
Look into parsing. What you describe can, in fact, be quite easily implemented using regular expressions, or hand-written parsing. Think of what makes up your expression's individual tokens, and how code to extract the next token would look.
There was a very nice tutorial on Flipcode on implementing scripting engines. You can read a few of the first chapters.
Basically you need to implement a lexical analyzer which breaks the string into tokens (identifier / constant / operator) and from tokens you can create a parse tree or reverse Polish notation e.g. by recursive descent or using a LL parser which is rather elegant if you are only interested in parsing arithmetic expressions.
Reverse Polish notation is then evaluated using stack-based interpreter or parse tree is evaluated using a recursive algorithm.
I have written a small expression evaluation class in C++ which supports simple expressions with variables.
I want to encrypt/encode a string at compile time so that the original string does not appear in the compiled executable.
I've seen several examples but they can't take a string literal as argument. See the following example:
template<char c> struct add_three {
enum { value = c+3 };
};
template <char... Chars> struct EncryptCharsA {
static const char value[sizeof...(Chars) + 1];
};
template<char... Chars>
char const EncryptCharsA<Chars...>::value[sizeof...(Chars) + 1] = {
add_three<Chars>::value...
};
int main() {
std::cout << EncryptCharsA<'A','B','C'>::value << std::endl;
// prints "DEF"
}
I don't want to provide each character separately like it does. My goal is to pass a string literal like follows:
EncryptString<"String to encrypt">::value
There's also some examples like this one:
#define CRYPT8(str) { CRYPT8_(str "\0\0\0\0\0\0\0\0") }
#define CRYPT8_(str) (str)[0] + 1, (str)[1] + 2, (str)[2] + 3, (str)[3] + 4, (str)[4] + 5, (str)[5] + 6, (str)[6] + 7, (str)[7] + 8, '\0'
// calling it
const char str[] = CRYPT8("ntdll");
But it limits the size of the string.
Is there any way to achieve what I want?
I think this question deserves an updated answer.
When I asked this question several years ago, I didn't consider the difference between obfuscation and encryption. Had I known this difference then, I'd have included the term Obfuscation in the title before.
C++11 and C++14 have features that make it possible to implement compile-time string obfuscation (and possibly encryption, although I haven't tried that yet) in an effective and reasonably simple way, and it's already been done.
ADVobfuscator is an obfuscation library created by Sebastien Andrivet that uses C++11/14 to generate compile-time obfuscated code without using any external tool, just C++ code. There's no need to create extra build steps, just include it and use it. I don't know a better compile-time string encryption/obfuscation implementation that doesn't use external tools or build steps. If you do, please share.
It not only obuscates strings, but it has other useful things like a compile-time FSM (Finite State Machine) that can randomly obfuscate function calls, and a compile-time pseudo-random number generator, but these are out of the scope of this answer.
Here's a simple string obfuscation example using ADVobfuscator:
#include "MetaString.h"
using namespace std;
using namespace andrivet::ADVobfuscator;
void Example()
{
/* Example 1 */
// here, the string is compiled in an obfuscated form, and
// it's only deobfuscated at runtime, at the very moment of its use
cout << OBFUSCATED("Now you see me") << endl;
/* Example 2 */
// here, we store the obfuscated string into an object to
// deobfuscate whenever we need to
auto narrator = DEF_OBFUSCATED("Tyler Durden");
// note: although the function is named `decrypt()`, it's still deobfuscation
cout << narrator.decrypt() << endl;
}
You can replace the macros DEF_OBFUSCATED and OBFUSCATED with your own macros. Eg.:
#define _OBF(s) OBFUSCATED(s)
...
cout << _OBF("klapaucius");
How does it work?
If you take a look at the definition of these two macros in MetaString.h, you will see:
#define DEF_OBFUSCATED(str) MetaString<andrivet::ADVobfuscator::MetaRandom<__COUNTER__, 3>::value, andrivet::ADVobfuscator::MetaRandomChar<__COUNTER__>::value, Make_Indexes<sizeof(str) - 1>::type>(str)
#define OBFUSCATED(str) (DEF_OBFUSCATED(str).decrypt())
Basically, there are three different variants of the MetaString class (the core of the string obfuscation). Each has its own obfuscation algorithm. One of these three variants is chosen randomly at compile-time, using the library's pseudo-random number generator (MetaRandom), along with a random char that is used by the chosen algorithm to xor the string characters.
"Hey, but if we do the math, 3 algorithms * 255 possible char keys (0 is not used) = 765 variants of the obfuscated string"
You're right. The same string can only be obfuscated in 765 different ways. If you have a reason to need something safer (you're paranoid / your application demands increased security) you can extend the library and implement your own algorithms, using stronger obfuscation or even encryption (White-Box cryptography is in the lib's roadmap).
Where / how does it store the obfuscated strings?
One thing I find interesting about this implementation is that it doesn't store the obfuscated string in the data section of the executable.
Instead, it is statically stored into the MetaString object itself (on the stack) and the algorithm decodes it in place at runtime. This approach makes it much harder to find the obfuscated strings, statically or at runtime.
You can dive deeper into the implementation by yourself. That's a very good basic obfuscation solution and can be a starting point to a more complex one.
Save yourself a heap of trouble down the line with template metaprogramming and just write a stand alone program that encrypts the string and produces a cpp source file which is then compiled in. This program would run before you compile and would produce a cpp and/or header file that would contain the encrypted string for you to use.
So here is what you start with:
encrypted_string.cpp and encrypted_string.h (which are blank)
A script or standalone app that takes a text file as an input and over writes encrypted_string.cpp and encrypted_string.h
If the script fails, your compiling will fail because there will be references in your code to a variable that does not exist. You could get smarter, but that's enough to get you started.
The reason why the examples you found can't take string literals as template argument is because it's not allowed by the ISO C++ standard. That's because, even though c++ has a string class, a string literal is still a const char *. So, you can't, or shouldn't, alter it (leads to undefined behaviour), even if you can access the characters of such an compile-time string literal.
The only way I see is using defines, as they are handled by the preprocessor before the compiler. Maybe boost will give you a helping hand in that case.
A macro based solution would be to take a variadic argument and pass in each part of the string as a single token. Then stringify the token and encrypt it and concatenate all tokens. The end result would look something like this
CRYPT(m y _ s t r i n g)
Where _ is some placeholder for a whitespace character literal. Horribly messy and I would prefer every other solution over this.
Something like this could do it although the Boost.PP Sequence isn't making it any prettier.
#include <iostream>
#include <boost/preprocessor/stringize.hpp>
#include <boost/preprocessor/seq/for_each.hpp>
#define GARBLE(x) GARBLE_ ## x
#define GARBLE_a x
#define GARBLE_b y
#define GARBLE_c z
#define SEQ (a)(b)(c)
#define MACRO(r, data, elem) BOOST_PP_STRINGIZE(GARBLE(elem))
int main() {
const char* foo = BOOST_PP_SEQ_FOR_EACH(MACRO, _, SEQ);
std::cout << foo << std::endl;
}