What is causing this format String attack? - c++

Look at this vulnerable snippet:
int main(int argc, char **argv) {
printf(argv[1], "bla");
return 0;
}
Compiling it without optimization leads to
./test "asd"
asd
./test "asd %s"
asd bla
./test "asd %0\$s"
asd %0$s
./test "asd %45\$s"
asd XDG_VTNR=7 <-- What the...
Well, actually it seems like "%(number)\$s" tries to interpret the (number)th argument as a string, looking upside the stack, and I met my environment variables. Is the use of such an format string documented anywhere, especially the use of the curious "\$" ? I couldn't find any references.
Finally, compilation with optimization enabled it leads to:
*** invalid %N$ use detected ***
asd zsh: abort ./test "asd %46\$s"
I've never seen such an error before. Where does it come from?
(I'm using Gentoo Linux / GCC 4.8.2 / glibc 2.18)

Sure, it's mentioned in the manual page like you'd expect. It seems to come from the Single Unix Specification (i.e. not C99).
It's used in internationalization, when you ofteen need to swap around the order of various pieces of information to fit the translation. The number is an argument index:
One can also specify explicitly which argument is taken, at each place where an argument is required, by writing "%m$" instead of '%' and "*m$" instead of '*', where the decimal integer m denotes the position in the argument list of the desired argument, indexed starting from 1
So in a more sensible program, this:
printf("%2$d %1$d", 1, 2);
prints
2 1
It's possible that buiding with optimizations enabled makes the compiler perform more heavy-weight analysis of the code, so that it can "know" more about the actual argument list and generate a better error.

Related

Difference between return 0 and -1 [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
Can anyone please explain the difference between return 0 and -1 used in c++? I have read many articles and posts from other programmers saying that return 0 means the program is successful and -1 means the program has an error. But what I don't get is why use these statements when the compiler will generate an error if there is one in the program anyway? Please explain in detail what these statements really mean.
This has absolutely nothing to do with the compiler.
The compiler will report syntax errors.
The return codes are used to report if the program completed successfully.
("Success" depends on what the program was intended to do).
For example:
// Program SearchTerm "term to find"
int main(int argc, char* argv[])
{
bool search_successful = false;
[ ... do work ... ]
if (search_successful)
{
return 0; // The search worked.
}
else
{
return -1; // The search failed
}
}
Example usage:
SearchTerm "Microsoft"
More than 1 million results found... returned Success
SearchTerm "asldfjpu"
No results found... returned Failure
When a program reports success or failure, it can be integrated in the Scripts such as:
#!/bin/bash
if `SearchTerm "Microsoft"`; then
GetTopResults "Microsoft"
else
echo "No results found, no Top Results to retrieve"
fi
The return value of int main() is the so called exit code of the program. This is already so in C (which has no exceptions) and is there to basically tell to caller how it went. A exit code of zero means success and every other exit code (but normally one uses only positive ones) means that something went wrong. Sometimes programms will document what a certain exit code means i.e. if a file was not found or an allocation failed etc.
This is a very important part of scripting for example bash scripts, which know in this way if a called command went right or wrong. Even if your program crashes with an exception, the programm will generate an exit code (which will be non-zero). In bash you can see the exit code of the last run program with echo $?, so you can check that out for yourself.
A return is simply the value returned by any function. At the end of a function, the computer is able to accept a single value to take back with it when it returns to the rest of the program. In the case of the function int main(), with which you are probably familiar, the return value is used as an indication of the success of the program.
main() is simply the entry point of the program you are running. It's the first function called by your computer when the program starts (with some exceptions)
Theoretically, you could return any integer from main().
The only "accepted" thing is that anything that is non-zero is generally some error or fault. So a return 0; indicates success, while return -1; might indicate some kind of error. The truth is, the value returned from main() doesn't really matter, and won't affect how your program runs.
If you're asking about the value returned by main, there are two defined values that you can return: EXIT_SUCCESS and EXIT_FAILURE. Those values are defined in the header <stdlib.h> and in the header <cstdlib>. You can also return the value 0, which is equivalent to returning EXIT_SUCCESS. There are no other meaningful values in C or C++, although your implementation can provide meanings for other values.
I am assuming that you are returning 0 or -1 from the main program as your program exits. What this does is inform the calling program the success or failure of the program. This is not very handy if you just run the program at a command prompt but if your program is called within a script (Perl,PHP, python, PowerShell, etc) you can test to see if the program was successful.
In a nutshell, if your program is called by another program, the calling program can test for success or failure and respond appropriately.
the program is successful and -1 means the program has an error.
Please explain in detail what these statements really mean.
There are situations where a function cannot proceed further, and cannot complete the task that it is specified for it. Such situation is typically called an error. There are many possible sources of errors. One example of that is a function that has pre-conditions for the inputs that were not satisfied by the caller of the function. An example of such pre-condition is square root function that calculates rational numbers (i.e. not complex): There is no result for negative input.
When an error is encountered, it must somehow be communicated to the caller. There are many techniques, but we shall not cover all of them in detail here. I'll mention that one option in C++ is exceptions. Exceptions technique has some useful properties, but it is not universally applicable to all situations.
Another, very simple technique is to return an error code which is an integer value. If we choose 0 to signify no error, the caller of the function can check for error with following pattern:
int error = function(arguments)
if (error) {
// handle error here
}
This pattern is very common in C APIs. C++ inherits C standard library, and it is common to take advantage of C libraries so it is common to encounter this in C++ as well.
Sometimes, a function needs to use the return value for some other purpose. For example, the open call from POSIX standard returns an integer called "file descriptor", which is a positive integer. Since some of the values in the domain of the return type are not used (negative numbers), it can be used to represent error condition as well. Although any negative number is available, -1 was chosen, and this is also quite conventional. In some other APIs different negative numbers represent different errors. Another approach is to store error information elsewhere. The chosen approach in POSIX is to store the error code in a separate variable.
In conclusion: Integer return values are a technique of communicating errors, and 0 is conventionally used to represent success and -1 is often used to represent erroneous execution. The meaning of the return value of a function should be documented by the implementer, and the documentation should be studied carefully by the user of the function.
But what I don't get is why use these statements when the compiler will generate an error if there is one in the program anyway?
It's unclear what you expect here. The compiler cannot read the programmer's mind and know all cases where the program should have an error. It can only tell whether the program is well-formed or not (and maybe give some helpful warnings about obvious mistakes if you're lucky).
It is the programmer's responsibility to consider the error conditions of the program.
The only difference between 0 and -1, as far as the compiler is concerned, is that they are two different numbers. Nothing is assumed whatsoever about their "success" or "failure" status (except in main in which the return value is ignored anyway).
The rest are (mostly bad) conventions used by developers. Usually < 0 for failure and 0 for success and you have created a nice bug when you test against bool (in which -1 is true and 0 is false).
One should use enums or, at least, something more recognizable as in windows HRESULT.
It basically means that anything other then 0 means something bad happened during execution of your program. The program could not deal with it so it exited with the status code.
Often when working with shell you will notice that some of commands exit with non 0 status code. You can check that value by running echo $? command.
And finally you can use return codes to communicate to the client what happend. That is if you describe what return codes can your program return.

Fortran - large arrays and invalid $LARGE usage

I got some huge arrays in my program. I use gfortran (tmd64-1) 4.7.1 version and I try to compile a code which begin with:
$LARGE
1
Error: Invalid character in name at (1)
but I get an above error. I use -fdollar-ok option but as far as I know, it doesn't affect first character in a symbol name. How can I pass by the problem?

What does the dash mean in windows command line syntax?

I know basic c++ coding and command line syntax. Often, the main() function looks like this:
int main(int argc, char *argc[])
When I execute my program via command line, I just do this:
cd c:\path\to\foo 1
Where 1 is the argument. And of course, arc would be equal to '2' and the '1' would be at element '1' of the array argc (as opposed to 0).
I've seen the dash used in a lot of places. Specifically:
gcc -v
And if I type just 'gcc', it says there are no arguments. And if I type 'gcc v' I get "error: no such file or directory". But when I take a look at the minGW bin folder where gcc is, there is no folder 'v'.
This is just a style of setting options in programs. It comes from the getopt library provided in Unix/Linux etc. The -- form that kc7zax mentions in their answer is the long-form of these options (allowing a long identifier instead of a single character).
There's nothing magic about these. They are simply parsed out of the argv array. You can implement similar or identical functionality yourself if you want. But it's a pain. That's why libraries exist.
It is simply a standard for identifying the name of a command line parameter from the parameter value that may exist. e.g. gcc -o myfile.o
The windows world usually sees a - or a / as the marker. I've often encountered a -- in the unix/bsd world as well.

value of argc when * is passed as one of the arguments calling a program in c

There is an abnormality in the value of argc when '*' is passed as one of the arguments calling a program in c.
I made a simple code in c and saved it as 'test2.c'.Here is the following code of 'test2.c'---
#include<stdio.h>
#include<stdlib.h>
int main(int argc,char* argv[])
{
printf("%d\n",argc);return 0;
}
I compiled it and call it as--
dev#ubuntu:~$ gcc test2.c -o t
dev#ubuntu:~$ ./t *
31
So, i am getting the argument count value as 31; whereas if '*' is replaced by any other binary operator ; the value of argc is 2(which is also logically correct).
dev#ubuntu:~$ ./t +
2
I am not able to fathom why is it so....and there is one more interesting thing.when '-' is used in place of '';the answer is 2(which is again logically correct)
dev#ubuntu:~$ ./t -*
2
Can anyone help me in this;thanks in advance.
Its just the shell expansion. The shell will expand * to the filenames (directories included) in the current directory in which the program is executing.
+ or ; does not have special interpretations therefore they are considered as normal strings, but * has special interpretation, thus it is expanded to the list of all files in current directory, which is passed to your program.
Try using
./t '*' or ./t \*
This stops special interpretation of *. The first one used the bash (looks like you are in bash) single quote, which does not make any special interpretation inside the patterns in the quote, and the next one uses escape sequence.
It's the shell that expands * to the list of files in the current directory. Print out the argv to see that this is the case.
argc will always give you the number of elements in argv[].
The arguments passed to argv[] will always contain at least the name of the executable so argc is always the number of arguments + 1.
The * case is although special, since your shell will expand * to all files and directories within your local directory. If you want * to not be expanded pass it as "*" with the double-quotes.
Try echo * vs echo "*" in your shell to see how * gets expanded.
This is just the shell expanding the star operator * to the current directory from where you are executing this program.
There are several expansions performed before the shell initiates execution, for example ~ (tilde), {} (braces) and $ (parameters), of course depending on your operating system. You can find the GNU list of expansions in the link below.
GNU BASH Manual - 3.5.3 Shell Parameter Expansion
Tip: If you had printed the values of your arguments you would probably see a list of directories and files, and the argument count 'argc' would change depending on which directory you execute from.

Parsing C/C++ source: How are token boundaries/interactions specified in lex/yacc?

I want to parse some C++ code, and as a guide I've been looking at the C lex/yacc definitions here: http://www.lysator.liu.se/c/ANSI-C-grammar-l.html and http://www.lysator.liu.se/c/ANSI-C-grammar-y.html
I understand the specifications of the tokens themselves, but not how they interact. eg. it's OK to have an operator such as = directly follow an identifier without intervening white space (ie. "foo="), but it's not OK to have a numerical constant immediately followed by an identifier (ie. 123foo). However, I don't see any way that such rules are represented.
What am I missing?... or is this lex/yacc too liberal in its acceptance of errors.
The lexer converts a character stream into a token stream (I think that's what you mean by token specification). The grammar specifies what sequences of tokens are acceptable. Hence, you won't see that something is not allowed; you only see what is allowed. Does that make sense?
EDIT
If the point is to get the lexer to distinguish the sequence "123foo" from the sequence "123 foo" one way is to add a specification for "123foo". Another way is to treat spaces as significant.
EDIT2
A syntax error can be "detected" from the lexer or the grammar production or the later stages of the compiler (think of, say, type errors, which are still "syntax errors"). Which part of the whole compilation process detects which error is largely a design issue (as it affects the quality of error messages), I think. In the given example, it probably makes more sense to outlaw "123foo" via a tokenization to an invalid token rather than relying on the non-existence of a production with a numeric literal followed by an identifier (at least, this is the behaviour of gcc).
The lexer is fine with 123foo and will split that into two tokens.
An integer constant
and an identifier.
But try and find the part in the syntax that allows those two tokens to sit side by side like that. Thus I bet the lexer is generating an error when it sees these two tokens.
Note the lexer does not care about whitespace (unless you explicitly tell it tow worry). In this case it just throws white space away:
[ \t\v\n\f] { count(); } // Throw away white space without looking.
Just to check this is what I built:
wget http://www.lysator.liu.se/c/ANSI-C-grammar-l.html > l.l
wget http://www.lysator.liu.se/c/ANSI-C-grammar-y.html > y.y
Edited file l.l to stop in the compiler complaining about undeclared functions:
#include "y.tab.h"
// Add the following lines
int yywrap();
void count();
void comment();
void count();
int check_type();
// Done adding lines
%}
Create the following file: main.c:
#include <stdio.h>
extern int yylex();
int main()
{
int x;
while((x = yylex()) != 0)
{
fprintf(stdout, "Token(%d)\n", x);
}
}
Build it:
$ bison -d y.y
y.y: conflicts: 1 shift/reduce
$ flex l.l
$ gcc main.c lex.yy.c
$ ./a.out
123foo
123Token(259)
fooToken(258)
Yes it split it into two tokens.
what's essentially going on is the lexical rules for each token type are greedy. For instance, the character sequence foo= cannot be interpreted as a single identifier, because identifiers don't contain symbols. on the other hand, 123abc is actually a numerical constant, though malformed, because numerical constants can end with a sequence of alphabetic characters that are used to express the type of the numerical constant.
You won't be able to parse C++ with lex and yacc, as it's an ambiguous grammar. You'd need a more powerful approach such as GLR or some hackish solution which modifies a lexer in runtime (that's what most of the current C++ parsers are doing).
Take a look at Elsa/Elkhound.