Bizarre behavior from sprintf in C++/VS2010

Bizarre behavior from sprintf in C++/VS2010 - c++

I have a super-simple class representing a decimal # with fixed precision, and when I want to format it I do something like this:
assert(d.DENOMINATOR == 1000000);
char buf[100];
sprintf(buf, "%d.%06d", d._value / d.DENOMINATOR, d._value % d.DENOMINATOR);
Astonishingly (to me at least) this does not work. The %06d term comes out all 0s even when d.DENOMINATOR does not evenly divide d._value. And if I throw an extra %d in the format string, I see the right value show up in the third spot -- so it's like something is secretly creating an extra argument between my two.
If I compute the two terms outside of the call to sprintf, everything behaves how I expect. I thought to reproduce this with a more simple test case:
char testa[200];
char testb[200];
int x = 12345, y = 1000;
sprintf(testa, "%d.%03d", x/y, x%y);
int term1 = x/y, term2 = x%y;
sprintf(testb, "%d.%03d", term1, term2);
...but this works properly. So I'm completely baffled as to exactly what's going on, how to avoid it in the future, etc. Can anyone shed light on this for me?
(EDIT: Problem ended up being that d._value and d.DENOMINATOR are both long longs so %d doesn't suffice. Thanks very much to Serge's comment below which pointed to the problem, and Mark's answer submitted shortly thereafter.)

Almost certainly your term components are a 64-bit type (perhaps long on a 64-bit system) which is getting passed into the non-type-safe sprintf. Thus when you create an intermediate int the size is right and it works fine.
g++ will warn about this and many other useful things with -Wall. The preferred solution is of course to use C++ iostreams for your formatting as they're totally type safe.
The alternate solution is to cast the result of your expression to the type that you told sprintf to expect so it pulls the proper number of bytes out of memory.
Finally, never use sprintf when almost every compiler supports snprintf which prevents all sorts of silly mistakes. Your code is fine now but when someone modifies it later and it runs off the end of the buffer you may spend days tracking down the corruption.

Related

A unique type of data conversion

in the following code
tt=5;
for(i=0;i<tt;i++)
{
int c,d,l;
scanf("%lld%lld%lld",&c,&d,&l);
printf("%d %d %d %d",c,d,l,tt);
}
in the first iteration, the value of 'tt' is changing to 0 automatically.
I know that i have declared c,d,l as int and taking input as long long so it is making c,d=0. But still, i m not able to understand how tt is becoming 0.

Small, but obligatory announcement. As it was said in comments, you face undefined behavior, so
don't be surprised by tt assigned to zero
don't be surprised by tt not assigned to zero after insignificant code changes (e.g. reordering initialization from "int i,tt;" to "int tt, i;" or vice versa)
don't be surprised by tt not assigned to zero after compiling with different flags or different compiler version or for different platform or for testing with different input
don't be surprised by anything. Any behavior is possible.
You can't expect this code to work one way or another, so don't ever use it in real program.
However, you seem to be OK with that, and the question is "what is actually happening with tt". IMHO this question is really great, it reveals passion to understand programming deeper, and it helps in digging into lower layer. So lets get started.
Possible explanation
I failed to reproduce behavior on VS2015, but situation is quite clear. Actual data aligning, variable sizes, endianness, stack growth direction and other details may differ on your PC, but the general idea should be the same.
Variables i, tt, c, d, l are local, so they are stored on stack. Lets assume, sizeof(int) is 4 and sizeof(long long) is 8 which is quite common. Then one of possible data alignments is shown on picture (addresses grow from left to right, each cell represents one byte):
When doing scanf, you pass address of c (blue arrow on next pict) for filling with data. But size of data is 8 bytes, so data of both c and tt are overwritten (blue cells on the pict). For little-endian representation, you always write zeroes to tt unless really big number is entered by user, while c actually gets valid data for small numbers.
However, valid data in c will be rewritten the same way during filling d, the same will happen to d while filling l. So only l will get nonzero value in described case. Easy test: enter large number for c, d, l and check if tt is still zero.
How to get precise answer
You can get all answers from assembly code. Enable disassembly listing (exact steps depend on toolchain: gcc has -S option, visual studio has "goto disassembly" item in context menu while on breakpoint) and analyze listing. It's really helpful to see exact instructions your CPU is going to execute. Some debuggers allow executing instructions one by one. So you need to find out how variables are alligned on stack and when exactly are they overwritten. Analyzing scanf is hard for beginners, so you can start with the simplified version of your program: replace scanf with the following (can't test, but should work):
*((long long *)(&c)) = 1; //or any other user specified value
*((long long *)(&d)) = 2;
*((long long *)(&l)) = 3;

Which to use? int32_t vs uint32_t [duplicate]

When is it appropriate to use an unsigned variable over a signed one? What about in a for loop?
I hear a lot of opinions about this and I wanted to see if there was anything resembling a consensus.
for (unsigned int i = 0; i < someThing.length(); i++) {
SomeThing var = someThing.at(i);
// You get the idea.
}
I know Java doesn't have unsigned values, and that must have been a concious decision on Sun Microsystems' part.

I was glad to find a good conversation on this subject, as I hadn't really given it much thought before.
In summary, signed is a good general choice - even when you're dead sure all the numbers are positive - if you're going to do arithmetic on the variable (like in a typical for loop case).
unsigned starts to make more sense when:
You're going to do bitwise things like masks, or
You're desperate to to take advantage of the sign bit for that extra positive range .
Personally, I like signed because I don't trust myself to stay consistent and avoid mixing the two types (like the article warns against).

In your example above, when 'i' will always be positive and a higher range would be beneficial, unsigned would be useful. Like if you're using 'declare' statements, such as:
#declare BIT1 (unsigned int 1)
#declare BIT32 (unsigned int reallybignumber)
Especially when these values will never change.
However, if you're doing an accounting program where the people are irresponsible with their money and are constantly in the red, you will most definitely want to use 'signed'.
I do agree with saint though that a good rule of thumb is to use signed, which C actually defaults to, so you're covered.

I would think that if your business case dictates that a negative number is invalid, you would want to have an error shown or thrown.
With that in mind, I only just recently found out about unsigned integers while working on a project processing data in a binary file and storing the data into a database. I was purposely "corrupting" the binary data, and ended up getting negative values instead of an expected error. I found that even though the value converted, the value was not valid for my business case.
My program did not error, and I ended up getting wrong data into the database. It would have been better if I had used uint and had the program fail.

C and C++ compilers will generate a warning when you compare signed and unsigned types; in your example code, you couldn't make your loop variable unsigned and have the compiler generate code without warnings (assuming said warnings were turned on).
Naturally, you're compiling with warnings turned all the way up, right?
And, have you considered compiling with "treat warnings as errors" to take it that one step further?
The downside with using signed numbers is that there's a temptation to overload them so that, for example, the values 0->n are the menu selection, and -1 means nothing's selected - rather than creating a class that has two variables, one to indicate if something is selected and another to store what that selection is. Before you know it, you're testing for negative one all over the place and the compiler is complaining about how you're wanting to compare the menu selection against the number of menu selections you have - but that's dangerous because they're different types. So don't do that.

size_t is often a good choice for this, or size_type if you're using an STL class.

effect of using sprintf / printf using %ld format string instead of %d with int data type

We have some legacy code that at one point in time long data types were refactored to int data types. During this refactor a number of printf / sprintf format statements were left incorrect as %ld instead of changed to %d. For example:
int iExample = 32;
char buf[200];
sprintf(buf, "Example: %ld", iExample);
This code is compiled on both GCC and VS2012 compilers. We use Coverity for static code analysis and code like in the example was flagged as a 'Printf arg type mismatch' with a Medium level of severity, CWE-686: Function Call With Incorrect Argument Type I can see this would be definitely a problem had the format string been that of an signed (%d) with an unsigned int type or something along these lines.
I am aware that the '_s' versions of sprintf etc are more secure, and that the above code can also be refactored to use std::stringstream etc. It is legacy code however...
I agree that the above code really should be using %d at the very least or refactored to use something like std::stringstream instead.
Out of curiosity is there any situation where the above code will generate incorrect results? As this legacy code has been around for quite some time and appears to be working fine.
UPDATED
Removed the usage of the word STL and just changed it to be std::stringstream.

As far as the standard is concerned, the behavior is undefined, meaning that the standard says exactly nothing about what will happen.
In practice, if int and long have the same size and representation, it will very likely "work", i.e., behave as if the correct format string has been used. (It's common for both int and long to be 32 bits on 32-bit systems).
If long is wider than int, it could still work "correctly". For example, the calling convention might be such that both types are passed in the same registers, or that both are pushed onto the stack as machine "words" of the same size.
Or it could fail in arbitrarily bad ways. If int is 32 bits and long is 64 bits, the code in printf that tries to read a long object might get a 64-bit object consisting of the 32 bits of the actual int that was passed combined with 32 bits of garbage. Or the extra 32 bits might consistently be zero, but with the 32 significant bits at the wrong end of the 64-bit object. It's also conceivable that fetching 64 bits when only 32 were passed could cause problems with other arguments; you might get the correct value for iExample, but following arguments might be fetched from the wrong stack offset.
My advice: The code should be fixed to use the correct format strings (and you have the tools to detect the problematic calls), but also do some testing (on all the C implementations you care about) to see whether it causes any visible symptoms in practice. The results of the testing should be used only to determine the priority of fixing the problems, not to decide whether to fix them or not. If the code visibly fails now, you should fix it now. If it doesn't, you can get away with waiting until later (presumably you have other things to work on).

It's undefined and depends on the implementation. On implementations where int and long have the same size, it will likely work as expected. But just try it on any system with 32-bit int and 64-bit long, especially if your integer is not the last format argument, and you're likely to get problems where printf reads 64 bits where only 32 were provided, the rest quite possibly garbage, and possibly, depending on alignment, the following arguments also cannot get accessed correctly.

Any better alternatives for getting the digits of a number? (C++)

I know that you can get the digits of a number using modulus and division. The following is how I've done it in the past: (Psuedocode so as to make students reading this do some work for their homework assignment):
int pointer getDigits(int number)
initialize int pointer to array of some size
initialize int i to zero
while number is greater than zero
store result of number mod 10 in array at index i
divide number by 10 and store result in number
increment i
return int pointer
Anyway, I was wondering if there is a better, more efficient way to accomplish this task? If not, is there any alternative methods for this task, avoiding the use of strings? C-style or otherwise?
Thanks. I ask because I'm going to be wanting to do this in a personal project of mine, and I would like to do it as efficiently as possible.
Any help and/or insight is greatly appreciated.

The time it takes to extract the digits will be dwarfed by the time required to dynamically allocate the array. Consider returning the result in a struct:
struct extracted_digits
{
int number_of_digits;
char digits[12];
};
You'll want to pick a suitable value for the maximum number of digits (12 here, which is enough for a 32-bit integer). Alternatively, you could return a std::array<char, 12> and encode the terminal by using an invalid value (so, after the last value, store a 10 or something else that isn't a digit).
Depending on whether you want to handle negative values, you'll also have to decide how to report the unary minus (-).

Unless you want the representation of the number in a base that's a power of 2, that's about the only way to do it.

Smacks of premature optimisation. If profiling proves it matters, then be sure to compare your algo to itoa - internally it may use some CPU instructions that you don't have explicit access to from C++, and which your compiler's optimiser may not be clever enough to employ (e.g. AAM, which divs while saving the mod result). Experiment (and benchmark) coding the assembler yourself. You might dig around for assembly implementations of ITOA (which isn't identical to what you're asking for, but might suggest the optimal CPU instructions).

By "avoiding the use of strings", I'm going to assume you're doing this because a string-only representation is pretty inefficient if you want an integer value.
To that end, I'm going to suggest a slightly unorthodox approach which may be suitable. Don't store them in one form, store them in both. The code below is in C - it will work in C++ but you may want to consider using c++ equivalents - the idea behind it doesn't change however.
By "storing both forms", I mean you can have a structure like:
typedef struct {
int ival;
char sval[sizeof("-2147483648")]; // enough for 32-bits
int dirtyS;
} tIntStr;
and pass around this structure (or its address) rather than the integer itself.
By having macros or inline functions like:
inline void intstrSetI (tIntStr *is, int ival) {
is->ival = i;
is->dirtyS = 1;
}
inline char *intstrGetS (tIntStr *is) {
if (is->dirtyS) {
sprintf (is->sval, "%d", is->ival);
is->dirtyS = 0;
}
return is->sval;
}
Then, to set the value, you would use:
tIntStr is;
intstrSetI (&is, 42);
And whenever you wanted the string representation:
printf ("%s\n" intstrGetS(&is));
fprintf (logFile, "%s\n" intstrGetS(&is));
This has the advantage of calculating the string representation only when needed (the fprintf above would not have to recalculate the string representation and the printf only if it was dirty).
This is a similar trick I use in SQL with using precomputed columns and triggers. The idea there is that you only perform calculations when needed. So an extra column to hold the indexed lowercased last name along with an insert/update trigger to calculate it, is usually a lot more efficient than select lower(non_lowercased_last_name). That's because it amortises the cost of the calculation (done at write time) across all reads.
In that sense, there's little advantage if your code profile is set-int/use-string/set-int/use-string.... But, if it's set-int/use-string/use-string/use-string/use-string..., you'll get a performance boost.
Granted this has a cost, at the bare minimum extra storage required, but most performance issues boil down to a space/time trade-off.
And, if you really want to avoid strings, you can still use the same method (calculate only when needed), it's just that the calculation (and structure) will be different.
As an aside: you may well want to use the library functions to do this rather than handcrafting your own code. Library functions will normally be heavily optimised, possibly more so than your compiler can make from your code (although that's not guaranteed of course).
It's also likely that an itoa, if you have one, will probably outperform sprintf("%d") as well, given its limited use case. You should, however, measure, not guess! Not just in terms of the library functions, but also this entire solution (and the others).

It's fairly trivial to see that a base-100 solution could work as well, using the "digits" 00-99. In each iteration, you'd do a %100 to produce such a digit pair, thus halving the number of steps. The tradeoff is that your digit table is now 200 bytes instead of 10. Still, it easily fits in L1 cache (obviously, this only applies if you're converting a lot of numbers, but otherwise efficientcy is moot anyway). Also, you might end up with a leading zero, as in "0128".

Yes, there is a more efficient way, but not portable, though. Intel's FPU has a special BCD format numbers. So, all you have to do is just to call the correspondent assembler instruction that converts ST(0) to BCD format and stores the result in memory. The instruction name is FBSTP.

Mathematically speaking, the number of decimal digits of an integer is 1+int(log10(abs(a)+1))+(a<0);.
You will not use strings but go through floating points and the log functions. If your platform has whatever type of FP accelerator (every PC or similar has) that will not be a big deal ,and will beat whatever "sting based" algorithm (that is noting more than an iterative divide by ten and count)

How do I safely format floats/doubles with C's sprintf()?

I'm porting one of my C++ libraries to a somewhat wonky compiler -- it doesn't support stringstreams, or C99 features like snprintf(). I need to format int, float, etc values as char*, and the only options available seem to be 1) use sprintf() 2) hand-roll formatting procedures.
Given this, how do I determine (at either compile- or run-time) how many bytes are required for a formatted floating-point value? My library might be used for fuzz-testing, so it needs to handle even unusual or extreme values.
Alternatively, is there a small (100-200 lines preferred), portable implementation of snprintf() available that I could simply bundle with my library?
Ideally, I would end up with either normal snprintf()-based code, or something like this:
static const size_t FLOAT_BUFFER_SIZE = /* calculate max buffer somehow */;
char *fmt_double(double x)
{
char *buf = new char[FLOAT_BUFFER_SIZE + 1];
sprintf(buf, "%f", x);
return buf;
}
Related questions:
Maximum sprintf() buffer size for integers
Maximum sprintf() buffer size for %g-formatted floats

Does the compiler support any of ecvt, fcvt or gcvt? They are a bit freakish, and hard to use, but they have their own buffer (ecvt, fcvt) and/or you may get lucky and find the system headers have, as in VC++, a definition of the maximum number of chars gcvt will produce. And you can take it from there.
Failing that, I'd consider the following quite acceptable, along the lines of the code provided. 500 chars is pretty conservative for a double; valid values are roughly 10^-308 to 10^308, so even if the implementation is determined to be annoying by printing out all the digits there should be no overflow.
char *fmt_double(double d) {
static char buf[500];
sprintf(buf,"%f",d);
assert(buf[sizeof buf-1]==0);//if this fails, increase buffer size!
return strdup(buf);
}
This doesn't exactly provide any amazing guarantees, but it should be pretty safe(tm). I think that's as good as it gets with this sort of approach, unfortunately. But if you're in the habit of regularly running debug builds, you should at least get early warning of any problems...

I think GNU Libiberty is what you want. You can just include the implementation of snprintf.
vasprintf.c - 152 LOC.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js