Magic Numbers In Arrays? - C++

Magic Numbers In Arrays? - C++ - c++

I'm a fairly new programmer, and I apologize if this information is easily available out there, I just haven't been able to find it yet.
Here's my question:
Is is considered magic numbers when you use a literal number to access a specific element of an array?
For example:
arrayOfNumbers[6] // Is six a magic number in this case?
I ask this question because one of my professors is adamant that all literal numbers in a program are magic numbers. It would be nice for me just to access an element of an array using a real number, instead of using a named constant for each element.
Thanks!

That really depends on the context. If you have code like this:
arr[0] = "Long";
arr[1] = "sentence";
arr[2] = "as";
arr[3] = "array.";
...then 0..3 are not considered magic numbers. However, if you have:
int doStuff()
{
return my_global_array[6];
}
...then 6 is definitively a magic number.

It's pretty magic.
I mean, why are you accessing the 6th element? What's are the semantics that should be applied to that number? As it stands all we know is "the 6th (zero-based) number". If we knew the declaration of arrayOfNumbers we would further know its type (e.g. an int or a double).
But if you said:
arrayOfNumbers[kDistanceToSaturn];
...now it has much more meaning to someone reading the code.
In general one iterates over an array, performing some operation on each element, because one doesn't know how long the array is and you can't just access it in a hardcoded manner.
However, sometimes array elements have specific meanings, for example, in graphics programming. Sometimes an array is always the same size because the data demands it (e.g. certain transform matrices). In these cases it may or may not be okay to access the specific element by number: domain experts will know what you're doing, but generalists probably won't. Giving the magic index number a name makes it more obvious to those who have to maintain your code, and helps you to prevent typing the wrong one accidentally.
In my example above I assumed your array holds distances from the sun to a planet. The sun would be the zeroth element, thus arrayOfNumbers[kDistanceToSun] = 0. Then as you increment, each element contains the distance to the next farthest planet: mercury, venus, etc. This is much more readable than just typing the number of the planet you want. In this case the array is of a fixed size because there are a fixed number of planets (well, except the whole Pluto debacle).
The other problem is that "arrayOfNumbers" tells us nothing about the contents of the array. We already know its an array of numbers because we saw the declaration somewhere where you said int arrayOfNumers[12345]; or however you declared it. Instead, something like:
int distanceToPlanetsFromSol[kNumberOfPlanets];
...gives us a much better idea of what the data actually is and what its semantics are. One of your goals as a programmer should be to write code that is self-documenting in this manner.
And then we can argue elsewhere if kNumberOfPlanets should be 8 or 9. :)

You should ask yourself why are you accessing that particular position. In this case, I assume that if you are doing arrayOfNumbers[6] the sixth position has some special meaning. If you think what's that meaning, you probably realize that it's a magic number hiding that.

another way to look at it:
What if after some chance the program needs to access 7th element instead of 6th? HOw would you or a maintainer know that? If for example if the 6th entry is the count of trees in CA it would be a good thing to put
#define CA_STATE_ENTRY 6
Then if now the table is reordered somebody can see that they need to change this to 9 (say). BTW I am not saying this is the best way to maintain an array for tree counts by state - it probably isnt.
Likewise, if later people want to change the program to deal with trees in oregon, then they know to replace
trees[CA_STATE_ENTRY]
with
trees[OR_STATE_ENTRY]
The point is
trees[6]
is not self-documenting
Of course for c++ it should be an enum not a #define

You'd have to provide more context for a meaningful answer. Not all literal numbers are magic, but many are. In a case like that there is no way at all to tell for sure, though most cases I can think of off-hand with an explicit array index >>1 probably qualify as magic.

Not all literals in a program really qualify as "magic numbers" -- but this one certainly seems to. The 6 gives us no clue of why you're accessing that particular element of the array.
To not be a magic number, you need its meaning to be quite clear even on first examination (or at least minimal examination) why that value is being used. Just for example, a lot of code will do things like: &x[0]. In this case, it's typically pretty clear that the '0' really just means "the beginning of the array."

If you need to access a particular element of the array, chances are you're doing it wrong.
You should almost always be iterating over the entire array.

It's only not a magic number if your program is doing something very special involving the number six specifically. Could you provide some context?

That's the problem with professors, they're often too academic. In theory he's right, as usual, but usually magic numbers are used in a stricter context, when the number is embedded in a data stream, allowing you to detect certain properties of the stream (like the signature header of a file type for instance).
See also this Wikipedia entry.

Usually not all constant values in software are called magic numbers.
A java class files always starts with the hex value 0xcafebabe a windows .exe
file with MZ 0x4d, 0x5a , this allows you quickly (but not for sure) to identify
the content of a binary file.

In a MISRA compliant system, all values except 0 and 1 are considered magic numbers. My opinion has always been if the constant value is obvious or likely won't change then leave it as a number. If in doubt create a unique constant since long term maintenance will be easier.

Related

What are common values for uninitialized memory for debugging?

A long time ago I learned about filling unused / uninitialized memory with 0xDEADBEEF so that in a debugger or a crash report if I ever see that value I know I'm looking at uninitialized memory. I saw from a crash report iOS uses 0xBBADBEEF.
What other creative values have people used? Do any particular values have any kind of specific benefit?
The most obvious benefit of values that turn into words is that, at least of most people, if the words are in their language they stick out easily where as some strictly numeric value is less likely to stick out.
But, maybe there are other reason to pick numbers? For example an odd number might crash a processors (68000) for example on certain memory accesses so it's probably better to pick 0x0BADBEEF over 0xBADBEEF0. Are their any other values (maybe processor specific) that have a concrete benefit for using for uninitialized memory?

Generally speaking, you want a value which is unlikely to happen to "work" when interpreted as either an integer, a pointer, or a string. So, here are a few constraints:
Don't use a value that's a multiple of the smallest "usual" alignment on your target architecture. For x86, that's 4 (bytes), so no values that are divisible by 4. This ensures that if the value is interpreted as a pointer, it'll be obviously-incorrect. If you're on a non-x86 architecture, you might even be able to use a value that will cause an alignment trap if used as a pointer.
Don't use a value which could reasonably be a small (positive or negative) integer. Your typical "int" variable in a C program never gets larger than 1,000 or so, so don't use small numbers as your empty data fill.
Don't use a value which is composed entirely of valid ASCII characters. Make sure there's at least one byte in there with the high bit set. These days, you'd want to make sure they weren't valid UTF-8 or possibly UTF-16 values, either.
Don't have any zero bytes in the value. There are too many cases where this would work out to be "helpful" to keeping the program from crashing - terminating a string, giving a non-int field a reasonable-looking value, etc.
Don't use a single (or two) byte values, repeated over and over. Having a full-word length pattern can make it easier to determine how your wild pointer ended up pointing where it is, at least narrowing down which operations offset it from the start of the pattern.
Don't use a value that maps to an valid address for a "typical" process. If the highest bits are set, it'll typically take a whole lot of malloc() before your process will grow large enough to make that a valid address.
Perhaps unsurprisingly, patterns like 0xDEADBEEF meet basically all of these requirements.

One technical term for values like this is "poison value".
Hex numbers that form English words are called Hexspeak. Wikipedia's Hexspeak article pretty much answers this question, cataloguing many known constants in use for various things, including several that are used as poison values / canaries / sanity checks, as well as other uses like error codes or IPv6 addresses.
I seem to recall some variation of 0xBADF00D. (maybe with a repeated letter like your 2nd example).
There's also 0xDEADC0DE. (Googling for where I've seen this used found the wikipedia article linked above).
Other English words in hex I've seen: Java .class files use 0xCAFEBABE as the magic number (first 4 bytes of the file). As a play on this, I guess, the Jikes JVM uses 0xDEADBABE as a sanity check constant.
Apparently Java wasn't the first user of 0xCAFEBABE. Wikipedia says "It was originally created by NeXTSTEP developers as a reference to the baristas at Peet's Coffee & Tea", and was used by the people developing Java before they thought of the name "Java". So it didn't come out of Java -> coffee (if anything the other way around), it's just plain old non-feminist tech culture. :(
re: update: Choosing a good value. For a poison value (not an error code), you want all the bytes to be different and not 0x00 or 0xFF, since those are probably the most likely values for an errant single-byte store. This applies especially for things like stack canaries (to detect buffer overruns), or other cases where detecting that it didn't get overwritten is important.
Your speculation about picking an odd value makes a lot of sense. Not being a valid memory address in the virtual memory layout of typical processes is a big advantage. Failing noisily as early as possible is optimal for debugging. Anyway, this probably means that having the high bit set is a good idea, so 0x0... is probably not a good idea.

Use of Literals, yay/nay in C++

I've recently heard that in some cases, programmers believe that you should never use literals in your code. I understand that in some cases, assigning a variable name to a given number can be helpful (especially in terms of maintenance if that number is used elsewhere). However, consider the following case studies:
Case Study 1: Use of Literals for "special" byte codes.
Say you have an if statement that checks for a specific value stored in (for the sake of argument) a uint16_t. Here are the two code samples:
Version 1:
// Descriptive comment as to why I'm using 0xBEEF goes here
if (my_var == 0xBEEF) {
//do something
}
Version 2:
const uint16_t kSuperDescriptiveVarName = 0xBEEF;
if (my_var == kSuperDescriptiveVarName) {
// do something
}
Which is the "preferred" method in terms of good coding practice? I can fully understand why you would prefer version 2 if kSuperDescriptiveVarName is used more than once. Also, does the compiler do any optimizations to make both versions effectively the same executable code? That is, are there any performance implications here?
Case Study 2: Use of sizeof
I fully understand that using sizeof versus a raw literal is preferred for portability and also readability concerns. Take the two code examples into account. The scenario is that you are computing the offset into a packet buffer (an array of uint8_t) where the first part of the packet is stored as my_packet_header, which let's say is a uint32_t.
Version 1:
const int offset = sizeof(my_packet_header);
Version 2:
const int offset = 4; // good comment telling reader where 4 came from
Clearly, version 1 is preferred, but what about for cases where you have multiple data fields to skip over? What if you have the following instead:
Version 1:
const int offset = sizeof(my_packet_header) + sizeof(data_field1) + sizeof(data_field2) + ... + sizeof(data_fieldn);
Version 2:
const int offset = 47;
Which is preferred in this case? Does is still make sense to show all the steps involved with computing the offset or does the literal usage make sense here?
Thanks for the help in advance as I attempt to better my code practices.

Which is the "preferred" method in terms of good coding practice? I can fully understand why you would prefer version 2 if kSuperDescriptiveVarName is used more than once.
Sounds like you understand the main point... factoring values (and their comments) that are used in multiple places. Further, it can sometimes help to have a group of constants in one place - so their values can be inspected, verified, modified etc. without concern for where they're used in the code. Other times, there are many constants used in proximity and the comments needed to properly explain them would obfuscate the code in which they're used.
Countering that, having a const variable means all the programmers studying the code will be wondering whether it's used anywhere else, keeping it in mind as they inspect the rest of the scope in which it's declared etc. - the less unnecessary things to remember the surer the understanding of important parts of the code will be.
Like so many things in programming, it's "an art" balancing the pros and cons of each approach, and best guided by experience and knowledge of the way the code's likely to be studied, maintained, and evolved.
Also, does the compiler do any optimizations to make both versions effectively the same executable code? That is, are there any performance implications here?
There's no performance implications in optimised code.
I fully understand that using sizeof versus a raw literal is preferred for portability and also readability concerns.
And other reasons too. A big factor in good programming is reducing the points of maintenance when changes are done. If you can modify the type of a variable and know that all the places using that variable will adjust accordingly, that's great - saves time and potential errors. Using sizeof helps with that.
Which is preferred [for calculating offsets in a struct]? Does is still make sense to show all the steps involved with computing the offset or does the literal usage make sense here?
The offsetof macro (#include <cstddef>) is better for this... again reducing maintenance burden. With the this + that approach you illustrate, if the compiler decides to use any padding your offset will be wrong, and further you have to fix it every time you add or remove a field.
Ignoring the offsetof issues and just considering your this + that example as an illustration of a more complex value to assign, again it's a balancing act. You'd definitely want some explanation/comment/documentation re intent here (are you working out the binary size of earlier fields? calculating the offset of the next field?, deliberately missing some fields that might not be needed for the intended use or was that accidental?...). Still, a named constant might be enough documentation, so it's likely unimportant which way you lean....

In every example you list, I would go with the name.
In your first example, you almost certainly used that special 0xBEEF number at least twice - once to write it and once to do your comparison. If you didn't write it, that number is still part of a contract with someone else (perhaps a file format definition).
In the last example, it is especially useful to show the computation that yielded the value. That way, if you encounter trouble down the line, you can easily see either that the number is trustworthy, or what you missed and fix it.
There are some cases where I prefer literals over named constants though. These are always cases where a name is no more meaningful than the number. For example, you have a game program that plays a dice game (perhaps Yahtzee), where there are specific rules for specific die rolls. You could define constants for One = 1, Two = 2, etc. But why bother?

Generally it is better to use a name instead of a value. After all, if you need to change it later, you can find it more easily. Also it is not always clear why this particular number is used, when you read the code, so having a meaningful name assigned to it, makes this immediately clear to a programmer.
Performance-wise there is no difference, because the optimizers should take care of it. And it is rather unlikely, even if there would be an extra instruction generated, that this would cause you troubles. If your code would be that tight, you probably shouldn't rely on an optimizer effect anyway.

I can fully understand why you would prefer version 2 if kSuperDescriptiveVarName is used more than once.
I think kSuperDescriptiveVarName will definitely be used more than once. One for check and at least one for assignment, maybe in different part of your program.
There will be no difference in performance, since an optimization called Constant Propagation exists in almost all compilers. Just enable optimization for your compiler.

`short int` vs `int`

Should I bother using short int instead of int? Is there any useful difference? Any pitfalls?

short vs int
Don't bother with short unless there is a really good reason such as saving memory on a gazillion values, or conforming to a particular memory layout required by other code.
Using lots of different integer types just introduces complexity and possible wrap-around bugs.
On modern computers it might also introduce needless inefficiency.
const
Sprinkle const liberally wherever you can.
const constrains what might change, making it easier to understand the code: you know that this beastie is not gonna move, so, can be ignored, and thinking directed at more useful/relevant things.
Top-level const for formal arguments is however by convention omitted, possibly because the gain is not enough to outweight the added verbosity.
Also, in a pure declaration of a function top-level const for an argument is simply ignored by the compiler. But on the other hand, some other tools may not be smart enough to ignore them, when comparing pure declarations to definitions, and one person cited that in an earlier debate on the issue in the comp.lang.c++ Usenet group. So it depends to some extent on the toolchain, but happily I've never used tools that place any significance on those consts.
Cheers & hth.,

Absolutely not in function arguments. Few calling conventions are going to make any distinction between short and int. If you're making giant arrays you could use short if your data fits in short to save memory and increase cache effectiveness.

What Ben said. You will actually create less efficient code since all the registers need to strip out the upper bits whenever any comparisons are done. Unless you need to save memory because you have tons of them, use the native integer size. That's what int is for.
EDIT: Didn't even see your sub-question about const. Using const on intrinsic types (int, float) is useless, but any pointers/references should absolutely be const whenever applicable. Same for class methods as well.

The question is technically malformed "Should I use short int?". The only good answer will be "I don't know, what are you trying to accomplish?".
But let's consider some scenarios:
You know the definite range of values that your variable can take.
The ranges for signed integers are:
signed char — -2⁷ – 2⁷-1
short — -2¹⁵ – 2¹⁵-1
int — -2¹⁵ – 2¹⁵-1
long — -2³¹ – 2³¹-1
long long — -2⁶³ – 2⁶³-1
We should note here that these are guaranteed ranges, they can be larger in your particular implementation, and often are. You are also guaranteed that the previous range cannot be larger than the next, but they can be equal.
You will quickly note that short and int actually have the same guaranteed range. This gives you very little incentive to use it. The only reason to use short given this situation becomes giving other coders a hint that the values will be not too large, but this can be done via a comment.
It does, however, make sense to use signed char, if you know that you can fit every potential value in the range -128 — 127.
You don't know the exact range of potential values.
In this case you are in a rather bad position to attempt to minimise memory useage, and should probably use at least int. Although it has the same minimum range as short, on many platforms it may be larger, and this will help you out.
But the bigger problem is that you are trying to write a piece of software that operates on values, the range of which you do not know. Perhaps something wrong has happened before you have started coding (when requirements were being written up).
You have an idea about the range, but realise that it can change in the future.
Ask yourself how close to the boundary are you. If we are talking about something that goes from -1000 to +1000 and can potentially change to -1500 – 1500, then by all means use short. The specific architecture may pad your value, which will mean you won't save any space, but you won't lose anything. However, if we are dealing with some quantity that is currently -14000 – 14000, and can grow unpredictably (perhaps it's some financial value), then don't just switch to int, go to long right away. You will lose some memory, but will save yourself a lot of headache catching these roll-over bugs.

short vs int - If your data will fit in a short, use a short. Save memory. Make it easier for the reader to know how much data your variable may fit.
use of const - Great programming practice. If your data should be a const then make it const. It is very helpful when someone reads your code.

Any better alternatives for getting the digits of a number? (C++)

I know that you can get the digits of a number using modulus and division. The following is how I've done it in the past: (Psuedocode so as to make students reading this do some work for their homework assignment):
int pointer getDigits(int number)
initialize int pointer to array of some size
initialize int i to zero
while number is greater than zero
store result of number mod 10 in array at index i
divide number by 10 and store result in number
increment i
return int pointer
Anyway, I was wondering if there is a better, more efficient way to accomplish this task? If not, is there any alternative methods for this task, avoiding the use of strings? C-style or otherwise?
Thanks. I ask because I'm going to be wanting to do this in a personal project of mine, and I would like to do it as efficiently as possible.
Any help and/or insight is greatly appreciated.

The time it takes to extract the digits will be dwarfed by the time required to dynamically allocate the array. Consider returning the result in a struct:
struct extracted_digits
{
int number_of_digits;
char digits[12];
};
You'll want to pick a suitable value for the maximum number of digits (12 here, which is enough for a 32-bit integer). Alternatively, you could return a std::array<char, 12> and encode the terminal by using an invalid value (so, after the last value, store a 10 or something else that isn't a digit).
Depending on whether you want to handle negative values, you'll also have to decide how to report the unary minus (-).

Unless you want the representation of the number in a base that's a power of 2, that's about the only way to do it.

Smacks of premature optimisation. If profiling proves it matters, then be sure to compare your algo to itoa - internally it may use some CPU instructions that you don't have explicit access to from C++, and which your compiler's optimiser may not be clever enough to employ (e.g. AAM, which divs while saving the mod result). Experiment (and benchmark) coding the assembler yourself. You might dig around for assembly implementations of ITOA (which isn't identical to what you're asking for, but might suggest the optimal CPU instructions).

By "avoiding the use of strings", I'm going to assume you're doing this because a string-only representation is pretty inefficient if you want an integer value.
To that end, I'm going to suggest a slightly unorthodox approach which may be suitable. Don't store them in one form, store them in both. The code below is in C - it will work in C++ but you may want to consider using c++ equivalents - the idea behind it doesn't change however.
By "storing both forms", I mean you can have a structure like:
typedef struct {
int ival;
char sval[sizeof("-2147483648")]; // enough for 32-bits
int dirtyS;
} tIntStr;
and pass around this structure (or its address) rather than the integer itself.
By having macros or inline functions like:
inline void intstrSetI (tIntStr *is, int ival) {
is->ival = i;
is->dirtyS = 1;
}
inline char *intstrGetS (tIntStr *is) {
if (is->dirtyS) {
sprintf (is->sval, "%d", is->ival);
is->dirtyS = 0;
}
return is->sval;
}
Then, to set the value, you would use:
tIntStr is;
intstrSetI (&is, 42);
And whenever you wanted the string representation:
printf ("%s\n" intstrGetS(&is));
fprintf (logFile, "%s\n" intstrGetS(&is));
This has the advantage of calculating the string representation only when needed (the fprintf above would not have to recalculate the string representation and the printf only if it was dirty).
This is a similar trick I use in SQL with using precomputed columns and triggers. The idea there is that you only perform calculations when needed. So an extra column to hold the indexed lowercased last name along with an insert/update trigger to calculate it, is usually a lot more efficient than select lower(non_lowercased_last_name). That's because it amortises the cost of the calculation (done at write time) across all reads.
In that sense, there's little advantage if your code profile is set-int/use-string/set-int/use-string.... But, if it's set-int/use-string/use-string/use-string/use-string..., you'll get a performance boost.
Granted this has a cost, at the bare minimum extra storage required, but most performance issues boil down to a space/time trade-off.
And, if you really want to avoid strings, you can still use the same method (calculate only when needed), it's just that the calculation (and structure) will be different.
As an aside: you may well want to use the library functions to do this rather than handcrafting your own code. Library functions will normally be heavily optimised, possibly more so than your compiler can make from your code (although that's not guaranteed of course).
It's also likely that an itoa, if you have one, will probably outperform sprintf("%d") as well, given its limited use case. You should, however, measure, not guess! Not just in terms of the library functions, but also this entire solution (and the others).

It's fairly trivial to see that a base-100 solution could work as well, using the "digits" 00-99. In each iteration, you'd do a %100 to produce such a digit pair, thus halving the number of steps. The tradeoff is that your digit table is now 200 bytes instead of 10. Still, it easily fits in L1 cache (obviously, this only applies if you're converting a lot of numbers, but otherwise efficientcy is moot anyway). Also, you might end up with a leading zero, as in "0128".

Yes, there is a more efficient way, but not portable, though. Intel's FPU has a special BCD format numbers. So, all you have to do is just to call the correspondent assembler instruction that converts ST(0) to BCD format and stores the result in memory. The instruction name is FBSTP.

Mathematically speaking, the number of decimal digits of an integer is 1+int(log10(abs(a)+1))+(a<0);.
You will not use strings but go through floating points and the log functions. If your platform has whatever type of FP accelerator (every PC or similar has) that will not be a big deal ,and will beat whatever "sting based" algorithm (that is noting more than an iterative divide by ten and count)

Is long long in C++ known to be very nasty in terms of precision?

The Given Problem:
Given a theater with n rows, m seats, and a list of seats that are reserved. Given these values, determine how many ways two friends can sit together in the same row.
So, if the theater was a size of 2x3 and the very first seat in the first row was reserved, there would be 3 different seatings that these two guys can take.
The Problem That I'm Dealing With
The function itself is supposed to return the number of seatings that there are based on these constraints. The return value is a long long.
I've gone through my code many many times...and I'm pretty sure that it's right. All I'm doing is incrementing this one value. However, ALL of the values that my function return differ from the actual solution by 1 or 2.
Any ideas? And if you think that it's just something wrong with my code, please tell me. I don't mind being called an idiot just as long as I learn something.

Unless you're overflowing or underflowing, it definitely sounds like something is wrong with your code. For integral types, there are no precision ambiguities in c or c++

First, C++ doesn't have a long long type. Second, in C99, long long can represent any integral value from LLONG_MIN (<= -2^63) to LLONG_MAX (>= 2^63 - 1) exactly. The problem lies elsewhere.

Given the description of the problem, I think it is unambiguous.
Normally, the issue is that you don't know if the order in which the combinations are taken is important or not, but the example clearly disambiguate: if the order was important we would have 6 solutions, not 3.
What is the value that your code gives for this toy example ?
Anyway I can add a few examples with my own values if you wish, so that you can compare against them, I can't do much more for you unless you post your code. Obviously, the rows are independent so I'm only going to show the result row by row.
X occupied seat
. free seat
1: X..X
1: .X..
2: X...X
3: X...X..
5: ..X.....
From a computation point of view, I should note it's (at least) an O(N) process where N is the number of seats: you have to inspect nearly each seat once, except the first (and last) ones in case the second (and next to last) are occupied; and that's effectively possible to solve this linearly.
From a technic point of view:
make sure you initialize your variable to 0
make sure you don't count too many seats on toy example
I'd be happy to help more but I would not like to give you the full solution before you have a chance to think it over and review your algorithm calmly.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js