Converting arrays of one type to another - c++

Basically I have an array of doubles. I want to pass this array to a function (ProcessData) which will treat them as short ints. Is creating a short pointer and pointing it to the array, then passing this pointer to the function ok (code 1) ?
Is this in effect the same as creating a short array, iterating through each element and converting each element of the double array to a short and then passing the short array pointer (code 2) ? Thanks
//code 1
//.....
short* shortPtr = (short*)doubleArr;
ProcessData(shortPtr);
..
//code 2
//...
short shortArr [ARRSIZE];
int i;
for (i = 0; i < ARRSIZE; i++)
{
shortArr[i] = (short)doubleArr[i];
}
ProcessData(shortArr);

You can't just cast, as the various comments have said. But if you use iterators you can get more or less the same effect:
void do_something_with_short(short i) {
/* whatever */
}
template <class Iter>
void do_something(Iter first, Iter last) {
while (first != last)
do_something_with_short(*first++);
}
You can call that template function with iterators into an array of any arithmetic type (in fact, any type that's implicitly convertible to short or, if you add a cast at the point of the call to do_something_with_short, with a type that requires a cast):
double data[10]; // needs to be initialized, of course
do_something(std::begin(data), std::end(data));

No you can't do that. Here's at least one reason why:
An array is a contiguous sequence of several memory allocations accessed by way of an index, like so
[----][----][----]
Note the four dashes inside the square brackets. That is to indicate that in most situations in C/C++, an int is four bytes long. Arrays cells can be accessed by their index because if we know the memory address of the first cell (m) and we know how big each cell is meant to be (c) - in this case, four bytes, we can easily find the memory location of any index by doing m + index * c
[----][----][----]
^ array[0]
[----][----][----]
---- ---- ^ array[2]
Fundamentally, this is why pointers can be treated like arrays in C/C++, because when you are accessing arrays, you are basically doing pointer arithmetic anyway.
In most cases in C/C++, a short is 2 bytes long, so to represent it in the same way
[--][--][--]
If you create a short pointer, and try to use it as an array, it is expected to point to something which is arranged like the above. If you try to index it, it is going to have problems: if you were dealing with an array of shorts, the location of array[2] is the same as m + 2 * index, as shown below
[--][--][--]
-- -- ^ array[2] (note moving along four bytes)
But since we are in reality dealing with an array of integers, the following will happen
[----][----][----]
---- ^ array[2] (again moving along four bytes)
Which is clearly wrong

No, because ++ptr actually does something like ptr = (char*)ptr + sizeof *ptr (with sizeof (char) being 1 by definition). So incrementing a double pointer moves it by (usually) 8 bytes, while incrementing a short pointer moves it by only 2 bytes.

Suppose that your kids study piano and occasionally ask you to scan for them a stack of sheet music given to them by their teacher who was born in the 20th century (just like yourself). You take those sheets to your office and feed them to the photocopier. It creates decent digital scans that your kids can use on their piano equipped with a touch screen. All goes well until one day the child brings to you an old rare set of vinyl records. She's desperate of finding those melodies in sheet music form but asks you to at least copy the records. Inexperienced in musical matters, you take those disks to your office, load them in the automatic document feeder of the scanner and realize that you are deep in ... um... crap only as you hear the sounds of the vinyl disks breaking inside the stupid machine. Even if the photocopier were not equipped with an ADF, and you had to place all the originals on its glass flatbed manually, hardly you would receive your fair share of praise when you sent the scans to your daughter.
The scanner doesn't care what you put into it - as long as it fits inside. It does its best, but the result is not up to the expectations. However, had you first taken the vinyl records to an experienced musician who would write them down as musical score, scanning those sheets would result in real delight of your child.
In C++, different types may differ to an extent that a printed sheet of paper differs from a CD. A C++ function expecting to receive an array of shorts will process any sequence of bytes/bits as an array of shorts. It doesn't care that the memory area is actually filled with values of a different type, having a completely different representation, just like the scanner didn't care about the contents of the stack on the ADF. Assuming that a function will internally convert each element of the array from double to short, is the same as believing that a photocopier includes a gramophone and a musician that will automatically transcribe vinyl recordings to sheet form. Note that the latter is a possible design for a real-world photocopier, and some other programming languages work like that. But not existing implementations of1 C++.
1 In theory, a standard compliant implementation of C/C++ is possible that would interpret all provisions of UB in the language in favor of the opposite answer to your question, rather than in favor of best performance. But that would make little sense for a language like C/C++.

Related

Should I use int or unsigned int when working with STL container?

Referring to this guide:
https://google.github.io/styleguide/cppguide.html#Integer_Types
Google suggests to use int in the most of time.
I try to follow this guide and the only problem is with STL containers.
Example 1.
void setElement(int index, int value)
{
if (index > someExternalVector.size()) return;
...
}
Comparing index and .size() is generating a warning.
Example 2.
for (int i = 0; i < someExternalVector.size(); ++i)
{
...
}
Same warning between i and .size().
If I declare index or i as unsigned int, the warning is off, but the type declaration will propagate, then I have to declare more variables as unsigned int, then it contradicts the guide and loses consistency.
The best way I can think is to use a cast like:
if (index > static_cast<int>(someExternalVector.size())
or
for (int i = 0; i < static_cast<int>(someExternalVector.size()); ++i)
But I really don't like the casts.
Any suggestion?
Some detailed thoughts below:
To advantage to use only signed integer is like: I can avoid signed/unsigned warnings, castings, and be sure every value can be negative(to be consistent), so -1 could be used to represent invalid values.
There are many cases that the usage of loop counters are mixed with some other constants or struct members. So it would be problematic if signed/unsigned is not consistent. There will be full of warnings and castings.
Unsigned types have three characteristics, one of which is qualitatively 'good' and one of which is qualitatively 'bad':
They can hold twice as many values as the same-sized signed type (good)
The size_t version (that is, 32-bit on a 32-bit machine, 64-bit on a 64-bit machine, etc) is useful for representing memory (addresses, sizes, etc) (neutral)
They wrap below 0, so subtracting 1 in a loop or using -1 to represent an invalid index can cause bugs (bad.) Signed types wrap too.
The STL uses unsigned types because of the first two points above: in order to not limit the potential size of array-like classes such as vector and deque (although you have to question how often you would want 4294967296 elements in a data structure); because a negative value will never be a valid index into most data structures; and because size_t is the correct type to use for representing anything to do with memory, such as the size of a struct, and related things such as the length of a string (see below.) That's not necessarily a good reason to use it for indexes or other non-memory purposes such as a loop variable. The reason it's best practice to do so in C++ is kind of a reverse construction, because it's what's used in the containers as well as other methods, and once used the rest of the code has to match to avoid the same problem you are encountering.
You should use a signed type when the value can become negative.
You should use an unsigned type when the value cannot become negative (possibly different to 'should not'.)
You should use size_t when handling memory sizes (the result of sizeof, often things like string lengths, etc.) It is often chosen as a default unsigned type to use, because it matches the platform the code is compiled for. For example, the length of a string is size_t because a string can only ever have 0 or more elements, and there is no reason to limit a string's length method arbitrarily shorter than what can be represented on the platform, such as a 16-bit length (0-65535) on a 32-bit platform. Note (thanks commenter Morwen) std::intptr_t or std::uintptr_t which are conceptually similar - will always be the right size for your platform - and should be used for memory addresses if you want something that's not a pointer. Note 2 (thanks commenter rubenvb) that a string can only hold size_t-1 elements due to the value of npos. Details below.
This means that if you use -1 to represent an invalid value, you should use signed integers. If you use a loop to iterate backwards over your data, you should consider using a signed integer if you are not certain that the loop construct is correct (and as noted in one of the other answers, they are easy to get wrong.) IMO, you should not resort to tricks to ensure the code works - if code requires tricks, that's often a danger signal. In addition, it will be harder to understand for those following you and reading your code. Both these are reasons not to follow #Jasmin Gray's answer above.
Iterators
However, using integer-based loops to iterate over the contents of a data structure is the wrong way to do it in C++, so in a sense the argument over signed vs unsigned for loops is moot. You should use an iterator instead:
std::vector<foo> bar;
for (std::vector<foo>::const_iterator it = bar.begin(); it != bar.end(); ++it) {
// Access using *it or it->, e.g.:
const foo & a = *it;
When you do this, you don't need to worry about casts, signedness, etc.
Iterators can be forward (as above) or reverse, for iterating backwards. Use the same syntax of it != bar.end(), because end() signals the end of the iteration, not the end of the underlying conceptual array, tree, or other structure.
In other words, the answer to your question 'Should I use int or unsigned int when working with STL containers?' is 'Neither. Use iterators instead.' Read more about:
Why use iterators instead of array indices in C++?
Why again (some more interesting points in the answers to this question)
Iterators in general - the different kinds, how to use them, etc.
What's left?
If you don't use an integer type for loops, what's left? Your own values, which are dependent on your data, but which in your case include using -1 for an invalid value. This is simple. Use signed. Just be consistent.
I am a big believer in using natural types, such as enums, and signed integers fit into this. They match our conceptual expectation more closely. When your mind and the code are aligned, you are less likely to write buggy code and more likely to expressively write correct, clean code.
Use the type that the container returns. In this case, size_t - which is an integer type that is unsigned.
(To be technical, it's std::vector<MyType>::size_type, but that's usually defined to size_t, so you're safe using size_t. unsigned is also fine)
But in general, use the right tool for the right job. Is the 'index' ever supposed to be negative? If not, don't make it signed.
By the by, you don't have to type out 'unsigned int'. 'unsigned' is shorthand for the same variable type:
int myVar1;
unsigned myVar2;
The page linked to in the original question said:
Some people, including some textbook authors, recommend using unsigned
types to represent numbers that are never negative. This is intended
as a form of self-documentation. However, in C, the advantages of such
documentation are outweighed by the real bugs it can introduce.
It's not just self-documentation, it's use the right tool for the right job. Saying that 'unsigned variables can cause bugs so don't use unsigned variables' is silly. Signed variables can also cause bugs. So can floats (more than integers). The only guaranteed bug-free code is code that doesn't exist.
Their example of why unsigned is evil, is this loop:
for (unsigned int i = foo.Length()-1; i >= 0; --i)
I have difficulty iterating backwards over a loop, and I usually make mistakes (with signed or unsigned integers) with it. Do I subtract one from size? Do I make it greater-than-AND-equal-to 0, or just greater than? It's a sloppy situation to begin with.
So what do you do with code that you know you have problems with? You change your coding style to fix the problem, make it simpler, and make it easier to read, and make it easier to remember. There is a bug in the loop they posted. The bug is, they wanted to allow a value below zero, but they chose to make it unsigned. It's their mistake.
But here's a simple trick that makes it easier to read, remember, write, and run. With unsigned variables. Here's the intelligent thing to do (obviously, this is my opinion).
for(unsigned i = myContainer.size(); i--> 0; )
{
std::cout << myContainer[i] << std::endl;
}
It's unsigned. It always works. No negative to the starting size. No worrying about underflows. It just works. It's just smart. Do it right, don't stop using unsigned variables because someone somewhere once said they had a mistake with a for() loop and failed to train themselves to not make the mistake.
The trick to remembering it:
Set 'i' to the size. (don't worry about subtracting one)
Make 'i' point to 0 like an arrow. i --> 0 (it's a combination of post-decrementing (i--) and greater-than comparison (i > 0))
It's better to teach yourself tricks to code right, then to throw away tools because you don't code right.
Which would you want to see in your code?
for(unsigned i = myContainer.size()-1; i >= 0; --i)
Or:
for(unsigned i = myContainer.size(); i--> 0; )
Not because it's less characters to type (that'd be silly), but because it's less mental clutter. It's simpler to mentally parse when skimming through code, and easier to spot mistakes.
Try the code yourself

Pointer arithmetic disguised &(array[0])

Today I browsed some source code (it was an example file explaining the use of a software framework) and discovered a lot of code like this:
int* array = new int[10]; // or malloc, who cares. Please, no language wars. This is applicable to both languages
for ( int* ptr = &(array[0]); ptr <= &(array[9]); ptr++ )
{
...
}
So basically, they've done "take the address of the object that lies at address array + x".
Normally I would say, that this is plain stupidity, as writing array + 0or array + 9 directly does the same. I even would always rewrite such code to a size_t for loop, but that's a matter of style.
But the overuse of this got me thinking: Am I missing something blatantly obvious or something subtely hidden in the dark corners of the language?
For anyone wanting to take a look at the original source code, with all it's nasty gotos , mallocs and of course this pointer thing, feel free to look at it online.
Yeah, there's no good reason for the first one. This is exactly the same thing:
int *ptr = array;
I agree on the second also, may as well just write:
ptr < (array + 10)
Of course you could also just make it a for loop from 0-9 and set the temp pointer to point to the beginning of the array.
for(int i = 0, *ptr = array; i < 10; ++i, ++ptr)
/* ... */
That of course assumes that ptr is not being modified within the body of the loop.
You're not missing anything, they do mean the same thing.
However, to try to shed some more light on this, I should say that I also write expressions like that from time to time, for added clarity.
I personally tend to think in terms of object-oriented programming, meaning that I prefer to refer to "the address of the nth element of the array", rather than "the nth offset from the beginning address of the array". Even though those two things are equivalent in C, when I'm writing the code, I have the former in mind - so I express that.
Perhaps that's the reasoning of the person who wrote this as well.
Edit: this is partially incorrect. Read the comments.
The problem with &(array[0]) is that it expands to &(*(array + 0)) which involves an dereference. Now, every compiler will obviously optimize this into the same thing as array + 0, but as far as the language is concerned the dereference can cause UB in places where array + 0 would not.
I think the reason why they wrote it this way was that
&(array[0])
and
&(array[9])
just look similar. Another way would be to write it
array + 0
and
array + 9
respectively. As you already mentioned, they essentially do the same (at least most compilers treat it as the same, I hope).
You could interpret the two different type of expressions differently: The first one can be read as "the address of element 0 / 9". The second one can be read as "array pointer with an element offset of 0 / 9". The first one sounds more high-level, the second more low-level. However, most people tend to use the second form, though.
Now since array + 0 of course is the same as array, you could just write array. I think the point here is that the begin and end of the loop look "analogous" to each other. A question of personal taste.
According to classical mathematics:
Array[n]
refers to the nth element in the array.
To "take the address of" the nth element, the & or address of operator is applied:
&Array[n]
To clear out any assumed ambiguities, parenthesis are added:
&(Array[n])
To a reader, reading from left to right, this expression has the meaning:
Return the address of the element at position 'n'
The insurance may have developed as a protection against old faulty compilers.
Some people consider it more readable than:
Array + n
Sorry, but I am old school and prefer using the '&' version, paren or without. I'll waste my time making code easier to read than worrying about which version takes longer to compile or which version is more efficient.
A clear commented section of code has a higher Return On Investment than a section of code that is micro-optimized for efficiency or uses sections of the language that are unfamilar to non language lawyers.

Strange C++ Memory Allocation

I created a simple class, Storer, in C++, playing with memory allocation. It contains six field variables, all of which are assigned in the constructor:
int x;
int y;
int z;
char c;
long l;
double d;
I was interested in how these variables were being stored, so I wrote the following code:
Storer *s=new Storer(5,4,3,'a',5280,1.5465);
cout<<(long)s<<endl<<endl;
cout<<(long)&(s->x)<<endl;
cout<<(long)&(s->y)<<endl;
cout<<(long)&(s->z)<<endl;
cout<<(long)&(s->c)<<endl;
cout<<(long)&(s->l)<<endl;
cout<<(long)&(s->d)<<endl;
I was very interested in the output:
33386512
33386512
33386516
33386520
33386524
33386528
33386536
Why is the char c taking up four bytes? sizeof(char) returns, of course, 1, so why is the program allocating more memory than it needs? This is confirmed that too much memory is being allocated with the following code:
cout<<sizeof(s->c)<<endl;
cout<<sizeof(Storer)<<endl;
cout<<sizeof(int)+sizeof(int)+sizeof(int)+sizeof(char)+sizeof(long)+sizeof(double)<<endl;
which prints:
1
32
29
confirming that, indeed, 3 bytes are being allocated needlessly. Can anyone explain to me why this is happening? Thanks.
Data alignment and compiler padding say hi!
The CPU has no notion of type, what it gets in its 32-bit (or 64-bit, or 128-bit (SSE), or 256-bit (AVX) - let's keep it simple at 32) registers needs to be properly aligned in order to be processed correctly and efficiently. Imagine a simple scenario, where you have a char, followed by an int. In a 32-bit architecture, that's 1 byte for a char and 4 bytes for an integer.
A 32-bit register would have to break on its boundary, only taking in 3 bytes of the integer and leaving the 4th byte for "a second run". It cannot process the data properly that way, so the compiler will add padding in order to make sure all the stuff is processed efficiently. And that means adding a certain amount of padding depending on the type in question.
Why is misalignment a problem?
The computer is not human, it can't just pick them out with a pair of eyes and a brain. It has to be very deterministic and cautious about how it goes about doing things. First it loads one block which contains n bytes of the given information, shift it around so that it prunes out unrelated information, then another, again, shift out a bunch of unnecessary bytes which do not have anything to do with the operation at hand and only then can it do the necessary operations. And usually you have two operands, that's just one complete. When you do all that work, only then can you actually process it. Way too much performance overhead when you can simply align the data properly (and most of the time, compilers do it for you, if you're not doing anything fancy).
Could you visualize it?
Visually - the first green byte is the mentioned char, and the three green bytes plus the first red one of the second block is the 4-byte int, colorcoded on a 4-byte access boundary (we're talking about a 32-bit register). The "instead part" at the bottom shows an ideal setup where the int hits the register properly (the char getting padded into obedience somewhere off image):
Read more on data alignment, which comes quite handy when you're dealing with fancy extensions of the instruction set like SSE (128-bit regs) or AVX (256-bit regs), so special care must be taken so that the optimizations of vectorization are not defeated ( aligning on a 16-byte boundary for SSE, 16*8 -> 128-bits).
Additional remarks on user defined alignment
phonetagger made a valid point in the comments that there are pragma directives which can be assigned through the preprocessor to force to compiler in order to align the data in a way the user, programmer specifies. But such directives, like #pragma pack(...), are a statement to the compiler that you know what you're doing and what's best for you. Be sure that you do, because if you fail to accomodate your environment, you might experience various penalties - the most obvious being using external libraries you didn't write yourself which differ in the way they pack data.
Things simply explode when they clash. Best is to advise caution in such cases and really being intimate with the issue at hand. If you're not sure, leave it to the defaults. If you are not sure but have to use something like SSE where alignment is king (and not default nor simple by a long shot), consult various resources online or ask an another question here.
I will make an analogy to help you understand.
Assume there is a long loaf of bread and you have a cutting machine that can cut it into slices of equal thickness. Then you are giving out these breads to, let's say children. Every child takes their bread and fairly do what they want to do with them (put Nutella on them and eat, etc.). They can even make thinner slices out of it and use it like that.
If one child comes up to you and says that he does not want that slice everyone is getting, but a thinner slice instead, then you will have difficulties, because your cutting machine is optimized to cut at least a minimum amount, which makes everyone happy. But when one child asks for a thinner slice, then you have to reinvent the machine or put additional complexity to it like introducing two cutting modes. You don't want that. Eventually you give up and just give him a big slice anyway.
This is the same reason why it happens. Hope you could relate to the analogy.
Data alignement is why the char has allocated 4 bytes : Data alignement
char does not take up four bytes: it takes up a single byte as usual. You can check it by printing sizeof(char). The other three bytes are padding that the compiler inserts to optimize access to other members of your class. Depending on hardware, it is often much faster to access multi-byte types, say, 4-byte integers, when they are located at an address divisible by four. A compiler may insert up to three bytes of padding before an int member to align it with a good memory address for faster access.
If you would like to experiment with class layouts, you can use a handy operation called offsetof. It takes two parameters - the name of the member and the name of the class, and it returns the number of bytes from the base address of your struct to the position of the member in memory.
cout << offsetof(Storer, x) << endl;
cout << offsetof(Storer, y) << endl;
cout << offsetof(Storer, z) << endl;
Structure members are aligned in particular ways. In general, if you want the most compact representation, list the members in decreasing order of size.
http://en.wikipedia.org/wiki/Data_structure_alignment#Typical_alignment_of_C_structs_on_x86

What's the rationale for null terminated strings?

As much as I love C and C++, I can't help but scratch my head at the choice of null terminated strings:
Length prefixed (i.e. Pascal) strings existed before C
Length prefixed strings make several algorithms faster by allowing constant time length lookup.
Length prefixed strings make it more difficult to cause buffer overrun errors.
Even on a 32 bit machine, if you allow the string to be the size of available memory, a length prefixed string is only three bytes wider than a null terminated string. On 16 bit machines this is a single byte. On 64 bit machines, 4GB is a reasonable string length limit, but even if you want to expand it to the size of the machine word, 64 bit machines usually have ample memory making the extra seven bytes sort of a null argument. I know the original C standard was written for insanely poor machines (in terms of memory), but the efficiency argument doesn't sell me here.
Pretty much every other language (i.e. Perl, Pascal, Python, Java, C#, etc) use length prefixed strings. These languages usually beat C in string manipulation benchmarks because they are more efficient with strings.
C++ rectified this a bit with the std::basic_string template, but plain character arrays expecting null terminated strings are still pervasive. This is also imperfect because it requires heap allocation.
Null terminated strings have to reserve a character (namely, null), which cannot exist in the string, while length prefixed strings can contain embedded nulls.
Several of these things have come to light more recently than C, so it would make sense for C to not have known of them. However, several were plain well before C came to be. Why would null terminated strings have been chosen instead of the obviously superior length prefixing?
EDIT: Since some asked for facts (and didn't like the ones I already provided) on my efficiency point above, they stem from a few things:
Concat using null terminated strings requires O(n + m) time complexity. Length prefixing often require only O(m).
Length using null terminated strings requires O(n) time complexity. Length prefixing is O(1).
Length and concat are by far the most common string operations. There are several cases where null terminated strings can be more efficient, but these occur much less often.
From answers below, these are some cases where null terminated strings are more efficient:
When you need to cut off the start of a string and need to pass it to some method. You can't really do this in constant time with length prefixing even if you are allowed to destroy the original string, because the length prefix probably needs to follow alignment rules.
In some cases where you're just looping through the string character by character you might be able to save a CPU register. Note that this works only in the case that you haven't dynamically allocated the string (Because then you'd have to free it, necessitating using that CPU register you saved to hold the pointer you originally got from malloc and friends).
None of the above are nearly as common as length and concat.
There's one more asserted in the answers below:
You need to cut off the end of the string
but this one is incorrect -- it's the same amount of time for null terminated and length prefixed strings. (Null terminated strings just stick a null where you want the new end to be, length prefixers just subtract from the prefix.)
From the horse's mouth
None of BCPL, B, or C supports
character data strongly in the
language; each treats strings much
like vectors of integers and
supplements general rules by a few
conventions. In both BCPL and B a
string literal denotes the address of
a static area initialized with the
characters of the string, packed into
cells. In BCPL, the first packed byte
contains the number of characters in
the string; in B, there is no count
and strings are terminated by a
special character, which B spelled
*e. This change was made partially
to avoid the limitation on the length
of a string caused by holding the
count in an 8- or 9-bit slot, and
partly because maintaining the count
seemed, in our experience, less
convenient than using a terminator.
Dennis M Ritchie, Development of the C Language
C doesn't have a string as part of the language. A 'string' in C is just a pointer to char. So maybe you're asking the wrong question.
"What's the rationale for leaving out a string type" might be more relevant. To that I would point out that C is not an object oriented language and only has basic value types. A string is a higher level concept that has to be implemented by in some way combining values of other types. C is at a lower level of abstraction.
in light of the raging squall below:
I just want to point out that I'm not trying to say this is a stupid or bad question, or that the C way of representing strings is the best choice. I'm trying to clarify that the question would be more succinctly put if you take into account the fact that C has no mechanism for differentiating a string as a datatype from a byte array. Is this the best choice in light of the processing and memory power of todays computers? Probably not. But hindsight is always 20/20 and all that :)
The question is asked as a Length Prefixed Strings (LPS) vs zero terminated strings (SZ) thing, but mostly expose benefits of length prefixed strings. That may seem overwhelming, but to be honest we should also consider drawbacks of LPS and advantages of SZ.
As I understand it, the question may even be understood as a biased way to ask "what are the advantages of Zero Terminated Strings ?".
Advantages (I see) of Zero Terminated Strings:
very simple, no need to introduce new concepts in language, char
arrays/char pointers can do.
the core language just include minimal syntaxic sugar to convert
something between double quotes to a
bunch of chars (really a bunch of
bytes). In some cases it can be used
to initialize things completely
unrelated with text. For instance xpm
image file format is a valid C source
that contains image data encoded as a
string.
by the way, you can put a zero in a string literal, the compiler will
just also add another one at the end of the literal: "this\0is\0valid\0C".
Is it a string ? or four strings ? Or a bunch of bytes...
flat implementation, no hidden indirection, no hidden integer.
no hidden memory allocation involved (well, some infamous non
standard functions like strdup
perform allocation, but that's mostly
a source of problem).
no specific issue for small or large hardware (imagine the burden to
manage 32 bits prefix length on 8
bits microcontrollers, or the
restrictions of limiting string size
to less than 256 bytes, that was a problem I actually had with Turbo Pascal eons ago).
implementation of string manipulation is just a handful of
very simple library function
efficient for the main use of strings : constant text read
sequentially from a known start
(mostly messages to the user).
the terminating zero is not even mandatory, all necessary tools
to manipulate chars like a bunch of
bytes are available. When performing
array initialisation in C, you can
even avoid the NUL terminator. Just
set the right size. char a[3] =
"foo"; is valid C (not C++) and
won't put a final zero in a.
coherent with the unix point of view "everything is file", including
"files" that have no intrinsic length
like stdin, stdout. You should remember that open read and write primitives are implemented
at a very low level. They are not library calls, but system calls. And the same API is used
for binary or text files. File reading primitives get a buffer address and a size and return
the new size. And you can use strings as the buffer to write. Using another kind of string
representation would imply you can't easily use a literal string as the buffer to output, or
you would have to make it have a very strange behavior when casting it to char*. Namely
not to return the address of the string, but instead to return the actual data.
very easy to manipulate text data read from a file in-place, without useless copy of buffer,
just insert zeroes at the right places (well, not really with modern C as double quoted strings are const char arrays nowaday usually kept in non modifiable data segment).
prepending some int values of whatever size would implies alignment issues. The initial
length should be aligned, but there is no reason to do that for the characters datas (and
again, forcing alignment of strings would imply problems when treating them as a bunch of
bytes).
length is known at compile time for constant literal strings (sizeof). So why would
anyone want to store it in memory prepending it to actual data ?
in a way C is doing as (nearly) everyone else, strings are viewed as arrays of char. As array length is not managed by C, it is logical length is not managed either for strings. The only surprising thing is that 0 item added at the end, but that's just at core language level when typing a string between double quotes. Users can perfectly call string manipulation functions passing length, or even use plain memcopy instead. SZ are just a facility. In most other languages array length is managed, it's logical that is the same for strings.
in modern times anyway 1 byte character sets are not enough and you often have to deal with encoded unicode strings where the number of characters is very different of the number of bytes. It implies that users will probably want more than "just the size", but also other informations. Keeping length give use nothing (particularly no natural place to store them) regarding these other useful pieces of information.
That said, no need to complain in the rare case where standard C strings are indeed inefficient. Libs are available. If I followed that trend, I should complain that standard C does not include any regex support functions... but really everybody knows it's not a real problem as there is libraries available for that purpose. So when string manipulation efficiency is wanted, why not use a library like bstring ? Or even C++ strings ?
EDIT: I recently had a look to D strings. It is interesting enough to see that the solution choosed is neither a size prefix, nor zero termination. As in C, literal strings enclosed in double quotes are just short hand for immutable char arrays, and the language also has a string keyword meaning that (immutable char array).
But D arrays are much richer than C arrays. In the case of static arrays length is known at run-time so there is no need to store the length. Compiler has it at compile time. In the case of dynamic arrays, length is available but D documentation does not state where it is kept. For all we know, compiler could choose to keep it in some register, or in some variable stored far away from the characters data.
On normal char arrays or non literal strings there is no final zero, hence programmer has to put it itself if he wants to call some C function from D. In the particular case of literal strings, however the D compiler still put a zero at the end of each strings (to allow easy cast to C strings to make easier calling C function ?), but this zero is not part of the string (D does not count it in string size).
The only thing that disappointed me somewhat is that strings are supposed to be utf-8, but length apparently still returns a number of bytes (at least it's true on my compiler gdc) even when using multi-byte chars. It is unclear to me if it's a compiler bug or by purpose. (OK, I probably have found out what happened. To say to D compiler your source use utf-8 you have to put some stupid byte order mark at beginning. I write stupid because I know of not editor doing that, especially for UTF-8 that is supposed to be ASCII compatible).
I think, it has historical reasons and found this in wikipedia:
At the time C (and the languages that
it was derived from) were developed,
memory was extremely limited, so using
only one byte of overhead to store the
length of a string was attractive. The
only popular alternative at that time,
usually called a "Pascal string"
(though also used by early versions of
BASIC), used a leading byte to store
the length of the string. This allows
the string to contain NUL and made
finding the length need only one
memory access (O(1) (constant) time).
But one byte limits the length to 255.
This length limitation was far more
restrictive than the problems with the
C string, so the C string in general
won out.
Calavera is right, but as people don't seem to get his point, I'll provide some code examples.
First, let's consider what C is: a simple language, where all code has a pretty direct translation into machine language. All types fit into registers and on the stack, and it doesn't require an operating system or a big run-time library to run, since it were meant to write these things (a task to which is superbly well-suited, considering there isn't even a likely competitor to this day).
If C had a string type, like int or char, it would be a type which didn't fit in a register or in the stack, and would require memory allocation (with all its supporting infrastructure) to be handled in any way. All of which go against the basic tenets of C.
So, a string in C is:
char s*;
So, let's assume then that this were length-prefixed. Let's write the code to concatenate two strings:
char* concat(char* s1, char* s2)
{
/* What? What is the type of the length of the string? */
int l1 = *(int*) s1;
/* How much? How much must I skip? */
char *s1s = s1 + sizeof(int);
int l2 = *(int*) s2;
char *s2s = s2 + sizeof(int);
int l3 = l1 + l2;
char *s3 = (char*) malloc(l3 + sizeof(int));
char *s3s = s3 + sizeof(int);
memcpy(s3s, s1s, l1);
memcpy(s3s + l1, s2s, l2);
*(int*) s3 = l3;
return s3;
}
Another alternative would be using a struct to define a string:
struct {
int len; /* cannot be left implementation-defined */
char* buf;
}
At this point, all string manipulation would require two allocations to be made, which, in practice, means you'd go through a library to do any handling of it.
The funny thing is... structs like that do exist in C! They are just not used for your day-to-day displaying messages to the user handling.
So, here is the point Calavera is making: there is no string type in C. To do anything with it, you'd have to take a pointer and decode it as a pointer to two different types, and then it becomes very relevant what is the size of a string, and cannot just be left as "implementation defined".
Now, C can handle memory in anyway, and the mem functions in the library (in <string.h>, even!) provide all the tooling you need to handle memory as a pair of pointer and size. The so-called "strings" in C were created for just one purpose: showing messages in the context of writting an operating system intended for text terminals. And, for that, null termination is enough.
Obviously for performance and safety, you'll want to keep the length of a string while you're working with it rather than repeatedly performing strlen or the equivalent on it. However, storing the length in a fixed location just before the string contents is an incredibly bad design. As Jörgen pointed out in the comments on Sanjit's answer, it precludes treating the tail of a string as a string, which for example makes a lot of common operations like path_to_filename or filename_to_extension impossible without allocating new memory (and incurring the possibility of failure and error handling). And then of course there's the issue that nobody can agree how many bytes the string length field should occupy (plenty of bad "Pascal string" languages used 16-bit fields or even 24-bit fields which preclude processing of long strings).
C's design of letting the programmer choose if/where/how to store the length is much more flexible and powerful. But of course the programmer has to be smart. C punishes stupidity with programs that crash, grind to a halt, or give your enemies root.
Lazyness, register frugality and portability considering the assembly gut of any language, especially C which is one step above assembly (thus inheriting a lot of assembly legacy code).
You would agree as a null char would be useless in those ASCII days, it (and probably as good as an EOF control char ).
let's see in pseudo code
function readString(string) // 1 parameter: 1 register or 1 stact entries
pointer=addressOf(string)
while(string[pointer]!=CONTROL_CHAR) do
read(string[pointer])
increment pointer
total 1 register use
case 2
function readString(length,string) // 2 parameters: 2 register used or 2 stack entries
pointer=addressOf(string)
while(length>0) do
read(string[pointer])
increment pointer
decrement length
total 2 register used
That might seem shortsighted at that time, but considering the frugality in code and register ( which were PREMIUM at that time, the time when you know, they use punch card ). Thus being faster ( when processor speed could be counted in kHz), this "Hack" was pretty darn good and portable to register-less processor with ease.
For argument sake I will implement 2 common string operation
stringLength(string)
pointer=addressOf(string)
while(string[pointer]!=CONTROL_CHAR) do
increment pointer
return pointer-addressOf(string)
complexity O(n) where in most case PASCAL string is O(1) because the length of the string is pre-pended to the string structure (that would also mean that this operation would have to be carried in an earlier stage).
concatString(string1,string2)
length1=stringLength(string1)
length2=stringLength(string2)
string3=allocate(string1+string2)
pointer1=addressOf(string1)
pointer3=addressOf(string3)
while(string1[pointer1]!=CONTROL_CHAR) do
string3[pointer3]=string1[pointer1]
increment pointer3
increment pointer1
pointer2=addressOf(string2)
while(string2[pointer2]!=CONTROL_CHAR) do
string3[pointer3]=string2[pointer2]
increment pointer3
increment pointer1
return string3
complexity O(n) and prepending the string length wouldn't change the complexity of the operation, while I admit it would take 3 time less time.
On another hand, if you use PASCAL string you would have to redesign your API for taking in account register length and bit-endianness, PASCAL string got the well known limitation of 255 char (0xFF) beacause the length was stored in 1 byte (8bits), and it you wanted a longer string (16bits->anything) you would have to take in account the architecture in one layer of your code, that would mean in most case incompatible string APIs if you wanted longer string.
Example:
One file was written with your prepended string api on an 8 bit computer and then would have to be read on say a 32 bit computer, what would the lazy program do considers that your 4bytes are the length of the string then allocate that lot of memory then attempt to read that many bytes.
Another case would be PPC 32 byte string read(little endian) onto a x86 (big endian), of course if you don't know that one is written by the other there would be trouble.
1 byte length (0x00000001) would become 16777216 (0x0100000) that is 16 MB for reading a 1 byte string.
Of course you would say that people should agree on one standard but even 16bit unicode got little and big endianness.
Of course C would have its issues too but, would be very little affected by the issues raised here.
In many ways, C was primitive. And I loved it.
It was a step above assembly language, giving you nearly the same performance with a language that was much easier to write and maintain.
The null terminator is simple and requires no special support by the language.
Looking back, it doesn't seem that convenient. But I used assembly language back in the 80s and it seemed very convenient at the time. I just think software is continually evolving, and the platforms and tools continually get more and more sophisticated.
Assuming for a moment that C implemented strings the Pascal way, by prefixing them by length: is a 7 char long string the same DATA TYPE as a 3-char string? If the answer is yes, then what kind of code should the compiler generate when I assign the former to the latter? Should the string be truncated, or automatically resized? If resized, should that operation be protected by a lock as to make it thread safe? The C approach side stepped all these issues, like it or not :)
Somehow I understood the question to imply there's no compiler support for length-prefixed strings in C. The following example shows, at least you can start your own C string library, where string lengths are counted at compile time, with a construct like this:
#define PREFIX_STR(s) ((prefix_str_t){ sizeof(s)-1, (s) })
typedef struct { int n; char * p; } prefix_str_t;
int main() {
prefix_str_t string1, string2;
string1 = PREFIX_STR("Hello!");
string2 = PREFIX_STR("Allows \0 chars (even if printf directly doesn't)");
printf("%d %s\n", string1.n, string1.p); /* prints: "6 Hello!" */
printf("%d %s\n", string2.n, string2.p); /* prints: "48 Allows " */
return 0;
}
This won't, however, come with no issues as you need to be careful when to specifically free that string pointer and when it is statically allocated (literal char array).
Edit: As a more direct answer to the question, my view is this was the way C could support both having string length available (as a compile time constant), should you need it, but still with no memory overhead if you want to use only pointers and zero termination.
Of course it seems like working with zero-terminated strings was the recommended practice, since the standard library in general doesn't take string lengths as arguments, and since extracting the length isn't as straightforward code as char * s = "abc", as my example shows.
"Even on a 32 bit machine, if you allow the string to be the size of available memory, a length prefixed string is only three bytes wider than a null terminated string."
First, extra 3 bytes may be considerable overhead for short strings. In particular, a zero-length string now takes 4 times as much memory. Some of us are using 64-bit machines, so we either need 8 bytes to store a zero-length string, or the string format can't cope with the longest strings the platform supports.
There may also be alignment issues to deal with. Suppose I have a block of memory containing 7 strings, like "solo\0second\0\0four\0five\0\0seventh". The second string starts at offset 5. The hardware may require that 32-bit integers be aligned at an address that is a multiple of 4, so you have to add padding, increasing the overhead even further. The C representation is very memory-efficient in comparison. (Memory-efficiency is good; it helps cache performance, for example.)
One point not yet mentioned: when C was designed, there were many machines where a 'char' was not eight bits (even today there are DSP platforms where it isn't). If one decides that strings are to be length-prefixed, how many 'char's worth of length prefix should one use? Using two would impose an artificial limit on string length for machines with 8-bit char and 32-bit addressing space, while wasting space on machines with 16-bit char and 16-bit addressing space.
If one wanted to allow arbitrary-length strings to be stored efficiently, and if 'char' were always 8-bits, one could--for some expense in speed and code size--define a scheme were a string prefixed by an even number N would be N/2 bytes long, a string prefixed by an odd value N and an even value M (reading backward) could be ((N-1) + M*char_max)/2, etc. and require that any buffer which claims to offer a certain amount of space to hold a string must allow enough bytes preceding that space to handle the maximum length. The fact that 'char' isn't always 8 bits, however, would complicate such a scheme, since the number of 'char' required to hold a string's length would vary depending upon the CPU architecture.
The null termination allows for fast pointer based operations.
Not a Rationale necessarily but a counterpoint to length-encoded
Certain forms of dynamic length encoding are superior to static length encoding as far as memory is concerned, it all depends on usage. Just look at UTF-8 for proof. It's essentially an extensible character array for encoding a single character. This uses a single bit for each extended byte. NUL termination uses 8 bits. Length-prefix I think can be reasonably termed infinite length as well by using 64 bits. How often you hit the case of your extra bits is the deciding factor. Only 1 extremely large string? Who cares if you're using 8 or 64 bits? Many small strings (Ie Strings of English words)? Then your prefix costs are a large percentage.
Length-prefixed strings allowing time savings is not a real thing. Whether your supplied data is required to have length provided, you're counting at compile time, or you're truly being provided dynamic data that you must encode as a string. These sizes are computed at some point in the algorithm. A separate variable to store the size of a null terminated string can be provided. Which makes the comparison on time-savings moot. One just has an extra NUL at the end... but if the length encode doesn't include that NUL then there's literally no difference between the two. There's no algorithmic change required at all. Just a pre-pass you have to manually design yourself instead of having a compiler/runtime do it for you. C is mostly about doing things manually.
Length-prefix being optional is a selling point. I don't always need that extra info for an algorithm so being required to do it for a every string makes my precompute+compute time never able to drop below O(n). (Ie hardware random number generator 1-128. I can pull from an "infinite string". Let's say it only generates characters so fast. So our string length changes all the time. But my usage of the data probably doesn't care how many random bytes I have. It just wants the next available unused byte as soon as it can get it after a request. I could be waiting on the device. But I could also have a buffer of characters pre-read. A length comparison is a needless waste of computation. A null check is more efficient.)
Length-prefix is a good guard against buffer overflow? So is sane usage of library functions and implementation. What if I pass in malformed data? My buffer is 2 bytes long but I tell the function it's 7! Ex: If gets() was intended to be used on known data it could've had an internal buffer check that tested compiled buffers and malloc() calls and still follow spec. If it was meant to be used as a pipe for unknown STDIN to arrive at unknown buffer then clearly one can't know abut the buffer size which means a length arg is pointless, you need something else here like a canary check. For that matter, you can't length-prefix some streams and inputs, you just can't. Which means the length check has to be built into the algorithm and not a magic part of the typing system. TL;DR NUL-terminated never had to be unsafe, it just ended up that way via misuse.
counter-counter point: NUL-termination is annoying on binary. You either need to do length-prefix here or transform NUL bytes in some way: escape-codes, range remapping, etc... which of course means more-memory-usage/reduced-information/more-operations-per-byte. Length-prefix mostly wins the war here. The only upside to a transform is that no additional functions have to be written to cover the length-prefix strings. Which means on your more optimized sub-O(n) routines you can have them automatically act as their O(n) equivalents without adding more code. Downside is, of course, time/memory/compression waste when used on NUL heavy strings. Depending on how much of your library you end up duplicating to operate on binary data, it may make sense to work solely with length-prefix strings. That said one could also do the same with length-prefix strings... -1 length could mean NUL-terminated and you could use NUL-terminated strings inside length-terminated.
Concat: "O(n+m) vs O(m)" I'm assuming your referring to m as the total length of the string after concatenating because they both have to have that number of operations minimum (you can't just tack-on to string 1, what if you have to realloc?). And I'm assuming n is a mythical amount of operations you no longer have to do because of a pre-compute. If so, then the answer is simple: pre-compute. If you're insisting you'll always have enough memory to not need to realloc and that's the basis of the big-O notation then the answer is even more simple: do binary search on allocated memory for end of string 1, clearly there's a large swatch of infinite zeros after string 1 for us to not worry about realloc. There, easily got n to log(n) and I barely tried. Which if you recall log(n) is essentially only ever as large as 64 on a real computer, which is essentially like saying O(64+m), which is essentially O(m). (And yes that logic has been used in run-time analysis of real data structures in-use today. It's not bullshit off the top of my head.)
Concat()/Len() again: Memoize results. Easy. Turns all computes into pre-computes if possible/necessary. This is an algorithmic decision. It's not an enforced constraint of the language.
String suffix passing is easier/possible with NUL termination. Depending on how length-prefix is implemented it can be destructive on original string and can sometimes not even be possible. Requiring a copy and pass O(n) instead of O(1).
Argument-passing/de-referencing is less for NUL-terminated versus length-prefix. Obviously because you're passing less information. If you don't need length, then this saves a lot of footprint and allows optimizations.
You can cheat. It's really just a pointer. Who says you have to read it as a string? What if you want to read it as a single character or a float? What if you want to do the opposite and read a float as a string? If you're careful you can do this with NUL-termination. You can't do this with length-prefix, it's a data type distinctly different from a pointer typically. You'd most likely have to build a string byte-by-byte and get the length. Of course if you wanted something like an entire float (probably has a NUL inside it) you'd have to read byte-by-byte anyway, but the details are left to you to decide.
TL;DR Are you using binary data? If no, then NUL-termination allows more algorithmic freedom. If yes, then code quantity vs speed/memory/compression is your main concern. A blend of the two approaches or memoization might be best.
Many design decisions surrounding C stem from the fact that when it was originally implemented, parameter passing was somewhat expensive. Given a choice between e.g.
void add_element_to_next(arr, offset)
char[] arr;
int offset;
{
arr[offset] += arr[offset+1];
}
char array[40];
void test()
{
for (i=0; i<39; i++)
add_element_to_next(array, i);
}
versus
void add_element_to_next(ptr)
char *p;
{
p[0]+=p[1];
}
char array[40];
void test()
{
int i;
for (i=0; i<39; i++)
add_element_to_next(arr+i);
}
the latter would have been slightly cheaper (and thus preferred) since it only required passing one parameter rather than two. If the method being called didn't need to know the base address of the array nor the index within it, passing a single pointer combining the two would be cheaper than passing the values separately.
While there are many reasonable ways in which C could have encoded string lengths, the approaches that had been invented up to that time would have all required functions that should be able to work with part of a string to accept the base address of the string and the desired index as two separate parameters. Using zero-byte termination made it possible to avoid that requirement. Although other approaches would be better with today's machines (modern compilers often pass parameters in registers, and memcpy can be optimized in ways strcpy()-equivalents cannot) enough production code uses zero-byte terminated strings that it's hard to change to anything else.
PS--In exchange for a slight speed penalty on some operations, and a tiny bit of extra overhead on longer strings, it would have been possible to have methods that work with strings accept pointers directly to strings, bounds-checked string buffers, or data structures identifying substrings of another string. A function like "strcat" would have looked something like [modern syntax]
void strcat(unsigned char *dest, unsigned char *src)
{
struct STRING_INFO d,s;
str_size_t copy_length;
get_string_info(&d, dest);
get_string_info(&s, src);
if (d.si_buff_size > d.si_length) // Destination is resizable buffer
{
copy_length = d.si_buff_size - d.si_length;
if (s.src_length < copy_length)
copy_length = s.src_length;
memcpy(d.buff + d.si_length, s.buff, copy_length);
d.si_length += copy_length;
update_string_length(&d);
}
}
A little bigger than the K&R strcat method, but it would support bounds-checking, which the K&R method doesn't. Further, unlike the current method, it would be possible to easily concatenate an arbitrary substring, e.g.
/* Concatenate 10th through 24th characters from src to dest */
void catpart(unsigned char *dest, unsigned char *src)
{
struct SUBSTRING_INFO *inf;
src = temp_substring(&inf, src, 10, 24);
strcat(dest, src);
}
Note that the lifetime of the string returned by temp_substring would be limited by those of s and src, which ever was shorter (which is why the method requires inf to be passed in--if it was local, it would die when the method returned).
In terms of memory cost, strings and buffers up to 64 bytes would have one byte of overhead (same as zero-terminated strings); longer strings would have slightly more (whether one allowed amounts of overhead between two bytes and the maximum required would be a time/space tradeoff). A special value of the length/mode byte would be used to indicate that a string function was given a structure containing a flag byte, a pointer, and a buffer length (which could then index arbitrarily into any other string).
Of course, K&R didn't implement any such thing, but that's most likely because they didn't want to spend much effort on string handling--an area where even today many languages seem rather anemic.
According to Joel Spolsky in this blog post,
It's because the PDP-7 microprocessor, on which UNIX and the C programming language were invented, had an ASCIZ string type. ASCIZ meant "ASCII with a Z (zero) at the end."
After seeing all the other answers here, I'm convinced that even if this is true, it's only part of the reason for C having null-terminated "strings". That post is quite illuminating as to how simple things like strings can actually be quite hard.
I don't buy the "C has no string" answer. True, C does not support built-in higher-level types but you can still represent data-structures in C and that's what a string is. The fact a string is just a pointer in C does not mean the first N bytes cannot take on special meaning as a the length.
Windows/COM developers will be very familiar with the BSTR type which is exactly like this - a length-prefixed C string where the actual character data starts not at byte 0.
So it seems that the decision to use null-termination is simply what people preferred, not a necessity of the language.
One advantage of NUL-termination over length-prefixing, which I have not seen anyone mention, is the simplicity of string comparison. Consider the comparison standard which returns a signed result for less-than, equal, or greater-than. For length-prefixing the algorithm has to be something along the following lines:
Compare the two lengths; record the smaller, and note if they are equal (this last step might be deferred to step 3).
Scan the two character sequences, subtracting characters at matching indices (or use a dual pointer scan). Stop either when the difference is nonzero, returning the difference, or when the number of characters scanned is equal to the smaller length.
When the smaller length is reached, one string is a prefix of the other. Return negative or positive value according to which is shorter, or zero if of equal length.
Contrast this with the NUL-termination algorithm:
Scan the two character sequences, subtracting characters at matching indices [note that this is handled better with moving pointers]. Stop when the difference is nonzero, returning the difference. NOTE: If one string is a PROPER prefix of the other, one of the characters in the subtraction will be NUL, i.e zero, and the comparison will naturally stop there.
If the difference is zero, -only then- check if either character is NUL. If so, return zero, otherwise continue to next character.
The NUL-terminated case is simpler, and very easy to implement efficiently with a dual pointer scan. The length-prefixed case does at least as much work, nearly always more. If your algorithm has to do a lot of string comparisons [e.g a compiler!], the NUL-terminated case wins out. Nowadays that might not be as important, but back in the day, heck yeah.
gcc accept the codes below:
char s[4] = "abcd";
and it's ok if we treat is as an array of chars but not string. That is, we can access it with s[0], s[1], s[2], and s[3], or even with memcpy(dest, s, 4). But we'll get messy characters when we trying with puts(s), or worse with strcpy(dest, s).
I think the better question is why you think C owes you anything? C was designed to give you what you need, nothing more. You need to loose the mentality that the language must provide you with everything. Or just continue to use your higher level languages that will give you the luxary of String, Calendar, Containers; and in the case of Java you get one thing in tonnes of variety. Multiple types String, multiple types of unordered_map(s).
Too bad for you, this was not the purpose of C. C was not designed to be a bloated language that offers from a pin to an anchor. Instead you must rely on third party libraries or your own. And there is nothing easier than creating a simple struct that will contain a string and its size.
struct String
{
const char *s;
size_t len;
};
You know what the problem is with this though. It is not standard. Another language might decide to organize the len before the string. Another language might decide to use a pointer to end instead. Another might decide to use six pointers to make the String more efficient. However a null terminated string is the most standard format for a string; which you can use to interface with any language. Even Java JNI uses null terminated strings.
Lastly, it is a common saying; the right data structure for the task. If you find that need to know the size of a string more than anything else; well use a string structure that allows you to do that optimally. But don't make claims that that operation is used more than anything else for everybody. Like, why is knowing the size of a string more important than reading its contents. I find that reading the contents of a string is what I mostly do, so I use null terminated strings instead of std::string; which saves me 5 pointers on a GCC compiler. If I can even save 2 pointers that is good.

Magic Numbers In Arrays? - C++

I'm a fairly new programmer, and I apologize if this information is easily available out there, I just haven't been able to find it yet.
Here's my question:
Is is considered magic numbers when you use a literal number to access a specific element of an array?
For example:
arrayOfNumbers[6] // Is six a magic number in this case?
I ask this question because one of my professors is adamant that all literal numbers in a program are magic numbers. It would be nice for me just to access an element of an array using a real number, instead of using a named constant for each element.
Thanks!
That really depends on the context. If you have code like this:
arr[0] = "Long";
arr[1] = "sentence";
arr[2] = "as";
arr[3] = "array.";
...then 0..3 are not considered magic numbers. However, if you have:
int doStuff()
{
return my_global_array[6];
}
...then 6 is definitively a magic number.
It's pretty magic.
I mean, why are you accessing the 6th element? What's are the semantics that should be applied to that number? As it stands all we know is "the 6th (zero-based) number". If we knew the declaration of arrayOfNumbers we would further know its type (e.g. an int or a double).
But if you said:
arrayOfNumbers[kDistanceToSaturn];
...now it has much more meaning to someone reading the code.
In general one iterates over an array, performing some operation on each element, because one doesn't know how long the array is and you can't just access it in a hardcoded manner.
However, sometimes array elements have specific meanings, for example, in graphics programming. Sometimes an array is always the same size because the data demands it (e.g. certain transform matrices). In these cases it may or may not be okay to access the specific element by number: domain experts will know what you're doing, but generalists probably won't. Giving the magic index number a name makes it more obvious to those who have to maintain your code, and helps you to prevent typing the wrong one accidentally.
In my example above I assumed your array holds distances from the sun to a planet. The sun would be the zeroth element, thus arrayOfNumbers[kDistanceToSun] = 0. Then as you increment, each element contains the distance to the next farthest planet: mercury, venus, etc. This is much more readable than just typing the number of the planet you want. In this case the array is of a fixed size because there are a fixed number of planets (well, except the whole Pluto debacle).
The other problem is that "arrayOfNumbers" tells us nothing about the contents of the array. We already know its an array of numbers because we saw the declaration somewhere where you said int arrayOfNumers[12345]; or however you declared it. Instead, something like:
int distanceToPlanetsFromSol[kNumberOfPlanets];
...gives us a much better idea of what the data actually is and what its semantics are. One of your goals as a programmer should be to write code that is self-documenting in this manner.
And then we can argue elsewhere if kNumberOfPlanets should be 8 or 9. :)
You should ask yourself why are you accessing that particular position. In this case, I assume that if you are doing arrayOfNumbers[6] the sixth position has some special meaning. If you think what's that meaning, you probably realize that it's a magic number hiding that.
another way to look at it:
What if after some chance the program needs to access 7th element instead of 6th? HOw would you or a maintainer know that? If for example if the 6th entry is the count of trees in CA it would be a good thing to put
#define CA_STATE_ENTRY 6
Then if now the table is reordered somebody can see that they need to change this to 9 (say). BTW I am not saying this is the best way to maintain an array for tree counts by state - it probably isnt.
Likewise, if later people want to change the program to deal with trees in oregon, then they know to replace
trees[CA_STATE_ENTRY]
with
trees[OR_STATE_ENTRY]
The point is
trees[6]
is not self-documenting
Of course for c++ it should be an enum not a #define
You'd have to provide more context for a meaningful answer. Not all literal numbers are magic, but many are. In a case like that there is no way at all to tell for sure, though most cases I can think of off-hand with an explicit array index >>1 probably qualify as magic.
Not all literals in a program really qualify as "magic numbers" -- but this one certainly seems to. The 6 gives us no clue of why you're accessing that particular element of the array.
To not be a magic number, you need its meaning to be quite clear even on first examination (or at least minimal examination) why that value is being used. Just for example, a lot of code will do things like: &x[0]. In this case, it's typically pretty clear that the '0' really just means "the beginning of the array."
If you need to access a particular element of the array, chances are you're doing it wrong.
You should almost always be iterating over the entire array.
It's only not a magic number if your program is doing something very special involving the number six specifically. Could you provide some context?
That's the problem with professors, they're often too academic. In theory he's right, as usual, but usually magic numbers are used in a stricter context, when the number is embedded in a data stream, allowing you to detect certain properties of the stream (like the signature header of a file type for instance).
See also this Wikipedia entry.
Usually not all constant values in software are called magic numbers.
A java class files always starts with the hex value 0xcafebabe a windows .exe
file with MZ 0x4d, 0x5a , this allows you quickly (but not for sure) to identify
the content of a binary file.
In a MISRA compliant system, all values except 0 and 1 are considered magic numbers. My opinion has always been if the constant value is obvious or likely won't change then leave it as a number. If in doubt create a unique constant since long term maintenance will be easier.