I am working through C++ program design by Cohoon and Davidson. This is what it says about string class attributes (3rd Edition, Page 123):
Characters that comprise the string
The number of characters in the string
My question is: If we know the characters in the string, does not it implies we already know about number of characters in the string? What is the need to explicitly specify the second attribute?
You are right but length is required in many places like counting, or knowing the length/end of malloc memory so it is better to store length as additional property to make your program run fast.
Consider what will happen if the program needs to count the chars all the way just to tell you how many are there in it. Moreover when this feature is accessed frequently.
So it simply saves time storing length too.
So all actual implementations of string classes do store length of the string.
If we know the characters in the string, does not it implies we already know about number of characters in the string?
Well in C we know the number of elements because we can count up to the NULL terminal. But think how expensive it is to get the length of a string? It takes walking the entire string. For such a common operation, why wouldn't we want this to be a constant-time operation?
Related
The problem is that I'm processing some UTF8 strings and I would like to design a class or a way to prevent string manipulations.
String manipulation is not desirable for strings of multibyte characters as splitting the string at a random position (which is measured in bytes) may split a character half way.
I have thought about using const std::string& but the user/developer can create a substring by calling std::substr.
Another way would be create a wrapper around const std::string& and expose only the string through getters.
Is this even possible?
Another way would be create a wrapper around const std::string& and expose only the string through getters.
You need a class wrapping a std::string or std::u8string, not a reference to one. The class then owns the string and its contents, basically just using it as a storage, and can provide an interface as you see fit to operate on unicode code points or characters instead of modifying the storage directly.
However, there is nothing in the standard library that will help you implement this. So a better approach would be to use a third party library that already does this for you. Operating on code points in a UTF-8 string is still reasonably simple and you can implement that part yourself, but if you want to operate on characters (in the sense of grapheme clusters or whatever else is suitable) implementation is going to be a project in itself.
I would use a wrapper where your external interface provides access to either code points, or to characters. So, foo.substr(3, 4) (for example) would skip the first 3 code points, and give you the next 4 code points. Alternatively, it would skip the first 3 characters, and give you the next 4 characters.
Either way, that would be independent of the number of bytes used to represent those code points or characters.
Quick aside on terminology for anybody unaccustomed to Unicode terminology: ISO 10646 is basically a long list of code points, each assigned a name and a number from 0 to (about) 220-1. UTF-8 encodes a code point number in a sequence of 1 to 4 bytes.
A character can consist of a (more or less) arbitrary number of code points. It will consist of a base character (e.g., a letter) followed by some number of combining diacritical marks. For example, à would normally be encoded as an a followed by a "combining grave accent" (U+0300).
The a and the U+0300 are each a code point. When encoded in UTF-8, the a would be encoded in a single byte and the U+0300 would be encoded in three bytes. So, it's one character composed of two code points encoded in 4 characters.
That's not quite all there is to characters (as opposed to code points) but it's sufficient for quite a few languages (especially, for the typical European languages like Spanish, German, French, and so on).
There are a fair number of other points that become non-trivial though. For example, German has a letter "ß". This is one character, but when you're doing string comparison, it should (at least normally) compare as equal to "ss". I believe there's been a move to change this but at least classically, it hasn't had an upper-case equivalent either, so both comparison and case conversion with it get just a little bit tricky.
And that's fairly mild compared to situations that arise in some of the more "exotic" languages. But it gives a general idea of the fact that yes, if you want to deal intelligently with Unicode strings, you basically have two choices: either have your code use ICU1 to do most of the real work, or else resign yourself to this being a multi-year project in itself.
1. In theory, you could use another suitable library--but in this case, I'm not aware of such a thing existing.
I know that strlen() does not count the NUL-terminating character with. I really know that this is a fact. Thus, this question is NOT about asking for why strlen() might "presumably" not return the right string length, which is already asked and answered alot well here on StackOverflow, f.e. in this thread, or this one.
So lets go ahead to my question:
In ISO/IEC 9899:1990 (E); 7.1.1., is stated:
A string is a contiguous sequence of characters terminated by and including the first null character.
What is the reason, why strlen() deviate from this formed standard, and does not "want" to accept a string with its NUL-terminating character?
Why?
Because you would expect this pseudocode's assertion to hold true:
str1 = "foo"
str2 = "bar"
str3 = concatenate(str1, str2)
Assert strlen(str1) + strlen(s2) == strlen(str3)
If terminating '\0' was counted by strlen, above assertion would not hold, which would be much more of overall headache, than what the current C string behavior is. More importantly, it would in my opinion be quite unintuitive and illogical.
Taking your doubt as a reasonable point we can state that: The C-string consists of two parts:
the string's useful content ("the text");
the null terminating character;
The null terminating character is purely a technical measure for determination of the end of the string by the C-originated library functions. Still, if one types a declaration:
char * str = "some string";
they logically would rather expect its length to be 11 which is as many as they can see in this statement. Hence the strlen() value yields only the length of the part 1. of the string.
Not really an answer to your question, but consider this example:
char string[] = "string";
printf("sizeof: %zu\n", sizeof(string));
printf("strlen: %zu\n", strlen(string));
This prints
sizeof: 7
strlen: 6
So sizeof counts the \0, but strlen doesn't.
Questions like this, that ask why a certain age-old decision was made one way and not another way, are hard to answer. I can say that it's perfectly obvious to me, anyway, that strlen should count just the real, "interesting" characters that are in the string, and ignore the \0 at the end that merely terminates it. I'm used to accounting for the \0 separately. I imagine it would have been considerably more of a nuisance overall if strlen had been defined the other way. But I can't prove this with convincing arguments, and I've been using strlen with its current definition for so long that I'm probably hopelessly biased; I might be saying "it's perfectly obvious to me that..." even if strlen's definition were quite wrong.
There is a difference between the physical, stored representation of a C style string and the logical representation of a C style string.
The physical representation, how the string is actually stored in memory or other media includes the null character. The null character is included when discussing the physical representation because it take up an additional piece of storage. In order to be a C style string the null character must be stored.
However the logical representation of a string does not include the null character. The logical representation of a string includes only the text characters that the programmer is wanting to manipulate.
I suspect that the null character, a value of binary zero, was chosen because of the original ASCII character set defined a character value of zero as the NULL character. Part of the lower values among the various teletype control codes, it seems to be the least likely ASCII character that may appear in text. See ASCII Character Codes.
Another nice quality of using a binary zero as the string terminator is that is the value that represents logical false so iterating over a string is often a matter of incrementing an array index or incrementing a pointer while logical true since all characters other than the end of string indicator have a non-zero or logical true value.
Due to how close to the hardware that the C programming language is, the programmer needs to be concerned about both representations, the physical representation when allocating memory to store a string which includes the null character and the logical representation which is the string without the null character.
The various C style string manipulation functions in the Standard Library (strlen(), strcpy(), etc.) are all designed around the logical representation of a C style string. They perform their actions by using the null character as not being part of the text but rather as a special indicator character which indicates the end of the string. However as a part of their operations they need to be aware of the null character and its use as a special symbol. For instance when strcpy() or strcat() are used to copy strings, they must also copy the null character that indicates the end of the string even though it is not part of the actual text of the logical representation.
This choice allows text strings to be stored as arrays of characters, as befits the hardware orientation and efficiency characteristics of C. There is no need to create an additional built in type for text strings and it fits well with the lean character of the C programming language.
C++ is able to provide the std::string because of being object oriented and having the additional facilities of the language that allows for objects to be created and managed. The C programming language, due to its simple syntax and lack of object oriented facilities does not have this convenience.
The problem with this approach is that the programmer needs to be aware of both the physical representation and the logical representation of text strings and be able to accommodate the needs of both when writing programs.
We will have to define an array for storing the string either way.
char[10];
And so suppose I want to store smcck in this array. What is the advantage of using gets(a)? My teacher said that the extra space in the array is wasted when we use cin.getline(a, 20), but that applies for gets(a) too right?
Also just an extra question, what exactly is stored in the empty "boxes"of an array?
gets() is a C function,it does not do bounds checking and is considered dangerous, it has been kept all this years for compatibility and nothing else.
You can check the following link to clear your doubt :
http://www.gidnetwork.com/b-56.html
Don't mix C features with C++, though all the feature of C works in C++ but it is not recommended . If you are working on C++ then you should probably avoid using gets(). use getline() instead.
Well, I don't think gets(a) is bettet because it does not check for the size of the string. If you try to read a long string using it, it may cause an buffer overflow. That means it will use all the 10 spaces you allocated for it and then it will try to use space allocated for another variables or another programs (what is going to make you publication crash).
The cin.getline() receives an int as a parameter with tells it to not read more than the expected number of characters. If you allocate a vector with only 10 positions and read 20 characters it will cause the same problem I told you about gets().
About the strings representation in memory, if you put "smcck" on an array
char v[10];
The word will take the first 5 positions (0 to 4), the position 5 will be taken by a null character (represented by '\0') that will mark the end of the string. Usually, what comes next in the array does not matter and are kept the way it were in the past.the null terminated character is used to mark where the string ends, so you can work it safely.
What are the problems of a zero-terminated string that length-prefixed strings overcome?
I was reading the book Write Great Code vol. 1 and I had that question in mind.
One problem is that with zero-terminated strings you have to keep finding the end of the string repeatedly. The classic example where this is inefficient is concatenating into a buffer:
char buf[1024] = "first";
strcat(buf, "second");
strcat(buf, "third");
strcat(buf, "fourth");
On every call to strcat the program has to start from the beginning of the string and find the terminator to know where to start appending. This means the function spends more and more time finding the place to append as the string grows longer.
With a length-prefixed string the equivalent of the strcat function would know where the end is immediately, and would just update the length after appending to it.
There are pros and cons to each way of representing strings and whether they cause problems for you depend on what you are doing with strings, and which operations need to be efficient. The problem described above can be overcome by manually keeping track of the end of the string as it grows, so by changing the code you can avoid the performance cost.
One problem is that you can not store null characters (value zero) in a zero terminated string. This makes it impossible to store some character encodings as well as encrypted data.
Length-prefixed strings do not suffer that limitation.
First a clarification: C++ strings (i.e. std::string) aren't weren't required to end with zero until C++11. They always provided access to a zero-terminated C string though.
C-style strings end with a 0 character for historical reasons.
The problems you're referring to are mainly bound to security issues: zero ended strings need to have a zero terminator. If they lack it (for whatever reason), the string's length becomes unreliable and they can lead to buffer overrun problems (which a malicious attacker can exploit by writing arbitrary data in places where it shouldn't be.. DEP helps in mitigating these issues but it's off-topic here).
It is best summarized in The Most Expensive One-byte Mistake by Poul-Henning Kamp.
Performance Costs: It is cheaper to manipulate memory in chunks, which cannot be done if you're always having to look for the NULL character. In other words if you know before hand you have a 129 character string, it would likely be more efficient to manipulate it in sections of 64, 64, and 1 bytes, instead of character by character.
Security: Marco A. already hit this pretty hard. Over and under-running string buffers is still a major route for attacks by hackers.
Compiler Development Costs: Big costs are associated with optimizing compilers for null terminating strings that would have been easier with the address and length format.
Hardware Development Costs: Hardware development costs are also large for string specific instructions associated with null terminating strings.
A few more bonus features that can be implemented with length-prefixed strings:
It's possible to have multiple styles of length prefix, identifiable through one or more bits of the first byte identified by the string pointer/reference. In exchange for a little extra time determining string length, one could e.g. use a single-byte prefix for short strings and longer prefixes for longer strings. If one uses a lot of 1-3 byte strings that could save more than 50% on overall memory consumption for such strings compared with using a fixed four-byte prefix; such a format could also accommodate strings whose length exceeded the range of 32-bit integers.
One may store variable-length strings within bounds-checked buffers at a cost of only one or two bits in the length prefix. The number N combined with the other bits would indicate one of three things:
An N-byte string
(Optional) An N-byte buffer holding a zero-length string
An N-byte buffer which, if its last byte B is less than 248, holds a string of length N-B-1; if the 248 or more, the preceding B-247 bytes would store the difference between the buffer size and the string length. Note that if the length of the string is precisely N-1, the string will be followed by a NUL byte, and if it's less than that the byte following the string will be unused and could be set to NUL.
Using such an approach, one would need to initialize strong buffers before use (to indicate their length), but would then no longer need to pass the length of a string buffer to a routine that was going to store data there.
One may use certain prefix values to indicate various special things. For example, one may have a prefix that indicates that it is not followed by a string, but rather by a string-data pointer and two integers giving buffer size and current length. If methods that operate on strings call a method to get the data pointer, buffer size, and length, one may pass such a method a reference to a portion of a string cheaply provided that the string itself will outlive the method call.
One may extend the above feature with a bit to indicate that the string data is in a region that was generated by malloc and may be resized if needed; additionally, one could safely have methods that sometimes return a dynamically-generated string allocated on the heap, and sometimes return an immutable static string, and have the recipient perform a "free this string if it isn't static".
I don't know if any prefixed-string implementations implement all those bonus features, but they can all be accommodated for very little cost in storage space, relatively little cost in code, and less cost in time than would be required to use NUL-terminated strings whose length was neither known nor short.
What are the problems of a zero-terminated string that length-prefixed strings overcome?
None whatsoever.
It's just eye candy.
Length-prefixed strings have, as part of their structure, information on how long the string is. If you want to do the same with zero-terminated strings you can use a helper variable;
lpstring = "foobar"; // saves '6' somewhere "inside" lpstring
ztstring = "foobar";
ztlength = 6; // saves '6' in a helper variable
Lots of C library functions work with zero-terminated strings and cannot use anything past the '\0' byte. That's an issue with the functions themselves, not the string structure. If you need functions which deal with zero-terminated strings with embedded zeroes, write your own.
In C++ primer it is given that a null character is added at the end of every string literal. Why does a compiler do so?
Wikipedia:
"At the time C (and the languages that it was derived from) was developed, memory was extremely limited, so using only one byte of overhead to store the length of a string was attractive. The only popular alternative at that time, usually called a "Pascal string" (though also used by early versions of BASIC), used a leading byte to store the length of the string. This allowed the string to contain NULL and made finding the length need only one memory access (O(1) (constant) time).
However, C designer Dennis Ritchie chose to follow the convention of NULL-termination, already established in BCPL 'to avoid the limitation on the length of a string caused by holding the count in an 8- or 9-bit slot, and partly because maintaining the count seemed, in our experience, less convenient than using a terminator'..."
It is the best way to find end of the string from a chunk of memory!
And the whole string library functions believe strings are null terminated ;)
Because C strings are null-terminated.