Efficiency of insert in string - c++

I'm trying to write this custom addition class for very large integers, bigger than long long. One approach which I'm investigating is keeping the integer as a string and then converting the characters to their int components and then adding each "column". Another approach I'm considering is to split up the string into multiple strings each of which is the size of a long long and then casting it using a string stream into a long long adding and then recombining.
Regardless I came across the fact that addition is done most easily in reverse to allow to the carrying over of digits. This being the case I was wondering the efficiency of the insert method for the string. It seems since a string is an array of chars that all the chars would have to be shifted over one. So it would vary but it would seem the efficiency is O(n) where n is the number of chars in the string.
Is this correct, or is this only with a naive interpretation?
Edit: I now have answer to my question but I was wondering on a related topic which is more efficient, inserting a string into a stream then extracting into an int. Or doing 10^n*char1+10^n-1*char2...etc?

As far as I know, you are correct. The C++ implementation of String will perform an insert in O(n) time. It treats the string as a character array.
For your numeric implementation, why not store the numbers as arrays of integers and convert to string only for output?

It's probably correct with all real implementations of std::string. Under the circumstances, you might want to either store the digits in reverse (though that could be clumsy in other ways) or else use something like an std::deque<char>. Better still, use an std::deque<unsigned long long>, which will reduce the number of operations involved.
Of course, for real use you usually want to use an existing library rather than rolling your own.

Related

Is there a way to restrict string manipulation e.g substring?

The problem is that I'm processing some UTF8 strings and I would like to design a class or a way to prevent string manipulations.
String manipulation is not desirable for strings of multibyte characters as splitting the string at a random position (which is measured in bytes) may split a character half way.
I have thought about using const std::string& but the user/developer can create a substring by calling std::substr.
Another way would be create a wrapper around const std::string& and expose only the string through getters.
Is this even possible?
Another way would be create a wrapper around const std::string& and expose only the string through getters.
You need a class wrapping a std::string or std::u8string, not a reference to one. The class then owns the string and its contents, basically just using it as a storage, and can provide an interface as you see fit to operate on unicode code points or characters instead of modifying the storage directly.
However, there is nothing in the standard library that will help you implement this. So a better approach would be to use a third party library that already does this for you. Operating on code points in a UTF-8 string is still reasonably simple and you can implement that part yourself, but if you want to operate on characters (in the sense of grapheme clusters or whatever else is suitable) implementation is going to be a project in itself.
I would use a wrapper where your external interface provides access to either code points, or to characters. So, foo.substr(3, 4) (for example) would skip the first 3 code points, and give you the next 4 code points. Alternatively, it would skip the first 3 characters, and give you the next 4 characters.
Either way, that would be independent of the number of bytes used to represent those code points or characters.
Quick aside on terminology for anybody unaccustomed to Unicode terminology: ISO 10646 is basically a long list of code points, each assigned a name and a number from 0 to (about) 220-1. UTF-8 encodes a code point number in a sequence of 1 to 4 bytes.
A character can consist of a (more or less) arbitrary number of code points. It will consist of a base character (e.g., a letter) followed by some number of combining diacritical marks. For example, à would normally be encoded as an a followed by a "combining grave accent" (U+0300).
The a and the U+0300 are each a code point. When encoded in UTF-8, the a would be encoded in a single byte and the U+0300 would be encoded in three bytes. So, it's one character composed of two code points encoded in 4 characters.
That's not quite all there is to characters (as opposed to code points) but it's sufficient for quite a few languages (especially, for the typical European languages like Spanish, German, French, and so on).
There are a fair number of other points that become non-trivial though. For example, German has a letter "ß". This is one character, but when you're doing string comparison, it should (at least normally) compare as equal to "ss". I believe there's been a move to change this but at least classically, it hasn't had an upper-case equivalent either, so both comparison and case conversion with it get just a little bit tricky.
And that's fairly mild compared to situations that arise in some of the more "exotic" languages. But it gives a general idea of the fact that yes, if you want to deal intelligently with Unicode strings, you basically have two choices: either have your code use ICU1 to do most of the real work, or else resign yourself to this being a multi-year project in itself.
1. In theory, you could use another suitable library--but in this case, I'm not aware of such a thing existing.

What are the problems of a zero-terminated string that length-prefixed strings overcome?

What are the problems of a zero-terminated string that length-prefixed strings overcome?
I was reading the book Write Great Code vol. 1 and I had that question in mind.
One problem is that with zero-terminated strings you have to keep finding the end of the string repeatedly. The classic example where this is inefficient is concatenating into a buffer:
char buf[1024] = "first";
strcat(buf, "second");
strcat(buf, "third");
strcat(buf, "fourth");
On every call to strcat the program has to start from the beginning of the string and find the terminator to know where to start appending. This means the function spends more and more time finding the place to append as the string grows longer.
With a length-prefixed string the equivalent of the strcat function would know where the end is immediately, and would just update the length after appending to it.
There are pros and cons to each way of representing strings and whether they cause problems for you depend on what you are doing with strings, and which operations need to be efficient. The problem described above can be overcome by manually keeping track of the end of the string as it grows, so by changing the code you can avoid the performance cost.
One problem is that you can not store null characters (value zero) in a zero terminated string. This makes it impossible to store some character encodings as well as encrypted data.
Length-prefixed strings do not suffer that limitation.
First a clarification: C++ strings (i.e. std::string) aren't weren't required to end with zero until C++11. They always provided access to a zero-terminated C string though.
C-style strings end with a 0 character for historical reasons.
The problems you're referring to are mainly bound to security issues: zero ended strings need to have a zero terminator. If they lack it (for whatever reason), the string's length becomes unreliable and they can lead to buffer overrun problems (which a malicious attacker can exploit by writing arbitrary data in places where it shouldn't be.. DEP helps in mitigating these issues but it's off-topic here).
It is best summarized in The Most Expensive One-byte Mistake by Poul-Henning Kamp.
Performance Costs: It is cheaper to manipulate memory in chunks, which cannot be done if you're always having to look for the NULL character. In other words if you know before hand you have a 129 character string, it would likely be more efficient to manipulate it in sections of 64, 64, and 1 bytes, instead of character by character.
Security: Marco A. already hit this pretty hard. Over and under-running string buffers is still a major route for attacks by hackers.
Compiler Development Costs: Big costs are associated with optimizing compilers for null terminating strings that would have been easier with the address and length format.
Hardware Development Costs: Hardware development costs are also large for string specific instructions associated with null terminating strings.
A few more bonus features that can be implemented with length-prefixed strings:
It's possible to have multiple styles of length prefix, identifiable through one or more bits of the first byte identified by the string pointer/reference. In exchange for a little extra time determining string length, one could e.g. use a single-byte prefix for short strings and longer prefixes for longer strings. If one uses a lot of 1-3 byte strings that could save more than 50% on overall memory consumption for such strings compared with using a fixed four-byte prefix; such a format could also accommodate strings whose length exceeded the range of 32-bit integers.
One may store variable-length strings within bounds-checked buffers at a cost of only one or two bits in the length prefix. The number N combined with the other bits would indicate one of three things:
An N-byte string
(Optional) An N-byte buffer holding a zero-length string
An N-byte buffer which, if its last byte B is less than 248, holds a string of length N-B-1; if the 248 or more, the preceding B-247 bytes would store the difference between the buffer size and the string length. Note that if the length of the string is precisely N-1, the string will be followed by a NUL byte, and if it's less than that the byte following the string will be unused and could be set to NUL.
Using such an approach, one would need to initialize strong buffers before use (to indicate their length), but would then no longer need to pass the length of a string buffer to a routine that was going to store data there.
One may use certain prefix values to indicate various special things. For example, one may have a prefix that indicates that it is not followed by a string, but rather by a string-data pointer and two integers giving buffer size and current length. If methods that operate on strings call a method to get the data pointer, buffer size, and length, one may pass such a method a reference to a portion of a string cheaply provided that the string itself will outlive the method call.
One may extend the above feature with a bit to indicate that the string data is in a region that was generated by malloc and may be resized if needed; additionally, one could safely have methods that sometimes return a dynamically-generated string allocated on the heap, and sometimes return an immutable static string, and have the recipient perform a "free this string if it isn't static".
I don't know if any prefixed-string implementations implement all those bonus features, but they can all be accommodated for very little cost in storage space, relatively little cost in code, and less cost in time than would be required to use NUL-terminated strings whose length was neither known nor short.
What are the problems of a zero-terminated string that length-prefixed strings overcome?
None whatsoever.
It's just eye candy.
Length-prefixed strings have, as part of their structure, information on how long the string is. If you want to do the same with zero-terminated strings you can use a helper variable;
lpstring = "foobar"; // saves '6' somewhere "inside" lpstring
ztstring = "foobar";
ztlength = 6; // saves '6' in a helper variable
Lots of C library functions work with zero-terminated strings and cannot use anything past the '\0' byte. That's an issue with the functions themselves, not the string structure. If you need functions which deal with zero-terminated strings with embedded zeroes, write your own.

Is it better to use std::string or single char when possible?

Is it better to use std::string or single char when possible?
In my class I want to store certain characters. I have CsvReader
class, and I want to store columnDelimiter character. I wonder,
is it better to have it as char, or just use std::string?
In terms of usage I suppose std::string is far better, but I wonder
maybe there will be major performance differences?
If your delimiter is constrained to be a single character, use a char.
If your delimiter may be a string, use a std::string.
Seems fairly self-explanatory. Refer to the requirements of the project, and the constraints of the feature that follow from those requirements.
Personally it seems to me that a CSV field delimiter will always be a single character, in which case std::string is not only misleading, but pointlessly heavy.
In terms of usage I suppose std::string is far better
I have largely ignored this claim as you did not provide any rationale, but let me just say that I reject the hypothetical premise of the claim.
I wonder maybe there will be major performance differences?
Absolutely! A string consists of a dynamically-allocated block of characters; this is entirely more heavy than a single byte in memory. Notwithstanding the small-string-optimisation that your implementation may perform, it's simply pointless to add all this weight when all you wish to represent is a single character. A single character is a char, so use a char in such a case.
A character is a character. A string is a string; conceptually, a set of N characters, where N is any natural number.
If your design requires a character, use char. If it requires a string, use string.
In both cases you may have multilanguage issues (what happens if the characteer is 青? what happens if the string is 青い?), but these are totally independent of your choice of whether you need a character or a set of N characters, i.e. a string.

Properties of a string class in C++

I am working through C++ program design by Cohoon and Davidson. This is what it says about string class attributes (3rd Edition, Page 123):
Characters that comprise the string
The number of characters in the string
My question is: If we know the characters in the string, does not it implies we already know about number of characters in the string? What is the need to explicitly specify the second attribute?
You are right but length is required in many places like counting, or knowing the length/end of malloc memory so it is better to store length as additional property to make your program run fast.
Consider what will happen if the program needs to count the chars all the way just to tell you how many are there in it. Moreover when this feature is accessed frequently.
So it simply saves time storing length too.
So all actual implementations of string classes do store length of the string.
If we know the characters in the string, does not it implies we already know about number of characters in the string?
Well in C we know the number of elements because we can count up to the NULL terminal. But think how expensive it is to get the length of a string? It takes walking the entire string. For such a common operation, why wouldn't we want this to be a constant-time operation?

How to store a long hex value message in C++

I'm learning a crypto class and one of the assignment asked us to xor a bunch of hex ciphertext and try to find the encrypted message.
I know that you can do '0x' in front of int or long to hold a hex value in a variable, but what if my message is this long:
271946f9bbb2aeadec111841a81abc300ecaa01bd8069d5cc91005e9fe4aad6e04d513e96d99de2569bc5e50eeeca709b50a8a987f4264edb6896fb537d0a716132ddc938fb0f836480e06ed0fcd6e9759f40462f9cf57f4564186a2c1778f1543efa270bda5e933421cbe88a4a52222190f471e9bd15f652b653b7071aec59a2705081ffe72651d08f822c9ed6d76e48b63ab15d0208573a7eef027
I would get an overflow. Is there a way to put the whole message into one variable? I could split the message into subparts, but I prefer it to be in variable instead of many (if that is possible). I tried to use string to hold the massage, but how can I use the operator, '^', for xor?
Or is there a more simple technique that I do not know of?
Thanks
For something like this, you'd typically use a string or a vector<char> to hold the data. You can't use the entire string/vector as an operand to ^, but you can apply it one byte at a time.
If you want to simplify the rest of the code, you could create a class that overloaded operator^ to do a byte-wise XOR, so your code would look something like result = key ^ message;.
You could use an array of, well, any size integer, and apply your operators to it an element at a time (which will probably be a bit more efficient than an array of characters). #JerryCoffin's idea of wrapping it inside a class w/ an overloaded operator is a good one, regardless of the actual representation you use.
Put it in a separate text file
read the file into a buffer
convert ascii chars to hex values
Jerry & Scott have sound suggestions. Another option is to use an existing library: for example, the GNU GMP arbitrary-precision maths library at http://gmplib.org, which supports XOR (see http://gmplib.org/manual/Integer-Logic-and-Bit-Fiddling.html#Integer-Logic-and-Bit-Fiddling) and a "scanf" style function to read in hex (see http://gmplib.org/manual/Formatted-Input-Strings.html#Formatted-Input-Strings), and explicitly aims to provide excellent support for cryptography.