Are raw strings faster than normal strings? - c++

I wanted to ask if raw strings are faster than normal strings at compile-time.
Let me explain what i mean with "raw" and "normal" strings...
We know there is a 'R' literal.
const char * raw = R"(Hello\nWorld!)"; will output
Hello\nWorld!
const char * normal = "Hello\nWorld!" will output
Hello
World!
So whats actually faster? I think using the R-Literal for strings like Hi, how are you? is faster than the 'normal' way we use strings.

So whats actually faster? I think using the R-Literal for strings like Hi, how are you? is faster than the 'normal' way we use strings.
OK, as you're asking for the compile time impact of "normal" or raw string literals, it could be that raw string literals can be handled faster, since the compiler won't need to handle escape character parsing and translation.
Though I believe that the difference won't be really significant.
The major advantage of raw string literals is, that you don't need to care about escaping special characters when writing the source code.

Raw strings might be slightly faster or slower to parse in a particular compiler, but the difference will almost certainly be too small to notice.
The purpose of raw strings isn't to improve compilation speed. It's to let you write string literals that contain lots of special characters (like backslashes and quotes) in a more-readable way, without having to insert lots of additional backslashes for escaping.
Use normal string literals unless your string needs lots of escaping that makes it look awkward in the source code. Use raw string literals just for those cases.

Related

Is it better to use std::string or single char when possible?

Is it better to use std::string or single char when possible?
In my class I want to store certain characters. I have CsvReader
class, and I want to store columnDelimiter character. I wonder,
is it better to have it as char, or just use std::string?
In terms of usage I suppose std::string is far better, but I wonder
maybe there will be major performance differences?
If your delimiter is constrained to be a single character, use a char.
If your delimiter may be a string, use a std::string.
Seems fairly self-explanatory. Refer to the requirements of the project, and the constraints of the feature that follow from those requirements.
Personally it seems to me that a CSV field delimiter will always be a single character, in which case std::string is not only misleading, but pointlessly heavy.
In terms of usage I suppose std::string is far better
I have largely ignored this claim as you did not provide any rationale, but let me just say that I reject the hypothetical premise of the claim.
I wonder maybe there will be major performance differences?
Absolutely! A string consists of a dynamically-allocated block of characters; this is entirely more heavy than a single byte in memory. Notwithstanding the small-string-optimisation that your implementation may perform, it's simply pointless to add all this weight when all you wish to represent is a single character. A single character is a char, so use a char in such a case.
A character is a character. A string is a string; conceptually, a set of N characters, where N is any natural number.
If your design requires a character, use char. If it requires a string, use string.
In both cases you may have multilanguage issues (what happens if the characteer is 青? what happens if the string is 青い?), but these are totally independent of your choice of whether you need a character or a set of N characters, i.e. a string.

C++: String with multiple languages

This is my first attempt at dealing with multiple languages in a program. I would really appreciate if someone could provide me with some study material and how to approach this type of issue.
The question is representing a string which has multiple languages. For example, think of a string that has "Hello" in many languages, all comma separated. What I want to do is to separate these words. So my questions are:
Can I use std::string for this or should I use std::wstring?
If I want to tokenize each of the words in the string and put them in to a char*, should I use wchar? But some encodings, such as UTF, can be bigger than what wchar can support.
Overall, what is the 'accepted' way of handling this type of case?
Thank you.
Can I use std::string for this or should I use std::wstring?
Both can be used. If you use std::string, the encoding should be UTF-8 so as to avoid null-bytes which you'd get if you were to use UTF-16, UCS-2 etc. If you use std::wstring, you can also use encodings that require larger numbers to represent the individual characters, i.e. UCS-2 and UCS-4 will typically be fine, but strictly speaking this is implementation-dependent. In C++11, there is also std::u16string (good for UTF-16 and UCS-2) and std::u32string (good for UCS-4).
So, which of these types to use depends on which encoding you prefer, not on the number or type of languages you want to represent.
As a rule of thumb, UTF-8 is great for storage of large texts, while UCS-4 is best if memory footprint does not matter so much, but you want character-level iterations and position-arithmetic to be convenient and fast. (Example: Skipping n characters in an UTF-8 string is an O(n) operation, while it is an O(1) operation in UCS-4.)
If I want to tokenize each of the words in the string and put them in to a char*, should I use wchar? But some encodings, such as UTF, can be bigger than what wchar can support.
I would use the same data type for the words as I would use for the text itself. I.e. words of a std::string text should also be std::string, and words from a std::wstring should be std::wstring.
(If there is really a good reason to switch from a string-datatype to a character-pointer datatype, of course char* is right for std::string and wchar_t* is right for std::string. Similarly for the C++11 types, there is char16_t* and char32_t*.)
Overall, what is the 'accepted' way of handling this type of case?
The first question you need to answer to yourself is which encoding you want to use for storage and processing. In highly international settings, only Unicode encodings are truly eligible, but there are still more than one to choose from: UTF-8, UCS-2 and UCS-4 are the most common ones. As described above, which one you choose has implications for memory footprint and processing speed, so think carefully about what types of operations you need to perform. It may be required to convert from one encoding to another at certain points in your program for optimal space and time behavior. Once you know which encoding you want to use in each part of the program, choose the data type accordingly.
Once encoding and data types have been decided, you might also need to look into Unicode normalization. In many languages, the same character (or character/diacritics combination) can be represented by more than one sequence of Unicode code points (esp. when combining characters are used). To deal with these cases properly, you may need to apply Unicode normalizations (such as NFKC) to the strings. Note that there is no built-in support for this in the C++ Standard Library.

Is there no possible loss of data when converting from wstring to string using string constructor?

When I do the following my compiler warns me of a possible loss of data (but the compilation is succesful):
std::vector<wchar_t> v1;
v1.push_back(L'a');
std::vector<char> v2(v1.begin(), v1.end());
When I do the following I get no such warning, and as far as I can tell I have not lost data when I've done it in the past:
std::wstring w1;
w1 = L"a";
std::string s1(w1.begin(), w1.end());
Is there in fact no possible loss of data in the second snippet? And if, not why not? Is there something in the basic_string constructor that handles the possibility of iterators of the other type of character? Or is it something special about the iterators themselves?
To give a concrete example, if you write
std::wstring w1 = L"τ"; // That's a Unicode Greek Small Letter Tau (U+03C4)
std::string s1(w1.begin(), w1.end());
Most likely you’ll end up with a string containing character 0xC4, which is an “Ä” in both Windows ANSI and ISO Latin-1. That probably isn’t what you wanted, and while it will work OK on most platforms if you stick to ASCII, even that isn’t guaranteed (e.g. if your code runs on an IBM mainframe, you might find that narrow strings are EBCDIC and wide strings could be in any number of unusual encodings).
If you want to convert wide strings to narrow strings, you need to use appropriate functions to cope with the fact that character encodings are involved. C++ doesn’t really provide a decent way to do this; typically you have to revert to C’s wctombs() function, or use platform-specific APIs. (Someone might point you at the narrow ctype facet, but that just means that any character that can’t be represented by a single byte gets replaced with a specified character; that isn’t really converting. Also, C++11 has some support for converting between Unicode strings using wstring_convert, but that only copes with Unicode and not everyone is using that for both narrow and wide characters.)
Yes, the second snippet will lose data (truncate the character values) in the same way the first snippet will. Your library implementation is probably doing something that suppresses the warning message. It's impossible to know without looking at the source for your particular library implementation.

How to efficiently replace german umlauts in C++?

Assume that I get a few hundred lines of text as a string (C++) from an API, and sprinkled into that data are german umlauts, such as ä or ö, which need to be replaced with ae and oe.
I'm familiar with encoding (well, I've read http://www.joelonsoftware.com/articles/Unicode.html) and solving the problem was trivial (basically, searching through the string, removing the char and adding 2 others instead).
However, I do not know enough about C++ to do this fast. I've just stumbled upon StringBuilder (http://www.codeproject.com/Articles/647856/4350-Performance-Improvement-with-the-StringBuilde), which improved speed a lot, but I was curious if there are any better or smarter ways to do this?
If you must improve efficiency on such small scale, consider doing the replacement in two phases:
The first phase calculates the number of characters in the result after the replacement. Go through the string, and add 1 to the count for each normal character; for characters such as ä or ö, add 2.
At this point, you have enough information to allocate the string for the result. Make a string of the length that you counted in the first phase.
The second phase performs the actual replacement: go through the string again, copying the regular characters, and replacing umlauted ones with their corresponding pairs.
When it is encoded in UTF-8, the german umlauts are all two-byte values in unicode, and so are their replacements like ae or oe. So when you use a char[] instead of a string, you wouldn't have to reallocate any memory and could just replace the bytes while iterating the char[].

Simplest way to mix sequences of types with iostreams?

I have a function void write<typename T>(const T&) which is implemented in terms of writing the T object to an ostream, and a matching function T read<typename T>() that reads a T from an istream. I am basically using iostreams as a plain text serialisation format, which obviously works fine for most built-in types, although I'm not sure how to effectively handle std::strings just yet.
I'd like to be able to write out a sequence of objects too, eg void write<typename T>(const std::vector<T>&) or an iterator based equivalent (although in practice, it would always be used with a vector). However, while writing an overload that iterates over the elements and writes them out is easy enough to do, this doesn't add enough information to allow the matching read operation to know how each element is delimited, which is essentially the same problem that I have with a single std::string.
Is there a single approach that can work for all basic types and std::string? Or perhaps I can get away with 2 overloads, one for numerical types, and one for strings? (Either using different delimiters or the string using a delimiter escaping mechanism, perhaps.)
EDIT: I appreciate the often sensible tendency when confronted with questions like this is to say, "you don't want to do that" and to suggest a better approach, but I would really like suggestions that relate directly to what I asked, rather than what you believe I should have asked instead. :)
A general-purpose serialisation framework is hard, and the built-in features of the iostream library are really not up to it - even dealing with strings satisfactorily is quite difficult. I suggest you either sit down and design the framework from scratch, ignoring iostreams (which then become an implementation detail), or (more realistically) use an existing library, or at least an existing format, such as XML.
Basically, you will have to create a file format. When you're restricted to built-ins, strings, and sequences of those, you could use whitespace as delimiters, write strings wrapped in " (escaping any " - and then \, too - occurring within the streams themselves), and pick anything that isn't used for streaming built-in types as sequence delimiter. It might be helpful to store the size of a sequence, too.
For example,
5 1.4 "a string containing \" and \\" { 3 "blah" "blubb" "frgl" } { 2 42 21 }
might be the serialization of an int (5), a float (1.4), a string ("a string containing " and \"), a sequence of 3 strings ("blah", "blubb", and "frgl"), and a sequence of 2 ints (42 and 21).
Alternatively you could do as Neil suggests in his comment and treat strings as sequences of characters:
{ 27 'a' ' ' 's' 't' 'r' 'i' 'n' 'g' ' ' 'c' 'o' 'n' 't' 'a' 'i' 'n' 'i' 'n' 'g' ' ' '"' ' ' 'a' 'n' 'd' ' ' '\' }
If you want to avoid escaping strings, you can look at how ASN.1 does things. It's overkill for your stated requirements: strings, fundamental types and arrays of these things, but the principle is that the stream contains unambiguous length information. Therefore nothing needs to be escaped.
For a very simple equivalent, you could output a uint32_t as "ui4" followed by 4 bytes of data, a int8_t as "si1" followed by 1 byte of data, an IEEE float as "f4", IEEE double as "f8", and so on. Use some additional modifier for arrays: "a134ui4" followed by 536 bytes of data. Note that arbitrary lengths need to be terminated, whereas bounded lengths like the number of bytes in the following integer can be fixed size (one of the reasons ASN.1 is more than you need is that it uses arbitrary lengths for everything). A string could then either be a<len>ui1 or some abbreviation like s<len>:. The reader is very simple indeed.
This has obvious drawbacks: the size and representation of types must be independent of platform, and the output is neither human readable nor particularly compressed.
You can make it mostly human-readable, though with ASCII instead of binary representation of arithmetic types (careful with arrays: you may want to calculate the length of the whole array before outputting any of it, or you may use a separator and a terminator since there's no need for character escapes), and by optionally adding a big fat human-visible separator, that the deserializer ignores. For example, s16:hello, worlds12:||s12:hello, world is considerably easier to read than s16:hello, worlds12:s12:hello, world. Just beware when reading that what looks like a separator sequence might not actually be one, and you have to avoid falling into traps like assuming s5:hello|| in the middle of the code means there's a string 5 chars long: it might be part of s15:hello||s5:hello||.
Unless you have very tight constraints on code size, it's probably easier to use a general-purpose serializer off the shelf than it is to write a specialized one. Reading simple XML with SAX isn't difficult. That said, everyone and his dog has written "finally, the serializer/parser/whatever that will save us ever hand-coding a serializer/parser/whatever ever again", with greater or lesser success.
You may consider using boost::spirit, which simplifies parsing of basic types from arbitrary input streams.