Assume that I get a few hundred lines of text as a string (C++) from an API, and sprinkled into that data are german umlauts, such as ä or ö, which need to be replaced with ae and oe.
I'm familiar with encoding (well, I've read http://www.joelonsoftware.com/articles/Unicode.html) and solving the problem was trivial (basically, searching through the string, removing the char and adding 2 others instead).
However, I do not know enough about C++ to do this fast. I've just stumbled upon StringBuilder (http://www.codeproject.com/Articles/647856/4350-Performance-Improvement-with-the-StringBuilde), which improved speed a lot, but I was curious if there are any better or smarter ways to do this?
If you must improve efficiency on such small scale, consider doing the replacement in two phases:
The first phase calculates the number of characters in the result after the replacement. Go through the string, and add 1 to the count for each normal character; for characters such as ä or ö, add 2.
At this point, you have enough information to allocate the string for the result. Make a string of the length that you counted in the first phase.
The second phase performs the actual replacement: go through the string again, copying the regular characters, and replacing umlauted ones with their corresponding pairs.
When it is encoded in UTF-8, the german umlauts are all two-byte values in unicode, and so are their replacements like ae or oe. So when you use a char[] instead of a string, you wouldn't have to reallocate any memory and could just replace the bytes while iterating the char[].
Related
The problem is that I'm processing some UTF8 strings and I would like to design a class or a way to prevent string manipulations.
String manipulation is not desirable for strings of multibyte characters as splitting the string at a random position (which is measured in bytes) may split a character half way.
I have thought about using const std::string& but the user/developer can create a substring by calling std::substr.
Another way would be create a wrapper around const std::string& and expose only the string through getters.
Is this even possible?
Another way would be create a wrapper around const std::string& and expose only the string through getters.
You need a class wrapping a std::string or std::u8string, not a reference to one. The class then owns the string and its contents, basically just using it as a storage, and can provide an interface as you see fit to operate on unicode code points or characters instead of modifying the storage directly.
However, there is nothing in the standard library that will help you implement this. So a better approach would be to use a third party library that already does this for you. Operating on code points in a UTF-8 string is still reasonably simple and you can implement that part yourself, but if you want to operate on characters (in the sense of grapheme clusters or whatever else is suitable) implementation is going to be a project in itself.
I would use a wrapper where your external interface provides access to either code points, or to characters. So, foo.substr(3, 4) (for example) would skip the first 3 code points, and give you the next 4 code points. Alternatively, it would skip the first 3 characters, and give you the next 4 characters.
Either way, that would be independent of the number of bytes used to represent those code points or characters.
Quick aside on terminology for anybody unaccustomed to Unicode terminology: ISO 10646 is basically a long list of code points, each assigned a name and a number from 0 to (about) 220-1. UTF-8 encodes a code point number in a sequence of 1 to 4 bytes.
A character can consist of a (more or less) arbitrary number of code points. It will consist of a base character (e.g., a letter) followed by some number of combining diacritical marks. For example, à would normally be encoded as an a followed by a "combining grave accent" (U+0300).
The a and the U+0300 are each a code point. When encoded in UTF-8, the a would be encoded in a single byte and the U+0300 would be encoded in three bytes. So, it's one character composed of two code points encoded in 4 characters.
That's not quite all there is to characters (as opposed to code points) but it's sufficient for quite a few languages (especially, for the typical European languages like Spanish, German, French, and so on).
There are a fair number of other points that become non-trivial though. For example, German has a letter "ß". This is one character, but when you're doing string comparison, it should (at least normally) compare as equal to "ss". I believe there's been a move to change this but at least classically, it hasn't had an upper-case equivalent either, so both comparison and case conversion with it get just a little bit tricky.
And that's fairly mild compared to situations that arise in some of the more "exotic" languages. But it gives a general idea of the fact that yes, if you want to deal intelligently with Unicode strings, you basically have two choices: either have your code use ICU1 to do most of the real work, or else resign yourself to this being a multi-year project in itself.
1. In theory, you could use another suitable library--but in this case, I'm not aware of such a thing existing.
When trying to answer a question, How to use enqueu, dequeue, push, and peek in a Palindrome?, I suggested a palindrome can be found using std::string by:
bool isPalindrome(const std::string str)
{
return std::equal(str.begin(), str.end(), str.rbegin(), str.rend());
}
For a Unicode string, I suggested:
bool isPalindrome(const std::u8string str)
{
std::u8string rstr{str};
std::reverse(rstr.begin(), rstr.end());
return str == rstr;
}
I now think this will create problems when you have multibyte characters in the string because the byte-order of the multibyte character is also reversed. Also, some characters will be equivalent to each other in different locales. Therefore, in C++20:
how do you make the comparison robust to multibyte characters?
how do you make the comparison robust to different locales when there can be equivalency between multiple characters?
Reversing a Unicode string becomes non-trivial. Converting from UTF-8 to UTF-32/UCS-4 is a good start, but not sufficient by itself--Unicode also has combining code points, so two (or more) consecutive code points form a single resulting grapheme (the added code point(s) add(s) diacritic marking to the base character), and for things to work correctly, you need to keep these in the correct order.
So, basically instead of code points, you need to divide the input up into a series of graphemes, and reverse the order of the graphemes, not just the code points.
To deal with multiple different sequences of code points that represent the same sequence of characters, you normally want to do normalization. There are four different normalization forms. In this case, you'd probably want to use NFC or NFD (should be equivalent for this purpose). The NFKC/NFKD forms are primarily for compatibility with other character sets, which it sounds like you probably don't care about.
This can also be non-trivial though. Just for one well known example, consider the German character "ß". This is sort of equivalent to "ss", but only exists in lower-case, since it never occurs at the beginning of a word. So, there's probably room for argument about whether something like Ssaß is a palindrome or not (for the moment ignoring the minor detail that it's not actually a word). For palindromes, most people ignore letter case, so it would be--but your code in the question seems to treat case as significant, in which case it probably shouldn't be.
I am determining the length of certain strings of characters in C++ with the function length(), but noticed something strange: say I define in the main function
string str;
str = "canción";
Then, when I calculate the length of str by str.length() I get as output 8. If instead I define str = "cancion" and calculate str's length again, the output is 7. In other words, the accent on the letter 'o' is altering the real length of the string. The same thing happens with other accents. For example, if str = "für" it will tell me its length is 4 instead of 3.
I would like to know how to ignore these accented characters when determinig the lenght of a string; however, I wouldn't want to ignore isolated characters like '. For example, if str = livin', the lenght of str must be 6.
It is a difficult subject. Your string is likely UTF-8 encoded, and str.length() counts bytes. An ASCII character can be encoded in 1 byte, but characters with codes larger than 127 is encoded in more than 1 byte.
Counting unicode code points may not give you the answer you needed. Instead, you need to take account the width of the code point to handle separated accents and code points with double width (and maybe there are other cases as well). So this is difficult to do this properly without using a library.
You may want to check out ICU.
If you have a constrained case and you don't want to use a library for this, you may want to check out UTF-8 encoding (it is not difficult), and create a simple UTF-8 code point counter (a simple algorithm could be to count bytes where (b&0xc0)!=0x80).
Sounds like UTF-8 encoding. Since the characters with the accents cannot be stored in a single byte, they are stored in 2 bytes. See https://en.wikipedia.org/wiki/UTF-8
string s="x1→(y1⊕y2)∧z3";
for(auto i=s.begin(); i!=s.end();i++){
if(*i=='→'){
...
}
}
The char comparing is definitely wrong, what's the correct way to do it? I am using vs2013.
First you need some basic understanding of how programs handle Unicode. Otherwise, you should read up, I quite like this post on Joel on Software.
You actually have 2 problems here:
Problem #1: getting the string into your program
Your first problem is getting that actual string in your string s. Depending on the encoding of your source code file, MSVC may corrupt any non-ASCII characters in that string.
either save your C++ file as UTF-16 (which Windows confusingly calls Unicode), and use whcar_t and wstring (effectively encoding the expression as UTF-16). Saving as UTF-8 with BOM will also work. Any other encoding and your L"..." character literals will contain the wrong characters.
Note that other platforms may define wchar_t as 4 bytes instead of 2. So the handling of characters above U+FFFF will be non-portable.
In all other cases, you can't just write those characters in your source file. The most portable way is encoding your string literals as UTF-8, using \x escape codes for all non-ASCII characters. Like this: "x1\xe2\x86\x92a\xe2\x8a\x95" "b)" rather than "x1→(a⊕b)".
And yes, that's as unreadable and cumbersome as it gets. The root problem is MSVC doesn't really support using UTF-8. You can go through this question here for an overview: How to create a UTF-8 string literal in Visual C++ 2008 .
But, also consider how often those strings will actually show up in your source code.
Problem #2: finding the character
(If you're using UTF-16, you can just find the L'→' character, since that character is representable as one whcar_t. For characters above U+FFFF you'll have to use the wide version of the workaround below.)
It's impossible to define a char representing the arrow character. You can however with a string: "\xe2\x86\x92". (that's a string with 3 chars for the arrow, and the \0 terminator.
You can now search for this string in your expression:
s.find("\xe2\x86\x92");
The UTF-8 encoding scheme guarantees this always finds the correct character, but keep in mind this is an offset in bytes.
My comment is too large, so i am submitting it as an answer.
The problem is that everybody is concentrating on the issue of different encodings that Unicode may use (UTF-8, UTF-16, UCS2, etc). But your problems here will just begin.
There is also an issue of composite characters, which will really mess up any search that you are trying to make.
Let's say you are looking for a character 'é', you find it in Unicode as U+00E9 and do your search, but it is not guaranteed that this is the only way to represent this character. The document may also contain U+0065 U+0301 combination. Which is actually exactly the same character.
Yes, not just "character that looks the same", but it is exactly the same, so any software and even some programming libraries will freely convert from one to another without even telling you.
So if you wish to make a search, that is robust, you will need something that represents not just different encodings of Unicode, but Unicode characters themselves with equality between Composite and Ready-Made chars.
I want to add a diacritic mark to my string in c++. Assume I want to modify wordz string in a following manner:
String respj = resp[j];
std::string respjz1 = respj; // create respjz1 and respjz2
std::string respjz2 = respj;
respjz1[i] = 'ź'; // put diacritic marks
respjz2[i] = 'ż';
I keep receiving: wordş and wordĽ (instead of wordź and wordż). I tried to google it but I keep getting results related to the opposite problem - diacritic normalization to non-diacritic mark.
First, what is String? Does it support accented characters or not?
But the real issue is one of encodings. When you say "I keep
receiving", what do you mean. What the string will contain is
not a character, but a numeric value, representing a code point
of a character, in some encoding. If the encoding used by the
compiler for accented characters is the same as the encoding
used by whatever you use to visualize them, then you will get
the same character. If it isn't, you will get something
different. Thus, for example, depending on the encoding, LATIN
SMALL LETTER Z WITH DOT (what I think you're trying to assign to
respjz2[i]) can be 0xFD or 0xBF in the encoding tables I have
access to (and it's absent in most single byte encodings); in
the single byte encoding I normally use (ISO 8859-1), these code
points correspond to LATIN SMALL LETTER Y WITH ACUTE and
INVERTED QUESTION MARK, respectively.
In the end, there is no real solution. Long term, I think you
should probably move to UTF-8, and try to ensure that all of the
tools you use (and all of the tools used by your users)
understand that. Short term, it may not be that simple: for
starters, you're more or less stuck with what your compiler
provides (unless you enter the characters in the form \u00BF
or \u00FD, and even then the compiler may do some funny
mappings when it puts them into a string literal). And you may
not even know what other tools your users use.