<regex> having trouble with Cyrillic characters - c++

I'm trying to use the standard <regex> library to match some Cyrillic words:
// This is a UTF-8 file.
std::locale::global(std::locale("en_US.UTF-8"));
string s {"Каждый охотник желает знать где сидит фазан."};
regex re {"[А-Яа-яЁё]+"};
for (sregex_iterator it {s.begin(), s.end(), re}, end {}; it != end; it++) {
cout << it->str() << "#";
}
However, that doesn't seem work. The code above results in the following:
Кажд�#й#о�#о�#ник#желае�#зна�#�#где#�#иди�#�#азан#
rather than the expected:
Каждый#охотник#желает#знать#где#сидит#фазан
The code of the '�' symbol above is \321.
I've checked the regular expression I used with grep and it works as expected. My locale is en_US.UTF-8. Both GCC and Clang produce the same result.
Is there anything I'm missing? Is there a way to "tame" <regex> so it would work with Cyrillic characters?

For ranges like А-Я to work properly, you must use std::regex::collate
Constants
...
collate Character ranges of the form "[a-b]" will be locale sensitive.
Changing the regular expression to
std::regex re{"[А-Яа-яЁё]+", std::regex::collate};
gives the expected result.
Depending on the encoding of your source file, you might need to prefix the regular expression string with u8
std::regex re{u8"[А-Яа-яЁё]+", std::regex::collate};

Cyrillic letters are represented as multibyte sequences in UTF-8. Therefore, one way of handling the problem is by using the "wide" version of string called wstring. Other functions and types working with wide characters need to be replaced with their "multibyte-conscious" version as well, generally this is done by prepending w to their name. This works:
std::locale::global(std::locale("en_US.UTF-8"));
wstring s {L"Каждый охотник желает знать где сидит фазан."};
wregex re {L"[А-Яа-яЁё]+"};
for (wsregex_iterator it {s.begin(), s.end(), re}, end {}; it != end; it++) {
wcout << it->str() << "#";
}
Output:
Каждый#охотник#желает#знать#где#сидит#фазан#
(Thanks #JohnDing for pitching this solution.)
An alternative solution is to use regex::collate to make regexes locale-sensitive with ordinary strings, see this answer by #OlafDietsche for details. This topic will shed some light on which solution might be more preferable in your circumstances. (Turns out in my case collate was a better idea!)

Related

Replace single backslash with double in a string c++

I am trying to replace one backslash with two. To do that I tried using the following code
str = "d:\test\text.txt"
str.replace("\\","\\\\");
The code does not work. Whole idea is to pass str to deletefile function, which requires double blackslash.
since c++11, you may try using regex
#include <regex>
#include <iostream>
int main() {
auto s = std::string(R"(\tmp\)");
s = std::regex_replace(s, std::regex(R"(\\)"), R"(\\)");
std::cout << s << std::endl;
}
A bit overkill, but does the trick is you want a "quick" sollution
There are two errors in your code.
First line: you forgot to double the \ in the literal string.
It happens that \t is a valid escape representing the tab character, so you get no compiler error, but your string doesn't contain what you expect.
Second line: according to the reference of string::replace,
you can replace a substring by another substring based on the substring position.
However, there is no version that makes a substitution, i.e. replace all occurences of a given substring by another one.
This doesn't exist in the standard library. It exists for example in the boost library, see boost string algorithms. The algorithm you are looking for is called replace_all.

C++ Escape occurrences of \ in a string

Is there a simple way to escape all occurrences of \ in a string? I start with the following string:
#include <string>
#include <iostream>
std::string escapeSlashes(std::string str) {
// I have no idea what to do here
return str;
}
int main () {
std::string str = "a\b\c\d";
std::cout << escapeSlashes(str) << "\n";
// Desired output:
// a\\b\\c\\d
return 0;
}
Basically, I am looking for the inverse to this question. The problem is that I cannot search for \ in the string, because C++ already treats it as an escape sequence.
NOTE: I am not able to change the string str in the first place. It is parsed from a LaTeX file. Thus, this answers to a similar question does not apply. Edit: The parsing failed due to an unrelated problem, the question here is about string literals.
Edit: There are nice solutions to find and replace known escape sequences, such as this answer. Another option is to use boost::regex("\p{cntrl}"). However, I haven't found one that works for unknown (erroneous) escape sequences.
You can use raw string literal. See http://en.cppreference.com/w/cpp/language/string_literal
#include <string>
#include <iostream>
int main() {
std::string str = R"(a\b\c\d)";
std::cout << str << "\n";
return 0;
}
Output:
a\b\c\d
It is not possible to convert the string literal a\b\c\d to a\\b\\c\\d, i.e. escaping the backslashes.
Why? Because the compiler converts \c and \d directly to c and d, respectively, giving you a warning about Unknown escape sequence \c and Unknown escape sequence \d (\b is fine as it is a valid escape sequence). This happens directly to the string literal before you have any chance to work with it.
To see this, you can compile to assembler
gcc -S main.cpp
and you will find the following line somewhere in your assembler code:
.string "a\bcd"
Thus, your problem is either in your parsing function or you use string literals for experimenting and you should use raw strings R"(a\b\c\d)" instead.

How to count characters in a string encoded in an arbitrary character set

Given a std::string containing text encoded in an arbitrary but known character set. What is the easiest way in C++ to count the characters? It should be able to handle things like combining characters and Unicode code points.
It would be nice to have something like:
std::string test = "éäöü";
std::cout << test.size("utf-8") << std::endl;
Unfortunately, life isn't always easy with C++. :)
For Unicode, I have seen that one can use the ICU library: Cross-platform iteration of Unicode string (counting Graphemes using ICU)
But is there a more general solution?
I'm afraid it depends on the particular encoding. If you use UTF-8 (and I really don't see why you should not), you could use UTF8-CPP.
It would appear they have a function to do just this:
::std::string test = "éäöü";
auto length = ::utf8::distance(test.begin(), test.end());
::std::cout << length << "\n"; // should print 4.

why the comparision of two strings in utf8 is not correct?

I have two words and both are of the type std::string and they are unicode words. they are the same, I mean when I write them to some file they both have the same representation. but when I call word1.compare(word2), I dont get the right result. why they are not the same?
or should I use another function instead of compare to compare two unicode strings?
thanks
ifstream myfile;
string term = "";
myfile.open("homograph.txt");
istream_iterator<string> i(myfile);
multiset<string> s(i, istream_iterator<string>());
for(multiset<string>::const_iterator i = s.begin(); i != s.end(); i = s.upper_bound(*i))
{
term = *i;
}
pugi::xml_document doc;
std::ifstream stream("words0.xml");
pugi::xml_parse_result result = doc.load(stream);
pugi::xml_node words = doc.child("Words");
for (pugi::xml_node_iterator it = words.begin(); it != words.end(); ++it)
{
std::string wordValue = as_utf8(it->child("WORDVALUE").child_value());
if(!wordValue.compare(term))
{
o << wordValue << endl;
}
}
the first word is "term" and the second word is wordValue;
the overload function of as_utf8() is :
std::string wordNet::as_utf8(const char* str)
{
return str;
}
In Unicode (and UTF-8 is Unicode), there is the problem of composition. A token like é can be represented by its own code point, or by the code point e followed by ´. It could be that one is encoded using precomposition (é) and the other using decomposition (e´). Both will usually be displayed the same way. To avoid the problem, one should normalize strings on one of these composition types.
Of course, there could be another problem, but this is one of the problems that can make equal looking strings not compare as equal. OTOH, if your text does not have any characters outside ASCII, this is hardly the problem.
The correct way to compare the strings is to normalize them first. You can do this in Python with the unicodedata module.
The Unicode Standard Technical Appendix #15 describes composition and normalization in detail.
Unicode is more complicated than you think. There are combining characters, invisible code points and what not. If two strings look the same when printed, it doesn't mean they are byte-to-byte identical.
To take all complications of Unicode into account, you need to use a Unicode-aware string library. One such library is ICU. The C++ standard library is most definitely not Unicode-aware. It probably can correctly count characters in a UTF-8 strings, but that's about it.
Try using std::wstring instead.

How to remove accents and tilde in a C++ std::string

I have a problem with a string in C++ which has several words in Spanish. This means that I have a lot of words with accents and tildes. I want to replace them for their not accented counterparts. Example: I want to replace this word: "había" for habia. I tried replace it directly but with replace method of string class but I could not get that to work.
I'm using this code:
for (it= dictionary.begin(); it != dictionary.end(); it++)
{
strMine=(it->first);
found=toReplace.find_first_of(strMine);
while (found!=std::string::npos)
{
strAux=(it->second);
toReplace.erase(found,strMine.length());
toReplace.insert(found,strAux);
found=toReplace.find_first_of(strMine,found+1);
}
}
Where dictionary is a map like this (with more entries):
dictionary.insert ( std::pair<std::string,std::string>("á","a") );
dictionary.insert ( std::pair<std::string,std::string>("é","e") );
dictionary.insert ( std::pair<std::string,std::string>("í","i") );
dictionary.insert ( std::pair<std::string,std::string>("ó","o") );
dictionary.insert ( std::pair<std::string,std::string>("ú","u") );
dictionary.insert ( std::pair<std::string,std::string>("ñ","n") );
and toReplace strings is:
std::string toReplace="á-é-í-ó-ú-ñ-á-é-í-ó-ú-ñ";
I obviously must be missing something. I can't figure it out.
Is there any library I can use?.
Thanks,
I disagree with the currently "approved" answer. The question makes perfect sense when you are indexing text. Like case-insensitive search, accent-insensitive search is a good idea. "naïve" matches "Naïve" matches "naive" matches "NAİVE" (you do know that an uppercase i is İ in Turkish? That's why you ignore accents)
Now, the best algorithm is hinted at the approved answer: Use NKD (decomposition) to decompose accented letters into the base letter and a seperate accent, and then remove all accents.
There is little point in the re-composition afterwards, though. You removed most sequences which would change, and the others are for all intents and purposes identical anyway. WHat's the difference between æ in NKC and æ in NKD?
First, this is a really bad idea: you’re mangling somebody’s language by removing letters. Although the extra dots in words like “naïve” seem superfluous to people who only speak English, there are literally thousands of writing systems in the world in which such distinctions are very important. Writing software to mutilate someone’s speech puts you squarely on the wrong side of the tension between using computers as means to broaden the realm of human expression vs. tools of oppression.
What is the reason you’re trying to do this? Is something further down the line choking on the accents? Many people would love to help you solve that.
That said, libicu can do this for you. Open the transform demo; copy and paste your Spanish text into the “Input” box; enter
NFD; [:M:] remove; NFC
as “Compound 1” and click transform.
(With help from slide 9 of Unicode Transforms in ICU. Slides 29-30 show how to use the API.)
I definitely think you should look into the root of the problem. That is, look for a solution that will allow you to support characters encoded in Unicode or for the user's locale.
That being said, your problem is that you're dealing with multi-character strings. There is std::wstring but I'm not sure I'd use that. For one thing, wide characters aren't meant to handle variable width encodings. This hole goes deep, so I'll leave it at that.
Now, as for the rest of your code, it is error prone because you mix the looping logic with translation logic. Thus, at least two kinds of bugs can occur: translation bugs and looping bugs. Do use the STL, it can help you a lot with the looping part.
The following is a rough solution for replacing characters in a string.
main.cpp:
#include <iostream>
#include <string>
#include <iterator>
#include <algorithm>
#include "translate_characters.h"
using namespace std;
int main()
{
string text;
cin.unsetf(ios::skipws);
transform(istream_iterator<char>(cin), istream_iterator<char>(),
inserter(text, text.end()), translate_characters());
cout << text << endl;
return 0;
}
translate_characters.h:
#ifndef TRANSLATE_CHARACTERS_H
#define TRANSLATE_CHARACTERS_H
#include <functional>
#include <map>
class translate_characters : public std::unary_function<const char,char> {
public:
translate_characters();
char operator()(const char c);
private:
std::map<char, char> characters_map;
};
#endif // TRANSLATE_CHARACTERS_H
translate_characters.cpp:
#include "translate_characters.h"
using namespace std;
translate_characters::translate_characters()
{
characters_map.insert(make_pair('e', 'a'));
}
char translate_characters::operator()(const char c)
{
map<char, char>::const_iterator translation_pos(characters_map.find(c));
if( translation_pos == characters_map.end() )
return c;
return translation_pos->second;
}
You might want to check out the boost (http://www.boost.org/) library.
It has a regexp library, which you could use.
In addition it has a specific library that has some functions for string manipulation (link) including replace.
Try using std::wstring instead of std::string. UTF-16 should work (as opposed to ASCII).
I could not link the ICU libraries but I still think it's the best solution. As I need this program to be functional as soon as possible I made a little program (that I have to improve) and I'm going to use that. Thank you all for for suggestions and answers.
Here's the code I'm gonna use:
for (it= dictionary.begin(); it != dictionary.end(); it++)
{
strMine=(it->first);
found=toReplace.find(strMine);
while (found != std::string::npos)
{
strAux=(it->second);
toReplace.erase(found,2);
toReplace.insert(found,strAux);
found=toReplace.find(strMine,found+1);
}
}
I will change it next time I have to turn my program in for correction (in about 6 weeks).
If you can (if you're running Unix), I suggest using the tr facility for this: it's custom-built for this purpose. Remember, no code == no buggy code. :-)
Edit: Sorry, you're right, tr doesn't seem to work. How about sed? It's a pretty stupid script I've written, but it works for me.
#!/bin/sed -f
s/á/a/g;
s/é/e/g;
s/í/i/g;
s/ó/o/g;
s/ú/u/g;
s/ñ/n/g;
/// <summary>
///
/// Replace any accent and foreign character by their ASCII equivalent.
/// In other words, convert a string to an ASCII-complient string.
///
/// This also get rid of special hidden character, like EOF, NUL, TAB and other '\0', except \n\r
///
/// Tests with accents and foreign characters:
/// Before: "äæǽaeöœoeüueÄAeÜUeÖOeÀÁÂÃÄÅǺĀĂĄǍΑΆẢẠẦẪẨẬẰẮẴẲẶАAàáâãåǻāăąǎªαάảạầấẫẩậằắẵẳặаaБBбbÇĆĈĊČCçćĉċčcДDдdÐĎĐΔDjðďđδdjÈÉÊËĒĔĖĘĚΕΈẼẺẸỀẾỄỂỆЕЭEèéêëēĕėęěέεẽẻẹềếễểệеэeФFфfĜĞĠĢΓГҐGĝğġģγгґgĤĦHĥħhÌÍÎÏĨĪĬǏĮİΗΉΊΙΪỈỊИЫIìíîïĩīĭǐįıηήίιϊỉịиыїiĴJĵjĶΚКKķκкkĹĻĽĿŁΛЛLĺļľŀłλлlМMмmÑŃŅŇΝНNñńņňʼnνнnÒÓÔÕŌŎǑŐƠØǾΟΌΩΏỎỌỒỐỖỔỘỜỚỠỞỢОOòóôõōŏǒőơøǿºοόωώỏọồốỗổộờớỡởợоoПPпpŔŖŘΡРRŕŗřρрrŚŜŞȘŠΣСSśŝşșšſσςсsȚŢŤŦτТTțţťŧтtÙÚÛŨŪŬŮŰŲƯǓǕǗǙǛŨỦỤỪỨỮỬỰУUùúûũūŭůűųưǔǖǘǚǜυύϋủụừứữửựуuÝŸŶΥΎΫỲỸỶỴЙYýÿŷỳỹỷỵйyВVвvŴWŵwŹŻŽΖЗZźżžζзzÆǼAEßssIJIJijijŒOEƒf'ξksπpβvμmψpsЁYoёyoЄYeєyeЇYiЖZhжzhХKhхkhЦTsцtsЧChчchШShшshЩShchщshchЪъЬьЮYuюyuЯYaяya"
/// After: "aaeooeuueAAeUUeOOeAAAAAAAAAAAAAAAAAAAAAAAaaaaaaaaaaaaaaaaaaaaaaaBbCCCCCCccccccDdDDjddjEEEEEEEEEEEEEEEEEEeeeeeeeeeeeeeeeeeeFfGGGGGgggggHHhhIIIIIIIIIIIIIiiiiiiiiiiiiJJjjKKkkLLLLllllMmNNNNNnnnnnOOOOOOOOOOOOOOOOOOOOOOooooooooooooooooooooooPpRRRRrrrrSSSSSSssssssTTTTttttUUUUUUUUUUUUUUUUUUUUUUUUuuuuuuuuuuuuuuuuuuuuuuuYYYYYYYYyyyyyyyyVvWWwwZZZZzzzzAEssIJijOEf'kspvmpsYoyoYeyeYiZhzhKhkhTstsChchShshShchshchYuyuYaya"
///
/// Tests with invalid 'special hidden characters':
/// Before: "\0\0\000\0000Bj��rk�\'\"\\\0\a\b\f\n\r\t\v\u0020���oacu\'\\\'te�"
/// After: "00000Bjrk'\"\\\n\r oacu'\\'te"
///
/// </summary>
private string Normalize(string StringToClean)
{
string normalizedString = StringToClean.Normalize(NormalizationForm.FormD);
StringBuilder Buffer = new StringBuilder(StringToClean.Length);
for (int i = 0; i < normalizedString.Length; i++)
{
if (CharUnicodeInfo.GetUnicodeCategory(normalizedString[i]) != UnicodeCategory.NonSpacingMark)
{
Buffer.Append(normalizedString[i]);
}
}
string PreAsciiCompliant = Buffer.ToString().Normalize(NormalizationForm.FormC);
StringBuilder AsciiComplient = new StringBuilder(PreAsciiCompliant.Length);
foreach (char character in PreAsciiCompliant)
{
//Reject all special characters except \n\r (Carriage-Return and Line-Feed).
//Get rid of special hidden character, like EOF, NUL, TAB and other '\0'
if (((int)character >= 32 && (int)character < 127) || ((int)character == 10 || (int)character == 13))
{
AsciiComplient.Append(character);
}
}
return AsciiComplient.ToString().Trim(); // Remove spaces at start and end of string if any
}