character and number Persian in regular Expression C++ [duplicate] - c++

I have to use unicode range in a regex in C++. Basically what I need is to have a regex to accept all valid unicode characters..I just tried with the test expression and facing some issues with it.
std::regex reg("^[\\u0080-\\uDB7Fa-z0-9!#$%&'*+/=?^_`{|}~-]+$");
Is the issue is with \\u?

This should work fine but you need to use std::wregex and std::wsmatch. You will need to convert the source string and regular expression to wide character unicode (UTF-32 on Linux, UTF-16(ish) on Windows) to make it work.
This works for me where source text is UTF-8:
inline std::wstring from_utf8(const std::string& utf8)
{
// code to convert from utf8 to utf32/utf16
}
inline std::string to_utf8(const std::wstring& ws)
{
// code to convert from utf32/utf16 to utf8
}
int main()
{
std::string test = "john.doe#神谕.com"; // utf8
std::string expr = "[\\u0080-\\uDB7F]+"; // utf8
std::wstring wtest = from_utf8(test);
std::wstring wexpr = from_utf8(expr);
std::wregex we(wexpr);
std::wsmatch wm;
if(std::regex_search(wtest, wm, we))
{
std::cout << to_utf8(wm.str(0)) << '\n';
}
}
Output:
神谕
Note: If you need a UTF conversion library I used THIS ONE in the example above.
Edit: Or, you could use the functions given in this answer:
Any good solutions for C++ string code point and code unit?

Related

<regex> having trouble with Cyrillic characters

I'm trying to use the standard <regex> library to match some Cyrillic words:
// This is a UTF-8 file.
std::locale::global(std::locale("en_US.UTF-8"));
string s {"Каждый охотник желает знать где сидит фазан."};
regex re {"[А-Яа-яЁё]+"};
for (sregex_iterator it {s.begin(), s.end(), re}, end {}; it != end; it++) {
cout << it->str() << "#";
}
However, that doesn't seem work. The code above results in the following:
Кажд�#й#о�#о�#ник#желае�#зна�#�#где#�#иди�#�#азан#
rather than the expected:
Каждый#охотник#желает#знать#где#сидит#фазан
The code of the '�' symbol above is \321.
I've checked the regular expression I used with grep and it works as expected. My locale is en_US.UTF-8. Both GCC and Clang produce the same result.
Is there anything I'm missing? Is there a way to "tame" <regex> so it would work with Cyrillic characters?
For ranges like А-Я to work properly, you must use std::regex::collate
Constants
...
collate Character ranges of the form "[a-b]" will be locale sensitive.
Changing the regular expression to
std::regex re{"[А-Яа-яЁё]+", std::regex::collate};
gives the expected result.
Depending on the encoding of your source file, you might need to prefix the regular expression string with u8
std::regex re{u8"[А-Яа-яЁё]+", std::regex::collate};
Cyrillic letters are represented as multibyte sequences in UTF-8. Therefore, one way of handling the problem is by using the "wide" version of string called wstring. Other functions and types working with wide characters need to be replaced with their "multibyte-conscious" version as well, generally this is done by prepending w to their name. This works:
std::locale::global(std::locale("en_US.UTF-8"));
wstring s {L"Каждый охотник желает знать где сидит фазан."};
wregex re {L"[А-Яа-яЁё]+"};
for (wsregex_iterator it {s.begin(), s.end(), re}, end {}; it != end; it++) {
wcout << it->str() << "#";
}
Output:
Каждый#охотник#желает#знать#где#сидит#фазан#
(Thanks #JohnDing for pitching this solution.)
An alternative solution is to use regex::collate to make regexes locale-sensitive with ordinary strings, see this answer by #OlafDietsche for details. This topic will shed some light on which solution might be more preferable in your circumstances. (Turns out in my case collate was a better idea!)

How to convert a text like "\320\272\320\276\320\274..." to std::wstring in C++?

I am working on a code that processes message from Ubuntu, some of the messages contains, for example:
localhost sshd 1658 - - Invalid user \320\272\320\276\320\274\320\274\321\320\275\320\270\320\267\320\274 from 172.28.60.28 port 50712 ]
where "\320\272\320\276\320\274\320\274\321\320\275\320\270\320\267\320\274" is the user name that originally is in Russian. How to convert it to std::wstring?
The numbers after the backslashes are the UTF-8 byte sequence values of the Cyrillic letters, each byte represented as an octal number.
You could for example use a regex replace to replace each \ooo with its value so that you get a real UTF-8 string out:
See it on Wandbox
#include <iostream>
#include <string>
#include <boost/regex.hpp>
int main()
{
std::string const source = R"(Invalid user \320\272\320\276\320\274\320\274\321\320\275\320\270\320\267\320\274 from 172.28.60.28 port 50712)";
boost::regex const re(R"(\\\d\d\d)");
auto const replacer = [](boost::smatch const& match, auto it) {
auto const byteVal = std::stoi(&match[0].str()[1], 0, 8);
*it = static_cast<char>(byteVal);
return ++it;
};
std::string const out = boost::regex_replace(source, re, replacer);
std::cout << out << std::endl;
return EXIT_SUCCESS;
}
If you really need to, you can then convert this std::string to std::wstring using e.g. Thomas's method.
If you have a std::string containing UTF-8 code-points and you wish to convert this to std::wstring you can do this in the following way, using the std::codecvt_utf8 facet and the std::wstring_convert class template:
#include <locale>
std::wstring convert(const std::string& utf8String) {
std::wstring_convert<std::codecvt_utf8<wchar_t>> converter{};
return converter.from_bytes(utf8String);
}
The format of the resulting std::wstring will either be UCS2 (on Windows platforms) or UCS4 (most non-Windows platforms).
Note, that the std::codecvt_utf8 facet is deprecated as of C++17, and instead consumers are encouraged to rely on specialized unicode/text-processing libraries. But this should suffice for now.

Extract double quotes using boost::regex in C++Builder

I am working on a lexer analyzer using C++Builder XE6 and this is what I've done so far: I have two memos (memoIN, memoOUT). memoIN contains the text to be analyzed and memoOUT the output (list of tokens).
First, I strip the memoIN content from all comments using boost::regex, and this works like a charm. Now I'm stuck on how to extract all double quotes from the text and display them as a string in the output memo.
All iIhave so far is an expression that removes all double quotes but not what i need, i need to extract theme and display theme for example:
memoIN :
This is a "Double" Quote and this is "another one"
memoOUT :
<(String "Double") #Line 01 #Length 06)>
<(String "another one") #Line 01 #Length 11)>
Using Boost.Regex
Here's some sample code that demonstrates using boost::regex to extract text within quotes.
#include <string>
#include <iostream>
#include <boost/regex.hpp>
using namespace std;
using namespace boost;
int main(int argc, char **argv) {
// Capture any non-quotes that occur within double quotes.
boost::regex re("\"([^\"]+)\"");
// Input text
std::string memoIN = "This is a \"Double\" Quote and this is \"another one\"";
// Iterate through memoIN
boost::sregex_iterator m1(memoIN.begin(), memoIN.end(), re);
// Ending iterator (using the default constructor)
boost::sregex_iterator m2;
for (; m1 != m2; ++m1) {
// Replace this with code to organize memoOUT
std::cout << (*m1)[1].str() << std::endl;
}
return 0;
}
Using a lexer library
Depending on how sophisticated your needs are, you may find that you're better in the long run using a dedicated lexer and parser generator (like ANTLR3 C) than writing your own with Boost.Regex.
Interfacing with UnicodeString
There are several approaches to handling mismatches between C++Builder's AnsiString and UnicodeString and Standard C++'s std::string and std::wstring. One simple approach is to convert UnicodeString to std::string for internal text manipulation then convert it back to UnicodeString for the UI. For example:
// Use AnsiString to convert from UTF-16 to a narrow character encoding
std::string memoIN_text = AnsiString(MemoIN->Text).c_str();
std::string memoOUT_text;
// Insert Boost.Regex manipulation here and assign the results to memoOUT_text
// Use implicit conversion from const char* to AnsiString/UnicodeString
MemoOUT->Text = memoOUT_text.c_str();
Converting from Unicode to ANSI may lose data, so you may want to use SetMultiByteConversionCodePage to tell C++Builder to use UTF-8 for AnsiString. (Character encoding is complicated enough to be its own topic.)

C++ Strip non-ASCII Characters from string

Before you get started; yes I know this is a duplicate question and yes I have looked at the posted solutions. My problem is I could not get them to work.
bool invalidChar (char c)
{
return !isprint((unsigned)c);
}
void stripUnicode(string & str)
{
str.erase(remove_if(str.begin(),str.end(), invalidChar), str.end());
}
I tested this method on "Prusæus, Ægyptians," and it did nothing
I also attempted to substitute isprint for isalnum
The real problem occurs when, in another section of my program I convert string->wstring->string. the conversion balks if there are unicode chars in the string->wstring conversion.
Ref:
How can you strip non-ASCII characters from a string? (in C#)
How to strip all non alphanumeric characters from a string in c++?
Edit:
I still would like to remove all non-ASCII chars regardless yet if it helps, here is where I am crashing:
// Convert to wstring
wchar_t* UnicodeTextBuffer = new wchar_t[ANSIWord.length()+1];
wmemset(UnicodeTextBuffer, 0, ANSIWord.length()+1);
mbstowcs(UnicodeTextBuffer, ANSIWord.c_str(), ANSIWord.length());
wWord = UnicodeTextBuffer; //CRASH
Error Dialog
MSVC++ Debug Library
Debug Assertion Failed!
Program: //myproject
File: f:\dd\vctools\crt_bld\self_x86\crt\src\isctype.c
Line: //Above
Expression:(unsigned)(c+1)<=256
Edit:
Further compounding the matter: the .txt file I am reading in from is ANSI encoded. Everything within should be valid.
Solution:
bool invalidChar (char c)
{
return !(c>=0 && c <128);
}
void stripUnicode(string & str)
{
str.erase(remove_if(str.begin(),str.end(), invalidChar), str.end());
}
If someone else would like to copy/paste this, I can check this question off.
EDIT:
For future reference: try using the __isascii, iswascii commands
Solution:
bool invalidChar (char c)
{
return !(c>=0 && c <128);
}
void stripUnicode(string & str)
{
str.erase(remove_if(str.begin(),str.end(), invalidChar), str.end());
}
EDIT:
For future reference: try using the __isascii, iswascii commands
At least one problem is in your invalidChar function. It should be:
return !isprint( static_cast<unsigned char>( c ) );
Casting a char to an unsigned is likely to give some very, very big
values if the char is negative (UNIT_MAX+1 + c). Passing such a
value toisprint` is undefined behavior.
Another solution that doesn't require defining two functions but uses anonymous functions available in C++17 above:
void stripUnicode(string & str)
{
str.erase(remove_if(str.begin(),str.end(), [](char c){return !(c>=0 && c <128);}), str.end());
}
I think it looks cleaner
isprint depends on the locale, so the character in question must be printable in the current locale.
If you want strictly ASCII, check the range for [0..127]. If you want printable ASCII, check the range and isprint.

C++ & Boost: encode/decode UTF-8

I'm trying to do a very simple task: take a unicode-aware wstring and convert it to a string, encoded as UTF8 bytes, and then the opposite way around: take a string containing UTF8 bytes and convert it to unicode-aware wstring.
The problem is, I need it cross-platform and I need it work with Boost... and I just can't seem to figure a way to make it work. I've been toying with
http://www.edobashira.com/2010/03/using-boost-code-facet-for-reading-utf8.html and
http://www.boost.org/doc/libs/1_46_0/libs/serialization/doc/codecvt.html
Trying to convert the code to use stringstream/wstringstream instead of files of whatever, but nothing seems to work.
For instance, in Python it would look like so:
>>> u"שלום"
u'\u05e9\u05dc\u05d5\u05dd'
>>> u"שלום".encode("utf8")
'\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d'
>>> '\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d'.decode("utf8")
u'\u05e9\u05dc\u05d5\u05dd'
What I'm ultimately after is this:
wchar_t uchars[] = {0x5e9, 0x5dc, 0x5d5, 0x5dd, 0};
wstring ws(uchars);
string s = encode_utf8(ws);
// s now holds "\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d"
wstring ws2 = decode_utf8(s);
// ws2 now holds {0x5e9, 0x5dc, 0x5d5, 0x5dd}
I really don't want to add another dependency on the ICU or something in that spirit... but to my understanding, it should be possible with Boost.
Some sample code would greatly be appreciated! Thanks
Thanks everyone, but ultimately I resorted to http://utfcpp.sourceforge.net/ -- it's a header-only library that's very lightweight and easy to use. I'm sharing a demo code here, should anyone find it useful:
inline void decode_utf8(const std::string& bytes, std::wstring& wstr)
{
utf8::utf8to32(bytes.begin(), bytes.end(), std::back_inserter(wstr));
}
inline void encode_utf8(const std::wstring& wstr, std::string& bytes)
{
utf8::utf32to8(wstr.begin(), wstr.end(), std::back_inserter(bytes));
}
Usage:
wstring ws(L"\u05e9\u05dc\u05d5\u05dd");
string s;
encode_utf8(ws, s);
There's already a boost link in the comments, but in the almost-standard C++0x, there is wstring_convert that does this
#include <iostream>
#include <string>
#include <locale>
#include <codecvt>
int main()
{
wchar_t uchars[] = {0x5e9, 0x5dc, 0x5d5, 0x5dd, 0};
std::wstring_convert<std::codecvt_utf8<wchar_t>> conv;
std::string s = conv.to_bytes(uchars);
std::wstring ws2 = conv.from_bytes(s);
std::cout << std::boolalpha
<< (s == "\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d" ) << '\n'
<< (ws2 == uchars ) << '\n';
}
output when compiled with MS Visual Studio 2010 EE SP1 or with CLang++ 2.9
true
true
Boost.Locale was released in Boost 1.48(November 15th, 2011) making it easier to convert from and to UTF8/16
Here are some convenient examples from the docs:
string utf8_string = to_utf<char>(latin1_string,"Latin1");
wstring wide_string = to_utf<wchar_t>(latin1_string,"Latin1");
string latin1_string = from_utf(wide_string,"Latin1");
string utf8_string2 = utf_to_utf<char>(wide_string);
Almost as easy as Python encoding/decoding :)
Note that Boost.Locale is not a header-only library.
For a drop-in replacement for std::string/std::wstring that handles utf8, see TINYUTF8.
In combination with <codecvt> you can convert pretty much from/to every encoding from/to utf8, which you then handle through the above library.