Does POSIX regex.h provide unicode or basically non-ascii characters? - c++

Hi i am using Standard Regex Library (regcomp, regexec..). But now on demand i should add unicode support to my codes for regular expressions.
Does Standard Regex Library provide unicode or basically non-ascii characters? I researched on the Web, and think not.
My project is resource critic therefore i don't want to use large libraries for it (ICU and Boost.Regex).
Any help would be appreciated..

Looks like POSIX Regex working properly with UTF-8 locale. I've just wrote a simple test (see below) and used it for matching string with a cyrillic characters against regex "[[:alpha:]]" (for example). And everything working just fine.
Note: The main thing you must remember - regex functions are locale-related. So you must call setlocale() before it.
#include <sys/types.h>
#include <string.h>
#include <regex.h>
#include <stdio.h>
#include <locale.h>
int main(int argc, char** argv) {
int ret;
regex_t reg;
regmatch_t matches[10];
if (argc != 3) {
fprintf(stderr, "Usage: %s regex string\n", argv[0]);
return 1;
}
setlocale(LC_ALL, ""); /* Use system locale instead of default "C" */
if ((ret = regcomp(&reg, argv[1], 0)) != 0) {
char buf[256];
regerror(ret, &reg, buf, sizeof(buf));
fprintf(stderr, "regcomp() error (%d): %s\n", ret, buf);
return 1;
}
if ((ret = regexec(&reg, argv[2], 10, matches, 0)) == 0) {
int i;
char buf[256];
int size;
for (i = 0; i < sizeof(matches) / sizeof(regmatch_t); i++) {
if (matches[i].rm_so == -1) break;
size = matches[i].rm_eo - matches[i].rm_so;
if (size >= sizeof(buf)) {
fprintf(stderr, "match (%d-%d) is too long (%d)\n",
matches[i].rm_so, matches[i].rm_eo, size);
continue;
}
buf[size] = '\0';
printf("%d: %d-%d: '%s'\n", i, matches[i].rm_so, matches[i].rm_eo,
strncpy(buf, argv[2] + matches[i].rm_so, size));
}
}
return 0;
}
Usage example:
$ locale
LANG=ru_RU.UTF-8
LC_CTYPE="ru_RU.UTF-8"
LC_COLLATE="ru_RU.UTF-8"
... (skip)
LC_ALL=
$ ./reg '[[:alpha:]]' ' 359 фыва'
0: 5-7: 'ф'
$
The length of the matching result is two bytes because cyrillic letters in UTF-8 takes so much.

Basically, POSIX regexes are not Unicode aware. You can try to use them on Unicode characters, but there might be problems with glyphs that have multiple encodings and other such issues that Unicode aware libraries handle for you.
From the standard, IEEE Std 1003.1-2008:
Matching shall be based on the bit pattern used for encoding the character, not on the graphic representation of the character. This means that if a character set contains two or more encodings for a graphic symbol, or if the strings searched contain text encoded in more than one codeset, no attempt is made to search for any other representation of the encoded symbol. If that is required, the user can specify equivalence classes containing all variations of the desired graphic symbol.
Maybe libpcre would work for you? It's slightly heavier than POSIX regexes, but I would think it lighter than ICU or Boost.

If you really mean "Standard", i.e. std::regex from C++11, then all you need to do is switch to std::wregex (and std::wstring of course).

Related

On Windows, stat and GetFileAttributes fail for paths containing strange characters

The code below demonstrates how stat and GetFileAttributes fail when the path contains some strange (but valid) ASCII characters.
As a workaround, I would use the 8.3 DOS file name. But this does not work when the drive has 8.3 names disabled.
(8.3 names are disabled with the fsutil command: fsutil behavior set disable8dot3 1).
Is it possible to get stat and/or GetFileAttributes to work in this case?
If not, is there another way of determining whether or not a path is a directory or file?
#include "stdafx.h"
#include <sys/stat.h>
#include <string>
#include <Windows.h>
#include <atlpath.h>
std::wstring s2ws(const std::string& s)
{
int len;
int slength = (int)s.length() + 1;
len = MultiByteToWideChar(CP_ACP, 0, s.c_str(), slength, 0, 0);
wchar_t* buf = new wchar_t[len];
MultiByteToWideChar(CP_ACP, 0, s.c_str(), slength, buf, len);
std::wstring r(buf);
delete[] buf;
return r;
}
// The final characters in the path below are 0xc3 (Ã) and 0x3f (?).
// Create a test directory with the name à and set TEST_DIR below to your test directory.
const char* TEST_DIR = "D:\\tmp\\VisualStudio\\TestProject\\ConsoleApplication1\\test_data\\Ã";
int main()
{
std::string testDir = TEST_DIR;
// test stat and _wstat
struct stat st;
const auto statSucceeded = stat(testDir.c_str(), &st) == 0;
if (!statSucceeded)
{
printf("stat failed\n");
}
std::wstring testDirW = s2ws(testDir);
struct _stat64i32 stW;
const auto statSucceededW = _wstat(testDirW.data(), &stW) == 0;
if (!statSucceededW)
{
printf("_wstat failed\n");
}
// test PathIsDirectory
const auto isDir = PathIsDirectory(testDirW.c_str()) != 0;
if (!isDir)
{
printf("PathIsDirectory failed\n");
}
// test GetFileAttributes
const auto fileAttributes = ::GetFileAttributes(testDirW.c_str());
const auto getFileAttributesWSucceeded = fileAttributes != INVALID_FILE_ATTRIBUTES;
if (!getFileAttributesWSucceeded)
{
printf("GetFileAttributes failed\n");
}
return 0;
}
The problem you have encountered comes from using the MultiByteToWideChar function. Using CP_ACP can default to a code page that does not support some characters. If you change the default system code page to UTF8, your code will work. Since you cannot tell your clients what code page to use, you can use a third party library such as International Components for Unicode to convert from the host code page to UTF16.
I ran your code using console code page 65001 and VS2015 and your code worked as written. I also added positive printfs to verify that it did work.
Don't start with a narrow string literal and try to convert it, start with a wide string literal - one that represents the actual filename. You can use hexadecimal escape sequences to avoid any dependency on the encoding of the source code.
If the actual code doesn't use string literals, the best resolution depends on the situation; for example, if the file name is being read from a file, you need to make sure that you know what encoding the file is in and perform the conversion accordingly.
If the actual code reads the filename from the command line arguments, you can use wmain() instead of main() to get the arguments as wide strings.

pcre2 UTF32 usage

I've just spent some time figuring out the pcre2 interface and think I've got it for the most part. I want to support UTF32, pcre2 is already built with support and code point width has been set to 32.
The code below is what I've got for working with code point width set to 8.
How do I change this to work with UTF32?
#include "gtest/gtest.h"
#include <pcre2.h>
TEST(PCRE2, example) {
//iterate over all matches in a string
PCRE2_SPTR subject = (PCRE2_SPTR) string("this is it").c_str();
PCRE2_SPTR pattern = (PCRE2_SPTR) string("([a-z]+)|\\s").c_str();
int errorcode;
PCRE2_SIZE erroroffset;
pcre2_code *re = pcre2_compile(pattern, PCRE2_ZERO_TERMINATED, PCRE2_ANCHORED | PCRE2_UTF, &errorcode,
&erroroffset, NULL);
if (re) {
uint32_t groupcount = 0;
pcre2_pattern_info(re, PCRE2_INFO_BACKREFMAX, &groupcount);
pcre2_match_data *match_data = pcre2_match_data_create_from_pattern(re, NULL);
uint32_t options_exec = PCRE2_NOTEMPTY;
PCRE2_SIZE subjectlen = strlen((const char *) subject);
errorcode = pcre2_match(re, subject, subjectlen, 0, options_exec, match_data, NULL);
while (errorcode >= 0) {
PCRE2_UCHAR *result;
PCRE2_SIZE resultlen;
for (int i = 0; i <= groupcount; i++) {
pcre2_substring_get_bynumber(match_data, i, &result, &resultlen);
printf("Matched:%.*s\n", (int) resultlen, (const char *) result);
pcre2_substring_free(result);
}
// Advance through subject
PCRE2_SIZE *ovector = pcre2_get_ovector_pointer(match_data);
errorcode = pcre2_match(re, subject, subjectlen, ovector[1], options_exec, match_data, NULL);
}
pcre2_match_data_free(match_data);
pcre2_code_free(re);
} else {
// Syntax error in the regular expression at erroroffset
PCRE2_UCHAR error[256];
pcre2_get_error_message(errorcode, error, sizeof(error));
printf("PCRE2 compilation failed at offset %d: %s\n", (int) erroroffset, (char *) error);
}
Presumably subject and pattern needs to be converted somehow and result would be of the same type? I couldn't find anything in pcre2 header to indicate support for that.
And I guess subjectlen would no longer be simply strlen.
Finally, I put this example together from having gone through some of the docs and the header, is there anything else I should be doing/worth knowing.
I left pcre2 in the end, after evaluating RE2, PCRE2 and ICU, I chose ICU. Its unicode support (from what I've seen so far) much more complete than the other two. It also provides a very clean API and lots of utilities for manipulation. Importantly, like PCRE2 provides a perl style regex engine which, out of the box works great with unicode.
If you set code width properly this could be the problem:
(PCRE2_SPTR) string("this is it").c_str();
Converting the c_str() to PCRE2_SPTR doesn't make the string utf32.
If you are unsure about setting the proper code width (I didn't see it in your sorce code) you can force 32 bit by adding _32 postfix to everything e.g. pcre2_compile_32.
It depends on which character type you are going to use and which system you are going to target.
The basic unit for std::string is char which is generally 8 bit and supports UTF-8 (can be different depending on implementation/system). So you can't use std::string("some string") and such codes when dealing with UTF-32 in such systems.
PCRE2_CODE_UNIT_WIDTH must match the bit-size of the basic character unit you are going to use. For 8-bit char it should be defined as 8, for 16-bit char it should be defined as 16 etc...
In GNU/Linux, you can use wchar_t i.e std::wstring which is 32 bit and UTF-32 supported. In windows wchar_t is 16 bit (with UTF-16).
In >=C++11, you can use char32_t i.e std::u32string which is at least 32 bit (you will have to make sure it's exactly 32 bit in your target system)
I have a wrapper for PCRE2 in C++ which contains some examples (in src directory) on how to handle UTF-16 and UTF-32 modes.

What is the correct way of processing different strings encodings via c++ main char** args?

I need some clarifications.
The problem is I have a program for windows written in C++ which uses 'wmain' windows-specific function that accepts wchar_t** as its args. So, there is an opportunity to pass whatever-you-like as a command line parameters to such program: for example, Chinese symbols, Japanese ones, etc, etc.
To be honest, I have no information about the encoding this function is usually used with. Probably utf-32, or even utf-16.
So, the questions:
What is the not windows-specific, but unix/linux way to achieve this with standard main function? My first thoughts were about usage of utf-8 encoded input strings with some kind of locales specifying?
Can somebody give a simple example of such main function? How can a std::string hold a Chinese symbols?
Can we operate with Chinese symbols encoded in utf-8 and contained in std::strings as usual when we just access each char (byte) like this: string_object[i] ?
Disclaimer: All Chinese words provided by GOOGLE translate service.
1) Just proceed as normal using normal std::string. The std::string can hold any character encoding and argument processing is simple pattern matching. So on a Chinese computer with the Chinese version of the program installed all it needs to do is compare Chinese versions of the flags to what the user inputs.
2) For example:
#include <string>
#include <vector>
#include <iostream>
std::string arg_switch = "开关";
std::string arg_option = "选项";
std::string arg_option_error = "缺少参数选项";
int main(int argc, char* argv[])
{
const std::vector<std::string> args(argv + 1, argv + argc);
bool do_switch = false;
std::string option;
for(auto arg = args.begin(); arg != args.end(); ++arg)
{
if(*arg == "--" + arg_switch)
do_switch = true;
else if(*arg == "--" + arg_option)
{
if(++arg == args.end())
{
// option needs a value - not found
std::cout << arg_option_error << '\n';
return 1;
}
option = *arg;
}
}
std::cout << arg_switch << ": " << (do_switch ? "on":"off") << '\n';
std::cout << arg_option << ": " << option << '\n';
return 0;
}
Usage:
./program --开关 --选项 wibble
Output:
开关: on
选项: wibble
3) No.
For UTF-8/UTF-16 data we need to use special libraries like ICU
For character by character processing you need to use or convert to UTF-32.
In short:
int main(int argc, char **argv) {
setlocale(LC_CTYPE, "");
// ...
}
http://unixhelp.ed.ac.uk/CGI/man-cgi?setlocale+3
And then you use mulitbyte string functions. You can still use normal std::string for storing multibyte strings, but beware that characters in them may span multiple array cells. After successfully setting the locale, you can also use wide streams (wcin, wcout, wcerr) to read and write wide strings from the standard streams.
1) with linux, you'd get standard main(), and standard char. It would use UTF-8 encoding. So chineese specific characters would be included in the string with a multibyte encoding.
***Edit:**sorry, yes: you have to set the default "" locale like here as well as cout.imbue().*
2) All the classic main() examples would be good examples. As said, chineese specific characters would be included in the string with a multibyte encoding. So if you cout such a string with the default UTF-8 locale, the cout sream would interpret the special UTF8 encoded sequences, knowing it has to agregate between 2 and 6 of each in order to produce the chineese output.
3) you can operate as usual on strings. THere are some issues however if you cout the string length for example: there is a difference between memory (ex: 3 bytes) and the chars that the user sees (ex: only 1). Same if you move with a pointer forward or backward. You have to make sure you interpret mulrtibyte encoding correctly, in order not to output an invalid encoding.
You could be interested in this other SO question.
Wikipedia explains the logic of the UTF-8 multibyte encoding. From this article you'll understand that any char u is a multibyte encoded char if:
( ((u & 0xE0) == 0xC0)
|| ((u & 0xF0) == 0xE0)
|| ((u & 0xF8) == 0xF0)
|| ((u & 0xFC) == 0xF8)
|| ((u & 0xFE) == 0xFC) )
It is followed by one or several chars such as:
((u & 0xC0) == 0x80)
All other chars are ASCII chars (i.e. not multibyte).

Renaming File With Unicode

How would I place a unicode character into the file name?
I have an ostringstream that I use in defining the file name through ofstream, but I cannot use unicode characters. What would be the simplest way of doing this? Renaming it in a unicode format? And please explain how I would do so.
Your question is unclear. If you want to place unicode character - any string/stream class there is in STL has its unicode equivalent. std::string/std::wstring, std::stringstream/std::wstringstream. If you std::wstringstream, here how you would put unicode characters into it:
std::wstringstream wideStream;
wideStream << L"Hello, world";
std::wstring wideString = wideStream.str();
Hope this helps.
/* This program attempts to rename a file named
* CRT_RENAMER.OBJ to CRT_RENAMER.JBO. For this operation
* to succeed, a file named CRT_RENAMER.OBJ must exist and
* a file named CRT_RENAMER.JBO must not exist.
*/
#include <stdio.h>
int main(void)
{
int result;
char old[] = "CRT_RENAMER.OBJ", new[] = "CRT_RENAMER.JBO";
/* Attempt to rename file: */
result = rename(old, newArray);
if(result != 0)
printf("Could not rename '%s'\n", old );
else
printf("File '%s' renamed to '%s'\n", old, newArray);
}

How to test a u32string for letters only (with locale)

I'm writing a compiler (for my own programming language) and I want to allow users to use any of the characters in the Unicode letter categories to define identifiers (modern languages, like Go allow such syntax already).
I've read a lot about character encoding in C++11 and based on all the informations I've found out, it will be fine to use utf32 encoding (it is fast to iterate over in lexer and it has better support than utf8 in C++).
In C++ there is isalpha function. How can I test wchar32_t if it is a letter (a Unicode code point classified as "letter" in any language)?
Is it even possible?
Use ICU to iterate over the string and check whether the appropriate Unicode properties are fulfilled. Here is an example in C that checks whether the UTF-8 command line argument is a valid identifier:
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <unicode/uchar.h>
#include <unicode/utf8.h>
int main(int argc, char **argv) {
if (argc != 2) return EXIT_FAILURE;
const char *const str = argv[1];
int32_t off = 0;
// U8_NEXT has a bug causing length < 0 to not work for characters in [U+0080, U+07FF]
const size_t actual_len = strlen(str);
if (actual_len > INT32_MAX) return EXIT_FAILURE;
const int32_t len = actual_len;
if (!len) return EXIT_FAILURE;
UChar32 ch = -1;
U8_NEXT(str, off, len, ch);
if (ch < 0 || !u_isIDStart(ch)) return EXIT_FAILURE;
while (off < len) {
U8_NEXT(str, off, len, ch);
if (ch < 0 || !u_isIDPart(ch)) return EXIT_FAILURE;
}
}
Note that ICU here uses the Java definitions, which are slightly different from those in UAX #31. In a real application you might also want to normalize to NFC before.
there is an isaplha in the ICU project. I think you can use that.