How to match ANSI data encoded as Unicode with regex - regex

I have some ANSI data encoded as UTF16 Little Endian.
It therefore looks like a \0 b \0 c \0. I suspect this is asking way too much of regex, but just on the off-chance, is there any way of matching this data specifically?
I am able to use ^[\w\x00]+$ but that doesn't really ensure that the null bytes are in the right place. Is there any way to have an alternating pattern match, or match based on character position mod 2 so that the even positions must be filled with null bytes, and null bytes are not allowed elsewhere?
If not I'll write a little bit of manual code, just would be helpful to know.
Thank you.

Related

How to search a non-ASCII character in a c++ string?

string s="x1→(y1⊕y2)∧z3";
for(auto i=s.begin(); i!=s.end();i++){
if(*i=='→'){
...
}
}
The char comparing is definitely wrong, what's the correct way to do it? I am using vs2013.
First you need some basic understanding of how programs handle Unicode. Otherwise, you should read up, I quite like this post on Joel on Software.
You actually have 2 problems here:
Problem #1: getting the string into your program
Your first problem is getting that actual string in your string s. Depending on the encoding of your source code file, MSVC may corrupt any non-ASCII characters in that string.
either save your C++ file as UTF-16 (which Windows confusingly calls Unicode), and use whcar_t and wstring (effectively encoding the expression as UTF-16). Saving as UTF-8 with BOM will also work. Any other encoding and your L"..." character literals will contain the wrong characters.
Note that other platforms may define wchar_t as 4 bytes instead of 2. So the handling of characters above U+FFFF will be non-portable.
In all other cases, you can't just write those characters in your source file. The most portable way is encoding your string literals as UTF-8, using \x escape codes for all non-ASCII characters. Like this: "x1\xe2\x86\x92a\xe2\x8a\x95" "b)" rather than "x1→(a⊕b)".
And yes, that's as unreadable and cumbersome as it gets. The root problem is MSVC doesn't really support using UTF-8. You can go through this question here for an overview: How to create a UTF-8 string literal in Visual C++ 2008 .
But, also consider how often those strings will actually show up in your source code.
Problem #2: finding the character
(If you're using UTF-16, you can just find the L'→' character, since that character is representable as one whcar_t. For characters above U+FFFF you'll have to use the wide version of the workaround below.)
It's impossible to define a char representing the arrow character. You can however with a string: "\xe2\x86\x92". (that's a string with 3 chars for the arrow, and the \0 terminator.
You can now search for this string in your expression:
s.find("\xe2\x86\x92");
The UTF-8 encoding scheme guarantees this always finds the correct character, but keep in mind this is an offset in bytes.
My comment is too large, so i am submitting it as an answer.
The problem is that everybody is concentrating on the issue of different encodings that Unicode may use (UTF-8, UTF-16, UCS2, etc). But your problems here will just begin.
There is also an issue of composite characters, which will really mess up any search that you are trying to make.
Let's say you are looking for a character 'é', you find it in Unicode as U+00E9 and do your search, but it is not guaranteed that this is the only way to represent this character. The document may also contain U+0065 U+0301 combination. Which is actually exactly the same character.
Yes, not just "character that looks the same", but it is exactly the same, so any software and even some programming libraries will freely convert from one to another without even telling you.
So if you wish to make a search, that is robust, you will need something that represents not just different encodings of Unicode, but Unicode characters themselves with equality between Composite and Ready-Made chars.

Does Re2 use string size or null termination?

The title is pretty much it. If a standard C++ string with UTF-8 characters has no zero bytes does the scanning terminate at the end of the string defined by it's size? Conversely, if the string has a zero byte does scanning stop at that byte, or continue to the full length of the string?
I've look at the Re2.h file and it does not seem to address this issue.
A std::string containing UTF-8 characters can´t have 0-bytes a part of the text
(only as termination), because UTF-8 doesn´t allow 0´s anywhere.
And given you´re using something C++11-compliant, a terminating 0 is guaranteed
(doesn´t matter if you use data() or c_str(). And data is the original data, so...).
See http://en.cppreference.com/w/cpp/string/basic_string/data
or the standard (21.4.7.1/1 etc.).
=> The processing of a string will stop at the 0
The interface to Re2 seems to use std::string, which almost
certainly means that it uses the begin and the end of the
string, and that null characters are characters like any other.
(The are, after all, defined in Unicode and in UTF-8.) Of
course, '\0' is in the category control characters, so it won't
match something like "\pL" (which matches a letter). But it
should match "\pC". And of course, '\u0000' and other representations of the null character.

Is it possible to 'trim' trailing spaces/tabs from a string in an arbitrary encoding using ICU without doing any conversions

Specifically, given the following:
A pointer to a buffer containing string data in some encoding X
supported by ICU
The length of the data in the buffer, in bytes
The encoding of the buffer (i.e. X)
Can I compute the length of the string, minus the trailing space/tab characters, without actually converting it into ICU's internal encoding first, then converting back? (this itself could be problematic due to unicode normalizations).
For certain encodings, such as any ascii-based encoding along with utf-8/16/32 the solution is pretty simple, just iterate from the back of the string comparing either 1/2/4 bytes at a time against the two constants.
For others it could be harder (variable-length encodings come to mind). I would like this to be as efficient as possible.
For a large subset of encodings, and for the limited set of U+0020 SPACE and HORIZONTAL TAB U+0009, this is pretty simple.
In ASCII, single-byte Windows code pages, and single-byte ISO code pages, these characters all have the same value. You can simply work backwards, byte-by-byte, lopping them off as long as the value is either 9 or 32.
This approach also works for UTF-8, which has the nice property that a byte less than 128 is always that ASCII character. You don't have to wonder whether it's a lead byte or a continuation byte, as those always have the high bit set.
Given UTF-16, you work two bytes at a time, looking for 0x0009 and 0x0020, being careful to handle byte order. Like UTF-8, UTF-16 has the nice property that if you see this value, you don't have to wonder if it's part of a surrogate pair, as those always have a distinct value.
The problematic cases are the variable-byte encodings that don't give you the assurance that a given unit is unique. If you see a byte with a value 9, you don't necessarily know whether it's a tab character or a random byte from a multibyte encoding. Even for some of these, however, it may be possible that the specific values you care about (9 and 32) are unique. For example, looking at Windows code page 950, it seems that lead bytes have the high value set, and tail bytes steer clear of the lower values (it would take a lot of checking to be absolutely sure). So for your limited case, this might be sufficient.
For the general problem of stripping out an arbitrary set of characters from absolutely any encoding, you need to parse the string according to the rules of that encoding (as well as knowing all the character mappings). For the general case, it's almost certainly best to convert the string to some Unicode encoding, do the trimming, and then convert back. This should round-trip correctly if you're careful to use the K normalization forms.
I use the rather simplistic STL approach of:
std::string mystring;
mystring.erase(mystring.find_last_not_of(" \n\r\t")+1);
Which seems to work for all my needs so far (your mileage may vary), but after years of using it it seems to do the job:)
Let me know if you need more information:)
If you restrict "arbitrary encoding" requirement to "any encoding that uses same codevalue for space and tab as ascii" which is still rather general you even don't need ICU at all. boost::trim_right or boost::trim_right_if is all you need.
http://www.boost.org/doc/libs/1_55_0/doc/html/string_algo/usage.html#idp206822440

Distinguishing between string formats

Having an untyped pointer pointing to some buffer which can hold either ANSI or Unicode string, how do I tell whether the current string it holds is multibyte or not?
Unless the string itself contains information about its format (e.g. a header or a byte order mark) then there is no foolproof way to detect if a string is ANSI or Unicode. The Windows API includes a function called IsTextUnicode() that basically guesses if a string is ANSI or Unicode, but then you run into this problem because you're forced to guess.
Why do you have an untyped pointer to a string in the first place? You must know exactly what and how your data is representing information, either by using a typed pointer in the first place or provide an ANSI/Unicode flag or something. A string of bytes is meaningless unless you know exactly what it represents.
Unicode is not an encoding, it's a mapping of code points to characters. The encoding is UTF8 or UCS2, for example.
And, given that there is zero difference between ASCII and UTF8 encoding if you restrict yourself to the lower 128 characters, you can't actually tell the difference.
You'd be better off asking if there were a way to tell the difference between ASCII and a particular encoding of Unicode. And the answer to that is to use statistical analysis, with the inherent possibility of inaccuracy.
For example, if the entire string consists of bytes less than 128, it's ASCII (it could be UTF8 but there's no way to tell and no difference in that case).
If it's primarily English/Roman and consists of lots of two-byte sequences with a zero as one of the bytes, it's probably UTF16. And so on. I don't believe there's a foolproof method without actually having an indicator of some sort (e.g., BOM).
My suggestion is to not put yourself in the position where you have to guess. If the data type itself can't contain an indicator, provide different functions for ASCII and a particular encoding of Unicode. Then force the work of deciding on to your client. At some point in the calling hierarchy, someone should now the encoding.
Or, better yet, ditch ASCII altogether, embrace the new world and use Unicode exclusively. With UTF8 encoding, ASCII has exactly no advantages over Unicode :-)
In general you can't
You could check for the pattern of zeros - just one at the end probably means ansi 'c', every other byte a zero probably means ansi text as UTF16, 3zeros might be UTF32

How does the UTF-8 support of TinyXML work?

I'm using TinyXML to parse/build XML files. Now, according to the documentation this library supports multibyte character sets through UTF-8. So far so good I think. But, the only API that the library provides (for getting/setting element names, attribute names and values, ... everything where a string is used) is through std::string or const char*. This has me doubting my own understanding of multibyte character set support. How can a string that only supports 8-bit characters contain a 16 bit character (unless it uses a code page, which would negate the 'supports Unicode' claim)? I understand that you could theoretically take a 16-bit code point and split it over 2 chars in a std::string, but that wouldn't transform the std::string to a 'Unicode' string, it would make it invalid for most purposes and would maybe accidentally work when written to a file and read in by another program.
So, can somebody explain to me how a library can offer an '8-bit interface' (std::string or const char*) and still support 'Unicode' strings?
(I probably mixed up some Unicode terminology here; sorry about any confusion coming from that).
First, utf-8 is stored in const char * strings, as #quinmars said. And it's not only a superset of 7-bit ASCII (code points <= 127 always encoded in a single byte as themselves), it's furthermore careful that bytes with those values are never used as part of the encoding of the multibyte values for code points >= 128. So if you see a byte == 44, it's a '<' character, etc. All of the metachars in XML are in 7-bit ASCII. So one can just parse the XML, breaking strings where the metachars say to, sticking the fragments (possibly including non-ASCII chars) into a char * or std::string, and the returned fragments remain valid UTF-8 strings even though the parser didn't specifically know UTF-8.
Further (not specific to XML, but rather clever), even more complex things genrally just work (tm). For example, if you sort UTF-8 lexicographically by bytes, you get the same answer as sorting it lexicographically by code points, despite the variation in # of bytes used, because the prefix bytes introducing the longer (and hence higher-valued) code points are numerically greater than those for lesser values).
UTF-8 is compatible to 7-bit ASCII code. If the value of a byte is larger then 127, it means a multibyte character starts. Depending on the value of the first byte you can see how many bytes the character will take, that can be 2-4 bytes including the first byte (technical also 5 or 6 are possible, but they are not valid utf-8). Here is a good resource about UTF-8: UTF-8 and Unicode FAQ, also the wiki page for utf8 is very informative. Since UTF-8 is char based and 0-terminated, you can use the standard string functions for most things. The only important thing is that the character count can differ from the byte count. Functions like strlen() return the byte count but not necessarily the character count.
By using between 1 and 4 chars to encode one Unicode code point.