How to make std::regex match Utf8 - c++

I would like a pattern like ".c", match "." with any utf8 followed by 'c' using std::regex.
I've tried under Microsoft C++ and g++. I get the same result, each time the "." only matches a single byte.
here's my test case:
#include <stdio.h>
#include <iostream>
#include <string>
#include <regex>
using namespace std;
int main(int argc, char** argv)
{
// make a string with 3 UTF8 characters
const unsigned char p[] = { 'a', 0xC2, 0x80, 'c', 0 };
string tobesearched((char*)p);
// want to match the UTF8 character before c
string pattern(".c");
regex re(pattern);
std::smatch match;
bool r = std::regex_search(tobesearched, match, re);
if (r)
{
// m.size() will be bytes, and we expect 3
// expect 0xC2, 0x80, 'c'
string m = match[0];
cout << "match length " << m.size() << endl;
// but we only get 2, we get the 0x80 and the 'c'.
// so it's matching on single bytes and not utf8
// code here is just to dump out the byte values.
for (int i = 0; i < m.size(); ++i)
{
int c = m[i] & 0xff;
printf("%02X ", c);
}
printf("\n");
}
else
cout << "not matched\n";
return 0;
}
I wanted the pattern ".c" to match 3 bytes of my tobesearched string, where the first two are a 2-byte utf8 character followed by 'c'.

Some regex flavours support \X which will match a single unicode character, which may consist of a number of bytes depending on the encoding. It is common practice for regex engines to get the bytes of the subject string in an encoding the engine is designed to work with, so you shouldn't have to worry about the actual encoding (whether it is US-ASCII, UTF-8, UTF-16 or UTF-32).
Another option is the \uFFFF where FFFF refers to the unicode character at that index in the unicode charset. With that, you could create a ranged match inside a character class i.e. [\u0000-\uFFFF]. Again, it depends on what the regex flavour supports. There is another variant of \u in \x{...} which does the same thing, except the unicode character index must be supplied inside curly braces, and need not be padded e.g. \x{65}.
Edit: This website is amazing for learning more about regex across various flavours https://www.regular-expressions.info
Edit 2: To match any Unicode-exclusive character, i.e. excluding characters in the ASCII table / 1 byte characters, you can try "[\x{80}-\x{FFFFFFFF}]" i.e. any character that has a value of 128-4,294,967,295 which is from the first character outside the ASCII range to the last unicode charset index which currently uses up to a 4-byte representation (was originally to be 6, and may change in future).
A loop through the individual bytes would be more efficient, though:
If the lead bit is 0, i.e. if its signed value is > -1, it is a 1 byte char representation. Skip to the next byte and start again.
Else if the lead bits are 11110 i.e. if its signed value is > -17, n=4.
Else if the lead bits are 1110 i.e. if its signed value is > -33, n=3.
Else if the lead bits are 110 i.e. if its signed value is > -65, n=2.
Optionally, check that the next n bytes each start with 10, i.e. for each byte, if it has a signed value < -63, it is invalid UTF-8 encoding.
You now know that the previous n bytes constitute a unicode-exclusive character. So, if the NEXT character is 'c' i.e. == 99, you can say it matched - return true.

Related

Difference between converting int to char by (char) and by ASCII

I have an example:
int var = 5;
char ch = (char)var;
char ch2 = var+48;
cout << ch << endl;
cout << ch2 << endl;
I had some other code. (char) returned wrong answer, but +48 didn't. When I changed ONLY (char) to +48, then my code got corrected.
What is the difference between converting int to char by using (char) and +48 (ASCII) in C++?
char ch=(char)var; has the same effect as char ch=var; and assigns the numeric value 5 to ch. You're using ASCII (supported by all modern systems) and ASCII character code 5 represents Enquiry 'ENQ' an old terminal control code. Perhaps some old timer has a clue what it did!
char ch2 = var+48; assigns the numeric value 53 to ch2 which happens to represent the ASCII character for the digit '5'. ASCII 48 is zero (0) and the digits all appear in the ASCII table in order after that. So 48+5 lands on 53 (which represents the character '5').
In C++ char is a integer type. The value is interpreted as representing an ASCII character but it should be thought of as holding a number.
Its numeric range is either [-128,127] or [0,255]. That's because C++ requires sizeof(char)==1 and all modern platforms have 8 bit bytes.
NB: C++ doesn't actually mandate ASCII, but again that will be the case on all modern platforms.
PS: I think its an unfortunate artifact of C (inherited by C++) that sizeof(char)==1 and there isn't a separate fundamental type called byte.
A char is simply the base integral denomination in c++. Output statements, like cout and printf map char integers to the corresponding character mapping. On Windows computers this is typically ASCII.
Note that the 5th in ASCII maps to the Enquiry character which has no printable character, while the 53rd character maps to the printable character 5.
A generally accepted hack to store a number 0-9 in a char is to do: const char ch = var + '0' It's important to note the shortcomings here:
If your code is running on some non-ASCII character mapping then characters 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9 may not be laid out in order in which case this wouldn't work
If var is outside the 0 - 9 range this var + '0' will map to something other than a numeric character mapping
A guaranteed way to get the most significant digit of a number independent of 1 or 2 is to use:
const auto ch = to_string(var).front()
Generally char represents a number as int does. Casting an int value to char doesn't provide it's ASCII representation.
The ASCII codes as numbers for digits range from 48 (== '0') to 58 (== '9'). So to get the printable digit you have to add '0' (or 48).
The difference is that casting to char (char) explicitly converts the digit to a char and adding 48 do not.
Its important to note that an int is typically 32 bit and char is typically 8 bit. This means that the number you can store in a char is from -127 to +127(or 0 to 255-(2^8-1) if you use unsigned char) and in an int from −2,147,483,648 (−231) to 2,147,483,647 (231 − 1)(or 0 to 2^32 -1 for unsigned).
Adding 48 to a value is not changing the type to char.

Why does the size of this std::string change, when characters are changed?

I have an issue in which the size of the string is effected with the presence of a '\0' character. I searched all over in SO and could not get the answer still.
Here is the snippet.
int main()
{
std::string a = "123123\0shai\0";
std::cout << a.length();
}
http://ideone.com/W6Bhfl
The output in this case is
6
Where as the same program with a different string having numerals instead of characters
int main()
{
std::string a = "123123\0123\0";
std::cout << a.length();
}
http://ideone.com/mtfS50
gives an output of
8
What exactly is happening under the hood? How does presence of a '\0' character change the behavior?
The sequence \012 when used in a string (or character) literal is an octal escape sequence. It's the octal number 12 which corresponds to the ASCII linefeed ('\n') character.
That means your second string is actually equal to "123123\n3\0" (plus the actual string literal terminator).
It would have been very clear if you tried to print the contents of the string.
Octal sequences are one to three digits long, and the compiler will use as many digits as possible.
If you check the coloring at ideone you will see that \012 has a different color. That is because this is a single character written in octal.

std::string optimal way to truncate utf-8 at safe place

I have a valid utf-8 encoded string in a std::string. I have limit in bytes. I would like to truncate the string and add ... at MAX_SIZE - 3 - x - where x is that value that will prevent a utf-8 character to be cut.
Is there function that could determine x based on MAX_SIZE without the need to start from the beginning of the string?
If you have a location in a string, and you want to go backwards to find the start of a UTF-8 character (and therefore a valid place to cut), this is fairly easily done.
You start from the last byte in the sequence. If the top two bits of the last byte are 10, then it is part of a UTF-8 sequence, so keep backing up until the top two bits are not 10 (or until you reach the start).
The way UTF-8 works is that a byte can be one of three things, based on the upper bits of the byte. If the topmost bit is 0, then the byte is an ASCII character, and the next 7 bits are the Unicode Codepoint value itself. If the topmost bit is 10, then the 6 bits that follow are extra bits for a multi-byte sequence. But the beginning of a multibyte sequence is coded with 11 in the top 2 bits.
So if the top bits of a byte are not 10, then it's either an ASCII character or the start of a multibyte sequence. Either way, it's a valid place to cut.
Note however that, while this algorithm will break the string at codepoint boundaries, it ignores Unicode grapheme clusters. This means that combining characters can be culled, away from the base characters that they combine with; accents can be lost from characters, for example. Doing proper grapheme cluster analysis would require having access to the Unicode table that says whether a codepoint is a combining character.
But it will at least be a valid Unicode UTF-8 string. So that's better than most people do ;)
The code would look something like this (in C++14):
auto FindCutPosition(const std::string &str, size_t max_size)
{
assert(str.size() >= max_size, "Make sure stupidity hasn't happened.");
assert(str.size() > 3, "Make sure stupidity hasn't happened.");
max_size -= 3;
for(size_t pos = max_size; pos > 0; --pos)
{
unsigned char byte = static_cast<unsigned char>(str[pos]); //Perfectly valid
if(byte & 0xC0 != 0x80)
return pos;
}
unsigned char byte = static_cast<unsigned char>(str[0]); //Perfectly valid
if(byte & 0xC0 != 0x80)
return 0;
//If your first byte isn't even a valid UTF-8 starting point, then something terrible has happened.
throw bad_utf8_encoded_text(...);
}

Regarding conversion of text to hex via ASCII in C++

So, I've looked up how to do conversion from text to hexadecimal according to ASCII, and I have a working solution (proposed on here). My problem is that I don't understand why it works. Here's my code:
#include <string>
#include <iostream>
int main()
{
std::string str1 = "0123456789ABCDEF";
std::string output[2];
std::string input;
std::getline(std::cin, input);
output[0] = str1[input[0] & 15];
output[1] = str1[input[0] >> 4];
std::cout << output[1] << output[0] << std::endl;
}
Which is all well and good - it returns the hexadecimal value for single characters, however, what I don't understand is this:
input[0] & 15
input[0] >> 4
How can you perform bitwise operations on a character from a string? And why does it oh-so-nicely return the exact values we're after?
Thanks for any help! :)
In C++ a character is 8 bits long.
If you '&' it with 15 (binary 1111), then the least significant 4 bits are outputted to the first digit.
When you apply right shift by 4, then it is equivalent of dividing the character value by 16. This gives you the most significant 4 bits for second digit.
Once the above digit values are calculated, the required character is picked up from the constant string str1 having all the characters in their respective positions.
"Characters in a string" are not characters (individual strings of one character only). In some programming languages they are. In Javascript, for example,
var string = "testing 1,2,3";
var character = string[0];
returns "t".
In C and C++, however, 'strings' are arrays of 8-bit characters; each element of the array is an 8-bit number from 0..255.
Characters are just integers. In ASCII the character '0' is the integer 48. C++ makes this conversion implicitly in many contexts, including the one in your code.

convert char[] of hexadecimal numbers to char[] of letters corresponding to the hexadecimal numbers in ascii table and reversing it

I have a char a[] of hexadecimal characters like this:
"315c4eeaa8b5f8aaf9174145bf43e1784b8fa00dc71d885a804e5ee9fa40b16349c146fb778cdf2d3aff021dfff5b403b510d0d0455468aeb98622b137dae857553ccd8883a7bc37520e06e515d22c954eba5025b8cc57ee59418ce7dc6bc41556bdb36bbca3e8774301fbcaa3b83b220809560987815f65286764703de0f3d524400a19b159610b11ef3e"
I want to convert it to letters corresponding to each hexadecimal number like this:
68656c6c6f = hello
and store it in char b[] and then do the reverse
I don't want a block of code please, I want explanation and what libraries was used and how to use it.
Thanks
Assuming you are talking about ASCII codes. Well, first step is to find the size of b. Assuming you have all characters by 2 hexadecimal digits (for example, a tab would be 09), then size of b is simply strlen(a) / 2 + 1.
That done, you need to go through letters of a, 2 by 2, convert them to their integer value and store it as a string. Written as a formula you have:
b[i] = (to_digit(a[2*i]) << 4) + to_digit(a[2*i+1]))
where to_digit(x) converts '0'-'9' to 0-9 and 'a'-'z' or 'A'-'Z' to 10-15.
Note that if characters below 0x10 are shown with only one character (the only one I can think of is tab, then instead of using 2*i as index to a, you should keep a next_index in your loop which is either added by 2, if a[next_index] < '8' or added by 1 otherwise. In the later case, b[i] = to_digit(a[next_index]).
The reverse of this operation is very similar. Each character b[i] is written as:
a[2*i] = to_char(b[i] >> 4)
a[2*i+1] = to_char(b[i] & 0xf)
where to_char is the opposite of to_digit.
Converting the hexadecimal string to a character string can be done by using std::substr to get the next two characters of the hex string, then using std::stoi to convert the substring to an integer. This can be casted to a character that is added to a std::string. The std::stoi function is C++11 only, and if you don't have it you can use e.g. std::strtol.
To do the opposite you loop over each character in the input string, cast it to an integer and put it in an std::ostringstream preceded by manipulators to have it presented as a two-digit, zero-prefixed hexadecimal number. Append to the output string.
Use std::string::c_str to get an old-style C char pointer if needed.
No external library, only using the C++ standard library.
Forward:
Read two hex chars from input.
Convert to int (0..255). (hint: sscanf is one way)
Append int to output char array
Repeat 1-3 until out of chars.
Null terminate the array
Reverse:
Read single char from array
Convert to 2 hexidecimal chars (hint: sprintf is one way).
Concat buffer from (2) to final output string buffer.
Repeat 1-3 until out of chars.
Almost forgot to mention. stdio.h and the regular C-runtime required only-assuming you're using sscanf and sprintf. You could alternatively create a a pair of conversion tables that would radically speed up the conversions.
Here's a simple piece of code to do the trick:
unsigned int hex_digit_value(char c)
{
if ('0' <= c && c <= '9') { return c - '0'; }
if ('a' <= c && c <= 'f') { return c + 10 - 'a'; }
if ('A' <= c && c <= 'F') { return c + 10 - 'A'; }
return -1;
}
std::string dehexify(std::string const & s)
{
std::string result(s.size() / 2);
for (std::size_t i = 0; i != s.size(); ++i)
{
result[i] = hex_digit_value(s[2 * i]) * 16
+ hex_digit_value(s[2 * i + 1]);
}
return result;
}
Usage:
char const a[] = "12AB";
std::string s = dehexify(a);
Notes:
A proper implementation would add checks that the input string length is even and that each digit is in fact a valid hex numeral.
Dehexifying has nothing to do with ASCII. It just turns any hexified sequence of nibbles into a sequence of bytes. I just use std::string as a convenient "container of bytes", which is exactly what it is.
There are dozens of answers on SO showing you how to go the other way; just search for "hexify".
Each hexadecimal digit corresponds to 4 bits, because 4 bits has 16 possible bit patterns (and there are 16 possible hex digits, each standing for a unique 4-bit pattern).
So, two hexadecimal digits correspond to 8 bits.
And on most computers nowadays (some Texas Instruments digital signal processors are an exception) a C++ char is 8 bits.
This means that each C++ char is represented by 2 hex digits.
So, simply read two hex digits at a time, convert to int using e.g. an istringstream, convert that to char, and append each char value to a std::string.
The other direction is just opposite, but with a twist.
Because char is signed on most systems, you need to convert to unsigned char before converting that value again to hex digits.
Conversion to and from hexadecimal can be done using hex, like e.g.
cout << hex << x;
cin >> hex >> x;
for a suitable definition of x, e.g. int x
This should work for string streams as well.