Sorting std::strings with numbers in them?

Sorting std::strings with numbers in them? - c++

I'm currently sorting by the std::string < operator. The problem with it is that:
30 < 9. The 30 shows up before the 9 since 3 < 9, Windows 9x had this issue to. How could I go about sorting them numerically so that "30 Foxes" shows up after "9 dogs". I should also add that I'm using utf 8 encoding.
Thanks

You can create a custom comparison function to use with std::sort. This function would have to check if the string begins with a numeric value. If it does, convert the numeric part of each string to an int using some mechanism like a stringstream. Then compare the two integer values. If the values compare equally, compare the non-numeric part of the strings lexicographically. Otherwise, if the strings don't contain a numeric part, simply compare the two strings lexicographically as normal.
Basically, something like the following (untested) comparison function:
bool is_not_digit(char c)
{
return !std::isdigit(c);
}
bool numeric_string_compare(const std::string& s1, const std::string& s2)
{
// handle empty strings...
std::string::const_iterator it1 = s1.begin(), it2 = s2.begin();
if (std::isdigit(s1[0]) && std::isdigit(s2[0])) {
int n1, n2;
std::stringstream ss(s1);
ss >> n1;
ss.clear();
ss.str(s2);
ss >> n2;
if (n1 != n2) return n1 < n2;
it1 = std::find_if(s1.begin(), s1.end(), is_not_digit);
it2 = std::find_if(s2.begin(), s2.end(), is_not_digit);
}
return std::lexicographical_compare(it1, s1.end(), it2, s2.end());
}
And then...
std::sort(string_array.begin(), string_array.end(), numeric_string_compare);
EDIT: Of course, this algorithm is only useful if you're sorting strings where the numeric portion appears at the beginning of the string. If you're dealing with strings where the numeric portion can appear anywhere in the string, then you need a more sophisticated algorithm. See http://www.davekoelle.com/alphanum.html for more information.

If you are targeting Windows (XP+) and can afford to convert your strings to utf-16, you can use the StrCmpLogicalW function from Shlwapi. See msdn for details.
Otherwise, ICU provides this functionality in its collators. See UCOL_NUMERIC_COLLATION.

Here's a version that doesn't convert to integer and thus works for long strings of digits regardless of sizeof(int).
#include <cctype>
#include <cstddef>
#include <cstring>
#include <string>
int numcmp(const char *a, const char *aend, const char *b, const char *bend)
{
for (;;) {
if (a == aend) {
if (b == bend)
return 0;
return -1;
}
if (b == bend)
return 1;
if (*a == *b) {
++a, ++b;
continue;
}
if (!isdigit((unsigned char) *a) || !isdigit((unsigned char) *b))
return *a - *b;
// skip leading zeros in both strings
while (*a == '0' && ++a != aend)
;
while (*b == '0' && ++b != aend)
;
// skip to end of the consecutive digits
const char *aa = a;
while (a != aend && isdigit((unsigned char) *a))
++a;
std::ptrdiff_t alen = a - aa;
const char *bb = b;
while (b != bend && isdigit((unsigned char) *b))
++b;
std::ptrdiff_t blen = b - bb;
if (alen != blen)
return alen - blen;
// same number of consecutive digits in both strings
while (aa != a) {
if (*aa != *bb)
return *aa - *bb;
++aa, ++bb;
}
}
}
int numcmp(const std::string& a, const std::string& b)
{
return numcmp(a.data(), a.data() + a.size(),
b.data(), b.data() + b.size());
}
int numcmp(const char *a, const char *b)
{
return numcmp(a, a + strlen(a), b, b + strlen(b));
}

This what worked for me (assuming no leading zeroes), i.e. the idea is that phonetic compare can be applied just to numbers with same number of digits.
auto numeric_str_cmp = [](const std::string& a, const std::string& b) -> bool {
return (a.size() < b.size()) || (a.size() == b.size() && a < b);
};
std::sort(numeric_strings.begin(), numeric_strings.end(), numeric_str_cmp);

Related

Storing all vector values in a data type

I have a vector declared containing n integers.
vector <int> tostore[n];
I want to store all the numbers in the vector inside a string in the format of their subscripts, like 12345..n
Example:
vector <int> store_vec{1,2,3,4,5};
int store_str; //to store the digits in order from vector store_vec
cout<<store_str;
Desired Output:
12345
How do I store it in store_str without printing it?

Instead of using an integer, which if it is 32 bits wide will only be able to store 8-9 digits, you could instead build a string that has all of the elements combined like
vector <int> store_vec{1,2,3,4,5};
std::string merged;
merged.reserve(store_vec.size());
for (auto num : store_vec)
merged += '0' + num;
// now merged is "12345"

One way would be to just multiply by 10 each iteration
int result = 0;
for (auto it : tostore)
{
result = result * 10 + it;
}
As mentioned in comments, a more robust approach would be concatenating to an actual string, or at least using a 64-bit integer.

Since you confirmed that store_vec only contains single digit numbers, a simple way of doing this would be :
std::vector<uint8_t> store_vec = {1,2,3,4,5};
std::string str = std::accumulate(store_vec.begin(), store_vec.end(), std::string{},
[](const std::string& res, uint8_t num){ return res + char('0' + num); });
int resnum = atoi(str.c_str());
or basically use the str resulting with accumulate since it already represent the sequence.

Since you know that each value in tostore will only be a single digit, you could use int_8 or uint8_t data types to store the values instead. This way you can still perform arithmetic on the values within the vectors (so long as the result of the arithmetic falls within the range of -128 to 127 or 0 to 255 respectively, see integer types for more details). These data types have the advantage of being only a single byte long, allowing your vector to potentially be more densely packed and faster to traverse. You can then use std::cout << unsigned(tostore[n]) to cast the integer into a character for display. The whole thing would look something like this
#include <iostream>
#include <type_traits>
#include <vector>
int main()
{
std::vector<uint8_t> tostore;
tostore.reserve(32);
for(int i = 0; i < 32; ++i) {
tostore.push_back(i % 10);
}
for(uint i = 0; i < tostore.size(); ++i) {
std::cout << unsigned(tostore[i]);
}
}
Alternatively, if you know that your digit will always be positive it opens a whole new range of possibilities. When converting an integer to a list of characters a program needs to break the integer into its individual digits and then add 48 to that digits value to find its ascii character code equivalent (see asciiTable for more details). The process of splitting the integer into its base 10 digits may be too cumbersome (you decide) if you plan to display these characters often or perform only a few arithmetic operations on the data. In this case you could create a struct that stores the value of the integer as a char data type but performs arithmetic with the data as if it were an integer. This way, when printing the values no operations need to be performed to format the data properly and the only operations that need to be done to format the data for arithmetic operations are simple subtractions by 48 which are very fast. Such a struct could look something like this:
#include <iostream>
#include <type_traits>
#include <vector>
struct notANumber {
char data;
notANumber() {}
notANumber(const int& a) : data(a + 48) {}
notANumber(const char& a) : data(a) {}
notANumber operator+(const notANumber& b) {
notANumber c;
c.data = b.data + c.data - 48;
return c;
}
notANumber operator-(const notANumber& b) {
notANumber c;
c.data = b.data - c.data + 48;
return c;
}
notANumber operator*(const notANumber& b) {
notANumber c;
c.data = (b.data - 48) * (c.data - 48) + 48;
return c;
}
notANumber operator/(const notANumber& b) {
notANumber c;
c.data = (b.data - 48) / (c.data - 48) + 48;
return c;
}
};
int operator+(const int& a, const notANumber& b) {
int c;
c = a + b.data - 48;
return c;
}
int operator-(const int& a, const notANumber& b) {
int c;
c = a - b.data + 48;
return c;
}
int operator*(const int& a, const notANumber& b) {
int c;
c = a * (b.data - 48);
return c;
}
int operator/(const int& a, const notANumber& b) {
int c;
c = a / (b.data - 48);
return c;
}
int main()
{
std::vector<notANumber> tostore;
tostore.reserve(32);
for(int i = 0; i < 32; ++i) {
tostore.push_back(i % 10);
}
std::cout.write(reinterpret_cast<char*>(tostore.data()), tostore.size());
}
Now this might not be what you are looking for, but I hope that it does showcase an important aspect of programming which is "the more you know about the data you are working with, the more you can optimize the program" so make sure you have a good feel for the range of the data you are working with and what operations you are going to be doing with it most often (arithmetic or printing for example) and the cost of those operations.

How to calculate the length of a string by characters, not by code units (UTF-8, UTF-16)? [duplicate]

my std::string is utf-8 encoded so obviously, str.length() returns the wrong result.
I found this information but I'm not sure how I can use it to do this:
The following byte sequences are
used to represent a character. The
sequence to be
used depends on the UCS code number of the character:
0x00000000 - 0x0000007F:
0xxxxxxx
0x00000080 - 0x000007FF:
110xxxxx 10xxxxxx
0x00000800 - 0x0000FFFF:
1110xxxx 10xxxxxx 10xxxxxx
0x00010000 - 0x001FFFFF:
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
How can I find the actual length of a UTF-8 encoded std::string? Thanks

Count all first-bytes (the ones that don't match 10xxxxxx).
int len = 0;
while (*s) len += (*s++ & 0xc0) != 0x80;

C++ knows nothing about encodings, so you can't expect to use a
standard function to do this.
The standard library indeed does acknowledge the existence of character encodings, in the form of locales. If your system supports a locale, it is very easy to use the standard library to compute the length of a string. In the example code below I assume your system supports the locale en_US.utf8. If I compile the code and execute it as "./a.out ソニーSony", the output is that there were 13 char-values and 7 characters. And all without any reference to the internal representation of UTF-8 character codes or having to use 3rd party libraries.
#include <clocale>
#include <cstdlib>
#include <iostream>
#include <string>
using namespace std;
int main(int argc, char *argv[])
{
string str(argv[1]);
unsigned int strLen = str.length();
cout << "Length (char-values): " << strLen << '\n';
setlocale(LC_ALL, "en_US.utf8");
unsigned int u = 0;
const char *c_str = str.c_str();
unsigned int charCount = 0;
while(u < strLen)
{
u += mblen(&c_str[u], strLen - u);
charCount += 1;
}
cout << "Length (characters): " << charCount << endl;
}

This is a naive implementation, but it should be helpful for you to see how this is done:
std::size_t utf8_length(std::string const &s) {
std::size_t len = 0;
std::string::const_iterator begin = s.begin(), end = s.end();
while (begin != end) {
unsigned char c = *begin;
int n;
if ((c & 0x80) == 0) n = 1;
else if ((c & 0xE0) == 0xC0) n = 2;
else if ((c & 0xF0) == 0xE0) n = 3;
else if ((c & 0xF8) == 0xF0) n = 4;
else throw std::runtime_error("utf8_length: invalid UTF-8");
if (end - begin < n) {
throw std::runtime_error("utf8_length: string too short");
}
for (int i = 1; i < n; ++i) {
if ((begin[i] & 0xC0) != 0x80) {
throw std::runtime_error("utf8_length: expected continuation byte");
}
}
len += n;
begin += n;
}
return len;
}

You should probably take the advice of Omry and look into a specialized library for this. That said, if you just want to understand the algorithm to do this, I'll post it below.
Basically, you can convert your string into a wider-element format, such as wchar_t. Note that wchar_t has a few portability issues, because wchar_t is of varying size depending on your platform. On Windows, wchar_t is 2 bytes, and therefore ideal for representing UTF-16. But on UNIX/Linux, it's four-bytes and is therefore used to represent UTF-32. Therefore, for Windows this will only work if you don't include any Unicode codepoints above 0xFFFF. For Linux you can include the entire range of codepoints in a wchar_t. (Fortunately, this issue will be mitigated with the C++0x Unicode character types.)
With that caveat noted, you can create a conversion function using the following algorithm:
template <class OutputIterator>
inline OutputIterator convert(const unsigned char* it, const unsigned char* end, OutputIterator out)
{
while (it != end)
{
if (*it < 192) *out++ = *it++; // single byte character
else if (*it < 224 && it + 1 < end && *(it+1) > 127) {
// double byte character
*out++ = ((*it & 0x1F) << 6) | (*(it+1) & 0x3F);
it += 2;
}
else if (*it < 240 && it + 2 < end && *(it+1) > 127 && *(it+2) > 127) {
// triple byte character
*out++ = ((*it & 0x0F) << 12) | ((*(it+1) & 0x3F) << 6) | (*(it+2) & 0x3F);
it += 3;
}
else if (*it < 248 && it + 3 < end && *(it+1) > 127 && *(it+2) > 127 && *(it+3) > 127) {
// 4-byte character
*out++ = ((*it & 0x07) << 18) | ((*(it+1) & 0x3F) << 12) |
((*(it+2) & 0x3F) << 6) | (*(it+3) & 0x3F);
it += 4;
}
else ++it; // Invalid byte sequence (throw an exception here if you want)
}
return out;
}
int main()
{
std::string s = "\u00EAtre";
cout << s.length() << endl;
std::wstring output;
convert(reinterpret_cast<const unsigned char*> (s.c_str()),
reinterpret_cast<const unsigned char*>(s.c_str()) + s.length(), std::back_inserter(output));
cout << output.length() << endl; // Actual length
}
The algorithm isn't fully generic, because the InputIterator needs to be an unsigned char, so you can interpret each byte as having a value between 0 and 0xFF. The OutputIterator is generic, (just so you can use an std::back_inserter and not worry about memory allocation), but its use as a generic parameter is limited: basically, it has to output to an array of elements large enough to represent a UTF-16 or UTF-32 character, such as wchar_t, uint32_t or the C++0x char32_t types. Also, I didn't include code to convert character byte sequences greater than 4 bytes, but you should get the point of how the algorithm works from what's posted.
Also, if you just want to count the number of characters, rather than output to a new wide-character buffer, you can modify the algorithm to include a counter rather than an OutputIterator. Or better yet, just use Marcelo Cantos' answer to count the first-bytes.

I recommend you use UTF8-CPP. It's a header-only library for working with UTF-8 in C++. With this lib, it would look something like this:
int LenghtOfUtf8String( const std::string &utf8_string )
{
return utf8::distance( utf8_string.begin(), utf8_string.end() );
}
(Code is from the top of my head.)

Most of my personal C library code has only been really tested in English, but here is how I've implemented my utf-8 string length function. I originally based it on the bit pattern described in this wiki page table. Now this isn't the most readable code, but I do like the benchmark better from my compiler. Also sorry for this being C code, it should translate over to std::string in C++ pretty easily though with some slight modifications :).
size_t utf8len(const char* const str) {
size_t len = 0;
unsigned char c = str[0];
for (size_t i = 0; c != 0; ++len) {
int v0 = (c & 0x80) >> 7;
int v1 = (c & 0x40) >> 6;
int v2 = (c & 0x20) >> 5;
int v3 = (c & 0x10) >> 4;
i += 1 + v0 * v1 + v0 * v1 * v2 + v0 * v1 * v2 * v3;
c = str[i];
}
return len;
}
Note that this does not validate any of the bytes (much like all the other suggested answers here). Personally I would separate string validation out of my string length function as that is not it's responsibility. If we were to move string validation to another function we could have the validation done something like the following.
bool utf8valid(const char* const str) {
if (str == NULL)
return false;
const char* c = str;
bool valid = true;
for (size_t i = 0; c[0] != 0 && valid;) {
valid = (c[0] & 0x80) == 0
|| ((c[0] & 0xE0) == 0xC0 && (c[1] & 0xC0) == 0x80)
|| ((c[0] & 0xF0) == 0xE0 && (c[1] & 0xC0) == 0x80 && (c[2] & 0xC0) == 0x80)
|| ((c[0] & 0xF8) == 0xF0 && (c[1] & 0xC0) == 0x80 && (c[2] & 0xC0) == 0x80 && (c[3] & 0xC0) == 0x80);
int v0 = (c[0] & 0x80) >> 7;
int v1 = (c[0] & 0x40) >> 6;
int v2 = (c[0] & 0x20) >> 5;
int v3 = (c[0] & 0x10) >> 4;
i += 1 + v0 * v1 + v0 * v1 * v2 + v0 * v1 * v2 * v3;
c = str + i;
}
return valid;
}
If you are going for readability, I'll admit that other suggestions are a quite bit more readable haha!

try to use an encoding library like iconv.
it probably got the api you want.
an alternative is to implement your own utf8strlen which determines the length of each codepoint and iterate codepoints instead of characters.

A slightly lazy approach would be to only count lead bytes, but visit every byte. This saves the complexity of decoding the various lead byte sizes, but obviously you pay to visit all the bytes, though there usually aren't that many (2x-3x):
size_t utf8Len(std::string s)
{
return std::count_if(s.begin(), s.end(),
[](char c) { return (static_cast<unsigned char>(c) & 0xC0) != 0x80; } );
}
Note that certain code values are illegal as lead bytes, those that represent bigger values than the 20 bits needed for extended unicode, for example, but then the other approach would not know how to deal with that code, anyway.

UTF-8 CPP library has a function that does just that. You can either include the library into your project (it is small) or just look at the function. http://utfcpp.sourceforge.net/
char* twochars = "\xe6\x97\xa5\xd1\x88";
size_t dist = utf8::distance(twochars, twochars + 5);
assert (dist == 2);

This code I'm porting from php-iconv to c++, you need use iconv first, hope usefull:
// porting from PHP
// http://lxr.php.net/xref/PHP_5_4/ext/iconv/iconv.c#_php_iconv_strlen
#define GENERIC_SUPERSET_NBYTES 4
#define GENERIC_SUPERSET_NAME "UCS-4LE"
UInt32 iconvStrlen(const char *str, size_t nbytes, const char* encode)
{
UInt32 retVal = (unsigned int)-1;
unsigned int cnt = 0;
iconv_t cd = iconv_open(GENERIC_SUPERSET_NAME, encode);
if (cd == (iconv_t)(-1))
return retVal;
const char* in;
size_t inLeft;
char *out;
size_t outLeft;
char buf[GENERIC_SUPERSET_NBYTES * 2] = {0};
for (in = str, inLeft = nbytes, cnt = 0; inLeft > 0; cnt += 2)
{
size_t prev_in_left;
out = buf;
outLeft = sizeof(buf);
prev_in_left = inLeft;
if (iconv(cd, &in, &inLeft, (char **) &out, &outLeft) == (size_t)-1) {
if (prev_in_left == inLeft) {
break;
}
}
}
iconv_close(cd);
if (outLeft > 0)
cnt -= outLeft / GENERIC_SUPERSET_NBYTES;
retVal = cnt;
return retVal;
}
UInt32 utf8StrLen(const std::string& src)
{
return iconvStrlen(src.c_str(), src.length(), "UTF-8");
}

Just another naive implementation to count chars in UTF-8 string
int utf8_strlen(const string& str)
{
int c,i,ix,q;
for (q=0, i=0, ix=str.length(); i < ix; i++, q++)
{
c = (unsigned char) str[i];
if (c>=0 && c<=127) i+=0;
else if ((c & 0xE0) == 0xC0) i+=1;
else if ((c & 0xF0) == 0xE0) i+=2;
else if ((c & 0xF8) == 0xF0) i+=3;
//else if (($c & 0xFC) == 0xF8) i+=4; // 111110bb //byte 5, unnecessary in 4 byte UTF-8
//else if (($c & 0xFE) == 0xFC) i+=5; // 1111110b //byte 6, unnecessary in 4 byte UTF-8
else return 0;//invalid utf8
}
return q;
}

Getting the correct length of string with special character [duplicate]

my std::string is utf-8 encoded so obviously, str.length() returns the wrong result.
I found this information but I'm not sure how I can use it to do this:
The following byte sequences are
used to represent a character. The
sequence to be
used depends on the UCS code number of the character:
0x00000000 - 0x0000007F:
0xxxxxxx
0x00000080 - 0x000007FF:
110xxxxx 10xxxxxx
0x00000800 - 0x0000FFFF:
1110xxxx 10xxxxxx 10xxxxxx
0x00010000 - 0x001FFFFF:
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
How can I find the actual length of a UTF-8 encoded std::string? Thanks

Count all first-bytes (the ones that don't match 10xxxxxx).
int len = 0;
while (*s) len += (*s++ & 0xc0) != 0x80;

C++ knows nothing about encodings, so you can't expect to use a
standard function to do this.
The standard library indeed does acknowledge the existence of character encodings, in the form of locales. If your system supports a locale, it is very easy to use the standard library to compute the length of a string. In the example code below I assume your system supports the locale en_US.utf8. If I compile the code and execute it as "./a.out ソニーSony", the output is that there were 13 char-values and 7 characters. And all without any reference to the internal representation of UTF-8 character codes or having to use 3rd party libraries.
#include <clocale>
#include <cstdlib>
#include <iostream>
#include <string>
using namespace std;
int main(int argc, char *argv[])
{
string str(argv[1]);
unsigned int strLen = str.length();
cout << "Length (char-values): " << strLen << '\n';
setlocale(LC_ALL, "en_US.utf8");
unsigned int u = 0;
const char *c_str = str.c_str();
unsigned int charCount = 0;
while(u < strLen)
{
u += mblen(&c_str[u], strLen - u);
charCount += 1;
}
cout << "Length (characters): " << charCount << endl;
}

This is a naive implementation, but it should be helpful for you to see how this is done:
std::size_t utf8_length(std::string const &s) {
std::size_t len = 0;
std::string::const_iterator begin = s.begin(), end = s.end();
while (begin != end) {
unsigned char c = *begin;
int n;
if ((c & 0x80) == 0) n = 1;
else if ((c & 0xE0) == 0xC0) n = 2;
else if ((c & 0xF0) == 0xE0) n = 3;
else if ((c & 0xF8) == 0xF0) n = 4;
else throw std::runtime_error("utf8_length: invalid UTF-8");
if (end - begin < n) {
throw std::runtime_error("utf8_length: string too short");
}
for (int i = 1; i < n; ++i) {
if ((begin[i] & 0xC0) != 0x80) {
throw std::runtime_error("utf8_length: expected continuation byte");
}
}
len += n;
begin += n;
}
return len;
}

You should probably take the advice of Omry and look into a specialized library for this. That said, if you just want to understand the algorithm to do this, I'll post it below.
Basically, you can convert your string into a wider-element format, such as wchar_t. Note that wchar_t has a few portability issues, because wchar_t is of varying size depending on your platform. On Windows, wchar_t is 2 bytes, and therefore ideal for representing UTF-16. But on UNIX/Linux, it's four-bytes and is therefore used to represent UTF-32. Therefore, for Windows this will only work if you don't include any Unicode codepoints above 0xFFFF. For Linux you can include the entire range of codepoints in a wchar_t. (Fortunately, this issue will be mitigated with the C++0x Unicode character types.)
With that caveat noted, you can create a conversion function using the following algorithm:
template <class OutputIterator>
inline OutputIterator convert(const unsigned char* it, const unsigned char* end, OutputIterator out)
{
while (it != end)
{
if (*it < 192) *out++ = *it++; // single byte character
else if (*it < 224 && it + 1 < end && *(it+1) > 127) {
// double byte character
*out++ = ((*it & 0x1F) << 6) | (*(it+1) & 0x3F);
it += 2;
}
else if (*it < 240 && it + 2 < end && *(it+1) > 127 && *(it+2) > 127) {
// triple byte character
*out++ = ((*it & 0x0F) << 12) | ((*(it+1) & 0x3F) << 6) | (*(it+2) & 0x3F);
it += 3;
}
else if (*it < 248 && it + 3 < end && *(it+1) > 127 && *(it+2) > 127 && *(it+3) > 127) {
// 4-byte character
*out++ = ((*it & 0x07) << 18) | ((*(it+1) & 0x3F) << 12) |
((*(it+2) & 0x3F) << 6) | (*(it+3) & 0x3F);
it += 4;
}
else ++it; // Invalid byte sequence (throw an exception here if you want)
}
return out;
}
int main()
{
std::string s = "\u00EAtre";
cout << s.length() << endl;
std::wstring output;
convert(reinterpret_cast<const unsigned char*> (s.c_str()),
reinterpret_cast<const unsigned char*>(s.c_str()) + s.length(), std::back_inserter(output));
cout << output.length() << endl; // Actual length
}
The algorithm isn't fully generic, because the InputIterator needs to be an unsigned char, so you can interpret each byte as having a value between 0 and 0xFF. The OutputIterator is generic, (just so you can use an std::back_inserter and not worry about memory allocation), but its use as a generic parameter is limited: basically, it has to output to an array of elements large enough to represent a UTF-16 or UTF-32 character, such as wchar_t, uint32_t or the C++0x char32_t types. Also, I didn't include code to convert character byte sequences greater than 4 bytes, but you should get the point of how the algorithm works from what's posted.
Also, if you just want to count the number of characters, rather than output to a new wide-character buffer, you can modify the algorithm to include a counter rather than an OutputIterator. Or better yet, just use Marcelo Cantos' answer to count the first-bytes.

I recommend you use UTF8-CPP. It's a header-only library for working with UTF-8 in C++. With this lib, it would look something like this:
int LenghtOfUtf8String( const std::string &utf8_string )
{
return utf8::distance( utf8_string.begin(), utf8_string.end() );
}
(Code is from the top of my head.)

Most of my personal C library code has only been really tested in English, but here is how I've implemented my utf-8 string length function. I originally based it on the bit pattern described in this wiki page table. Now this isn't the most readable code, but I do like the benchmark better from my compiler. Also sorry for this being C code, it should translate over to std::string in C++ pretty easily though with some slight modifications :).
size_t utf8len(const char* const str) {
size_t len = 0;
unsigned char c = str[0];
for (size_t i = 0; c != 0; ++len) {
int v0 = (c & 0x80) >> 7;
int v1 = (c & 0x40) >> 6;
int v2 = (c & 0x20) >> 5;
int v3 = (c & 0x10) >> 4;
i += 1 + v0 * v1 + v0 * v1 * v2 + v0 * v1 * v2 * v3;
c = str[i];
}
return len;
}
Note that this does not validate any of the bytes (much like all the other suggested answers here). Personally I would separate string validation out of my string length function as that is not it's responsibility. If we were to move string validation to another function we could have the validation done something like the following.
bool utf8valid(const char* const str) {
if (str == NULL)
return false;
const char* c = str;
bool valid = true;
for (size_t i = 0; c[0] != 0 && valid;) {
valid = (c[0] & 0x80) == 0
|| ((c[0] & 0xE0) == 0xC0 && (c[1] & 0xC0) == 0x80)
|| ((c[0] & 0xF0) == 0xE0 && (c[1] & 0xC0) == 0x80 && (c[2] & 0xC0) == 0x80)
|| ((c[0] & 0xF8) == 0xF0 && (c[1] & 0xC0) == 0x80 && (c[2] & 0xC0) == 0x80 && (c[3] & 0xC0) == 0x80);
int v0 = (c[0] & 0x80) >> 7;
int v1 = (c[0] & 0x40) >> 6;
int v2 = (c[0] & 0x20) >> 5;
int v3 = (c[0] & 0x10) >> 4;
i += 1 + v0 * v1 + v0 * v1 * v2 + v0 * v1 * v2 * v3;
c = str + i;
}
return valid;
}
If you are going for readability, I'll admit that other suggestions are a quite bit more readable haha!

try to use an encoding library like iconv.
it probably got the api you want.
an alternative is to implement your own utf8strlen which determines the length of each codepoint and iterate codepoints instead of characters.

A slightly lazy approach would be to only count lead bytes, but visit every byte. This saves the complexity of decoding the various lead byte sizes, but obviously you pay to visit all the bytes, though there usually aren't that many (2x-3x):
size_t utf8Len(std::string s)
{
return std::count_if(s.begin(), s.end(),
[](char c) { return (static_cast<unsigned char>(c) & 0xC0) != 0x80; } );
}
Note that certain code values are illegal as lead bytes, those that represent bigger values than the 20 bits needed for extended unicode, for example, but then the other approach would not know how to deal with that code, anyway.

UTF-8 CPP library has a function that does just that. You can either include the library into your project (it is small) or just look at the function. http://utfcpp.sourceforge.net/
char* twochars = "\xe6\x97\xa5\xd1\x88";
size_t dist = utf8::distance(twochars, twochars + 5);
assert (dist == 2);

This code I'm porting from php-iconv to c++, you need use iconv first, hope usefull:
// porting from PHP
// http://lxr.php.net/xref/PHP_5_4/ext/iconv/iconv.c#_php_iconv_strlen
#define GENERIC_SUPERSET_NBYTES 4
#define GENERIC_SUPERSET_NAME "UCS-4LE"
UInt32 iconvStrlen(const char *str, size_t nbytes, const char* encode)
{
UInt32 retVal = (unsigned int)-1;
unsigned int cnt = 0;
iconv_t cd = iconv_open(GENERIC_SUPERSET_NAME, encode);
if (cd == (iconv_t)(-1))
return retVal;
const char* in;
size_t inLeft;
char *out;
size_t outLeft;
char buf[GENERIC_SUPERSET_NBYTES * 2] = {0};
for (in = str, inLeft = nbytes, cnt = 0; inLeft > 0; cnt += 2)
{
size_t prev_in_left;
out = buf;
outLeft = sizeof(buf);
prev_in_left = inLeft;
if (iconv(cd, &in, &inLeft, (char **) &out, &outLeft) == (size_t)-1) {
if (prev_in_left == inLeft) {
break;
}
}
}
iconv_close(cd);
if (outLeft > 0)
cnt -= outLeft / GENERIC_SUPERSET_NBYTES;
retVal = cnt;
return retVal;
}
UInt32 utf8StrLen(const std::string& src)
{
return iconvStrlen(src.c_str(), src.length(), "UTF-8");
}

Just another naive implementation to count chars in UTF-8 string
int utf8_strlen(const string& str)
{
int c,i,ix,q;
for (q=0, i=0, ix=str.length(); i < ix; i++, q++)
{
c = (unsigned char) str[i];
if (c>=0 && c<=127) i+=0;
else if ((c & 0xE0) == 0xC0) i+=1;
else if ((c & 0xF0) == 0xE0) i+=2;
else if ((c & 0xF8) == 0xF0) i+=3;
//else if (($c & 0xFC) == 0xF8) i+=4; // 111110bb //byte 5, unnecessary in 4 byte UTF-8
//else if (($c & 0xFE) == 0xFC) i+=5; // 1111110b //byte 6, unnecessary in 4 byte UTF-8
else return 0;//invalid utf8
}
return q;
}

hex string arithmetic in c++

I want to do basic arithmetic (addition, subtraction and comparison) with 64 digit hex numbers represented as strings. for example
"ffffa"+"2" == "ffffc"
Since binary representation of such a number requires 256 bits, I cannot convert the string to basic integer types. one solution is to use gmp or boost/xint but they are too big for this simple functionality.
Is there a lightweight solution that can help me?

Just write a library which will handle the strings with conversion between hex to int and will add one char at a time, taking care of overflow. It took minutes to implement such an algorithm:
#include <cstdio>
#include <sstream>
#include <iostream>
using namespace std;
namespace hexstr {
char int_to_hexchar(int v) {
if (0 <= v && v <= 9) {
return v + '0';
} else {
return v - 10 + 'a';
}
}
int hexchar_to_int(char c) {
if ('0' <= c && c <= '9') {
return c - '0';
} else {
return c - 'a' + 10;
}
}
int add_digit(char a, char b) {
return hexchar_to_int(a) + hexchar_to_int(b);
}
void reverseStr(string& str) {
int n = str.length();
for (int i = 0; i < n / 2; i++)
swap(str[i], str[n - i - 1]);
}
void _add_val_to_string(string& s, int& val) {
s.push_back(int_to_hexchar(val % 16));
val /= 16;
}
string add(string a, string b)
{
auto ita = a.end();
auto itb = b.end();
int tmp = 0;
string ret;
while (ita != a.begin() && itb != b.begin()) {
tmp += add_digit(*--ita, *--itb);
_add_val_to_string(ret, tmp);
}
while (ita != a.begin()) {
tmp += hexchar_to_int(*--ita);
_add_val_to_string(ret, tmp);
}
while (itb != b.begin()) {
tmp += hexchar_to_int(*--itb);
_add_val_to_string(ret, tmp);
}
while (tmp) {
_add_val_to_string(ret, tmp);
}
reverseStr(ret);
return ret;
}
}
int main()
{
std::cout
<< "1bd5adead01230ffffc" << endl
<< hexstr::add(
std::string() + "dead0000" + "00000" + "ffffa",
std::string() + "deaddead" + "01230" + "00002"
) << endl;
return 0;
}
This can be optimized, the reversing string maybe can be omitted and some cpu cycles and memory allocations spared. Also error handling is lacking. It will work only on implementations that use ASCII table as the character set and so on... But it's as simple as that. I guess this small lib can handle any hex strings way over 64 digits, depending only on the host memory.

Implementing addition, subtraction and comparison over fixed-base numeric strings yourself should be quite easy.
For instance, for addition and subtraction, simply do it as you would in paper: start on the right-hand end of both strings, parse the chars, compute the result, then carry over, etc. Comparison is even easier, and you go left-to-right.
Of course, all this is assuming you don't need performance (otherwise you should be using a proper library).

How to change a number's digits in a recursive function? C++

I have to input a number n, an a digit and a b digit and output the number n with all the a digits in it replaced by a b one. For example:
Input:
n = 1561525
a = 5
b = 9
Output:
n = 1961929
Should be recursive ! I didn't post any code as I've done it in a non-recursive way but apparently it's not even close to what I need.
Thanks for the help !

Check this, it works but maybe it is to much C
int convert(int num, int a, int b)
{
if( num )
{
int res = convert(num/10,a,b);
int t = num%10;
res *=10;
t = t == a ? b:t;
res = res + t;
return res;
}
return 0;
}
Divide by 10 the initial number, until nothing left of it, and then construct it again replacing a with b.

To make things easier, you can convert the number into a string (a char[] in C++). Then, it's a simple matter of iterating over it and checking at each step if the number we want to replace was found in the current position. For a possible solution, here's an implementation of the algorithm in Python - one of the nice things of the language is that it reads almost as pseudocode, and it should be relatively simple to port to C++:
def aux(n, a, b, i):
if i == len(n):
return ''
elif n[i] == a:
return b + aux(n, a, b, i+1)
else:
return n[i] + aux(n, a, b, i+1)
def change(n, a, b):
return int(aux(str(n), str(a), str(b), 0))
It works as expected:
change(1561525, 5, 9)
=> 1961929

So the easiest and safest way I can think of, is by using std::replace:
int replace(int num, int d1, int d2) {
string s = std::to_string(num); // convert to string
std::replace( s.begin(), s.end(), d1+'0', d2+'0'); // call std::replace
return atoi( s.c_str() ); // return the int
}
Now if you really have to use recursion (there is no need for it here), here's one possible solution:
using std::string;
// recursive function, accepts a string, current index, c2 replaces c1
string replace_rec (string s, unsigned index, char c1, char c2) {
// check if the it's still a valid index
if (index < s.size()) {
// if this is a char to be converted, do so
if (s[index] == c1)
s[index] = c2;
// call itself but with an updated string and incremented index
replace_rec(s, index+1, c1, c2);
}
// the last call will result in the string with all chars checked. return it
return s;
}
// call this function with input, the num to be replaced and what with
int replace(int num, int d1, int d2) {
string s = std::to_string(num); // convert to string
// convert the result back to int and return it.
return atoi( replace_rec(s, 0, d1+'0', d2+'0').c_str() );
}
In any case, you can call your replace() function like this:
int main(){
cout << replace (4578, 4, 9); // output: 9578
cin.get();
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Sorting std::strings with numbers in them? - c++

If you are targeting Windows (XP+) and can afford to convert your strings to utf-16, you can use the StrCmpLogicalW function from Shlwapi. See msdn for details. Otherwise, ICU provides this functionality in its collators. See UCOL_NUMERIC_COLLATION.

Related

Storing all vector values in a data type

How to calculate the length of a string by characters, not by code units (UTF-8, UTF-16)? [duplicate]

Getting the correct length of string with special character [duplicate]

hex string arithmetic in c++

How to change a number's digits in a recursive function? C++

Categories

Resources