wxStyledTextCtrl non ASCII characters - c++

I realized that in wxStyledTextCtrl if the user's comments contains non-ASCII characters, the positions reported by WordStartPosition and WordEndPosition are wrong. What is a good way of dealing with non-ASCII characters in wxStyledTextCtrl? How can I identify the characters that are non-ASCII?

You've probably answered this question by now, but in the experiments I've done, WordStartPosition and WordEndPosition still work with non-ASCII characters. The data internally in the control is stored in UTF-8 format, and those functions give the number of bytes in that data where the word starts and ends. If that's not what's happening for you, can you post a sample where they don't work?
As for determining which characters are and aren't ASCII, something like the following seems to work (assuming a is the start and b is the end position):
wxString s = m_stc->GetTextRange(a,b);
for (wxString::const_iterator i = s.begin(); i != s.end(); ++i)
{
wxUniChar uni_ch = *i;
if(uni_ch.IsAscii())
{
//something
}
else
{
//something else
}
}
One thing I did notice is that if you use a value for a or b that falls in the middle of one of the non-ASCII characters, the resulting string will be empty. I hope this of some help if you haven't already found a solution.

Related

C++, how to remove char from a string

I have to remove some chars from a string, but I have some problems. I found this part of code online, but it does not work so well, it removes the chars but it even removes the white spaces
string messaggio = "{Questo e' un messaggio} ";
char chars[] = {'Ì', '\x1','\"','{', '}',':'};
for (unsigned int i = 0; i < strlen(chars); ++i)
{
messaggio.erase(remove(messaggio.begin(), messaggio.end(), chars[i]), messaggio.end());
}
Can someone tell me how this part of code works and why it even removes the white spaces?
Because you use strlen on your chars array. This function stops ONLY when it encounters a \0, and you inserted none... So you're parsing memory after your array - which is bad, it should even provoke a SEGFAULT.
Also, calling std::remove is enough.
A correction could be:
char chars[] = {'I', '\x1','\"','{', '}',':'};
for (unsigned int i = 0; i < sizeof(chars); ++i)
{
std::remove(messaggio.begin(), messaggio.end(), chars[i]) ;
}
Answer for Wissblade is more or less correct, it just lacks of some technical details.
As mentioned strlen searches for terminating character: '\0'.
Since chars do not contain such character, this code invokes "Undefined behavior" (buffer overflow).
"Undefined behavior" - means anything can happen, code may work, may crash, may give invalid results.
So first step is to drop strlen and use different means to get size of the array.
There is also another problem. Your code uses none ASCII character: 'Ì'.
I assume that you are using Windows and Visual Studio. By default msvc compiler assumes that file is encoded using your system locale and uses same locale to generate exactable. Windows by default uses single byte encoding specific to your language (to be compatible with very old software). Only in such chase you code has chance to work. On platforms/configuration with mutibyte encoding, like UTF-8 this code can't work even after Wisblade fixes.
Wisblade fix can take this form (note I change order of loops, now iteration over characters to remove is internal loop):
bool isCharToRemove(char ch)
{
constexpr char skipChars[] = {'Ì', '\x1','\"','{', '}',':'};
return std::find(std::begin(skipChars), std::end(skipChars), ch) != std::end(skipChars);
}
std::string removeMagicChars(std::string message)
{
message.erase(
std::remove_if(message.begin(), message.end(), isCharToRemove),
message.end());
}
return message;
}
Let me know if you need solution which can handle more complex text encoding.

Reading alphabetical characters only from file - c++

I am to read words from a text file. Word is defined as a consecutive sequence of letters. So for example in the following string:
"It’s a ver5y good #” idea of a line. You know it?"
the words are:
it s a ver y good idea of line you know
('it' and 'a' are doubled)
I was wondering, if there's any clever function that reads words until it finds a non-alphabetical character? Or the only way to do it is to read char by char and use push_back until we find non-alphabetical one?
When you read a string from a stream, the stream reads a contiguous run of non-white-space characters as the string. It then ignores any white-space characters. The next non-white-space character is the beginning of the next string it'll read. This is pretty much the behavior you want, with one more exception: you want everything other than letters to be treated like white-space.
Fortunately, the stream doesn't hard-code its idea of what's "white space". It uses a locale to tell it what's white space. A locale, in turn, is composed of pieces that deal with individual aspects ("facets") of localization. The facet that deal specifically with classifying characters is a ctype facet. So, if we write a ctype facet that classifies everything other than a letter as white space, we can read "words" from the stream quite easily.
Here's some code to do exactly that:
struct alpha_only: std::ctype<char> {
alpha_only(): std::ctype<char>(get_table()) {}
static std::ctype_base::mask const* get_table() {
static std::vector<std::ctype_base::mask>
rc(std::ctype<char>::table_size,std::ctype_base::space);
std::fill(&rc['a'], &rc['z'], std::ctype_base::lower);
std::fill(&rc['A'], &rc['Z'], std::ctype_base::upper);
return &rc[0];
}
};
The char specialization of a ctype facet is (always) table driven. All we really have to do is create a table that classifies characters properly. In this case, that means alphabetical characters are classified as upper- or lower-case, and everything else is classified as white-space. We do that by filling the table with ctype_base::space, then for the alphabetical characters basically saying: "oops, no that's not white-space, that's upper- or lower-case.
Technically, the way I've done that is slightly incorrect--it assumes that upper-case and lower-case letters are contiguous. This is true of any sane character set, but not of EBCDIC. If we wanted to be technically correct, instead of the two "std::fill" calls, we could write a loop something like this:
auto max = std::numeric_limits<unsigned char>::max();
for (int i=0; i<max; i++)
if (islower(i))
table[i] = std::ctype_base::lower;
else if (isupper(i))
table[i] = std::ctype_base::upper;
else
table[i] = std::ctype_base::space;
Either way, the conclusion is fairly simple: upper case is upper case, lower case is lower case, everything else is "white space".
Once we've written that, we need to tell the stream to use that locale; then we can read our words really easily:
int main() {
std::istringstream infile("It’s a ver5y good #” idea of a line. You know it?");
// Tell the stream to use our character classifier:
infile.imbue(std::locale(std::locale(), new alpha_only));
std::string word;
while (infile >> word)
std::cout << word << "\n";
}
[I've put a new-line between each "word" so you can easily see what it's reading as a word.]
Result:
It
s
a
ver
y
good
idea
of
a
line
You
know
it
Based on your result in the question, you apparently also only want each word to appear once in the output. To do that, you'd typically insert each word in a set as its read, and only write it to the output if insertion in the set was successful.
std::unordered_set<std::string> words;
std::string word;
while (infile >> word)
if (words.insert(word).second)
std::cout << word << "\n";
The insert for set and unordered_set returns a pair<iterator, bool>, where the bool indicates whether insertion was successful. If it was previously present, that will fail and return false, so based on that we decide whether to write the word out or not.
With this modification, it still appears in the output twice--the first instance has the i capitalized, and the second doesn't. To filter that out, you'll need to convert each string entirely to lower-case (or entirely to upper-case) before inserting it into the set.

How do I remove only the first character of a string that is not a digit? (MFC, C++)

I want to remove only the first character in a string that is NOT a digit. The first character can be anything from ‘A’ to ‘Z’ or it may be a special character like ‘&’ or ‘#’. This legacy code is written in MFC. I've looked at the CString class but cannot figure out how to make this work.
I have strings that may look like any of the following:
J22008943452GF or 22008943452GF or K33423333333IF or 23000526987IF or #12000895236GF. You get the idea by now.
My dilemma is I need to remove the character in the first position of all the strings, but not the strings that starts with a digit. For the strings that begin with a digit, I need to leave them alone. Also, none of the other characters in the string should not be altered. For example the ‘G’, ‘I’ or ‘F’ in the later part of the string should not be changed. The length of the string will always be 13 or 14 digits.
Here is what I have so far.
CString GAbsMeterCalibration::TrimMeterSNString (CString meterSN)
{
meterSN.MakeUpper();
CString TrimmedMeterSNString = meterSN;
int strlength = strlen(TrimmedMeterSNString);
if (strlength == 13)
{
// Check the first character anyway, even though it’s
// probably okay. If it is a digit, life’s good.
// Return unaltered TrimmedMeterSNString;
}
if (strlength == 14))
{
//Check the first character, it’s probably going
// to be wrong and is a character, not a digit.
// if I find a char in the first postion of the
// string, delete it and shift everything to the
// left. Make this my new TrimmedMeterSNString
// return altered TrimmedMeterSNString;
}
}
The string lengths are checked and validated before the calls.
From my investigations, I’ve found that MFC does not have a regular expression
class. Nor does it have the substring methods.
How about:
CString GAbsMeterCalibration::TrimMeterSNString (CString meterSN)
{
meterSN.MakeUpper();
CString TrimmedMeterSNString = meterSN;
int strlength = strlen(TrimmedMeterSNString);
if (std::isdigit(TrimmedMeterSNString.GetAt(0)) )
{
// Check the first character anyway, even though it’s
// probably okay. If it is a digit, life’s good.
// Return unaltered TrimmedMeterSNString;
}
}
From what I understand, you want to remove the first letter if it is not a digit. So you may make this function simpler:
CString GAbsMeterCalibration::TrimMeterSNString(CString meterSN)
{
meterSN.MakeUpper();
int length = meterSN.GetLength();
// just check the first character is always a digit else remove it
if (length > 0 && unsigned(meterSN[0] - TCHAR('0')) > unsigned('9'))
{
return meterSN.Right(length - 1);
}
return meterSN;
}
I am not using function isdigit instead of the conditional trick with unsigned because CString uses TCHAR which can be either char or wchar_t.
The solution is fairly straight forward:
CString GAbsMeterCalibration::TrimMeterSNString(CString meterSN) {
meterSN.MakeUpper();
return _istdigit(meterSN.GetAt(0)) ? meterSN :
meterSN.Mid(1);
}
The implementation can be compiled for both ANSI and Unicode project settings by using _istdigit. This is required since you are using CString, which stores either MBCS or Unicode character strings. The desired substring is extracted using CStringT::Mid.
(Note that CString is a typedef for a specific CStringT template instantiation, depending on your project settings.)
CString test="12355adaddfca";
if((test.GetAt(0)>=48)&&(test.GetAt(0)<=57))
{
//48 and 57 are ascii values of 0&9, hence this is a digit
//do your stuff
//CString::GetBuffer may help here??
}
else
{
//it is not a digit, do your stuff
}
Compare the ascii value of the first position in the string and you know if it's a digit or not..
I don't know if you've tried this, but, it should work.
CString str = _T("#12000895236GF");
// check string to see if it starts with digit.
CString result = str.SpanIncluding(_T("0123456789"));
// if result is empty, string does not start with a number
// and we can remove the first character. Otherwise, string
// remains intact.
if (result.IsEmpty())
str = str.Mid(1);
Seems a little easier than what's been proposed.

Reverse string with non-ASCII characters

I want to change the order in the string with special characters like this:
ZAŻÓŁĆ GĘŚLĄ JAŹŃ
to
ŃŹAJ ĄŁŚĘG ĆŁÓŻAZ
I try to use std::reverse
std::string text("ZAŻÓŁĆ GĘŚLĄ JAŹŃ!");
std::cout << text << std::endl;
std::reverse(text.rbegin(), text.rend());
std::cout << text << std::endl;
but the output show me that:
ZAŻÓŁĆ GĘŚLĄ JAŹŃ!
!\203Ź\305AJ \204\304L\232Ř\304G \206āœû\305AZ <- reversed string
So i try do this "manually" :
std::string text1("ZAŻÓŁĆ GĘŚLĄ JAŹŃ!");
std::cout << text1 << std::endl;
int count = (int) floorf(text1.size() /2.f);
std::cout << count << " " << text1.size() << std::endl;
unsigned int maxIndex = text1.size() - 1;
for (int i = 0; i < count ; i++)
{
char tmp = text1[i];
text1[i] = text1[maxIndex];
text1[maxIndex] = tmp;
maxIndex--;
}
std::cout << text1 << std::endl;
But in this case I have a problem in text1.size() because every special character are counted twice:
ZAŻÓŁĆ GĘŚLĄ JAŹŃ!
13 27 <- second number is text1.size()
!\203Ź\305AJ \204\304L\232Ř\304G \206āœû\305AZ
How is the proper way to reverse a string with special characters?
Your code really does correctly reverse bytes in your string, there's nothing wrong here. The problem, however, is that your compiler stores your literal string "ZAŻÓŁĆ GĘŚLĄ JAŹŃ!" in UTF-8 encoding.
And UTF-8 stores all characters except those that match ASCII as variable-length sequences of bytes. This means that one char (one byte) is no longer one character, so reversing char's isn't now the same as reversing characters.
To achieve your goal you have at least two options:
Use some utf-8 library that will let you iterate characters instead of bytes. One example is http://utfcpp.sourceforge.net/
Somehow (and that depends a lot on the compiler and OS you are using) switch to utf-32 encoding that has constant character length and have good old constant-character-size strings without all this crazy variable-character-size troubles.
UPD: A nice link for you: http://www.joelonsoftware.com/articles/Unicode.html
You might code a reverseUt8 function by yourself:
std::string getMultiByteReversed(char ch1, char ch2)
{
if (ch == '\xc3') // most utf8 characters
return std::string(ch1)+ std::string(ch2);
} else {
return std::string(ch1);
}
}
std::string reverseMultiByteString(const std::string &s)
{
std::string result;
for (std::string::reverse_iterator it = s.rbegin(); it != s.rend(); ++it) {
std::string reversed;
if ( (it+1) != rbegin() && (reversed = getMultiByteReversed(*it, *it+1) ) {
result += reversed;
++it;
} else {
result += *it;
}
}
return result;
}
You can look up the utf8 codes at: http://www.utf8-chartable.de/
There are a couple of issues here. The answer is complex and can depend on exactly what you're trying to do.
First is that (as other answers have stated) if your string is UTF-8 encoded, one Unicode code point may consist of multiple bytes. If you just reverse the bytes, you'll break the UTF-8 encoding. The simplest (though not necessarily the best) fix for this is to convert the string to UTF-32 and reverse the 32-bit code points rather than bytes.
The next problem is that a single grapheme might consist of multiple Unicode code points. For example, a "é" might be encoded as the two code points U+0065 followed by U+0301. If you reverse the order of these, that will break it as the combining character U+301 will now be associate with a different base character. So "Pokémon" reversed this way would become "noḿekoP" with the accent over the "m" instead of the "e".
Now you might think that you can get around this problem by normalizing the string into a composed form first. That has its own problems, however, because not every grapheme can be represented by a single code point. For example, the Canadian flag emoji (🇨🇦) is represented by the code point U+1F1E8 followed by the code point U+1F1E6. There is no single code point for it. If you reverse its code points, you get the flag for Ascension Island (🇦🇨) instead.
Then you have languages where characters change form based on context, and I don't yet know much about dealing with those.
It may be closer to what you want to reverse grapheme clusters. See UAX29: Unicode text segmentation.
have you tried swapping characters one by one.
For example, if the string length is odd, swap the first character with the last, second with the second last, till the middle character is left. If the string lengt is even, swap 1st with last, 2nd with 2nd last, till both the middle characters are swapped. In that way, the string will be reversed.

C++ substr() problems when string contains special characters

I'm trying to split a c++ string into a number of substrings (NUM_LINES) each with the length of CHAR_PER_LINE.
for(int i = 0; i < NUM_LINES; i++) {
lines[i] = totalstring.substr(i*CHAR_PER_LINE,CHAR_PER_LINE);
}
Works fine as long as there's no special character in the string. Otherwise substr() gets me a string that isn't CHAR_PER_LINE characters long, but stops right before a special character and exits the loop.
Any hints?
ok, edit:
1) I'm definitely not reaching the end of my string. If my totalstring.length() is 1000 and I have a special character in the first line (that is the first CHAR_PER_LINE (30) chars of the string) the loop exits.
2) Special characters I had problems with are for instance 'ö' and '–' (the long one)
EDIT 2:
std::string text = "aaaabbbbccccdödd";
std::string line[4];
for(int i = 0; i < 4; i++)
line[i] = text.substr(i*4,4);
for(int i = 0; i < 4; i++)
std::cout << line[i] << "\n";
This example works. I get a '%' for the ö.
So the problem wasn't substr(). Sorry. I'm using Cairo to create a gui and it seems my Cairo output is causing the troubles, not substr().
How about a hint of what special characters you're talking about?
My guess is that you reached the end of the string.
The STL doesn't care of special characters. If there are multibyte sequences (i.e. UTF8), std::string treats them as a sequence of single one-byte-characters. If you need proper Unicode handling, do not use the builtin substr or length.
You can, however, use std::wstring (from your posting it isn't clear whether you're already using it, but I guess not) - it holds wchar_t characters - large enough for the native character set of your target platform.
What's happening is that you're running off the end of the string on the last line. It isn't exiting the loop after skipping characters. It exits the loop precisely when it should, and the last line contains the right number of characters, it's just that some of them are garbage so your diagnositic printout is showing that the line is short.
The only way the loop could be exited early is if an exception were thrown.