Remove a sub string in a Korean string in C++

Remove a sub string in a Korean string in C++ - c++

I have a Korean string: "태권소녀 1". And now I want to remove a substring, " 1" (a space and '1' character). How can I do it in C++?
With the English string it works ok, but I cannot do it with Korean yet.
Thank you so much if you can give me some ideas.

thestring.erase(thestring.find(" 1"),2);
assuming, it's there. This is not the code to use it's a hint about what to look up in the documentation.
The problem you have is probably to determine the size in bytes of the particular string in characters. It depends on the encoding, but generally you may want to look at the family of functions with mb in their names (which stands for multibyte).

Related

Check if c++ char/char[] is dbcs/cjk

I need a way to check if characters in a string input are cjk, I've searched and I've only been able to detect if the characters are multibyte, however, I need to be able to tell Japanese, chinese or korean characters apart from other multibyte-encoded character.
The string encoding is utf8 and it'd be simpler to keep it that way, but I welcome any solution.
I've tried writing out the bytes and using information found here to determine the size and bit-content of the characters. perhaps if there was a continous range of digits for representing cjk chars, not sure it'd be that simple however.

FLUTTER - Checking if a string contains another one

I am working on an English vocabulary learning app. Some of the exercises given to the users are written quizzes. They have to translate French words into English words and vice versa.
To make the checking a little more sophisticated than just "1" or "0" (TypedWord == expectedWord), I have been working with similarities between strings and that worked well (for spelling mistakes for example).
I had also used the contains function, so that for example, if the user adds an article in front of the expected word, it doesn't consider it wrong. (Ex : Ecole (School is expected), but user writes "A school").
So I was checking with lines such as "if (typedWord.contains(word)==true) then...". It works fine for the article problem.
But it prompts another issue :
Ex : A bough --> the expected French word is "branche". If user types "une branche", it considers it correct, which is great. But if user types "débrancher" (to unplug), it considers it correct as well as the word "branche" is a part of "débrancher"...
How could I keep this from happening ? Any idea of other ways to go about it ?
I read the three proposed answers which are really interesting. The thing is that some of the words are compound.... "Ex : kitchen appliance, garden tool" etc... so then I think the "space" functions might be problematic...

In this case, separate the whole answer with the "space", then compare it with the correct word.
For an example:
User's answer: That is my school
Separate it with space, so that you will find an array of words:
that, is, my, school.
Then compare each word with your word. It will give you the correct answer.
The flutter code will be like below:
usersAnswer?.split(" ").forEach((word){
if(word == correctAnswer)
print("this is a correct answer");
});

You can split the string by space and check if the resulting array has the word you're looking for.
typedWord.split(' ').contains('debranche');
So if typedWord is 'une branchethesplit(' ') will turn it into this array: ['une', 'branche'].
Now when you check if this array contains('branche') it will check if the exact string branche exists which in this case it does and returns true.
However if it's 'une debranche' the resulting array would be: ['une', 'debranche'] and because this array has no value equal to 'branche' it will return false. Remember that when you use split it turns the string into an array and by using contains on an array it checks whether or not an item of exactly the value you provide contains exists or not, whereas in a string it checks if part of that string matches the given value or not.

You could check for whitespaces before and after the correct word: something like if (typedWord.contains(' '+word+' ')==true) then..., so that "débrancher" gets marked as wrong. This is kind of strict, though: if the sentence must be completed with some punctuation, it would be rejected by this check. You'll probably want some RegExp that allows punctuation but not whitespaces.

Right to left isolate string. C++

does anyone have experience in Unicodes?
I am facing a tough problem with Farsi unicodes.
I have an std::wstring s = (L"\u0634\u0646\u0628\u0647"); which is a Farsi word. When I debug it, I see that the underlying word is exactly what I want, but reversed. So I have researched and found that u2067 is for right to left reading the string.
NOTE:
I cannot reverse the string manually because Farsi characters are changing their shape regardless of their position in the string.
So I added the 2067 int the beginning and got
std::wstring s = (L"\u2067\u0634\u0646\u0628\u0647");.
But now the underlying string is the same , just added a square in the beginning if the string instead of reversing.
Does anyone have experince with this stuff? Please suggest a solution. Thanks!

The underlying string will be the same. You haven't changed the order of bytes, which is written right there in the code. But a renderer that understands Unicode should take those bytes and display the characters right-to-left. That's a visual thing. It has nothing to do with the encoding. From your question, it's not entirely clear what else you expected. It may be that you are viewing the string in a debugger, and the debugger does not support this feature of Unicode. If you try outputting the string to a proper console you ought to see it as you expect.

Inputting a string containing greek characters in linux

I have a function which returns a string.
I have to define that string with greek characters in the function itself and should return that string.
I am working on Linux platform and my code is in C++.
My function is as follows:
string gen_string()
{
string str = "αγρω";
return str;
}
But I am not able to give the input.
When I try to copy paste the greek characters I want, it is appearing as some garbage characters.
Can some one please help me with this?
Thanks in advance.
EDIT:
Thanks for all your response.
Its not about using the wstring or string.
When I copy the string to the vim to give it as input, it is appearing as something like this.
▒~^▒~T▒~A▒~A201604¸▒~B▒žMDF_F▒~S123▒~T▒~B▒▒~B▒
I also tried by keeping the text in the file and opening the text file from vim.
But still it's the same.

string is only for ASCII characters, I believe.
You have international, likely Unicode characters. Consider using std::wstring for a multibyte "wide" string.

If you mean copy from some text to the terminal input then how to do this depends on the terminal. If it's a gnome terminal you need to specify UTF-8 in the locale settings though I'm not sure if that would get you the Greek alphabet.
locale command will list the current locale setting in locale.conf. You likely want to change the LANG setting. A way to do this system wide is
localectl set-locale LANG=en_country_code.UTF-8
Change country_code. It's US for the United States but I don't know what the Greek code is. You may need to be root. To change it just for yourself modify
~/.config/locale.conf
(or $XDG_CONFIG_HOME/locale.conf or $HOME/.config/locale.conf).
whichever gets you to the locale.conf file. On most systems all of them do.

How to compute a unicode string which bidirectional representation is specified?

fellows. I have a rather pervert question. Please forgive me :)
There's an official algorithm that describes how bidirectional unicode text should be presented.
http://www.unicode.org/reports/tr9/tr9-15.html
I receive a string (from some 3rd-party source), which contains latin/hebrew characters, as well as digits, white-spaces, punctuation symbols and etc.
The problem is that the string that I receive is already in the representation form. I.e. - the sequence of characters that I receive should just be presented from left to right.
Now, my goal is to find the unicode string which representation is exactly the same. Means - I need to pass that string to another entity; it would then render this string according to the official algorithm, and the result should be the same.
Assuming the following:
The default text direction (of the rendering entity) is RTL.
I don't want to inject "special unicode characters" that explicitly override the text direction (such as RLO, RLE, etc.)
I suspect there may exist several solutions. If so - I'd like to preserve the RTL-looking of the string as much as possible. The string usually consists of hebrew words mostly. I'd like to preserve the correct order of those words, and characters inside those words. Whereas other character sequences may (and should) be transposed.
One naive way to solve this is just to swap the whole string (this takes care of the hebrew words), and then swap inside it sequences of non-hebrew characters. This however doesn't always produce correct results, because actual rules of representation are rather complex.
The only comprehensive algorithm that I see so far is brute-force check. The string can be divided into sequences of same-class characters. Those sequences may be joined in random order, plus any of them may be reversed. I can check all those combinations to obtain the correct result.
Plus this technique may be optimized. For instance the order of hebrew words is known, so we only have to check different combinations of their "joining" sequences.
Any better ideas? If you have an idea, not necessarily the whole solution - it's ok. I'll appreciate any idea.
Thanks in advance.

If you want to check if a character is Bidirectional you have to use UCD (Unicode Character Database) which provided by Unicode.org and includes lots of information about characters . in one of that DB attributes you can find the Bidirectionality of a character
So you have to Download USD , then write a class to look for your character in the XML and return answer
I did this in an opensource C# application and you can ind it here http://Unicode.Codeplex.com
Please let me know has your issue resolved by this or not.

Nasser, thanks for the answer.
Unfortunately it doesn't fully resolve my problem.
So far for every character I can know its directionality. Still I don't see how can I compute the whole string so that its representation would match what I need.
Imagine you want to have the following text written from left to right, whereas hebrew/arabic characters are denoted by BIG:
ABC eng 123 456 DEF
The correct string would be like this:
FED 456 123 eng CBA
or also:
FED eng 456 123 CBA
Or, if using explicit direction override codes it can be written like this:
FED eng 123 456 CBA
Currently I solved this problem by injecting explicit directionality override codes into the string. So that I isolate sequences of hebrew/arabic words, and for all the joining LTR/Weak/Neutral characters I explicitly override the direction to LTR.
However I'd like to do this without injecting explicit override codes.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js