I need a way to check if characters in a string input are cjk, I've searched and I've only been able to detect if the characters are multibyte, however, I need to be able to tell Japanese, chinese or korean characters apart from other multibyte-encoded character.
The string encoding is utf8 and it'd be simpler to keep it that way, but I welcome any solution.
I've tried writing out the bytes and using information found here to determine the size and bit-content of the characters. perhaps if there was a continous range of digits for representing cjk chars, not sure it'd be that simple however.
Related
I have a mobile app that stores data in dynamoDB tables. There is a group of users in Taiwan that attempted to store there names in the database. when the data is stored it become garbled. I have researched this and see that it is because dynamoDB uses UTF encoding while tradional chinese uses big 5 text encoding. How do I setup dynamoDB so that it will store and recall the proper characters??
So you start with a string in your head. It's a sequence of Unicode characters. There's no inherent byte encoding to the characters. The same string could be encoded into bytes in a variety of ways. Big5 is one. UTF-8 is another.
When you say that Traditional Chinese uses Big5, that's not entirely true. It may be commonly encoded in Big5, but it could be in UTF-8 instead, and UTF-8 has this cool property that it can encode all Unicode character sequences. That's why it's become the standard encoding for situations where you don't want to optimize for one character set.
So your challenge is make sure to carefully control the characters and encodings so that you're sending UTF-8 sequences to DynamoDB. The standard SDKs would do this correctly as long as you're creating the strings as basic strings in them.
You also have to make sure you're not confusing yourself when you look at the data. If you look at UTF-8 bytes but in a way where you're interpreting them as Big5 then it's going to look like gibberish, or vice versa.
You don't say how they're loading the data. If they're starting with a file, could be that. You'd want to read the file in a language saying it's Big5, then you'll have the string version, and then you can send the string version and rely on the SDK to correctly translate to UTF-8 on the wire.
I remember when I first learned this stuff it was all kind of confusing. The thing to remember is a capital A exists as an idea (and is a defined character in Unicode) and there's a whole lot of mechanisms you could use to put that letter into ones and zeros on disk. Each of those ways is an encoding. ASCII is popular but EBCDIC was another contender from the past, and UTF-16 is yet another contender now. Traditional Chinese is a character set (a set of characters) and you can encode each those characters a bunch of ways too. It's just a question of how you map characters to bits and bytes and back again.
I am building a language analysis program I have a program which counts the words in text and give the ratio of every word in text as a output, but this program can not work on file containing Urdu text. how can I make it work
Encoding
Urdu may be presented in two¹ forms: Unicode and Code Page 868. This is convenient to you because the two ranges do not overlap. It is inconvenient because the Unicode code range is U+0600 – U+06FF, which means encoding is an issue:
CP-868 will encode each one as a single-byte value in the range 128–252
UTF-8 will encode each one as a two-byte sequence with bits 110x xxxx and 10xx xxxx
UTF-16 encodes every character as two-byte entities
UTF-32 encodes every character as four-byte entities
This means that you should be aware of encoding issues, and for an easy life, use UTF-16 internally (std::u16string), and accept files as (default) UTF-8 / CP-868, or as UTF-16/32 if there is a BOM indicating such.
Your other option is to simply require all input to be UTF-8 / CP-868.
¹ AFAIK. There may be other ways of storing Urdu text.
Three forms. See comments below.
Word separation
As you know, the end of a word is generally marked with a special letter form.
So, all you need is a table of end-of-word letters listing letters in both the CP-868 range and the Unicode Arabic text range.
Then, every time you find a space or a letter in that table you know you have found the end of a word.
Histogram
As you read words, store them in a histogram. For C++ a map <u16string, size_t> will do. The actual content of each word does not matter.
After that you have all the information necessary to print stats about the text.
Edit
The approach presented above is designed to be simple at the cost of some correctness. If you are doing something for the workplace, for example, and assuming it matters, you should also consider:
Normalizing word forms
For example, the same word may be presented in standard Arabic text codes or using the Urdu-specific codes. If you do not convert to the Urdu equivalent characters then you will have two words that should compare equal but do not.
Use something internally consistent. I recommend UZT, as it is the most complete Urdu text representation. You will also need an additional lookup for the original text representation from the UZT representation.
Dictionaries
As complete a dictionary (as an unordered_set <u16string>) of words in Urdu as you can get.
This is how it is done with languages like Japanese, for example, to find breaks between words.
Then use the dictionary to find all the words you can, and fall back on letterform recognition and/or spaces for what remains.
I'm programming an application that converts .txt files to bags of words for text mining. However, I keep getting non-alphabetic characters ( like ¾ and =) even though my application filters non-alphabetic characters:
My vector passes through a loop which erases strings that begins with a char with an ASCII value other than [65,90] (from A to Z). These characters also pass the isalpha test. It seems like these characters can't be distinguished from alphabetic characters.
I don't see how I can remove these weird strings dynamically from my vector of strings. I need help.
My code because it is quite long for a forum post.
This part of my code fails to get rid of the strings beginning with non-aphabetic characters:
for (unsigned int i=0; i<token24.size();i++){
string temp = token24[i];
char c = temp[0];
if(c>90||c<65){
token24.erase(token24.begin()+i);
i--;
}
}
I also tried with the condition
(c>'Z'||c<'A')
You could always do a string replace the characters with whitespace, but that just handles the specific cases of specific characters, not the larger problem.
I don't think we can do anything for you until we see the code.
The most important part in programs like yours is handling the content of .txt file. Such file can be a Unicode text, which in turn can be encoded, for eample, with UTF-8. Then, single byte can be only a part of a character, not character itself. Are you sure you load (and possibly, decode) the file in a proper way?
Also, don't you think that lower letters are also valid alpha characters?
I have a Korean string: "태권소녀 1". And now I want to remove a substring, " 1" (a space and '1' character). How can I do it in C++?
With the English string it works ok, but I cannot do it with Korean yet.
Thank you so much if you can give me some ideas.
thestring.erase(thestring.find(" 1"),2);
assuming, it's there. This is not the code to use it's a hint about what to look up in the documentation.
The problem you have is probably to determine the size in bytes of the particular string in characters. It depends on the encoding, but generally you may want to look at the family of functions with mb in their names (which stands for multibyte).
Problem is categorized in two steps:
Problem Step 1. Access 97 db containing XML strings that are encoded in UTF-8.
The problem boils down to this: the Access 97 db contains XML strings that are encoded in UTF-8. So I created a patch tool for separate conversion for the XML strings from UTF-8 to Unicode. In order to covert UTF8 string to Unicode, I have used function
MultiByteToWideChar(CP_UTF8, 0, PChar(OriginalName), -1, #newName, Size);.(where newName is array as declared "newName : Array[0..2048] of WideChar;" ).
This function works good on most of the cases, I have checked it with Spainsh, Arabic, characters. but I am working on Greek and Chineese Characters it is choking.
For some greek characters like "Ευγ. ΚαÏαβιά" (as stored in Access-97), the resultant new string contains null charaters in between, and when it is stored to wide-string the characters are getting clipped.
For some chineese characters like "?¢»?µ?"(as stored in Access-97), the result is totally absurd like "?¢»?µ?".
Problem Step 2. Access 97 db Text Strings, Application GUI takes unicode input and saved in Access-97
First I checked with Arabic and Spainish Characters, it seems then that no explicit characters encoding is required. But again the problem comes with greek and chineese characters.
I tried the above mentioned same function for the text conversion( Is It correct???), the result was again disspointing. The Spainsh characters which are ok with out conversion, get unicode character either lost or converted to regular Ascii Alphabets.
The Greek and Chineese characters shows similar behaviour as mentined in step 1.
Please guide me. Am I taking the right approach? Is there some other way around???
Well Right now I am confused and full of Questions :)
There is no special requirement for working with Greek characters. The real problem is that the characters were stored in an encoding that Access doesn't recognize in the first place. When the application stored the UTF8 values in the database it tried to convert every single byte to the equivalent byte in the database's codepage. Every character that had no correspondence in that encoding was replaced with ? That may mean that the Greek text is OK, while the chinese text may be gone.
In order to convert the data to something readable you have to know the codepage they are stored in. Using this you can get the actual bytes and then convert them to Unicode.