Maximum number of codepoints in a grapheme cluster - c++

I am using the C++ ICU library. I wish to split a utf-8 string into approximately equal chunks. However, I want the chunks to be demarcated at grapheme cluster boundaries. I do not wish to convert my entire string into utf-16 to do this for both memory and speed efficiency. Instead, I want to translate a small number of utf-8 codepoints close to my estimated chunk boundaries into utf-16. I can then use ICU's BreakIterator to work out the exact boundaries.
Is there a hard upper limit of the number of codepoints that can make up a grapheme cluster? If so, what is it? I need to know this in order to determine the minimum codepoints that I need to translate from utf-8 to utf-16.

Is there a hard upper limit of the number of codepoints that can make up a grapheme cluster?
No. There is no hard upper limit for how many code points a grapheme clusters - i.e. a user-perceived character - consists of.
You could for example repeatedly add ZERO WIDTH JOINER with a joined character.

Just to add an example to the accepted answer.
You can for example create arbitrarily large grapheme clusters using this page:
https://glitchtextgenerator.com/
As an example here is a "letter X" that occupies 73 bytes on disk:
x̧̡̬̘͓̖̲̻̻̲̠̪̻͓͙̜̂̓̊̔̀̀͗̑̀̅̀̂̚͘̕̚͘͢͜͠
I also created another that is close to 10 kilobytes, but perhaps better not post such monsters here because they could cause some problems. Depending on software these render in interesting ways.

Related

Using Traditional Chinese with AWS DynamoDB

I have a mobile app that stores data in dynamoDB tables. There is a group of users in Taiwan that attempted to store there names in the database. when the data is stored it become garbled. I have researched this and see that it is because dynamoDB uses UTF encoding while tradional chinese uses big 5 text encoding. How do I setup dynamoDB so that it will store and recall the proper characters??
So you start with a string in your head. It's a sequence of Unicode characters. There's no inherent byte encoding to the characters. The same string could be encoded into bytes in a variety of ways. Big5 is one. UTF-8 is another.
When you say that Traditional Chinese uses Big5, that's not entirely true. It may be commonly encoded in Big5, but it could be in UTF-8 instead, and UTF-8 has this cool property that it can encode all Unicode character sequences. That's why it's become the standard encoding for situations where you don't want to optimize for one character set.
So your challenge is make sure to carefully control the characters and encodings so that you're sending UTF-8 sequences to DynamoDB. The standard SDKs would do this correctly as long as you're creating the strings as basic strings in them.
You also have to make sure you're not confusing yourself when you look at the data. If you look at UTF-8 bytes but in a way where you're interpreting them as Big5 then it's going to look like gibberish, or vice versa.
You don't say how they're loading the data. If they're starting with a file, could be that. You'd want to read the file in a language saying it's Big5, then you'll have the string version, and then you can send the string version and rely on the SDK to correctly translate to UTF-8 on the wire.
I remember when I first learned this stuff it was all kind of confusing. The thing to remember is a capital A exists as an idea (and is a defined character in Unicode) and there's a whole lot of mechanisms you could use to put that letter into ones and zeros on disk. Each of those ways is an encoding. ASCII is popular but EBCDIC was another contender from the past, and UTF-16 is yet another contender now. Traditional Chinese is a character set (a set of characters) and you can encode each those characters a bunch of ways too. It's just a question of how you map characters to bits and bytes and back again.

i am building a program for Urdu language analysis so how can I make my program to accept text file in Urdu language in c++

I am building a language analysis program I have a program which counts the words in text and give the ratio of every word in text as a output, but this program can not work on file containing Urdu text. how can I make it work
Encoding
Urdu may be presented in two¹ forms: Unicode and Code Page 868. This is convenient to you because the two ranges do not overlap. It is inconvenient because the Unicode code range is U+0600 – U+06FF, which means encoding is an issue:
CP-868 will encode each one as a single-byte value in the range 128–252
UTF-8 will encode each one as a two-byte sequence with bits 110x xxxx and 10xx xxxx
UTF-16 encodes every character as two-byte entities
UTF-32 encodes every character as four-byte entities
This means that you should be aware of encoding issues, and for an easy life, use UTF-16 internally (std::u16string), and accept files as (default) UTF-8 / CP-868, or as UTF-16/32 if there is a BOM indicating such.
Your other option is to simply require all input to be UTF-8 / CP-868.
¹ AFAIK. There may be other ways of storing Urdu text.
  Three forms. See comments below.
Word separation
As you know, the end of a word is generally marked with a special letter form.
So, all you need is a table of end-of-word letters listing letters in both the CP-868 range and the Unicode Arabic text range.
Then, every time you find a space or a letter in that table you know you have found the end of a word.
Histogram
As you read words, store them in a histogram. For C++ a map <u16string, size_t> will do. The actual content of each word does not matter.
After that you have all the information necessary to print stats about the text.
Edit
The approach presented above is designed to be simple at the cost of some correctness. If you are doing something for the workplace, for example, and assuming it matters, you should also consider:
Normalizing word forms
For example, the same word may be presented in standard Arabic text codes or using the Urdu-specific codes. If you do not convert to the Urdu equivalent characters then you will have two words that should compare equal but do not.
Use something internally consistent. I recommend UZT, as it is the most complete Urdu text representation. You will also need an additional lookup for the original text representation from the UZT representation.
Dictionaries
As complete a dictionary (as an unordered_set <u16string>) of words in Urdu as you can get.
This is how it is done with languages like Japanese, for example, to find breaks between words.
Then use the dictionary to find all the words you can, and fall back on letterform recognition and/or spaces for what remains.

Where can I get sample Reed-Solomon encoded data?

I want to write a Reed-Solomon decoder and experiment with performance improvements. Where could I find sample data with appended Reed-Solomon parity bytes?
I am aware that Reed-Solomon is used in all kinds of 1D and 2D bar codes, but I would like to have the raw data (an array of bytes) with clear separation of payload and parity bytes.
Any help is appreciated.
Basically, a Reed-Solomon code will be composed of characters with a value between 0 and (m-1), where m is the exposant of the Galois Field used to generate the RS code. For example, in GF(2^8) (2^8 = 256), you will get a RS code composed of characters between 0 and 255 (compatible with ASCII, UTF-8 and usual binary encoding). In GF(2^16), you will get characters encoded between 0 and 65535 (compatible with UTF-16 or with UTF-8 if you encode 2 characters as one).
Other than the range of values of each character of a RS code, all the rest can be basically considered random from an external POV (it's not if you have the generator polynomial and Galois field, but for your purpose of getting a sample, you can assume a random distribution of values in the valid range).
If you want to generate real samples of a RS code with the corresponding data block, you can use the Python library pyFileFixity (disclaimer, I am the author). By default, each ecc block is separated by a md5 digest, so that you can clearly separate them. The original data is not stored, but you can easily do that by modifying the script structural_adaptive_ecc.py or header_ecc.py (the latter will be easier to modify) to also store the original data (it's just a file.write() to edit). If Python is not your thing, you can probably find a Reed-Solomon library for your language of choice, and just do a slight modification to print or save into a file the original data along the ecc blocks.

How to identify compressed/uncompressed bit groups?

I'm using a static dictionary file with some words and values for this words. This values are not fixed sized, for example the is 1, love is 01, kill is 101 etc. When I try to compress a group of words, I traverse every word and look up to dictionary if a value exists for that word. If one exists I change the word with the value, if it doesn't exist I encode the word as bytes. After compression I got a chunk of bits, and because these dictionary values and uncompressed words are not fixed sized I can not group the bits and decode them.
I have thought about using 1 bit flag for every group of bits to determine it is compressed or uncompressed, but I can't detect the flag bit because of this unknown length of a codeword or regular word.
If I use a 1 byte delimiter, it still has problems. Let's say my delimiter is 00000000, and before the delimiter I have 100 and after delimiter I have 001, so we have 10000000000001, how am I supposed to know that which group of these bits are my delimiter?
Can I use some other method to group these compressed/uncompressed bits to decode them? Thank you.
First off,what language and system are you intending to deploy this? Many languages provide their own libraries and tools for compression and may suite your needs without major low-level design effors.
The answer here is to establish some more rigorous bookkeeping and file formatting to be able to undo the compression. Most compression systems have some amount of overhead in their file format which is why when you compress something twice you don't necessarily save anything and can actually increase the size of the file.
Often files take advantage of header at the start of a file to provide key information. which would be a good place to define any rules that are specific to the compressed file.
create fixed size delimiter to use between code words only. This can be determined after analyzing the file but before actually writing out the compressed data.
If you generate your delimiter rather than a fixed known value, include this as one of your header items.
keep your header a simple ascii format so that you can easily extract it with standard tools like sscanf and fscanf.
if you want to have a header that can contain extra information you may need a consistent way to tell where the header ends and the data begins. Including something to the effect of "ENDHEADER" should be enough and still easily identifiable.

Cocos2d custom font file - issue with character id

I am developing cocos2d game, which supports multiple languages. I created a font file(.png and .fnt) with all supported characters.
The issue is some of the character id's are in range of 917505-917631. So I set kCCBMFontMaxChars = 917632. But this is taking lot of memory.
Can anyone please tell me how to handle this situation.
kCCBMFontMaxChars = 0xffff; // 65k
This should suffice for all Unicode characters. It certainly works for all the asian and cyrillic languages. The memory usage will be exactly 2 MB.
Don't worry about the ID, I believe they are offsets into the BMFont char array and not indexes. Each entry is 32 Bytes. 917632 divided by 32 gives you 28676, which if it is an index fits within the unicode character range.