Are there or could there be some old or new characters beyond the utf32 dataset?
Is there a bigger dataset?
Just a random question..
Related
I've got a table that I populate with tab-separated data from files whose encoding doesn't seem to be utf-8 exactly, like so:
CREATE TABLE tab (
url varchar(2000),
...
);
COPY tab
FROM 's3://input.tsv'
After the copy has completed I run
SELECT
MAX(LEN(url))
FROM tab
which returns 1525. I figure, since I'm wasting space I might as well resize the column by almost a quarter by using varchar(2000) instead of varchar(1525). But neither redoing the COPY nor setting up a new table and inserting the already imported data works. In both cases I get
error: Value too long for character type
Why won't the column hold these values?
Your file might be in a multi-byte format.
From the LEN Function documentation:
The LEN function returns an integer indicating the number of characters in the input string. The LEN function returns the actual number of characters in multi-byte strings, not the number of bytes. For example, a VARCHAR(12) column is required to store three four-byte Chinese characters. The LEN function will return 3 for that same string.
The extra size of a VARCHAR will not waste disk space due to the compression methods used by Amazon Redshift, but it will waste in-memory buffer space when a block is read from disk and decompressed into memory.
I have a column with a length of 8. However, when I view the data, the value in the column is above 8 characters long.
See the following screenshot:
Your mixing up Length and Format.
http://blogs.sas.com/content/sasdummy/2007/11/20/lengths-and-formats-the-long-and-short-of-it/
Length: The column length, in SAS terms, is the amount of storage allocated in the data set to hold the column values. The length is specified in bytes. For numeric columns, the valid lengths are usually 3 through 8. The longer the length, the greater the precision allowed within the column values. For character columns, the length can be 1 through 32767. For single-byte data values, that equates to the number of characters the column can hold. For multibyte data values (DBCS, Unicode, or UTF-8), where a character can occupy more than one byte, the number of characters that fit might be less than the length value of the column.
Format: The column format, in SAS terms, is a basically an instruction for how to transform a raw value into an appearance that is suitable for a given purpose. A basic attribute of a format is the format length, which controls how much of the value is displayed. For example, a character column might have a storage length of 10 bytes, but a format length of 5 characters ($5. format), so when you see the formatted values you will see at most 5 characters for each record.
I'm trying to use PROC FREQ on a subset of my data called dataname. I would like it to include all rows where varname doesn't equal "A.Never Used". I have the following code:
proc freq data=dataname(where=(varname NE 'A.Never Used'));
run;
I thought there might be a problem with trailing or leading blanks so I also tried:
proc freq data=dataname(where=(strip(varname) NE 'A.Never Used'));
run;
My guess is for some reason my string values are not "A.Never Used" but whenever I print the data this is the value I see.
This is a common issue in dealing with string data (and a good reason not to!). You should consider the source of your data - did it come from web forms? Then it probably contains nonbreaking spaces ('A0'x) instead of regular spaces ('20'x). Did it come from a unicode environment (say, Japanese characters are legal)? Then you may have transcoding issues.
A few options that work for a large majority of these problems:
Compress out everything but alphabet characters. where=(compress(varname,,'ka') ne 'ANeverUsed') for example. 'ka' means 'keep only' and 'alphabet characters'.
UPCASE or LOWCASE to ensure you're not running into case issues.
Use put varname HEX.; in a data step to look at the underlying characters. Each two hex characters is one alphabet character. 20 is space (which strip would remove). Sort by varname before doing this so that you can easily see the rows that you think should have this value next to each other - what is the difference? Probably some special character, or multibyte characters, or who knows what, but it should be apparent here.
I am developing cocos2d game, which supports multiple languages. I created a font file(.png and .fnt) with all supported characters.
The issue is some of the character id's are in range of 917505-917631. So I set kCCBMFontMaxChars = 917632. But this is taking lot of memory.
Can anyone please tell me how to handle this situation.
kCCBMFontMaxChars = 0xffff; // 65k
This should suffice for all Unicode characters. It certainly works for all the asian and cyrillic languages. The memory usage will be exactly 2 MB.
Don't worry about the ID, I believe they are offsets into the BMFont char array and not indexes. Each entry is 32 Bytes. 917632 divided by 32 gives you 28676, which if it is an index fits within the unicode character range.
Problem is categorized in two steps:
Problem Step 1. Access 97 db containing XML strings that are encoded in UTF-8.
The problem boils down to this: the Access 97 db contains XML strings that are encoded in UTF-8. So I created a patch tool for separate conversion for the XML strings from UTF-8 to Unicode. In order to covert UTF8 string to Unicode, I have used function
MultiByteToWideChar(CP_UTF8, 0, PChar(OriginalName), -1, #newName, Size);.(where newName is array as declared "newName : Array[0..2048] of WideChar;" ).
This function works good on most of the cases, I have checked it with Spainsh, Arabic, characters. but I am working on Greek and Chineese Characters it is choking.
For some greek characters like "Ευγ. ΚαÏαβιά" (as stored in Access-97), the resultant new string contains null charaters in between, and when it is stored to wide-string the characters are getting clipped.
For some chineese characters like "?¢»?µ?"(as stored in Access-97), the result is totally absurd like "?¢»?µ?".
Problem Step 2. Access 97 db Text Strings, Application GUI takes unicode input and saved in Access-97
First I checked with Arabic and Spainish Characters, it seems then that no explicit characters encoding is required. But again the problem comes with greek and chineese characters.
I tried the above mentioned same function for the text conversion( Is It correct???), the result was again disspointing. The Spainsh characters which are ok with out conversion, get unicode character either lost or converted to regular Ascii Alphabets.
The Greek and Chineese characters shows similar behaviour as mentined in step 1.
Please guide me. Am I taking the right approach? Is there some other way around???
Well Right now I am confused and full of Questions :)
There is no special requirement for working with Greek characters. The real problem is that the characters were stored in an encoding that Access doesn't recognize in the first place. When the application stored the UTF8 values in the database it tried to convert every single byte to the equivalent byte in the database's codepage. Every character that had no correspondence in that encoding was replaced with ? That may mean that the Greek text is OK, while the chinese text may be gone.
In order to convert the data to something readable you have to know the codepage they are stored in. Using this you can get the actual bytes and then convert them to Unicode.