I've got a table that I populate with tab-separated data from files whose encoding doesn't seem to be utf-8 exactly, like so:
CREATE TABLE tab (
url varchar(2000),
...
);
COPY tab
FROM 's3://input.tsv'
After the copy has completed I run
SELECT
MAX(LEN(url))
FROM tab
which returns 1525. I figure, since I'm wasting space I might as well resize the column by almost a quarter by using varchar(2000) instead of varchar(1525). But neither redoing the COPY nor setting up a new table and inserting the already imported data works. In both cases I get
error: Value too long for character type
Why won't the column hold these values?
Your file might be in a multi-byte format.
From the LEN Function documentation:
The LEN function returns an integer indicating the number of characters in the input string. The LEN function returns the actual number of characters in multi-byte strings, not the number of bytes. For example, a VARCHAR(12) column is required to store three four-byte Chinese characters. The LEN function will return 3 for that same string.
The extra size of a VARCHAR will not waste disk space due to the compression methods used by Amazon Redshift, but it will waste in-memory buffer space when a block is read from disk and decompressed into memory.
Related
I have a column with a length of 8. However, when I view the data, the value in the column is above 8 characters long.
See the following screenshot:
Your mixing up Length and Format.
http://blogs.sas.com/content/sasdummy/2007/11/20/lengths-and-formats-the-long-and-short-of-it/
Length: The column length, in SAS terms, is the amount of storage allocated in the data set to hold the column values. The length is specified in bytes. For numeric columns, the valid lengths are usually 3 through 8. The longer the length, the greater the precision allowed within the column values. For character columns, the length can be 1 through 32767. For single-byte data values, that equates to the number of characters the column can hold. For multibyte data values (DBCS, Unicode, or UTF-8), where a character can occupy more than one byte, the number of characters that fit might be less than the length value of the column.
Format: The column format, in SAS terms, is a basically an instruction for how to transform a raw value into an appearance that is suitable for a given purpose. A basic attribute of a format is the format length, which controls how much of the value is displayed. For example, a character column might have a storage length of 10 bytes, but a format length of 5 characters ($5. format), so when you see the formatted values you will see at most 5 characters for each record.
Our database predates our database software having good unicode support, and in its place has a psuedo-base64 encoding which it uses to store UTF16 characters in an ascii field. I am writing a function to convert this type of field into straight UTF8 within SAS.
The function loops through the string converting each set of three ascii characters into a unicode character and placing it in an array. When experimenting with code in a data step I had used cat(of final{*}) to convert the array into a string, but the same code does not appear to be valid within a function.
I am currently collating the string in the loop with collate = trim(collate)!!trim(final{i}) and an arbitrary length collate string, but I would like to produce this directly from the array or at least set the size of the collate string based on the length of the input string.
I've included a pastebin of the data and function here.
Edit: The version of SAS I was using is 9.3
The same code is valid in a function in SAS 9.4 TS1M3; it may not be in earlier versions (significant changes were made to how arrays were handled in FCMP in 9.4 and in maintenance releases TS1M2 and 3).
However, this doesn't really solve your arbitrary length problem; when I run your function with
outtext = cat(of final{*});
return (outtext);
I get... 1 character! And when I run
return(cats(of final{*}));
output:
Obs text_enc finaltext
1 ABCABlABjABhAB1ABzABlAAgABVABUABGAA4AAgABpABzAAgABoABhAByABk BecauseU
2 ABTABpABtABwABsABlAByAAgABsABpABrABlAAgAB0ABoABpABz Simplerl
3 ABJABvAAgABJABvAAgABCAByABvABtABpABvABz IoIoBrom
which is a bit better (cats trims for you), I still only get 8 characters. That's because 8 characters is the default length in SAS for an undeclared character variable. Expand the length (using a length statement for outtext) and you get:
Obs text_enc finaltext
1 ABCABlABjABhAB1ABzABlAAgABVABUABGAA4AAgABpABzAAgABoABhAByABk BecauseUTF8ishard
2 ABTABpABtABwABsABlAByAAgABsABpABrABlAAgAB0ABoABpABz Simplerlikethis
3 ABJABvAAgABJABvAAgABCAByABvABtABpABvABz IoIoBromios
You'll still need to define whatever length you need, then. FCMP doesn't, as far as I know, allow for a way to have an undefined-length string; you need to define the default (and maximum) length for the string you're going to return. The user is welcome to define a shorter length, and should, when it's appropriate.
SAS while reading varbinary data from Amazon RDS is appending spaces at the end of the data.
proc sql;
select emailaddr from tablename1;
quit;
The column emailaddr is varbinary(20)
For example:
I inserted "XX#WWW.com ", but while reading from db, it is appending spaces equal to the length of the column.
Since the column length is 20 it is returning "XX#WWW.com " ( note the spaces appended. I cannot use the trim() function since this also removes spaces that might genuinely be part of the original inserted data.
How can i stop sas from appending these spaces?
For my program i need to get the exact data as present in database without any extra spaces attached.
That's how SAS works; SAS has only CHAR equivalent datatype (in base SAS, anyway, DS2 is different), no VARCHAR concept. Whatever the length of the column is (20 here) it will have 20 total characters with spaces at the end to pad to 20.
Most of the time, it doesn't matter; when SAS inserts into another RDBMS for example it will typically treat trailing spaces as nonexistent (so they won't be inserted). You can use TRIM and similar to deal with the spaces if you're using regular expressions or concatenation to work with these values; CATS and similar functions perform concatenation-with-trimming.
If trailing spaces are part of your data, you are mostly out of luck in SAS. SAS considers trailing spaces irrelevant (equivalent to null characters). You can append a non-space character in SQL, or translate the spaces to NBSPs ('A0'x) or something else, while still in SQL, or use quotes or something around your actual values - but whatever you do will be complicated.
I'm new to C++ and probably have a silly question. I have an ifstream which I'd like to split approximately in half.
The file in question is a sorted csv and I wish to search on the first value of each line of the file.
Eventually the file will be very large so I am trying to avoid having to read every line of the file.
e.g.
If the file contains 7 lines I'd like to split the ifstream to give 1 stream containing the first 3 lines and 1 stream containing the last 4 lines.
First, use the answer to this question to determine the size of your file. Then divide that number by two. Read the input line by line, and write it to the first output stream; check file.tellg() after each call. Once you're past the half-way point, switch the output to the second file.
This wouldn't split the strings evenly between the files, but the total number of characters in these strings should be close enough, and it wouldn't split your file in the middle of a string.
Think of it as a relational database with one huge table. In order to find a certain piece of data, you can either do a sequential scan over the entire table, or use an index (which must be usable for the type of query you want to perform).
A typical index for a text file would be a list of offsets inside the file, sorted by the index expression. If the csv file is sorted by a specific column already, then the offsets in the index would be ascending, which is useful to know when building the index.
So basically you have to read the file once anyway, to find out where lines end; this is the index for the sort column. To find a particular element, use a binary search, using the index to find individual elements in the data set.
Depending on the data type, you can extend your index to allow for quick comparison without reading the actual data table. For example, in a word list you could keep the first four letters of the word next to the offset, which allows you to get into the right area quickly and only requires data reads for the last accesses (which you can then optimize to a sequential scan, as filesystems handle that a lot better).
The same technique can be applied to the other columns as well; the offsets stored in the index would no longer be ascending in file order, of course.
Since it is CSV data, a special case also applies: If the only index you have is in the same order as the file data itself and the end of record can be determined easily (that is, either you have a fixed record length, or there is a clear record separator, such as an EOL character), then building the actual index can be omitted and the values guessed (for fixed length records, offset is always equal to record length times offset in the index; for separated records you can just jump into the middle of a record and seek for the next terminator; be aware that there are nasty corner cases with binary search here). This does however mean that you will always be reading data pages here, which is less efficient than just reading the index.
I have a encoded character buffer array of size 512 in C, and a database field of varchar in MySQL. Is it possible to store the encoded character buffer into varchar?
I have tried this, but the problem which I face is that it only stores the limited area of the buffer into the database and ignore. What is the actual problem, and how do I solve this problem?
It is not clear what you mean by encoded.
If you mean that you have an arbitrary string of byte values, then varchar is a bad fit because it will attempt to trim trailing spaces. A better choice in such cases is to use varbinary fields.
If the string you are inserting contains control characters, you might be best converting that into a hex string and inserting it like follows:
create table xx (
v varbinary(512) not null );
insert into xx values ( 0x68656C6C6F20776F726C64);
This will prevent any component in the tool chain from choking on NUL characters and so forth.
What size is your varchar declared for the table?
Often varchar fields are set to 255 bytes, not characters. Starting with MySQL 5.0.3 you can have longer varchar fields.
Sounds like you need a varchar(512) field, is that what you have?
See http://dev.mysql.com/doc/refman/5.0/en/char.html