I am struggling to import csv-file with cyrillic symbols to a table at Hp Vertica and every time I get the error
[Vertica][ODBC] (10170) String data right truncation on data from data source: String data is too big for the driver's data buffer
I tried to import utf8-saved .csv file and cp1251-saved .csv file, but the error is still there.
Any ideas?
The "String data right truncation" error can have two causes:
Target column is not big enough. Remember: (a) Vertica uses a byte-oriented length semantic for CHAR/VARCHAR and (b) it wants UTF-8 in input. So, for example, if you want to store the Euro sign (one, single, character) the target column should be - at least - CHAR(3) because the Euro sign requires three bytes when encoded UTF-8
The other possibility is that your ODBC loader does not allocate enough memory to store your string and/or does not understand field separators
Forget about CP-1251. Vertica wants UTF-8 in input.
Related
I'm working on a big dataset encoded in UTF-16LE that holds 1 Billion records containing text strings in over 50 languages ( not all known to me).
I need to get these into our database MySql 5.7 using LOAD DATA INFILE(for import speed) but i just found out that MySql does not support UTF-16LEtext encoding while trying to load this using the workbench import too and also querying this data with Athena gives me no records back with this encoding.
Best encoding relative to MySQl 5.7 that handles multi language and can LOAD DATA INFILE?
Will this keep the text safe and not garble the text strings?
I have searched around a bit and found some people asking a similar question, but I have not found an answer I can make work.
I have tab delimited .txt files which I need to read in to a SAS database. The files contain a serial number which is 18 numbers long so SAS imports this as "5.2231309E17".
Ideally SAS would import all the fields as if they were text, not numbers.
To add a complexity to this, the import files have 2 different formats, these are only visible once the file is open, I cannot tell which format the file is from the name. Also there are no column names in the file. So I don't know which column is which until I have read in the file.
Currently my starting point is:
data Readin;
infile foo dsd dlm='09'x truncover;
input item1-item25;
run;
foo is the file something like 'c:\myfile.txt'
Any help is appreciated.
There are two separate issues here. One is that the "9.234E17" is displaying in scientific notation, and two that you are reading in numbers that can't be stored exactly as numbers anyway.
First, this is how the BEST12. format works, which is the default numeric format for things like this. It's not truncating it in a meaningful way; if you simply change the format, to BEST32. for example, it will display the entire number, within the limits of precision, and it will always act as if it were the full number, again within the limits of precision; if I took 12345678, formatted BEST6., it would display as 1.23e7, but if I said if x=12345678 then do; put x; end;, it would put x, as it would be exactly equal to that value.
However, that last part is important, and the second part of your problem. You can't store 18 digit number precisely; 15 digits is the largest you can store precisely in Windows and similar Intel type environments, slightly different results on mainframes. So you definitely need these to be stored as character, unless you don't care about the last few digits (sounds like you do).
If you have a (anything)-delimited file, your best bet is to simply write a data step to read them in, at which point you can assign them as character yourself. Don't use proc import for most text files, unless they're really easy impossible to screw up sorts of things. What you can do is look at your log after PROC IMPORT ran, and copy that log into a program; then make adjustments to turn the serial number into a character field (and anything else you want to fix).
I had a similar problem, I was trying to import a file that had a 20 digit long field, one workaround i found for this was, opening the file in Excel and changing the attribute of the column from general to number, then when i imported the file, it was imported as a number and not in scientific notation
I am totally new to Python. I have to parse a .txt file that contains network byte order binary encoded numbers (see here for the details on the data). I know that I have to use the package struct.unpack in Python. My questions are the following:
(1) Since I don't really understand how the function struct.unpack works, is it straight forward to parse the data? By that, I mean that if you look at the data structure it seems that I have to write a code for each type of messages. But if I look online for the documentation on struct.unpack it seems more straight forward but I am not sure how to write the code. A short sample would be appreciated.
(2) What's the best practice once I parse the data? I would like to save the parsed file in order to avoid parsing the file each time I need to make a query. In what format should I keep the parsed file that would be the most efficient?
This should be relatively straight forward. I can't comment on how you're actually supposed to get the byte encoded packets of information, but I can help you parse them.
First, here's a list of some of the packet types you'll be dealing with that I gathered from section 4 of the documentation:
TimeStamp
System Event Message
Stock Related Messages
Stock Directory
Stock Trading Action
Reg SHO Short Sale Price Test Restricted Indicator
Market Participant Position
Add Order Message
This continues on. But as an example, let's see how to decode one or two of these:
System Event Message
A System Event Message packet has 3 portions, which is 6 bytes long:
A Message Type, which starts at byte 0, is 1 byte long, with a Value of S (a Single Character)
A TimeStamp, which starts at byte 1, is 4 bytes long, and should be interpreted an in Integer.
An Event Code, which starts at byte 5, is 1 byte long and is a String (Alpha).
Looking up each type in the struct.unpack code table, we'll need to build a string to represent this sequence. First, we have a Character, then a 4Byte Unsigned Integer, then another Character. This corresponds to the encoding and decoding string of "cIc".
*NOTE: The unsigned portion of the Integer is documented in Section 3: Data Types of their documentation
Construct a fake packet
This could probably be done better, but it's functional:
>>> from datetime import datetime
>>> import time
>>> data = struct.pack('cIc', 'S', int(time.mktime(datetime.now().timetuple())), 'O')
>>> print repr(data) # What does the bytestring look like?
'S\x00\x00\x00\xa6n\x8dRO' # Yep, that's bytes alright!
Unpack the data
In this example, we'll use the fake packet above, but in the real world we'd use a real data response:
>>> response_tuple = struct.unpack('cIc', data)
>>> print(repr(response_tuple))
('S', 1385000614, 'O')
In this case, the 3rd item in the tuple (the 'O') is a key, to be looked up in another table called System Event Codes - Daily and System Event Codes - As Needed.
If you need additional examples, feel free to ask, but that's the jist of it.
Recommendations on how to store this data. Well, I suppose that depends on what you'd like to do long term to this data. Probably, a database makes sense here. However, without further information, I cannot say.
Hope that helps!
Are there some situation where I have to prefer binary file to text file? I'm using C++ as programming language?
For example if I have to store some large text file is it better use text file or binary file?
Edit
The file for the moment has no requirment to be readable from human. Are some performance difference, security difference and so on?
Edit
Sorry for the omit other the requirment (thanks to Carey Gregory)
The record to save are in ascii encoding
The file must be crypted ( AES )
The machine can power off any time. So I've to try to prevents errors.
I've to know if the file change outside the program, I think I'll use a sha1 digest of the file.
As a general rule, define a text format, and use it. It's much
easier to develop and debug, and it's much easier to see what is
going wrong if it doesn't work.
If you find that the files are becoming too big, or taking to
much time to transfer over the wire, consider compressing them.
A compressed text file is often smaller than you can do with
binary. Or consider a less verbose text format; it's possible
to reliably transmit a text representation of your data with
a lot less characters than XML uses.
And finally, if you do end up having to use binary, try to chose
an existing format (e.g. Google's protocol blocks), or base your
format on an existing format. Just remember that:
Binary is a lot more work than text, since you practically
have to write all of the << operators again, including those
in the standard library.
Binary is a lot more difficult to debug, because you can't
easily see what you've actually done.
Concerning your last edit:
Once you've encrypted, the results will be binary. You can
use a text representation of the binary (base64 or some such),
but the results won't be any more readable than the binary, so
it's not worth the bother. If you're encrypting in process,
before writing to disk, you automatically lose all of the
advantages of text.
The issues concerning powering off mean that you cannot use
ofstream directly. You must open or create the file with the
necessary options for full transactional integrity (O_SYNC as
a flag to open under Unix). You must write each record as
a single write request to the system.
It's always a good idea to have a checksum, just in case. If
you're worried about security, SHA1 is a good choice. But keep
in mind that if someone has access to the file, and wants to
intentionally change it, they can recalculate the SHA1 and
insert the new value as well.
All files are binary; the data within them is a binary representation of some information. If you have to store a large amount of text then the file will contain the binary representation of that text. The difference between a "binary file" and a "text file" is that creating the latter involves converting data to a text form before saving it. This is typically done so humans can read it.
The distinction between binary and text is usually made when storing data that is for computer consumption. Typically this data would not be text - it might be a list of numerical configuration values, for example: 1, 2, 3.
If you stored this in text format, your file could contain a list of human-readable numbers, and if you opened the file in Notepad you might see one number per line. But what you're actually saving here is not the binary values 1, 2, 3 - you're saving a string "1\n2\n3\n". Note that this string is 6 characters long, and the binary values (assuming ASCI) would actually be 49, 10, 50, 10, 51, 10!
If the same data were stored in binary format, you would store the numbers in the smallest useful space, and write the file as individual bytes that can often only be read by the code that created them. Opening this file in Notepad would likely display junk characters, because the data makes no sense as text. In this case you would be saving a byte array with actual values { 1, 2, 3 } - or even a single byte with the three values embedded. This could be much smaller than the human-readable equivalent.
Binary files store a sequence of bytes like all other files. You can store numeric values like integers per 4 bytes, characters per single byte, or even serialized class objects and anything you want.
When you know how to read a binary file (ie. you know what is stored in it) you can extract all the information from it. However, text files use text encodings like UTF8, ANSI etc. and they are intended to encode text characters to be processed by text editors.
Binary files are for machines only to interpret, whereas a text file, a human can also open and interpret its content.
So it depends whether you want your file to be readable by a human or not.
It depends on a lot of factors. I can think of two right now:
Do you require the file to be readable by humans?
Is compression a factor? A 10-digits number will take at least 10 bytes as text, but might take as little as four or two as binary.
All data is binary. You always need a machine to interpret it for you. Even if the data is compressed like protocol buffers, Avro, Thrift etc, it is binary, and if it is uncompressed, it is still binary. If you want to read protocol buffers by notepad, there is a two step process. Uncompress, and read. In case of text, this step of uncompressing is not needed. Same is case with encrypted. First unencrypted, and then read. Humans cannot read binary (as some commenters are mentioning). We still need notepad to interpret and display binary (so called text).
All data stored in a text file are human-readable graphic characters. Each line of data ends with a new line character.
In case of a binary file - data is stored in the same format as they are stored in the memory. There are no lines or new line characters. There is an end of file marker.
Moreover binary files show more efficiency for memory as they are stored in zeros and one's.
I have a candidate key (mongodb candidate key, __id) thats looks like the following in protocol buffers :
message qrs_signature
{
required uint32 region_id = 1;
repeated fixed32 urls = 2;
};
Naturally I can't use a protocol buffers encoded string (via ParseToString(std::string)) in my bson document since it can contain non-printing characters. Therefore, I am using the ascii85 encoding to encode the data (using this library). I have two questions.
Is b85 encoding bson-safe.
What is bson's binary type for ? is there some way that I can implant my (binary) string into that field using a mongodb API call , or is it just syntactic sugar to denote a value-type that needs to be processed in some form (--i.e., not a native mongodb entity)?
edit
The append binary api's show's data being encoded as hex(OMG!), base85 is therefore more space efficient (22 bytes per record in my case).
BSON safe, yes. The output of ASCII85 encoding is also valid utf-8 iirc.
It's used to store chunks of binary data. Binary data is an officially supported type and you should be able to push binary values to BSON fields using the appropriate driver code, BSONObj in your case. Refer to your driver docs or the source code for details.