SAS infile truncating long numbers with E - sas

I have searched around a bit and found some people asking a similar question, but I have not found an answer I can make work.
I have tab delimited .txt files which I need to read in to a SAS database. The files contain a serial number which is 18 numbers long so SAS imports this as "5.2231309E17".
Ideally SAS would import all the fields as if they were text, not numbers.
To add a complexity to this, the import files have 2 different formats, these are only visible once the file is open, I cannot tell which format the file is from the name. Also there are no column names in the file. So I don't know which column is which until I have read in the file.
Currently my starting point is:
data Readin;
infile foo dsd dlm='09'x truncover;
input item1-item25;
run;
foo is the file something like 'c:\myfile.txt'
Any help is appreciated.

There are two separate issues here. One is that the "9.234E17" is displaying in scientific notation, and two that you are reading in numbers that can't be stored exactly as numbers anyway.
First, this is how the BEST12. format works, which is the default numeric format for things like this. It's not truncating it in a meaningful way; if you simply change the format, to BEST32. for example, it will display the entire number, within the limits of precision, and it will always act as if it were the full number, again within the limits of precision; if I took 12345678, formatted BEST6., it would display as 1.23e7, but if I said if x=12345678 then do; put x; end;, it would put x, as it would be exactly equal to that value.
However, that last part is important, and the second part of your problem. You can't store 18 digit number precisely; 15 digits is the largest you can store precisely in Windows and similar Intel type environments, slightly different results on mainframes. So you definitely need these to be stored as character, unless you don't care about the last few digits (sounds like you do).
If you have a (anything)-delimited file, your best bet is to simply write a data step to read them in, at which point you can assign them as character yourself. Don't use proc import for most text files, unless they're really easy impossible to screw up sorts of things. What you can do is look at your log after PROC IMPORT ran, and copy that log into a program; then make adjustments to turn the serial number into a character field (and anything else you want to fix).

I had a similar problem, I was trying to import a file that had a 20 digit long field, one workaround i found for this was, opening the file in Excel and changing the attribute of the column from general to number, then when i imported the file, it was imported as a number and not in scientific notation

Related

How to convert imported date variable to the original format in Stata?

My original date variable is like this 19jun2015 16:52:04. After importing, it looks like this: 1.77065e+12
The storage type for the new imported variable is str11 and display format is %11s
I wonder how I can restore it back to date format?
William Lisowski gives excellent advice in his comment. For anyone using date-times in Stata, there is a minimal level of understanding without which confusion and outright error are unavoidable. Only study of the help so that your specific needs are understood can solve your difficulty.
There is a lack of detail in the question which makes precise advice difficult (imported -- from what kind of file? using which commands and/or third party programs?), except to diagnose that your dates are messed up and can only be corrected by going back to the original source.
Date strings such as "19jun2015 16:52:04" can be held in Stata as strings but to be useful they need to be converted to double numeric variables which hold the number of milliseconds since the beginning of 1960. This is a number that people cannot interpret, but Stata provides display formats so that displayed dates are intelligible.
Your example is when converted a number of the order of a trillion but if held as a string with only 6 significant figures you have, at a minimum, lost detail irretrievably.
These individual examples make my points concrete. di is an abbreviation for the display command.
clock() (and also Clock(), not shown or discussed here: see the help) converts string dates to milliseconds since Stata's origin. With a variable, you would use generate double.
. di %23.0f clock("19jun2015 16:52:04", "DMY hms")
1750351924000
If displayed with a specific format, you can check that Stata is interpreting your date-times correctly. There are also many small variations on the default %tc format to control precise display of date-time elements.
. di %tc clock("19jun2015 16:52:04", "DMY hms")
19jun2015 16:52:04
The first example shows that even date-times which are recent dates (~2016) and in integer seconds need 10 significant figures to be accurate; the default display gives 4; somehow you have 6, but that is not enough.
. di clock("19jun2015 16:52:04", "DMY hms")
1.750e+12
You need to import the dates again. If you import them exactly as shown, the rest can be done in Stata.
See https://en.wikipedia.org/wiki/Significant_figures if that phrase is unfamiliar.

How to parse, read, and store only one column of .CSV file into an array in C++

I have a .CSV file that's storing data from a laser. It records the height of the laser beam every second.
The .CSV file ends up having rows for each measurement that are all in this format:
DR,04,#
where the # is the height reading.
For example, if the beam is at a height of 10, the reading would say:
DR,04,10.
I want my program in C++ to read only the height (third column of the .CSV) from each row and put it into an array. I do not want the first two columns at all. That way I end up with an array with just a bunch of height values from each measurement.
How do I do that?
You can use strtok() to separate out the three columns. And then just get the last value.
You could also just take the string and scan for the first comma, and then scan from there for the second comma. What follows is the value you are after.
You could also use sscanf() to parse out the individual values.
This really isn't a difficult problem, and there are many ways to approach it. That is why people are complaining that you probably should've tried something and then ask a question here when you get stuck on a specific question.

Excel international date formatting

I am having problems formatting Excel datetimes, so that it works internationally. Our program is written in C++ and uses COM to export data from our database to Excel, and this includes datetime fields.
If we don't supply a formatting mask, some installations of Excel displays these dates as Serial numbers (days since 1900.01.01 followed by time as a 24-hour fraction). This is unreadable to a human, so we ha found out that we MUST supply a date formatting mask to be sure that it displays readable.
The problem - as I see it - is that Excel uses international formatting masks. For example; the UK datetime format mask might be "YYYY-MM-DD HH:MM".
But if the format mask is sent to an Excel that is installed in Sweden, it fails since the Swedish version of the Excel uses "ÅÅÅÅ-MM-DD tt:mm".
It's highly impractical to have 150 different national datetime formatting masks in our application to support different countries.
Is there a way to write formatting masks so that they include locale, such that we would be allowed to use ONE single mask?
Unless you are using the date functionality in Excel, the easiest way to handle this is to decide on a format and then create a string yourself in that format and set the cell accordingly.
This comic: http://xkcd.com/1179/ might help you choose a standard to go with. Otherwise, clients that open your file in different countries will have differently formatted data. Just pick a standard and force your data to that standard.
Edited to add: There are libraries that can make this really easy for your as well... http://www.libxl.com/read-write-excel-date-time.html
Edited to add part2: Basically what I'm trying to get at is to avoid asking for the asmk and just format the data yourself (if that makes sense).
I recommend doing the following: Create an excel with date formatting on a specific cell and save this for your program to use.
Now when the program runs it will open this use this excel file to retrieve the local date formatting from the excel and the specified cell.
When you have multiple formats to save just use different cells for them.
It is not a nice way but will work afaik.
Alteratively you could consider creating an xla(m) file that will use vba and a command to feed back the local formatting characters through a function like:
Public Function localChar(charIn As Range) As String
localChar = charIn.NumberFormatLocal
End Function
Also not a very clean method, but it might do the trick for you.

File Binary vs Text

Are there some situation where I have to prefer binary file to text file? I'm using C++ as programming language?
For example if I have to store some large text file is it better use text file or binary file?
Edit
The file for the moment has no requirment to be readable from human. Are some performance difference, security difference and so on?
Edit
Sorry for the omit other the requirment (thanks to Carey Gregory)
The record to save are in ascii encoding
The file must be crypted ( AES )
The machine can power off any time. So I've to try to prevents errors.
I've to know if the file change outside the program, I think I'll use a sha1 digest of the file.
As a general rule, define a text format, and use it. It's much
easier to develop and debug, and it's much easier to see what is
going wrong if it doesn't work.
If you find that the files are becoming too big, or taking to
much time to transfer over the wire, consider compressing them.
A compressed text file is often smaller than you can do with
binary. Or consider a less verbose text format; it's possible
to reliably transmit a text representation of your data with
a lot less characters than XML uses.
And finally, if you do end up having to use binary, try to chose
an existing format (e.g. Google's protocol blocks), or base your
format on an existing format. Just remember that:
Binary is a lot more work than text, since you practically
have to write all of the << operators again, including those
in the standard library.
Binary is a lot more difficult to debug, because you can't
easily see what you've actually done.
Concerning your last edit:
Once you've encrypted, the results will be binary. You can
use a text representation of the binary (base64 or some such),
but the results won't be any more readable than the binary, so
it's not worth the bother. If you're encrypting in process,
before writing to disk, you automatically lose all of the
advantages of text.
The issues concerning powering off mean that you cannot use
ofstream directly. You must open or create the file with the
necessary options for full transactional integrity (O_SYNC as
a flag to open under Unix). You must write each record as
a single write request to the system.
It's always a good idea to have a checksum, just in case. If
you're worried about security, SHA1 is a good choice. But keep
in mind that if someone has access to the file, and wants to
intentionally change it, they can recalculate the SHA1 and
insert the new value as well.
All files are binary; the data within them is a binary representation of some information. If you have to store a large amount of text then the file will contain the binary representation of that text. The difference between a "binary file" and a "text file" is that creating the latter involves converting data to a text form before saving it. This is typically done so humans can read it.
The distinction between binary and text is usually made when storing data that is for computer consumption. Typically this data would not be text - it might be a list of numerical configuration values, for example: 1, 2, 3.
If you stored this in text format, your file could contain a list of human-readable numbers, and if you opened the file in Notepad you might see one number per line. But what you're actually saving here is not the binary values 1, 2, 3 - you're saving a string "1\n2\n3\n". Note that this string is 6 characters long, and the binary values (assuming ASCI) would actually be 49, 10, 50, 10, 51, 10!
If the same data were stored in binary format, you would store the numbers in the smallest useful space, and write the file as individual bytes that can often only be read by the code that created them. Opening this file in Notepad would likely display junk characters, because the data makes no sense as text. In this case you would be saving a byte array with actual values { 1, 2, 3 } - or even a single byte with the three values embedded. This could be much smaller than the human-readable equivalent.
Binary files store a sequence of bytes like all other files. You can store numeric values like integers per 4 bytes, characters per single byte, or even serialized class objects and anything you want.
When you know how to read a binary file (ie. you know what is stored in it) you can extract all the information from it. However, text files use text encodings like UTF8, ANSI etc. and they are intended to encode text characters to be processed by text editors.
Binary files are for machines only to interpret, whereas a text file, a human can also open and interpret its content.
So it depends whether you want your file to be readable by a human or not.
It depends on a lot of factors. I can think of two right now:
Do you require the file to be readable by humans?
Is compression a factor? A 10-digits number will take at least 10 bytes as text, but might take as little as four or two as binary.
All data is binary. You always need a machine to interpret it for you. Even if the data is compressed like protocol buffers, Avro, Thrift etc, it is binary, and if it is uncompressed, it is still binary. If you want to read protocol buffers by notepad, there is a two step process. Uncompress, and read. In case of text, this step of uncompressing is not needed. Same is case with encrypted. First unencrypted, and then read. Humans cannot read binary (as some commenters are mentioning). We still need notepad to interpret and display binary (so called text).
All data stored in a text file are human-readable graphic characters. Each line of data ends with a new line character.
In case of a binary file - data is stored in the same format as they are stored in the memory. There are no lines or new line characters. There is an end of file marker.
Moreover binary files show more efficiency for memory as they are stored in zeros and one's.

SQLite Int to Hex and Hex to Int in query

I have some numeric data that is given to me in INTEGER form. I insert it along with other data into SQLite but when I write it out The INTEGER Numbers need to be 8 digit Hex Numbers with leading zeros.
Ex.
Input
400
800
25
76
Output
00000190
00000320
00000019
0000004C
Originally I was converting them as I read them in and storing them as TEXT like this.
stringstream temp;
temp << right << setw(8) << setfill('0') << hex << uppercase << VALUE;
But life is never easy and now I have to create a second output in INTEGER form not HEX. Is there a way to convert INTEGER numbers to HEX or HEX numbers to INTEGERS in SQLite?
I'd like to avoid using C++ to change data after it is in SQLite because I've written a few convent export functions that take a queries result and print them to a file. If I need to touch the data during the queries return I couldn't use them.
I've looked at the HEX() function in SQLite but that didn't have desired results. Could I make a function or would that be very inefficient? I'm doing this over a really big data set so anything expensive should be avoided.
Note:
I'm using SQLites C/C++ interface with Visual Studios 2010.
You can use sqlite3 printf() function:
CREATE TABLE T (V integer);
insert into T(V) values(400), (800), (25), (76);
select printf('%08X', V) from T;
You can use sqlite3_create_function. See an example
Well not my favorite answer I've decided to add a column in the table that is the INTEGER value.
If someone finds a better way to do this I'm all ears.
EDIT:
After Implementing this answer and looking at it's effect on the program this appears to be a really good way to get the data without adding much to the run time load. This answer does add to the size of the Database a little but it doesn't require any extra processing to get the values from SQLite because this is just grabbing a different column in the query.
Also because I had these values to start with, this answer had a HUGE cost savings by adding them to the table verses processing later to get values I through away earlier in the program.