I am trying to import an excel file into SAS using SAS EG Import wizard. I have a field X with type char4. But, it has numeric data. e.g. 1101
During import, SAS truncates leading zeros although the data type is character. Is there a way to retain the leading zeros. I cannot append zeros in SAS because there are few values which are legitimate 3 digits and few which have become 3 digits due to leading zero truncation. So, hard to figure out which values should be padded with leading zero.
If it's general/text SAS might have a hard time understanding that it's a character if there is only numbers in the excel file. You should have two ways of making it understand that the field is a pure text field.
If you either change it to text and re-import
In the 3rd screen in the import wizard set the type to character, in-format to $char4., length to 4, and out-format to $char4. you should be ok.
(Post my comment as an answer so it can be closed)
Related
I am importing an Excel spreadsheet into SAS using Proc Import:
Proc Import out=OUTPUT
Datafile = "(filename)"
DBMS=XLSX Replace;
Range = "Sheet1$A:Z";
run;
My numeric data columns contain a mixture of values held in Excel as numerics and '0 values held as text - i.e. with a leading apostrophe / single quote. When SAS imports these it treats them all the same (i.e. it returns Character strings of the values with the leading apostrophe stripped out).
This results in differences from the spreadsheet when calculations are applied (e.g. averaging) as Excel treats the '0 values as missing but SAS treats them as 0.
Is it possible to import the values as strings including the leading single quote / apostrophe, so that I can replace the '0 with missing values but keep the 0 records as 0? I would like to avoid having to manually manipulate the data in Excel as this data is drawn from an external source (don't ask...)
I doubt it. I think Excel doesn’t really consider the leading apostrophe as part of the value. It’s just a crazy way to indicate that a value is a text string (rather than numeric). When SAS imports the data, it recognizes that the quote is not part of the value. So if you’ve got an Excel column with ‘0 in some cells and 0 in others, it’s going to come in as character, and I don’t think you can tell the difference between them.
Unfortunately, the xlsx engine doesn’t support the s DBSASTYPE option. Other engines that import Excel have the DBSASTYPE option. That should allow you to tell SAS to import a column as a numeric variable, even if it sees character values. If it’s the case that you want all text values in the cell converted to missing, that might do the trick. But it’s possible it would still treat ‘0 the same as 0. I’m away from SAS, so can’t test.
Option:
The ~ (tilde) format modifier enables you to read and retain single quotation marks.
http://support.sas.com/documentation/cdl/en/lrcon/62955/HTML/default/viewer.htm#a003209907.htm
Is it possible to convert the .xlsx to .txt keeping the single quotes? Because it is not possible to infile xlsx in a data step.
filename df disk 'C:\data_temp\ex.txt';
data test;
infile df firstobs=2;
input ID $2. x ~$3. ;
run;
proc print data=test;
run;
So I have some code that does essentially this:
REAL, DIMENSION(31) :: month_data
INTEGER :: no_days
no_days = get_no_days()
month_data = [fill array with some values]
WRITE(1000,*) (month_data(d), d=1,no_days)
So I have an array with values for each month, in a loop I fill the array with a certain number of values based on how many days there are in that month, then write out the results into a file.
It took me quite some time to wrap my head around the whole 'write out an array in one go' aspect of WRITE, but this seems to work.
However this way, it writes out the numbers in the array like this (example for January, so 31 values):
0.00000 10.0000 20.0000 30.0000 40.0000 50.0000 60.0000
70.0000 80.0000 90.0000 100.000 110.000 120.000 130.000
140.000 150.000 160.000 170.000 180.000 190.000 200.000
210.000 220.000 230.000 240.000 250.000 260.000 270.000
280.000 290.000 300.000
So it prefixes a lot of spaces (presumably to make columns line up even when there are larger values in the array), and it wraps lines to make it not exceed a certain width (I think 128 chars? not sure).
I don't really mind the extra spaces (although they inflate my file sizes considerably, so it would be nice to fix that too...) but the breaking-up-lines screws up my other tooling. I've tried reading several Fortran manuals, but while some of the mention 'output formatting', I have yet to find one that mentions newlines or columns.
So, how do I control how arrays are written out when using the syntax above in Fortran?
(also, while we're at it, how do I control the nr of decimal digits? I know these are all integer values so I'd like to leave out any decimals all together, but I can't change the data type to INTEGER in my code because of reasons).
You probably want something similar to
WRITE(1000,'(31(F6.0,1X))') (month_data(d), d=1,no_days)
Explanation:
The use of * as the format specification is called list directed I/O: it is easy to code, but you are giving away all control over the format to the processor. In order to control the format you need to provide explicit formatting, via a label to a FORMAT statement or via a character variable.
Use the F edit descriptor for real variables in decimal form. Their syntax is Fw.d, where w is the width of the field and d is the number of decimal places, including the decimal sign. F6.0 therefore means a field of 6 characters of width with no decimal places.
Spaces can be added with the X control edit descriptor.
Repetitions of edit descriptors can be indicated with the number of repetitions before a symbol.
Groups can be created with (...), and they can be repeated if preceded by a number of repetitions.
No more items are printed beyond the last provided variable, even if the format specifies how to print more items than the ones actually provided - so you can ask for 31 repetitions even if for some months you will only print data for 30 or 28 days.
Besides,
New lines could be added with the / control edit descriptor; e.g., if you wanted to print the data with 10 values per row, you could do
WRITE(1000,'(4(10(F6.0,:,1X),/))') (month_data(d), d=1,no_days)
Note the : control edit descriptor in this second example: it indicates that, if there are no more items to print, nothing else should be printed - not even spaces corresponding to control edit descriptors such as X or /. While it could have been used in the previous example, it is more relevant here, in order to ensure that, if no_days is a multiple of 10, there isn't an empty line after the 3 rows of data.
If you want to completely remove the decimal symbol, you would need to rather print the nearest integers using the nint intrinsic and the Iw (integer) descriptor:
WRITE(1000,'(31(I6,1X))') (nint(month_data(d)), d=1,no_days)
I am trying to create an MDM file using HLM 7 Student version, but since I don't have access to SPSS I am trying to import my data using ASCII input. As part of this process I am required to input the data format Fortran style. Try as I might I have not been able to understand this step. Could someone familiar with Fortran (or even better HLM itself) explain to me how this works? Here is my current understanding
From the example EG3.DAT they give
(A4,1X,3F7.1)
I think
A4 signifies that the ID is 4 characters long.
1X means skip a space.
F.1 means that it should read 1 decimal places.
I am very confused about what 3F7 might mean.
EG3.DAT
2020 380.0 40.3 12.5
2040 502.0 83.1 18.6
2180 777.0 96.6 44.4
Below are examples from the help documents.
Rules for format statement
Format statement example
EG1 data format
EG2 data format
EG3 data format
One similar question is Explaining Fortran Write Format. Unfortunately it does not explicitly treat the F descriptor.
3F7.1 means 3 floating point numbers, each printed over 7 characters, each with one decimal number behind the decimal point. Leading characters are blanks.
For reading you don't need the .1 info at all, just read a floating point number from those 7 characters.
You guessed the meaning of A4 (string of four characters) and 1X (one blank) correctly.
In Fortran, so-called data edit descriptors (which format the input or output of data) may have repeat specifications.
In the format (A4,1X,3F7.1) the data edit descriptors are A4 and F7.1. Only F7.1 has a repeat specification (the number before the F). This simply means that the format is as though the descriptor appeared repeated: like F7.1, F7.1, F7.1. With a repeat specification of 1, or not given, there is just the single appearance.
The format of the question, then, is like
(A4,1X,F7.1,F7.1,F7.1)
This format is one that is covered by the rules provided in one of the images of the question. In particular, the aspect of repeat specification is given in rule 2 with the corresponding example of rule 3.
Further, in Fortran proper, a repeat count specifier may also be * as special case: that's like an exceptionally large repeat count. *(F7.1) would be like F7.1, F7.1, F7.1, .... I see no indication that this is supported by HLM but if this is needed a very large repeat count may be given instead.
In 1X the 1 isn't a repeat specification but an integral, and necessary, part of the position edit descriptor.
Procedure for making MDM file from excel for HLM:
-Make sure ALL the characters in ALL the columns line up
Select a column, then right click and select Format Cells
Then click on 'Custom' and go to the 'Type' box and enter the number
of 0s you need to line everything up
-Remove all the tabs from the document and replace them with spaces.
Open the document in word and use find and replace
-To save the document as .dat
First save it as .txt
Then open it in Notepad and save it as .dat
To enter the data format (FORTRAN-Style)
The program wants to read the data file space by space, so you have to specify it perfectly so that it reads the whole set properly.
If something is off, even by a single space, then your descriptive stats will be wonky compared to if you check them in another program.
Enclose the code with brackets ()
Divide the entries with commas ,
-Need ID column for all levels
ID column needs to be sorted so that it is in order from smallest to
largest
Use A# with # being the number of characters in the ID
Use an X1 to
move from the ID to the next column
-Need to say how many characters are needed in each column
Use F
After F is the number of characters needed for that column -Use F# (#= number)
There need to be enough character spaces to provide one 'gap' space
between each column
There need to be enough to character spaces to allow for the decimal
As part of the F you need to specify the number of decimal places
You do this by adding a decimal point after the F number and then a
number to represent the spaces you need -F#.#
You can use a number in front of the F so as to 'repeat' it. Not
necessary though. -#F#.#
All in all, it should look something like this:
(A4,X1,F4.0,F5.1)
Helpful links:
https://books.google.de/books?id=VdmVtz6Wtc0C&pg=PA78&lpg=PA78&dq=data+format+fortran+style+hlm&source=bl&ots=kURJ6USN5e&sig=fdtsmTGSKFxn04wkxvRc2Vw1l5Q&hl=en&sa=X&ved=0ahUKEwi_yPurjYrYAhWIJuwKHa0uCuAQ6AEIPzAC#v=onepage&q&f=false
http://www.ssicentral.com/hlm/help6/error/Problems_creating_MDM_files.pdf
http://www.ssicentral.com/hlm/help7/faq/FAQ_Format_specifications_for_ASCII_data.pdf
I am looking to convert a scientific notation number into the full integer number.
E.g:
8.18234E+11 => 818234011668
Excel reformatted all my upc codes within a csv and this solution is not working for me.
I have my csv open in Notepad++ and would love to do this using a regex find and replace.
Thanks.
The damage is already done and cannot be recovered from the CSV file. 8.18234E+11 could be anything* from 818233500000 to 818234499999.
To prevent Excel from rounding large numbers, you need to store them as text. If you set the cell format to text, any value inserted from then on should be automatically interpreted as text. In OpenOffice Calc (I don't have MS Excel), you can also prefix a numeric value with ' to get it interpreted as text no matter the cell format.
There is a chance that the correct value is stored in the original XLS (or XSLX or ODS or the live Excel session or ...) file. If not, then you'll have to enter the data again. If the data is there, you need to store it as text or increase the number of significant digits in the exported CSV. If you only have the exported data, then you're out of luck.
*UPC has a single check digit, so only 100 000 out of the 1 000 000 codes are actually valid UPC codes.
I have a long ID number (say, 12184447992012111111). BY using proc import from csv file that number shortens itself with a addition of 'E' in between the digits (1.2184448E19, with format best12. and informat best32.). Browsing here I got to know the csv format itself shortens it previously so it is nothing to do with SAS. So I tried to copy say about 5 numbers and use datalines statement then also it results same.... It wil be helpful if anyone can suggest which format I need to use. Using best32. format I donot get the original number since most probably it modifies that altered number, which infact gives me 12184447992012111872 which is not my desired number.
Because your ID variable is really an identifier rather than a "real" number, you need to read it in as a character string. The value you show as an example is too large to be represented as an integer, so since SAS stores all numerics as floating point, you are losing "precision".
Since you mention using PROC IMPORT, copy the SAS program it generates and change the FORMAT and INFORMAT specifications from "21." and "best32." to "$32." (or whatever value matched your data.
Best of course would be if you had SAS Access to PC File formats, in which case you cound format the column as "text" in Excel and let SAS read it directly.
I'm not sure about the csv changing the value (they are just plain text files) - unless you are saving an excel spreadsheet as a csv file. If you are using excel just set the column to number format, no decimal places.
It might be easier to treat the column as text when importing it to SAS - unless you need to perform mathematical operations on it! If you really need to keep it as a number the format 32. should force it to be a 32 digit number - best is fairly sensibly changing it into scientific notation (though I suspect the data is there in the background and just displayed unhelpfully).
There is a SAS informat for reading exponential notation - Ew.d where w is the width and d the number of decimal places. In your case, it probably won't help because you will "lose" the complete number - and the value stored in case you read with this informat will be 1.2184448 * (10^19). The only way in your case is to ensure that the program which produces the CSV file outputs it in the right way. If you are creating the data from an Excel worksheet, then format the number in the Excel worksheet to display all the digits correctly.