getting input format - sas

So, I just buy introduction book for sas. But it only contain tons of examples with little/no explanation. I tried to find some tutorial online, but I can't find the explanation for this formatting. I just wonder what's the different between these two:
INPUT Name $16. Age 3. +1 height 5.1
I wonder, what does "." mean. What the different between:
INPUT Name $16
and
INPUT Name $ 1-16
what is the symbol "+1" mean?
what does "5.1" mean? how's that different from "5."? thx

Formats always contain periods; the period can serve to separate width from decimal, ie 5.1 is 5 total width, 1 decimal - so xxx.d (actually, -xx.d, but it will also display xxx.d correctly). For character values and other values that cannot have decimal portions, there is never a number after the period, but it is still present; so DATE9. is a DATE formatted variable (specifically, looks like "19JAN2013") and is 9 characters long (as opposed to DATE7., or 19JAN13).
In general, SAS has many different input options. Find a better book, or read the online documentation (http://support.sas.com/documentation/92/index.html or similar for your version of SAS). input Name $16. inputs name as a 16 digit character variable. You have a lot of variants of input options, so look at the documentation to find out more.
+1 specifically tells SAS to move the pointer forward one - so instead of 16 characters of Name, then 3 digits of Age, then 5 digits of Height, it skips a space between Age and Height; so NAMENAMENAMENAMEage heigh not NAMENAMENAMENAMEageheigh.

You can start here:
Input statement

Related

Controlling newlines when writing out arrays in Fortran

So I have some code that does essentially this:
REAL, DIMENSION(31) :: month_data
INTEGER :: no_days
no_days = get_no_days()
month_data = [fill array with some values]
WRITE(1000,*) (month_data(d), d=1,no_days)
So I have an array with values for each month, in a loop I fill the array with a certain number of values based on how many days there are in that month, then write out the results into a file.
It took me quite some time to wrap my head around the whole 'write out an array in one go' aspect of WRITE, but this seems to work.
However this way, it writes out the numbers in the array like this (example for January, so 31 values):
0.00000 10.0000 20.0000 30.0000 40.0000 50.0000 60.0000
70.0000 80.0000 90.0000 100.000 110.000 120.000 130.000
140.000 150.000 160.000 170.000 180.000 190.000 200.000
210.000 220.000 230.000 240.000 250.000 260.000 270.000
280.000 290.000 300.000
So it prefixes a lot of spaces (presumably to make columns line up even when there are larger values in the array), and it wraps lines to make it not exceed a certain width (I think 128 chars? not sure).
I don't really mind the extra spaces (although they inflate my file sizes considerably, so it would be nice to fix that too...) but the breaking-up-lines screws up my other tooling. I've tried reading several Fortran manuals, but while some of the mention 'output formatting', I have yet to find one that mentions newlines or columns.
So, how do I control how arrays are written out when using the syntax above in Fortran?
(also, while we're at it, how do I control the nr of decimal digits? I know these are all integer values so I'd like to leave out any decimals all together, but I can't change the data type to INTEGER in my code because of reasons).
You probably want something similar to
WRITE(1000,'(31(F6.0,1X))') (month_data(d), d=1,no_days)
Explanation:
The use of * as the format specification is called list directed I/O: it is easy to code, but you are giving away all control over the format to the processor. In order to control the format you need to provide explicit formatting, via a label to a FORMAT statement or via a character variable.
Use the F edit descriptor for real variables in decimal form. Their syntax is Fw.d, where w is the width of the field and d is the number of decimal places, including the decimal sign. F6.0 therefore means a field of 6 characters of width with no decimal places.
Spaces can be added with the X control edit descriptor.
Repetitions of edit descriptors can be indicated with the number of repetitions before a symbol.
Groups can be created with (...), and they can be repeated if preceded by a number of repetitions.
No more items are printed beyond the last provided variable, even if the format specifies how to print more items than the ones actually provided - so you can ask for 31 repetitions even if for some months you will only print data for 30 or 28 days.
Besides,
New lines could be added with the / control edit descriptor; e.g., if you wanted to print the data with 10 values per row, you could do
WRITE(1000,'(4(10(F6.0,:,1X),/))') (month_data(d), d=1,no_days)
Note the : control edit descriptor in this second example: it indicates that, if there are no more items to print, nothing else should be printed - not even spaces corresponding to control edit descriptors such as X or /. While it could have been used in the previous example, it is more relevant here, in order to ensure that, if no_days is a multiple of 10, there isn't an empty line after the 3 rows of data.
If you want to completely remove the decimal symbol, you would need to rather print the nearest integers using the nint intrinsic and the Iw (integer) descriptor:
WRITE(1000,'(31(I6,1X))') (nint(month_data(d)), d=1,no_days)

Fortran code to read formatted data file record wise

I got a Fortran code for K-means clustering from online.
As I am new to Fortran, I do not have an idea about the required input file format for the code as below.
How to prepare an input file according to this Fortran code?
infile1='D1_TR_all_cent63.dat'
OPEN(1,FILE=infile1,form='formatted',access='direct',
+ recl=429)
istep=nvectors/nclusters
DO i=1,nclusters
READ(1,23,rec=istep*(i-1)+1)(vec(j),j=1,42)
DO k=1,nelements
centroid(i,k)=vec(k)
END DO
END DO
23 format(42(f10.3))
It's looking for a file where each line contains 42 fields where it expects to find fixed point numbers. That's the 42(f10.3) in the format
statement. Each of those fields is 10 characters wide; there are no commas or other separator characters between them, although spaces are allowed.
For example:
123456.789 123.123 -123.123 12345.678 0.000
(and onward until there are 42 of those).
Most FORTRAN implementations will be a bit merciful on reading: You can leave off some of the after-decimal point digits, or have a space after the number (but you have to stay in the 10-character fields). But not all do, so it might be better to prepare your input with 6 digits (including sign, if needed, and allowing spaces instead) before the decimal point and three after.

Meaning of 3F7.1 in Fortran data format

I am trying to create an MDM file using HLM 7 Student version, but since I don't have access to SPSS I am trying to import my data using ASCII input. As part of this process I am required to input the data format Fortran style. Try as I might I have not been able to understand this step. Could someone familiar with Fortran (or even better HLM itself) explain to me how this works? Here is my current understanding
From the example EG3.DAT they give
(A4,1X,3F7.1)
I think
A4 signifies that the ID is 4 characters long.
1X means skip a space.
F.1 means that it should read 1 decimal places.
I am very confused about what 3F7 might mean.
EG3.DAT
2020 380.0 40.3 12.5
2040 502.0 83.1 18.6
2180 777.0 96.6 44.4
Below are examples from the help documents.
Rules for format statement
Format statement example
EG1 data format
EG2 data format
EG3 data format
One similar question is Explaining Fortran Write Format. Unfortunately it does not explicitly treat the F descriptor.
3F7.1 means 3 floating point numbers, each printed over 7 characters, each with one decimal number behind the decimal point. Leading characters are blanks.
For reading you don't need the .1 info at all, just read a floating point number from those 7 characters.
You guessed the meaning of A4 (string of four characters) and 1X (one blank) correctly.
In Fortran, so-called data edit descriptors (which format the input or output of data) may have repeat specifications.
In the format (A4,1X,3F7.1) the data edit descriptors are A4 and F7.1. Only F7.1 has a repeat specification (the number before the F). This simply means that the format is as though the descriptor appeared repeated: like F7.1, F7.1, F7.1. With a repeat specification of 1, or not given, there is just the single appearance.
The format of the question, then, is like
(A4,1X,F7.1,F7.1,F7.1)
This format is one that is covered by the rules provided in one of the images of the question. In particular, the aspect of repeat specification is given in rule 2 with the corresponding example of rule 3.
Further, in Fortran proper, a repeat count specifier may also be * as special case: that's like an exceptionally large repeat count. *(F7.1) would be like F7.1, F7.1, F7.1, .... I see no indication that this is supported by HLM but if this is needed a very large repeat count may be given instead.
In 1X the 1 isn't a repeat specification but an integral, and necessary, part of the position edit descriptor.
Procedure for making MDM file from excel for HLM:
-Make sure ALL the characters in ALL the columns line up
Select a column, then right click and select Format Cells
Then click on 'Custom' and go to the 'Type' box and enter the number
of 0s you need to line everything up
-Remove all the tabs from the document and replace them with spaces.
Open the document in word and use find and replace
-To save the document as .dat
First save it as .txt
Then open it in Notepad and save it as .dat
To enter the data format (FORTRAN-Style)
The program wants to read the data file space by space, so you have to specify it perfectly so that it reads the whole set properly.
If something is off, even by a single space, then your descriptive stats will be wonky compared to if you check them in another program.
Enclose the code with brackets ()
Divide the entries with commas ,
-Need ID column for all levels
ID column needs to be sorted so that it is in order from smallest to
largest
Use A# with # being the number of characters in the ID
Use an X1 to
move from the ID to the next column
-Need to say how many characters are needed in each column
Use F
After F is the number of characters needed for that column -Use F# (#= number)
There need to be enough character spaces to provide one 'gap' space
between each column
There need to be enough to character spaces to allow for the decimal
As part of the F you need to specify the number of decimal places
You do this by adding a decimal point after the F number and then a
number to represent the spaces you need -F#.#
You can use a number in front of the F so as to 'repeat' it. Not
necessary though. -#F#.#
All in all, it should look something like this:
(A4,X1,F4.0,F5.1)
Helpful links:
https://books.google.de/books?id=VdmVtz6Wtc0C&pg=PA78&lpg=PA78&dq=data+format+fortran+style+hlm&source=bl&ots=kURJ6USN5e&sig=fdtsmTGSKFxn04wkxvRc2Vw1l5Q&hl=en&sa=X&ved=0ahUKEwi_yPurjYrYAhWIJuwKHa0uCuAQ6AEIPzAC#v=onepage&q&f=false
http://www.ssicentral.com/hlm/help6/error/Problems_creating_MDM_files.pdf
http://www.ssicentral.com/hlm/help7/faq/FAQ_Format_specifications_for_ASCII_data.pdf

How does the reverse function in SAS work?

I have a time data field, say, 10/1/2014.
I want to extract the month and the year information dynamically in SAS, given any date.
I wrote the following code in SAS to extract the month info:
month = substr(time_field, 1, index(time_field, '/')-1);
This worked fine.
I wrote the following snippet to extract the year info:
year = substr(reverse(time_field), 1, 4);
This doesn't work; it throws a blank. Have I missed something? Please help.
SAS will return the year for you. No need to write any custom function for this purpose. Look:
data _null_;
length year 4.;
year=year(today());
put "we are on the year of " year;
run;
Your variable has trailing spaces most likely. So when you reverse it, the trailing spaces become leading spaces and then you take the first four characters which are blanks.
You can verify this by running the reverse function alone on the variable and see the results.
Try adding the compress function.
year = substr(reverse(compress(time_field)), 1, 4);
Though this may solve your problem, you should really convert your date to a SAS date and then use the Month/Day/Year functions.
data have;
length time_field $20.;
time_field="10/1/2014";
year_bad = substr(reverse(time_field),1, 4);
year_good = reverse(substr(reverse(compress(time_field)),1, 4));
year_better = year(input(time_field, mmddyy10.));
put "year_bad:" year_bad;
put "year_good:" year_good;
put "year_better:" year_better;
run;
Your data is either a month in a character field, or it is a numeric value formatted as a date. While you can use text expressions on numerics, you shouldn't; you should explicitly convert them.
When you don't, then you end up with things like this - ie, improper lengths of fields, because the automatic conversion is very loose. It tends to allow a huge amount of extra space where it's not required to.
If your data is numeric, use MONTH() or YEAR() and be done with it; there's no reason to play in text here. Look at the field in the data explorer; it will tell you if it's numeric or not. (Numeric with a format can still look like text, so actually look at it!)
If your data is text, then you have some better options than REVERSE.
First is SCAN. SCAN splits by word, similar to many other languages; often strsplit (R) or similar.
month=scan(mdy_var,1,'/');
day =scan(mdy_var,2,'/');
year =scan(mdy_var,3,'/');
Second, you could still use SUBSTR, along with LENGTH.
year = scan(mdy_var,length(mdy_var)-3,4);
LENGTH tells you how long the string really is (minus trailing spaces), so '10/1/2014' is 9 long; 6th character (9-3) is the 2, and then 4 characters after that [which should be unnecessary]. This method wouldn't really work with Day, of course, only with year (and only with 4 digit year). Scan is better really, but this is a good example of how this works.
Going along the same lines, you can use FIND and look backwards, also, using a negative start position.
year = substr(mdy_var,find(mdy_var,'/',-99)+1,4);
That starts it at the 99th character (which is realistically your maximum, right?) and goes left, and then tells you what position the first '/' it finds.

Convert alphanumeric string to 16 digit GCID

I'm building our inventory feed for Amazon Seller Central in OpenOffice Calc but can't work out how to convert our inhouse product IDs to the Amazon required format GCID.
The standard-product-id must have a specific number of characters according to type: GCID (16 alphanumeric characters), UPC (12 digit number), EAN (13 digit number) or GTIN(14 digit number).
Our product IDs vary by manufacturer, eg:-
123456
AB123456
1234AB
Where the ID is numerical only I can format the cells with leading zeros, however this doesn't work if the cell contains letters.
My file has over 10,000 products so I'm wondering if there is a formula I can apply to all cells to instantly convert them to GCID?
It seems the question was asked when under a misapprehension but having noticed that the example 123456 AB123456 1234AB represents three different IDs and aware that padding to a specified length is quite a common requirement (eg see String.PadLeft Method) a suggestion for OpenOffice might be of use to someone, one day.
Convention is to pad with 0s but since some spreadsheets automatically strip these off the front of numbers (as first example) and databases tend to prefer that fields are of consistent format I suggest separating the padding from the example with a hyphen, to aid identification of alpha numeric codes and to force text format:
=REPT(0;15-LEN(A1))&"-"&A1