SAS Input: Integers Separated by Commas - sas

This seems incredibly basic, but I simply can't find the right informat in SAS to read in the kind of data I have, which looks like this:
9 Bittersweet #FD7C6E (253, 124, 110) 48 1949
10 Black #000000 (0,0,0) 8 1903
I need to read in the values in parentheses into three separate numeric variables, and there's no informat I can find that simply "reads in numeric characters until it encounters a non-numeric character." The file is not completely comma-separated, more's the pity (whoever "designed" this file format should be shot, dead, buried, resurrected, and shot again!) The problem with the data in parentheses is that sometimes there is a space after a comma, and sometimes not. I've gotten the first number and the first set of characters after the number read in via column input, since the # is always in column 32. I've read in the six-digit hex value (just using character there).
Here is my MWE:
Data crayons;
Infile 'path\crayons.dat' MISSOVER;
Input crayon_number
color_name $ 4-31
hex_code $ 33-38 #42
red 3. #','
green 3. #','
blue 3. #')'
pack_size
year_issued
year_retired;
Run;
The Bittersweet line is read in correctly, but not the Black line. (year_retired is blank for both of these - I'm not concerned about that.) In the Black line, I get the hex_code variable correctly, but nothing after that.
So I guess the central question is this: how do I read in an integer of varying length that is guaranteed NOT to contain a comma, particularly when it is immediately followed by a comma?
Perhaps at a higher level: where can I go to find these sorts of things out? I have these questions about reading in dirty data, and I don't know where to go to find out. The SAS Language Reference is woefully inadequate for this, in my experience. If data fits into their neat little boxes, you're good to go. Anything outside of that, and their reference is useless.
Thank you very much for your time!

I would use list input with delimiters=' (,)'
Data crayons;
infile cards dlm=' (,)' missover;
Input crayon_number
color_name &$28.
hex_code $
red
green
blue
pack_size
year_issued
year_retired;
list;
cards;
9 Bittersweet #FD7C6E (253, 124, 110) 48 1949
10 Black #000000 (0,0,0) 8 1903

Another option is to read it in as a character as a whole field and use SCAN() later on.
Since you mention that the position is fixed, it sounds like you're reading a fixed width type file?
red = scan(orig_var, 1, "(,)");
green = scan(orig_var, 2, "(,)");
blue = scan(orig_var, 3, "(,)");

Related

Strange sorting in Stata after doing encode

Variable X used to be string. So I used encode command to make it non-string.
But after that when I sort it, it's sorted in this way.
1000
10000
10001
10003
10005
1003
But usually, it should be sorted like
1000
1001
1003
1005
Why is sorting so strange after doing encode?
And it appears 1003 created from encode and 1003 in using dataset are considered different numbers.
Not strange at all. Right near the top of help encode Stata tells you "Do not use encode if varname contains numbers that merely happen to be stored as strings".
encode maps strings in alphabetical (here alphanumeric) order to numeric values 1 up (unless you specify otherwise with a label() option).
So "1000" will sort before "10000" before "1001", and so forth.
You probably need destring but why was the variable read as string? That's what you need to worry about.
encode is for strings when you want a numeric equivalent. So "cat" "dog" "frog" "toad" will map to 1 2 3 4 and the string values will become value labels.
destring is for mistaken strings. The variable should be numeric, but something went wrong on reading the data. So, what was it that went wrong? Common errors include
Header data from a spreadsheet that should be a variable label (or ignored) got read in as data.
Codes for missing data such as NA that make sense to people or to some other program but do not correspond to Stata representations of missing.
Garbage of some kind.
To check for problems, you could look at the values that wouldn't translate to numbers:
tab whatever if missing(real(whatever))

Meaning of 3F7.1 in Fortran data format

I am trying to create an MDM file using HLM 7 Student version, but since I don't have access to SPSS I am trying to import my data using ASCII input. As part of this process I am required to input the data format Fortran style. Try as I might I have not been able to understand this step. Could someone familiar with Fortran (or even better HLM itself) explain to me how this works? Here is my current understanding
From the example EG3.DAT they give
(A4,1X,3F7.1)
I think
A4 signifies that the ID is 4 characters long.
1X means skip a space.
F.1 means that it should read 1 decimal places.
I am very confused about what 3F7 might mean.
EG3.DAT
2020 380.0 40.3 12.5
2040 502.0 83.1 18.6
2180 777.0 96.6 44.4
Below are examples from the help documents.
Rules for format statement
Format statement example
EG1 data format
EG2 data format
EG3 data format
One similar question is Explaining Fortran Write Format. Unfortunately it does not explicitly treat the F descriptor.
3F7.1 means 3 floating point numbers, each printed over 7 characters, each with one decimal number behind the decimal point. Leading characters are blanks.
For reading you don't need the .1 info at all, just read a floating point number from those 7 characters.
You guessed the meaning of A4 (string of four characters) and 1X (one blank) correctly.
In Fortran, so-called data edit descriptors (which format the input or output of data) may have repeat specifications.
In the format (A4,1X,3F7.1) the data edit descriptors are A4 and F7.1. Only F7.1 has a repeat specification (the number before the F). This simply means that the format is as though the descriptor appeared repeated: like F7.1, F7.1, F7.1. With a repeat specification of 1, or not given, there is just the single appearance.
The format of the question, then, is like
(A4,1X,F7.1,F7.1,F7.1)
This format is one that is covered by the rules provided in one of the images of the question. In particular, the aspect of repeat specification is given in rule 2 with the corresponding example of rule 3.
Further, in Fortran proper, a repeat count specifier may also be * as special case: that's like an exceptionally large repeat count. *(F7.1) would be like F7.1, F7.1, F7.1, .... I see no indication that this is supported by HLM but if this is needed a very large repeat count may be given instead.
In 1X the 1 isn't a repeat specification but an integral, and necessary, part of the position edit descriptor.
Procedure for making MDM file from excel for HLM:
-Make sure ALL the characters in ALL the columns line up
Select a column, then right click and select Format Cells
Then click on 'Custom' and go to the 'Type' box and enter the number
of 0s you need to line everything up
-Remove all the tabs from the document and replace them with spaces.
Open the document in word and use find and replace
-To save the document as .dat
First save it as .txt
Then open it in Notepad and save it as .dat
To enter the data format (FORTRAN-Style)
The program wants to read the data file space by space, so you have to specify it perfectly so that it reads the whole set properly.
If something is off, even by a single space, then your descriptive stats will be wonky compared to if you check them in another program.
Enclose the code with brackets ()
Divide the entries with commas ,
-Need ID column for all levels
ID column needs to be sorted so that it is in order from smallest to
largest
Use A# with # being the number of characters in the ID
Use an X1 to
move from the ID to the next column
-Need to say how many characters are needed in each column
Use F
After F is the number of characters needed for that column -Use F# (#= number)
There need to be enough character spaces to provide one 'gap' space
between each column
There need to be enough to character spaces to allow for the decimal
As part of the F you need to specify the number of decimal places
You do this by adding a decimal point after the F number and then a
number to represent the spaces you need -F#.#
You can use a number in front of the F so as to 'repeat' it. Not
necessary though. -#F#.#
All in all, it should look something like this:
(A4,X1,F4.0,F5.1)
Helpful links:
https://books.google.de/books?id=VdmVtz6Wtc0C&pg=PA78&lpg=PA78&dq=data+format+fortran+style+hlm&source=bl&ots=kURJ6USN5e&sig=fdtsmTGSKFxn04wkxvRc2Vw1l5Q&hl=en&sa=X&ved=0ahUKEwi_yPurjYrYAhWIJuwKHa0uCuAQ6AEIPzAC#v=onepage&q&f=false
http://www.ssicentral.com/hlm/help6/error/Problems_creating_MDM_files.pdf
http://www.ssicentral.com/hlm/help7/faq/FAQ_Format_specifications_for_ASCII_data.pdf

Fortran 90: reading a generic string with enclosed some "/" characters

Hy everybody, I've found some problems in reading unformatted character strings in a simple file. When the first / is found, everything is missed after it.
This is the example of the text I would like to read: after the first 18 character blocks that are fixed (from #Mod to Flow[kW]), there is a list of chemical species' names, that are variables (in this case 5) within the program I'm writing.
#Mod ID Mod Name Type C. #Coll MF[kg/s] Pres.[Pa] Pres.[bar] Temp.[K] Temp.[C] Ent[kJ/kg K] Power[kW] RPM[rad/s] Heat Flow[kW] METHANE ETHANE PROPANE NITROGEN H2O
I would like to skip, after some formal checks, the first 18 blocks, then read the chemical species. To do the former, I created a character array with dimension of 18, each with a length of 20.
character(20), dimension(18) :: chapp
Then I would like to associate the 18 blocks to the character array
read(1,*) (chapp(i),i=1,18)
...but this is the result: from chapp(1) to chapp(7) are saved the right first 7 strings, but this is chapp(8)
chapp(8) = 'MF[kg '
and from here on, everything is leaved blank!
How could I overcome this reading problem?
The problem is due to your using list-directed input (the * as the format). List-directed input is useful for quick and dirty input, but it has its limitations and quirks.
You stumbled across a quirk: A slash (/) in the input terminates assignment of values to the input list for the READ statement. This is exactly the behavior that you described above.
This is not choice of the compiler writer, but is mandated by all relevant Fortran standards.
The solution is to use formatted input. There are several options for this:
If you know that your labels will always be in the same columns, you can use a format string like '(1X,A4,2X,A2,1X,A3,2X)' (this is not complete) to read in the individual labels. This is error-prone, and is also bad if the program that writes out the data changes format for some reason or other, or if the labes are edited by hand.
If you can control the program that writes the label, you can use tab characters to separate the individual labels (and also, later, the labels). Read in the whole line, split it into tab-separated substrings using INDEX and read in the individual fields using an (A) format. Don't use list-directed format, or you will get hit by the / quirk mentioned above. This has the advantage that your labels can also include spaces, and that the data can be imported from/to Excel rather easily. This is what I usually do in such cases.
Otherwise, you can read in the whole line and split on multiple spaces. A bit more complicated than splitting on single tab characters, but it may be the best option if you cannot control the data source. You cannot have labels containing spaces then.

Replacing consecutive embedded blanks with another character in SAS

I'm trying to replace embedded spaces in one of my variables (QPR) with a new character. Here is my (abbreviated) code:
data sas2;
input QPR $ & 1-9;
QPR=tranwrd(strip(QPR)," ","0");
run;
proc print data=sas2;
run;
The tranwrd function seems to work for observations with one embedded blank; however, it does not work when there are two blanks in a row.
For example, 234 2345 becomes 23402345, but 234 345 becomes 234 (i.e., The rest gets cut off, I assume because of strip). Instead, I want 23400345.
I also tried tranwrd without the strip function, but I go from 234 345 to 23400000 instead. Translate does the same thing.
Any ideas on why this won't work and how to fix it? Alternatively, are there easier/better ways to do this in the data step?
The "&" symbol in your input statement causes SAS to stop reading the data after two spaces. After SAS stops reading the data, it pads the rest of the string with spaces up to a total length of 9 chars. This is why you had a bunch of zeros at the end of the string when you didn't use strip. Removing the "&" should fix it.

getting input format

So, I just buy introduction book for sas. But it only contain tons of examples with little/no explanation. I tried to find some tutorial online, but I can't find the explanation for this formatting. I just wonder what's the different between these two:
INPUT Name $16. Age 3. +1 height 5.1
I wonder, what does "." mean. What the different between:
INPUT Name $16
and
INPUT Name $ 1-16
what is the symbol "+1" mean?
what does "5.1" mean? how's that different from "5."? thx
Formats always contain periods; the period can serve to separate width from decimal, ie 5.1 is 5 total width, 1 decimal - so xxx.d (actually, -xx.d, but it will also display xxx.d correctly). For character values and other values that cannot have decimal portions, there is never a number after the period, but it is still present; so DATE9. is a DATE formatted variable (specifically, looks like "19JAN2013") and is 9 characters long (as opposed to DATE7., or 19JAN13).
In general, SAS has many different input options. Find a better book, or read the online documentation (http://support.sas.com/documentation/92/index.html or similar for your version of SAS). input Name $16. inputs name as a 16 digit character variable. You have a lot of variants of input options, so look at the documentation to find out more.
+1 specifically tells SAS to move the pointer forward one - so instead of 16 characters of Name, then 3 digits of Age, then 5 digits of Height, it skips a space between Age and Height; so NAMENAMENAMENAMEage heigh not NAMENAMENAMENAMEageheigh.
You can start here:
Input statement