Strange sorting in Stata after doing encode - stata

Variable X used to be string. So I used encode command to make it non-string.
But after that when I sort it, it's sorted in this way.
1000
10000
10001
10003
10005
1003
But usually, it should be sorted like
1000
1001
1003
1005
Why is sorting so strange after doing encode?
And it appears 1003 created from encode and 1003 in using dataset are considered different numbers.

Not strange at all. Right near the top of help encode Stata tells you "Do not use encode if varname contains numbers that merely happen to be stored as strings".
encode maps strings in alphabetical (here alphanumeric) order to numeric values 1 up (unless you specify otherwise with a label() option).
So "1000" will sort before "10000" before "1001", and so forth.
You probably need destring but why was the variable read as string? That's what you need to worry about.
encode is for strings when you want a numeric equivalent. So "cat" "dog" "frog" "toad" will map to 1 2 3 4 and the string values will become value labels.
destring is for mistaken strings. The variable should be numeric, but something went wrong on reading the data. So, what was it that went wrong? Common errors include
Header data from a spreadsheet that should be a variable label (or ignored) got read in as data.
Codes for missing data such as NA that make sense to people or to some other program but do not correspond to Stata representations of missing.
Garbage of some kind.
To check for problems, you could look at the values that wouldn't translate to numbers:
tab whatever if missing(real(whatever))

Related

Attempting to identify non-integer values using for loop

I am trying to identify values that are not integers in Stata. My dataset is the following:
var1 var2 var3
1 2 3
2 4 5
3 6 7
4 2 3
5 1 1
6 2 8
My code is the following:
foreach var in var1 var2 var3 {
gen flag_`var' = 1 if format(`var') == %int
replace flag_`var' = 0 if flag_`var' ==.
I am getting an error message stating
unknown function format()
}
I also tried replacing the parentheses around format(`var') with format[`var'] but then I got an error stating format not found. Is there something wrong with the format I am using or is there a better way to identify non-integer values?
The first answer is what Stata told you: there is no format() function.
But a deeper answer is that thinking of (display) formats is the wrong way round for this question. A display format is in essence an instruction to show data in a certain way and has nothing to do with its stored value, or to be more precise the decimal equivalent of its stored value. Thus 42 displayed with format %4.3f is shown as 42.000 while 6.789 displayed with format %1.0f is shown as 7. Otherwise put, no value has an inherent format, but a display format is used to display a value, either by default or because a user specified a format. Stata is here just using the same broad ideas as say C and various C-like languages.
Nothing to do with its stored value is a slight exaggeration, as only numeric formats make sense for numbers and only string formats make sense for strings, but display format has nothing to do with whether a stored value is integer.
Further %int is not a display format any way. When formats are being checked for, they would be literal strings enclosed in "".
To show non-integers various methods could be used, say using rounding functions such as round(), int(), floor() or ceil(). So an indicator for whether x is integer could be
gen is_int_x = x == floor(x)
All the values in your data example are integer any way, but I take it that you are looking for non-integers elsewhere.

SAS Input: Integers Separated by Commas

This seems incredibly basic, but I simply can't find the right informat in SAS to read in the kind of data I have, which looks like this:
9 Bittersweet #FD7C6E (253, 124, 110) 48 1949
10 Black #000000 (0,0,0) 8 1903
I need to read in the values in parentheses into three separate numeric variables, and there's no informat I can find that simply "reads in numeric characters until it encounters a non-numeric character." The file is not completely comma-separated, more's the pity (whoever "designed" this file format should be shot, dead, buried, resurrected, and shot again!) The problem with the data in parentheses is that sometimes there is a space after a comma, and sometimes not. I've gotten the first number and the first set of characters after the number read in via column input, since the # is always in column 32. I've read in the six-digit hex value (just using character there).
Here is my MWE:
Data crayons;
Infile 'path\crayons.dat' MISSOVER;
Input crayon_number
color_name $ 4-31
hex_code $ 33-38 #42
red 3. #','
green 3. #','
blue 3. #')'
pack_size
year_issued
year_retired;
Run;
The Bittersweet line is read in correctly, but not the Black line. (year_retired is blank for both of these - I'm not concerned about that.) In the Black line, I get the hex_code variable correctly, but nothing after that.
So I guess the central question is this: how do I read in an integer of varying length that is guaranteed NOT to contain a comma, particularly when it is immediately followed by a comma?
Perhaps at a higher level: where can I go to find these sorts of things out? I have these questions about reading in dirty data, and I don't know where to go to find out. The SAS Language Reference is woefully inadequate for this, in my experience. If data fits into their neat little boxes, you're good to go. Anything outside of that, and their reference is useless.
Thank you very much for your time!
I would use list input with delimiters=' (,)'
Data crayons;
infile cards dlm=' (,)' missover;
Input crayon_number
color_name &$28.
hex_code $
red
green
blue
pack_size
year_issued
year_retired;
list;
cards;
9 Bittersweet #FD7C6E (253, 124, 110) 48 1949
10 Black #000000 (0,0,0) 8 1903
Another option is to read it in as a character as a whole field and use SCAN() later on.
Since you mention that the position is fixed, it sounds like you're reading a fixed width type file?
red = scan(orig_var, 1, "(,)");
green = scan(orig_var, 2, "(,)");
blue = scan(orig_var, 3, "(,)");

Controlling newlines when writing out arrays in Fortran

So I have some code that does essentially this:
REAL, DIMENSION(31) :: month_data
INTEGER :: no_days
no_days = get_no_days()
month_data = [fill array with some values]
WRITE(1000,*) (month_data(d), d=1,no_days)
So I have an array with values for each month, in a loop I fill the array with a certain number of values based on how many days there are in that month, then write out the results into a file.
It took me quite some time to wrap my head around the whole 'write out an array in one go' aspect of WRITE, but this seems to work.
However this way, it writes out the numbers in the array like this (example for January, so 31 values):
0.00000 10.0000 20.0000 30.0000 40.0000 50.0000 60.0000
70.0000 80.0000 90.0000 100.000 110.000 120.000 130.000
140.000 150.000 160.000 170.000 180.000 190.000 200.000
210.000 220.000 230.000 240.000 250.000 260.000 270.000
280.000 290.000 300.000
So it prefixes a lot of spaces (presumably to make columns line up even when there are larger values in the array), and it wraps lines to make it not exceed a certain width (I think 128 chars? not sure).
I don't really mind the extra spaces (although they inflate my file sizes considerably, so it would be nice to fix that too...) but the breaking-up-lines screws up my other tooling. I've tried reading several Fortran manuals, but while some of the mention 'output formatting', I have yet to find one that mentions newlines or columns.
So, how do I control how arrays are written out when using the syntax above in Fortran?
(also, while we're at it, how do I control the nr of decimal digits? I know these are all integer values so I'd like to leave out any decimals all together, but I can't change the data type to INTEGER in my code because of reasons).
You probably want something similar to
WRITE(1000,'(31(F6.0,1X))') (month_data(d), d=1,no_days)
Explanation:
The use of * as the format specification is called list directed I/O: it is easy to code, but you are giving away all control over the format to the processor. In order to control the format you need to provide explicit formatting, via a label to a FORMAT statement or via a character variable.
Use the F edit descriptor for real variables in decimal form. Their syntax is Fw.d, where w is the width of the field and d is the number of decimal places, including the decimal sign. F6.0 therefore means a field of 6 characters of width with no decimal places.
Spaces can be added with the X control edit descriptor.
Repetitions of edit descriptors can be indicated with the number of repetitions before a symbol.
Groups can be created with (...), and they can be repeated if preceded by a number of repetitions.
No more items are printed beyond the last provided variable, even if the format specifies how to print more items than the ones actually provided - so you can ask for 31 repetitions even if for some months you will only print data for 30 or 28 days.
Besides,
New lines could be added with the / control edit descriptor; e.g., if you wanted to print the data with 10 values per row, you could do
WRITE(1000,'(4(10(F6.0,:,1X),/))') (month_data(d), d=1,no_days)
Note the : control edit descriptor in this second example: it indicates that, if there are no more items to print, nothing else should be printed - not even spaces corresponding to control edit descriptors such as X or /. While it could have been used in the previous example, it is more relevant here, in order to ensure that, if no_days is a multiple of 10, there isn't an empty line after the 3 rows of data.
If you want to completely remove the decimal symbol, you would need to rather print the nearest integers using the nint intrinsic and the Iw (integer) descriptor:
WRITE(1000,'(31(I6,1X))') (nint(month_data(d)), d=1,no_days)

Meaning of 3F7.1 in Fortran data format

I am trying to create an MDM file using HLM 7 Student version, but since I don't have access to SPSS I am trying to import my data using ASCII input. As part of this process I am required to input the data format Fortran style. Try as I might I have not been able to understand this step. Could someone familiar with Fortran (or even better HLM itself) explain to me how this works? Here is my current understanding
From the example EG3.DAT they give
(A4,1X,3F7.1)
I think
A4 signifies that the ID is 4 characters long.
1X means skip a space.
F.1 means that it should read 1 decimal places.
I am very confused about what 3F7 might mean.
EG3.DAT
2020 380.0 40.3 12.5
2040 502.0 83.1 18.6
2180 777.0 96.6 44.4
Below are examples from the help documents.
Rules for format statement
Format statement example
EG1 data format
EG2 data format
EG3 data format
One similar question is Explaining Fortran Write Format. Unfortunately it does not explicitly treat the F descriptor.
3F7.1 means 3 floating point numbers, each printed over 7 characters, each with one decimal number behind the decimal point. Leading characters are blanks.
For reading you don't need the .1 info at all, just read a floating point number from those 7 characters.
You guessed the meaning of A4 (string of four characters) and 1X (one blank) correctly.
In Fortran, so-called data edit descriptors (which format the input or output of data) may have repeat specifications.
In the format (A4,1X,3F7.1) the data edit descriptors are A4 and F7.1. Only F7.1 has a repeat specification (the number before the F). This simply means that the format is as though the descriptor appeared repeated: like F7.1, F7.1, F7.1. With a repeat specification of 1, or not given, there is just the single appearance.
The format of the question, then, is like
(A4,1X,F7.1,F7.1,F7.1)
This format is one that is covered by the rules provided in one of the images of the question. In particular, the aspect of repeat specification is given in rule 2 with the corresponding example of rule 3.
Further, in Fortran proper, a repeat count specifier may also be * as special case: that's like an exceptionally large repeat count. *(F7.1) would be like F7.1, F7.1, F7.1, .... I see no indication that this is supported by HLM but if this is needed a very large repeat count may be given instead.
In 1X the 1 isn't a repeat specification but an integral, and necessary, part of the position edit descriptor.
Procedure for making MDM file from excel for HLM:
-Make sure ALL the characters in ALL the columns line up
Select a column, then right click and select Format Cells
Then click on 'Custom' and go to the 'Type' box and enter the number
of 0s you need to line everything up
-Remove all the tabs from the document and replace them with spaces.
Open the document in word and use find and replace
-To save the document as .dat
First save it as .txt
Then open it in Notepad and save it as .dat
To enter the data format (FORTRAN-Style)
The program wants to read the data file space by space, so you have to specify it perfectly so that it reads the whole set properly.
If something is off, even by a single space, then your descriptive stats will be wonky compared to if you check them in another program.
Enclose the code with brackets ()
Divide the entries with commas ,
-Need ID column for all levels
ID column needs to be sorted so that it is in order from smallest to
largest
Use A# with # being the number of characters in the ID
Use an X1 to
move from the ID to the next column
-Need to say how many characters are needed in each column
Use F
After F is the number of characters needed for that column -Use F# (#= number)
There need to be enough character spaces to provide one 'gap' space
between each column
There need to be enough to character spaces to allow for the decimal
As part of the F you need to specify the number of decimal places
You do this by adding a decimal point after the F number and then a
number to represent the spaces you need -F#.#
You can use a number in front of the F so as to 'repeat' it. Not
necessary though. -#F#.#
All in all, it should look something like this:
(A4,X1,F4.0,F5.1)
Helpful links:
https://books.google.de/books?id=VdmVtz6Wtc0C&pg=PA78&lpg=PA78&dq=data+format+fortran+style+hlm&source=bl&ots=kURJ6USN5e&sig=fdtsmTGSKFxn04wkxvRc2Vw1l5Q&hl=en&sa=X&ved=0ahUKEwi_yPurjYrYAhWIJuwKHa0uCuAQ6AEIPzAC#v=onepage&q&f=false
http://www.ssicentral.com/hlm/help6/error/Problems_creating_MDM_files.pdf
http://www.ssicentral.com/hlm/help7/faq/FAQ_Format_specifications_for_ASCII_data.pdf

Replacing consecutive embedded blanks with another character in SAS

I'm trying to replace embedded spaces in one of my variables (QPR) with a new character. Here is my (abbreviated) code:
data sas2;
input QPR $ & 1-9;
QPR=tranwrd(strip(QPR)," ","0");
run;
proc print data=sas2;
run;
The tranwrd function seems to work for observations with one embedded blank; however, it does not work when there are two blanks in a row.
For example, 234 2345 becomes 23402345, but 234 345 becomes 234 (i.e., The rest gets cut off, I assume because of strip). Instead, I want 23400345.
I also tried tranwrd without the strip function, but I go from 234 345 to 23400000 instead. Translate does the same thing.
Any ideas on why this won't work and how to fix it? Alternatively, are there easier/better ways to do this in the data step?
The "&" symbol in your input statement causes SAS to stop reading the data after two spaces. After SAS stops reading the data, it pads the rest of the string with spaces up to a total length of 9 chars. This is why you had a bunch of zeros at the end of the string when you didn't use strip. Removing the "&" should fix it.