I am trying to identify values that are not integers in Stata. My dataset is the following:
var1 var2 var3
1 2 3
2 4 5
3 6 7
4 2 3
5 1 1
6 2 8
My code is the following:
foreach var in var1 var2 var3 {
gen flag_`var' = 1 if format(`var') == %int
replace flag_`var' = 0 if flag_`var' ==.
I am getting an error message stating
unknown function format()
}
I also tried replacing the parentheses around format(`var') with format[`var'] but then I got an error stating format not found. Is there something wrong with the format I am using or is there a better way to identify non-integer values?
The first answer is what Stata told you: there is no format() function.
But a deeper answer is that thinking of (display) formats is the wrong way round for this question. A display format is in essence an instruction to show data in a certain way and has nothing to do with its stored value, or to be more precise the decimal equivalent of its stored value. Thus 42 displayed with format %4.3f is shown as 42.000 while 6.789 displayed with format %1.0f is shown as 7. Otherwise put, no value has an inherent format, but a display format is used to display a value, either by default or because a user specified a format. Stata is here just using the same broad ideas as say C and various C-like languages.
Nothing to do with its stored value is a slight exaggeration, as only numeric formats make sense for numbers and only string formats make sense for strings, but display format has nothing to do with whether a stored value is integer.
Further %int is not a display format any way. When formats are being checked for, they would be literal strings enclosed in "".
To show non-integers various methods could be used, say using rounding functions such as round(), int(), floor() or ceil(). So an indicator for whether x is integer could be
gen is_int_x = x == floor(x)
All the values in your data example are integer any way, but I take it that you are looking for non-integers elsewhere.
Related
I am trying to write a command that ereturns a scalar that is percentage rounded to 2 decimal places. The percentage can be negative or positive, with unknown number of digits before the decimal point.
Here's MRE that shows the problem I am having.
#delimit;
capture program drop my_note;
program my_note, eclass;
local my_x: display %-9.2f 92.23999999999999;
ereturn scalar my_x = `my_x';
end;
ereturn clear;
my_note;
ereturn list;
display %-9.2f 92.23999999999999;
display 92.23999999999999;
I am puzzled why display seems to do the right thing (turn 92.23999999999999 to 92.24, though regardless of format), but that e(my_x) does not seem to inherit that format.
When you create your scalar you copy the value of the local `my_x'. That value is still 92.23999999999999 as : display is not changing the underlying data, only how it is displayed. Think of it as the data in 5.00e+2 and 500 is the same, it is just how that value is shown that differs.
You need to use strings to work with how the value is displayed. However, there are two issues with strings in your code example.
While scalars normally can hold both strings and numeric values, returned scalars can not hold strings (don't ask me why). Would it be possible to return a local instead?
In %-9.2f you specify that the display format to be 9 characters long so your scalar will be e(my_x) : "92.24 ". You can adjust %-9.2f, but since you are now working with strings you can remove excessive spaces using the trim() function.
Try the code below and see if that works given the context of this function. If not tell us more about what you are about to do.
#delimit;
capture program drop my_note;
program my_note, eclass;
local my_x: display %-9.2f 92.23999999999999;
ereturn local my_x = trim("`my_x'");
end;
ereturn clear;
my_note;
ereturn list;
I'm using an If-statement to assign integers to strings from another cell. This seems to be working, but if I reference these columns, I'm getting a NaN value. This is my formula below. I tried adding INT() around the output values, but that seemed to break everything. Am I missing something?
IF(FIND('1',{Functional response}),-4,
IF(FIND('2',{Functional response}),-2,
IF(FIND('3',{Functional response}),0,
IF(FIND('4',{Functional response}),2,
IF(FIND('5',{Functional response}),4,"")))))
Assuming Functional response can only store a number 1 to 5 as a string a simple option in excel would be to first convert the string to a number and then use the choose function to assign a value. this works as the numbers are are sequential integers. Assuming Cell K2 has the value of Functional response, your formula could be:
=CHOOSE(--K2,-4,-2,0,2,4)
=CHOOSE(K2+0,-4,-2,0,2,4)
=CHOOSE(K2-0,-4,-2,0,2,4)
=CHOOSE(K2*1,-4,-2,0,2,4)
=CHOOSE(K2/1,-4,-2,0,2,4)
Basically sending the string of a pure number through a math operation has excel convert it to a number. By sending it through a math operation that does not change its value, you get the string as a number.
CHOOSE is like a sequential IF function Supply it with an integer as the first argument and then it will return the value from the subsequent list that matches the number. if the number you supply is greater than the number of options you will get an error.
Alternatively you could just do a straight math convertion on the number stored as a string in K2 using the following formula:
=(K2-3)*2
And as my final option, you could build a table and use VLOOKUP or INDEX/MATCH.
NOTE: If B2:B6 was stored as strings instead of numbers, K2 instead of --K2 would need to be used.
So I have some code that does essentially this:
REAL, DIMENSION(31) :: month_data
INTEGER :: no_days
no_days = get_no_days()
month_data = [fill array with some values]
WRITE(1000,*) (month_data(d), d=1,no_days)
So I have an array with values for each month, in a loop I fill the array with a certain number of values based on how many days there are in that month, then write out the results into a file.
It took me quite some time to wrap my head around the whole 'write out an array in one go' aspect of WRITE, but this seems to work.
However this way, it writes out the numbers in the array like this (example for January, so 31 values):
0.00000 10.0000 20.0000 30.0000 40.0000 50.0000 60.0000
70.0000 80.0000 90.0000 100.000 110.000 120.000 130.000
140.000 150.000 160.000 170.000 180.000 190.000 200.000
210.000 220.000 230.000 240.000 250.000 260.000 270.000
280.000 290.000 300.000
So it prefixes a lot of spaces (presumably to make columns line up even when there are larger values in the array), and it wraps lines to make it not exceed a certain width (I think 128 chars? not sure).
I don't really mind the extra spaces (although they inflate my file sizes considerably, so it would be nice to fix that too...) but the breaking-up-lines screws up my other tooling. I've tried reading several Fortran manuals, but while some of the mention 'output formatting', I have yet to find one that mentions newlines or columns.
So, how do I control how arrays are written out when using the syntax above in Fortran?
(also, while we're at it, how do I control the nr of decimal digits? I know these are all integer values so I'd like to leave out any decimals all together, but I can't change the data type to INTEGER in my code because of reasons).
You probably want something similar to
WRITE(1000,'(31(F6.0,1X))') (month_data(d), d=1,no_days)
Explanation:
The use of * as the format specification is called list directed I/O: it is easy to code, but you are giving away all control over the format to the processor. In order to control the format you need to provide explicit formatting, via a label to a FORMAT statement or via a character variable.
Use the F edit descriptor for real variables in decimal form. Their syntax is Fw.d, where w is the width of the field and d is the number of decimal places, including the decimal sign. F6.0 therefore means a field of 6 characters of width with no decimal places.
Spaces can be added with the X control edit descriptor.
Repetitions of edit descriptors can be indicated with the number of repetitions before a symbol.
Groups can be created with (...), and they can be repeated if preceded by a number of repetitions.
No more items are printed beyond the last provided variable, even if the format specifies how to print more items than the ones actually provided - so you can ask for 31 repetitions even if for some months you will only print data for 30 or 28 days.
Besides,
New lines could be added with the / control edit descriptor; e.g., if you wanted to print the data with 10 values per row, you could do
WRITE(1000,'(4(10(F6.0,:,1X),/))') (month_data(d), d=1,no_days)
Note the : control edit descriptor in this second example: it indicates that, if there are no more items to print, nothing else should be printed - not even spaces corresponding to control edit descriptors such as X or /. While it could have been used in the previous example, it is more relevant here, in order to ensure that, if no_days is a multiple of 10, there isn't an empty line after the 3 rows of data.
If you want to completely remove the decimal symbol, you would need to rather print the nearest integers using the nint intrinsic and the Iw (integer) descriptor:
WRITE(1000,'(31(I6,1X))') (nint(month_data(d)), d=1,no_days)
Variable X used to be string. So I used encode command to make it non-string.
But after that when I sort it, it's sorted in this way.
1000
10000
10001
10003
10005
1003
But usually, it should be sorted like
1000
1001
1003
1005
Why is sorting so strange after doing encode?
And it appears 1003 created from encode and 1003 in using dataset are considered different numbers.
Not strange at all. Right near the top of help encode Stata tells you "Do not use encode if varname contains numbers that merely happen to be stored as strings".
encode maps strings in alphabetical (here alphanumeric) order to numeric values 1 up (unless you specify otherwise with a label() option).
So "1000" will sort before "10000" before "1001", and so forth.
You probably need destring but why was the variable read as string? That's what you need to worry about.
encode is for strings when you want a numeric equivalent. So "cat" "dog" "frog" "toad" will map to 1 2 3 4 and the string values will become value labels.
destring is for mistaken strings. The variable should be numeric, but something went wrong on reading the data. So, what was it that went wrong? Common errors include
Header data from a spreadsheet that should be a variable label (or ignored) got read in as data.
Codes for missing data such as NA that make sense to people or to some other program but do not correspond to Stata representations of missing.
Garbage of some kind.
To check for problems, you could look at the values that wouldn't translate to numbers:
tab whatever if missing(real(whatever))
Consider the following program and output:
data _null_;
input a;
length b $64;
do i = 1 to 64;
fmtname = cats('binary',i);
b = cats(putn(a,fmtname));
put i= b=;
end;
cards;
1
;
run;
Output (SAS 9.1.3, Windows 7 x64):
i=1 b=1
i=2 b=01
i=3 b=001
i=4 b=0001
i=5 b=00001
/*Skipped a few very similar lines*/
i=58 b=0000000000000000000000000000000000000000000000000000000001
i=59 b=11111110000000000000000000000000000000000000000000000000000
i=60 b=111111110000000000000000000000000000000000000000000000000000
i=61 b=1111111110000000000000000000000000000000000000000000000000000
i=62 b=11111111110000000000000000000000000000000000000000000000000000
i=63 b=011111111110000000000000000000000000000000000000000000000000000
i=64 b=0011111111110000000000000000000000000000000000000000000000000000
Last few lines of output from SAS 9.4 on Linux x64:
i=60 b=000000000000000000000000000000000000000000000000000000000001
i=61 b=1111111110000000000000000000000000000000000000000000000000000
i=62 b=11111111110000000000000000000000000000000000000000000000000000
i=63 b=011111111110000000000000000000000000000000000000000000000000000
i=64 b=0011111111110000000000000000000000000000000000000000000000000000
This behaviour is rather unexpected, to me at least, and doesn't seem to be documented on the help page. It agrees with the document I found here for width 64 - standard double precision - but I don't understand why it flips over at width 59.
I don't quite get the same result - mine switches at 61 - but I believe the answer is the same.
Up to some point - 58, 60, somewhere around there - SAS is showing you the fixed-point integer representation of the number. Test this with a decimal, like so:
data _null_;
a=3.14159265358979323846264338327950288419716939937510582;
length b $64;
put a= hex4.;
put a= hex8.;
put a= hex16.;
do i = 1 to 64;
fmtname = cats('binary',i);
b = cats(putn(a,fmtname));
put i= b=;
end;
run;
And you will get a sort-of-surprising result - you see 000...0011 for most of your rows, up through 60. The documentation doesn't explicitly mention this, but it does show it in the example (123.45 and 123 are identical in binary8.).
Then starting at 61, or 59 for you I'm guessing, you see the actual representation of the number as SAS internally stores it (or, arguably, how Intel internally stores it).
The binary documentation doesn't explain this well, but the HEX. documentation does explain it pretty clearly in a tip:
If w< 16, the HEXw. format converts real binary numbers to fixed-point integers before writing them as hexadecimal characters. It also writes negative numbers in two's complement notation, and right aligns digits. If w is 16, HEXw. displays floating-point values in their hexadecimal form.
Binary is doing the same, and on my machine it happens right at the point HEX would also make the change - at 15x4=60. And HEX. shows the same - notice below; hex4. and hex8. show a different result than hex16..
To be clear, the value shown at binary64. is correct, and not any sort of truncation (though 61-63, and in your example 59-60, are left-truncated).
I did find a SAS usage note regarding this, though it's clearly out of date based on our tests:
Beginning with SASĀ® Version 7, the BINARYw. format was changed to be more consistent with the HEXw. format. When the HEXw. format uses a width of 16, (corresponding to 8 bytes of data), it produces a hexadecimal representation of the floating point value. The BINARYw. format changed so that widths of 57-64 produce a binary representation of the floating point value, since widths of 57-64 correspond to 8 bytes of data.
It also contains a suggestion for how to get consistent results for integers, which may be of use.
BIN_64=PUT(PUT(VALUE,S370FIB8.),$BINARY64.);
S370FIB8. is a format that converts numbers to their fixed integer binary representation, in IBM Mainframe format. (I.e., it writes the integer in Big-Endian format, which is not what you'd get on an Intel machine.)