How can I sort variables based on part of a string variable? - stata

I have a dataset with string variables and I am trying to generate a new binary variable based on the first two characters. All strings are 5 characters long, but I'm only concerned with the first two in order to sort.
For example, I could have 22001 and 22005. Since both are of the form 22XXX, I want to assign value 1 for both in the variable type_A. And if I have 25001 and 25005, since both are not of the form 22XXX, I want to assign value 0 for both in the variable type_A.

This should do the job:
clear
set obs 4
generate str5 var1 = "22001" in 1
replace var1 = "22005" in 2
replace var1 = "25001" in 3
replace var1 = "25005" in 4
gen type_A = substr(var1, 1, 2) == "22"
Please note that as you explain your problem it looks like you you are storing 22005 as text - which may not necessarily be the best idea..

Related

Create variables by multiplying variables that have the same name suffix

I have a Stata dataset that looks like this:
stock8201
stock8202
stock8203
immigrantshare8201
immigrantshare8202
immigrantshare8203
123
24
21
0.0004696
0.0001165
0.0016181
123
24
21
0.0004696
0.0001165
0.0016181
123243
24
21
0.0004696
0.0001165
0.0016181
And I want a command that would create for me three variables that would multiply the first one stock8201 by immigrantshare8201 and do the same for the other ones. The table I want at the end would look something like this:
Predi8201
Predi8202
Predi8203
0.0577608
0.002796
0.0339801
0.0577608
0.002796
0.0339801
57.8749128
0.002796
0.0339801
which is for instance: Predi8201 which is equal to stock8201*immigrantshare8201
forval j = 1/3 {
gen Predi820`j' = stock820`j' * immigrantshare820`j'
}
For a larger set of variables, you might want something like
foreach v of var stock* {
local suffix : subinstr local v "stock" ""
gen Predi`suffix' = `v' * immigrantshare`suffix'
}
Your question hints that you are holding data for different months (January 1982, February 1982, ...) in a wide layout. In Stata most things are easier in a long layout, which usually calls for reshape long.

Is there a SAS function to delete negative and missing values from a variable in a dataset?

Variable name is PRC. This is what I have so far. First block to delete negative values. Second block is to delete missing values.
data work.crspselected;
set work.crspraw;
where crspyear=2016;
if (PRC < 0)
then delete;
where ticker = 'SKYW';
run;
data work.crspselected;
set work.crspraw;
where ticker = 'SKYW';
where crspyear=2016;
where=(PRC ne .) ;
run;
Instead of using a function to remove negative and missing values, it can be done more simply when inputting or outputting the data. It can also be done with only one data step:
data work.crspselected;
set work.crspraw(where = (PRC >= 0 & PRC ^= .)); * delete values that are negative and missing;
where crspyear = 2016;
where ticker = 'SKYW';
run;
The section that does it is:
(where = (PRC >= 0 & PRC ^= .))
Which can be done for either the input dataset (work.crspraw) or the output dataset (work.crspselected).
If you must use a function, then the function missing() includes only missing values as per this answer. Hence ^missing() would do the opposite and include only non-missing values. There is not a function for non-negative values. But I think it's easier and quicker to do both together simultaneously without a function.
You don't need more than your first test to remove negative and missing values. SAS treats all 28 missing values (., ._, .A ... .Z) as less than any actual number.

Loop through a set of variables based on condition in another variable

I have a list of variables a_23 a_24_1 a_24_2 a_24_3 a_24_4 a_24_5 a_24_6 a_24_7 a_24_8.
The values in variables a_24* are based on the response in a_23.
If a_23==1, then at least one variable in a_24* must be equal to 1.
I therefore want to check if any of the variables a_24* does not contain the value 1 if a_23==1
I tried the loop below,
foreach var of varlist a_24_1* {
br a_23 a_24* if a_23==1 & `var' != 1
}
but it returns all the variables that do not contain 1 in the set of variables. However, I only need cases where all variables do not contain the value 1 if the determining variable is equal to 1.
A data example as well as code would be a good idea, so that you then base your question on an MCVE: see https://stackoverflow.com/help/mcve for explanation.
As I understand it an intermediate variable would help here:
egen mina_24 = rowmin(a_24_*)
as the minimum will be 0 if and only if all values are 0.
Note that your loop
foreach var of varlist a_24_1* {
br a_23 a_24* if a_23 == 1 & `var' != 1
}
is a loop over the single variable a_24_1; presumably you mean a24_* in the foreach line.

how to extract last 4 characters of the string in SAS

improved formatting,I am a bit stuck where I am not able to extract the last 4 characters of the string., when I write :-
indikan=substr(Indikation,length(Indikation)-3,4);
It is giving invalid argument.
how to do this?
This code works:
data temp;
indikation = "Idontknow";
run;
data temp;
set temp;
indikan = substrn(indikation,max(1,length(indikation)-3),4);
run;
Can you provide more context on the variable? If indikation is length 3 or smaller than I could see this erroring or if it was numeric it may cause issues because it right justifies the numbers (http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000245907.htm).
If it's likely to be under four characters in some cases, I would recommend adding max:
indikan = substrn(indikation,max(1,length(indikation)-3),4);
I've also added substrn as Rob suggests given it better handles a not-long-enough string.
Or one could use the reverse function twice, like this:
data _null_;
my_string = "Fri Apr 22 13:52:55 +0000 2016";
_day = substr(my_string, 9, 2);
_month = lowcase(substr(my_string, 5, 3));
* Check the _year out;
_year = reverse(substr(reverse(trim(my_string)), 1, 4));
created_at = input(compress(_day || _month || _year), date9.);
put my_string=;
put created_at=weekdatx29.;
run;
Wrong results might be caused by trailing blanks:
so, before you perform substr, strip/trim your string:
indikan=substr(strip(Indikation),length(strip(Indikation))-3);
must give you last 4 characters
Or you can try this approach, which, while initially a bit less intuitive, is stable, shorter, uses fewer functions, and works with numeric and text values:
indikan = prxchange("s/.*(.{4}$)/$1/",1,indikation);
data temp;
input trt$;
cards;
treat123
treat121
treat21
treat1
treat1
trea2
;run;
data abc;
set temp;
b=substr(trt,length(trt)-3);
run;
[Output]
Output:

Stata factor value from label

I would like to look up a value/code associated with a label, and store that value in a scalar or local macro. While the information I want is stored in the definition of the label vector, apparently I need to go through some contortions to get it.
Extending Roberto Ferrer's answer to my last question, I came up with this approach:
// sample data
clear
input str5 mystr int mynum
a 5
b 5
b 6
c 4
end
encode mystr, gen(myfactor)
// get code for "b"
gen tmp = 0
replace tmp = myfactor if myfactor == "b":myfactor
sort tmp
scalar bcode = tmp[_N]
This seems woefully inefficient in terms of data manipulation and code maintenance, especially considering how the information I want is already saved (and viewable with label list).
This uses labellist, from SSC. Download using ssc install labellist.
clear
set more off
*----- example data -----
input str5 mystr
"good"
"bad"
"bad"
"regular"
end
encode mystr, gen(myfactor)
*----- what you want -----
labellist
local faclab = r(myfactor_labels)
local facval = r(myfactor_values)
// get # for "good"
local i : list posof "good" in faclab
local j : word `i' of `facval'
display "`j'"