I am trying to run 3 arrays in my SAS code and input each value into the variables in the last array, however, each time I run this code, it only populates the CREVASC_age column. Please let me know any thoughts on how to populate each age variable, in the third array, using the matching variables in the other arrays.
data Outc_adjust3; set Outc_adjust2;
ARRAY outcvars{3} CABG MI CREVASC;
ARRAY outcdys{3} CABGDY MIDY CREVASCDY;
ARRAY outc_age{3} CABG_age MI_age CREVASC_age;
Do I= 1 to 3;
if outcvars{3} = 1 then outc_age{3} = ageatenroll + (outcdys(3)/365.25);
else if outcvars{3} = 0 then do;
if EXTFLAG = 0 AND EXT2FLAG = 0 then outc_age{3} = AGE_WHIENDFU;
if EXTFLAG = 1 AND EXT2FLAG = 0 then outc_age{3} = AGE_EXT1ENDFU;
if EXT2FLAG = 1 AND EXT2MRC = 1 then outc_age{3} = age_endfu;
if EXT2FLAG = 1 AND EXT2SRC = 1 then outc_age{3} = age_ext1endfu;
end;
end;
run;
You hard coded {3} when you needed {I}. That is why only the third variable of the array was being dealt with.
Change the {3} to {I} for all the array indexed references inside the loop.
Related
I created a SAS function using fcmp to calculate the jaccard distance between two strings. I do not want to use macros, as I'm going to use it through a large dataset for multiples variables. the substrings I have are missing others.
proc fcmp outlib=work.functions.func;
function distance_jaccard(string1 $, string2 $);
n = length(string1);
m = length(string2);
ngrams1 = "";
do i = 1 to (n-1);
ngrams1 = cats(ngrams1, substr(string1, i, 2) || '*');
end;
/*ngrams1= ngrams1||'*';*/
put ngrams1=;
ngrams2 = "";
do j = 1 to (m-1);
ngrams2 = cats(ngrams2, substr(string2, j, 2) || '*');
end;
endsub;
options cmplib=(work.functions);
data test;
string1 = "joubrel";
string2 = "farjoubrel";
jaccard_distance = distance_jaccard(string1, string2);
run;
I expected ngrams1 and ngrams2 to contain all the substrings of length 2 instead I got this
ngrams1=jo*ou*ub
ngrams2=fa*ar*rj
If you want real help with your algorithm you need to explain in words what you want to do.
I suspect your problem is that you never defined how long you new character variables NGRAM1 and NGRAM2 should be. From the output you show it appears that FCMP defaulted them to length $8.
To define a variable you need use a LENGTH statement (or an ATTRIB statement with the LENGTH= option) before you start referencing the variable.
I have a dataframe that looks like the following. The rightmost two columns are my desired columns:
Open Close open_to_close close_to_next_open open_desired close_desired
0 0 0 3 0 0
0 0 4 8 3 7
0 0 1 1 15 16
The calculations are as the following:
open_desired = close_desired(prior row) + close_to_next_open(prior row)
close_desired = open_desired + open_to_close
How do I implement the following in a loop manner? I am trying to do this until the last row.
df = pd.DataFrame({'open': [0,0,0], 'close': [0,0,0], 'open_to_close': [0,4,1], 'close_to_next_open': [3,8,1]})
df['close_desired'] = 0
df['open_desired'] = 0
##First step is to create open_desired in current row which is dependent on close_desired in previous row
df['open_desired'] = df['close_desired'].shift() + df['close_to_next_open'].shift()
##second step is to create close_desired in current row which is dependent on open_desired in current row
df['close_desired'] = df['open_desired'] + df['open_to_close']
df.fillna(0,inplace=True)
The only way I can think of doing this is with iterrows()
for row, v in df.iterrows():
if row>0:
df.loc[row,'open_desired'] = df.shift(1).loc[row, 'close_desired'] + df.shift(1).loc[row, 'close_to_next_open']
df.loc[row,'close_desired'] = df.loc[row, 'open_desired'] + df.loc[row, 'open_to_close']
It's the variable number of lists that is confusing me. It's easy to (pseudo code)...
for i=1 to list1_size
for ii=1 to list2_size
for iii=1 to list3_size
results_list.add(list1[i]+list2[ii]+list3[iii])
... but I could do with a nudge in the right direction regarding how to do this with a varying number of lists.
I've started with getting a table of the number of entries in each list to work through and it does seem like a bit of recursion is required but from there I'm drawing a blank.
edit: To clarify, what I am looking for is...
Input: A varying number of lists/tables with a varying numbers of entries.
{"A","B","C"} {"1","2","3","4"} {"*","%"}
Output:
{"A1*", "A1%", "A2*", "A2%", "A3*", "A3%", "A4*", "A4%",
"B1*", "B1%", "B2*", "B2%", "B3*", "B3%", "B4*", "B4%",
"C1*", "C1%", "C2*", "C2%", "C3*", "C3%", "C4*", "C4%"}
local list_of_lists = {{"A","B","C"}, {"1","2","3","4"}, {"*","%"}}
local positions = {}
for i = 1, #list_of_lists do
positions[i] = 1
end
local function increase_positions()
for i = #list_of_lists, 1, -1 do
positions[i] = positions[i] % #list_of_lists[i] + 1
if positions[i] > 1 then return true end
end
end
local function get_concatenated_selection()
local selection = {}
for i = 1, #list_of_lists do
selection[i] = list_of_lists[i][positions[i]]
end
return table.concat(selection)
end
local result = {}
repeat
table.insert(result, get_concatenated_selection())
until not increase_positions()
I am trying to script a dynamic way way to only take the first two elements in a list and I am having some trouble. Below is a breakdown of what I have in my List
Declaration:
Set List = CreateObject("Scripting.Dictionary")
List Contents:
List(0) = 0-0-0-0
List(1) = 0-1-0-0
List(2) = 0-2-0-0
Code so far:
for count = 0 To UBound(List) -1 step 1
//not sure how to return
next
What I currently have does not work.
Desired Return List:
0-0-0-0
0-1-0-0
You need to use the Items method of the Dictionary. For more info see here
For example:
Dim a, i
a = List.Items
For i = 0 To List.Count - 1
MsgBox(a(i))
Next i
or if you just want the first 2:
For i = 0 To 1
MsgBox(a(i))
Next i
UBound() is for arrays, not dictionaries. You need to use the Count property of the Dictionary object.
' Show all dictionary items...
For i = 0 To List.Count - 1
MsgBox List(i)
Next
' Show the first two dictionary items...
For i = 0 To 1
MsgBox List(i)
Next
Sorry that title is confusing. Hopefully it's clear below.
I'm using Stata and I'd like to assign the value 1 to a variable that depends on the value within a different variable. I have 20 order variables and also 20 corresponding variables. For example if order1 = 3, I'd like to assign variable3 = 1. Below is a snippet of what the final dataset would look like if I had only 3 of each variable.
Right now I'm doing this with two loops but I have to another loop around this that goes through this 9 more times plus I'd doing this for a couple hundred data files. I'd like to make it more efficient.
forvalues i = 1/20 {
forvalues j = 1/20 {
replace variable`j' = 1 if order`i'==`j'
}
}
Is it possible to use the value of order'i' to assign the variable[order`i'VALUE] directly? Then I can get rid of the j loop above. Something like this.
forvalues i = 1/20 {
replace variable[`order`i'value] = 1
}
Thanks for your help!
***** CLARIFICATION ADDED Feb 2nd.**
I simplified my problem and the dataset too much bc the solutions suggested work for what I presented but, are not getting at what I'm really attempting to do. Thank you three for your solutions though. I was not clear enough in my post.
To clarify, my data doesn't have a one to one correspondence of each order# assigning variable# a 1 if it's not missing. For example, the first observation for order1=3, variable1 isn't supposed to get a 1, variable3 should get a 1. What I didn't include in my original post is that I'm actually checking for other conditions to set it equal to 1.
For more background, I'm counting up births of women by birth order(1st child, 2nd child, etc) that occurred at different ages of mothers. So in the data, each row is a woman, each order# is the number birth (order1=3, it's her third child). The corresponding variable#s are the counts (variable# means the woman has a child of birth order #). I mentioned in the post, that I do this 9 times bc I'm doing it for 5 year age groups (15-19; 20-24; etc). So the first set of variable# would be counts of birth by order when women were ages 15-19; the second set of variable# would be counts of births by order when women were 20-24. etc etc. After this, I sum up the counts in different ways (by woman's education, geography, etc).
So with the additional loop what I do is something more like this
forvalues k = 1/9{
forvalues i = 1/20 {
forvalues j = 1/20 {
replace variable`k'_`j' = 1 if order`i'==`j' & age`i'==`k' & birth_age`i'<36
}
}
}
Not sure if it's possible, but I wanted to simplify so I only need to cycle through each child once, without cycling through the birth orders and directly use the value in order# to assign a 1 to the correct variable. So if order1=3 and the woman had the child at the specific age group, assign variable[agegroup][3]=1; if order1=2, then variable[agegroup][2] should get a 1.
forvalues k=1/9{
forvalues i = 1/20 {
replace variable`k'_[`order`i'value] = 1 if age`i'==`k' & birth_age`i'<36
}
}
I would reshape twice. First reshape to long, then condition variable on !missing(order), then reshape back to wide.
* generate your data
clear
set obs 3
forvalues i = 1/3 {
generate order`i' = .
local k = (3 - `i' + 1)
forvalues j = 1/`k' {
replace order`i' = (`k' - `j' + 1) if (_n == `j')
}
}
list
*. list
*
* +--------------------------+
* | order1 order2 order3 |
* |--------------------------|
* 1. | 3 2 1 |
* 2. | 2 1 . |
* 3. | 1 . . |
* +--------------------------+
* I would rehsape to long, then back to wide
generate id = _n
reshape long order, i(id)
generate variable = !missing(order)
reshape wide order variable, i(id) j(_j)
order order* variable*
drop id
list
*. list
*
* +-----------------------------------------------------------+
* | order1 order2 order3 variab~1 variab~2 variab~3 |
* |-----------------------------------------------------------|
* 1. | 3 2 1 1 1 1 |
* 2. | 2 1 . 1 1 0 |
* 3. | 1 . . 1 0 0 |
* +-----------------------------------------------------------+
Using a simple forvalues loop with generate and missing() is orders of magnitude faster than other proposed solutions (until now). For this problem you need only one loop to traverse the complete list of variables, not two, as in the original post. Below some code that shows both points.
*----------------- generate some data ----------------------
clear all
set more off
local numobs 60
set obs `numobs'
quietly {
forvalues i = 1/`numobs' {
generate order`i' = .
local k = (`numobs' - `i' + 1)
forvalues j = 1/`k' {
replace order`i' = (`k' - `j' + 1) if (_n == `j')
}
}
}
timer clear
*------------- method 1 (gen + missing()) ------------------
timer on 1
quietly {
forvalues i = 1/`numobs' {
generate variable`i' = !missing(order`i')
}
}
timer off 1
* ----------- method 2 (reshape + missing()) ---------------
drop variable*
timer on 2
quietly {
generate id = _n
reshape long order, i(id)
generate variable = !missing(order)
reshape wide order variable, i(id) j(_j)
}
timer off 2
*--------------- method 3 (egen, rowmax()) -----------------
drop variable*
timer on 3
quietly {
// loop over the order variables creating dummies
forvalues v=1/`numobs' {
tab order`v', gen(var`v'_)
}
// loop over the domain of the order variables
// (may need to change)
forvalues l=1/`numobs' {
egen variable`l' = rmax(var*_`l')
drop var*_`l'
}
}
timer off 3
*----------------- method 4 (original post) ----------------
drop variable*
timer on 4
quietly {
forvalues i = 1/`numobs' {
gen variable`i' = 0
forvalues j = 1/`numobs' {
replace variable`i' = 1 if order`i'==`j'
}
}
}
timer off 4
*-----------------------------------------------------------
timer list
The timed procedures give
. timer list
1: 0.00 / 1 = 0.0010
2: 0.30 / 1 = 0.3000
3: 0.34 / 1 = 0.3390
4: 0.07 / 1 = 0.0700
where timer 1 is the simple gen, timer 2 the reshape, timer 3 the egen, rowmax(), and timer 4 the original post.
The reason you need only one loop is that Stata's approach is to execute the command for all observations in the database, from top (first observation) to bottom (last observation). For example, variable1 is generated but according to whether order1 is missing or not; this is done for all observations of both variables, without an explicit loop.
I wonder if you actually need to do this. For future questions, if you have a further goal in mind, I think a good strategy is to mention it in your post.
Note: I've reused code from other posters' answers.
Here's a simpler way to do it (that still requires 2 loops):
// loop over the order variables creating dummies
forvalues v=1/20 {
tab order`v', gen(var`v'_)
}
// loop over the domain of the order variables (may need to change)
forvalues l=1/3 {
egen variable`l' = rmax(var*_`l')
drop var*_`l'
}