How to subset data and set values based on matching dates

How to subset data and set values based on matching dates - sas

I have a table like:
id -- date ------------ value
x --- 01/01/2019 ---- 1
y --- 01/01/2019 ---- 2
x --- 01/02/2019 ---- 1
z --- 01/03/2019 ---- 1
I am trying to select where id in (x,y)
Then, once I have that subset, I want to check if x has a corresponding y of the same date
if it does, then I want to set new to value from the row with this date and id=y - for both x and y;
otherwise, new is just set to value
so the table would become
id -- date ------------ value -- new
x --- 01/01/2019 ---- 1 -------- 2
y --- 01/01/2019 ---- 2 -------- 2
x --- 01/02/2019 ---- 1 -------- 1
z --- 01/03/2019 ---- 1 -------- 1
I am totally unsure how to go about this,
I began by subsetting my data:
IF id='x' OR id='y' THEN DO;
/*...*/
END;
ELSE new=value;
any help would be appreciated
Also, I cannot use the actual dates in my code
I don't need to generalize the id, I know specifically that I need to be comparing/ checking x and y (y is a follow up for x in the data I am using)

I've recently answered a similar question. Base SAS isn't very convenient for problems like this - solving it with DATA steps involves sorting in the right order, then reading in BY groups, and retaining the data across rows.
This is easier in SQL: the expected result you've described can be obtained by LEFT JOINing the original table (T) with a subset (Y) of itself (where id=y) on date. Once joined, new can be calculated as coalesce(Y.value, T.value)

Related

Plotting categorical variables using a bar diagram/bar chart

data
I am trying to plot a bar graph for both sept and oct waves. As in the image you can see the id are the individuals who are surveyed across time. So on the one graph I need to plot sept in-house, oct in-house, sept out-house, oct out-house and just have to show the proportion of people who said yes in sept in-house, oct in-house, sept out-house, oct out-house. Not all the categories have to be taken into account.
Also I have to show whiskers for 95% confidence intervals for each of the respective categories.

* Example generated by -dataex-. For more info, type help dataex
clear
input float(id sept_outhouse sept_inhouse oct_outhouse oct_inhouse)
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 3 3 3
5 4 4 3 3
6 4 4 3 3
7 4 4 4 1
8 1 1 1 1
9 1 1 1 1
10 1 1 1 1
end
label values sept_outhouse codes
label values sept_inhouse codes
label values oct_outhouse codes
label values oct_inhouse codes
label def codes 1 "yes", modify
label def codes 2 "no", modify
label def codes 3 "don't know", modify
label def codes 4 "refused", modify
save tokenexample, replace
rename (*house) (house*)
reshape long house, i(id) j(which) string
replace which = subinstr(proper(which), "_", " ", .)
gen yes = house == 1
label def WHICH 1 "Sept Out" 2 "Sept In" 3 "Oct Out" 4 "Oct In"
encode which, gen(WHICH) label(WHICH)
statsby, by(WHICH) clear: ci proportion yes, jeffreys
set scheme s1color
twoway scatter mean WHICH ///
|| rspike ub lb WHICH, xla(1/4, noticks valuelabel) xsc(r(0.9 4.1)) ///
xtitle("") legend(off) subtitle(Proportion Yes with 95% confidence interval)
This has to be solved backwards.
The means and confidence intervals have to be plotted using twoway as graph bar is a dead-end here, because it does not allow whiskers too.
The confidence limits have to be put in variables before the graphics. Some graph commands, notably graph bar, will calculate means for you, but as said that is a dead end. So, we need to calculate the means too.
To do that you need an indicator variable for Yes.
The best way I know to get the results then is to reshape to a different structure and then apply ci proportion under statsby.
As a detail, the option jeffreys is explicit as a signal that there are different methods for the confidence interval calculation. You should choose one knowingly.

What is best practice for creating Year on Year columns in Power BI?

I have a table with many columns containing various metrics by year & week:
Year Week_Number Metric1 Metric2
---- ----------- ------- -------
2020 1 11 21
2019 1 10 20
I need to add columns calculating Year On Year growth for each of these metrics to use on a report; the data needs to look something like this:
Year Week_Number Metric1 Metric2 Metric1_YOY Metric2_YOY
---- ----------- ------- ------- ----------- -----------
2020 1 11 21 10% 5%
2019 1 10 20 N/A N/A
I can do this using DAX, but there are a lot of columns needed and it takes a lot of time to copy my formula, create a new column, paste the formula in, edit it, and repeat for each column.
Does anyone know of a quicker, more efficient way to create a lot of columns quickly please?

sas hash table with multiple key value pairs

I have a lookup table which looks like this (Name: LOOKUP_TABLE):
Obs Member_id plan_id Plan_desc group_id Group_name
1 164-234 XYZ HMO_Salaried G123 Umbrellas, Inc.
2 297-123 ABC PPO_Hourly G123 Umbrellas, Inc.
3 344-123 JKL HMO_Executive G456 Toy Company
4 395-123 XYZ HMO_Salaried G123 Umbrellas, Inc.
5 495-987 ABC PPO_Hourly G456 Toy Company
6 562-987 ABC PPO_Hourly G123 Umbrellas, Inc.
7 697-123 XYZ HMO_Salaried G456 Toy Company
I have another table with the following data (Name: MAIN_TABLE):
Obs Member_id zip income svc_dt dx plan_id group_id Obs old_id new_id
1 164-234 04021 $45,000 2005/01/01 250 XYZ G123 1 164-234 N164-234
2 297-123 22003-1234 $56,999 2005/02/03 4952 ABC G123 2 297-123 N297-123
3 344-123 45459-0306 $72,999 2005/03/15 78910 JKL G456 3 344-123 C344-123
4 395-123 03755 $75,000 2005/04/14 250 XYZ G123 4 N164-234 M164-234
5 495-987 94305 $96,000 2005/08/19 12345 ABC G456 5 N297-123 B297-123
6 562-987 78277-8310 $32,999 2005/09/13 250 ABC G123 6 M164-234 P164-234
7 697-123 88044-3760 $47,999 2005/11/01 4952 XYZ G456 7 P164-234 A164-234
My SAS data step is as follows:
data MAIN_TABLE_1.
set MAIN_TABLE;
declare hash pd_lookup(dataset:"&LOOKUP_TABLE.");
rc_pd_definekey = pd_lookup.definekey
(
'plan_id',
'group_id'
);
rc_pd_definedata = pd_lookup.definedata
(
'Plan_desc',
'Group_name'
);
rc_pd_definedone = pd_lookup.definedone();
call missing (
Plan_desc,
Group_name
);
put "rc_pd_definekey is " rc_pd_definekey;
put "rc_pd_definedata is " rc_pd_definedata;
put "rc_pd_definedone is " rc_pd_definedone;
drop rc_pd_definekey rc_pd_definedata rc_pd_definedone;
rc_pd_lookup = pd_lookup.find();
run
My question is to understand whats happening behind the scenes in this lookup, mainly with regards to the key value pairs being generated.
i.e., are there individual key value pairs being generated.
As in , the example of key value pairs will be
: "plan_id" -> "Plan_desc"
: "plan_id" -> "Group_name"
: "group_id" -> "Plan_desc"
: "group_id" -> "Group_name"
Or is it that the keys are concatenated together and so are the values, and then we make pairs.
As in, something like this
:"plan_id"+"group_id" -> "Plan_desc" + "Group_name"
I ask this question as I have to convert the same code logic into R, and if I misunderstand, then the whole R code will be wrong

Each combination of plan_id and group_id is used to retrieve a unique entry from the hash table containing values of both plan_desc and group_name.
However, currently there are duplicate rows with the same combination of these ids in the lookup table, which may cause errors or unexpected behaviour - e.g. obs 1 and 4. You should create a deduplicated copy of the lookup table and use that to declare the hash object.

Reshaping dataset wide

I have a dataset from a small clinic which looks something like this:
What I am trying to do is make the top long form of the dataset look like the bottom wide form.
My code is the following:
reform date injury_code_1 .... , i(ID) j(VisitNum)
The error code I get is this:
There are variables other than a, b, ID, VisitNum in your data. They must be constant within ID because that is the only way they can fit into wide data without loss of information.
The variable or variables listed above are not constant within ID. Perhaps the values are in error. Type reshape error for a list of the problem observations.
Either that, or the values vary because they should vary, in which case you must either add the variables to the list of xij variables to be reshaped, or drop them.
Why is my code wrong?

Using the data as illustrated in the screenshot, the following works for me:
clear
input ID VisitNum str6 date Injury_1 Injury_2 Injury_3 gender
1 1 "12-Mar" 1 2 3 0
2 1 "2-Apr" 4 . . 1
1 2 "23-Jun" 1 2 . 0
3 1 "1-Feb" 5 6 . 1
1 3 "30-Aug" 8 9 10 0
end
reshape wide date Injury_1 Injury_2 Injury_3, i(ID) j(VisitNum)
order ID gender
list, abbreviate(15)
+----------------------------------------------------------------------------------------------------------------------------------------------------+
| ID gender date1 Injury_11 Injury_21 Injury_31 date2 Injury_12 Injury_22 Injury_32 date3 Injury_13 Injury_23 Injury_33 |
|----------------------------------------------------------------------------------------------------------------------------------------------------|
1. | 1 0 12-Mar 1 2 3 23-Jun 1 2 . 30-Aug 8 9 10 |
2. | 2 1 2-Apr 4 . . . . . . . . |
3. | 3 1 1-Feb 5 6 . . . . . . . |
+----------------------------------------------------------------------------------------------------------------------------------------------------+
The command provided is not valid Stata syntax.

Stata Deleting Multiple Observations

I have the following data matrix containing ideology scores in a customized dataset:
year state cdnum party name dwnom1
1946 23 10 200 WOODRUFF 0.43
1946 23 11 200 BRADLEY F. 0.534
1946 23 11 200 POTTER C. 0.278
1946 23 12 200 BENNETT J. 0.189
My unit of analysis is a given congressional district, in a given year. As one can see state #23, cdnum #11, has two observations in 1946.
What I would like to do is delete the earlier observation, in this case the observation corresponding to name: BRADLEY.F. This happens when a Congressional district has two members in a given Congress. The attempt of code that I have tried is as follows:
drop if year==[_n+1] & statenum==[_n+1] & cdnum==[_n+1]
My attempt is a conditional argument, drop the observation if: the year is the same as the next observation, the statenum is the same as the next observation, and the cdnum is the same as the next observation. In this way, I can insure each district has only one corresponding for a given year. When I attempt to run the code I get:
drop if year==[_n-1] & statenum==[_n-1] & cdnum==[_n-1]
(0 observations deleted)

Brief alternative: You should check out the duplicates command.
Detailed explanation of error:
You don't mean what you say to Stata.
Your conditions such as
if year == [_n-1]
should be
if year == year[_n-1]
and so forth.
[_n-1]
by itself is treated as if you typed
_n-1
which is the observation number, minus 1.
Here is a dopey example. Read in the auto data.
. sysuse auto
(1978 Automobile Data)
. list foreign if foreign == [_n-1], nola
+---------+
| foreign |
|---------|
1. | 0 |
+---------+
The variable foreign is equal to _n - 1 precisely once, in observation 1 when foreign is 0 and _n is 1.
In short, [_n-1] is not to be interpreted as the previous value (of the variable I just mentioned).
help subscripting gives very basic help.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to subset data and set values based on matching dates - sas

Related

Plotting categorical variables using a bar diagram/bar chart

What is best practice for creating Year on Year columns in Power BI?

sas hash table with multiple key value pairs

Reshaping dataset wide

Stata Deleting Multiple Observations

Categories

Resources