Suppose I have two id variables in my survey, id1 and id2. I want to consolidate all observations at a level defined by two variables, ie join all the observations that have the id1 or id2. More precisely, two individuals are defined to be pairwise related when they share the same id1 or id2, and I want to define a new variable id3 such that observation1 and observation2 have the same id3 iff there is a chain of pairwise related observations from observation 1 to observation 2. This is a particular application of the the problem of finding connected component.
Intuitively, this would be similar to a command egen id = group(var1 var2) that uses the OR logic rather than the AND logic. How should one do it in stata?
I have found the answer to my question. The program a2group, written by Amine Ouazad, does just that. It can be installed via a2reg
Related
I have data that I would like to rank by two separate categories, State and ServiceType. Essentially, there are multiple years of data for each ServiceType across various states, and I was hoping to get the sum of all years for each ServiceType by State, meaning each State is treated independently and the sums of the various categories are ranked only within that state, not nationally.
I've tried
bys State ServiceCategory (quant_variable): ///
egen rank_quant_variable= rank(sum(quant_variable)), field
as well as a version of above where I used a pre-calculated sum variable. Both don't really work.
This lacks a reproducible example, as you do not give your data or phrase your problem in terms of a dataset we could download, for example as loaded with or referred to in Stata. There is no need to give the full dataset but just a minimal example with the same structure.
The call to sum() here would be to Stata's sum() function, which yields the cumulative or running sum, which evidently isn't what you want. So that case is easy to dismiss.
The problem remaining is quite what you did in the code you don't show with a pre-calculated sum.
At a guess you worked out
bys State ServiceCategory: egen sum = total(quant_variable)
and then pushed that sum through rank(). But that would use each value of sum as many times as it occurred.
Perhaps you want something more like this:
egen tag = tag(State ServiceCategory)
bysort State: egen rank_quant_variable = rank(sum) if tag, field
bysort State (rank): replace rank = rank[1]
But it's really hard (for me) to visualize this without details on what you did or an example to work on.
Hi I am trying to create dyad from households id clustered within villages with stata. My problem is I do not know how to use vlookup in order to have a list of households id linked to every household.
Without a bit more information this question is tough to answer, but some places you can look are first tabulate to see your data broken down by variables. Another place to check is the bysort and gen commands, these together will probably be the answer you're looking for, although it is tough to tell from the question. Finally, you may want to look into encode if your village variable is a string, you will get a unique id for each village using that command.
I have a very weird thing happening with my code. I have panel data set with the panel id being p_id and I am trying to create a another variable by using panel_id. My code is this, where p_id is the panel id, marital_status of person observed in each time period and x is the variable I would want to create.
bys p_id: gen count =_N
bys p_id: gen count1 =_n
bys p_id: gen x= marital_status if count1 ==1
However when I do
tab x
I get different numbers for rows (row total does not change) each time I run this code. The numbers are pretty closely clustered, but I need to understand why this is happening.
Although the lack of a reproducible example is poor practice, it is possible to guess at what is going on. The first line of code is not problematic, but the second two have the same effect as
bys p_id: gen x = marital_status if _n == 1
In words, the new variable contains marital status data from the first observation in each group of observations for distinct p_id. But sorting on p_id says nothing about sort order for the observations with the same p_id and that within-group sort order is not reproducible without some sufficient constraint. So the first observation could easily be different (unless naturally there is only one observation in each group), with the results you report.
Concretely, suppose that there are 3 observations for p_id 42. Then any of 6 possible orders of those observations is consistent with sorting on p_id. And so forth.
Presumably there is something special about one observation in each group. You would need to explain more about your data and what you want to get to allow fuller advice, but this problem is not a puzzle.
I'm trying to create a table using SAS 9.3 that shows information on current and past projects. For current projects, I want to show whether they've met various criteria ("yes", "no", OR "n/a"). In the same table, I want to show summary information of past projects (i.e. how many projects met the criteria, how many did not, and how many were n/a). Having one table to show current projects and one table to show past projects is easy. I'm struggling to show them together in a single table. Using proc tabulate, my code looks like this:
proc tabulate data = projects order=formatted missing;
class project;
var dt criteria1 criteria2 criteria3;
table
(dt=”Start Date)"*min=''*f=year_date.)
(criteria1="Criteria 1")*sum=''*f=ans.
(criteria2="Criteria 2")*sum=''*f=ans.
(criteria3="Criteria 3")*sum=''*f=ans.
,(project='');
format project $project_label.;
run;
The values for each criteria are 1 for yes, 0 for no, and . for n/a. The year format distinguishes current from past projects and the ans format shows "yes" for 1 and "no" for 0. This works for the the current projects. It also gives me the total number of past projects with "yes" answers. What I don't know how to do is the break-out for past projects showing no and n/a. (I'm also in trouble if there sum of past projects is 1 or 0 because the format would replace those with 'yes' or 'no.'
Any suggestions?
Thanks.
Brandon
Edit: I'll try to add some sample data that looks reasonable...
Criteria ActiveProject1 ActiveProject2 Past_Projects
Criteria1 yes no 5/10/5
Criteria2 yes yes 7/9/4
Criteria3 no yes 2/15/3
While I can't visualize what you're trying to do, one suggestion I would have is to use the ODS DOCUMENT and PROC DOCUMENT facility, or PROC REPORT.
You can in this way build your two separate tables that you like, then use PROC DOCUMENT to put them together so they show up in one place. This might suffice for what you're aiming to do.
If it doesn't, then PROC REPORT is probably more apt than PROC TABULATE when you are in some places summarizing and in other places not, if that's what you're trying to do. It allows limited data step functionality along with the summarization elements of the tabulation procs. I can't suggest a specific example because I don't understand what you're doing, but it may be the superior choice.
(first time posting)
I have a data set where I need to create a new variable (in SAS), based on meeting a condition related to another variable. So, the data contains three variables from a survey: Site, IDnumb (person), and Date. There can be multiple responses from different people but at the same site (see person 1 and 3 from site A).
Site IDnumb Date
a 1 6/12
b 2 3/4
c 4 5/1
a 3 .
d 5 .
I want to create a new variable called Complete, but it can't contain duplicates. So, when I go to proc freq, I want site A to be counted once, using the 6/12 Date of the Completed Survey. So basically, if a site is represented twice and contains a Date in one, I want to only count that one and ignore the duplicate site without a date.
N %
Complete 3 75%
Last Month 1 25%
My question may be around the NODUP and NODUPKEY possibilities. If I do a Proc Sort (nodupkey) by Site and Date, would that eliminate obs "a 3 ."?
Any help would be greatly appreciated. Sorry for the jumbled "table", as this is my first post (hints on making that better are also welcomed).
You can do this a number of ways.
First off, you need a complete/not complete binary variable. If you're in the datastep anyway, might as well just do it all there.
proc sort data=yourdata;
by site date descending;
run;
data yourdata_want;
set yourdata;
by site date descending;
if first.site then do;
comp = ifn(date>0,1,0);
output;
end;
run;
proc freq data=yourdata_want;
tables comp;
run;
If you used NODUPKEY, you'd first sort it by SITE DATE DESCENDING, then by SITE with NODUPKEY. That way the latest date is up top. You also could format COMP to have the text labels you list rather than just 1/0.
You can also do it with a format on DATE, so you can skip the data step (still need the sort/sort nodupkey). Format all nonmissing values of DATE to "Complete" and missing value of date to "Last Month", then include the missing option in your proc freq.
Finally, you could do the table in SQL (though getting two rows like that is a bit harder, you have to UNION two queries together).