Merge rows with unique ID in stata - stata

I have a dataset where I need unique county FIPS codes that need to be merged. The dataset looks like:
FIPS yr1990 yr2000 yr2010
1001 1 0 1
1002 1 1 0
1003 1 0 0
1004 0 0 0
1005 0 0 1
County boundaries have changed and I need to merge several FIPS codes together. Essentially, I need the dataset to look like:
FIPS yr1990 yr2000 yr2010
1001/1003 1 1 1
1002 1 1 0
1004/1005 0 0 1
Is there a way to select specific FIPS to be merged over rows?

This solution might not scale to very large datasets as writing the replace statements must be done manually. But it keeps the exact format you are using in your example. And a more scalable way might be difficult if there is no system in how the FIPS codes were combined.
* Example generated by -dataex-. For more info, type help dataex
clear
input str4 FIPS byte(yr1990 yr2000 yr2010)
"1001" 1 0 1
"1002" 1 1 0
"1003" 1 0 0
"1004" 0 0 0
"1005" 0 0 1
end
*Combine the FIPS codes
replace FIPS = "1001/1003" if inlist(FIPS,"1001","1003")
replace FIPS = "1004/1005" if inlist(FIPS,"1004","1005")
*Collapse rows by FIPS value, use max value for each var on format yr????
collapse (max) yr???? , by(FIPS)

Related

Power BI Report with dynamic columns according to slicer selection

I have a table with the columns as below -
There are rows showing the allocation of various people under various project. The months (columns) can extend to Dec,20 and continue on from Jan,21 in the same pattern as above.
One Staff can be tagged to any number of projects in a given month.
Now I want to prepare a Power BI report on this in the format as below -
Staff ID, Project ID and End Date are the slicers to be present.
For the End Date slicer we can select options in the format of (MMM,YY) (eg - Jan,23). On the basis of this slicer I want to show the preceding 6 months of data, as portrayed by the above sample image.
I have tried using parameters but those have to specified for each combination so not usable for this data as this is going to increase over time.
Is there any way to do this or am I missing some simple thing in particular?
Any help on this will be highly appreciated.
Adding in the sample data as below -
Staff ID
Project ID
Jan,20
Feb,20
Mar,20
Apr,20
May,20
Jun,20
Jul,20
1
20
0
0
0
100
80
10
0
1
30
0
0
0
0
20
90
100
2
20
100
100
100
0
0
0
0
3
50
80
100
0
0
0
0
0
3
60
15
0
0
0
20
0
0
3
70
5
0
100
100
80
0
0

Giving subjects a binary id they keep for every period

In Stata I have a list of subjects and contributions from an economic experiment.
There are multiple rounds being played for each treatment. Now I want to keep track of those who contributed in the first period and give them either 1 if a contributor or 0 if a defector. The game is played for multiple periods, but I only really care about the first round. My current code looks like this
g firstroundcont = 0
replace firstroundcont = 1 if c>0 & period==1
This however results in everyone getting a 0 for every subsequent period meaning that they are not "identified" as either a "first round" contributor or a defector for all other periods in the dataset. The table below shows a snippet of how my data looks and how the variable firstroundcont should look.
sessionID
period
subject
group
contribution
firstroundcont
1
1
1
1
4
1
1
1
2
1
0
0
1
1
3
1
2
1
1
1
4
2
10
1
1
1
5
2
0
0
1
1
6
2
0
0
1
2
1
1
0
1
1
2
2
1
5
0
1
2
3
1
0
1
#JR96 is right: this sorely and surely needs a data example. But I guess you want something with the flavour of
bysort id (period) : gen wanted = c[1] > 0
See https://www.stata.com/support/faqs/data-management/creating-dummy-variables/ and https://www.stata-journal.com/article.html?article=dm0099 for more on how to get indicators in one step. The business of generating with 0 and then replacing with 1 can usually be cut to a direct one-line statement.

Scikit-learn labelencoder: how to preserve mappings between batches?

I have 185 million samples that will be about 3.8 MB per sample. To prepare my dataset, I will need to one-hot encode many of the features after which I end up with over 15,000 features.
But I need to prepare the dataset in batches since the memory footprint exceeds 100 GB for just the features alone when one hot encoding using only 3 million samples.
The question is how to preserve the encodings/mappings/labels between batches?
The batches are not going to have all the levels of a category necessarily. That is, batch #1 may have: Paris, Tokyo, Rome.
Batch #2 may have Paris, London.
But in the end I need to have Paris, Tokyo, Rome, London all mapped to one encoding all at once.
Assuming that I can not determine the levels of my Cities column of 185 million all at once since it won't fit in RAM, what should I do?
If I apply the same Labelencoder instance to different batches will the mappings remain the same?
I also will need to use one hot encoding either with scikitlearn or Keras' np_utilities_to_categorical in batches as well after this. So same question: how to basically use those three methods in batches or apply them at once to a file format stored on disk?
I suggest using Pandas' get_dummies() for this, since sklearn's OneHotEncoder() needs to see all possible categorical values when .fit(), otherwise it will throw an error when it encounters a new one during .transform().
# Create toy dataset and split to batches
data_column = pd.Series(['Paris', 'Tokyo', 'Rome', 'London', 'Chicago', 'Paris'])
batch_1 = data_column[:3]
batch_2 = data_column[3:]
# Convert categorical feature column to matrix of dummy variables
batch_1_encoded = pd.get_dummies(batch_1, prefix='City')
batch_2_encoded = pd.get_dummies(batch_2, prefix='City')
# Row-bind (append) Encoded Data Back Together
final_encoded = pd.concat([batch_1_encoded, batch_2_encoded], axis=0)
# Final wrap-up. Replace nans with 0, and convert flags from float to int
final_encoded = final_encoded.fillna(0)
final_encoded[final_encoded.columns] = final_encoded[final_encoded.columns].astype(int)
final_encoded
output
City_Chicago City_London City_Paris City_Rome City_Tokyo
0 0 0 1 0 0
1 0 0 0 0 1
2 0 0 0 1 0
3 0 1 0 0 0
4 1 0 0 0 0
5 0 0 1 0 0

Logistic Time Discrete Hazard Model Parameter Estimates Intrepration

I am using PROC GLIMMIX, and I'm curious as to why my parameter estimates are behaving strangely.
proc glimmix data=blah pconv=1e-3;
class strata1;
model event(event=LAST)=time1--time20/
noint solution link=logit dist=binary;
nloptions tech=nrridg;
covtest 'var(strata1)=0'/WALD;
random intercept/subject=strata1;
run;
Since I'm using a logistic discrete time hazard model (without any censored observations), I have my dataset constructed using the 'person-period' dataset. Here is an example of what a person-period dataset looks like:
id time1 time2 time3 time4 event
100 1 0 0 0 0
100 0 1 0 0 0
100 0 0 1 0 1
101 1 0 0 0 1
102 1 0 0 0 0
102 0 1 0 0 0
102 0 0 1 0 0
102 0 0 0 1 0
Essentially, each 'time' variable represents whether this period is occuring. So, time1=1 during the first period, 0 otherwise. And then time2=1 during the first period, 0 otherwise, and so on. I am modelling the probability that the event occurs during each of these periods. When I use PROC LOGISITIC, I get sensible parameter estimates.
proc logistic data=blah;
model event (event=LAST)=time1--time20 /noint;
run;
This code delivers parameter estimates for time1=-3.0052, which gives me a probability of the event occuring in time period 1 of .047. These estimates slowly get smaller, for each time[i] variable, which is what I would expect. However, when I run my GLIMMIX code and add in this random effect for strata1, it blows up my model - the parameter estimates for time flip their sign. time1=2.84, time2=2.67, time3=2.41, and they consistently get smaller. I'm really confused as to why- this model is telling me that the probability of the event occuring is over 90% in this period, which I know to be untrue. Does anyone have any idea why this is? I would expect these estimates to essentially have their negative sign be flipped.
Thanks.

Cleaner way of handling addition of summarizing rows to table?

I have a dataset that is unique by 5 variables, with two dependent variables. My goal is for this dataset to have appended to it additional rows with TOTAL as the value of independent variables, with the values of the dependent variables changing accordingly.
To do this for a single independent variable is not a problem, I would do something along the lines of:
proc sql;
create table want as
select "TOTAL" as independent_var1,
independent_var2,
...
independent_var5,
sum(dependent_1) as dependent_1,
sum(dependent_2) as dependent_2
from have
group by independent_var1,...,independent_var5;
quit;
Followed by appending the original dataset in whatever fashion you choose. However, I want the above, yet x5 (for each independent variable), and then again for each possible combination of TOTAL/nontotal across the 5 independent variables. Not sure just how many datasets that is off the top of my head...but it's a decent amount.
So best strategy I've come up with so far is to use the above with some mildly creative macro code to generate all possible table combinations of total/non-total, but it seems like SAS just might have a better way, maybe tucked away in an esoteric proc step I've never heard of...
--
Attempt to show example, using three independent variables and 1 dependent variable:
Ind1|2|3|Dependent1
0 0 0 1
0 0 1 3
0 1 0 5
0 1 1 7
Desired output would be:
0 0 ALL 4
0 1 ALL 12
0 ALL 0 6
0 ALL 1 10
ALL 0 0 1
ALL 0 1 3
ALL 1 0 5
ALL 1 1 7
0 ALL ALL 16
ALL 0 ALL 4
ALL 1 ALL 12
ALL ALL 0 6
ALL ALL 1 10
ALL ALL ALL 16
0 0 0 1
0 0 1 3
0 1 0 5
0 1 1 7
I may have forgotten some combinations, but that should serve to get the point across.
PROC MEANS should do this for you trivially. You need to clean up the output in order to get it to perfectly match what you want (missing for INDx = "ALL" in your example) but otherwise it gets the calculations done properly.
data have;
input Ind1 Ind2 Ind3 Dependent1;
datalines;
0 0 0 1
0 0 1 3
0 1 0 5
0 1 1 7
;;;;
run;
proc means data=have;
class ind1 ind2 ind3;
var dependent1;
output out=want sum=;
run;