Sorting complex data - stata

I'm using 2014 and 2018 SIPP data in Stata to examine the effects of children's presence in a household (my proxy is grade level) on parent divorce rates.
The data is clustered by household (identified by ssuid), within each household there are multiple people (identified by pnum), who have attributes (grade, marr_stat) recorded over periods of time (monthcode). Where I am running into issues is trying to link the marital status (marr_stat) of one person in the household to the grade level of a different person in the household, so I can then run regressions. Simply put, I am trying to create a row that contains the marital status of person 101 and the grade level of person 102.
Here is a snapshot of what my data looks like.
enter image description here
I have tried converting between wide and long formats, leaving grade in wide form, but have been unsuccessful. Any help would be greatly appreciated!

Related

Cohort Tracking in Power BI

I need to build a student cohort tracking pbix so as to show students who have progressed onto the next consecutive year, students who have continued their studies and other similar metrics. Currently, I have a standard star schema as follows:
Fact Enrolment – Logs all enrolment activity for each student (multiple records can exist in the fact for each student based on different years, statuses, courses etc)
Student – Shows all students and their personal details such as email addresses, phone numbers etc. I’d rather not build upon this table as it is quite large as it currently stands.
Year of Study - This table helps to identify which year a student is studying in (e.g second year)
University Academic Year – This lists all academic years (e.g. 2017/18)
Student Status Per Year - This table lists all the possible statuses a student can have for a particular year of their degree such as ‘Current Student’, ‘withdrawn’, ‘transferred’
I was thinking of building a dimension in Power Query which shows cohort tracking for each student and links back to the fact in the standard one-to-many relationship. This will enable end-users to slice the data further by faculty etc. However, I’m not entirely sure how to do this. I was thinking of using Cohort Analysis but this does not appear to do what I need it to.
Any advice would be much appreciated.

longitudinal dataset categorical variables

I have a longitudinal dataset which contains variables on individuals from 2 waves from Feb and June which measure economic activity across these individuals. The variables from Feb and May wave are categorical variables and I am running the proportion command in Stata to get the individual change in economic activity. For example. I am looking for changes in hours worked across 2 waves and I run proportion but am not able to figure out the if condition as I only want individuals who responded in both Feb and June. I want to drop all those who responded in Feb but not in May or likewise.
Let's suppose you have an identifier variable id and a time-like variable, wave that takes values 1 and 2. If so, you are looking for individuals that satisfy
bysort id (wave) : gen wanted = wave[1] == 1 & wave[2] == 2
So wanted is an indicator that is 1 for individuals present for both waves and 0 otherwise and if wanted would be an if condition to select those people wanted.
There are many variations on this, depending on: your variable names; your data layout; how the information on waves is held (could be also, say, a string variable containing values like "Feb", "May" or "June", or a numeric variable holding dates).
You gave a broad-brush description sketching the problem but almost no precise information on the data. The stata tag wiki gives much detailed advice on how to post a question and flags the importance of giving a concrete data example.

In SAS, how do you create a certain number of records where the primary outcome does not occur based on the value of another variable?

I am examining the effect of passing vs running plays on injuries across a few football seasons. The way the data was collected, all injuries were recorded as well as information about the play in which the injury occurred (ie position, quarter, play type), game info (ie weather conditions, playing surface, etc), and team info (ie number of pass vs run plays in the game).
I would like to use one play as the primary exposure with the outcome as injury vs no injury with analysis using logistic regression, but to do so I would need to create all the records with no injury. There is a range from 0 to around 6-7 injuries in a game for a team, and the total passing and running plays are recorded so I would need to find a way to add X (total passing plays minus injuries on passing plays) and Y (total running plays - injuries on running plays) records that share all the details for that particular game but have no injury as the outcome. I imagine there is a way in proc sql to do this, but I could not find it online. How would I go about coding this?
I have attached an example of the relevant data. An example of what I would need to do is for game 1 add 30 records for passing plays and 38 records for running plays with outcome of no injury and otherwise the same data (team A, dry weather, game plays).
You can use the freq statement to prevent having to de-aggregate it.
The FREQ statement identifies a variable that contains the frequency
of occurrence of each observation. PROC LOGISTIC treats each
observation as if it appears n times, where n is the value of the FREQ
variable for the observation. If it is not an integer, the frequency
value is truncated to an integer.
SAS Documentation
De-aggregating the data would require the data step and a do loop. It's not recommended to do this.

How do I perform spatial logistic regression in SAS?

I am trying to develop a spatiotemporal logistic regression model to predict the presence/absence of a disease in U.S. counties (contiguous U.S.) based on climatologic variables, with data points for each year between 2007 and 2014; ideally, I would like a model with functionality to score additional datasets, e.g., use the model developed for 2006-2014 to predict disease probability in future climate scenarios. The model needs to account for spatial autocorrelation, and (again, ideally) repeated measures (each county has one data point per year). Unfortunately, my SAS abilities are not up to the task. Would anyone have suggestions for developing the model? The data, in csv format, take the form of:
countyFIPS year outcome predictor1 predictor2 predictor3 latitude longitude
where
countyFIPS = unique 5-digit identifier for U.S. counties
outcome = at least one case in the county for the given year, coded 0/1
latitude and longitude denote the centroid of the county
I'm really bad at this, so please be gentle and use small words...

AWS Machine Learning predication by three fields

i have created a model in AWS
contains Sales records by date
for example
Type: Sale,Time:2016-08-01,Success:1 (1 is a boolean)
i want to predict how much Sales will be after 1 month from the latest date (2016-08-01)
which means a combo of Type=Sale AND Time >2016-08-01 and Success=1
any idea how to achieve this
thank u
You need to aggregate your data to a wider array of attributes to be able to use Amazon ML for such predictions. You can use different level of aggregation, for example daily, weekly and monthly.
You should also add any relevant information for the items that you are selling. For example, if you are selling umbrellas, you should add information about the amount of rain on that day, or if you are selling flowers, you should add information about day of the week or proximity to holidays, when people are buying more flowers.