I have data that looks like this:
where month is the number of months that have passed, vegetable is a category of interests, and n_spoiled is the number of vegetables from that category that spoiled after x amount of months.
I am interested in running a survival analysis to compare the curves for these three categories (proc lifetest).
It is my understanding that in SAS to run a survival analysis we need the 'uncollapsed' version of this data such that for example we will see 3289 entries with month=1 and potato, 9 entries with month=1 and onion and so on. None of this entries would need to censored for the analysis as all non-completions were omitted form the aggregated data.
I would really appreciate if someone could help me modify the data so that it runs OR alternatively, instruct me as to how to run the test without 'uncollapsing' the data.
Thank you.
Related
I'm attempting to write a formula for excel that will look at 3 different columns of data in a query table that I'm pulling from my teams sharepoint site. I'm trying to have a Yes/No result given.
The formula I've written so far is
=IF(AND(ISTEXT(Governance_Master_List__2[Is the check c]="Yes")+(AND(ISTEXT(Governance_Master_List__2[Merged Month and Year Due]="January 2023")+(AND(ISTEXT(Governance_Master_List__2[Frequency]="Weekly")))))),"Yes","No")
So, In this example I want the formula to look at [Is the check c] for a "Yes", then Look at the [Merged month and year due] for January 2023, and then look at [Frequency] for "Weekly", and if all 3 of those are met to show the result of Yes onto another worksheet.
So far the formula isn't throwing an error when making it, but it only returns the No result, even when I ensure that all the Weekly, January 2023 entries are showing as Yes.
I'm very new to excel in terms of formulas and am learning as I go. I feel like, because the refernce query table i'm looking at has tons of different values (eg, the frequency has weekly, 6 monthly, monthly, quarterly etc and that there are 12 months to choose from) that the formula is stopping because its looking at and not ignoring those other options, hence the constant false result.
I've tried the following variations with no result..
=IF(AND(ISTEXT(Governance_Master_List__2[Is the check c]="Yes")(AND(ISTEXT(Governance_Master_List__2[Merged Month and Year Due]="January 2023")(AND(ISTEXT(Governance_Master_List__2[Frequency]="Weekly")))))),"Yes","No")
=IF(AND('Imported data'!C1:C200="Weekly")+(AND('Imported data'!E1:E200="January 2023")+(AND('Imported data'!H1:H200="Yes"))),"Yes","No")
=IF(AND('Imported data'!C1:C200="Weekly")(AND('Imported data'!E1:E200="January 2023")(AND('Imported data'!H1:H200="Yes"))),"Yes","No")
The output excel sheet is to show our quarterly compliance reporting, i'm simple trying to automate the process of entering the data into that report, as the teams use a sharepoint list to enter their compliance tasks.
Anyone have a suggestion on how I can get this working? Eventually it'll be used to populate a number of different yes/no report cells based on the relevant month and frequency of the check.
I'm new to google automl tables and have a basic question about which data is worthwhile including in the training of my model.
I have a dataset of golfers and will be looking at the averages of scores over different periods. For example, average over the past 3 months, 6 months, 1 year etc.
My question is, is it worthwhile also including the sample size for each date range for each player. For example, over the past 3 months, some players will have a sample size of 28 while some will only have 2. Those players that have 28 rounds will have more accurate averages than those with 2. However, I didn't know whether google automl tables would pick up this link automatically, whether I could create a different weighting/reliability variable, or whether there's a way to specify a link between columns? Or if this automated type of automl isn't really suitable?
Thanks in advance
I am examining the effect of passing vs running plays on injuries across a few football seasons. The way the data was collected, all injuries were recorded as well as information about the play in which the injury occurred (ie position, quarter, play type), game info (ie weather conditions, playing surface, etc), and team info (ie number of pass vs run plays in the game).
I would like to use one play as the primary exposure with the outcome as injury vs no injury with analysis using logistic regression, but to do so I would need to create all the records with no injury. There is a range from 0 to around 6-7 injuries in a game for a team, and the total passing and running plays are recorded so I would need to find a way to add X (total passing plays minus injuries on passing plays) and Y (total running plays - injuries on running plays) records that share all the details for that particular game but have no injury as the outcome. I imagine there is a way in proc sql to do this, but I could not find it online. How would I go about coding this?
I have attached an example of the relevant data. An example of what I would need to do is for game 1 add 30 records for passing plays and 38 records for running plays with outcome of no injury and otherwise the same data (team A, dry weather, game plays).
You can use the freq statement to prevent having to de-aggregate it.
The FREQ statement identifies a variable that contains the frequency
of occurrence of each observation. PROC LOGISTIC treats each
observation as if it appears n times, where n is the value of the FREQ
variable for the observation. If it is not an integer, the frequency
value is truncated to an integer.
SAS Documentation
De-aggregating the data would require the data step and a do loop. It's not recommended to do this.
I have a rather simple question regarding the output of tabstat command in Stata.
To be more specific, I have a large panel dataset containing several hundred thousands of observations, over a 9 year period.
The context:
bysort year industry: egen total_expenses=total(expenses)
This line should create total expenses by year and industry (or sum of all expenses by all id's in one particular year for one particular industry).
Then I'm using:
tabstat total_expenses, by(country)
As far as I understand, tabstat should show in a table format the means of expenses. Please do note that ids are different from countries.
In this case tabstat calculates the means for all 9 years for all industries for a particular country, or it just the mean of one year and one industry by each country from my panel data?
What would happen if this command is used in the following context:
bysort year industry: egen mean_expenses=mean(expenses)
tabstat mean_expenses, by(country)
Does tabstat creates means of means? This is a little bit confusing.
I don't know what is confusing you about what tabstat does, but you need to be clear about what calculating means implies. Your dataset is far too big to post here, but for your sake as well as ours creating a tiny sandbox dataset would help you see what is going on. You should experiment with examples where the correct answer (what you want) is obvious or at least easy to calculate.
As a detail, your explanation that ids are different from countries is itself confusing. My guess is that your data are on firms and the identifier concerned identifies the firm. Then you have aggregations by industry and by country and separately by year.
bysort year industry: egen total_expenses = total(expenses)
This does calculate totals and assigns them to every observation. Thus if there are 123 observations for industry A and 2013, there will be 123 identical values of the total in the new variable.
tabstat total_expenses, by(country)
The important detail is that tabstat by default calculates and shows a mean. It just works on all the observations available, unless you specify otherwise. Stata has no memory or understanding of how total_expenses was just calculated. The mean will take no account of different numbers in each (industry, year) combination. There is no selection of individual values for (industry, year) combinations.
Your final question really has the same flavour. What your command asks for is a brute force calculation using all available data. In effect your calculations are weighted by the numbers of observations in whatever combinations of industry, country and year are being aggregated.
I suspect that you need to learn about two commands (1) collapse and (2) egen, specifically its tag() function. If you are using Stata 16, frames may be useful to you. That should apply to any future reader of this using a later version.
I have a couple of different tables in my Report, for demonstration purposes lets say that I have 1 data source that is Actual Invoice amounts and then I have another table that is Forecasted amounts. Each table has several dimensions that are the same between them, let say Country, Region, Product Classification and Product.
What I want is to be able to display a table/matrix that pulls information from both of these data sources like this
Description Invoice Forecast vs Forecast
USA 300 325 92%
East 150 175 86%
Product Grouping 1 125 125 100%
Product 1 50 75 67%
Product 2 75 50 150%
Product Grouping 3 25 50 50%
Product 3 25 50 50%
West 150 150 100%
Product Grouping 1 75 100 75%
Product 1 25 50 50%
Product 2 50 50 100%
Product Grouping 3 75 50 150%
Product 3 75 50 150%
I have not been able to figure out a way to combine the information from the multiple data source into a single matrix table, so any help would be appreciated. The one thing that I did find was somebody hard coded the structure of the rows into a separate data source and then used DAX expressions to pull in the pieces of information into the columns, but I don't like this solution because the structure of the rows is not constant.
What you're asking about is a common part of the star schema: combining facts from different fact tables together into a single visual or report.
What Not To Do (That You Might Be Tempted To)
What you don't want to do is combine the 2 fact tables into a single table in your Power BI data model. That's a lot of work and there's absolutely no need. Especially, since there are likely dimensions that the 2 fact tables do not have in common (e.g. actual amounts might be associated with a customer dimension, but forecast amounts wouldn't be).
What you also don't want to do is relate the 2 fact tables to each other in any way. Again, that's a lot of work. (Especially since there's no natural way to relate them at the row level.)
What To Do
Generally, how you handle 2 fact tables is the same as you handle a single fact table. First, you have your dimensions (country, region, classification, product, date, customer). Then you load your fact tables, and join them to the dimensions. You do not join your fact tables to each other. You then create measures (i.e. DAX expressions).
When you want to combine measures from the two facts together in a single matrix, you only use rows/columns that are meaningful to both fact tables. For example, actual amounts might be associated with a customer, but forecast amounts aren't. So you can't include customer information in the matrix. Another possibility is that actual amounts are recorded each day, whereas forecasts were done for the whole month. In this situation, you could put month in your matrix (since that's meaningful to both), but you wouldn't want to use date because Power BI wouldn't know how to divide up forecasts to individual dates.
As long as you're only using dimensions & attributes that are meaningful to both fact tables, you can easily create a matrix as you envision above. Simply drag on the attributes you want, then add the measures (i.e. DAX expressions).
The Invoice & Forecast columns would both be measures. The two measures from different fact tables can be combined into a 3rd measure for the vs. Forecast measure. Everything will work as long as you're just using dimensions/attributes that mean something to both fact tables.
I don't see anything in your proposed pivot table that strikes me as problematic.
Other Situations
If you have a situation where forecasts are at a month level and actual is at a date level, then you may be wondering how you'd relate them both to the same date dimension. This situation is called having different granularities, and there's a good article here I'd recommend reading that has advice: https://www.daxpatterns.com/handling-different-granularities/. Indeed, there's a whole section on comparing budget with revenue that you might find useful.
Finally, you mention that someone hard-coded the structure of the rows and used DAX expressions to build everything. This does, admittedly, sound like overkill. The goal with Power BI is flexibility. Once you have your facts, measures & dimensions, you can combine them in any way that makes sense. Hard-coding the rows eliminates that flexibility, and is a good clue that something isn't right. (Another good clue that something isn't right is when DAX expressions seem really complicated for something that should be easy)
I hope my answer helps. It's a general answer since your question is general. If you have specific questions about your specific situation, definitely post additional questions. (Sample data, a description of the model, the problem you're seeing, and what you want to see is helpful to get a good answer.)
If you're brand new to Power BI, data models, and the star schema, Alberto Ferrari and Marco Russo have an excellent book that I'd recommend reading to get a crash course: https://www.sqlbi.com/books/analyzing-data-with-microsoft-power-bi-and-power-pivot-for-excel/