I am using Stata and completing a competing risks regression with secondary cancer diagnosis as the failure and death as a competing risk.
I am not sure if I am using the stset command correctly. The code I am using is this:-
stset diagtime, time0(diagnosisdate1) origin(time diagnosisdate1) exit(diagnosisdate2) failure(fail==1)
Where "diagtime" is the time between primary and secondary diagnosis and fail == 1 is the occurrence of a secondary diagnosis.
I need to specify death as a competing failure for when I run the regression but not sure if this should be specified as death alone, or death as well as no second diagnosis.
A delayed response, but in case others find it helpful.
I can't speak to the t0 and origin options being correct without seeing the dataset. For the fail option, though: regardless of what type of competing risks model you're estimating, the stset format is what you have. To strip down to the key parts:
stset diagtime, failure(fail==1)
Because fail==1 represents your event of interest--secondary diagnosis.
If you're using stcrreg, you must specify the competing event as an option. Say death (your competing event) is represented by iAmDeath==1. The stcrreg syntax would be:
stcrreg [varlist] [if] [in], compete(iAmDeath==1)
For competing risks with any other type of canned survival model in Stata, you're implicitly taking a latent approach to competing risks. That means you're treating all events other than the 'primary' one of interest as right censored. Ergo, there is nothing additional you must do, beyond setting stset 's fail option correctly (i.e., to your primary event of interest, as you do in your stset statement).
Related
I want to take a general Idea of how I can optimise the query performance in redshift Database, I have Huge queries with lots of joins , I do understand using sort and Dist key it can be achieved but is there a method which we can follow in order to get some optimal results.
What to look in a table and how to approach query optimisation in redshift?
What are the necessary steps to look for or approach in order to have a certain plan for optimisation?
Any guidance will help a lot
Having improved many queries on Redshift there are a few things I can point you towards. First let me list a few tools / techniques to make sure you have these in your toolbox.
Ability to read and EXPLAIN plan and find expected costly points
Know where to find the query "actual" execution report
Know the system tables to find join, distribution, and disk io reports
So with those understood let's look at where many queries go sideways on Redshift. I will try to list these out in pareto order but any of these, or combos, can create significant issue.
#1 - Fat in the middle queries. When joining it is possible to expand the number of rows being operated upon many fold. Cross joining is a clear way this can happen but isn't how this usually happens. If the join on conditions create a many to many join pattern the number of rows can expand. When the table sizes are very large and the "multiplication" can make absurd data sizes. The explain plan can show this but not always - use of DISTINCT and GROUP BY can "hide" the true size of the dataset in play. Performing a SELECT COUNT(*) on your join tree can help show how big this is. You may also may need to look a pieces of the join tree if a later join is collapsing the rows (failure of the query optimizer?). Redshift is a columnar database and not well set up for the creation of data - this includes during the execution of query.
#2 - Distribution of large amounts of data. Redshift is a cluster and the node are connected together by ethernet cables and these connections are the slowest part of the cluster. A lot of work is done by the query optimizer to minimize the amount of data that needs to move around the network. However, it doesn't know your data as well as you do and doesn't always do this well. Look at the type of joins you are getting - is distribution needed? how much data is being distributed? Also, group by (and window functions) need to combine rows and therefore may need redistribution to complete. How big are the data sets entering your aggregation steps?
Moving a lot of data around the network will be slow. The difficulty is that it isn't always clear how to reduce this movement. Large join trees like you say you have can do "odd" things when it comes to the resulting distribution of the "joined" data. Joins are performed one at a time and the order these happen can matter. The query optimizer is making a number of decisions about the order of joins and how to organize the resulting data from each join. The choices it makes is based on what it sees in the table metadata so completeness of metadata matters. WHERE conditions can also impact the optimizer's choices. There are just way to many interactions to itemize them out here. Best advice is to look at the performance per step and see if data distribution is a factor. Then work to control how data is distributed in the query's execution. This may mean changing the join trees or even decomposing the query into several with temp table that have distribution set so that data movement is minimized.
#3 Excessive IO traffic - While not as slow as the networks, the disk IO subsystem is often a bottleneck. This shows up in a few ways. Are you reading more data from disk than is needed? (Metadata up to date?) Do you need a redundant WHERE clause to eliminate data? (Redundant WHERE clause is one that isn't needed functionally but is added so Redshift can perform the metadata comparisons that will reduce data read at scan.) Data spill is another way that disk IO can be strained (this goes back to #1). If data needs to spill to disk it can bring the disk IO performance down considerably. Use your metadata and Where clauses well.
Now these 3 areas often team up to kill your performance. Read too many rows from your tables, join all these extra rows together across the network while also making many new rows. This data doesn't fit in memory so now Redshift needs to spill to disk to complete the query. Things slow down real fast in these conditions.
Lastly these factors I've listed are cluster wide "resources" of Redshift. If one query take up a lot of one of these then there is less for other queries running at the same time. What often happens is that the query writers on a cluster follow similar patterns (good or bad) and when their pattern is costly on one axis then many of their queries are costly on the same axis. This shows up as queries that work "ok" when run in isolation but very badly when others are using the cluster. This generally means that many queries are contributing to pushing the cluster "over the edge" on some limited resource. There are system tables that you can look at to see aggregated IO or network traffic to see these effects.
Good queries are:
Don't make a lot of new "rows" during execution (not fat in the middle)
Keep large data sets "on node" and only redistribute data once the data has been pared down significantly
Don't read more data from disk than is necessary and don't spill
The problem is that doing all of these isn't always possible the trick is to not over subscribe the cluster resources you have.
Reading upon many Kimball design tips regarding fact tables (transaction, accumulating, periodic) etc. I'm still vague what should I do with my case of updating a fact table which I believe is not that uncommon. To the case.
We're processing complaints from clients, and we want to be able to reflect current status of complaint in the Data Warehouse. Our complaints have a workflow of statuses they go through, different assignees that deal with them on time, but for our analysis this is irrelevant as of now. We would like to review what the current situation on complaint is.
To my understanding the grain of the fact table would be single complaint, with columns (irrelevant for this question whether it should be junk dimension, degenerate etc) such as:
Complaint Number
Current Status
Current Status Date
Current Assignee
Type of complaint
As far as I understand, since we don't want to view the process history, but instead see what the current status of the process is, storing multiple rows for each complaint representing it's state is an overkill, so instead we store only one row per complaint and update it.
Now, is my reasoning correct to do that? In above case, complaint number and type of complaint store values that don't change, while "Current" columns do and we need to update the row, so we could implement Change Data Capture mechanism (just like we do for dimensions right now) to compare incoming rows from source system for this fact with currently stored fact rows to improve time cost of such operation.
It honestly looks like a Dimension table with mixed SCD Type 0&1 for me, but it stores facts of receiving complaints.
SO Post for reference: Fact table with information that is regularly updatable in source system
Edit
I'm aware that I could use accumulating fact table with time stamps which is somewhat SCD Type 2 alike but the end user doesn't really care about the history of the process. There are more facts involved in the analysis later on, so separating this need from data warehouse doesn't really work in this case.
I’ve encountered similar use cases in the past, where an accumulating snapshot would be the default solution.
However, the accumulating snapshot doesn’t allow processes with varying length. I’ve designed a different pattern, when 2 rows are added for each event: if an object goes from state A to state B you first insert a row with state A and quantity -1, then a new one with state B and quantity +1.
The end result allows:
- no updates necessary, only inserts;
- map-reduce friendly;
- arbitrary length processes;
- counting how many of each in each state at any point in time (with the help of a periodic snapshot for performance reasons);
- how many entered or left any state at any point in time.;
- calculate time in each state and age overall.
Details in 5 blog posts here (with implementation in Pentaho Data Integration):
http://ubiquis.co.uk/dwh/status-change-fact-table-part-1-the-problem/
I am using PROC GLIMMIX to analyze repeated measures data about specific sexual events. The original data came from a weekly diary study of about 400 people. During each week they reported on behaviours from their most recent sexual encounter. We also have basline data on their demographics. 12 weeks of observation were collected and we had a high completion rate.
I would like to create a mixed effect model, but I am unsure exactly how this is done in SAS. I want to test the effect of event-specific factors as well as some person level demographics and would like to get odds ratios for each factor of interest. The outcome is whether or not drugs were used during the event and the explanatory factors will be things like age, gender, etc. as well as characteristics about the event (i.e., partner HIV status), whether the partner was a regular sexual partner, etc..
The code I'm working with follows this pattern:
PROC GLIMMIX DATA=work.dataset oddsratio;
CLASS VISIT_NUMBER PARTICIPANT_ID BINARY_EVENTLEVEL_OUTCOME BINARY_EVENTLEVEL_EXPLANATORY_FACTOR CATEGORICAL_PERSONLEVEL_EXPLANATORY_FACTOR;
MODEL BINARY_EVENTLEVEL_OUTCOME = BINARY_EVENTLEVEL_EXPLANATORY CATEGORICAL_PERSONLEVEL_EXPLANATORY_FACTOR /DIST=binary link=logit CL S ddfm=kr;
RANDOM ?????;
RUN;
option 1 for ?????: residual / subject=PARTICIPANT_ID
option 2 for ?????: INTERCEPT / subject=PARTICIPANT_ID
option 3 for ?????: VISIT_NUM / subject=PARTICIPANT_ID residual type=ar(1)
INTERCEPT / subject=VISIT_NUM(PARTICIPANT_ID)
option 4 for ?????: Other?
I am also unclear whether I should use ddfm=kr in my model statement or method=laplace in my proc statement -- both have been recommended elsewhere for this sort of repeated measures analysis.
I've come across several potential options for modelling this which often give similar results, but option 1 gives a statistically significant result for an event-level, while the others give non-significant results. The inclusion of the ddfm=kr makes the result of interest more significant. The method=laplace does not allow for option 1.
I may not be answering your question, but might be able to provide a couple of directions:
To start with the simplest part, your MODEL statement looks correct to me as you want to test event-level factors and person-level demographics which are thus considered as fixed effects.
Now, as far as the random effects are concerned:
the RANDOM statements you propose for options (1) and (2):
(1) RANDOM _residual_ / subject=PARTICIPANT_ID;
or
(2) RANDOM intercept / subject=PARTICIPANT_ID;
are modeling two different parts of the random effects: the R-side and the G-side, respectively.
If you are already familiar with PROC MIXED, you may want to notice that your option (1) of using RANDOM _residual_ in PROC GLIMMIX is equivalent to using the REPEATED statement in PROC MIXED that tells that you have repeated measures for PARTICIPANT_ID, which is clearly your case (Ref: "Comparing the GLIMMIX and MIXED Procedures")
As for option (3):
RANDOM VISIT_NUM / subject=PARTICIPANT_ID residual type=ar(1) INTERCEPT / subject=VISIT_NUM(PARTICIPANT_ID);
here you are modeling the time component of the repeated measures (visit_num) as a random effect, and this should be included when you believe that there would be a random variation of the response at each of the measurements times (i.e. at each event). At first glance, I don't believe this is relevant in your case, since you are taking this into account already by the fixed effects... but of course I may be wrong by not seeing your data.
Up to here is what I can contribute at this time.
As next steps for you to have a better understanding, I would suggest that you:
Read the Overview of the PROC GLIMMIX documentation, in particular the mathematical model specification and all 3 sections therein:
The Basic Model
G-Side and R-Side Random Effects and Covariance Structures
Relationship with Generalized Linear Models
If you are still unsure, ask your question at communities.sas.com which might be able to help you better.
HTH
We’re using JCo 3.0 to connect to RFCs and read data from SAP R/3. We use one RFC RFC_READ_TABLE often and use a second custom RFC to read employee information. My questions revolve around a third RFC RSAQ_REMOTE_QUERY_CALL. I'm calling an ad-hoc query I built in SAP using this RFC but I’m not getting the expected results. The main problem is that it appears that SAP is ignoring one of my selection criteria and using what was saved in SAP when I originally built it. The date criterion stored in my ad-hoc is 6/23/2013. If I pass in 6/28/2013 from JCo, I get the same results as if I had passed 6/23/2013 from JCo.
We have built several ad-hoc queries whose only criteria is a personnel number and call them successfully using RFC RSAQ_REMOTE_QUERY_CALL.
Background on my ad-hoc query: reporting period of today, joining together four aspects of an employee’s information: their latest action (hire, rehire, etc.), organization (e.g. company), pay (e.g. pay scale level) and communication (e.g. email). The query will run every workday.
Here are my questions:
My ad-hoc has three selection criteria. The first two are simple strings. The third is a date. The date will vary each time the query runs. We are referencing the first criteria using SP$00001, the second with SP$00002 and the third with SP$00003. The order of the criteria changes from the ad-hoc to SQ01 (what was SP$00001 in the ad-hoc is now SP$00003). Shouldn’t we reference them in the order defined in the ad-hoc (e.g. SP$00001)?
The two simple string selections are using OPTION “EQ”. The date criteria is using OPTION GT (greater than). Is “GT” correct?
We have some limited accessibility to SAP. Is there a way to see which SP$ parameters are mapped to which criteria?
If my ad-hoc was saved with five criteria but four of them never change when I call the ad-hoc from JCo, do I just need to set the value of the one or do I need to set the other four as well?
Do I have to call this ad-hoc using a variant (function.getImportParameterList().setValue(“VARIANT”, “VARIANT_NAME”))?
Does the Reporting Period have an impact on the date criteria? I have tried changing the Reporting Period to be PNPBEGDA = today and PNPENDDA = today and noticed no change.
Is there a way in SAP to get a “declaration” of your ad-hoc (name, inputs, outputs, criteria)? I have looked at JCoFunction.toXml() and JCoFunctionTemplate. These are good if you want to see something at runtime before it goes to SAP, but I’m looking for something I can use on the JCo end to help me write Java code that matches the ad-hoc.
I have looked at length on the web for answers to my questions and have not found anything that is useful. If there is anything which would help me, please let me know.
Thanks,
LM
Since I don't know much about SQnn, I won't be able to answer all of your questions...
I don't know, sorry.
It should be, at least it's the usual operator for greater than.
Yes - set an external breakpoint right inside the function module and trace its execution while performing the RFC call. Warning: At least basic ABAP knowledge required.
I don't know, sorry.
I don't know either, sorry.
That would depend on the query, I suspect...
JCo won't be able to help you out there - it doesn't know about queries, it only knows function modules. There might be other RSAQ_* function modules to get that information though.
I played with setting up a variant in SQ01 for my query. I added some settings in the variant that solved my problem and answered several of my questions in my post. The main thing I did was add a dynamically calculated date as part of my criteria. Here's how:
1. In SQ01, access menu "Go To" -> "Maintain Variants".
2. Choose your variant and in subobjects, choose "Attributes" and click "Change".
3. In the displayed list, find your date criterion.
4. Choose "D" in Selection Variable, choose a comparison option (mine was GT for greater than), and a "Name of a Variable" (really, this is the type of dynamic date calculation you need).
5. Go back to the Subobjects panel, choose "Values" and click "Change".
6. Enter any other criteria you need in the "Program selections" section.
7. Save the variant.
By doing this, I don't need to pass anything into the query from JCo. Also, SAP will automatically update the date criteria you entered in step #4 above.
So to to answer my questions from my original post:
1 and 4. It doesn't matter because I'm no longer passing anything in from JCo.
2. "GT" is Greater Than.
3 and 7. If anyone knows, I'd really like to find out.
5. Use the name you as it is in SAP (step #2 above).
6. I still don't know, but it's not holding me up.
I'm posting this in case anyone out there needs this type of information. Thanks to Esti and vwegert for helping me out.
I have to perform a data mining task on a database containing informations about insurance policies. Each tuple indicates data about a single policy, along with information regarding the agency that issued it, the customer it is referring to and other fields. It is like a product between hypotetical tables Policies, Customers and Agencies. The fields are the following:
Policy Type,ID Number,Policy Status ,Product Description,Product Combinations,Issue Date,Effective Date,Maturity Date,Policy Duration,Loan Duration ,Cancellation Date ,Reason for cancellation,Total Premium,Splitter Premium,ID Partners,ID Agency,Country Agency,ID Zone,Agency potential,Sex Contractor ,Birth Year Contractor,Job Contractor,Sex Insured,Job Insured,Birth Year Insured,Product Area,Legal Form,ID Claim,Year Claim,Status Claim,Provision Claim,Payments Claim
This is an academic task and our professor wants us to identify churn rates, cross-selling and up-selling. I am not quite into the field and therefore I sought those terms on wikipedia. I started with churn rate and it appears to me that in this case I have to characterize the properties of customers whose Policy Status is set to "canceled" and the Reason for cancellation is "customer cancellation".
With Rapid Miner, I tried to apply decision trees and rule mining, but the subset of interest is so small that the output model, despite having a good accuracy overall, has a very very very poor accuracy in predicting canceled policies. This happens because the subset of canceled policies is really small. I also tried to apply the MetaCost operator with a given cost matrix in which the cost of misclassifying canceled policies is outrageously high with respect to the others (like a million times higher), but this did not change the result at all.
My best option now is to use the sequential covering algorithm for rule mining, but rapid miner does not implement it and I would have to code it manually.
Do you have any suggestion on how to build a good model for that small subset of canceled policies, so that we could use it to identify customers that would potentially cancel their policy in the future?
N.B.: since it comes from a real source, albeit anonymized, I cannot disclose the database or any data contained within.
Did you try Navie Bayes? It works well with small set of data. You can as well try a variant of it like AODE. AODE is not available in Rapid Miner. You should install Weka extension to access AODE in Rapid Miner.
You need to balance your dataset, so that the classes (cancelled / not cancelled) are the same size. This means (temporarily) discarding lots of data.
You can use the Sample operator with the Balance Labels checkbox to do this.