How to make an Extended Entry Decision Table for Testing - unit-testing

So I need to make a Decision Table to test a bit of code. The code is simple and straight forward but requires two user inputs to determine two proceeding variables. This is why I need to make this more than a binary (true/false) table.
The user is prompted to input total income, this will, in turn, determine which bracket the user falls into. Then the user is prompted to input the amount of dependents, which will determine the final tax. Depending on the income bracket, a switch statement will then set
tax = income * [certain percentage]. After this the amount of dependents will determine what percentage of the tax will be removed.
Essentially what I need to know is how to set up my Conditions, Actions, and Rules.
Here is an example of decision table but this one is binary (true/false)
I am using Java for the code but that isn't entirely relevant. What I need specifically is deciding what my Conditions should be, whether it's solely income or the combinations of income and dependents, etc. I do not need to write the code, just need to write the test table for it.
If anyone can help inform me as to what I should do or where I should look, that would be appreciated. I am willing to provide anymore information if need be!
Thanks!

If you understand, what the regular (binary) decision table is, then it'll be easy for you to get the extended table as well - the difference is in conditions only.
Conditions in the extended decision table can have more than two values. For example, in your case you have two conditions:
Tax Bracket with values (for example): [0, 9275], [9276, 37650], [37651, 91150] etc.
Number of Dependents with values: 0, 1, 2, 3 etc. (what's the max?)
For Actions I'd choose the tax rate and the deduction amount - they depend on the bracket and on the number of dependents correspondingly.
Rules (=columns in the table) connect conditions and actions - for all the possible combinations of condition values you have a list of actions to perform. In your case these actions will be simply two numbers, which you will use in a tax formula.
(Frankly I don't see why you need a decision table in this case at all... I think, the tax formula with parameters, depending on income, will be enough)

Related

Analyze Repeated Measures Data Using PROC GLIMMIX

I am using PROC GLIMMIX to analyze repeated measures data about specific sexual events. The original data came from a weekly diary study of about 400 people. During each week they reported on behaviours from their most recent sexual encounter. We also have basline data on their demographics. 12 weeks of observation were collected and we had a high completion rate.
I would like to create a mixed effect model, but I am unsure exactly how this is done in SAS. I want to test the effect of event-specific factors as well as some person level demographics and would like to get odds ratios for each factor of interest. The outcome is whether or not drugs were used during the event and the explanatory factors will be things like age, gender, etc. as well as characteristics about the event (i.e., partner HIV status), whether the partner was a regular sexual partner, etc..
The code I'm working with follows this pattern:
PROC GLIMMIX DATA=work.dataset oddsratio;
CLASS VISIT_NUMBER PARTICIPANT_ID BINARY_EVENTLEVEL_OUTCOME BINARY_EVENTLEVEL_EXPLANATORY_FACTOR CATEGORICAL_PERSONLEVEL_EXPLANATORY_FACTOR;
MODEL BINARY_EVENTLEVEL_OUTCOME = BINARY_EVENTLEVEL_EXPLANATORY CATEGORICAL_PERSONLEVEL_EXPLANATORY_FACTOR /DIST=binary link=logit CL S ddfm=kr;
RANDOM ?????;
RUN;
option 1 for ?????: residual / subject=PARTICIPANT_ID
option 2 for ?????: INTERCEPT / subject=PARTICIPANT_ID
option 3 for ?????: VISIT_NUM / subject=PARTICIPANT_ID residual type=ar(1)
INTERCEPT / subject=VISIT_NUM(PARTICIPANT_ID)
option 4 for ?????: Other?
I am also unclear whether I should use ddfm=kr in my model statement or method=laplace in my proc statement -- both have been recommended elsewhere for this sort of repeated measures analysis.
I've come across several potential options for modelling this which often give similar results, but option 1 gives a statistically significant result for an event-level, while the others give non-significant results. The inclusion of the ddfm=kr makes the result of interest more significant. The method=laplace does not allow for option 1.
I may not be answering your question, but might be able to provide a couple of directions:
To start with the simplest part, your MODEL statement looks correct to me as you want to test event-level factors and person-level demographics which are thus considered as fixed effects.
Now, as far as the random effects are concerned:
the RANDOM statements you propose for options (1) and (2):
(1) RANDOM _residual_ / subject=PARTICIPANT_ID;
or
(2) RANDOM intercept / subject=PARTICIPANT_ID;
are modeling two different parts of the random effects: the R-side and the G-side, respectively.
If you are already familiar with PROC MIXED, you may want to notice that your option (1) of using RANDOM _residual_ in PROC GLIMMIX is equivalent to using the REPEATED statement in PROC MIXED that tells that you have repeated measures for PARTICIPANT_ID, which is clearly your case (Ref: "Comparing the GLIMMIX and MIXED Procedures")
As for option (3):
RANDOM VISIT_NUM / subject=PARTICIPANT_ID residual type=ar(1) INTERCEPT / subject=VISIT_NUM(PARTICIPANT_ID);
here you are modeling the time component of the repeated measures (visit_num) as a random effect, and this should be included when you believe that there would be a random variation of the response at each of the measurements times (i.e. at each event). At first glance, I don't believe this is relevant in your case, since you are taking this into account already by the fixed effects... but of course I may be wrong by not seeing your data.
Up to here is what I can contribute at this time.
As next steps for you to have a better understanding, I would suggest that you:
Read the Overview of the PROC GLIMMIX documentation, in particular the mathematical model specification and all 3 sections therein:
The Basic Model
G-Side and R-Side Random Effects and Covariance Structures
Relationship with Generalized Linear Models
If you are still unsure, ask your question at communities.sas.com which might be able to help you better.
HTH

dax code for count and distinct count measures (or calculated columns)

I hope somebody can help me with some hints for the following analysis. The students may do some actions for some courses (enroll, join, grant,...) and also the reverse - to cancel the latest action.
The first metric is to count all the action occurred in the system between two dates - these are exposed like a filter/slicer.
Some sample data :
person-id,person-name,course-name,event,event-rank,startDT,stopDT
11, John, CS101, enrol,1,2000-01-01,2000-03-31
11, John, CS101, grant,2,2000-04-01,2000-04-30
11, John, CS101, cancel,3,2000-04-01,2000-04-30
11, John, PHIL, enrol, 1, 2000-02-01,2000-03-31
11, John, PHIL, grant, 2, 2000-04-01,2000-04-30
The data set (ds) is above and I have added the following code for the count metric:
evaluate
sumx(
addcolumns( ds
,"z+", if([event] <> "cancel",1,0)
,"z-", if([event] = "cancel",-1,0)
)
,[z+] + [z-])
}
The metric should display : 3 subscriptions (John-CS101 = 1 , John-PHIL=2).
There are some other rules but I don't know how to add them to the DAX code, the cancel date is the same as the above action (non-cancel) and the rank of the cancel-action = the non-cancel-action + 1.
Also there is a need for adding the number for distinct student and course, the composite key . How to add this to the code, please ? (via summarize, rankx)
Regards,
Q
This isn't technically an answer, but more of a recommendation.
It sounds like your challenge is that you have actions that may then be cancelled. There is specific logic that determines whether an action is cancelled or not (i.e. the cancellation has to be the immediate next row and the dates must match).
What I would recommend, which doesn't answer your specific question, is to adjust your data model rather than put the cancellation logic in DAX.
For example, if you could add a column to your data model that flags a row as subsequently cancelled, then all DAX has to do is check that flag to know if an action is cancelled or not. A CALCULATE statement. You don't have to have lots of logic to determine whether the event was cancelled. You entirely eliminate the need for SUMX, which can be slow when working with a lot of rows since it works row by row.
The logic for whether an action is cancelled or not moves to your source system (e.g. SQL or even a calculated column in Excel), or to your ETL (e.g. the Query Editor in Power BI) which are better equipped for such tasks. The logic is applied 1 time and then exists in your data model for all measures, instead of needing to apply the logic each time a measure is used.
I know this doesn't help you solve your logic question, but the reason I make this recommendation is that DAX is fundamentally a giant calculator. It adds things up. It's great at filters (adding some things up but not others), but it works best when everything is reduced to columns that it can sum or count. Once you go beyond that (e.g. wanting to look at the row below to adjust something about the current row), your DAX is going to get very complicated (and slow), whereas a source system or the Query Editor will likely be able to handle such requirements more easily.

How can I restrict the output of an Amazon Machine Learning model? (Predicting cricket team results)

I am trying to predict match winner based on the historical data set as shown below,
The data set comprises of IPL seasons and Team_Name_id vs Opponent Team are the team names in IPL. I have set the match id as Row id and created the model. When running realtime testing, the result is not as expected (shown below)
Target is set as Match_winner_id.
Am I missing any configurations? Please help
The model is working perfectly correctly. There's just two problems:
Your input data is not very good
There's no way for the model to know that only one of those two teams should win
Data Quality
A predictive model needs good quality input data on which to reverse-engineer a model that explains a given result. This input data should contain information that can be used to predict a result given a different set of input data.
For example, when predicting house prices, it would need to know the suburb (category), number of bedrooms/bathrooms/parking spaces, age of the building and selling price. It could then predict the selling price for other houses with a slightly different mix of variables.
However, based on your screenshot, you are giving the following information (and probably more) on which to make your prediction:
Teams: Not great, because you are separating Column C and Column D. The model will assume they are unrelated information. It doesn't realise that those two values could be swapped.
Match date: Useless information unless the outcome varies in proportion to time (eg a team continually gets better)
Season: As with Match Date, this is probably useless because you're always predicting the future -- you won't be predicting for a past season
Venue: Only relevant if a particular team always wins at a given venue
Toss Decision: Would this really influence the outcome? Also, it's only known once the game begins, so not great for predicting a future game.
Win Type: You won't know the win type until a game is over, so it's not suitable for predicting a future game.
Score: Again, not known until the actual game, so no good for future predictions.
Man of the Match: Not known for future games.
Umpire: How does an umpire influence the result of a game?
City: Yes, given that home teams often have an advantage.
You have provided very little information that could be used to predict a future game. There is really only the teams and the venue. Everything else is either part of the game itself or irrelevant.
Picking only one of the two teams
When the ML model looks at your data and tries to make a prediction, it will look at all the data you have provided. For example, it might notice that for a given venue and season, Team 8 has a higher propensity to win. Therefore, given that venue and season, it will favour a win by Team 8. The model has no concept that the only possible outcome is one of the two teams given in columns C and D.
You are predicting for two given teams and you are listing the teams in either Column C or Column D and this makes no sense -- the result is the same if you swapped the teams between columns, but the model has no concept of this. Also, information about Team 1 vs Team 2 is totally irrelevant for Team 3 vs Team 4.
What you should do is create one dataset per team, listing all their matches, plus a column that shows the outcome -- either a boolean (Win/Lose) or a value that represents the number of runs by which they won (where negative is a loss). You would then ask them model to predict the result for that team, given the input data, which would be win/lose or a points above/below the other team.
But at the core, I think that your input data doesn't have enough rich content to be able to make a sensible prediction. Just ask yourself: "What data would I like to know if I were to guess which team would win?" It would probably be past results, weather conditions, which players were on each team, how many matches they played in the last week, etc. None of this information is being provided as input on each line of your input data.

Propensity score matching on stata

I have a group of treated firms in a country, and for each firm I would like to find the closest match in terms of industry, size and profitability in the rest of the country. I am working on Stata. All I need is to form a control group- could anybody guide me with the code? That'd be greatly appreciated! I currently have the following, which doesn't get me what I need:
psmatch2 (logpension) (treated sector logassets logebitda), logit ate
Here's how you might match on x1 and x2 using Mahalanobis distance as a metric, to get the effect on y from treatment t:
use http://ssc.wisc.edu/sscc/pubs/files/psm, clear
psmatch2 t, mahalanobis(x1 x2) outcome(y) ate
The variable _n1 stores the observation number of the matched control observation for every treatment observation.
The following is a full set of code you can run to find your average treatment effect on the treated (your most important indicator result) and to check if the data is balanced (whether your result is valid). Before you run it, you need to make sure your treated is labeled in the following manner: 0 should be labeled as the control group and 1 should be labeled as the experimental/treatment. "neighbor(1)" means I chose the option nearest-neighbor matching. It basically pairs each treated observation with a control observation whose propensity score is closest in absolute value.
psmatch2 treated sector logassets logebitda, outcome (logpension) neighbor(1) common
After running psmatch, you need to make sure your data is balanced. So you need to run this:
pstest sector logassets logebitda, treated(treated)
if your t-test shows any significance below 0.05, it means your data is not balanced. to check the balance of your data visually, you can also run
psgraph
right after your psmatch2 command.
Good luck!

Using an element in a data set to drive If/Then logic decision

I'm working with our IT group to develop an optimizer for logistics operations. The basic design is that it will look at shipments, run a search for additional shipments originating with in XX miles of the previous shipments destination, and link them together in a loop. It will continue to do this until it hits a user defined set of shipment legs where the loop ends at or close to 1st shipment origin.
The issue we are facing is that the materials we ship are chemicals, which can have interactions if placed in a tank that contained XX chemical before it. The obvious solution is to use a different tank or wash it out, but we also need it to compute solutions prior to that.
My problem is, currently, there is no way on the market to do that prior product optimization.
The question is: Is there some kind of logic table function I can write that will allow the optimizer to see an element in the data set (say, Product Family of 1) that will pull from a product database containing predefined product families (i.e. PF 1 = Chemicals A1-B7, PF 2 = Chemical B8-J8, etc.) and then ping off of a logic table that defines a do not ship with list (i.e. PF 1 cannot ship if PF 2 was on the previous leg.