UPDATE:
I solved the first part of the problem. I created unique ids for each observation:
gen id=_n
Then, I used
fillin id categ
which essentially created what I was looking for.
However, for the rest of the variables (except id and categ), almost all observations are missing. Now, I need your help to duplicate the rest of the variables instead of having them missing.
Just as an example, each observation is associated with a particular week. I am missing most of them. Or another dummy variable indicates whether a purchase was made at a drug or grocery store. Most of them are missing too.
Thanks!
ORIGINAL MESSAGE:
Need your help in Stata!
Each observation in my database is a 1-unit purchase of a beer product made by a customer. These product purchases are categorized unto 8 general categories such that the variable "categ" has values from 1 to 8 (1=import, 2=craft, 3=premium, 4=light, etc).
For my multinomial logit model, I need to observe all categories purchased or not purchased by the customer in each observation.
Assume, this is my initial dataset:
customer id-------beer category-----units purchased
----------1------------------1--------------------- 1
----------2----------------- 3--------------------- 1
----------3 -----------------2 ---------------------1
This is what I am looking for:
customer id-------beer category-----units purchased
----------1------------------1--------------------- 1
----------1 -----------------2 ---------------------0
----------1----------------- 3--------------------- 0
----------2----------------- 1--------------------- 0
----------2----------------- 3--------------------- 1
----------2 -----------------3--------------------- 0
----------3----------------- 1--------------------- 0
----------3----------------- 2--------------------- 0
----------3 -----------------2 ---------------------1
Currently, my dataset is 600,000 obs. After this procedure, I should have 600,000*8=4,800,000 obs.
When constructing this code, it is necessary that all other variables in the dataset are duplicated according to the associated category of beer.
I assume that "fillin" and less likely "expand" might work.
You help will tremendously help.
Thanks!
This is an old question, but i'll post a possible answer if someone else is having this problem.
In this case, you could generate variables for every option of your "choice variable", and after that, apply the reshape long command:
tab beercategory, gen(b)
reshape long b , i(customerid) j(newvarname)
Greetings
Related
I need a little help since im new to PowerBI. I have a data set which says what have been eaten in a specific day. At the end there are columns which show if the day before the overall feeling was better (so in this specific day it got worse). I got up to 30 Ingredients and 5 days before. The "1" in e.g. Day2 means its TRUE for condition "2 days before it got worse"
The data looks like this:
Data set example
Now, I want to retrieve and add up all ingredients, which has been eaten at "Day1", "Day2" and so on, so I can see which food is maybe causing problems because it should appear more often in the days before or at least appear in every case there. How do I achieve this?
For example I can see then, that on Day2 overy often Ingredient "Apple" appears, so there could be an assumption that Chicken meat is not good for this person.
I tried to pivot the table, as well as disconnect the "DayX" into another table and make relationship between them, but nothing adds up the things in the way I want it to.
I'm building a staff allocation sheet for our production teams so that management can see (graphically) the peaks and troughs for each department. I've anonymised the data and shared it here: https://docs.google.com/spreadsheets/d/140w_v_ApksXH2q7h_dK5Iglm0VOTF1Zk9fq9jPxF5BM/edit?usp=sharing
What I am attempting to achieve is to have the week commencing dates across the top, the employees down the left (Based on a unique list of employees form the department tabs) and the individual cells populated with the relevant show number (I've manually entered the first two employees).
I had done this successfully by having hidden proxy columns for each show that pull a start and end date from the department sheet via and indirect lookup. Where this falls down is if an employee works on a production at two separate times, e.g. all of May and all of August but not between those months.
I initially attempted a series of nested ifs:
=if(and(indirect($A3&"!$B:$B")=$B:$B,
indirect($A3&"!$E:$E")<=AK$2,
indirect($A3&"!$F:$F")>AK$2,
indirect($A3&"!$C:$C")="SHOW #1"
),1,
if(and(indirect($A3&"!$B:$B")=$B:$B,
indirect($A3&"!$E:$E")<=AK$2,
indirect($A3&"!$F:$F")>AK$2,
indirect($A3&"!$C:$C")="SHOW #2"
),2,
if(and(indirect($A3&"!$B:$B")=$B:$B,
indirect($A3&"!$E:$E")<=AK$2,
indirect($A3&"!$F:$F")>AK$2,
indirect($A3&"!$C:$C")="SHOW #3"
),3,
. . .
if(and(indirect($A3&"!$B:$B")=$B:$B,
indirect($A3&"!$E:$E")<=AK$2,
indirect($A3&"!$F:$F")>AK$2,
indirect($A3&"!$C:$C")="SHOW #17"
),17,
0)))))))))))))))))
Hoping that this would calculate in each cell and return the correct result, however this did not work, that's when I discovered if() doesn't work with ranges in this context! I've effectively hit a wall on this.
I feel like I'm missing something obvious in this and would appreciate any help!!!
I am trying to predict match winner based on the historical data set as shown below,
The data set comprises of IPL seasons and Team_Name_id vs Opponent Team are the team names in IPL. I have set the match id as Row id and created the model. When running realtime testing, the result is not as expected (shown below)
Target is set as Match_winner_id.
Am I missing any configurations? Please help
The model is working perfectly correctly. There's just two problems:
Your input data is not very good
There's no way for the model to know that only one of those two teams should win
Data Quality
A predictive model needs good quality input data on which to reverse-engineer a model that explains a given result. This input data should contain information that can be used to predict a result given a different set of input data.
For example, when predicting house prices, it would need to know the suburb (category), number of bedrooms/bathrooms/parking spaces, age of the building and selling price. It could then predict the selling price for other houses with a slightly different mix of variables.
However, based on your screenshot, you are giving the following information (and probably more) on which to make your prediction:
Teams: Not great, because you are separating Column C and Column D. The model will assume they are unrelated information. It doesn't realise that those two values could be swapped.
Match date: Useless information unless the outcome varies in proportion to time (eg a team continually gets better)
Season: As with Match Date, this is probably useless because you're always predicting the future -- you won't be predicting for a past season
Venue: Only relevant if a particular team always wins at a given venue
Toss Decision: Would this really influence the outcome? Also, it's only known once the game begins, so not great for predicting a future game.
Win Type: You won't know the win type until a game is over, so it's not suitable for predicting a future game.
Score: Again, not known until the actual game, so no good for future predictions.
Man of the Match: Not known for future games.
Umpire: How does an umpire influence the result of a game?
City: Yes, given that home teams often have an advantage.
You have provided very little information that could be used to predict a future game. There is really only the teams and the venue. Everything else is either part of the game itself or irrelevant.
Picking only one of the two teams
When the ML model looks at your data and tries to make a prediction, it will look at all the data you have provided. For example, it might notice that for a given venue and season, Team 8 has a higher propensity to win. Therefore, given that venue and season, it will favour a win by Team 8. The model has no concept that the only possible outcome is one of the two teams given in columns C and D.
You are predicting for two given teams and you are listing the teams in either Column C or Column D and this makes no sense -- the result is the same if you swapped the teams between columns, but the model has no concept of this. Also, information about Team 1 vs Team 2 is totally irrelevant for Team 3 vs Team 4.
What you should do is create one dataset per team, listing all their matches, plus a column that shows the outcome -- either a boolean (Win/Lose) or a value that represents the number of runs by which they won (where negative is a loss). You would then ask them model to predict the result for that team, given the input data, which would be win/lose or a points above/below the other team.
But at the core, I think that your input data doesn't have enough rich content to be able to make a sensible prediction. Just ask yourself: "What data would I like to know if I were to guess which team would win?" It would probably be past results, weather conditions, which players were on each team, how many matches they played in the last week, etc. None of this information is being provided as input on each line of your input data.
Consider the fictional data to illustrate my problem, which contains in reality thousands of rows.
Figure 1
Each individual is characterized by values attached to A,B,C,D,E. In figure1, I show 3 individuals for which some characteristics are missing. Do you have any idea how can I get the following completed table (figure 2)?
Figure 2
With the ID in figure 1 I could have used the carryforward command to filling in the values. But since each individual has a different number of rows I don't know how to create the ID.
Edit: All individual share the characteristic "A".
Edit: the existing order of observations is informative.
To detect the change of id, the idea is to compare if the precedent value of char is >= in each rows.
This works only if your data are ordered, but it seems mandatory in your data.
gen id= 1 if (char[_n-1] >= char[_n]) | _n ==1
replace id = sum(id) if id==1
replace id = id[_n-1] if missing(id)
fillin id char
drop _fillin
If an individual as only the characteristics A and C and another individual as only the characteristics D and E, this won't work, but it seems impossible to detect with your data.
I have a dataset with observations at specific timepoints, but those timepoints (and the length of time between them) vary by group. I'm trying to "fill down" the data so that existing observations are carried down into missing cells. But I only want to do this for a certain number of rows after the original observation. So for example, I could have a dataset that looks like this:
For group A, I'd want to fill in the value for 2002 with 2001's value, 2004 with 2003, etc. I wouldn't want to fill in 2000 at all, since I don't have the preceding value. And I ALSO wouldn't want to fill in the 2011 value, because the "cyclelength" variable tells me that group A's observations are supposed to take place every two years, so I don't want to carry data forward past that. 2011 is just a genuinely missing value.
Similarly, in group B, I'd want to carry 2000's value forward into years 2001, 2002, and 2003 (because the "cyclelength" here is 4 years). I'd want to carry 2004's value into 2005, 2006, and 2007, but not beyond that--the later years should stay missing.
I've tried setting this up with the "carryforward" command, but haven't figured out how to have it stop filling down after a specified number of years that varies by group. Is there a way to do this, either with carryforward or otherwise?
This is a variation on a problem documented since 2000 as an FAQ: see here
The variation lies in limiting how far non-missing values are copied. But it falls easily to the same idea.
The last known value was recorded in certain years which we can copy down the dataset:
gen when_last_known = year if !missing(value)
bysort group (year) : replace when_last_known = when_last_known[_n-1] if missing(when_last_known)
Now the replacement wanted is
by group : replace value = value[_n-1] if missing(value) & (year - when_last_known) < cyclelength
That statement presupposes the sort order of the previous statement.
On Statalist (see here) you'd be expected to document that carryforward is a user-written command to be installed from SSC. That's a good convention here too.
In practice, it's good data management to keep the original data exactly as they arrive and do this on a clone of the variable. Sooner or later someone will ask to see the original values, and then you could be seriously embarrassed.