Stata moving products - stata

Using Stata I want a formula (line of code) that takes all of the previous entries for a given group G at a given cell and returns the product for all of the values at that cell and above. For example:
G X Y
1 1 1
1 2 2
1 6 12
1 3 36
2 2 2
2 4 8
3 2 2
4 2 2
4 11 22
4 7 154
G = Group ID, X = Value, Y = Moving Product
The way I have been doing this is pretty long and involves creating a good number of variables. There must be a way in Stata to just have it do a moving product by group ID (G).
Any insight is helpful

Here is the solution:
sort G
by G: gen moving_product = exp(sum(ln(X)))
This should make X = Y

Related

how to normalize rating in scale of 1 to 5?

In Yahoo! Movie dataset the rating scale is from 1 to 13. here, 1 represent good rating and 13 represent the lowest rating to the movie.
if there is 0 then it represents that user didn't rate that movie.
rating { 13 12 11 10 9 8 7 6 5 4 3 2 1 0} OR
rating { A+ A A- B+ B B- C+ C C- C+ D D- F 0}
eg. user m1 m2 m3
1 2 3 13
2 0 1 7
but I don't know how to normalize rating in the scale of 1 to 13 into a scale of 1 to 5.
simply I can do one thing i.e.
{A+,A,A-} = 5
{B+,B,B-} = 4
{C+,C,C-} = 3
{D+,D,D-} = 2
{F} = 1
is there any other method or by using any formula ?
If floating points are allowed, simply multiply with 5/13. Round to full numbers if necessary.
If 5 is the best, substract the result from 6 (handle 0 with an if clause)

SAS - Selecting optimal quantities

I'm trying to solve a problem in SAS where I have quantities of customers across a range of groups, and the quantities I select need to be as even across the different categories as possible. This will be easier to explain with a small table, which is a simplification of a much larger problem I'm trying to solve.
Here is the table:
Customer Category | Revenue band | Churn Band | # Customers
A 1 1 4895
A 1 2 383
A 1 3 222
A 2 1 28
A 2 2 2828
A 2 3 232
B 1 1 4454
B 1 2 545
B 1 3 454
B 2 1 4534
B 2 2 434
B 2 3 454
Suppose I need to select 3000 customers from category A, and 3000 customers from category B. From the second category, within each A and B, I need to select an equal amount from 1 and 2. If possible, I need to select a proportional amount across each 1, 2, and 3 subcategories. Is there an elegant solution to this problem? I'm relatively new to SAS and so far I've investigated OPTMODEL, but the examples are either too simple or too advanced to be much use to me yet.
Edit: I've thought about using survey select. I can use this to select equal sizes across the Revenue Bands 1, 2, and 3. However where I'm lacking customers in the individual churn bands, surveyselect may not select the maximum number of customers available where those numbers are low, and I'm back to manually selecting customers.
There are still some ambiguities in the problem statement, but I hope that the PROC OPTMODEL code below is a good start for you. I tried to add examples of many different features, so that you can toy around with the model and hopefully get closer to what you actually need.
Of the many things you could optimize, I am minimizing the maximum violation from your "If possible" goal, e.g.:
min MaxMismatch = MaxChurnMismatch;
I was able to model your constraints as a Linear Program, which means that it should scale very well. You probably have other constraints you did not mention, but that would probably beyond the scope of this site.
With the data you posted, you can see from the output of the print statements that the optimal penalty corresponds to choosing 1500 customers from A,1,1, where the ideal would be 1736. This is more expensive than ignoring the customers from several groups:
[1] ChooseByCat
A 3000
B 3000
[1] [2] [3] Choose IdealProportion
A 1 1 1500 1736.670
A 1 2 0 135.882
A 1 3 0 78.762
A 2 1 28 9.934
A 2 2 1240 1003.330
A 2 3 232 82.310
B 1 1 1500 1580.210
B 1 2 0 193.358
B 1 3 0 161.072
B 2 1 1500 1608.593
B 2 2 0 153.976
B 2 3 0 161.072
Proportion MaxChurnMisMatch
0.35478 236.67
That is probably not the ideal solution, but figuring how to model exactly your requirements would not be as useful for this site. You can contact me offline if that is relevant.
I've added quotes from your problem statement as comments in the code below.
Have fun!
data custCounts;
input cat $ rev churn n;
datalines;
A 1 1 4895
A 1 2 383
A 1 3 222
A 2 1 28
A 2 2 2828
A 2 3 232
B 1 1 4454
B 1 2 545
B 1 3 454
B 2 1 4534
B 2 2 434
B 2 3 454
;
proc optmodel printlevel = 0;
set CATxREVxCHURN init {} inter {<'A',1,1>};
set CAT = setof{<c,r,ch> in CATxREVxCHURN} c;
num n{CATxREVxCHURN};
read data custCounts into CATxREVxCHURN=[cat rev churn] n;
put n[*]=;
var Choose{<c,r,ch> in CATxREVxCHURN} >= 0 <= n[c,r,ch]
, MaxChurnMisMatch >= 0, Proportion >= 0 <= 1
;
/* From OP:
Suppose I need to select 3000 customers from category A,
and 3000 customers from category B. */
num goal = 3000;
/* See "implicit slice" for the parenthesis notation, i.e. (c) below. */
impvar ChooseByCat{c in CAT} =
sum{<(c),r,ch> in CATxREVxCHURN} Choose[c,r,ch];
con MatchCatGoal{c in CAT}:
ChooseByCat[c] = goal;
/* From OP:
From the second category, within each A and B,
I need to select an equal amount from 1 and 2 */
con MatchRevenueGroupsWithinCat{c in CAT}:
sum{<(c),(1),ch> in CATxREVxCHURN} Choose[c,1,ch]
= sum{<(c),(2),ch> in CATxREVxCHURN} Choose[c,2,ch]
;
/* From OP:
If possible, I need to select a proportional amount
across each 1, 2, and 3 subcategories. */
con MatchBandProportion{<c,r,ch> in CATxREVxCHURN, sign in / 1 -1 /}:
MaxChurnMismatch >= sign * ( Choose[c,r,ch] - Proportion * n[c,r,ch] );
min MaxMismatch = MaxChurnMismatch;
solve;
print ChooseByCat;
impvar IdealProportion{<c,r,ch> in CATxREVxCHURN} = Proportion * n[c,r,ch];
print Choose IdealProportion;
print Proportion MaxChurnMismatch;
quit;

SAS PROC GMAP Annotate Regions

I am having difficulty annotating a map I have created using the Gmap procedure (SAS 9.4).
I have a custom shape data set I have created for two regions (XX and YY). XX is actually a disjoint region made up of two shapes.
I am having two issues:
The Proc is trying to draw the Area XX as one contiguous region, even though I've defined it as two separate subpolygons.
The labels are not populating in the centroid of the shapes, even though I've tried using the %centroid macro to build the annotation set. The coordinates look to be correct, but the text is not showing up in the right place.
Here is the code I've put together.
data map;
input Area $ Y X POINTORDER SUB_POLYGON_NUMBER POLYGON_NUMBER;
cards;
XX 1 1 1 1 1
XX 2 1 2 1 1
XX 3 1 3 1 1
XX 3 2 4 1 1
XX 3 3 5 1 1
XX 2 3 6 1 1
XX 1 3 7 1 1
XX 1 2 8 1 1
XX -1 0 1 2 1
XX -2 0 2 2 1
XX -1 -2 3 2 1
YY 7 7 1 1 2
YY 7 8 2 1 2
YY 8 9 3 1 2
;
run;
data sales;
input Area $ Sales;
datalines;
XX 500
YY 200
;
run;
%annomac;
%CENTROID(map,anno,Area,segonly=1);
data anno;
set anno;
text=Area;
function='label';
style="'Albany AMT/bold'";
run;
proc gmap data = sales map=map;
id Area;
choro Sales / nolegend annotate=anno;
run;
quit;
As Joe said, this would defintely be good to have as two questions. I'll respond to the first part, since Joe has answered the second one.
By opening MAPS.Sweden, I found out that the region identifiers, your POLYGON_NUMBER and SUB_POLYGON_NUMBER, are called ID and SEGMENT. So if you change your column names according to that in the map definition, you'll get the wanted outcome.
data map;
input Area $ Y X POINTORDER SEGMENT ID;
cards;
XX 1 1 1 1 1
XX 2 1 2 1 1
XX 3 1 3 1 1
XX 3 2 4 1 1
XX 3 3 5 1 1
XX 2 3 6 1 1
XX 1 3 7 1 1
XX 1 2 8 1 1
XX -1 0 1 2 1
XX -2 0 2 2 1
XX -1 -2 3 2 1
YY 7 7 1 1 2
YY 7 8 2 1 2
YY 8 9 3 1 2
;
run;
I hadn't worked with gmap before, so it was quite interesting. I tried to read the documentation to find out how the columns should be named to get this to work. I did not find anything, but it should be there somewhere. Please drop a comment if you know where I can read about it.
I'm not sure about the first part of your question, but you probably should split them into two questions - these are two separate issues.
As far as the issue in the question title, the position of the annotate text, you have two problems.
One: your annotate text isn't using the same coordinate system. In SAS/GRAPH, this is controlled with the XSYS, YSYS, etc. variables. 4 is default, which is the value across the entire image; that's not what you want here. What you want here is 2, which is in the data space only (ie, actually on the drawn axis).
You also need to make it visible: by default it won't be drawn "over" a graph element.
data anno;
set anno;
text=Area;
function='label';
style="'Albany AMT/bold'";
color='Red';
when='After';
xsys='2';
ysys='2';
run;
I made it red to make it more visible, but you of course can use black.
Note that I tested this using the single polygon (I deleted the subpolygon=2); I'm not sure what would happen if you had both, but the centering would probably be a bit odd.

pandas - Perform computation against a reference record within groups

For each row of data in a DataFrame I would like to compute the number of unique values in columns A and B for that particular row and a reference row within the group identified by another column ID. Here is a toy dataset:
d = {'ID' : pd.Series([1,1,1,2,2,2,2,3,3])
,'A' : pd.Series([1,2,3,4,5,6,7,8,9])
,'B' : pd.Series([1,2,3,4,11,12,13,14,15])
,'REFERENCE' : pd.Series([1,0,0,0,0,1,0,1,0])}
data = pd.DataFrame(d)
The data looks like this:
In [3]: data
Out[3]:
A B ID REFERENCE
0 1 1 1 1
1 2 2 1 0
2 3 3 1 0
3 4 4 2 0
4 5 11 2 0
5 6 12 2 1
6 7 13 2 0
7 8 14 3 1
8 9 15 3 0
Now, within each group defined using ID I want to compare each record with the reference record and I want to compute the number of unique A and B values for the combination. For instance, I can compute the value for data record 3 by taking len(set([4,4,6,12])) which gives 3. The result should look like this:
A B ID REFERENCE CARDINALITY
0 1 1 1 1 1
1 2 2 1 0 2
2 3 3 1 0 2
3 4 4 2 0 3
4 5 11 2 0 4
5 6 12 2 1 2
6 7 13 2 0 4
7 8 14 3 1 2
8 9 15 3 0 3
The only way I can think of implementing this is using for loops that loop over each grouped object and then each record within the grouped object and computes it against the reference record. This is non-pythonic and very slow. Can anyone please suggest a vectorized approach to achieve the same?
I would create a new column where I combine a and b into a tuple and then I would group by And then use groups = dict(list(groupby)) and then get the length of each frame using len()

Formula that uses previous value

In Stata I want to have a variable calculated by a formula, which includes multiplying by the previous value, within blocks defined by a variable ID. I tried using a lag but that did not work for me.
In the formula below the Y-1 is intended to signify the value above (the lag).
gen Y = 0
replace Y = 1 if count == 1
sort ID
by ID: replace Y = (1+X)*Y-1 if count != 1
X Y count ID
. 1 1 1
2 3 2 1
1 6 3 1
3 24 4 1
2 72 5 1
. 1 1 2
1 2 2 2
7 16 3 2
Your code can be made a little more concise. Here's how:
input X count ID
. 1 1
2 2 1
1 3 1
3 4 1
2 5 1
. 1 2
1 2 2
7 3 2
end
gen Y = count == 1
bysort ID (count) : replace Y = (1 + X) * Y[_n-1] if count > 1
The creation of a dummy (indicator) variable can exploit the fact that true or false expressions are evaluated as 1 or 0.
Sorting before by and the subsequent by command can be condensed into one. Note that I spelled out that within blocks of ID, count should remain sorted.
This is really a comment, not another answer, but it would be less clear if presented as such.
Y-1, the lag in the formula would be translated as seen in the below.
gen Y = 0
replace Y = 1 if count == 1
sort ID
by ID: replace Y = (1+X)*Y[_n-1] if count != 1