How to find maximum distance apart of values within a variable - stata

I create a working example dataset:
input ///
group value
1 3
1 2
1 3
2 4
2 6
2 7
3 4
3 4
3 4
3 4
4 17
4 2
5 3
5 5
5 12
end
My goal is to figure out the maximum distance between incremental values within group. For group 2, this would be 2, because the next highest value after 4 is 6. Note that the only value relevant to 4 is 6, not 7, because 7 is not the next highest value after 4. The result for group 3 is 0 because there is only one value in group 3. There will only be one result per group.
What I want to get:
input ///
group value result
1 3 1
1 2 1
1 3 1
2 4 2
2 6 2
2 7 2
3 4 0
3 4 0
3 4 0
3 4 0
4 17 15
4 2 15
5 3 7
5 5 7
5 12 7
end
The order is not important, so the order just above can change with no problem.
Any tips?

I may have figured it out:
bys group (value): gen d = value[_n+1] - value[_n]
bys group: egen result = max(d)
drop d

Related

Sort rows in a dataframe based on highest values in the whole dataframe

I have a dataframe that has probability values for 3 category columns [A, B, C]. Now I want to sort the rows of this dataframe based on the condition that the row which has the highest probability value in the whole dataframe(irrespective of the columns), should be at the top followed by the row with the second highest probability value and so on.
If someone can help me with this?
In [15]: df = pd.DataFrame(np.random.randint(1, 10, (10,3)))
In [16]: df
Out[16]:
0 1 2
0 9 2 8
1 6 6 9
2 2 4 9
3 2 1 2
4 2 5 3
5 3 4 9
6 8 7 3
7 6 4 1
8 3 3 8
9 7 2 7
In [17]: df.iloc[df.apply(np.max, axis=1).sort_values(ascending=False).index]
Out[17]:
0 1 2
5 3 4 9
2 2 4 9
1 6 6 9
0 9 2 8
8 3 3 8
6 8 7 3
9 7 2 7
7 6 4 1
4 2 5 3
3 2 1 2

pandas - Perform computation against a reference record within groups

For each row of data in a DataFrame I would like to compute the number of unique values in columns A and B for that particular row and a reference row within the group identified by another column ID. Here is a toy dataset:
d = {'ID' : pd.Series([1,1,1,2,2,2,2,3,3])
,'A' : pd.Series([1,2,3,4,5,6,7,8,9])
,'B' : pd.Series([1,2,3,4,11,12,13,14,15])
,'REFERENCE' : pd.Series([1,0,0,0,0,1,0,1,0])}
data = pd.DataFrame(d)
The data looks like this:
In [3]: data
Out[3]:
A B ID REFERENCE
0 1 1 1 1
1 2 2 1 0
2 3 3 1 0
3 4 4 2 0
4 5 11 2 0
5 6 12 2 1
6 7 13 2 0
7 8 14 3 1
8 9 15 3 0
Now, within each group defined using ID I want to compare each record with the reference record and I want to compute the number of unique A and B values for the combination. For instance, I can compute the value for data record 3 by taking len(set([4,4,6,12])) which gives 3. The result should look like this:
A B ID REFERENCE CARDINALITY
0 1 1 1 1 1
1 2 2 1 0 2
2 3 3 1 0 2
3 4 4 2 0 3
4 5 11 2 0 4
5 6 12 2 1 2
6 7 13 2 0 4
7 8 14 3 1 2
8 9 15 3 0 3
The only way I can think of implementing this is using for loops that loop over each grouped object and then each record within the grouped object and computes it against the reference record. This is non-pythonic and very slow. Can anyone please suggest a vectorized approach to achieve the same?
I would create a new column where I combine a and b into a tuple and then I would group by And then use groups = dict(list(groupby)) and then get the length of each frame using len()

Create a dummy variable for the last rows based on on another variable

I would like to create a dummy variable that will look at the variable "count" and label the rows as 1 starting from the last row of each id. As an example ID 1 has count of 3 and the last three rows of this id will have such pattern: 0,0,1,1,1 Similarly, ID 4 which has a count of 1 will have 0,0,0,1. The IDs have different number of rows. The variable "wish" shows what I want to obtain as a final output.
input byte id count wish str9 date
1 3 0 22sep2006
1 3 0 23sep2006
1 3 1 24sep2006
1 3 1 25sep2006
1 3 1 26sep2006
2 4 1 22mar2004
2 4 1 23mar2004
2 4 1 24mar2004
2 4 1 25mar2004
3 2 0 28jan2003
3 2 0 29jan2003
3 2 1 30jan2003
3 2 1 31jan2003
4 1 0 02dec1993
4 1 0 03dec1993
4 1 0 04dec1993
4 1 1 05dec1993
5 1 0 08feb2005
5 1 0 09feb2005
5 1 0 10feb2005
5 1 1 11feb2005
6 3 0 15jan1999
6 3 0 16jan1999
6 3 1 17jan1999
6 3 1 18jan1999
6 3 1 19jan1999
end
For future questions, you should provide your failed attempts. This shows that you have done your part, namely, research your problem.
One way is:
clear
set more off
*----- example data -----
input ///
byte id count wish str9 date
1 3 0 22sep2006
1 3 0 23sep2006
1 3 1 24sep2006
1 3 1 25sep2006
1 3 1 26sep2006
2 4 1 22mar2004
2 4 1 23mar2004
2 4 1 24mar2004
2 4 1 25mar2004
3 2 0 28jan2003
3 2 0 29jan2003
3 2 1 30jan2003
3 2 1 31jan2003
4 1 0 02dec1993
4 1 0 03dec1993
4 1 0 04dec1993
4 1 1 05dec1993
5 1 0 08feb2005
5 1 0 09feb2005
5 1 0 10feb2005
5 1 1 11feb2005
6 3 0 15jan1999
6 3 0 16jan1999
6 3 1 17jan1999
6 3 1 18jan1999
6 3 1 19jan1999
end
list, sepby(id)
*----- what you want -----
bysort id: gen wish2 = _n > (_N - count)
list, sepby(id)
I assume you already sorted your date variable within ids.
One way to accomplish this would be to use within-group row numbers using 'bysort'-type logic:
***Create variable of within-group row numbers.
bysort id: gen obsnum = _n
***Calculate total number of rows within each group.
by id: egen max_obsnum = max(obsnum)
***Subtract the count variable from the group row count.
***This is the number of rows where we want the dummy to equal zero.
gen max_obsnum_less_count = max_obsnum - count
***Create the dummy to equal one when the row number is
***greater than this last variable.
gen dummy = (obsnum > max_obsnum_less_count)
***Clean up.
drop obsnum max_obsnum max_obsnum_less_count

If current week has missing value, how to replace it with the value from previous week?

I have a dataset that shows how much was paid ("cenoz" - cents per ounce) per product category during specific week and in a specific store.
clear
set more off
input week store cenoz category
1 1 2 1
1 1 4 2
1 1 3 3
1 2 5 1
1 2 7 2
1 2 8 3
2 1 4 1
2 1 1 2
2 1 10 3
2 2 3 1
2 2 4 2
2 2 7 3
3 1 5 1
3 1 3 2
3 2 5 1
3 2 4 2
end
I create a new variable cenoz3 that indicates how much on average was paid for category 3 given specific week and a store. Same with cenoz1, and cenoz2.
egen cenoz1 = mean(cenoz/ (category == 1)), by(week store)
egen cenoz2 = mean(cenoz/ (category == 2)), by(week store)
egen cenoz3 = mean(cenoz/ (category == 3)), by(week store)
It turns out that category 3 was not sold in any of the stores (1 and 2) in week 3. As a result, missing values are generated.
week store cenoz category cenoz1 cenoz2 cenoz3
1 1 2 1 2 4 3
1 1 4 2 2 4 3
1 1 3 3 2 4 3
1 2 5 1 5 7 8
1 2 7 2 5 7 8
1 2 8 3 5 7 8
2 1 4 1 4 1 10
2 1 1 2 4 1 10
2 1 10 3 4 1 10
2 2 3 1 3 4 7
2 2 4 2 3 4 7
2 2 7 3 3 4 7
3 1 5 1 5 3 .
3 1 3 2 5 3 .
3 2 5 1 5 4 .
3 2 4 2 5 4 .
I would like to replace missing values of a particular week with values of the previous week and matching store. That's to say:
replace missing values for category 3 in week 3 in store 1
with values for category 3 in week 2 in store 1
and
replace missing values for category 3 in week 3 in store 2
with values for category 3 in week 2 in store 2
Can I use command replace or is it something more complicated than that?
Something like:
replace cenoz1 = cenoz1[_n-1] if missing(cenoz1)
But I also need to the stores to match, not just the time variable week.
I found this code provided by Nicholas Cox at
http://www.stata.com/support/faqs/data-management/replacing-missing-values/:
by id (time), sort: replace myvar = myvar[_n-1] if myvar >= .
Do you think
by store (week), sort: cenoz1 = cenoz1[_n-1] if missing(cenoz1)
is sufficient?
UPDATE:
When I use the code
by store (week category), sort: replace cenoz3 = cenoz3[_n-1] if missing(cenoz3)
It seems it delivers correct values:
week store cenoz category cenoz1 cenoz2 cenoz3
1 1 2 1 2 4 3
1 1 4 2 2 4 3
1 1 3 3 2 4 3
1 2 5 1 5 7 8
1 2 7 2 5 7 8
1 2 8 3 5 7 8
2 1 4 1 4 1 10
2 1 1 2 4 1 10
2 1 10 3 4 1 10
2 2 3 1 3 4 7
2 2 4 2 3 4 7
2 2 7 3 3 4 7
3 1 5 1 5 3 10
3 1 3 2 5 3 10
3 2 5 1 5 4 7
3 2 4 2 5 4 7
Is there any way to double check this code given that my dataset is quite large?
How make this code not so specific but applicable to any missing cenoz if it finds one with missing vaues? (cenoz1, cenoz2, cenoz3, cenoz4...cenoz12)
If you want to use the previous information for the same store and the same category, that should be
by store category (week), sort: replace cenoz3 = cenoz3[_n-1] if missing(cenoz3)
A generalization could be
sort store category week
forval j = 1/12 {
by store category: replace cenoz`j' = cenoz`j'[_n-1] if missing(cenoz`j')
}
However this carrying forward is a fairly crude method of interpolation. Consider linear, cubic, cubic spline, PCHIP methods of interpolation. Use search to find Stata programs.
A quick note on why your code
by store (category week), sort: replace cenoz3 = cenoz3[_n-1] if missing(cenoz3)
won't work.
It will work for the example dataset you give. But a slight modification can give unexpected results. Consider the following example:
clear all
set more off
input week store cenoz category
1 1 2 1
1 1 4 2 /*
1 1 3 3 deleted observation */
1 2 5 1
1 2 7 2
1 2 8 3
2 1 4 1
2 1 1 2
2 1 10 3
2 2 3 1
2 2 4 2
2 2 7 3
3 1 5 1
3 1 3 2
3 1 999 3 // new observation
3 2 5 1
3 2 4 2
end
egen cenoz1 = mean(cenoz/ (category == 1)), by(week store)
egen cenoz2 = mean(cenoz/ (category == 2)), by(week store)
egen cenoz3 = mean(cenoz/ (category == 3)), by(week store)
order store category week
sort store category week
list, sepby(store category)
*----- method 1 (your code) -----
gen cenoz3x1 = cenoz3
by store (category week), sort: replace cenoz3x1 = cenoz3x1[_n-1] if missing(cenoz3x1)
*----- method 2 (Nick's code) -----
gen cenoz3x2 = cenoz3
by store category (week), sort: replace cenoz3x2 = cenoz3x2[_n-1] if missing(cenoz3x2)
list, sepby(store category)
Method 1 will assign the price of a category 1 article to a category 2 article (observation 4 of cenoz3x1). Presumably, something you don't want. If you want to avoid this, then the groups should be based on store category and not just store.
The best place to start reading is help and the manuals.

if condition is intended to be fulfilled for observation but for value of another variable

At the moment my code reads: gen lateFirms = 1 if firmage0 != .
So at the moment the dataset which I get looks like this:
firm_id lateFirms firmage0
1
1
1
1
1
3
3
3
3
3
4
4
4
4
4
5
5
6 1 110
6
6
6
6
7
7
7
7
7
8 1 90
8
8
8
8
But what I want is this:
firm_id lateFirms firmage0
1
1
1
1
1
3
3
3
3
3
4
4
4
4
4
5
5
6 1 110
6 1
6 1
6 1
6 1
7
7
7
7
7
8 1 90
8 1
8 1
8 1
8 1
NOTE: All blank entries are missing values!
So "lateFirms" should equal 1 if, regarding a "firm_id", there exists one observation for which firmage0 is not a missing value.
bysort firm_id : egen present = count(firmage0)
replace lateFirms = present > 0
The count() function of egen counts non-missings and assigns the count to all values for each firm.
Maybe this helps:
bysort firm_id: gen dum = 1 if sum(firmage0) != 0
To get exactly what you want, you can use replace instead of generate:
bysort firm_id: replace lateFirms = 1 if sum(firmage0) != 0
As #NickCox pointed out, this solution is specific to the example dataset you provided.