Generating a variable only including the top 4 firms with largest sales

Generating a variable only including the top 4 firms with largest sales - stata

My question is very related to the question below:
Calculate industry concentration in Stata based on four biggest numbers
I want to generate a variable only including the top 4 firms with largest sales and exclude the rest.
In other words the new variable will only have values of the 4 firms with largest sales in a given industry for a given year and the rest will be .

Consider this:
webuse grunfeld, clear
bysort year (invest) : gen largest4 = cond(_n < _N - 3, ., invest)
sort year invest
list year largest4 if largest4 < . in 1/40, sepby(year)
+-----------------+
| year largest4 |
|-----------------|
7. | 1935 39.68 |
8. | 1935 40.29 |
9. | 1935 209.9 |
10. | 1935 317.6 |
|-----------------|
17. | 1936 50.73 |
18. | 1936 72.76 |
19. | 1936 355.3 |
20. | 1936 391.8 |
|-----------------|
27. | 1937 74.24 |
28. | 1937 77.2 |
29. | 1937 410.6 |
30. | 1937 469.9 |
|-----------------|
37. | 1938 51.6 |
38. | 1938 53.51 |
39. | 1938 257.7 |
40. | 1938 262.3 |
+-----------------+
If you had missing values, they would sort to the end of each block and mess up the results.
So you need a trick more:
generate OK = !missing(invest)
bysort OK year (invest) : gen Largest4 = cond(_n < _N - 3, ., invest) if OK
sort year invest
list year Largest4 if Largest4 < . in 1/40, sepby(year)
With this example, which you can run, there are no missing values and the results are the same.

Related

Calculating if Start time occurs within 1 hour range for each person (Single column)

I'm trying to figure out how to calculate if start time for each subject occurs within 1 hour of each other. However I only have one column and two groups with two different dates for each. I have no comparative variable to a dhms time difference as they occur under the same column variable. I have thought of doing a lag on the first time and then an intchk to calculate the 24 hour time difference between each but I don't think i have sufficient arguments for the intchk function. Alternatively could maybe do a proc transpose and then do a timediff between each array variable but that seems messy. Anyone have less clunky and more efficient solutions as i might be overthinking this.
Sample Data:
+----------+-------+------+------------+------------+
| CLIENTID | GRPID | date | start_date | start_time |
+----------+-------+------+------------+------------+
| 2 | 1 | -2 | 10Nov2019 | 23:19:52 |
| 3 | 1 | -2 | 10Nov2019 | 23:22:51 |
| 4 | 1 | -2 | 10Nov2019 | 23:20:16 |
| 5 | 1 | -2 | 10Nov2019 | 23:21:30 |
| 6 | 1 | -2 | 10Nov2019 | 23:23:51 |
| 23 | 2 | -2 | 11Nov2019 | 23:11:38 |
| 24 | 2 | -2 | 11Nov2019 | 23:38:33 |
| 25 | 2 | -2 | 11Nov2019 | 23:15:01 |
| 26 | 2 | -2 | 11Nov2019 | 23:08:43 |
+----------+-------+------+------------+------------+

You can compile the start date and time into a temporary datetime variable (_start_dt) to ease the comparison. Then, taking the first datetime for each GRPID as the baseline, you could use a RETAIN statement to pass that baseline datetime (_base_dt) down the related data rows and find the time difference (time_diff) using the INTCK function with a dtsecond interval.
proc sort data=your_data;
by grpid clientid;
run;
data your_results (drop=_:);
retain CLIENTID GRPID DATE start_date start_time _base_dt;
format _base_dt _start_dt datetime16. time_diff time8.;
set your_data;
by grpid clientid;
_start_dt = dhms(start_date,hour(start_time),minute(start_time),second(start_time));
if first.grpid then _base_dt = _start_dt;
time_diff = intck('dtsecond', _base_dt, _start_dt);
run;
This gives the following results dataset:
+----------+-------+------+------------+------------+-----------+
| CLIENTID | GRPID | date | start_date | start_time | time_diff |
+----------+-------+------+------------+------------+-----------+
| 2 | 1 | -2 | 10Nov2019 | 23:19:52 | 00:00:00 |
| 3 | 1 | -2 | 10Nov2019 | 23:22:51 | 00:02:59 |
| 4 | 1 | -2 | 10Nov2019 | 23:20:16 | 00:00:24 |
| 5 | 1 | -2 | 10Nov2019 | 23:21:30 | 00:01:38 |
| 6 | 1 | -2 | 10Nov2019 | 23:23:51 | 00:03:59 |
| 23 | 2 | -2 | 11Nov2019 | 23:11:38 | 00:00:00 |
| 24 | 2 | -2 | 11Nov2019 | 23:38:33 | 00:26:55 |
| 25 | 2 | -2 | 11Nov2019 | 23:15:01 | 00:03:23 |
| 26 | 2 | -2 | 11Nov2019 | 23:08:43 | -0:02:55 |
+----------+-------+------+------------+------------+-----------+
I think I’ve interpreted your requirements correctly.. Let me know if not.

It sounds like you want to check if the RANGE of the start_time over each group is < 1 hour:
Coerce the start_date to a datetime value and add the start_time before computing the range.
data have;
input
CLIENTID GRPID date start_date: date9. start_time: hhmmss6.;
format start_date date9. start_time time8.;
datalines;
2 1 -2 10Nov2019 23:19:52
3 1 -2 10Nov2019 23:22:51
4 1 -2 10Nov2019 23:20:16
5 1 -2 10Nov2019 23:21:30
6 1 -2 10Nov2019 23:23:51
23 2 -2 11Nov2019 23:11:38
24 2 -2 11Nov2019 23:38:33
25 2 -2 11Nov2019 23:15:01
26 2 -2 11Nov2019 23:08:43
run;
proc sql;
create table want (label="start range status by group") as
select
grpid,
range(dhms(start_date,0,0,0)+start_time) as start_range format time8.,
calculated start_range < '24:00:00't as one_hr_start_flag
from have
group by grpid;
If you want to disregard the groups and focus only on the time of day, disregarding the date, the range computation would be:
* Presuming 'noon' is the center of the day;
proc sql;
create table want (label="time of day start range status overall") as
select
range(start_time) as range format time8.,
calculated range < '24:00:00't as one_hr_start_flag
from have;
Looking at only time is always troublesome for the cases of when the time value is slightly after midnight.

Transposing variables

Is there an easy way to transpose my variables in Stata?
From:
-.48685038 -.13912173 -.91550094 -.96246505
-1.4760038 1.2873173 -.22300169 .25329232
-.01091149 -.58777297 .49454963 2.2842488
-.01376025 -.03060045 -.26231077 .32238093
.51557881 -2.1968436 .36612388 -.40590465
To:
-.48685038 -1.4760038 -.01091149 -.01376025 .51557881
-.13912173 1.2873173 -.58777297 -.03060045 -2.1968436
-.91550094 -.22300169 .49454963 -.26231077 .36612388
-.96246505 .25329232 2.2842488 .32238093 -.40590465
My understanding is that I have to create a matrix first:
mkmat *, matrix(data)
matrix data = data'
svmat data

Try xpose:
. webuse xposexmpl, clear
. list
+--------------------------------+
| county year1 year2 year3 |
|--------------------------------|
1. | 1 57.2 11.3 19.5 |
2. | 2 12.5 8.2 28.9 |
3. | 3 18 14.2 33.2 |
+--------------------------------+
. xpose, clear varname
. list
+-------------------------------+
| v1 v2 v3 _varname |
|-------------------------------|
1. | 1 2 3 county |
2. | 57.2 12.5 18 year1 |
3. | 11.3 8.2 14.2 year2 |
4. | 19.5 28.9 33.2 year3 |
+-------------------------------+

Pandas: consecutive rows' value change comparison

I have a Dataframe with date as index:
Index | Opp id | Pipeline_Type |Amount
20170104 | 1 | Null | 10
20170104 | 2 | Sou | 20
20170104 | 3 | Inf | 25
20170118 | 1 | Inf | 12
20170118 | 2 | Null | 27
20170118 | 3 | Inf | 25
Now I want to calculate number of records(Opp id) for which Pipeline type has changed or amount has changed (+/-diff). Above no of records will be 2 for pipeline_type as well as for amount.
Please help me frame the solution.

How to calculate the maximum of last 5 years in panel data

I have panel data. I am interested to calculate the maximum of one variable (Var_C) in the last 5 years.I tried several different functions and loop but did not manage to get what I wanted.

Here is a reproducible example. You must install tsegen with ssc install tsegen before you can use it.
webuse grunfeld
tsset
tsegen max_invest = rowmax(L.(0/4).invest)
list *invest if company == 1
+-------------------+
| invest max_in~t |
|-------------------|
1. | 317.6 317.6 |
2. | 391.8 391.8 |
3. | 410.6 410.6 |
4. | 257.7 410.6 |
5. | 330.8 410.6 |
|-------------------|
6. | 461.2 461.2 |
7. | 512 512 |
8. | 448 512 |
9. | 499.6 512 |
10. | 547.5 547.5 |
|-------------------|
11. | 561.2 561.2 |
12. | 688.1 688.1 |
13. | 568.9 688.1 |
14. | 529.2 688.1 |
15. | 555.1 688.1 |
|-------------------|
16. | 642.9 688.1 |
17. | 755.9 755.9 |
18. | 891.2 891.2 |
19. | 1304.4 1304.4 |
20. | 1486.7 1486.7 |
+-------------------+
If the definition of the last 5 years doesn't include the current year, but means over the previous 5 years, the syntax would be L.(1/5). If you want a minimum of 5 years in each window, there is syntax to match.

Stata: Cumulative number of new observations

I would like to check if a value has appeared in some previous row of the same column.
At the end I would like to have a cumulative count of the number of distinct observations.
Is there any other solution than concenating all _n rows and using regular expressions? I'm getting there with concatenating the rows, but given the limit of 244 characters for string variables (in Stata <13), this is sometimes not applicable.
Here's what I'm doing right now:
gen tmp=x
replace tmp = tmp[_n-1]+ "," + tmp if _n > 1
gen cumu=0
replace cumu=1 if regexm(tmp[_n-1],x+"|"+x+",|"+","+x+",")==0
replace cumu= sum(cumu)
Example
+-----+
| x |
|-----|
1. | 12 |
2. | 32 |
3. | 12 |
4. | 43 |
5. | 43 |
6. | 3 |
7. | 4 |
8. | 3 |
9. | 3 |
10. | 3 |
+-----+
becomes
+-------------------------------+
| x | tmp |
|-----|--------------------------
1. | 12 | 12 |
2. | 32 | 12,32 |
3. | 12 | 12,32,12 |
4. | 43 | 3,32,12,43 |
5. | 43 | 3,32,12,43,43 |
6. | 3 | 3,32,12,43,43,3 |
7. | 4 | 3,32,12,43,43,3,4 |
8. | 3 | 3,32,12,43,43,3,4,3 |
9. | 3 | 3,32,12,43,43,3,4,3,3 |
10. | 3 | 3,32,12,43,43,3,4,3,3,3|
+--------------------------------+
and finally
+-----------+
| x | cumu|
|-----|------
1. | 12 | 1 |
2. | 32 | 2 |
3. | 12 | 2 |
4. | 43 | 3 |
5. | 43 | 3 |
6. | 3 | 4 |
7. | 4 | 5 |
8. | 3 | 5 |
9. | 3 | 5 |
10. | 3 | 5 |
+-----------+
Any ideas how to avoid the 'middle step' (for me that gets very important when having strings in x instead of numbers).
Thanks!

Regular expressions are great, but here as often elsewhere simple calculations suffice. With your sample data
. input x
x
1. 12
2. 32
3. 12
4. 43
5. 43
6. 3
7. 4
8. 3
9. 3
10. 3
11. end
end of do-file
you can identify first occurrences of each distinct value:
. gen long order = _n
. bysort x (order) : gen first = _n == 1
. sort order
. l
+--------------------+
| x order first |
|--------------------|
1. | 12 1 1 |
2. | 32 2 1 |
3. | 12 3 0 |
4. | 43 4 1 |
5. | 43 5 0 |
|--------------------|
6. | 3 6 1 |
7. | 4 7 1 |
8. | 3 8 0 |
9. | 3 9 0 |
10. | 3 10 0 |
+--------------------+
The number of distinct values seen so far is then just a cumulative sum of first using sum(). This works with string variables too. In fact this problem is one of several discussed within
http://www.stata-journal.com/sjpdf.html?articlenum=dm0042
which is accessible to all as a .pdf. search distinct would have pointed you to this article.
Becoming fluent with what you can do with by:, sort, _n and _N is an important skill in Stata. See also
http://www.stata-journal.com/sjpdf.html?articlenum=pr0004
for another article accessible to all.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Generating a variable only including the top 4 firms with largest sales - stata

Related

Calculating if Start time occurs within 1 hour range for each person (Single column)

Transposing variables

Pandas: consecutive rows' value change comparison

How to calculate the maximum of last 5 years in panel data

Stata: Cumulative number of new observations

Categories

Resources