I have panel data. I am interested to calculate the maximum of one variable (Var_C) in the last 5 years.I tried several different functions and loop but did not manage to get what I wanted.
Here is a reproducible example. You must install tsegen with ssc install tsegen before you can use it.
webuse grunfeld
tsset
tsegen max_invest = rowmax(L.(0/4).invest)
list *invest if company == 1
+-------------------+
| invest max_in~t |
|-------------------|
1. | 317.6 317.6 |
2. | 391.8 391.8 |
3. | 410.6 410.6 |
4. | 257.7 410.6 |
5. | 330.8 410.6 |
|-------------------|
6. | 461.2 461.2 |
7. | 512 512 |
8. | 448 512 |
9. | 499.6 512 |
10. | 547.5 547.5 |
|-------------------|
11. | 561.2 561.2 |
12. | 688.1 688.1 |
13. | 568.9 688.1 |
14. | 529.2 688.1 |
15. | 555.1 688.1 |
|-------------------|
16. | 642.9 688.1 |
17. | 755.9 755.9 |
18. | 891.2 891.2 |
19. | 1304.4 1304.4 |
20. | 1486.7 1486.7 |
+-------------------+
If the definition of the last 5 years doesn't include the current year, but means over the previous 5 years, the syntax would be L.(1/5). If you want a minimum of 5 years in each window, there is syntax to match.
Related
Could you please help me to solve the problem as I am totally new to DAX and English is not my first language so I am struggling to even find the correct question.
Here's the problem.
I have two tables:
start_balance
+------+---------------+
| Type | Start balance |
+------+---------------+
| A | 0 |
| B | 10 |
+------+---------------+
in_out
+------+-------+------+----+-----+
| Year | Month | Type | In | Out |
+------+-------+------+----+-----+
| 2020 | 1 | A | 20 | 20 |
| 2020 | 1 | A | 0 | 10 |
| 2020 | 2 | B | 20 | 0 |
| 2020 | 2 | B | 20 | 10 |
+------+-------+------+----+-----+
I'd like to get the result as follows:
Unfiltered:
+------+-------+------+---------+----+-----+------+
| Year | Month | Type | Balance | In | Out | Left |
+------+-------+------+---------+----+-----+------+
| 2020 | 1 | A | 0 | 20 | 20 | 0 |
| 2020 | 1 | B | 10 | 20 | 10 | 20 |
| 2020 | 2 | A | 0 | 20 | 10 | 10 |
| 2020 | 2 | B | 20 | 20 | 10 | 30 |
+------+-------+------+---------+----+-----+------+
Filtered (for example year/month 2020/2):
+------+-------+------+---------+----+-----+------+
| Year | Month | Type | Balance | In | Out | Left |
+------+-------+------+---------+----+-----+------+
| 2020 | 2 | A | 0 | 20 | 10 | 10 |
| 2020 | 2 | B | 20 | 20 | 10 | 30 |
+------+-------+------+---------+----+-----+------+
So while selecting a slicer for the year/month it should calculate balance before selected year/month and then show selected year/month values.
Edit: corrected start_balance table.
Is the sample data correct?
A -> the starting balance is 10, but in your unfiltered table example, it is 0.
Do you have any relationship between these tables?
Does opening balance always apply to the current year? What if 2021 appears in the in_out table? How do you know when the start balance started?
example without starting balance
If you want to show value breaking given filter you should use statement ALL or REMOVEFILTERS function (in Analysis Services 2019 and in Power BI since October 2019).
calculate(sum([in]) - sum([out]), all('in_out'[Year],'in_out'[Month]))
More helpful information:
https://www.sqlbi.com/articles/managing-all-functions-in-dax-all-allselected-allnoblankrow-allexcept/
I have a Dataframe with date as index:
Index | Opp id | Pipeline_Type |Amount
20170104 | 1 | Null | 10
20170104 | 2 | Sou | 20
20170104 | 3 | Inf | 25
20170118 | 1 | Inf | 12
20170118 | 2 | Null | 27
20170118 | 3 | Inf | 25
Now I want to calculate number of records(Opp id) for which Pipeline type has changed or amount has changed (+/-diff). Above no of records will be 2 for pipeline_type as well as for amount.
Please help me frame the solution.
So I'm trying to figure out a good way of vectorizing a calculation and I'm a bit stuck.
| A | B (Calculation) | B (Value) |
|---|----------------------|-----------|
| 1 | | |
| 2 | | |
| 3 | | |
| 4 | =SUM(A1:A4)/4 | 2.5 |
| 5 | =(1/4)*A5 + (3/4)*B4 | 3.125 |
| 6 | =(1/4)*A6 + (3/4)*B5 | 3.84375 |
| 7 | =(1/4)*A7 + (3/4)*B6 | 4.6328125 |
I'm basically trying to replicate Wilder's Average True Range (without using TA-Lib). In the case of my simplified example, column A is the precomputed True Range.
Any ideas of how to do this without looping? Breaking down the equation it's effectively a weighted cumulative sum... but it's definitely not something that the existing pandas cumsum allows out of the box.
This is indeed an ewm problem. The issue is that the first 4 rows are crammed together into a single row... then ewm takes over
a = df.A.values
d1 = pd.DataFrame(dict(A=np.append(a[:4].mean(), a[4:])), df.index[3:])
d1.ewm(adjust=False, alpha=.25).mean()
A
3 2.500000
4 3.125000
5 3.843750
6 4.632812
I know how to do it in Excel and have been copying and pasting the data in Excel, using subtotal to add various values and then repasting in Stata. But the dataset I'm dealing with is huge, and it takes 15 minutes in Excel for 1 round. Is there a easier way in Stata?
Without a data example or any attempt at code, the question needs guesswork, not least for anyone who never or hardly ever uses Excel.
Please study https://stackoverflow.com/help/mcve for hints on what makes a good question.
Columns of a Stata dataset are called variables.
You may find this silly example helps. You can type the commands in your own copy of Stata.
If not, you may need to be much more specific about what you are doing. We don't need to see the whole of a huge dataset, just something with similar structure.
. sysuse auto, clear
(1978 Automobile Data)
. egen totalprice = total(price), by(rep78)
. tabdisp rep78, c(totalprice)
----------------------
Repair |
Record |
1978 | totalprice
----------+-----------
1 | 9129
2 | 47741
3 | 192877
4 | 109287
5 | 65043
. | 32152
----------------------
. sort rep78
. list rep78 price totalprice, sepby(rep78)
+---------------------------+
| rep78 price totalp~e |
|---------------------------|
1. | 1 4,934 9129 |
2. | 1 4,195 9129 |
|---------------------------|
3. | 2 14,500 47741 |
4. | 2 5,104 47741 |
5. | 2 4,010 47741 |
6. | 2 5,886 47741 |
7. | 2 3,667 47741 |
8. | 2 4,172 47741 |
9. | 2 4,060 47741 |
10. | 2 6,342 47741 |
|---------------------------|
11. | 3 5,222 192877 |
12. | 3 4,099 192877 |
13. | 3 15,906 192877 |
14. | 3 3,895 192877 |
15. | 3 4,723 192877 |
16. | 3 4,647 192877 |
17. | 3 11,385 192877 |
18. | 3 6,165 192877 |
19. | 3 10,372 192877 |
20. | 3 13,466 192877 |
21. | 3 3,291 192877 |
22. | 3 13,594 192877 |
23. | 3 5,172 192877 |
24. | 3 4,187 192877 |
25. | 3 11,497 192877 |
26. | 3 4,296 192877 |
27. | 3 4,733 192877 |
28. | 3 4,516 192877 |
29. | 3 5,788 192877 |
30. | 3 4,749 192877 |
31. | 3 4,082 192877 |
32. | 3 4,181 192877 |
33. | 3 3,299 192877 |
34. | 3 4,816 192877 |
35. | 3 3,955 192877 |
36. | 3 10,371 192877 |
37. | 3 5,189 192877 |
38. | 3 4,482 192877 |
39. | 3 4,504 192877 |
40. | 3 6,295 192877 |
|---------------------------|
41. | 4 3,829 109287 |
42. | 4 5,798 109287 |
43. | 4 4,389 109287 |
44. | 4 4,890 109287 |
45. | 4 7,827 109287 |
46. | 4 3,995 109287 |
47. | 4 9,735 109287 |
48. | 4 8,814 109287 |
49. | 4 6,303 109287 |
50. | 4 7,140 109287 |
51. | 4 6,850 109287 |
52. | 4 4,697 109287 |
53. | 4 5,705 109287 |
54. | 4 5,079 109287 |
55. | 4 4,499 109287 |
56. | 4 6,229 109287 |
57. | 4 8,129 109287 |
58. | 4 5,379 109287 |
|---------------------------|
59. | 5 3,748 65043 |
60. | 5 5,899 65043 |
61. | 5 5,719 65043 |
62. | 5 11,995 65043 |
63. | 5 4,589 65043 |
64. | 5 5,799 65043 |
65. | 5 5,397 65043 |
66. | 5 3,984 65043 |
67. | 5 9,690 65043 |
68. | 5 4,425 65043 |
69. | 5 3,798 65043 |
|---------------------------|
70. | . 3,799 32152 |
71. | . 6,486 32152 |
72. | . 12,990 32152 |
73. | . 4,424 32152 |
74. | . 4,453 32152 |
+---------------------------+
I would like to check if a value has appeared in some previous row of the same column.
At the end I would like to have a cumulative count of the number of distinct observations.
Is there any other solution than concenating all _n rows and using regular expressions? I'm getting there with concatenating the rows, but given the limit of 244 characters for string variables (in Stata <13), this is sometimes not applicable.
Here's what I'm doing right now:
gen tmp=x
replace tmp = tmp[_n-1]+ "," + tmp if _n > 1
gen cumu=0
replace cumu=1 if regexm(tmp[_n-1],x+"|"+x+",|"+","+x+",")==0
replace cumu= sum(cumu)
Example
+-----+
| x |
|-----|
1. | 12 |
2. | 32 |
3. | 12 |
4. | 43 |
5. | 43 |
6. | 3 |
7. | 4 |
8. | 3 |
9. | 3 |
10. | 3 |
+-----+
becomes
+-------------------------------+
| x | tmp |
|-----|--------------------------
1. | 12 | 12 |
2. | 32 | 12,32 |
3. | 12 | 12,32,12 |
4. | 43 | 3,32,12,43 |
5. | 43 | 3,32,12,43,43 |
6. | 3 | 3,32,12,43,43,3 |
7. | 4 | 3,32,12,43,43,3,4 |
8. | 3 | 3,32,12,43,43,3,4,3 |
9. | 3 | 3,32,12,43,43,3,4,3,3 |
10. | 3 | 3,32,12,43,43,3,4,3,3,3|
+--------------------------------+
and finally
+-----------+
| x | cumu|
|-----|------
1. | 12 | 1 |
2. | 32 | 2 |
3. | 12 | 2 |
4. | 43 | 3 |
5. | 43 | 3 |
6. | 3 | 4 |
7. | 4 | 5 |
8. | 3 | 5 |
9. | 3 | 5 |
10. | 3 | 5 |
+-----------+
Any ideas how to avoid the 'middle step' (for me that gets very important when having strings in x instead of numbers).
Thanks!
Regular expressions are great, but here as often elsewhere simple calculations suffice. With your sample data
. input x
x
1. 12
2. 32
3. 12
4. 43
5. 43
6. 3
7. 4
8. 3
9. 3
10. 3
11. end
end of do-file
you can identify first occurrences of each distinct value:
. gen long order = _n
. bysort x (order) : gen first = _n == 1
. sort order
. l
+--------------------+
| x order first |
|--------------------|
1. | 12 1 1 |
2. | 32 2 1 |
3. | 12 3 0 |
4. | 43 4 1 |
5. | 43 5 0 |
|--------------------|
6. | 3 6 1 |
7. | 4 7 1 |
8. | 3 8 0 |
9. | 3 9 0 |
10. | 3 10 0 |
+--------------------+
The number of distinct values seen so far is then just a cumulative sum of first using sum(). This works with string variables too. In fact this problem is one of several discussed within
http://www.stata-journal.com/sjpdf.html?articlenum=dm0042
which is accessible to all as a .pdf. search distinct would have pointed you to this article.
Becoming fluent with what you can do with by:, sort, _n and _N is an important skill in Stata. See also
http://www.stata-journal.com/sjpdf.html?articlenum=pr0004
for another article accessible to all.