I have data looking like this:
| ID |OpID|
| -- | -- |
| 10 | 1 |
| 10 | 2 |
| 10 | 4 |
| 11 |null|
| 12 | 3 |
| 12 | 4 |
| 13 | 1 |
| 13 | 2 |
| 13 | 3 |
| 14 | 2 |
| 14 | 4 |
Here OpID 4 means 1 and 2.
I would like to count the different occurrences of 1, 2 and 3 in OpID of distinct ID.
If the counts of OpID having 1 would be 4, 2 would be 4, 3 would be 2.
If ID has OpID of 4 but already has data of 1, 2 it wouldn't be counted. But if 4 exists and only 1 (2) is there, count for 2 (1) would be incremented.
The expected output would be:
|OpID|Count|
| 1 | 4 |
| 2 | 4 |
| 3 | 2 |
(Going to be using the results in a column chart)
Hope this makes sense...
edit: there are other columns too and an ID and OpID can be duplicated hence need to do a groupby clause before.
I have a table like this:
| a | b | c |
x | 1 | 8 | 6 |
y | 5 | 4 | 2 |
z | 7 | 3 | 5 |
What I want to do is finding a value based on the row and col titles, so for example if I have c&y, then it should return 2. What function(s) should I use to do this in OpenOffice Calc?
later:
I tried =INDEX(B38:K67;MATCH('c';B37:K37;0);MATCH('y';A38:A67;0)), but it writes invalid argument.
It turned out I wrote the arguments of INDEX in the wrong order. The =INDEX(B38:K67;MATCH('y';A38:A67;0);MATCH('c';B37:K37;0)) formula works properly. The second argument is the row number and not the col number.
I have a column as a string with no spaces:
clear
input str100 var
"ihaveanewspaper"
"watchingthenewsonthetv"
"watchthenewsandreadthenewspaper"
end
I am using the following command:
gen = regex,(var, "(news)")
This outputs 1 1 1 because it finds that the 3 rows in the column var contain the word news.
I'm trying to alter the regular expression "(news)" to create two columns. One for news and one for newspaper. regexm(var, "(newspaper)") makes sure that the row contains a newspaper, but I need a command to make sure characters after news are not "paper" as I'm trying to quantify the two.
EDIT:
Is there a way to count the third entry as 1, because it has a news occurrence without however being a newspaper?
You can quantify as follows without a regular expression:
clear
input str100 var
"ihaveanewspaper"
"watchingthenewsonthetv"
"watchthenewsandreadthenewspaper"
"fdgdnews"
"fgogodigjhoigjnewspaper"
"fgeogeionnewsfgdgfpaper"
"45pap9358newsfjfgni"
end
generate news = strmatch(var, "*news*") & !strmatch(var, "*newspaper*")
list, separator(0)
+----------------------------------------+
| var news |
|----------------------------------------|
1. | ihaveanewspaper 0 |
2. | watchingthenewsonthetv 1 |
3. | watchthenewsandreadthenewspaper 0 |
4. | fdgdnews 1 |
5. | fgogodigjhoigjnewspaper 0 |
6. | fgeogeionnewsfgdgfpaper 1 |
7. | 45pap9358newsfjfgni 1 |
+----------------------------------------+
count if news
4
count if !news
3
EDIT:
One way to do this is to eliminate all instances of the word newspaper and repeat the process:
generate var2 = subinstr(var, "newspaper", "", .)
replace news = 1 if strmatch(var2, "*news*")
list, separator(0)
+------------------------------------------------------------------+
| var news var2 |
|------------------------------------------------------------------|
1. | ihaveanewspaper 0 ihavea |
2. | watchingthenewsonthetv 1 watchingthenewsonthetv |
3. | watchthenewsandreadthenewspaper 1 watchthenewsandreadthe |
4. | fdgdnews 1 fdgdnews |
5. | fgogodigjhoigjnewspaper 0 fgogodigjhoigj |
6. | fgeogeionnewsfgdgfpaper 1 fgeogeionnewsfgdgfpaper |
7. | 45pap9358newsfjfgni 1 45pap9358newsfjfgni |
+------------------------------------------------------------------+
count if news
5
count if !news
2
So I'm trying to figure out a good way of vectorizing a calculation and I'm a bit stuck.
| A | B (Calculation) | B (Value) |
|---|----------------------|-----------|
| 1 | | |
| 2 | | |
| 3 | | |
| 4 | =SUM(A1:A4)/4 | 2.5 |
| 5 | =(1/4)*A5 + (3/4)*B4 | 3.125 |
| 6 | =(1/4)*A6 + (3/4)*B5 | 3.84375 |
| 7 | =(1/4)*A7 + (3/4)*B6 | 4.6328125 |
I'm basically trying to replicate Wilder's Average True Range (without using TA-Lib). In the case of my simplified example, column A is the precomputed True Range.
Any ideas of how to do this without looping? Breaking down the equation it's effectively a weighted cumulative sum... but it's definitely not something that the existing pandas cumsum allows out of the box.
This is indeed an ewm problem. The issue is that the first 4 rows are crammed together into a single row... then ewm takes over
a = df.A.values
d1 = pd.DataFrame(dict(A=np.append(a[:4].mean(), a[4:])), df.index[3:])
d1.ewm(adjust=False, alpha=.25).mean()
A
3 2.500000
4 3.125000
5 3.843750
6 4.632812
I would like to check if a value has appeared in some previous row of the same column.
At the end I would like to have a cumulative count of the number of distinct observations.
Is there any other solution than concenating all _n rows and using regular expressions? I'm getting there with concatenating the rows, but given the limit of 244 characters for string variables (in Stata <13), this is sometimes not applicable.
Here's what I'm doing right now:
gen tmp=x
replace tmp = tmp[_n-1]+ "," + tmp if _n > 1
gen cumu=0
replace cumu=1 if regexm(tmp[_n-1],x+"|"+x+",|"+","+x+",")==0
replace cumu= sum(cumu)
Example
+-----+
| x |
|-----|
1. | 12 |
2. | 32 |
3. | 12 |
4. | 43 |
5. | 43 |
6. | 3 |
7. | 4 |
8. | 3 |
9. | 3 |
10. | 3 |
+-----+
becomes
+-------------------------------+
| x | tmp |
|-----|--------------------------
1. | 12 | 12 |
2. | 32 | 12,32 |
3. | 12 | 12,32,12 |
4. | 43 | 3,32,12,43 |
5. | 43 | 3,32,12,43,43 |
6. | 3 | 3,32,12,43,43,3 |
7. | 4 | 3,32,12,43,43,3,4 |
8. | 3 | 3,32,12,43,43,3,4,3 |
9. | 3 | 3,32,12,43,43,3,4,3,3 |
10. | 3 | 3,32,12,43,43,3,4,3,3,3|
+--------------------------------+
and finally
+-----------+
| x | cumu|
|-----|------
1. | 12 | 1 |
2. | 32 | 2 |
3. | 12 | 2 |
4. | 43 | 3 |
5. | 43 | 3 |
6. | 3 | 4 |
7. | 4 | 5 |
8. | 3 | 5 |
9. | 3 | 5 |
10. | 3 | 5 |
+-----------+
Any ideas how to avoid the 'middle step' (for me that gets very important when having strings in x instead of numbers).
Thanks!
Regular expressions are great, but here as often elsewhere simple calculations suffice. With your sample data
. input x
x
1. 12
2. 32
3. 12
4. 43
5. 43
6. 3
7. 4
8. 3
9. 3
10. 3
11. end
end of do-file
you can identify first occurrences of each distinct value:
. gen long order = _n
. bysort x (order) : gen first = _n == 1
. sort order
. l
+--------------------+
| x order first |
|--------------------|
1. | 12 1 1 |
2. | 32 2 1 |
3. | 12 3 0 |
4. | 43 4 1 |
5. | 43 5 0 |
|--------------------|
6. | 3 6 1 |
7. | 4 7 1 |
8. | 3 8 0 |
9. | 3 9 0 |
10. | 3 10 0 |
+--------------------+
The number of distinct values seen so far is then just a cumulative sum of first using sum(). This works with string variables too. In fact this problem is one of several discussed within
http://www.stata-journal.com/sjpdf.html?articlenum=dm0042
which is accessible to all as a .pdf. search distinct would have pointed you to this article.
Becoming fluent with what you can do with by:, sort, _n and _N is an important skill in Stata. See also
http://www.stata-journal.com/sjpdf.html?articlenum=pr0004
for another article accessible to all.