I would like to check if a value has appeared in some previous row of the same column.
At the end I would like to have a cumulative count of the number of distinct observations.
Is there any other solution than concenating all _n rows and using regular expressions? I'm getting there with concatenating the rows, but given the limit of 244 characters for string variables (in Stata <13), this is sometimes not applicable.
Here's what I'm doing right now:
gen tmp=x
replace tmp = tmp[_n-1]+ "," + tmp if _n > 1
gen cumu=0
replace cumu=1 if regexm(tmp[_n-1],x+"|"+x+",|"+","+x+",")==0
replace cumu= sum(cumu)
Example
+-----+
| x |
|-----|
1. | 12 |
2. | 32 |
3. | 12 |
4. | 43 |
5. | 43 |
6. | 3 |
7. | 4 |
8. | 3 |
9. | 3 |
10. | 3 |
+-----+
becomes
+-------------------------------+
| x | tmp |
|-----|--------------------------
1. | 12 | 12 |
2. | 32 | 12,32 |
3. | 12 | 12,32,12 |
4. | 43 | 3,32,12,43 |
5. | 43 | 3,32,12,43,43 |
6. | 3 | 3,32,12,43,43,3 |
7. | 4 | 3,32,12,43,43,3,4 |
8. | 3 | 3,32,12,43,43,3,4,3 |
9. | 3 | 3,32,12,43,43,3,4,3,3 |
10. | 3 | 3,32,12,43,43,3,4,3,3,3|
+--------------------------------+
and finally
+-----------+
| x | cumu|
|-----|------
1. | 12 | 1 |
2. | 32 | 2 |
3. | 12 | 2 |
4. | 43 | 3 |
5. | 43 | 3 |
6. | 3 | 4 |
7. | 4 | 5 |
8. | 3 | 5 |
9. | 3 | 5 |
10. | 3 | 5 |
+-----------+
Any ideas how to avoid the 'middle step' (for me that gets very important when having strings in x instead of numbers).
Thanks!
Regular expressions are great, but here as often elsewhere simple calculations suffice. With your sample data
. input x
x
1. 12
2. 32
3. 12
4. 43
5. 43
6. 3
7. 4
8. 3
9. 3
10. 3
11. end
end of do-file
you can identify first occurrences of each distinct value:
. gen long order = _n
. bysort x (order) : gen first = _n == 1
. sort order
. l
+--------------------+
| x order first |
|--------------------|
1. | 12 1 1 |
2. | 32 2 1 |
3. | 12 3 0 |
4. | 43 4 1 |
5. | 43 5 0 |
|--------------------|
6. | 3 6 1 |
7. | 4 7 1 |
8. | 3 8 0 |
9. | 3 9 0 |
10. | 3 10 0 |
+--------------------+
The number of distinct values seen so far is then just a cumulative sum of first using sum(). This works with string variables too. In fact this problem is one of several discussed within
http://www.stata-journal.com/sjpdf.html?articlenum=dm0042
which is accessible to all as a .pdf. search distinct would have pointed you to this article.
Becoming fluent with what you can do with by:, sort, _n and _N is an important skill in Stata. See also
http://www.stata-journal.com/sjpdf.html?articlenum=pr0004
for another article accessible to all.
Related
I have data looking like this:
| ID |OpID|
| -- | -- |
| 10 | 1 |
| 10 | 2 |
| 10 | 4 |
| 11 |null|
| 12 | 3 |
| 12 | 4 |
| 13 | 1 |
| 13 | 2 |
| 13 | 3 |
| 14 | 2 |
| 14 | 4 |
Here OpID 4 means 1 and 2.
I would like to count the different occurrences of 1, 2 and 3 in OpID of distinct ID.
If the counts of OpID having 1 would be 4, 2 would be 4, 3 would be 2.
If ID has OpID of 4 but already has data of 1, 2 it wouldn't be counted. But if 4 exists and only 1 (2) is there, count for 2 (1) would be incremented.
The expected output would be:
|OpID|Count|
| 1 | 4 |
| 2 | 4 |
| 3 | 2 |
(Going to be using the results in a column chart)
Hope this makes sense...
edit: there are other columns too and an ID and OpID can be duplicated hence need to do a groupby clause before.
I have a variable num with values 1-10.
I would like to create a new variable type with values odd or even:
generate type = odd if inlist(num, 1,3,5,7,9)
Questions:
What is the cleanest way to also label even numbers?
Could I use a negation somewhere and keep the command all in one line?
The code you provide is not valid syntax:
clear
set obs 10
generate num = _n
generate type = odd if inlist(num, 1,3,5,7,9)
odd not found
r(111);
You could get what you want with:
generate type = "odd" if inlist(num, 1,3,5,7,9)
And you can do both at the same time using the cond() function:
generate type = cond(inlist(num, 1,3,5,7,9), "odd", "even")
However, having this variable as a string will be of limited value for later use.
You could subsequently use the encode command to create a new variable of numeric type:
encode type, generate(type2)
list
+--------------------+
| num type type2 |
|--------------------|
1. | 1 odd odd |
2. | 2 even even |
3. | 3 odd odd |
4. | 4 even even |
5. | 5 odd odd |
|--------------------|
6. | 6 even even |
7. | 7 odd odd |
8. | 8 even even |
9. | 9 odd odd |
10. | 10 even even |
+--------------------+
Although seemingly identical, type and type2 variables are indeed of a different type:
list, nolabel
+--------------------+
| num type type2 |
|--------------------|
1. | 1 odd 2 |
2. | 2 even 1 |
3. | 3 odd 2 |
4. | 4 even 1 |
5. | 5 odd 2 |
|--------------------|
6. | 6 even 1 |
7. | 7 odd 2 |
8. | 8 even 1 |
9. | 9 odd 2 |
10. | 10 even 1 |
+--------------------+
This is how you can do it with type being a numeric variable:
generate type = mod(num, 2)
list
+------------+
| num type |
|------------|
1. | 1 1 |
2. | 2 0 |
3. | 3 1 |
4. | 4 0 |
5. | 5 1 |
|------------|
6. | 6 0 |
7. | 7 1 |
8. | 8 0 |
9. | 9 1 |
10. | 10 0 |
+------------+
You then create the value label and attach it to the variable type:
label define numlab 0 "even" 1 "odd"
label values type numlab
list
+------------+
| num type |
|------------|
1. | 1 odd |
2. | 2 even |
3. | 3 odd |
4. | 4 even |
5. | 5 odd |
|------------|
6. | 6 even |
7. | 7 odd |
8. | 8 even |
9. | 9 odd |
10. | 10 even |
+------------+
If you only want the odd numbers labeled you can simply do:
label define numlab 1 "odd"
If you later change your mind and want to add a label for even numbers:
label define numlab 0 "even", add
When your command has been run, the value of type for the odd numbers is "odd" and the value for the even numbers is "", that is a missing string.
You could tag the even numbers using
replace type = "even" if type==""
I cannot think of a way to keep it all in one line, since you have to both generate the variable and fill in two different string values.
If you could use a numeric variable (I name it flag) as your type variable, you could try this:
gen flag = mod(num,2)
This will flag the odd numbers as 1 and the even numbers as 0. You could then create a label for the flag variable, if you need to display its values as "odd" and "even".
I am using Stata 13 to stack several variables into one variable using
stack stand1-stand10, into(all)
However, I need to do it for each unique id which is pasted parallel to all, something like:
bysort familyid: stack stand1-stand10,into(all) keep familyid
We can use a simpler analogue of your data example.
clear
set obs 3
gen familyid = _n
forval j = 1/3 {
gen stand`j' = _n * `j'
}
list
+-------------------------------------+
| familyid stand1 stand2 stand3 |
|-------------------------------------|
1. | 1 1 2 3 |
2. | 2 2 4 6 |
3. | 3 3 6 9 |
+-------------------------------------+
save original
To stack with an identifier, just repeat the identifier variable name. For more than a few variables, it's easiest to prepare a call using a loop.
forval j = 1/3 {
local call `call' familyid stand`j'
}
di "`call'"
familyid stand1 familyid stand2 familyid stand3
stack `call', into(familyid stand)
sort familyid _stack
list, sepby(familyid)
+---------------------------+
| _stack familyid stand |
|---------------------------|
1. | 1 1 1 |
2. | 2 1 2 |
3. | 3 1 3 |
|---------------------------|
4. | 1 2 2 |
5. | 2 2 4 |
6. | 3 2 6 |
|---------------------------|
7. | 1 3 3 |
8. | 2 3 6 |
9. | 3 3 9 |
+---------------------------+
That said, it's easier to use reshape long.
use original, clear
reshape long stand, i(familyid) j(which)
list, sepby(familyid)
+--------------------------+
| familyid which stand |
|--------------------------|
1. | 1 1 1 |
2. | 1 2 2 |
3. | 1 3 3 |
|--------------------------|
4. | 2 1 2 |
5. | 2 2 4 |
6. | 2 3 6 |
|--------------------------|
7. | 3 1 3 |
8. | 3 2 6 |
9. | 3 3 9 |
+--------------------------+
I have a Dataframe with date as index:
Index | Opp id | Pipeline_Type |Amount
20170104 | 1 | Null | 10
20170104 | 2 | Sou | 20
20170104 | 3 | Inf | 25
20170118 | 1 | Inf | 12
20170118 | 2 | Null | 27
20170118 | 3 | Inf | 25
Now I want to calculate number of records(Opp id) for which Pipeline type has changed or amount has changed (+/-diff). Above no of records will be 2 for pipeline_type as well as for amount.
Please help me frame the solution.
How to increase column values from:
1 | 1 | 7.317073
2 | 1 | 14.634146
3 | 1 | 24.390244
4 | 2 | 7.317073
5 | 2 | 14.634146
6 | 2 | 24.390244
To:
1 | 1 | 7.317073
2 | 1 | 14.634146
3 | 1 | 24.390244
4 | 2 | 7.317073
5 | 2 | 14.634146
6 | 2 | 24.390244
7 | 3 | 7.317073
8 | 3 | 14.634146
9 | 3 | 24.390244
10 | 4 | 7.317073
11 | 4 | 14.634146
12 | 4 | 24.390244
I'm using Open Office.
Assuming that the top left corner is A1, set the fourth row such:
A4: =A3+1
B4: =roundup(A4/3)
C4 =C1
And pull them up to row 12
For ColumnA simply selecting the first three rows, grabbing the fill handle (black square at the bottom right of the range) and dragging down to suit should be sufficient.
An alternative here to ROUNDUP is, in B1 and copied down:
=INT((ROW()-1)/3)+1
For ColumnC as for ColumnA but with Crl depressed.