Consecutive tagging in Stata

Consecutive tagging in Stata - stata

The task is to identify which consecutive week a product (in a specific store) has been on promotion.
clear
input ///
upc week store promo
1 1 86 1
1 2 86 1
1 3 86 1
1 4 86 1
3 1 86 0
3 2 86 1
4 1 86 0
4 2 86 1
4 3 86 1
end
The end result should look something like this:
upc week store promo promocount
1 1 86 1 1
1 2 86 1 2
1 3 86 1 3
1 4 86 1 4
3 1 86 0 0
3 2 86 1 1
4 1 86 0 0
4 2 86 1 1
4 3 86 1 2
end
I have 800K obs., and I am encountering a problem with the real data set. When I run bysort upc week store promo: gen prcount = _n if promo==1, my data set is sorted in a different way (which, as a result, yields wrong tagging):
upc week store promo
1 1 86 1
3 1 86 0
4 1 86 0
1 2 86 1
3 2 86 1
4 2 86 1
1 3 86 1
4 3 86 1
1 4 86 1
Anyway, I now realize my code is wrong. Any suggestions?

I think
. quietly input ///
> upc week store promo
. generate promocount = 0
. bysort store upc (week): replace promocount = 1+cond(_n==1,0,promocount[_n-1]) if promo>0
(7 real changes made)
. list, clean noobs
upc week store promo promoc~t
1 1 86 1 1
1 2 86 1 2
1 3 86 1 3
1 4 86 1 4
3 1 86 0 0
3 2 86 1 1
4 1 86 0 0
4 2 86 1 1
4 3 86 1 2
does do what you want.

Related

How to add a row where there is a disruption in series of numbers in Stata

I'm attempting to format a table of 40 different age-race-sex strata to be inputted into R-INLA and noticed that it's important to include all strata (even if they are not present in a county). These should be zeros. However, at this point my table only contains records for strata that are not empty. I can identify places where strata are missing for each county by looking at my strata variable and finding the breaks in the series 1 through 40 (marked with a red x in the image below).
In these places (marked by the red x) I need to add the missing rows and fill in the corresponding county code, strata code, population=0, and the correct corresponding race, sex, age code for the strata.
If I can figure out a way to add an empty row in the spaces with the red Xs from the image, and correctly assign the strata code (and county code) to these empty/missing rows, I am able to populate the rest of the values with the code below:
recode race = 1 & sex= 1 & age =4 if strata = 4
...etc
I'm wondering if there is a way to add the missing rows using an if statement that considers the fact that there are supposed to be forty strata for each county code. It would be ideal if this could populate the correct county code and strata code as well!
Dataex sample data:
* Example generated by -dataex-. To install: ssc install dataex
clear
input float OID str5 fips_statecounty double population byte(race sex age) float strata
1 "" 672 1 1 1 1
2 "" 1048 1 1 2 2
3 "" 883 1 1 3 3
4 "" 1129 1 1 4 4
5 "" 574 1 2 1 5
6 "" 986 1 2 2 6
7 "" 899 1 2 3 7
8 "" 1820 1 2 4 8
9 "" 96 2 1 1 9
10 "" 142 2 1 2 10
11 "" 81 2 1 3 11
12 "" 99 2 1 4 12
13 "" 71 2 2 1 13
14 "" 125 2 2 2 14
15 "" 103 2 2 3 15
16 "" 162 2 2 4 16
17 "" 31 3 1 1 17
18 "" 32 3 1 2 18
19 "" 18 3 1 3 19
20 "" 31 3 1 4 20
21 "" 22 3 2 1 21
22 "" 28 3 2 2 22
23 "" 28 3 2 3 23
24 "" 44 3 2 4 24
25 "" 20 4 1 1 25
26 "" 24 4 1 2 26
27 "" 21 4 1 3 27
28 "" 43 4 1 4 28
29 "" 19 4 2 1 29
30 "" 26 4 2 2 30
31 "" 24 4 2 3 31
32 "" 58 4 2 4 32
33 "" 6 5 1 1 33
34 "" 11 5 1 2 34
35 "" 13 5 1 3 35
36 "" 7 5 1 4 36
37 "" 7 5 2 1 37
38 "" 9 5 2 2 38
39 "" 10 5 2 3 39
40 "" 11 5 2 4 40
41 "01001" 239 1 1 1 1
42 "01001" 464 1 1 2 2
43 "01001" 314 1 1 3 3
44 "01001" 232 1 1 4 4
45 "01001" 284 1 2 1 5
46 "01001" 580 1 2 2 6
47 "01001" 392 1 2 3 7
48 "01001" 440 1 2 4 8
49 "01001" 41 2 1 1 9
50 "01001" 38 2 1 2 10
51 "01001" 23 2 1 3 11
52 "01001" 26 2 1 4 12
53 "01001" 34 2 2 1 13
54 "01001" 52 2 2 2 14
55 "01001" 40 2 2 3 15
56 "01001" 50 2 2 4 16
57 "01001" 4 3 1 1 17
58 "01001" 2 3 1 2 18
59 "01001" 3 3 1 3 19
60 "01001" 6 3 2 1 21
61 "01001" 4 3 2 2 22
62 "01001" 6 3 2 3 23
63 "01001" 4 3 2 4 24
64 "01001" 1 4 1 4 28
65 "01003" 1424 1 1 1 1
66 "01003" 2415 1 1 2 2
67 "01003" 1680 1 1 3 3
68 "01003" 1823 1 1 4 4
69 "01003" 1545 1 2 1 5
70 "01003" 2592 1 2 2 6
71 "01003" 1916 1 2 3 7
72 "01003" 2527 1 2 4 8
73 "01003" 68 2 1 1 9
74 "01003" 82 2 1 2 10
75 "01003" 52 2 1 3 11
76 "01003" 54 2 1 4 12
77 "01003" 72 2 2 1 13
78 "01003" 129 2 2 2 14
79 "01003" 81 2 2 3 15
80 "01003" 106 2 2 4 16
81 "01003" 10 3 1 1 17
82 "01003" 14 3 1 2 18
83 "01003" 8 3 1 3 19
84 "01003" 4 3 1 4 20
85 "01003" 8 3 2 1 21
86 "01003" 14 3 2 2 22
87 "01003" 17 3 2 3 23
88 "01003" 10 3 2 4 24
89 "01003" 4 4 1 1 25
90 "01003" 1 4 1 3 27
91 "01003" 2 4 1 4 28
92 "01003" 2 4 2 1 29
93 "01003" 3 4 2 2 30
94 "01003" 4 4 2 3 31
95 "01003" 10 4 2 4 32
96 "01003" 5 5 1 1 33
97 "01003" 4 5 1 2 34
98 "01003" 3 5 1 3 35
99 "01003" 5 5 1 4 36
100 "01003" 5 5 2 2 38
end
label values race race
label values sex sex

My answer to your previous question
Nested for-loop: error variable already defined
detailed how to create a minimal dataset with all strata present. Therefore you should just merge that with your main dataset and replace missings on the absent strata with whatever your other software expects, zeros it seems.
The complication most obvious at this point is you need to factor in a county variable. I can't see any information on how many counties you have in your dataset, which may affect what is practical. You should be able to break down the preparation into: first, prepare a minimal county dataset with identifiers only; then merge that with a complete strata dataset.

Stata: Changing Number Format

I am using estpost and esttab to export tabulation results in Stata.
sysuse auto, clear
estpost tabulate turn foreign
esttab ., cells("b(fmt(0))") unstack
---------------------------------------------------
(1)
Domestic Foreign Total
b b b
---------------------------------------------------
31 1 0 1
32 0 1 1
33 1 1 2
34 2 4 6
35 2 4 6
36 1 8 9
37 2 2 4
38 1 2 3
39 1 0 1
40 6 0 6
41 4 0 4
42 7 0 7
43 12 0 12
44 3 0 3
45 3 0 3
46 3 0 3
48 2 0 2
51 1 0 1
Total 52 22 74
---------------------------------------------------
N 74
---------------------------------------------------
Although I can change the format of the cells, I couldn't find a way to change the format of the observation number(N) and the total number of observations in each column. I tried adding obs(fmt(%10.2fc)) as an estab option but it didn't work.

Duplication of data entries by id if they meet a certain condition

In the original choice data set, individuals (id) are captured making purchases (choice) among all the product options possible (assortchoice is a product code). Every individual always faces the same set of products to choose from; as a result the value of choice is always either 0 or 1 ("was the product chosen or not?").
clear
input
id assortchoice choice sumchoice
2 12 1 2
2 13 0 2
2 14 0 2
2 15 0 2
2 16 0 2
2 17 0 2
2 18 0 2
2 19 0 2
2 20 0 2
2 21 0 2
2 22 0 2
2 23 1 2
3 12 1 1
3 13 0 1
3 14 0 1
3 15 0 1
3 16 0 1
3 17 0 1
3 18 0 1
3 19 0 1
3 20 0 1
3 21 0 1
3 22 0 1
3 23 0 1
4 12 1 3
4 13 0 3
4 14 1 3
4 15 1 3
4 16 0 3
4 17 0 3
4 18 0 3
4 19 0 3
4 20 0 3
4 21 0 3
4 22 0 3
4 23 0 3
end
I created the following code to understand how many choices were made by each individual:
egen sumchoice=total(choice), by(id)
In this example, an individual 3 (id=3) only chose one product (since sumchoice=1), but individual 2 made two choices (sumchoice=2). Finally, individual 4 made three choices (sumchoice=3).
Since this is a choice data, I need to transform all the instances of multiple choices into sets of single choices.
What I mean by that: if an individual made two purchases, I need to duplicate the choice set for that individual twice; for an individual who made 3 purchases, I need to replicate the choice set three times, so the final structure looks like the data set below.
clear
input
id transaction assortchoice choice
2 1 12 1
2 1 13 0
2 1 14 0
2 1 15 0
2 1 16 0
2 1 17 0
2 1 18 0
2 1 19 0
2 1 20 0
2 1 21 0
2 1 22 0
2 1 23 0
2 2 12 0
2 2 13 0
2 2 14 0
2 2 15 0
2 2 16 0
2 2 17 0
2 2 18 0
2 2 19 0
2 2 20 0
2 2 21 0
2 2 22 0
2 2 23 1
3 1 12 1
3 1 13 0
3 1 14 0
3 1 15 0
3 1 16 0
3 1 17 0
3 1 18 0
3 1 19 0
3 1 20 0
3 1 21 0
3 1 22 0
3 1 23 0
4 1 12 1
4 1 13 0
4 1 14 0
4 1 15 0
4 1 16 0
4 1 17 0
4 1 18 0
4 1 19 0
4 1 20 0
4 1 21 0
4 1 22 0
4 1 23 0
4 2 12 0
4 2 13 0
4 2 14 1
4 2 15 0
4 2 16 0
4 2 17 0
4 2 18 0
4 2 19 0
4 2 20 0
4 2 21 0
4 2 22 0
4 2 23 0
4 3 12 0
4 3 13 0
4 3 14 0
4 3 15 1
4 3 16 0
4 3 17 0
4 3 18 0
4 3 19 0
4 3 20 0
4 3 21 0
4 3 22 0
4 3 23 0
end
***update:
transaction indicates which transaction order this is:
bysort id assortchoice (choice): gen transaction=_n
Hence, choice=1 should appear only once per each transaction.

The answer isn't quite "use expand" as there is a twist that you don't want exact replicates.
expand sumchoice
bysort id assortchoice (choice) : replace choice = 0 if _n != _N & choice == 1
list if id == 2 , sepby(assortchoice)
+-----------------------------------+
| id assort~e choice sumcho~e |
|-----------------------------------|
1. | 2 12 0 2 |
2. | 2 12 1 2 |
|-----------------------------------|
3. | 2 13 0 2 |
4. | 2 13 0 2 |
|-----------------------------------|
5. | 2 14 0 2 |
6. | 2 14 0 2 |
|-----------------------------------|
7. | 2 15 0 2 |
8. | 2 15 0 2 |
|-----------------------------------|
9. | 2 16 0 2 |
10. | 2 16 0 2 |
|-----------------------------------|
11. | 2 17 0 2 |
12. | 2 17 0 2 |
|-----------------------------------|
13. | 2 18 0 2 |
14. | 2 18 0 2 |
|-----------------------------------|
15. | 2 19 0 2 |
16. | 2 19 0 2 |
|-----------------------------------|
17. | 2 20 0 2 |
18. | 2 20 0 2 |
|-----------------------------------|
19. | 2 21 0 2 |
20. | 2 21 0 2 |
|-----------------------------------|
21. | 2 22 0 2 |
22. | 2 22 0 2 |
|-----------------------------------|
23. | 2 23 0 2 |
24. | 2 23 1 2 |
+-----------------------------------+

fortran read and write from file(reading from .msh and writing to dat)

I am trying to read a .msh file and want to generate .dat file in rearranged manner (node number, x1 ,y1 , z1, x2, y2, z2)
$MeshFormatv
2.2 0 8
$EndMeshFormat
$PhysicalNames
4
1 1 "inlet"
1 2 "top"
1 3 "exit"
1 4 "bottom"
$EndPhysicalNames
$Nodes
45
1 -2 -2 0
2 2 -2 0
3 2 2 0
4 -2 2 0
5 -1.666666666666667 -2 0
6 -1.333333333333333 -2 0
7 -1 -2 0
8 -0.6666666666666665 -2 0
9 -0.3333333333333335 -2 0
10 0 -2 0
11 0.3333333333333335 -2 0
12 0.666666666666667 -2 0
13 1 -2 0
14 1.333333333333333 -2 0
15 1.666666666666667 -2 0
16 2 -1.666666666666667 0
17 2 -1.333333333333333 0
18 2 -1 0
19 2 -0.6666666666666665 0
20 2 -0.3333333333333335 0
21 2 0 0
22 2 0.3333333333333335 0
23 2 0.666666666666667 0
24 2 1 0
25 2 1.333333333333333 0
26 2 1.666666666666667 0
27 1.666666666666667 2 0
28 1.333333333333333 2 0
29 1 2 0
30 0.6666666666666665 2 0
31 0.3333333333333335 2 0
32 0 2 0
33 -0.3333333333333335 2 0
34 -0.666666666666667 2 0
35 -1 2 0
36 -1.333333333333333 2 0
37 -1.666666666666667 2 0
38 -2 1.555555555555556 0
39 -2 1.111111111111111 0
40 -2 0.6666666666666667 0
41 -2 0.2222222222222223 0
42 -2 -0.2222222222222223 0
43 -2 -0.6666666666666665 0
44 -2 -1.111111111111111 0
45 -2 -1.555555555555555 0
$EndNodes
$Elements
45
1 1 2 4 1 1 5
2 1 2 4 1 5 6
3 1 2 4 1 6 7
4 1 2 4 1 7 8
5 1 2 4 1 8 9
6 1 2 4 1 9 10
7 1 2 4 1 10 11
8 1 2 4 1 11 12
9 1 2 4 1 12 13
10 1 2 4 1 13 14
11 1 2 4 1 14 15
12 1 2 4 1 15 2
13 1 2 3 2 2 16
14 1 2 3 2 16 17
15 1 2 3 2 17 18
16 1 2 3 2 18 19
17 1 2 3 2 19 20
18 1 2 3 2 20 21
19 1 2 3 2 21 22
20 1 2 3 2 22 23
21 1 2 3 2 23 24
22 1 2 3 2 24 25
23 1 2 3 2 25 26
24 1 2 3 2 26 3
25 1 2 2 3 3 27
26 1 2 2 3 27 28
27 1 2 2 3 28 29
28 1 2 2 3 29 30
29 1 2 2 3 30 31
30 1 2 2 3 31 32
31 1 2 2 3 32 33
32 1 2 2 3 33 34
33 1 2 2 3 34 35
34 1 2 2 3 35 36
35 1 2 2 3 36 37
36 1 2 2 3 37 4
37 1 2 1 4 4 38
38 1 2 1 4 38 39
39 1 2 1 4 39 40
40 1 2 1 4 40 41
41 1 2 1 4 41 42
42 1 2 1 4 42 43
43 1 2 1 4 43 44
44 1 2 1 4 44 45
45 1 2 1 4 45 1
$EndElements
I have tried with allocatable, I want to skip the lines till character '$Nodes' appear and and one more line then read it in a array and then skip the three lines of character. Read the next in another array and then rearrange the no as mentioned above.
program coordinates
implicit none
INTEGER:: ierror, nodeno, elementno, i, j, k , t, p, l=0, n
CHARACTER:: command
real::data(2,100)
!CHARACTER (len=5)::N!odes
!CHARACTER (len=8)::EndN!odes
!CHARACTER (len=8)::E!lements
!CHARACTER (len=11)::EndE!lements
!CHARACTER :: No*5, EndN*8, E*8, EndE*11
! CHARACTER*5 :: Nod
! CHARACTER*8 :: Ele
! real, allocatable, dimension(:,4)::node
! real, allocatable, dimension(:,7)::element
! real, allocatable, dimension(:)::n,x,y,z,a,b,c,d,g,h
!call system(l='grep -n '$Nodes' /home/user/Nitesh/Fortran/rect.msh|tail -LineNumberToStartWith|grep regEX')
! 'l = 'grep -n '$Nodes' /home/user/Nitesh/Fortran/rect.msh
command = 'grep -n $Nodes /home/user/Nitesh/Fortran/rect.msh|cut -f1 -d:'
! call system('command')
call system('grep -n '$Nodes' /home/user/Nitesh/Fortran/rect.msh|cut -f1 -d:')
! call system('l')
! print*, "enter the no. of nodes"
! read*,t
! print*, "enter the no. of elements"
! read*,p
! print*, "enter the line no. nodes data(array) starting from"
! read*,l
!allocate(node(t),element(j))
print*, "opening file"
! allocate(n(t),x(t),y(t),z(t),a(t),b(t),c(t),d(t),g(t),h(t))
OPEN (FILE='/home/user/Nitesh/Fortran/my.dat',UNIT=8, STATUS='OLD', ACTION='READ', &
IOSTAT=ierror)
if(ierror/=0)then
print*,"File rect.msh cannot be open"
stop
end if
do i=1,n
read(8,'(/)')
end do
do i=1,t+l
read(8,*) data(:,i)
!read(8,*) j,k
end do
! 8 format('',F10.6,F10.6,F10.6,F10.6)
! if(Nod == 'Nodes') then
! read*,(n(i),x(i),y(i),z(i),i=1,t)
! end if
! if(Ele == 'Elements') then
! read*,(n(j),a(j),b(j),c(j),d(j),g(j),h(j),k=1,p)
! end if
! OPEN (UNIT=10, FILE='/home/Nitesh/rect_new.msh', STATUS='NEW', ACTION='WRITE', &
! IOSTAT=ierror)
! write(*,10)
! 10 format (' ',n())
! CLOSE (UNIT=10)
CLOSE (UNIT=8)
print*, "file read"
! do i=1,n
! n(t)=g(j)
! x(t),y(t),z(t)
open(file='/home/user/Nitesh/Fortran/rect.dat', unit=24, status='replace', action='write', &
IOSTAT=ierror)
if(ierror/=0)then
print*,"File rect.msh cannot be open"
stop
end if
print*,"writing data"
do i=1,l+t
write(24,*) data(:,i)
end do
print*,"data written"
! OPEN (UNIT=10, FILE='rect_new.msh', STATUS='NEW', ACTION='WRITE', &
! IOSTAT=ierror)
! write(*,10)
! 10 format (' ','The coordinates of elements')
! CLOSE (UNIT=10)
end program coordinates

select minimum value by ID, over range of visits

I'm trying to extract a variable for the lowest value over a range of visits, in this case:
I want the lowest value over first 3 days of admission (admission day 1 or 2 or 3) , by VisitID. any suggestions?
visitID value day of admission
1 941 1
1 948 2
1 935 4
2 83 1
2 84 2
2 50 4
2 79 5
and I would want:
visitID value visit minvalue
1 941 1 941
1 948 2 941
1 935 4 941
2 83 1 83
2 84 2 83
2 50 4 83
2 79 5 83

It would have been helpful if you had presented your data in an easily usable form. But here's an approach that should point you in a useful direction.
* Example generated by -dataex-. To install: ssc install dataex
clear
input byte visitid int value byte day
1 941 1
1 948 2
1 935 4
2 83 1
2 84 2
2 50 4
2 79 5
end
bysort visitid (day) : egen minvalue = min(cond(day<=3,value,.))
Which results in
. list, sepby(visitid)
+----------------------------------+
| visitid value day minvalue |
|----------------------------------|
1. | 1 941 1 941 |
2. | 1 948 2 941 |
3. | 1 935 4 941 |
|----------------------------------|
4. | 2 83 1 83 |
5. | 2 84 2 83 |
6. | 2 50 4 83 |
7. | 2 79 5 83 |
+----------------------------------+

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Consecutive tagging in Stata - stata

Related

How to add a row where there is a disruption in series of numbers in Stata

Stata: Changing Number Format

Duplication of data entries by id if they meet a certain condition

fortran read and write from file(reading from .msh and writing to dat)

select minimum value by ID, over range of visits

Categories

Resources