Count row-wise in stata - stata

I am using a table that is relatively unstructured in stata. I am trying to count the number of observations between two specific rows by a group. Specifically, I want to count the number of observations between the row with the value TITLE and the blank row, as seen below:
v1 id
AGENCY: HHS-ACF 1
EXPIRATION DATE: 11/30/2023 1
TITLE: Annual Survey of Refugees 1
TOTAL ANNUAL RESPONSES: 1
3,000 1
ASSOCIATED INFORMATION COLLECTIONS: 1
TITLE 1
ORR-9 Annual Survey of Refugees 1
Introduction Letter and Postcard 1
AGENCY: HHS-ACF 2
EXPIRATION DATE: 02/29/2024 2
TITLE: Unaccompanied Refugee Minors Program: 2 ORR-3 Placement Report and ORR-4 Outcomes
Report 2
TOTAL ANNUAL RESPONSES: 2
8,058 2
ASSOCIATED INFORMATION COLLECTIONS: 2
TITLE 2
ORR-3 (Unaccompanied Refugee Minors Placement Report) - State Agencies 2
ORR-4 (Unaccompanied Refugee Minors Outcomes Report) - State Agencies 2
ORR-3 (Unaccompanied Refugee Minors Placement Report) - URM Provider Agencies 2
ORR-4 (Unaccompanied Refugee Minors Outcomes Report) - URM Provider Agencies 2
ORR-4 (Unaccompanied Refugee Minors Outcomes Report) - URM Youth
I want the final dataset to have a count of the total values between the blank and TITLE with an accompanying unique ID. I have already been able to construct the ID but I can't get the count right. What is the best way to do that?
I want the final dataset to look like:
ID count
1 2
2 5

If you have a single column v1, then you can do this:
gen id=sum(v1=="")+1
bysort id: gen ct=_N-_n
keep if v1=="TITLE"
drop v1
Output:
id ct
1. 1 2
2. 2 5

Related

Changing organization of data so that each observation represents a new variable (I tried)

I am working in Stata with a dataset on electric vehicle charging stations. Variables include
station_name name of charging station
review_text all of the customer reviews for a specific station delimited by }{
num_reviews number of customer reviews.
I'm trying to make a new file where each observation represents one customer review in a new variable customer_review and another variable station_id has the name of the corresponding station. So, if the original dataset had 100 observations (one per station) with 5 reviews each, the new file should have 500 observations.
How can I do this? I would include some code I have tried but I have no idea how to start.
If your data look like this:
station reviews n
1. 1 {good}{bad}{great} 3
2. 2 {poor}{excellent} 2
Then the following:
split(reviews), parse(}{)
drop reviews n
reshape long reviews, i(station) j(review_num)
drop if reviews==""
replace reviews = subinstr(reviews, "}","",.)
replace reviews = subinstr(reviews, "{","",.)
will produce:
station review~m reviews
1. 1 1 good
2. 1 2 bad
3. 1 3 great
4. 2 1 poor
5. 2 2 excellent

DAX Summing Values in Another Table

I am new to DAX.
I have 2 tables. Let's call them Table_1 and Table_2.
Let's say they look like this:
Table_1
ID Table_2_ID Person
1 1 Steve
2 1 Steve
3 1 Steve
4 2 John
5 2 John
6 3 Sally
Table_2 Sales
1 100
2 50
3 5
I want to return results that look something like this:
ID Table_2_ID Person Sales
1 1 Steve 100
2 1 Steve 100
3 1 Steve 100
4 2 John 50
5 2 John 50
6 3 Sally 5
How can I return this with a Dax function?
I know I need to use LOOKUPVALUE and/or the RELATED function, in combination with SUM, but I'm not sure how.
I'm not looking to produce a table, but a measure that when I use it in combination with other columns in Power BI, it applies the appropriate amount to each person in Table_1.
This can be done either by a calculated column or by a measure.
CC in Table_1:
Sum_Tab2 =
var t2_ID = [Table_2_ID]
return
CALCULATE(
SUM('Tabel_2'[Sales]),
'Tabel_2'[ID] = t2_ID
)
Measure:
SumTab2_measure =
var currentT2ID = MAX('Tabel_1'[Table_2_ID])
return
CALCULATE(
SUM('Tabel_2'[Sales]),
'Tabel_2'[ID] = currentT2ID
)
No relationships needed. However, for the measure to work in a visual table the [Tabel_2_ID from Tabl_1 needs to be present with this solution.
These may have to be slightly altered depending on your other filter dependencies and such so that they behave as you want them to.

days overlap based on two conditions

I have the following dataset, I need to count each id only once based on the highest order achieved into:
mono
dual
2 or more
(for example: if the same patient have drugs with no overlap, another two drugs overlapped= then count this patient one time under 2 drug overlap or dual) based on the presence of one of the two conditions:
overlap of 60 days or more between drugs or if the drugs overlap at two different time periods by 30 days then count them (for example one period they same two drugs overlap by 30 and in another period by 40 days count this id as dual)
the output would be
mono or one drug : 1 (patient 3 counted here)
dual or two drugs overlap : 2 (patient 2 and 4 would be counted here)
three or more (patient 1)
I don't need the actual drugs that overlap just a count of the frequency where each patient can be counted once only.
There are a totol of 6 drugs.
data have;
input id drug $ start :mmddyy10. end :mmddyy10.;
format start end mmddyy10.;
cards;
1 a 1/1/2004 4/4/2004
1 b 2/2/2004 6/6/2004
1 d 1/4/2005 4/4/2005
2 a 3/1/2006 4/2/2006
2 b 2/2/2006 5/3/2006
2 c 2/2/2006 4/4/2006
2 d 2/3/2001 4/4/2001
3 a 3/3/2001 4/3/2001
3 b 3/2/2002 4/2/2002
4 a 6/1/2001 8/2/2001
4 b 6/1/2001 7/7/2001
4 a 2/2/2001 4/4/2001
4 b 2/5/2001 3/28/2001
;
run;

Creating and doing Market basket analysis from raw data

I have a data set with me which have many items and their sales data in terms of amount and quantity sold rolled up per week. I want to figure out that is there some correlation between the two or not, trying to access that if sales of one item affecting the other's sale or not, in terms of any positive or negative effect.
Consider the following type of data:
Week # Product # Sale($) Quantity
Week 1 Product 1 1 1
Product 2 2 1
Product 3 3 1
Week 2 Product 1 3 2
Product 3 2 1
Product 6 2 2
Week 3 Product 4 2 1
Product 3 1 2
Product 5 4 2
So,from the above data on week basis, I want to figure out that how can I convert this data into a form of market basket data with the above set of parameters available with me. Since, there isn't any market basket data available.
The parameters I could think of is :
To use the count or occurrences of each product in a given week.
To use the total quantity sold
To use the total sales to find correlation.
So, basically I have to come up with how can an item be correlated to the other of the affinity of one product with the other product.No matter it is positively correlated or negative correlated. The only issue is I do not have any primary key to bind the items with a basket or an order number since it's rolled up sales.
Any answers or help in this topic is highly appreciable. In case you find it incomplete, you can let me know for any further clarity.
You can't do this because you have no information about the co-occurrence. You also have data muddled from daily grain to weekly grain. Aggregates won't permit this.

Run a regression of countries by quartiles for a specific year

I am exploring an effect that I think will vary by GDP levels, from a data set that has, vertically, country and year (1960 to 2015), so each country label is on 55 rows. I ran
sort year
by year: egen yrank = xtile(rgdp), nquantiles(4)
which tags every year row with what quartile of GDP they were in that year. I want to run this:
xtreg fiveyearg taxratio if yrank == 1 & year==1960
which would regress my variable (tax ratio) against some averaged gdp data from countries that were in the bottom quartile of GDPs in 1960 alone. So even if later on they grew enough to change ranks, the later data would still be in the regression pool. Sadly, I cannot get this code, or any variation, to run.
My current approach is to try to generate some new variable that would give every row with country label X a value of 1 if they were in the bottom quartile in 1960, but I can't get that to work either. i have run out of ideas, so I thought I would ask!
Based on your latest comment, which describes the (un)expected behavior:
clear
set more off
*----- example data -----
input ///
country year rank
1 1960 2
1 1961 1
1 1962 2
2 1960 1
2 1961 1
2 1962 1
3 1960 3
3 1961 3
3 1962 3
end
list, sepby(country)
*----- what you want -----
// tag countries whose first observation for -rank- is 1
// (I assume the first observation for -year- is always 1960)
bysort country : gen toreg = rank[1] == 1
list, sepby(country)
// run regression conditional on -toreg-
xtreg ... if toreg
Check help subscripting if in doubt.