Stata code to conditionally sum values based on a group rank - stata

I'm trying to write a code for a fairly huge dataset (3m observations) which has been segregated into smaller groups (ID). For each observation (described in the table below), I want to create a cumulative sum of a variable "Value" for all observations ranked below me, subject to condition of the lower ranked observation equals mine.
[
I want to write this code without using loops, if there is a way to do so.
Could someone help me?
Thank you!
UPDATE:
I have pasted the equation for the output variable below.
UPDATE 2:
The CSV format of the above table is:
ID,Rank,Condition,Value,Expected output,,
1,1,30,10,0,,
1,2,40,20,0,,
1,3,20,30,0,,
1,4,30,40,10,,
1,5,40,50,20,,
1,6,20,60,30,,
1,7,30,70,80,,
2,1,40,80,0,,
2,2,20,90,0,,
2,3,30,100,0,,
2,4,40,110,80,,
2,5,20,120,90,,
2,6,30,130,100,,
2,7,40,140,190,,
2,8,20,150,210,,
2,9,30,160,230,,
Equation

If I understand correctly, for each combination of ID and Condition, you want to calculate a running sum, ordered by Rank, of the variable Value, excluding the current observation. If that is indeed your goal, the following untested code might set you on the path to a solution
sort ID Condition Rank
// be sure there is a single observation for each combination
isid ID Condition Rank
// generate the running sum
by ID Condition (Rank): generate output = sum(Value)
// subtract out the current observation
replace output = output - Value
// return to the original order
sort ID Rank
As I said, this is untested, because my copy of Stata cannot read pictures of data. If your testing shows that it is imperfect and you cannot resolve the problem yourself, providing your sample data in a usable format will increase the likelihood someone will be able to help.
Added in edit: Corrected the isid command.

Related

How to choose the MIN of a calculated column (not in Power Query)

Working with basketball data, I'm trying to get the time on court for the players (there are some columns that have information about a player or players).
I tried to obtain the value with a calculated column, named "TimeOnCourt". The code works for most cases but there is a case that, due to a mistake in the data entry team, there are different values of the players columns for the same "TimeOnCourt" so, when I try to visualize the information, the data entry mistake comes out.
I guess I could use the column "Index" to add a piece of code to choose the MIN value for the "TimeOnCourt" column but, after trying some options, I don't know where to put it or if I have to change the full code.
I also tried with Test_Flags but not working for all cases (but could fix 2 of the 4 cases).
Add you the link with the pbix file and the Test_Flag measures I tried: Link to pbix file v3
And the image with the mistake marked. The expected time in the right visualization should be 0:40:00 instead of 0:43:03 (it's due to the duplicate in Full Quarter = 2Q and Time_Def = 0:04:00. This could happen again although I talked with them so the solution should be general, not filtering this specific case.
Problem

Index Match with multiple results greater than zero

I am trying to simplify a table that shows the amount of time that people are working on certain jobs and wanting to present the dataset in a table that only shows the values greater than zero.
The image below shows how the table currently looks, where each person has a % of their time allocated to 1 of 5 jobs across columns.
I am trying to create a table that looks like the below, where it only shows the jobs that each person is working on, and excludes the ones where they have no % of their time allocated.
Wondering if I am going about this in the wrong fashion, any help greatly appreciated!
Thanks
I have been tryin to use an index match function with some if logic for values greater than zero but have been only able to get the first value greater than zero to populate.

Power BI How to return a column with different data type summarized

My question is: Is there a way to return a column in a Matrix with different data types to be summarized as shown in the picture?(Using SWITCH)
I am not sure if this has been phrased in this way before but hopefully someone knows a simpler solution than what I've tried.
Im trying to return a column in a Matrix with different data types to be summarized. I have tried something similar in transform data to the following.
MixedFormatColumn = SWITCH('Cars'[Attribute],
"Socks",CONVERT('Socks'[Value],STRING) ,
"Paper",FORMAT('Paper'[Value], "#,0.0" ) ,
"Plastics",FORMAT('Plastics'[Value], "$#,0" ) ,
CONVERT('Crayons'[Value],STRING)
)
Although not exact, im sure you get the idea. I just keep getting stuck not sure if its an Power Query or a Measure issue and really not sure how to go about this. If someone could at least point me in the right direction it would be greatly appreciated. Thank you whomever is reading this for your time.
A column or a measure cannot have mixed data types or mixed formatting. In order to get the $ value of socks sold, you would need the $ value for the sale. In order to get a count of socks sold, you would need a number, unless you want to count the rows for socks, but a row might be about more than just one pair of socks.
Mixing percentages into all this in one single matrix column is not possible. You may want to rethink your approach.

Countif and ArrayFormula with multiple levels

I have a formula. It works - but feels like it could be made much simpler.
I have many departments across several columns. Each row has an item that we're tracking and each column has a status text that changes as we do the work.
'queue' - it's in line waiting to be done and weighs down the average
'active' - in process and provides a half value across the average
'done', 'ok'd', 'rcvd' - finished and contributes to the final average
'none' - denotes a department that's inactive on this job and should not count in the final average.
The formula is:
=iferror(((ArrayFormula(sum(countif(B3:O3,{"done","ok'd","rcvd"}))))+(countif(B3:O3,"active")/2))/(counta(B3:O3)-(countif(B3:O3,"none"))),)
The formula works but I'm looking to see if there's an easier way to approach it. Would a query or array modification work better in this scenario?
What if I wanted to add other text strings based on syntax for my current application?
Here's a link to a sample sheet with it in context.
https://docs.google.com/spreadsheets/d/1zPFAcSxM7tYjZmlATYde7qKsDoeH6AW_xjFooOZFOf4/edit#gid=0
EDIT:
As a followup question - how do I get the same thing to work across the columns?
I did some reverse engineering to the solution and can see the formula working across the top of my sheet - but it's giving me an error:
"MMULT has incompatible matrix sizes. Number of columns in first matrix (13) must equal number of rows in second matrix (1)."
Here's the formula I've added (it's also in the linked sheet).
=ARRAYFORMULA(IF(LEN(B4:N4), MMULT(IFERROR(( N(REGEXMATCH(B4:N9, "ok'd|done|ready|rcvd"))+ N(REGEXMATCH(B4:N9, "active"))/2)/MMULT(N(REGEXMATCH(B4:N9, "[^none]")),TRANSPOSE(ROW(B4:B9)^0)), 0), TRANSPOSE(ROW(B4:B9)^0)),))
As a followup question - how do I get the same thing to work across the columns?
=ARRAYFORMULA(TRANSPOSE(IF(LEN(TRANSPOSE(B4:N4)), MMULT(IFERROR((
N(REGEXMATCH(TRANSPOSE(B4:N16), "ok'd|done|ready|rcvd"))+
N(REGEXMATCH(TRANSPOSE(B4:N16), "active"))/2)/MMULT(
N(REGEXMATCH(TRANSPOSE(B4:N16), "[^none]")),
(ROW(B4:B16)^0)), 0),
(ROW(B4:B16)^0)), )))
=ARRAYFORMULA(IF(LEN(B3:B9), MMULT(IFERROR((
N(REGEXMATCH(B3:N9, "ok'd|done|ready|rcvd"))+
N(REGEXMATCH(B3:N9, "active"))/2)/MMULT(
N(REGEXMATCH(B3:N9, "[^none]")),
TRANSPOSE(COLUMN(B3:N3)^0)), 0),
TRANSPOSE(COLUMN(B3:N3)^0)), ))

Creating an ID based on factor and filling down with Stata

Consider the fictional data to illustrate my problem, which contains in reality thousands of rows.
Figure 1
Each individual is characterized by values attached to A,B,C,D,E. In figure1, I show 3 individuals for which some characteristics are missing. Do you have any idea how can I get the following completed table (figure 2)?
Figure 2
With the ID in figure 1 I could have used the carryforward command to filling in the values. But since each individual has a different number of rows I don't know how to create the ID.
Edit: All individual share the characteristic "A".
Edit: the existing order of observations is informative.
To detect the change of id, the idea is to compare if the precedent value of char is >= in each rows.
This works only if your data are ordered, but it seems mandatory in your data.
gen id= 1 if (char[_n-1] >= char[_n]) | _n ==1
replace id = sum(id) if id==1
replace id = id[_n-1] if missing(id)
fillin id char
drop _fillin
If an individual as only the characteristics A and C and another individual as only the characteristics D and E, this won't work, but it seems impossible to detect with your data.