Is there a way to count from last condition x? - if-statement

I have a complex set of data that can return 3 different conditions per row. I need to be able to count the last x rows matching one of the specific conditions.
The following formula has been working well for me, but I have discovered a glitch in one instance of this formula (the formula is replicated at least a dozen times)
=ArrayFormula(LOOKUP(9.99999999999999E+307,IF(FREQUENCY(IF(AQ:AQ)=1,ROW(AQ:AQ)),IF(AQ:AQ<>1,ROW(AQ:AQ)))=0,FREQUENCY(IF(AQ:AQ=1,ROW(AQ:AQ)),IF(AQ:AQ=0,ROW(AQ:AQ))))))
Current criteria are as such:
0: Condition x met - Reset counter
1: Condition y met - Increment counter
2: Condition z met - Ignore this row
Therefore this:
1
2
2
2
1
1
0
1
1
1
Should output: 3
This:
1
2
0
2
2
1
2
1
Should output: 2
However the glitch I have encountered isn't resetting the counter when 0 is reached, for example:
1
2
1
2
1
1
2
2
2
2
0
Should output: 0
But in fact is outputting: 4
I have tested all possible conditions with that specific data set and I cannot rectify the issue. I believe there is an error in the formula (specifically the 9.99999999999999E+307) but I wrote it so long ago that I cannot successfully debug it. I have tried 1E+306 but the result is the same.
EDIT1: Upon request I have included as stripped down version of the sheet as I can while recreating the issue.
https://docs.google.com/spreadsheets/d/1SOXiFMEQelqptBvjcabMZGNgG60TRRbe_b65rzT1bi0/edit?usp=sharing
If you scroll to the bottom of the sheet you can see Col AQ has a 0, as a result the value in the cell AF2 should be 0.
You will notice in the sheet that I am using Named Ranges.
EDIT2: player0's answer was PERFECT!! <3
I modified the new formula to adapt to my spreadsheet so it could accommodate Named Ranges and drop-down lists. This question helped me a lot with that:
Convert column index into corresponding column letter
The final formula (just FYI) turned out to be:
=ARRAYFORMULA(COUNTIF(
INDIRECT(REGEXEXTRACT(ADDRESS(ROW(), column(INDIRECT($A$1 & Z$1 & "L"))), "[A-Z]+")&
MAX(IF((INDIRECT($A$1 & Z$1 & "L")=0)*(INDIRECT($A$1 & Z$1 & "L")<>""),
ROW(INDIRECT($A$1 & Z$1 & "L"))+1,5))&":"&
REGEXEXTRACT(ADDRESS(ROW(), column(INDIRECT($A$1 & Z$1 & "L"))), "[A-Z]+")), 1))

=ARRAYFORMULA(COUNTIF(INDIRECT("A"&
MAX(IF((A2:A=0)*(A2:A<>""), ROW(A2:A)+1, ROW(A2)))&":A"), 1))
spreadsheet demo

Related

combining multiple items to create one dummy variable

I have 7 items/variables in Stata that address the same survey question. These 7 items are each different weight control behaviors (diet, exercise, pills, etc.). I am trying to combine these variables to create a single weight control behavior dummy variable that is coded as yes (did engage in weight control) and no (did not engage in weight control).
The response options for each variable look something like this for a given weight control behavior
dieted
11438 0 not marked
2771 1 marked
16 6 refused
6508 7 legitimate skip
13 8 don’t know
Here is my code. I re-coded 6,7,8 for all 7 vars as missing:
tab1 h1gh30a-h1gh30g,m`
foreach X of varlist h1gh30a-h1gh30g {
replace `X'=. if `X' > 1
}
egen wgt_control= rowmax(h1gh30a-h1gh30g)
ta wgt_control
gen wgt_control_new=wgt_control
replace wgt_control_new = 1 if wgt_control>0 & wgt_control!=.
replace wgt_control_new= 0 if wgt_control <1
ta wgt_control_new
I used rowmax() to combine all 7 items but my issue is that the response option 0 or No doesn't appear when I tabulate it. I only get those who responded yes=1.
Here is a suggestion with a reproducible example for what I think is the cleanest approach. I also included some unsolicited advice about survey data best practices
* Example generated by -dataex-. For more info, type help dataex
clear
input double(h1gh30a h1gh30b h1gh30c)
1 1 1
1 0 1
6 1 8
0 0 0
7 6 8
end
* Explicit coding is better, so if possible, which it is with 7 vars,
* create a local with the vars are explicitly listed
local wgt_controls h1gh30a h1gh30b h1gh30c
* Recode is a better command to use here. And do not destroy information,
* there is a survey data quality assurance difference between respondent
* refusing to answer, not knowing or question skipped. You can replace this
* survey codes with these extended missing values that behaves like missing values
* but retain the differences in the survey codes
recode `wgt_controls' (6=.a) (7=.b) (8=.c)
* While rowmax() could be used, I think it seems like anymatch() fits
* what you are trying to do better
egen wgt_control = anymatch(`wgt_controls'), values(1)
There is no minimal reproducible example here, so we can't reproduce the problem independently.
From your code, it seems that h1gh30a-h1gh30g are recoded so that all are 0, 1 or missing, so their maximum takes one of the same values.
gen wgt_control_new = wgt_control
replace wgt_control_new = 1 if wgt_control>0 & wgt_control!=.
replace wgt_control_new= 0 if wgt_control <1
seems to boil down to cloning the variable:
gen wgt_control_new = wgt_control
In short, I can't see a reason in your code why you should never see 0 as a possible result.
EDIT
A minimal check on whether there are zeros that aren't showing up as they should might be
egen max = rowmax(h1gh30a-h1gh30g)
list high30a-high30g if max == 0
```

Trajectory Analysis (SAS): Incorrect number of start values

I am attempting a trajectory analysis in SAS (proc traj).
Following instructions found online, I first begin by testing two quadratic models, then three, then four (i.e., order 2 2, order 2 2 2, order 2 2 2 2, order 2 2 2 2 2).
I determined that a three-group linear model is the best fit (order 1 1 1;)
I then wish to add time stable covariates with the risk command. As found online, I did this by adding the start parameters provided in the Log.
At this point, I receive a notice: "Incorrect number of start values. There should be 10 start values based on the model specifications.").
I understand that it's possible to delete some of the 12 parameter estimates provided - But how do I select which ones to remove?
Thank you.
Code:
proc traj data=followupyes outplot=op outstat=os out=of outest=oe itdetail;
id youthid;
title3 'linear 3-gp model ';
var pronoun_allpar1-pronoun_allpar3;
indep time1-time3;
model logit;
ngroups 3;
order 1 1 1;
weight wgt_00;
start 0.031547 0.499724 1.969017 0.859566 -1.236747 0.007471
0.771878 0.495458 0.000000 0.000000 0.000000 0.000000;
risk P00_45_1;
run;
%trajplot (OP, OS, "linear 3-gp model ", "Traj of Pronoun Support", "Pron Support", "Time");
Because you are estimating a model with 3 linear trajectories, you will need 2 start values for each of your 3 groups.
See here for more info: https://www.andrew.cmu.edu/user/bjones/example.htm

GCP Data Prep- forward and backward fill

I have the following table which I am trying to wrangle in GCP Data prep:
Timestamp Event
2018-04-01 0
2018-04-02 0
2018-04-03 0
2018-04-04 0
2018-04-05 1
2018-04-06 0
2018-04-07 0
2018-04-08 0
I am trying to transform it in a way such that if Event is 1, then the previous 3 entries in the Event are set to 1 and the next 2 entries in Event are set to 2.
So, essentially the data set will look like the below after transformation
Timestamp Event
2018-04-01 0
2018-04-02 1
2018-04-03 1
2018-04-04 1
2018-04-05 1
2018-04-06 2
2018-04-07 2
2018-04-08 0
I have tried to use window and conditionals to achieve this, but w/o success.
Any ideas on how this transformation can be achieved? I am open to splitting the column or creating a new derived column if that can help achieve this result.
Thanks!
You can use window functions as part of your conditions in your IF statements. Using the PREV and NEXT window functions you can get the values at X rows above or below the current row in your window. Once you got the values, you can compare if they match the expected value and shape your IF statement accordingly.
For your use case, you need to verify if the PREV value at 1 or 2 position prior is equal to one and replace these rows by the number 2. If not true, if the NEXT value at position 1, 2 or 3 is equal to 1, the rows should be replaced with the number 1. Lastly, you need to check if the value at the current row is 1 and replace the remaining rows with 0. Converting this into a formula accepted by Dataprep would look like the following:
IF(PREV(Event, 1) == 1 || PREV(Event, 2) == 1, 2, IF(NEXT(Event, 1) == 1 || NEXT(Event, 2) == 1 || NEXT(Event, 3) == 1, 1, IF(Event == 1, 1, 0)))
To enter this formula on Dataprep, under the Function tab, select “Custom Formula”. Under the custom formula window, set the formula type to “Multiple row formula” as the PREV and NEXT function requires an additional argument specifying which column to sort by.

Replace zeros with missing values in certain cases

I was wondering if anyone knew an easier way of doing the following:
I have a dataset of health facility caseload by year, where each observation is one health facility. Facilities were 'brought online' in different years, so some have zeros before they have values for caseload. Also, some 'discontinue', as in they did provide services, but don't any more. I would like to replace the zeros with missing values for the years in which a facility discontinued. In the following example, the 3rd and 4th facilities discontinued, so I'd like missing for y2014 for the 3rd and y2013 & y2014 for the 4th.
y2011 y2012 y2013 y2014
0 0 76 82
0 0 29 13
0 0 25 0
5 10 0 0
0 0 17 24
I tried the following, which worked, but I'm going to have many years worth of data to work on (2000-2014), so was wondering if there was a more efficient way.
replace y2014=. if y2014==0 & (y2013>0 | y2012>0 | y2011>0)
replace y2013=. if y2013==0 & ( y2012>0 | y2011>0)
replace y2012=. if y2012==0 & ( y2011>0)
I messed around with egen rowlast to identify the facilities with a zero in the last year (meaning they discontinued), but then wasn't sure where to go with it.
Your problem would benefit from a loop over the variables.
We'll initialise started to 0, change our mind about started when we see a positive value, and change any subsequent 0s to missings if started is 1.
gen started = 0
forval y = 2000/2014 {
replace started = 1 if y`y' > 0
replace y`y' = . if started == 1 & y`y' == 0
}
Note that this scheme allows re-starts.
A more general comment is that this is not the better data structure for such panel or longitudinal data. This particular problem is not too challenging, but most problems with such data will be easier after reshape long.
See here for a survey of "rowwise" technique in Stata.

Calculating the distance between characters

Problem: I have a large number of scanned documents that are linked to the wrong records in a database. Each image has the correct ID on it somewhere that says where it belongs in the db.
I.E. A DB row could be:
| user_id | img_id | img_loc |
| 1 | 1 | /img.jpg|
img.jpg would have the user_id (1) on the image somewhere.
Method/Solution: Loop through the database. Pull the image text in to a variable with OCR and check if user_id is found anywhere in the variable. If not, flag the record/image in a log, if so do nothing and move on.
My example is simple, in the real world I have a guarantee that user_id wouldn't accidentally show up on the wrong form (it is of a specific format that has its own significance)
Right now it is working. However, it is incredibly strict. If you've worked with OCR you understand how fickle it can be. Sometimes a 7 = 1 or a 9 = 7, etc. The result is a large number of false positives. Especially among images with low quality scans.
I've addressed some of the image quality issues with some processing on my side - increase image size, adjust the black/white threshold and had satisfying results. I'd like to add the ability for the prog to recognize, for example, that "81*7*23103" is not very far from "81*9*23103"
The only way I know how to do that is to check for strings >= to the length of what I'm looking for. Calculate the distance between each character, calc an average and give it a limit on what is a good average.
Some examples:
Ex 1
81723103 - Looking for this
81923103 - Found this
--------
00200000 - distances between characters
0 + 0 + 2 + 0 + 0 + 0 + 0 + 0 = 2
2/8 = .25 (pretty good match. 0 = perfect)
Ex 2
81723103 - Looking
81158988 - Found
--------
00635885 - distances
0 + 0 + 6 + 3 + 5 + 8 + 8 + 5 = 35
35/8 = 4.375 (Not a very good match. 9 = worst)
This way I can tell it "Flag the bottom 30% only" and dump anything with an average distance > 6.
I figure I'm reinventing the wheel and wanted to share this for feedback. I see a huge increase in run time and a performance hit doing all these string operations over what I'm currently doing.