Increment 2 different Ids based on selected option - if-statement

I'm trying to update a working solution of incrementing one ID based on several conditions, so I was using the ROW() function without any issue. But now I'm trying to increment 2 different IDs based on selected option as shown in the screenshot below, where I've started the following so far:
=ARRAYFORMULA(IF(LEN(A2:A),COUNTIFS(A2:A, A2:A, ROW(A2:A), "<="&ROW(A2:A),A2:A,"Option 2"),))
Can anyone bring some light on this scenario: thanks
Link of spreadsheet illustrating my situation: here

You have to first check if the value is Option 1/Option 2 or not. A way to do this without using OR (which can't be iterated over an array) is this:
IF(A2:A="Option 1",0,1)*IF(A2:A="Option 2",0,1)
Next, you can wrap this into another IF so that the returned value depends on whether the previous condition is true. So, if option is not 1 nor 2, the corresponding value should result from the count of all the previous values which are not 1 or 2. So the COUNTIFS should check if the option is not 1 nor 2. Something like this:
29999 + COUNTIFS(A2:A,"<>Option 1",A2:A,"<>Option 2",ROW(A2:A), "<="&ROW(A2:A))
Finally, if the option is 1 or 2, the returned value should result from hte count of all previous 1 and 2 values. Since that's an OR condition, you have to sum two different COUNTIFS, one for option 1 and one for 2. Could be like this:
9999 + COUNTIFS(A2:A,"=Option 1",ROW(A2:A), "<="&ROW(A2:A)) + COUNTIFS(A2:A,"=Option 2",ROW(A2:A), "<="&ROW(A2:A))
Putting it all together, it could be like this:
=ARRAYFORMULA(IF(LEN(A2:A),IF(IF(A2:A="Option 1",0,1)*IF(A2:A="Option 2",0,1),
29999 + COUNTIFS(A2:A,"<>Option 1",A2:A,"<>Option 2",ROW(A2:A), "<="&ROW(A2:A)),
9999 + COUNTIFS(A2:A,"=Option 1",ROW(A2:A), "<="&ROW(A2:A)) + COUNTIFS(A2:A,"=Option 2",ROW(A2:A), "<="&ROW(A2:A))),""))

slight alternative:
=ARRAYFORMULA(IF(A2:A="",,IF(REGEXMATCH(A2:A, H2&"$|"&H3&"$"),
9999+COUNTIFS(REGEXMATCH(A2:A, H2&"$|"&H3&"$"),
REGEXMATCH(A2:A, H2&"$|"&H3&"$"), ROW(A2:A), "<="&ROW(A2:A)),
29999+COUNTIFS(A2:A, "<>"&H2, A2:A, "<>"&H3, ROW(A2:A), "<="&ROW(A2:A)))))

Related

Google sheets IF stops working correctly when wrapped in ARRAYFORMULA

I want this formula to calculate a date based on input from two other dates. I first wrote it for a single cell and it gives the expected results but when I try to use ARRAYFORMULA it returns the wrong results.
I first use two if statements specifycing what should happen if either one of the inputs is missing. Then the final if statement calculates the date if both are present based on two conditions. This seems to work perfectly if I write the formula for one cell and drag it down.
=IF( (LEN(G19)=0);(U19+456);(IF((LEN(U19)=0) ;(G19);(IF((AND((G19<(U19+456));(G19>(U19+273)) ));(G19);(U19+456))))))
However, when I want to use arrayformula to apply it to the entire column, it always returns the value_if_false if neither cell is empty, regardless of whether the conditions in the if statement are actually met or not. I am specifically talking about the last part of the formula that calculates the date if both input values are present, it always returns the result of U19:U+456 even when the result should be G19:G. Here is how I tried to write the ARRAYFORMULA:
={"Date deadline";ARRAYFORMULA(IF((LEN(G19:G400)=0);(U19:U400+456);(IF((LEN(U19:U400)=0);
(G19:G400);(IF((AND((G19:G400<(U19:U400+456));(G19:G400>(U19:U400+273)) ));(G19:G400);(U19:U400+456)))))))}
I am a complete beginner who only learned to write formulas two weeks ago, so any help or tips would be greatly appreciated!
AND and OR are not compatible with ARRAYFORMULA
Replace them by * or +
Try
={"Date deadline";ARRAYFORMULA(
IF((LEN(G19:G400)=0),(U19:U400+456),
(IF((LEN(U19:U400)=0), (G19:G400),
(IF((((G19:G400<(U19:U400+456))*(G19:G400>(U19:U400+273)) )),(G19:G400),
(U19:U400+456)))
))
)
)}
Keep in mind you cannot use AND, OR operators in an arrayformula, so you must find an alternative method such as multiplying the values together and checking them for 0 or 1 (true*true=1)
I am gathering based on your formula's and work that you want to have the following:
If G19 is blank show U19 + 456
If U19 is blank show G19
If G19 is less than U19 + 456 but greater than U19 + 273 show G19
Otherwise show U19 + 456
I'm not too sure what you want to happen when both columns G and U are empty. Based on your current formula you are returning an empty cell + 456... but with this formula it returns an empty cell rather than Column U + 456
Formula
={"Date deadline";ARRAYFORMULA(TO_DATE(ARRAYFORMULA(IFS((($G19:$G400="")*($U19:$U400=""))>0,"",$G19:$G400="",$U19:$U400+456,$U19:$U400="",$G19:$G400,(($G19:$G400<$U19:$U400+456)*($G19:$G400>$U19:$U400+273))>0,$G19:$G400,TRUE,$U19:$U400+456))))}

Make =IF Function Output Numbers For "Scoring": Google Sheets

I'm am exploring methods of giving scores to different datapoints within a dataset. These points come from a mix of numbers and text string attributes looking for certain characteristics, e.g. if Col. A contains more than X number of "|", then give it a 1. If not, it gets a 0 for that category. I also have some that give the point when the value is >X.
I have been trying to do this with =IF, for example, =IF([sheet] = [Text], "1","0").
I can get it to give me 1 or 0, but I am unable to get a point total with sum.
I have tried changing the formatting of the text to both "number", "plain text", and have left it as automatic, but I can't get it to sum. Thoughts? Is there maybe a better way to do this?
FWIW - I'm trying to score based on about 12 factors.
Best,
Alex
The issue here might be that you're having the cell evaluate to either the string "0" or the string "1" rather than the number 0 or the number 1. That would explain why you're seeing the right things but the math isn't coming out right - the cell contents look like numbers, but they're really text, which the summation would then ignore.
One option would be to drop the quotation marks and write something like this:
=IF(condition, 1, 0)
This has the condition evaluate to 1 if it's true and 0 if it's false.
Alternatively, you could write something like this:
=(condition) * 1
This will take the boolean TRUE or FALSE returned by condition and convert it to either the numeric value 1 (true) or the numeric value 0 (false).

How to distribute values into group in python

I have a dataset of actions doing over time, an attribute 'Hour' ( contains values from 0 ->23 ). Now I want to create another attribute, say 'PartOfDay', which group 24 hours into 4 parts. For tuples have 'Hour' value of 0 to 5, then the 'PartOfDay' value should be 1; if 'Hour' value in [6,11], then the 'PartOfDay' value should be 2;...How can I do?
The codes would do this:
train['PartOfDay']=1
train.loc[(train.Hour>=6) & (train.hour<=11),'PartOfDay']=2
train.loc[(train.Hour>=12) & (train.hour<=17),'PartOfDay']=3
train.loc[(train.Hour>=18) & (train.hour<=23),'PartOfDay']=4
but it seems not so beautiful, I would like to know a more decent one if possible
Thank you for all your supports!!
While it is not clear what train.loc represents, a general approach to your problem is to use modulus function to set the RHS:
1 + int(train.Hour / 6)

Generating rolling z-scores of panel data in Stata

I have an unbalanced panel data set (countries and years). For simplicity let's say I have one variable, x, that I am measuring. The panel data sorted first by country (a 3-digit numeric country-code) and then by year. I would like to write a .do file that generates a new variable, z_x, containing the standardized values of the variable x. The variables should be standardized by subtracting the mean from the preceding (exclusive) m time periods, and then dividing by the standard deviation from those same time periods. If this is not possible, return a missing value.
Currently, the code I am using to accomplish this is the following (edited now for clarity)
xtset weocountrycode year
sort weocountrycode year
local win_len = 5 // Defining rolling window length.
quietly: rolling sd_x=r(sd) mean_x=r(mean), window(`win_len') saving(stats_x, replace): sum x
use stats_x, clear
rename end year
save, replace
use all_data_PROCESSED_FINAL.dta, clear
quietly: merge 1:1 (weocountrycode year) using stats_x
replace sd_x = . if `x'[_n-`win_len'+1] == . | weocountrycode[_n-`win_len'+1] != weocountrycode[_n] // This and next line are for deleting values that rolling calculates when I actually want missing values.
replace mean_`x' = . if `x'[_n-`win_len'+1] == . | weocountrycode[_n-`win_len'+1] != weocountrycode[_n]
gen z_`x' = (`x' - mean_`x'[_n-1])/sd_`x'[_n-1] // calculate z-score
UPDATE:
My struggle with rolling is that when rolling is set up to use a window length 5 rolling mean, it automatically does window length 1,2,3,4 means for the first, second, third and fourth entries (when there are not 5 preceding entries available to average out). In fact, it does this in general - if the first non-missing value is on entry 5, it will do a length 1 rolling average on entry 5, length 2 rolling average on entry 6, ..... and then finally start doing length 5 moving averages on entry 9. My issue is that I do not want this, so I would like to avoid performing these calculations. Until now, I have only been able to figure out how to delete them after they are done, which is both inefficient and bothersome.
I tried adding an if clause to the -rolling- statement:
quietly: rolling sd_x=r(sd) mean_x=r(mean) if x[_n-`win_len'+1] != . & weocountrycode[_n-`win_len'+1] != weocountrycode[_n], window(`win_len') saving(stats_x, replace): sum x
But it did not fix the problem and the output is "weird" in the sense that
1) If `win_len' is equal to, say, 10, there are 15 missing values in the resulting z_x variable, instead of 9.
2) Even though there are "extra" missing values in z_x, the observations still start out as window length 1 means, then window length 2 means, etc. which makes no sense to me.
Which leads me to believe I fundamentally don't understand 1) what -rolling- is doing and 2) how an if clause works in the context of -rolling-.
Does this help?
Thanks!
I'm not sure I understand completely but I'll try to answer based on what I think your problem is, and based on a comment by #NickCox.
You say:
... when rolling is set up to use a window length 5 rolling mean...
if the first non-missing value is
on entry 5, it will do a length 1 rolling average on entry 5, length 2
rolling average on entry 6, ...
This is expected. help rolling states:
The window size refers to calendar periods, not the number of
observations. If there
are missing data (for example, because of weekends), the actual number of observations used by command may be less than
window(#).
It's not actually doing a "length 1 rolling average", but I get to that later.
Below some examples to see what rolling does:
clear all
set more off
*-------------------------- example data -----------------------------
set obs 92
gen dat = _n - 1
format dat %tq
egen seq = fill(1 1 1 1 2 2 2 2)
tsset dat
tempfile main
save "`main'"
list in 1/12, separator(4)
*------------------- Example 1. None missing ------------------------
rolling mean=r(mean), window(4) stepsize(4) clear: summarize seq, detail
list in 1/12, separator(0)
*------- Example 2. All but one value, missing in first window ------
use "`main'", clear
replace seq = . in 1/3
list in 1/8
rolling mean=r(mean), window(4) stepsize(4) clear: summarize seq, detail
list in 1/12, separator(0)
*------------- Example 3. All missing in first window --------------
use "`main'", clear
replace seq = . in 1/4
list in 1/8
rolling mean=r(mean), window(4) stepsize(4) clear: summarize seq, detail
list in 1/12, separator(0)
Note I use the stepsize option to make things much easier to follow. Because the date variable is in quarters, I set windowsize(4) and stepsize(4) so rolling is just computing averages by year. I hope that's easy to see.
Example 1 does as expected. No problem here.
Example 2 on the other hand, should be more interesting for you. We've said that what matters are calendar periods, so the mean is computed for the whole year (four quarters), even though it contains missings. There are three missings and one non-missing. summarize is computing the mean over the whole year, but summarize ignores missings, so it just outputs the mean of non-missings, which in this case is just one value.
Example 3 has missings for all four quarters of the year. Therefore, summarize outputs . (missing).
Your problem, as I understand it, is that when you face a situation like Example 2, you'd like the output to be missing. This is where I think Nick Cox's advice comes in. You could try something like:
rolling mean=r(mean) N=r(N), window(4) stepsize(4) clear: summarize seq, detail
replace mean = . if N != 4
list in 1/12, separator(0)
This says: if the number of non-missings for the window (r(N), also computed by summarize), is not the same as the window size, then replace it with missing.

Filtering on the count with the Django ORM

I have a query that's basically "count all the items of type X, and return the items that exist more than once, along with their counts". Right now I have this:
Item.objects.annotate(type_count=models.Count("type")).filter(type_count__gt=1).order_by("-type_count")
but it returns nothing (the count is 1 for all items). What am I doing wrong?
Ideally, it should get the following:
Type
----
1
1
2
3
3
3
and return:
Type, Count
-----------
1 2
3 3
In order to count the number of occurrences of each type, you have to group by the type field. In Django this is done by using values to get just that field. So, this should work:
Item.objects.values('group').annotate(
type_count=models.Count("type")
).filter(type_count__gt=1).order_by("-type_count")
It's logical error ;)
type_count__gt=1 means type_count > 1 so if the count == 1 it won't be displayed :)
use type_count__gte=1 instead - it means type_count >= 1 :)