I have data like:
Folder Replied Complied
1 testing 1 1
2 /complete/ 0 1
3 none 1 1
4 Incomplete 0 1
5 complete// 0 0
6 Incomplete 1 0
7 ABCcomplete 1 1
I like a measure to calculate the average of Complied (sum divided by count), only where Folder contains the string complete AND Replied is 0 (both conditions simultaneously).
Therefore rows 2, 4, 5 should be used in the count, resulting in 0.66... (1 + 1 + 0)/3
i've tried several things but the formula either results in an error, or returns the wrong result
i.e.
Measure = CALCULATE (
Average( [Complied]),
CONTAINSSTRING([Folder],"complete") && [replied] = 0
)
DAX is very confusing to me. Thanks in advance
edit:
I've seen examples like
`
= CALCULATE(AVERAGE([col]), CONTAINSSTRING([Folder],"complete") , [replied] = 0)
note the , instead of && but that doesn't work for some reason either. Neither does AND(condition1, condition2).
This dax measure should be the one you are looking for:
Measure = CALCULATE(AVERAGE(Sheet1[Complied]),
CONTAINSSTRING(Sheet1[Folder],"complete") && Sheet1[Replied]=0)
So how is it working?
ContainString to check about "complete", work like VBA instr function
&& in order to meet both condition
Calculate(method, expression) to filter all the value
Scorecard
You may first test with the following measure to check if statement is working in your case first:
IF(CONTAINSSTRING(Sheet1[Folder],"complete") && Sheet1[Replied]=0,"True","False")
Only three row is True here:
Related
How can I create a new variable that takes value X+1 if an event doesn't occur in X periods of time?
Specifically, I have data of many people in 12 years. For a question, they could answer yes (1) or no(0). I care the first time someone says Yes during 12 years and created a variable that takes value of the number of years with Yes replies.
But if someone replies No for 12 years, I set value of that variable equal 13. But I'm stuck at how to do that.
by hhidpn (wave), sort: gen byte EarlyHeart = sum(rhearte) == 1
gen EarlyHeart1=year if EarlyHeart==1
(what's next?)
If the last cumulative sum for an individual is 0, then they all are.
by hhidpn (wave), sort: gen byte EarlyHeart = sum(rhearte) == 1
by hhidpn : replace EarlyHeart = 13 if EarlyHeart[_N] == 0
I want to set a variable to 0 if the "Independent" column is equal to 0,1,or null. I have been trying something like this:
df["Iflag"] = df.Independent.where((df.Independent == 0) | (df.Independent == 1 )|(df.Independent.isnull())).astype(int)
Iflag = df[df.Iflag == 0]
pd.DataFrame(Iflag, columns=["LocIdent","Independent"]).to_csv(Targcsv, mode='ab')
I get an error that says I cannot convert NA to integer. This code works when I drop the check to see if Independent is null. My question is, what is the best way to write an if statement that includes null values in Pandas?
I'd just fill the nan values first and then your code would work, NaN cannot be represented using ints hence the error.
So something like
# fill the nan values
df.Independent = df.Independent.fillna(0)
# set any values that are 1 to 0
df.loc[df.Indepedent == 1, 'Independent'] = 0
# take a view of the df where the value is 0
Iflag = df[df.Independent == 0]
pd.DataFrame(Iflag, columns=["LocIdent","Independent"]).to_csv(Targcsv, mode='ab')
It's redundant to check where a value is 0 if all you're going to is set it to 0 again anyway so all you need to do is find the rows where it's 1 already, set these to 0 and then take a view of the df where the condition is satisfied.
So I have this function that takes in an integer. But It doesn't work and I suspect that the if statement is not valid, I could not find anything on google regarding the issue, maybe my googling skills just suck.
if mynumber != (0 or 1 or 2 or 3 or 4 or 5 or 6 or 7 or 8) then
print("Please choose an integer number between 1-8")
end
Thanks for any help!!
Correct. That is not how you test things like that. You cannot test multiple values that way.
or requires expressions on either side and evaluates to a single expression. So (0 or 1 or 2 or 3 or 4 or 5 or 6 or 7 or 8) evaluates to 0 and your final expression is just if mynumber != 0 then.
To test multiple values like that you need to use or around multiple comparison expressions.
if (mynumber ~= 0) or (mynumber ~= 1) or (mynumber ~= 2) ... then (also notice ~= is the not-equal operator not !=).
Also be sure to note YuHao's answer about the logic in this line and how to test for this correctly.
Others have pointed the major problems you have, i.e, 0 or 1 or 2 or 3 or 4 or 5 or 6 or 7 or 8 evaluates as 0, the rest is ignored because of short-circuit. You need to test the number with these numbers one by one.
However, there's one last trap. The condition
if mynumber ~= 0 or mynumber ~= 1 then
is always true, because a number is either not equal to 0, in which case mynumber ~= 0 is true; or it is equal to 0, in which case mynumber ~= 1 is true.
The correct logic should be:
if mynumber ~= 0 and mynumber ~= 1 then
Etan's answer explains the behaviour as observed in lua. I'd suggest writing a custom FindIn function for searching:
function FindIn( tInput, Value )
for _ in pairs( tInput ) do
if Value == tInput[_] then return true end
end
return false
end
if FindIn( {1,2,3,4,5,6,7,8}, mynumber ) then
-- ...
end
try this:
In Lua You check if two items are NOT EQUAL by "~=" instead of "!=",
If You compare two items in if statement, then always remember that items should return booleans, so: instead of mynumber != (0 or 1 or...) try something like (mynumber ~= 0) or (mynumber ~= 1) ...
You can do it simple with .... (mynumber have to be integer variable)
if mynumber<0 or mynumber>8 then
print("Please choose an integer number between 1-8")
end
I need to program a nearest neighbor algorithm in stata from scratch because my dataset does not allow me to use any of the available solutions (as far as I am concerned).
To be pecise. I have a dataset that is of similar structure to that of the following (original has around 14k observations)
input id value treatment match
1 0.14 0 .
2 0.32 0 .
3 0.465 1 2
4 0.878 1 2
5 0.912 1 2
6 0.001 1 1
end
I want to generate a variable called match (already included in the example above). For each observation with treatment == 1 the variable match should store the id of another observation from within treatment == 0 whose value is closest to value of the considered observation (treatment == 1).
I am new to stata programming, so I am not yet familiar with the syntax. My first shot is the following however it does not produce any changes to the match variable. I am sure this is a novice question but I am hoping for some advice on how to make the code running.
EDIT: I have changed the code slightly and now it seems to work. Do you see any problems that may arise if I run it on a bigger dataset?
set more off
clear all
input id pscore treatment
1 0.14 0
2 0.32 0
3 0.465 1
4 0.878 1
5 0.912 1
6 0.001 1
end
gen match = .
forval i = 1/`= _N' {
if treatment[`i'] == 1 {
local dist 1
forvalues j = 1/`= _N' {
if (treatment[`j'] == 0) {
local current_dist (pscore[`i'] - pscore[`j'])^2
if `dist' > `current_dist' {
local dist `current_dist' // update smallest distance
replace match = id[`j'] in `i' // write match
}
}
}
}
}
Consider some simulated data: 1,000 observations, 200 of them untreated (treat == 0) and the rest treated (treat == 1). Then the code included below will be much more efficient than the originally posted. (Ties, like in your code, are not explicitly handled.)
clear
set more off
*----- example data -----
set obs 1000
set seed 32956
gen id = _n
gen pscore = runiform()
gen treat = cond(_n <= 200, 0, 1)
*----- new method -----
timer clear
timer on 1
// get id of last non-treated and first treated
// (data is sorted by treat and ids are consecutive)
bysort treat (id): gen firsttreat = id[1]
local firstt = first[_N]
local lastnt = `firstt' - 1
// start loop
gen match = .
gen dif = .
quietly forvalues i = `firstt'/`=_N' {
// compute distances
replace dif = (pscore[`i'] - pscore)^2
summarize dif in 1/`lastnt', meanonly
// identify id of minimum-distance observation
replace match = . in 1/`lastnt'
replace match = id in 1/`lastnt' if dif == r(min)
summarize match in 1/`lastnt', meanonly
// save the minimum-distance id
replace match = r(max) in `i'
}
// clean variable and drop
replace match = . in 1/`lastnt'
drop dif firsttreat
timer off 1
tempfile first
save `first'
*----- your method -----
drop match
timer on 2
gen match = .
quietly forval i = 1/`= _N' {
if treat[`i'] == 1 {
local dist 1
forvalues j = 1/`= _N' {
if (treat[`j'] == 0) {
local current_dist (pscore[`i'] - pscore[`j'])^2
if `dist' > `current_dist' {
local dist `current_dist' // update smallest distance
replace match = id[`j'] in `i' // write match
}
}
}
}
}
timer off 2
tempfile second
save `second'
// check for equality of results
cf _all using `first'
// check times
timer list
The results in seconds to finish execution:
. timer list
1: 0.19 / 1 = 0.1930
2: 10.79 / 1 = 10.7900
The difference is huge, specially considering this data set has only 1,000 observations.
An interesting thing to notice is that as the number of non-treated cases increases relative to the number of treated, then the original method improves, but never reaches the levels of efficiency of the new method. As an example, invert the number of cases, so there is now 800 untreated and 200 treated (change data setup to gen treat = cond(_n <= 800, 0, 1)). The result is
. timer list
1: 0.07 / 1 = 0.0720
2: 4.45 / 1 = 4.4470
You can see that the new method also improves and is still much faster. In fact, the relative difference is still the same.
Another way to do this is using joinby or cross. The problem is they temporarily expand (a lot) the size of your data base. In many cases, they are not feasible due to the hard limit Stata has on the number of possible observations (see help limits). You can find an example of joinby here: https://stackoverflow.com/a/19784222/2077064.
Edit
If there's a large number of treated relative to untreated, your code suffers
because you go through the whole first loop many more times (due to the first if).
Furthermore, going through
that whole loop once, implies going through another loop that
has itself two if conditions, _N more times.
The opposite case in which there are few treated observations means that you go through the whole
first loop only in a small number of occasions, speeding up your code substantially.
The reason my code can maintain its efficiency is due to the use of in. This always
offers speed gains over if. Stata will go directly to those observations with no
logical checking needed. Your problem provides an opportunity for that replacement
and it's wise to seize it.
If my code used if where in is in place, the results would be different.
Your code would be faster for the
case in which there's a large number of untreated relative to treated, and again, that
is because in your code there would not be the need to go through the complete loop,
requiring very little work;
the first loop is short-circuited with the first if. For the opposite case,
my code would still dominate.
The key is to "separate" treated from untreated and work on each group using in.
Can someone pl tell me what is rolling sum and how to implement it in Informatica?
My requirement is as below:(Given by client)
ETI_DUR :
SUM(CASE WHEN AGENT_EXPNCD_DIM.EXCEPTION_CD='SYS/BLDG ISSUES ETI' THEN IEX_AGENT_DEXPN.SCD_DURATION ELSE 0 END)
ETI_30_DAY :
ROLLING SUM(CASE WHEN (SYSDATE-IEX_AGENT_DEXPN.ROW_DT)<=30 AND AGENT_EXPNCD_DIM.EXCEPTION_CD = 'SYS/BLDG ISSUES ETI'
THEN IEX_AGENT_DEXPN.SCD_DURATION ELSE 0 END)
ETI_30_DAY_OVRG :
CASE WHEN ETI_DUR > 0 THEN
CASe
WHEN ROLLINGSUM(ETI_DUR_30_DAY FOR LAST 29 DAYS) BETWEEN 0 AND 600 AND ROLLINGSUM(ETI_DUR_30_DAY FOR LAST 29 DAYS) + ETI_DUR > 600 THEN ROLLINGSUM(ETI_DUR_30_DAY FOR LAST 30 DAYS) - 600
WHEN ROLLINGSUM(ETI_DUR_30_DAY FOR LAST 29 DAYS) > 600 THEN ETI_DUR
ELSE 0 END
ELSE 0 END
And i have implemented as below in Informatica.
Expression Transformation:
o_ETI_DUR-- IIF(UPPER(EXCEPTION_CD_AGENT_EXPNDIM)='SYS/BLDG ISSUES ETI',SCD_DURATION,0)
o_ETI_29_DAY-- IIF(DATE_DIFF(TRUNC(SYSDATE),trunc(SCHD_DATE),'DD') <=29 AND UPPER(EXCEPTION_CD_AGENT_EXPNDIM) = 'SYS/BLDG ISSUES ETI' ,SCD_DURATION,0)
o_ETI_30_DAY -- IIF(DATE_DIFF(TRUNC(SYSDATE),trunc(SCHD_DATE),'DD') <=30 AND UPPER(EXCEPTION_CD_AGENT_EXPNDIM) = 'SYS/BLDG ISSUES ETI' ,SCD_DURATION,0)
Aggregator transformation:
o_ETI_30_DAY_OVRG:
IIF(sum(i_ETI_DUR) > 0,
IIF((sum(i_ETI_29_DAY)>=0 and sum(i_ETI_29_DAY)<=600) and (sum(i_ETI_29_DAY)+sum(i_ETI_DUR)) > 600,
sum(i_ETI_30_DAY) - 600,
IIF(sum(i_ETI_29_DAY)>600,sum(i_ETI_DUR),0)),0)
But is not working. Pl help ASAP.
Thanks a lot....!
Rolling sum is just the sum of some amount over a fixed duration of time. For example, everyday you can calculate the sum of expense for last 30 days.
I guess you can use an aggregator to calculate ETI_DUR, ETI_30_DAY and ETI_29_DAY. After that, in an expression you can implement the logic for ETI_30_DAY_OVRG. Note that you cannot write an IIF expression like that in an aggregator. Output ports must use an aggregate function.
Here is a rolling sum example:
count, rolling_sum
1,1
2,3
5,8
1,9
1,10
Basically it is the sum of the values listed previously. To implement it in Informatica use 'local variables' (variable port in expression transformation) as follows:
input port: count
variable port: v_sum_count = v_sum_count + count
output port: rolling_sum = v_sum_count
we have a moving sum function defined in Numerical functions in Expression transformation:
MOVINGSUM(n as numeric, i as integer, [where as expression]).
Please check if it helps.