I'm trying to have information about the completness in some survey results exported in to a Excel Format I'm using Google Sheets, as every survey there is questions and subquestions the subquestions have a conditional, Example: 4. How are you today? multiple choise answers: Good, Bad, Prefer not to say, so there we have 3 answer options if we click good there is a conditional and the subquestion will be: Why?. So in my survey there is 8 questions and question 4, 7 and 8 has conditional questions if someone answer "Yes". Now here is my problem to calculate the percentage of completness I used this relation: number of inputs in the survey/number of expected answers, But as I mentioned before the conditional affects this expected answers this Variable is dynamic depending on the answers from question 4, 7 and 8. So I would like to obtain this Variable for every case, if someone put information will have an ID if we have 20 persons doing the survey we will have 20 ID's. So for every record of answers the number will change depending on the inputs from Question 4, 7 and 8. I have prepared a document in Google sheets will the full aproach that I tried but is still hard to have it right I would like to have some help with this.
Link to the spreesheet
Here is an image about it:
if any parts of MAIN or SUB question make a whole MAIN/SUB question count as 1 use:
QUERY(TRANSPOSE(B2:F12),,9^9)
if all parts of MAIN/SUB question count as 1 use regular range:
AS2:BB12
change those ranges if I got them wrong:
=ARRAYFORMULA(IF(""={TRANSPOSE(TRIM({
QUERY(TRANSPOSE(B2:F12),,9^9); QUERY(TRANSPOSE(G2:X12),,9^9); QUERY(TRANSPOSE(Y2:AH12),,9^9);
QUERY(TRANSPOSE(AI2:AR12),,9^9); QUERY(TRANSPOSE(BC2:BD12),,9^9); QUERY(TRANSPOSE(BG2:BH12),,9^9);
QUERY(TRANSPOSE(BI2:BR12),,9^9); QUERY(TRANSPOSE(BS2:BU12),,9^9); QUERY(TRANSPOSE(BV2:BX12),,9^9);
QUERY(TRANSPOSE(BY2:CA12),,9^9); QUERY(TRANSPOSE(CB2:CD12),,9^9); QUERY(TRANSPOSE(CE2:CG12),,9^9);
QUERY(TRANSPOSE(CH2:CJ12),,9^9); QUERY(TRANSPOSE(CK2:CM12),,9^9); QUERY(TRANSPOSE(CN2:CP12),,9^9);
QUERY(TRANSPOSE(CQ2:CS12),,9^9); QUERY(TRANSPOSE(CT2:CV12),,9^9)})), AS2:BB12, BE2:BF12}, 0, 1))
to sum this up there are 17 queries and 2 regular ranges with 12 columns eg 17+12 = 29:
=ARRAYFORMULA(MMULT(IF(""={TRANSPOSE(TRIM({
QUERY(TRANSPOSE(B2:F12),,9^9); QUERY(TRANSPOSE(G2:X12),,9^9); QUERY(TRANSPOSE(Y2:AH12),,9^9);
QUERY(TRANSPOSE(AI2:AR12),,9^9); QUERY(TRANSPOSE(BC2:BD12),,9^9); QUERY(TRANSPOSE(BG2:BH12),,9^9);
QUERY(TRANSPOSE(BI2:BR12),,9^9); QUERY(TRANSPOSE(BS2:BU12),,9^9); QUERY(TRANSPOSE(BV2:BX12),,9^9);
QUERY(TRANSPOSE(BY2:CA12),,9^9); QUERY(TRANSPOSE(CB2:CD12),,9^9); QUERY(TRANSPOSE(CE2:CG12),,9^9);
QUERY(TRANSPOSE(CH2:CJ12),,9^9); QUERY(TRANSPOSE(CK2:CM12),,9^9); QUERY(TRANSPOSE(CN2:CP12),,9^9);
QUERY(TRANSPOSE(CQ2:CS12),,9^9); QUERY(TRANSPOSE(CT2:CV12),,9^9)})), AS2:BB12, BE2:BF12}, 0, 1),
SEQUENCE(29, 1, 1, 0)))
now to skip SUB question if empty we can do:
=ARRAYFORMULA(IF(""=TRANSPOSE(TRIM({
QUERY(TRANSPOSE(AS2:BB12),,9^9)})), 1, 0))
and then:
again, if you got more to skip add it like:
so the last step is to get the "% completeness":
=ARRAYFORMULA(MMULT(IF(""={TRANSPOSE(TRIM({
QUERY(TRANSPOSE(B2:F12),,9^9); QUERY(TRANSPOSE(G2:X12),,9^9); QUERY(TRANSPOSE(Y2:AH12),,9^9);
QUERY(TRANSPOSE(AI2:AR12),,9^9); QUERY(TRANSPOSE(BC2:BD12),,9^9); QUERY(TRANSPOSE(BG2:BH12),,9^9);
QUERY(TRANSPOSE(BI2:BR12),,9^9); QUERY(TRANSPOSE(BS2:BU12),,9^9); QUERY(TRANSPOSE(BV2:BX12),,9^9);
QUERY(TRANSPOSE(BY2:CA12),,9^9); QUERY(TRANSPOSE(CB2:CD12),,9^9); QUERY(TRANSPOSE(CE2:CG12),,9^9);
QUERY(TRANSPOSE(CH2:CJ12),,9^9); QUERY(TRANSPOSE(CK2:CM12),,9^9); QUERY(TRANSPOSE(CN2:CP12),,9^9);
QUERY(TRANSPOSE(CQ2:CS12),,9^9); QUERY(TRANSPOSE(CT2:CV12),,9^9)})), AS2:BB12, BE2:BF12}, 0, 1),
SEQUENCE(29, 1, 1, 0))/(29-IF(""=TRANSPOSE(TRIM({QUERY(TRANSPOSE(AS2:BB12),,9^9)})), 1, 0)))
Related
I'm trying to count with a condition but some context before, I have a survey this survey has Yes/No Questions in some cases if you answer yes it appears a subquestion like: If yes please tell us why.
I'm counting the total of questions that are asked but this is a kind of dynamic number depending if the user with ID 1 answers Yes or No if he answers Yes this total number will be modified because the subquestion is one more question to count so if in My survey I have 7 questions and one of those 7 is Conditional will be added 1 more question so for ID 1 the total of questions asked was 8.
But if ID 2 answer no the total will be 7 still.
I have an example with some questions and answers in this image:
I'm using this formula to count: =IF(B3="Yes", 1, "") but this ignores the No and I would like to have something like if Yes count the subquestion if not just count the question.
It's could be hard to understand let me know any doubts.
Help please could be any approach like counting individual or total.
try:
={"Total"; INDEX(IF(A3:A10="",,
MMULT(IF({B3:B10, C3:C10&D3:D10, E3:E10}="", 0, 1),
SEQUENCE(COLUMNS({B3:B10, C3:C10&D3:D10, E3:E10}), 1, 1, 0))))}
update:
={"Total"; INDEX(IF(A3:A10="",,
MMULT({IF(C3:C10&D3:D10="", 0, 2), SEQUENCE(ROWS(A3:A10), 3, 1, 0)},
SEQUENCE(COLUMNS(C3:C10&D3:D10)+3, 1, 1, 0))))}
In my code
from PIL import Image
import pytesseract
print(pytesseract.image_to_string(Image.open('test.png')))
The results I get from here (just from the question and answers) are:
Which team surrendered
the biggest lead in Super
Bowl history?
Atlanta Falcons
Denver Broncos
Buffalo Bills
Is there any way to say that lines 1, 2, and 3 are the question, then line 5 is answer 1, etc.?
Depending on how your data differs between images this should work. If you always have the '?' to split on.
image_text=pytesseract.image_to_string(Image.open('test.png'))
text_list=image_text.split('?')
This will give you a list with 2 elements. First being all before the ? and second after. Such as:
print(text_list)
['Which team surrendered\nthe biggest lead in Super\nBowl history',
'\n\nAtlanta Falcons\n\nDenver Broncos\n\nBuffalo Bills']
From here you can define q and a. As the question and answer.
q = text_list[0]
a = [a for a in text_list[1].split('\n') if a]
The logic above will keep the new lines for the question leaving it formatted as:
Which team surrendered
the biggest lead in Super
Bowl history?
Then variable a will be filled with a list of the answers without any blank lines in the list. So a print(a) would return:
['Atlanta Falcons', 'Denver Broncos', 'Buffalo Bills']
Keep in mind, this fix is dependent on the text having a ? in it to define which half of the string is the question vs which is the answer.
In my IF-function the “otherwise” argument should conduct the subtraction “6 - value”. It works fine for cells containing numbers, but unfortunately also works fine with blank cells. This results in a lot of cells with 6 (6 - 0 = 6) instead of empty cells.
In detail:
I want to import and select data collected in an online questionnaire.
I import my extract of the raw data in sheet “Sample” with the following formula:
=IF(LOOKUP(D$1,'Analysis'!$A$2:$A,'Analysis'!$G$2:$G)="No",FILTER(FILTER(Import!$A$2:$CV,Import!$A$1:$CV$1=D$1),Import!$A$2:$A=0),ARRAYFORMULA(6-FILTER(FILTER(Import!$A$2:$CV,Import!$A$1:$CV$1=D$1),Import!$A$2:$A=0)))
= If the question has not to be reversed (“No”), then import the values as they are, otherwise (if the question has to be reversed, “Yes”) subtract 6 - value.
Sheets in Google Spreadsheets:
“Import”: This sheet contains the raw data. For each person that participated in the study, there is a row with the corresponding answers (that is 1, 2, 3, 4 or 5 according to the rating scale in the questionnaire). Because not every person in the list started or completed the questionnaire, there are blank cells where no answers were registered and blank cells at the end of the sheet.
“Sample”: This sheet should contain an extract of the raw data for further analysis. It’s the sheet where the IF-formula is applied.
“Analysis”: This sheet contains informations concerning the questions, e.g. if the answers of some questions have to be reversed (reversed rating scale: 1 -> 5, 2 -> 4, 3 stays 3 and so on).
Coordinates:
Sheet “Sample”: Cell D$1, E$1, F$1 and so on contain the names of the questions (e.g. question_1).
Sheet “Analysis”: A2 to A contain the names of the questions.
Sheet “Analysis”: G2 to G contain the information if the answers of the questions have to be reversed. If the answers have to be reversed (“Yes”), the raw data needs to be adjusted with “6-” (6-5 = 1, 6-4 = 2, 6-3 = 3 and so on).
Sheet “Import”: A2 to A contains if there are any missing values. Zero means there are no missing values. Only data rows with no missing values should be imported.
Problem:
The formula works fine and displays the answers and reversed answers for the questions of interest. BUT at the end of the sheet “Sample” the columns continue with 6, 6, 6, 6, 6, 6, 6… (only for reversed questions); for not reversed questions the cells after the last valid import are blank.
Attempts to fix it:
I tried different variations of nested if-functions that unfortunately don’t have any effect, e.g.:
=IF(ISBLANK(Import!E2:I8)," ",IF(LOOKUP(D$1,Analysis!$A$2:$A,Analysis!$G$2:$G)="No",FILTER(FILTER(Import!$A$2:$CV,Import!$A$1:$CV$1=D$1),Import!$A$2:$A=0),ARRAYFORMULA(6-FILTER(FILTER(Import!$A$2:$CV,Import!$A$1:$CV$1=D$1),Import!$A$2:$A=0))))
or:
=IF(LOOKUP(D$1,Analysis!$A$2:$A,Analysis!$G$2:$G)="No",FILTER(FILTER(Import!$A$2:$CV,Import!$A$1:$CV$1=D$1),Import!$A$2:$A=0),IF(Import!E2:E=" "," ",ARRAYFORMULA(6-FILTER(FILTER(Import!$A$2:$CV,Import!$A$1:$CV$1=D$1),Import!$A$2:$A=0))))
Alternatively, I could delete the cells with 6, 6, 6,… but that would be very time-consuming for all questionnaires.
Thanks for your help!
The following is the simple pattern
=IF(ISBLANK(A1),,6-A1)
This if A1 is blank, the will return a blank, otherwise, will return the result of 6-A1.
To apply the above to an open-ended reference, nest the above pattern inside FILTER in the following way:
=FILTER(IF(NOT(ISBLANK(A:A)),6-A:A,),LEN(A:A))
Replace A:A by a single column of the imported data, or a formula that returns a column of values.
I am working in SharePoint 2013. I am trying evaluate the work of my staff and will be using a list to track this and want to assign a grade to the results. If the person gets a 4 out of 5 I want them to receive 80%.
The way I'm doing this is with a list that contains 5 questions with a drop down menu and the response are Yes, No, N/A. There is a second 5 columns that converts the drop down list selection to a 1, 0, or "Empty Record". Yes=1, No=0, N/A=Empty. This is an how I'm converting the responses to numbers: =IF([Question1]="Yes",3,IF([Question1]="N/A"," ",0))
So a response of (Yes, Yes, No, Yes, Yes) should convert to (1, 1, 0, 1, 1) which should = 4 out of 5. (80%)
The problem is how to calculate the grade so that if the an empty record doesn't factor into the calculation.
So a response of (Yes, Yes, No, Yes, N/A) should convert to (1, 1, 0, 1) which should = 3 out of 4. (75%)
I've got the conversion of the responses to numbers down pat, just can't get the calculation of the grade to work.
A True value is 1 and a False value is 0
So you can count the Yes strings (true) and the NOT N/A strings (number of Yes/No answers)
No need for any intermediate Calculations, all in one Column
=INT(AND(A1="Yes")+AND(A2="Yes")+AND(A3="Yes"))
/
INT(NOT(A1="N/A")+NOT(A2="N/A")+NOT(A3="N/A"))
*100
for less typing you can leave out the AND and just write (A1="Yes") which returns True. You can not leave out the NOT because we want a NOT(false) = true value to count all non-N/A columns
I have an unbalanced panel data set (countries and years). For simplicity let's say I have one variable, x, that I am measuring. The panel data sorted first by country (a 3-digit numeric country-code) and then by year. I would like to write a .do file that generates a new variable, z_x, containing the standardized values of the variable x. The variables should be standardized by subtracting the mean from the preceding (exclusive) m time periods, and then dividing by the standard deviation from those same time periods. If this is not possible, return a missing value.
Currently, the code I am using to accomplish this is the following (edited now for clarity)
xtset weocountrycode year
sort weocountrycode year
local win_len = 5 // Defining rolling window length.
quietly: rolling sd_x=r(sd) mean_x=r(mean), window(`win_len') saving(stats_x, replace): sum x
use stats_x, clear
rename end year
save, replace
use all_data_PROCESSED_FINAL.dta, clear
quietly: merge 1:1 (weocountrycode year) using stats_x
replace sd_x = . if `x'[_n-`win_len'+1] == . | weocountrycode[_n-`win_len'+1] != weocountrycode[_n] // This and next line are for deleting values that rolling calculates when I actually want missing values.
replace mean_`x' = . if `x'[_n-`win_len'+1] == . | weocountrycode[_n-`win_len'+1] != weocountrycode[_n]
gen z_`x' = (`x' - mean_`x'[_n-1])/sd_`x'[_n-1] // calculate z-score
UPDATE:
My struggle with rolling is that when rolling is set up to use a window length 5 rolling mean, it automatically does window length 1,2,3,4 means for the first, second, third and fourth entries (when there are not 5 preceding entries available to average out). In fact, it does this in general - if the first non-missing value is on entry 5, it will do a length 1 rolling average on entry 5, length 2 rolling average on entry 6, ..... and then finally start doing length 5 moving averages on entry 9. My issue is that I do not want this, so I would like to avoid performing these calculations. Until now, I have only been able to figure out how to delete them after they are done, which is both inefficient and bothersome.
I tried adding an if clause to the -rolling- statement:
quietly: rolling sd_x=r(sd) mean_x=r(mean) if x[_n-`win_len'+1] != . & weocountrycode[_n-`win_len'+1] != weocountrycode[_n], window(`win_len') saving(stats_x, replace): sum x
But it did not fix the problem and the output is "weird" in the sense that
1) If `win_len' is equal to, say, 10, there are 15 missing values in the resulting z_x variable, instead of 9.
2) Even though there are "extra" missing values in z_x, the observations still start out as window length 1 means, then window length 2 means, etc. which makes no sense to me.
Which leads me to believe I fundamentally don't understand 1) what -rolling- is doing and 2) how an if clause works in the context of -rolling-.
Does this help?
Thanks!
I'm not sure I understand completely but I'll try to answer based on what I think your problem is, and based on a comment by #NickCox.
You say:
... when rolling is set up to use a window length 5 rolling mean...
if the first non-missing value is
on entry 5, it will do a length 1 rolling average on entry 5, length 2
rolling average on entry 6, ...
This is expected. help rolling states:
The window size refers to calendar periods, not the number of
observations. If there
are missing data (for example, because of weekends), the actual number of observations used by command may be less than
window(#).
It's not actually doing a "length 1 rolling average", but I get to that later.
Below some examples to see what rolling does:
clear all
set more off
*-------------------------- example data -----------------------------
set obs 92
gen dat = _n - 1
format dat %tq
egen seq = fill(1 1 1 1 2 2 2 2)
tsset dat
tempfile main
save "`main'"
list in 1/12, separator(4)
*------------------- Example 1. None missing ------------------------
rolling mean=r(mean), window(4) stepsize(4) clear: summarize seq, detail
list in 1/12, separator(0)
*------- Example 2. All but one value, missing in first window ------
use "`main'", clear
replace seq = . in 1/3
list in 1/8
rolling mean=r(mean), window(4) stepsize(4) clear: summarize seq, detail
list in 1/12, separator(0)
*------------- Example 3. All missing in first window --------------
use "`main'", clear
replace seq = . in 1/4
list in 1/8
rolling mean=r(mean), window(4) stepsize(4) clear: summarize seq, detail
list in 1/12, separator(0)
Note I use the stepsize option to make things much easier to follow. Because the date variable is in quarters, I set windowsize(4) and stepsize(4) so rolling is just computing averages by year. I hope that's easy to see.
Example 1 does as expected. No problem here.
Example 2 on the other hand, should be more interesting for you. We've said that what matters are calendar periods, so the mean is computed for the whole year (four quarters), even though it contains missings. There are three missings and one non-missing. summarize is computing the mean over the whole year, but summarize ignores missings, so it just outputs the mean of non-missings, which in this case is just one value.
Example 3 has missings for all four quarters of the year. Therefore, summarize outputs . (missing).
Your problem, as I understand it, is that when you face a situation like Example 2, you'd like the output to be missing. This is where I think Nick Cox's advice comes in. You could try something like:
rolling mean=r(mean) N=r(N), window(4) stepsize(4) clear: summarize seq, detail
replace mean = . if N != 4
list in 1/12, separator(0)
This says: if the number of non-missings for the window (r(N), also computed by summarize), is not the same as the window size, then replace it with missing.