Creating variables with pytesseract - python-2.7

In my code
from PIL import Image
import pytesseract
print(pytesseract.image_to_string(Image.open('test.png')))
The results I get from here (just from the question and answers) are:
Which team surrendered
the biggest lead in Super
Bowl history?
Atlanta Falcons
Denver Broncos
Buffalo Bills
Is there any way to say that lines 1, 2, and 3 are the question, then line 5 is answer 1, etc.?

Depending on how your data differs between images this should work. If you always have the '?' to split on.
image_text=pytesseract.image_to_string(Image.open('test.png'))
text_list=image_text.split('?')
This will give you a list with 2 elements. First being all before the ? and second after. Such as:
print(text_list)
['Which team surrendered\nthe biggest lead in Super\nBowl history',
'\n\nAtlanta Falcons\n\nDenver Broncos\n\nBuffalo Bills']
From here you can define q and a. As the question and answer.
q = text_list[0]
a = [a for a in text_list[1].split('\n') if a]
The logic above will keep the new lines for the question leaving it formatted as:
Which team surrendered
the biggest lead in Super
Bowl history?
Then variable a will be filled with a list of the answers without any blank lines in the list. So a print(a) would return:
['Atlanta Falcons', 'Denver Broncos', 'Buffalo Bills']
Keep in mind, this fix is dependent on the text having a ? in it to define which half of the string is the question vs which is the answer.

Related

Survey completness problems in Google Sheets

I'm trying to have information about the completness in some survey results exported in to a Excel Format I'm using Google Sheets, as every survey there is questions and subquestions the subquestions have a conditional, Example: 4. How are you today? multiple choise answers: Good, Bad, Prefer not to say, so there we have 3 answer options if we click good there is a conditional and the subquestion will be: Why?. So in my survey there is 8 questions and question 4, 7 and 8 has conditional questions if someone answer "Yes". Now here is my problem to calculate the percentage of completness I used this relation: number of inputs in the survey/number of expected answers, But as I mentioned before the conditional affects this expected answers this Variable is dynamic depending on the answers from question 4, 7 and 8. So I would like to obtain this Variable for every case, if someone put information will have an ID if we have 20 persons doing the survey we will have 20 ID's. So for every record of answers the number will change depending on the inputs from Question 4, 7 and 8. I have prepared a document in Google sheets will the full aproach that I tried but is still hard to have it right I would like to have some help with this.
Link to the spreesheet
Here is an image about it:
if any parts of MAIN or SUB question make a whole MAIN/SUB question count as 1 use:
QUERY(TRANSPOSE(B2:F12),,9^9)
if all parts of MAIN/SUB question count as 1 use regular range:
AS2:BB12
change those ranges if I got them wrong:
=ARRAYFORMULA(IF(""={TRANSPOSE(TRIM({
QUERY(TRANSPOSE(B2:F12),,9^9); QUERY(TRANSPOSE(G2:X12),,9^9); QUERY(TRANSPOSE(Y2:AH12),,9^9);
QUERY(TRANSPOSE(AI2:AR12),,9^9); QUERY(TRANSPOSE(BC2:BD12),,9^9); QUERY(TRANSPOSE(BG2:BH12),,9^9);
QUERY(TRANSPOSE(BI2:BR12),,9^9); QUERY(TRANSPOSE(BS2:BU12),,9^9); QUERY(TRANSPOSE(BV2:BX12),,9^9);
QUERY(TRANSPOSE(BY2:CA12),,9^9); QUERY(TRANSPOSE(CB2:CD12),,9^9); QUERY(TRANSPOSE(CE2:CG12),,9^9);
QUERY(TRANSPOSE(CH2:CJ12),,9^9); QUERY(TRANSPOSE(CK2:CM12),,9^9); QUERY(TRANSPOSE(CN2:CP12),,9^9);
QUERY(TRANSPOSE(CQ2:CS12),,9^9); QUERY(TRANSPOSE(CT2:CV12),,9^9)})), AS2:BB12, BE2:BF12}, 0, 1))
to sum this up there are 17 queries and 2 regular ranges with 12 columns eg 17+12 = 29:
=ARRAYFORMULA(MMULT(IF(""={TRANSPOSE(TRIM({
QUERY(TRANSPOSE(B2:F12),,9^9); QUERY(TRANSPOSE(G2:X12),,9^9); QUERY(TRANSPOSE(Y2:AH12),,9^9);
QUERY(TRANSPOSE(AI2:AR12),,9^9); QUERY(TRANSPOSE(BC2:BD12),,9^9); QUERY(TRANSPOSE(BG2:BH12),,9^9);
QUERY(TRANSPOSE(BI2:BR12),,9^9); QUERY(TRANSPOSE(BS2:BU12),,9^9); QUERY(TRANSPOSE(BV2:BX12),,9^9);
QUERY(TRANSPOSE(BY2:CA12),,9^9); QUERY(TRANSPOSE(CB2:CD12),,9^9); QUERY(TRANSPOSE(CE2:CG12),,9^9);
QUERY(TRANSPOSE(CH2:CJ12),,9^9); QUERY(TRANSPOSE(CK2:CM12),,9^9); QUERY(TRANSPOSE(CN2:CP12),,9^9);
QUERY(TRANSPOSE(CQ2:CS12),,9^9); QUERY(TRANSPOSE(CT2:CV12),,9^9)})), AS2:BB12, BE2:BF12}, 0, 1),
SEQUENCE(29, 1, 1, 0)))
now to skip SUB question if empty we can do:
=ARRAYFORMULA(IF(""=TRANSPOSE(TRIM({
QUERY(TRANSPOSE(AS2:BB12),,9^9)})), 1, 0))
and then:
again, if you got more to skip add it like:
so the last step is to get the "% completeness":
=ARRAYFORMULA(MMULT(IF(""={TRANSPOSE(TRIM({
QUERY(TRANSPOSE(B2:F12),,9^9); QUERY(TRANSPOSE(G2:X12),,9^9); QUERY(TRANSPOSE(Y2:AH12),,9^9);
QUERY(TRANSPOSE(AI2:AR12),,9^9); QUERY(TRANSPOSE(BC2:BD12),,9^9); QUERY(TRANSPOSE(BG2:BH12),,9^9);
QUERY(TRANSPOSE(BI2:BR12),,9^9); QUERY(TRANSPOSE(BS2:BU12),,9^9); QUERY(TRANSPOSE(BV2:BX12),,9^9);
QUERY(TRANSPOSE(BY2:CA12),,9^9); QUERY(TRANSPOSE(CB2:CD12),,9^9); QUERY(TRANSPOSE(CE2:CG12),,9^9);
QUERY(TRANSPOSE(CH2:CJ12),,9^9); QUERY(TRANSPOSE(CK2:CM12),,9^9); QUERY(TRANSPOSE(CN2:CP12),,9^9);
QUERY(TRANSPOSE(CQ2:CS12),,9^9); QUERY(TRANSPOSE(CT2:CV12),,9^9)})), AS2:BB12, BE2:BF12}, 0, 1),
SEQUENCE(29, 1, 1, 0))/(29-IF(""=TRANSPOSE(TRIM({QUERY(TRANSPOSE(AS2:BB12),,9^9)})), 1, 0)))

How to filter a Django queryset, after it is modified in a loop

I have recently been working with Django, and it has been confusing me a lot (although I also like it).
The problem I am facing right now is when I am looping, and in the loop modifying the queryset, in the next loop the .filter is not working.
So let's take the following simplified example:
I have a dictionary that is made from the queryset like this
animal_dict = {chicken: 6,
cows: 7,
fish: 1,
sheep: 2}
The queryset is called self.animals
for key in dict:
if dict[key] < 3:
remove_animal = max(dict, key=dict.get)
remove = self.animals.filter(animal = remove_animal)[-2:]
self.animals = self.animals.difference(remove)
key[replaced_industry] = key[replaced_industry] - 2
What I am trying to do is as follows: my goal is that there needs to be a balance under the animals. So since there are not enough fish, 2 of the animals with the highest n have to go (cows). And then in the second loop - since there are not enough sheep, 2 of the animals with the highest n have to go again (chicken).
Now the first time it loops (with fish), the .filter does exactly as it should. However, when I loop it a second time (for sheep), the remove = self.animals.filter(animal = remove_animal)[-2:] gives me an output is not in line with animal = filter. When I print the remove in the second loop, it returns a list of all different animals (instead of just 1).
After the loops, the dict should look like this: {chicken: 4,
cows: 5,
fish: 1,
sheep: 2}
This because first cow will go down 2 and it is the max, and then chicken will go down 2, as it is then the max
I am definitely missing some Django logic here, but to me this seems very strange. I hope the question is well-understood, else happy to clarify further.
As others have pointed out, every time you call self.animals.filter you are making a request to the database, and this should not be done in a loop.
It isn't very clear what you are trying to achieve, but it seems like you want the number of each type of animal to be (almost?) the same number, and that the only operation you can perform on it is reducing the number of animals.
Its always better to avoid loops if you can.
If you want them all to be the same number, and you can only reduce the number of animals you have, then the best solution would be
fewest_number = min(self.animals.values())
self.animals = {animal: fewest_number for animal in self.animals.keys()}
If you want, to say, allow a tolerance +1 of the fewest animal
fewest_number = min(self.animals.values()) + 1
self.animals = {animal: fewest_number for animal in self.animals.keys()}
If you can increase the number of each type of animal, then you could find the average:
average_number = sum(self.animals.values()) / len(self.animals)
self.animals = {animal: average_number for animal in self.animals.keys()}
Answering my own question here. Apparently it didn't work because only count(), order_by(), values(), values_list() and slicing of union queryset is allowed. You can't filter on union queryset and the same applies to .difference.
More about this here: Django: Filter a Queryset made of unions not working
Because in the second loop it is made into a union queryset, the filter function simply doesn't work. The weird thing is that this doesn't show an error, but will simply not filter, what makes it hard to detect.

“Otherwise” argument of my IF-function applies to blank cells, but should ignore them. What can I add to my formula to stop it?

In my IF-function the “otherwise” argument should conduct the subtraction “6 - value”. It works fine for cells containing numbers, but unfortunately also works fine with blank cells. This results in a lot of cells with 6 (6 - 0 = 6) instead of empty cells.
In detail:
I want to import and select data collected in an online questionnaire.
I import my extract of the raw data in sheet “Sample” with the following formula:
=IF(LOOKUP(D$1,'Analysis'!$A$2:$A,'Analysis'!$G$2:$G)="No",FILTER(FILTER(Import!$A$2:$CV,Import!$A$1:$CV$1=D$1),Import!$A$2:$A=0),ARRAYFORMULA(6-FILTER(FILTER(Import!$A$2:$CV,Import!$A$1:$CV$1=D$1),Import!$A$2:$A=0)))
= If the question has not to be reversed (“No”), then import the values as they are, otherwise (if the question has to be reversed, “Yes”) subtract 6 - value.
Sheets in Google Spreadsheets:
“Import”: This sheet contains the raw data. For each person that participated in the study, there is a row with the corresponding answers (that is 1, 2, 3, 4 or 5 according to the rating scale in the questionnaire). Because not every person in the list started or completed the questionnaire, there are blank cells where no answers were registered and blank cells at the end of the sheet.
“Sample”: This sheet should contain an extract of the raw data for further analysis. It’s the sheet where the IF-formula is applied.
“Analysis”: This sheet contains informations concerning the questions, e.g. if the answers of some questions have to be reversed (reversed rating scale: 1 -> 5, 2 -> 4, 3 stays 3 and so on).
Coordinates:
Sheet “Sample”: Cell D$1, E$1, F$1 and so on contain the names of the questions (e.g. question_1).
Sheet “Analysis”: A2 to A contain the names of the questions.
Sheet “Analysis”: G2 to G contain the information if the answers of the questions have to be reversed. If the answers have to be reversed (“Yes”), the raw data needs to be adjusted with “6-” (6-5 = 1, 6-4 = 2, 6-3 = 3 and so on).
Sheet “Import”: A2 to A contains if there are any missing values. Zero means there are no missing values. Only data rows with no missing values should be imported.
Problem:
The formula works fine and displays the answers and reversed answers for the questions of interest. BUT at the end of the sheet “Sample” the columns continue with 6, 6, 6, 6, 6, 6, 6… (only for reversed questions); for not reversed questions the cells after the last valid import are blank.
Attempts to fix it:
I tried different variations of nested if-functions that unfortunately don’t have any effect, e.g.:
=IF(ISBLANK(Import!E2:I8)," ",IF(LOOKUP(D$1,Analysis!$A$2:$A,Analysis!$G$2:$G)="No",FILTER(FILTER(Import!$A$2:$CV,Import!$A$1:$CV$1=D$1),Import!$A$2:$A=0),ARRAYFORMULA(6-FILTER(FILTER(Import!$A$2:$CV,Import!$A$1:$CV$1=D$1),Import!$A$2:$A=0))))
or:
=IF(LOOKUP(D$1,Analysis!$A$2:$A,Analysis!$G$2:$G)="No",FILTER(FILTER(Import!$A$2:$CV,Import!$A$1:$CV$1=D$1),Import!$A$2:$A=0),IF(Import!E2:E=" "," ",ARRAYFORMULA(6-FILTER(FILTER(Import!$A$2:$CV,Import!$A$1:$CV$1=D$1),Import!$A$2:$A=0))))
Alternatively, I could delete the cells with 6, 6, 6,… but that would be very time-consuming for all questionnaires.
Thanks for your help!
The following is the simple pattern
=IF(ISBLANK(A1),,6-A1)
This if A1 is blank, the will return a blank, otherwise, will return the result of 6-A1.
To apply the above to an open-ended reference, nest the above pattern inside FILTER in the following way:
=FILTER(IF(NOT(ISBLANK(A:A)),6-A:A,),LEN(A:A))
Replace A:A by a single column of the imported data, or a formula that returns a column of values.

Testing for an item in lists - Python 3

As part of a school project we are creating a trouble shooting program. I have come across a problem that I cannot solve:
begin=['physical','Physical','Software','software',]
answer=input()
if answer in begin[2:3]:
print("k")
software()
if answer in begin[0:1]:
print("hmm")
physical()
When I try to input software/Software no output is created. Can anybody see a hole in my code as it is?
In Python, slice end values are exclusive. You are slicing a smaller list than you think you are:
>>> begin=['physical','Physical','Software','software',]
>>> begin[2:3]
['Software']
>>> begin[0:1]
['physical']
Use begin[2:4] and begin[0:2] or even begin[2:] and begin[:2] to get all elements from the 3rd to the end, and from the start until the 2nd (inclusive):
>>> begin[2:]
['Software', 'software']
>>> begin[2:4]
['Software', 'software']
>>> begin[:2]
['physical', 'Physical']
>>> begin[0:2]
['physical', 'Physical']
Better yet, use str.lower() to limit the number of inputs you need to provide:
if answer.lower() == 'software':
With only one string to test, you can now put your functions in a dictionary; this gives you the option to list the various valid answers too:
options = {'software': software, 'physical': physical}
while True:
answer = input('Please enter one of the following options: {}\n'.format(
', '.join(options))
answer = answer.lower()
if answer in options:
options[answer]()
break
else:
print("Sorry, {} is not a valid option, try again".format(answer))
Your list slicing is wrong, Try the following script.
begin=['physical','Physical','Software','software',]
answer=input()
if answer in begin[2:4]:
print("k")
software()
if answer in begin[0:2]:
print("hmm")
physical()

Avoiding pandas chained selection

I'm trying to determine "best practice" to do the following without incurring a SettingWithCopyWarning. I'm using python 2.7 and pandas 15.2
What I want to do is subselect a dataframe and then use this selection as a new dataframe, without risking modification to the original. Here's an example of what I'm doing:
import pandas as pd
def select_blue_cars(df):
"""Returns a new dataframe of blue cars"""
return df[df['color'] == 'blue']
cars = pd.DataFrame({'color': ['blue', 'blue', 'red'], 'make': ['Ford', 'BMW', 'Ford']})
blue_cars = select_blue_cars(cars)
blue_cars['price'] = 10000
The above generates a SettingWithCopyWarning in current pandas but otherwise behaves as I want it to (ie. the cars df has not been modified).
What is the best way to implement select_blue_cars so that the subsequent code doesn't trigger this warning?
Should I be using .copy() everywhere?
return df[df['color'] == 'blue'].copy()
(Aside) What's the performance of copy() like?
Eventually I'd like to chain simple transform functions like select_blue_cars:
blue_fords = select_fords(select_blue_cars(cars))
Edit: Having thought about this a bit more I think that I'm looking for a single transform which selects a copy from the dataframe without explicitly calling .copy(). That way I can write functions to do little transformations on the df and chain them.
Transposition for example df.T gives a new dataframe. There's no need to call .copy().
df2 = df.T
df2 = df.T.copy() # no need
It looks like, in the case of selection, .copy() is required for this pattern.
How you get around the SettingWithCopyWarning depends a bit on how long you plan on keeping the subset around. If you just want to briefly look at the price within a particular colour and then return to the overall dataframe, the suggestions JohnE has given are pretty good. If you actually want to keep the subset around and perform a bunch of separate analyses on it, then what I usually do is subset with .loc and explicitly copy, e.g.:
subset = df.loc[df['condition'] > 5, :].copy()
In your code, this would be:
import pandas as pd
def select_blue_cars(df):
"""Returns a new dataframe of blue cars"""
return df.loc[df['color'] == 'blue', :].copy()
cars = pd.DataFrame({'color': ['blue', 'blue', 'red'], 'make': ['Ford', 'BMW', 'Ford']})
blue_cars = select_blue_cars(cars)
blue_cars['price'] = 10000
I think this remains one of the more confusing parts of pandas. You are actually asking 2 or 3 questions and the answers may be less simple than you'd think. Consequently, I'll make the simplifying assumption that you'll just keep everything in one dataset (if not, it's not that big a deal though), and give a simple answer.
What you want to do (in pseudocode):
price = 10000 if color == blue
The simplest way to do this is actually with numpy where():
cars['price'] = np.where( cars['color'] == 'blue', 10000, np.nan )
color make price
0 blue Ford 10000
1 blue BMW 10000
2 red Ford NaN
You can also nest where() so it's really very powerful and simple method for conditional setting like this. You can also use ix/loc/iloc (though you need to create an empty column for 'price' first):
cars.ix[ cars.color == 'blue', 'price' ] = 10000
And to briefly address the chained indexing warning, what it's mostly saying is don't try to do too much on the left hand side when setting values:
df[ df.y > 5 ]['x'] = df['z']
this is OK though:
df['x'] = df[ df.y > 5 ]['z']
Because the result of chained indexing may by a copy rather than reference, which will cause the former to fail but not the latter. You can also get around this by using ix/loc/iloc.