How to equate ['1'] with 1 in flask - flask

I have 2 variables
a=['1'] # This is an array
b=1 # This is a integer
I want (a==b) to be true.

In your array a=['1'] is a String not an integer so it won't become equal
a = [1]
b = 1
for i in a:
if i == b:
print('a==b')
else:
print('a!=b')

Related

IF calculation with aggregate functions

A B UA
1 0 Negative
1 1 Negative
1 1 Positive
2 5 Negative
2 2 Positive
I want to calculate %UA Negative such that when A = B then count all the Negatives in the UA column and divide that by total number of results where A = B. So for A=1 and B=1, %UA Negative = 1/2 = 0.5
I tried:
IF [A] = [B] THEN
SUM(IF[UA] = 'Negative' THEN 1 ElSE 0 END)/COUNT([UA]) END
but I'm getting the error:
Cannot mix aggregate and non-aggregate comparisons or results in 'IF'
expressions
You can either place the first IF statement inside your Sum and Count aggs or place ATTR around the first IF statement.
SUM(IF [A] = [B] THEN IF[UA] = 'Negative' THEN 1 ElSE 0 END END)/COUNT(IF [A] = [B] THEN [UA] end)
or
IF ATTR([A]) = ATTR([B]) THEN
SUM(IF[UA] = 'Negative' THEN 1 ElSE 0 END)/COUNT([UA]) END
The latter converts your first IF statement to an agg.

Grabbing columns with special characters and upper case letters

I have a data frame and I'm trying to loop through the data frame to identify those columns which contain a special character or which are all capital letters.
I have tried a few things but nothing where I'm apple to catch the column names within the loop.
data = data.frame(one=c(1,3,5,1,3,5,1,3,5,1,3,5), two=c(1,3,5,1,3,5,1,3,5,1,3,5),
thr=c("A","B","D","E","F","G","H","I","J","H","I","J"),
fou=c("A","B","D","A","B","D","A","B","D","A","B","D"),
fiv=c(1,3,5,1,3,5,1,3,5,1,3,5),
six=c("A","B","D","E","F","G","H","I","J","H","I","J"),
sev=c("A","B","D","A","B","D","A","B","D","A","B","D"),
eig=c("A","B","D","A","B","D","A","B","D","A","B","D"),
nin=c(1.24,3.52,5.33,1.44,3.11,5.33,1.55,3.66,5.33,1.32,3.54,5.77),
ten=c(1:12),
ele=rep(1,12),
twe=c(1,2,1,2,1,2,1,2,1,2,1,2),
thir=c("THiS","THAT34","T(&*(", "!!!","#$#","$Q%J","who","THIS","this","this","this","this"),
stringsAsFactors = FALSE)
data
colls <- c()
spec=c("$","%","&")
for( col in names(data) ) {
if( length(strings[stringr::str_detect(data[,col], spec)]) >= 1 ){
print("HORRAY")
colls <- c(collls, col)
}
else print ("NOOOOOOOOOO")
}
for( col in names(data) ) {
if( any(data[,col]) %in% spec ){
print("HORRAY")
colls <- c(collls, col)
}
else print ("NOOOOOOOOOO")
}
Can anyone shed light on a good way to tackle this problem.
EDIT:
The end goal is to have a vector with a name of column names which meet that criteria. Sorry for my poor SO question, but hopefully this will help with what I'm trying to do
I would use grep() to search for the pattern you are interested in. See here.
[:upper:] Matches any upper case letters.
Combining it with anchors (^,$) and match one or more times (+) gives ^[[:upper:]]+$ and should only match entries completely in capitals.
The following would match the special characters in your toy data set (but is not guaranteed to match all special characters in your real data set i.e form feeds, carriage returns)
[:punct:] #Matches punctuation - ! " # $ % & ' ( ) * + , - . / : ; < = > ? # [ \ ] ^ _ ` { | } ~.
Note that rather than use [:punct:] you could define your special characters manually.
We can try the resultant code on the first row of your data set:
#Using grepl() rather than grep() so that we return a list of logical values.
grepl(x= data[1,], pattern = "^[[:upper:]]+$|[[:punct:]]")
[1] FALSE FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
This gives us our expected response except for column nine which has the value 1.24. Here the decimal point is being recognised as punctuation and is being flagged as a match.
We can add a "negative lookahead assertion" - (?!\\.) - to remove any periods from consideration, before they are even tested for being punctuation characters. Note we use \ to escape the period.
grepl(x= data[1,], perl = TRUE, pattern = "(?!\\.)(^[[:upper:]]+$|[[:punct:]])")
[1] FALSE FALSE TRUE TRUE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
This returns a better response - it now no longer matches decimal places. NOTE: This might not be what you want as this pattern also won't match any fullstops in character fields. You would need to refine the pattern further.
Rather than use a 'for loop' to reiterate this code across every row in your dataframe I would use vectorization instead which is 'more R like'.
To do this we must convert our script into a function which we will call with apply()
myFunction <- function(x){
matches <- grepl(x= x, perl = TRUE, pattern = "(?!\\.)(^[[:upper:]]+$|[[:punct:]])")
#Given a set of logical vectors 'matches', is at least one of the values true? using any()
return(any(matches))
}
apply(X = data, 1, myFunction)
The 1 above instructs apply() to reiterate across rows rather than columns.
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
In your example data set all rows have an entry containing a special character or a string of all capital letters. This is unsurprising as many columns in your example data set are a list of single capital letters.
If you are just interested in which values in column thirteen fit the stated criteria you can use:
matches <- grepl(x= data$thir, perl = TRUE, pattern = "(?!\\.)(^[[:upper:]]+$|[[:punct:]])")
matches
[1] FALSE FALSE TRUE TRUE TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE
To subset your dataframe on matching rows:
data[matches,]
one two thr fou fiv six sev eig nin ten ele twe thir
3 5 5 D D 5 D D D 5.33 3 1 1 T(&*(
4 1 1 E A 1 E A A 1.44 4 1 2 !!!
5 3 3 F B 3 F B B 3.11 5 1 1 #$#
6 5 5 G D 5 G D D 5.33 6 1 2 $Q%J
8 3 3 I B 3 I B B 3.66 8 1 2 THIS
To subset your dataframe on non-matching rows:
data[!matches,]
one two thr fou fiv six sev eig nin ten ele twe thir
1 1 1 A A 1 A A A 1.24 1 1 1 THiS
2 3 3 B B 3 B B B 3.52 2 1 2 THAT34
7 1 1 H A 1 H A A 1.55 7 1 1 who
9 5 5 J D 5 J D D 5.33 9 1 1 this
10 1 1 H A 1 H A A 1.32 10 1 2 this
11 3 3 I B 3 I B B 3.54 11 1 1 this
12 5 5 J D 5 J D D 5.77 12 1 2 this
Note that the regular expression used doesn't match THAT34 as it isn't composed wholly of capitalised letters, having the number 34 at the end.
EDIT:
To get a list of column names identifying columns that fulfill the criteria in your edit use myFunction described above with:
colnames(data)[apply(X = data, 2, myFunction)]
"thr" "fou" "six" "sev" "eig" "thir"
The number in apply() changes from 1 to 2 to reiterate across columns rather than rows. We pass the output from apply(), a list of logical matches (TRUE or FALSE), to colnames(data) - this returns the matching column names via subsetting.
I would collapse the data into strings (one string per row)
strings = apply(data, 1, paste, collapse = "")
contains_only_caps = strings == toupper(strings)
strings[contains_only_caps]
# [1] "33BB3BBB3.52 212THAT34" "55DD5DDD5.33 311T(&*(" "11EA1EAA1.44 412!!!" "33FB3FBB3.11 511#$#"
# [5] "55GD5GDD5.33 612$Q%J" "33IB3IBB3.66 812THIS"
# escaping special characters
spec=c("\\$","%","\\&")
contains_spec = stringr::str_detect(strings, pattern = paste(spec, collapse = "|"))
strings[contains_spec]
# [1] "55DD5DDD5.33 311T(&*(" "33FB3FBB3.11 511#$#" "55GD5GDD5.33 612$Q%J"
You could also use which on contains_spec or contains_only_caps to get the corresponding row numbers for the original data frame. I think that using strings rather than row-wise data frame elements will by much faster - as long as you want to search the whole strings, not certain columns for certain conditions.

How to compare two columns and create new dataframe

My table:
A B C D E F G H I
0 0.292090 0.806958 0.255845 0.855154 0.590744 0.937458 0.192190 0.548974 0.703214
1 0.094978 NaN NaN NAN 0.350109 0.635469 0.025525 0.108062 0.510891
2 0.918005 0.568802 0.041519 NaN NaN 0.882552 0.086663 0.908168 0.221058
3 0.882920 0.230281 0.172843 0.948232 0.560853 NaN NaN 0.664388 0.393678
4 0.086579 0.819807 0.712273 0.769890 0.448730 0.853134 0.508932 0.630004 0.579961
Output:
A B&C D&E F&G H&I
0.292090 Present Present Present Present
0.094978 Not There Not There Present Present
0.918005 Present Not There Present Present
0.882920 Present Present Not There Present
0.086579 Present Present Present Present
If both B and C is not there then show not there else present
If anyone D and E is not there then show not there else present
If anyone F and G is not equal to 0 present else not there
If H and I sum is greater than 2, then show not there else present
I want to write if functions or lambda whatever is fast in pandas and I want to generate a new dataframe as I have given an output. But I am not able to understand how should I write these following statements in pandas.
if (B & C):
df.at[0, 'B&C'] = 'Present'
elif
df.at[0, 'B&C'] = 'Not there'
if (D | E):
df.at[0, 'D&E'] = 'Present'
elif
df.at[0, 'D&E'] = 'Not there'
So is there anyway in pandas with that I can complete my newset of dataframe.
You can use isnull to determine which entries are NaN:
In [3]: df
Out[3]:
A B C D E
0 -0.600684 -0.112947 -0.081186 -0.012543 1.951430
1 -1.198891 NaN NaN NaN 1.196819
2 -0.342050 0.971968 -1.097107 NaN NaN
3 -0.908169 0.095141 -1.029277 1.533454 0.171399
In [4]: df.B.isnull()
Out[4]:
0 False
1 True
2 False
3 False
Name: B, dtype: bool
Use the & and | operators to combine two boolean Series, and use the where function from numpy to select between two values based on a Series of booleans. It returns a numpy array, but you can assign it to a column of a DataFrame:
In [5]: df['B&C'] = np.where(df.B.isnull() & df.C.isnull(), 'Not There', 'Present')
In [6]: df['B&C']
Out[6]:
0 Present
1 Not There
2 Present
3 Present
Name: B&C, dtype: object
Here you need to take two columns and need to check the corresponding entries having "NAN" or not. In Pandas, there is a one stop solution to all kind of indexing and selecting from a data frame.
http://pandas.pydata.org/pandas-docs/stable/indexing.html
I have explained this using iloc but you can do it in many other ways. I have not run the below code, but I hope the logic will be clear.
def tmp(col1,col2):
df = data[[col1,col2]]
for i in range(df.shape[0]):
if(df.iloc[i,0] == np.nan or df.iloc[i,1] == np.nan):
df.iloc[i,2]="Not Present"
else:
df.iloc[i,2]="Present"

How should I check more than 10 columns for nan values and select those rows having nan values, ie keepna() instead of dropna()

Output = df[df['TELF1'].isnull() | df['STCEG'].isnull() | df['STCE1'].isnull()]
This is my code I am checking here if a column contains nan value than only select that row. But here I have over 10 columns to do that. This will make my code huge. Is there any short or more pythonic way to do it.
df.dropna(subset=['STRAS','ORT01','LAND1','PSTLZ','STCD1','STCD2','STCEG','TELF1','BANKS','BANKL','BANKN','E-MailAddress'])
Is there any way to get the opposite of the above command.It will give me the same output what I was trying above but it was getting very long.
Using loc with a simple boolean filter should work:
df = pd.DataFrame(np.random.random((5,4)), columns=list('ABCD'))
subset = ['C', 'D']
df.at[0, 'C'] = None
df.at[4, 'D'] = None
>>> df
A B C D
0 0.985707 0.806581 NaN 0.373860
1 0.232316 0.321614 0.606824 0.439349
2 0.956236 0.169002 0.989045 0.118812
3 0.329509 0.644687 0.034827 0.637731
4 0.980271 0.001098 0.918052 NaN
>>> df.loc[df[subset].isnull().any(axis=1), :]
A B C D
0 0.985707 0.806581 NaN 0.37386
4 0.980271 0.001098 0.918052 NaN
df[subset].isnull() returns boolean values of whether or not any of the subset columns have a NaN.
>>> df[subset].isnull()
C D
0 True False
1 False False
2 False False
3 False False
4 False True
.any(axis=1) will return True if any value in the row (because axis=1, otherwise the column) is True.
>>> df[subset].isnull().any(axis=1)
0 True
1 False
2 False
3 False
4 True
dtype: bool
Finally, use loc (rows, columns) to locate rows that satisfy a boolean condition. The : symbol means to select everything, so it selects all columns for rows 0 and 4.

Extracting columns with a difference in aligned data

I have some aligned data (something bioinformatic related) as so:
reference_string = 'yearning'
string2 = 'learning'
string3 = 'aligning'
I need to extract only columns showing differences in relation to the reference data.
The output should show only positional information of the columns containing differences in relation to the reference string and the corresponding reference item.
1 2 3 4
y e a r
l
a l i g
My current code does most things okay except that it also reports columns with no difference.
string1 = 'yearning'
string2 = 'learning'
string3 = 'aligning'
string_list = [string1, string2]
reference = reference_string
diffs_top, diffs = [], []
all_diffs = set()
for s in string_list:
diffs = []
for i, c in enumerate(s):
if s[i] != reference[i]:
diffs.append(i)
all_diffs.add(i)
diffs_top.append(diffs)
for d in all_diffs:
print str(int(d+1)),
print
for c in reference:
print str(c),
print
for i, s in enumerate(string_list):
for j, c in enumerate(s):
if j in diffs_top[i]:
print str(c),
else:
print str(' '),
print
This code would give:
1 2 3 4
y e a r n i n g
l
a l i g
Any help appreciated.
EDIT: I have picked some section of real data to make the problem as clearer as possible and my attempt at solving it thus far:
reference_string = 'MAHEWGPQRLAGGQPQAS'
string1 = 'MAQQWSLQRLAGRHPQDS'
string2 = 'MAQRWGAHRLTGGQLQDT'
string3 = 'MAQRWGPHALSGVQAQDA'
string_list = [string1, string2, string3]
reference = reference_string
diffs_top, diffs = [], []
all_diffs = set()
for s in string_list:
diffs = []
for i, c in enumerate(s):
if s[i] != reference[i]:
diffs.append(i)
all_diffs.add(i)
diffs_top.append(diffs)
#print diffs_top
#print all_diffs
for d in all_diffs:
print str(int(d+1)), # retains natural positions of the reference residues
print
for d in all_diffs:
for i, c in enumerate(reference):
if i == d:
print c,
print
The print out will be an output showing the position at which there is any difference to other non-reference strings and the corresponding reference letter.
3 4 6 7 8 9 11 13 14 15 17 18
H E G P Q R A G Q P A S
Then the next step is to write a code that will process non reference strings by printing out the difference with the reference (at that position). If there is no difference it will leave blank (' ').
Doing it manually the output will be:
3 4 6 7 8 9 11 13 14 15 17 18
H E G P Q R A G Q P A S
Q Q S L R H D
Q R A H T L D T
Q R H A S V A D A
My entire code as an attempt to get to the solution above as been messy to say the least:
reference_string = 'MAHEWGPQRLAGGQPQAS'
string1 = 'MAQQWSLQRLAGRHPQDS'
string2 = 'MAQRWGAHRLTGGQLQDT'
string3 = 'MAQRWGPHALSGVQAQDA'
string_list = [string1, string2, string3]
reference = reference_string
diffs_top, diffs = [], []
all_diffs = set()
for s in string_list:
diffs = []
for i, c in enumerate(s):
if s[i] != reference[i]:
diffs.append(i)
all_diffs.add(i)
diffs_top.append(diffs)
#print diffs_top
#print all_diffs
for d in all_diffs:
print str(int(d+1)),
print
for d in all_diffs:
for i, c in enumerate(reference):
if i == d:
print c,
print
# this is my attempt to look into non-reference strings
# to check for the difference with the reference, and print an output.
for d in all_diffs:
for i, s in enumerate(string_list):
for j, c in enumerate(s):
if j == d:
print c,
else:
print str(' '),
print
Your code is working perfectly fine (as per your logic).
What is happening , is that while printing the output, when you come across the reference string, Python looks for the corresponding entry in the diffs_top list and because while storing in diff_top, you have no entry stored for the reference string, Python just prints blank spaces for your reference string.
1 2 3 4
y e a r n i n g #prints the reference string, because you've coded in that way
#prints blank as string_list[0] and reference string are the same
l
a l i g
The question here is how exactly do you define your difference for reference string.
Besides, I also found some fundamental flaws in your code implementation. If you try to run your code by setting string_list[1] as your reference string, you would get your output as :
1 2 3 4
l e a r n i n g
y
a l i g
Is this what you need? Please spend some time in properly defining difference for all cases and then try to implement you code.
EDIT:
As per you updated requirements, replace the last block in your code with this:
for i, s in enumerate(string_list):
for d in all_diffs:
if d in diffs_top[i]:
print s[d],
else:
print ' ',
print
Cheers!
I think there is a general problem in your logic. If you need to extract only columns showing difference in relation to the reference data and string1 is the reference the output should be:
1 2 3 4
l
a l i g
So, 'yearning' shouldn't show any character because it has no difference to string1.
If you delete or put the following lines in comments, you will exactly get what I expect is the right answer:
#for c in reference:
# print str(c),
#print
Consider to review your logic if this solution is not what you actually want.
Update
Here is a shorter solution which solves your task:
from itertools import compress, izip_longest
def delta(reference, string):
return [ '' if a == b else b for a, b in izip_longest(reference, string)]
ref_string = 'MAHEWGPQRLAGGQPQAS'
strings = ['MAQQWSLQRLAGRHPQDS',
'MAQRWGAHRLTGGQLQDT',
'MAQRWGPHALSGVQAQDA']
delta_strings = [delta(ref_string, string) for string in strings]
selectors = [1 if any(tup) else 0 for tup in izip_longest(*delta_strings)]
indices = [str(i+1) for i in range(len(selectors))]
output_data = [indices, ref_string] + delta_strings
for line in output_data:
print ''.join(x.rjust(3) for x in compress(line, selectors))
Explanation:
I defined a function delta(reference, string) which returns the delta between the string and the referenced string. For example: delta("ABFF", "AECF") returns the list ['', E, C, ''].
The variable delta_strings holds all the deltas between each string in the list strings and the reference string ref_string.
The variable selector is a list containing only 1 and 0 values, where 0 specifies the collumns which shouldn't be printed and vice versa.