Extracting columns with a difference in aligned data - python-2.7

I have some aligned data (something bioinformatic related) as so:
reference_string = 'yearning'
string2 = 'learning'
string3 = 'aligning'
I need to extract only columns showing differences in relation to the reference data.
The output should show only positional information of the columns containing differences in relation to the reference string and the corresponding reference item.
1 2 3 4
y e a r
l
a l i g
My current code does most things okay except that it also reports columns with no difference.
string1 = 'yearning'
string2 = 'learning'
string3 = 'aligning'
string_list = [string1, string2]
reference = reference_string
diffs_top, diffs = [], []
all_diffs = set()
for s in string_list:
diffs = []
for i, c in enumerate(s):
if s[i] != reference[i]:
diffs.append(i)
all_diffs.add(i)
diffs_top.append(diffs)
for d in all_diffs:
print str(int(d+1)),
print
for c in reference:
print str(c),
print
for i, s in enumerate(string_list):
for j, c in enumerate(s):
if j in diffs_top[i]:
print str(c),
else:
print str(' '),
print
This code would give:
1 2 3 4
y e a r n i n g
l
a l i g
Any help appreciated.
EDIT: I have picked some section of real data to make the problem as clearer as possible and my attempt at solving it thus far:
reference_string = 'MAHEWGPQRLAGGQPQAS'
string1 = 'MAQQWSLQRLAGRHPQDS'
string2 = 'MAQRWGAHRLTGGQLQDT'
string3 = 'MAQRWGPHALSGVQAQDA'
string_list = [string1, string2, string3]
reference = reference_string
diffs_top, diffs = [], []
all_diffs = set()
for s in string_list:
diffs = []
for i, c in enumerate(s):
if s[i] != reference[i]:
diffs.append(i)
all_diffs.add(i)
diffs_top.append(diffs)
#print diffs_top
#print all_diffs
for d in all_diffs:
print str(int(d+1)), # retains natural positions of the reference residues
print
for d in all_diffs:
for i, c in enumerate(reference):
if i == d:
print c,
print
The print out will be an output showing the position at which there is any difference to other non-reference strings and the corresponding reference letter.
3 4 6 7 8 9 11 13 14 15 17 18
H E G P Q R A G Q P A S
Then the next step is to write a code that will process non reference strings by printing out the difference with the reference (at that position). If there is no difference it will leave blank (' ').
Doing it manually the output will be:
3 4 6 7 8 9 11 13 14 15 17 18
H E G P Q R A G Q P A S
Q Q S L R H D
Q R A H T L D T
Q R H A S V A D A
My entire code as an attempt to get to the solution above as been messy to say the least:
reference_string = 'MAHEWGPQRLAGGQPQAS'
string1 = 'MAQQWSLQRLAGRHPQDS'
string2 = 'MAQRWGAHRLTGGQLQDT'
string3 = 'MAQRWGPHALSGVQAQDA'
string_list = [string1, string2, string3]
reference = reference_string
diffs_top, diffs = [], []
all_diffs = set()
for s in string_list:
diffs = []
for i, c in enumerate(s):
if s[i] != reference[i]:
diffs.append(i)
all_diffs.add(i)
diffs_top.append(diffs)
#print diffs_top
#print all_diffs
for d in all_diffs:
print str(int(d+1)),
print
for d in all_diffs:
for i, c in enumerate(reference):
if i == d:
print c,
print
# this is my attempt to look into non-reference strings
# to check for the difference with the reference, and print an output.
for d in all_diffs:
for i, s in enumerate(string_list):
for j, c in enumerate(s):
if j == d:
print c,
else:
print str(' '),
print

Your code is working perfectly fine (as per your logic).
What is happening , is that while printing the output, when you come across the reference string, Python looks for the corresponding entry in the diffs_top list and because while storing in diff_top, you have no entry stored for the reference string, Python just prints blank spaces for your reference string.
1 2 3 4
y e a r n i n g #prints the reference string, because you've coded in that way
#prints blank as string_list[0] and reference string are the same
l
a l i g
The question here is how exactly do you define your difference for reference string.
Besides, I also found some fundamental flaws in your code implementation. If you try to run your code by setting string_list[1] as your reference string, you would get your output as :
1 2 3 4
l e a r n i n g
y
a l i g
Is this what you need? Please spend some time in properly defining difference for all cases and then try to implement you code.
EDIT:
As per you updated requirements, replace the last block in your code with this:
for i, s in enumerate(string_list):
for d in all_diffs:
if d in diffs_top[i]:
print s[d],
else:
print ' ',
print
Cheers!

I think there is a general problem in your logic. If you need to extract only columns showing difference in relation to the reference data and string1 is the reference the output should be:
1 2 3 4
l
a l i g
So, 'yearning' shouldn't show any character because it has no difference to string1.
If you delete or put the following lines in comments, you will exactly get what I expect is the right answer:
#for c in reference:
# print str(c),
#print
Consider to review your logic if this solution is not what you actually want.
Update
Here is a shorter solution which solves your task:
from itertools import compress, izip_longest
def delta(reference, string):
return [ '' if a == b else b for a, b in izip_longest(reference, string)]
ref_string = 'MAHEWGPQRLAGGQPQAS'
strings = ['MAQQWSLQRLAGRHPQDS',
'MAQRWGAHRLTGGQLQDT',
'MAQRWGPHALSGVQAQDA']
delta_strings = [delta(ref_string, string) for string in strings]
selectors = [1 if any(tup) else 0 for tup in izip_longest(*delta_strings)]
indices = [str(i+1) for i in range(len(selectors))]
output_data = [indices, ref_string] + delta_strings
for line in output_data:
print ''.join(x.rjust(3) for x in compress(line, selectors))
Explanation:
I defined a function delta(reference, string) which returns the delta between the string and the referenced string. For example: delta("ABFF", "AECF") returns the list ['', E, C, ''].
The variable delta_strings holds all the deltas between each string in the list strings and the reference string ref_string.
The variable selector is a list containing only 1 and 0 values, where 0 specifies the collumns which shouldn't be printed and vice versa.

Related

Find starting and ending index of each unique charcters in a string in python

I have a string with characters repeated. My Job is to find starting Index and ending index of each unique characters in that string. Below is my code.
import re
x = "aaabbbbcc"
xs = set(x)
for item in xs:
mo = re.search(item,x)
flag = item
m = mo.start()
n = mo.end()
print(flag,m,n)
Output :
a 0 1
b 3 4
c 7 8
Here the end index of the characters are not correct. I understand why it's happening but how can I pass the character to be matched dynamically to the regex search function. For instance if I hardcode the character in the search function it provides the desired output
x = 'aabbbbccc'
xs = set(x)
mo = re.search("[b]+",x)
flag = item
m = mo.start()
n = mo.end()
print(flag,m,n)
output:
b 2 5
The above function is providing correct result but here I can't pass the characters to be matched dynamically.
It will be really a help if someone can let me know how to achieve this any hint will also do. Thanks in advance
String literal formatting to the rescue:
import re
x = "aaabbbbcc"
xs = set(x)
for item in xs:
# for patterns better use raw strings - and format the letter into it
mo = re.search(fr"{item}+",x) # fr and rf work both :) its a raw formatted literal
flag = item
m = mo.start()
n = mo.end()
print(flag,m,n) # fix upper limit by n-1
Output:
a 0 3 # you do see that the upper limit is off by 1?
b 3 7 # see above for fix
c 7 9
Your pattern does not need the [] around the letter - you are matching just one anyhow.
Without regex1:
x = "aaabbbbcc"
last_ch = x[0]
start_idx = 0
# process the remainder
for idx,ch in enumerate(x[1:],1):
if last_ch == ch:
continue
else:
print(last_ch,start_idx, idx-1)
last_ch = ch
start_idx = idx
print(ch,start_idx,idx)
output:
a 0 2 # not off by 1
b 3 6
c 7 8
1RegEx: And now you have 2 problems...
Looking at the output, I'm guessing that another option would be,
import re
x = "aaabbbbcc"
xs = re.findall(r"((.)\2*)", x)
start = 0
output = ''
for item in xs:
end = start + len(item[0])
output += (f"{item[1]} {start} {end}\n")
start = end
print(output)
Output
a 0 3
b 3 7
c 7 9
I think it'll be in the Order of N, you can likely benchmark it though, if you like.
import re, time
timer_on = time.time()
for i in range(10000000):
x = "aabbbbccc"
xs = re.findall(r"((.)\2*)", x)
start = 0
output = ''
for item in xs:
end = start + len(item[0])
output += (f"{item[1]} {start} {end}\n")
start = end
timer_off = time.time()
timer_total = timer_off - timer_on
print(timer_total)

Use regular expression to extract elements from a pandas data frame

From the following data frame:
d = {'col1':['a-1524112-124', 'b-1515', 'c-584854', 'a-15154']}
df = pd.DataFrame.from_dict(d)
My ultimate goal is to extract the letters a, b or c (as string) in a pandas series. For that I am using the .findall() method from the re module, as shown below:
# import the module
import re
# define the patterns
pat = 'a|b|c'
# extract the patterns from the elements in the specified column
df['col1'].str.findall(pat)
The problem is that the output i.e. the letters a, b or c, in each row, will be present in a list (of a single element), as shown below:
Out[301]:
0 [a]
1 [b]
2 [c]
3 [a]
While I would like to have the letters a, b or c as string, as shown below:
0 a
1 b
2 c
3 a
I know that if I combine re.search() with .group() I can get a string, but if I do:
df['col1'].str.search(pat).group()
I will get the following error message:
AttributeError: 'StringMethods' object has no attribute 'search'
Using .str.split() won't do the job because, in my original dataframe, I want to capture strings that might contain the delimiter (e.g. I might want to capture a-b)
Does anyone know a simple solution for that, perhaps avoiding iterative operations such as a for loop or list comprehension?
Use extract with capturing groups:
import pandas as pd
d = {'col1':['a-1524112-124', 'b-1515', 'c-584854', 'a-15154']}
df = pd.DataFrame.from_dict(d)
result = df['col1'].str.extract('(a|b|c)')
print(result)
Output
0
0 a
1 b
2 c
3 a
Fix your code
pat = 'a|b|c'
df['col1'].str.findall(pat).str[0]
Out[309]:
0 a
1 b
2 c
3 a
Name: col1, dtype: object
Simply try with str.split() like this- df["col1"].str.split("-", n = 1, expand = True)
import pandas as pd
d = {'col1':['a-1524112-124', 'b-1515', 'c-584854', 'a-15154']}
df = pd.DataFrame.from_dict(d)
df['col1'] = df["col1"].str.split("-", n = 1, expand = True)
print(df.head())
Output:
col1
0 a
1 b
2 c
3 a

Combining values from an arbitrary number of pandas columns into a new column — a 'join' in the not-SQL sense

I'm trying to do what's described here, but it's not the case that only one of my columns is populated, and I want to have a delimiter.
The code I'd like to replace (with something that will take an arbitrary number of k's) is:
raw_df["all ks"] = raw_df["k1"].fillna("") + "/" + \
raw_df["k2"].fillna("") + "/" + \
raw_df["k3"].fillna("") + "/" + \
raw_df["k4"].fillna("")
I wondered if this solution could be somehow responsive, but I'm hoping for something simpler.
Thanks for any helpful suggestions. Searching the web has been frustrating because I'm trying to do a join (in the pythonic sense) and most search results relate to joining columns in the database sense (including as adapted in pandas).
You could use the cat string method to concatenate the string values. With this method you can specify the delimiter and what the NaN values should be replaced with.
For example, here's a DataFrame:
>>> df = pd.DataFrame({'a': ['x', np.nan, 'x'],
'b': ['y', 'y', np.nan],
'c': ['z', 'z', np.nan]})
a b c
0 x y z
1 NaN y z
2 x NaN NaN
Then starting with column a and passing in the remaining columns using a list comprehension:
>>> df['a'].str.cat(others=[df[col] for col in df.columns[1:]],
sep='/', na_rep='')
0 x/y/z
1 /y/z
2 x//
So this is what I came up. It uses Apply() and a function. Not as concise as I hoped, but it works with an arbitrary number of Ks. Maybe someone will come up with something better
Generating a dataframe
d = {'k1' : [np.nan,'a','b'], 'k2' : ['c', np.nan, 'c'], 'k3' : ['r','t',np.nan], 'k4': [np.nan,'t','e']}
raw_df = pd.DataFrame(d)
raw_df
k1 k2 k3 k4
0 Nan c r Nan
1 a Nan t t
2 b c Nan e
define a function
def concatKs(s):
allK = ''
for k in s:
if k is not np.nan:
allK += k + '/'
else:
allK += '' + '/'
return allK
then the apply() and passing our function
raw_df['all ks'] = raw_df.apply(concatKs, axis=1)
raw_df
k1 k2 k3 k4 all ks
0 NaN c r NaN /c/r//
1 a NaN t t a//t/t/
2 b c NaN e b/c//e/

Understanding Recursive Function

I'm working through the book NLP with Python, and I came across this example from an 'advanced' section. I'd appreciate help understanding how it works. The function computes all possibilities of a number of syllables to reach a 'meter' length n. Short syllables "S" take up one unit of length, while long syllables "L" take up two units of length. So, for a meter length of 4, the return statement looks like this:
['SSSS', 'SSL', 'SLS', 'LSS', 'LL']
The function:
def virahanka1(n):
if n == 0:
return [""]
elif n == 1:
return ["S"]
else:
s = ["S" + prosody for prosody in virahanka1(n-1)]
l = ["L" + prosody for prosody in virahanka1(n-2)]
return s + l
The part I don't understand is how the 'SSL', 'SLS', and 'LSS' matches are made, if s and l are separate lists. Also in the line "for prosody in virahanka1(n-1)," what is prosody? Is it what the function is returning each time? I'm trying to think through it step by step but I'm not getting anywhere. Thanks in advance for your help!
Adrian
Let's just build the function from scratch. That's a good way to understand it thoroughly.
Suppose then that we want a recursive function to enumerate every combination of Ls and Ss to make a given meter length n. Let's just consider some simple cases:
n = 0: Only way to do this is with an empty string.
n = 1: Only way to do this is with a single S.
n = 2: You can do it with a single L, or two Ss.
n = 3: LS, SL, SSS.
Now, think about how you might build the answer for n = 4 given the above data. Well, the answer would either involve adding an S to a meter length of 3, or adding an L to a meter length of 2. So, the answer in this case would be LL, LSS from n = 2 and SLS, SSL, SSSS from n = 3. You can check that this is all possible combinations. We can also see that n = 2 and n = 3 can be obtained from n = 0,1 and n=1,2 similarly, so we don't need to special-case them.
Generally, then, for n ≥ 2, you can derive the strings for length n by looking at strings of length n-1 and length n-2.
Then, the answer is obvious:
if n = 0, return just an empty string
if n = 1, return a single S
otherwise, return the result of adding an S to all strings of meter length n-1, combined with the result of adding an L to all strings of meter length n-2.
By the way, the function as written is a bit inefficient because it recalculates a lot of values. That would make it very slow if you asked for e.g. n = 30. You can make it faster very easily by using the new lru_cache from Python 3.3:
#lru_cache(maxsize=None)
def virahanka1(n):
...
This caches results for each n, making it much faster.
I tried to melt my brain. I added print statements to explain to me what was happening. I think the most confusing part about recursive calls is that it seems to go into the call forward but come out backwards, as you may see with the prints when you run the following code;
def virahanka1(n):
if n == 4:
print 'Lets Begin for ', n
else:
print 'recursive call for ', n, '\n'
if n == 0:
print 'n = 0 so adding "" to below'
return [""]
elif n == 1:
print 'n = 1 so returning S for below'
return ["S"]
else:
print 'next recursivly call ' + str(n) + '-1 for S'
s = ["S" + prosody for prosody in virahanka1(n-1)]
print '"S" + each string in s equals', s
if n == 4:
print '**Above is the result for s**'
print 'n =',n,'\n', 'next recursivly call ' + str(n) + '-2 for L'
l = ["L" + prosody for prosody in virahanka1(n-2)]
print '\t','what was returned + each string in l now equals', l
if n == 4:
print '**Above is the result for l**','\n','**Below is the end result of s + l**'
print 'returning s + l',s+l,'for below', '\n','='*70
return s + l
virahanka1(4)
Still confusing for me, but with this and Jocke's elegant explanation, I think I can understand what is going on.
How about you?
Below is what the code above produces;
Lets Begin for 4
next recursivly call 4-1 for S
recursive call for 3
next recursivly call 3-1 for S
recursive call for 2
next recursivly call 2-1 for S
recursive call for 1
n = 1 so returning S for below
"S" + each string in s equals ['SS']
n = 2
next recursivly call 2-2 for L
recursive call for 0
n = 0 so adding "" to below
what was returned + each string in l now equals ['L']
returning s + l ['SS', 'L'] for below
======================================================================
"S" + each string in s equals ['SSS', 'SL']
n = 3
next recursivly call 3-2 for L
recursive call for 1
n = 1 so returning S for below
what was returned + each string in l now equals ['LS']
returning s + l ['SSS', 'SL', 'LS'] for below
======================================================================
"S" + each string in s equals ['SSSS', 'SSL', 'SLS']
**Above is the result for s**
n = 4
next recursivly call 4-2 for L
recursive call for 2
next recursivly call 2-1 for S
recursive call for 1
n = 1 so returning S for below
"S" + each string in s equals ['SS']
n = 2
next recursivly call 2-2 for L
recursive call for 0
n = 0 so adding "" to below
what was returned + each string in l now equals ['L']
returning s + l ['SS', 'L'] for below
======================================================================
what was returned + each string in l now equals ['LSS', 'LL']
**Above is the result for l**
**Below is the end result of s + l**
returning s + l ['SSSS', 'SSL', 'SLS', 'LSS', 'LL'] for below
======================================================================
This function says that:
virakhanka1(n) is the same as [""] when n is zero, ["S"] when n is 1, and s + l otherwise.
Where s is the same as the result of "S" prepended to each elements in the resulting list of virahanka1(n - 1), and l the same as "L" prepended to the elements of virahanka1(n - 2).
So the computation would be:
When n is 0:
[""]
When n is 1:
["S"]
When n is 2:
s = ["S" + "S"]
l = ["L" + ""]
s + l = ["SS", "L"]
When n is 3:
s = ["S" + "SS", "S" + "L"]
l = ["L" + "S"]
s + l = ["SSS", "SL", "LS"]
When n is 4:
s = ["S" + "SSS", "S" + "SL", "S" + "LS"]
l = ["L" + "SS", "L" + "L"]
s + l = ['SSSS", "SSL", "SLS", "LSS", "LL"]
And there you have it, step by step.
You need to know the results of the other function calls in order to calculate the final value, which can be pretty messy to do manually as you can see. It is important though that you do not try to think recursively in your head. This would cause your mind to melt. I described the function in words, so that you can see that these kind of functions is are descriptions, and not a sequence of commands.
The prosody you see, that is a part of s and l definitions, are variables. They are used in a list-comprehension, which is a way of building lists. I've described earlier how this list is built.

Regular expression puzzle

This is not homework, but an old exam question. I am curious to see the answer.
We are given an alphabet S={0,1,2,3,4,5,6,7,8,9,+}. Define the language L as the set of strings w from this alphabet such that w is in L if:
a) w is a number such as 42 or w is the (finite) sum of numbers such as 34 + 16 or 34 + 2 + 10
and
b) The number represented by w is divisible by 3.
Write a regular expression (and a DFA) for L.
This should work:
^(?:0|(?:(?:[369]|[147](?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147]0*(?:\+?(?:0\
+)*[369]0*)*\+?(?:0\+)*[258])*(?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[258]|0*(?:
\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147]0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147])|[
258](?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[258]0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0
\+)*[147])*(?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147]|0*(?:\+?(?:0\+)*[369]0*)
*\+?(?:0\+)*[258]0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[258]))0*)+)(?:\+(?:0|(?:(?
:[369]|[147](?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147]0*(?:\+?(?:0\+)*[369]0*)
*\+?(?:0\+)*[258])*(?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[258]|0*(?:\+?(?:0\+)*
[369]0*)*\+?(?:0\+)*[147]0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147])|[258](?:0*(?
:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[258]0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147])*
(?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147]|0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)
*[258]0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[258]))0*)+))*$
It works by having three states representing the sum of the digits so far modulo 3. It disallows leading zeros on numbers, and plus signs at the start and end of the string, as well as two consecutive plus signs.
Generation of regular expression and test bed:
a = r'0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*'
b = r'a[147]'
c = r'a[258]'
r1 = '[369]|[147](?:bc)*(?:c|bb)|[258](?:cb)*(?:b|cc)'
r2 = '(?:0|(?:(?:' + r1 + ')0*)+)'
r3 = '^' + r2 + r'(?:\+' + r2 + ')*$'
r = r3.replace('b', b).replace('c', c).replace('a', a)
print r
# Test on 10000 examples.
import random, re
random.seed(1)
r = re.compile(r)
for _ in range(10000):
x = ''.join(random.choice('0123456789+') for j in range(random.randint(1,50)))
if re.search(r'(?:\+|^)(?:\+|0[0-9])|\+$', x):
valid = False
else:
valid = eval(x) % 3 == 0
result = re.match(r, x) is not None
if result != valid:
print 'Failed for ' + x
Note that my memory of DFA syntax is woefully out of date, so my answer is undoubtedly a little broken. Hopefully this gives you a general idea. I've chosen to ignore + completely. As AmirW states, abc+def and abcdef are the same for divisibility purposes.
Accept state is C.
A=1,4,7,BB,AC,CA
B=2,5,8,AA,BC,CB
C=0,3,6,9,AB,BA,CC
Notice that the above language uses all 9 possible ABC pairings. It will always end at either A,B,or C, and the fact that every variable use is paired means that each iteration of processing will shorten the string of variables.
Example:
1490 = AACC = BCC = BC = B (Fail)
1491 = AACA = BCA = BA = C (Success)
Not a full solution, just an idea:
(B) alone: The "plus" signs don't matter here. abc + def is the same as abcdef for the sake of divisibility by 3. For the latter case, there is a regexp here: http://blog.vkistudios.com/index.cfm/2008/12/30/Regular-Expression-to-determine-if-a-base-10-number-is-divisible-by-3
to combine this with requirement (A), we can take the solution of (B) and modify it:
First read character must be in 0..9 (not a plus)
Input must not end with a plus, so: Duplicate each state (will use S for the original state and S' for the duplicate to distinguish between them). If we're in state S and we read a plus we'll move to S'.
When reading a number we'll go to the new state as if we were in S. S' states cannot accept (another) plus.
Also, S' is not "accept state" even if S is. (because input must not end with a plus).