Compare strings and just keep those who have on same positions different characters [closed] - regex

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
my question is the following: I have a file which contains around 70 strings, all of them have 6 characters (either a,c,g or t for every position -> these are short DNA-sequences).
For example:
accggt agctta gggatc gactta ccttgg
What I need are the strings which are completely unique. Which have on every position a different character (base) compared with the other strings.
In this case I would get two matches (I define them as lists but this is only an idea for the output format):
[accggt , gggatc]
[gggatc , ccttgg]
The elements of list one are on every position different and so are also the elements of list 2.
Is there a build-in function which can do it? I also thought of regular expression but I'm not that familar with this approach.
Thanks in advance!
Edit:
Ok, it seems it is not that easy to describe. So lets go into more detail:
Let's take the five strings mentioned above:
I would start to compare the first string with all the other strings and then continue with string 2 comparing with all other strings and so on.
The first character of the first string is an a.
The first character of the second string is also an a.
This means I would discard the second string.
The first character of the third string is an g.
Fine.
The second character of the first string is an c.
The second character of the third string is an g.
Fine.
The third character of the first string is an c.
The third character of the third string is an g.
Fine.
The fourth ... and so on.
And if all characters of a string are different from the characters of another string (on every position like described above) I would keep those two strings and would search for the next strings which are different on every position compared to the strings I already found. Because I only have four letters there should be only four possibilities fo different strings.
I should end up with, probably a list, which contains the groups of strings which are different in every position.
I hope this helps.

You can use the following algorithm: iterate through all possible word combinations in your string and check each pair for equality with if [x == y for (x, y) in zip(word, nextWord)].count(True) == 0:.
Here is a snippet:
s = "accggt agctta gggatc gactta ccttgg"
chks = s.split(" ");
for word in chks:
for nextWord in chks:
if word != nextWord:
if [x == y for (x, y) in zip(word, nextWord)].count(True) == 0:
print([word, nextWord])
Result of the IDEONE demo:
['accggt', 'gggatc']
['gggatc', 'accggt']
['gggatc', 'ccttgg']
['ccttgg', 'gggatc']
UPDATE
You can deduplicate the list with a custom function. Here is an updated snippet:
def dedup(lst):
seen = set()
result = []
for item in lst:
fs = frozenset(item)
if fs not in seen:
result.append(item)
seen.add(fs)
return result
res = []
s = "accggt agctta gggatc gactta ccttgg"
chks = s.split(" ");
for word in chks:
for nextWord in chks:
if word != nextWord:
if [x == y for (x, y) in zip(word, nextWord)].count(True) == 0:
res.append([word, nextWord])
print(dedup(res))
Result: [['accggt', 'gggatc'], ['gggatc', 'ccttgg']].
To check the words by 3, you need to create all possible permutations of the string into 3-word combinations and use something like:
from itertools import permutations
def dedup(lst):
seen = set()
result = []
for item in lst:
fs = frozenset(item)
if fs not in seen:
result.append(item)
seen.add(fs)
return result
res = []
s = "accggt agctta gggatc gactta ccttgg"
chks = s.split(" ");
perms = [p for p in permutations(chks, 3)]
for perm in perms:
if [(x == y or y == z or x == z) for (x, y, z) in zip(*perm)].count(True) == 0:
res.append(perm)
print(dedup(res))

To find the DNA strings which are completely different on every character you have to check every string against any other string if any character of the given string is the same character on the same position in the comparing string.
Here is an example code for that:
# read all dna strings into a list of strings
dna = ['accggt', 'agctta', 'gggatc', 'gactta', 'ccttgg', '123456']
def compare_two_dna(dna1, dna2):
i = 0
l = len(dna1)
while(i < l):
if dna1[i] == dna2[i]:
return True
i += 1
return False
def is_dna_unique(d, dna_strings):
return len(filter(lambda x: compare_two_dna(d, x), dna_strings)) == 1
# filter all items which only occure once in the list
unique_dna = filter(lambda d: is_dna_unique(d, dna), dna)
print(unique_dna)
The result here is: 123456

var dnaList = "accggt agctta gggatc gactta ccttgg".split( " " );
function getUniqueDnas( dna_list ){
var result = [];
for( var d1 in dna_list ){
var isRepeat = false;
var dna1 = dna_list[ d1 ];
for( var d2 in dna_list ){
var dna2 = dna_list[ d2 ];
if( dna1 == dna2 ){
isRepeat = true;
break;
}
}
if( !isRepeat )
result.push( dna1 );
}
return result;
}
var uniqueDnaList = getUniqueDnas( dnaList );

Related

Find starting and ending index of each unique charcters in a string in python

I have a string with characters repeated. My Job is to find starting Index and ending index of each unique characters in that string. Below is my code.
import re
x = "aaabbbbcc"
xs = set(x)
for item in xs:
mo = re.search(item,x)
flag = item
m = mo.start()
n = mo.end()
print(flag,m,n)
Output :
a 0 1
b 3 4
c 7 8
Here the end index of the characters are not correct. I understand why it's happening but how can I pass the character to be matched dynamically to the regex search function. For instance if I hardcode the character in the search function it provides the desired output
x = 'aabbbbccc'
xs = set(x)
mo = re.search("[b]+",x)
flag = item
m = mo.start()
n = mo.end()
print(flag,m,n)
output:
b 2 5
The above function is providing correct result but here I can't pass the characters to be matched dynamically.
It will be really a help if someone can let me know how to achieve this any hint will also do. Thanks in advance
String literal formatting to the rescue:
import re
x = "aaabbbbcc"
xs = set(x)
for item in xs:
# for patterns better use raw strings - and format the letter into it
mo = re.search(fr"{item}+",x) # fr and rf work both :) its a raw formatted literal
flag = item
m = mo.start()
n = mo.end()
print(flag,m,n) # fix upper limit by n-1
Output:
a 0 3 # you do see that the upper limit is off by 1?
b 3 7 # see above for fix
c 7 9
Your pattern does not need the [] around the letter - you are matching just one anyhow.
Without regex1:
x = "aaabbbbcc"
last_ch = x[0]
start_idx = 0
# process the remainder
for idx,ch in enumerate(x[1:],1):
if last_ch == ch:
continue
else:
print(last_ch,start_idx, idx-1)
last_ch = ch
start_idx = idx
print(ch,start_idx,idx)
output:
a 0 2 # not off by 1
b 3 6
c 7 8
1RegEx: And now you have 2 problems...
Looking at the output, I'm guessing that another option would be,
import re
x = "aaabbbbcc"
xs = re.findall(r"((.)\2*)", x)
start = 0
output = ''
for item in xs:
end = start + len(item[0])
output += (f"{item[1]} {start} {end}\n")
start = end
print(output)
Output
a 0 3
b 3 7
c 7 9
I think it'll be in the Order of N, you can likely benchmark it though, if you like.
import re, time
timer_on = time.time()
for i in range(10000000):
x = "aabbbbccc"
xs = re.findall(r"((.)\2*)", x)
start = 0
output = ''
for item in xs:
end = start + len(item[0])
output += (f"{item[1]} {start} {end}\n")
start = end
timer_off = time.time()
timer_total = timer_off - timer_on
print(timer_total)

Questions about Tuples

So I was able to run part of a program doing below (using tuples)
def reverse_string():
string_in = str(input("Enter a string:"))
length = -int(len(string_in))
y = 0
print("The reverse of your string is:")
while y != length:
print(string_in[y-1], end="")
y = y - 1
reverse_string()
The output is:
Enter a string:I Love Python
The reverse of your string is:
nohtyP evoL I
I am still thinking how for the program to reverse the position of the words instead of per letter.
The desired output will be "Phython Love I"
Is there anyway that I will input a string and then convert it to a tuple similar below:
So If I enter I love Phyton, a certain code will do as variable = ("I" ,"Love", "Python") and put additional codes from there...
Newbie Programmer,
Mac

Find the the total number of 1's in binary form for a group number's in a list in python 3

I want to count total number of '1's in binary format of a number which is in a list.
z = ['0b111000','0b1000011'] # z is a list
d = z.count('1')
print(d)
The output is 0.
Whereas the required output should be in the form of [3,3]
which is number of ones in every element that Z is containing :
Here it is :
z=['0b111000','0b1000011']
finalData = []
for word in z:
finalData.append(word.count('1'))
print(finalData)
The problem with your code was you were trying to use count() method on list type and it is used for string. You first need to get the string from the list and then use count() method on it.
Hope this helps :)
z = ['0b111000','0b1000011']
d = z.count('1')
This attempts to find the number of times the string '1' is in z. This obviously returns 0 since z contains '0b111000' and '0b1000011'.
You should iterate over every string in z and count the numbers of '1' in every string:
z = ['0b111000','0b1000011']
output = [string.count('1') for string in z]
print(output)
# [3, 3]
list.count(x) will count the number of occurrences such that it only counts the element if it is equal to x.
Use list comprehension to loop through each string and then count the number of 1s. Such as:
z = ['0b111000','0b1000011']
d = [x.count("1") for x in z]
print(d)
This will output:
[3, 3]

ROT 13 Cipher: Creating a Function Python

I need to create a function that replaces a letter with the letter 13 letters after it in the alphabet (without using encode). I'm relatively new to Python so it has taken me a while to figure out a way to do this without using Encode.
Here's what I have so far. When I use this to type in a normal word like "hello" it works but if I pass through a sentence with special characters I can't figure out how to JUST include letters of the alphabet and skip numbers, spaces or special characters completely.
def rot13(b):
b = b.lower()
a = [chr(i) for i in range(ord('a'),ord('z')+1)]
c = []
d = []
x = a[0:13]
for i in b:
c.append(a.index(i))
for i in c:
if i <= 13:
d.append(a[i::13][1])
elif i > 13:
y = len(a[i:])
z = len(x)- y
d.append(a[z::13][0])
e = ''.join(d)
return e
EDIT
I tried using .isalpha() but this doesn't seem to be working for me - characters are duplicating for some reason when I use it. Is the following format correct:
def rot13(b):
b1 = b.lower()
a = [chr(i) for i in range(ord('a'),ord('z')+1)]
c = []
d = []
x = a[0:13]
for i in b1:
if i.isalpha():
c.append(a.index(i))
for i in c:
if i <= 12:
d.append(a[i::13][1])
elif i > 12:
y = len(a[i:])
z = len(x)- y
d.append(a[z::13][0])
else:
d.append(i)
if message[0].istitle() == True:
d[0] = d[0].upper()
e = ''.join(d)
return e
Following on from comments. OP was advised to use isalpha, and wondering why that's causing duplication (see OP's edit)
This isn't tied to the use of isalpha, it's to do with the second for loop
for i in c:
isn't necessary, and is causing the duplication. You should remove that. Instead you can do the same by just using index = a.index(i). You were already doing this, but for some reason appending to a list instead and causing confusion
Use the index variable any time you would have used i inside the for i in c loop. On a side note, in nested for loops try not to reuse the same variables. It just causes confusion...but that's a matter for code review
Assuming you do all that right it should work.

Python - Number with a variable part

I have a number given in this form 623746xyz3 and i have to code a Python script that prints on screen all numbers that can be created with the combination of all values (from 0 to 9 ) that x,y,z can assume.
Can someone help me?
If those xyz are always next to each other, you can just loop from 0 to 999 and replace that part of the string accordingly.
s = "623746xyz3"
for xyz in range(1000):
sxyz = s.replace('xyz', str(xyz))
print int(sxyz)
In case the x, y, and z can be more 'spread out', you will need three nested loops:
for x in range(10):
sx = s.replace('x', str(x))
for y in range(10):
sxy = sx.replace('y', str(y))
for z in range(10):
sxyz = sxy.replace('z', str(z))
print int(sxyz)
(And in case you do not know the 'variables' a priori, you will first need to find the non-digit characters and use a recursive approach to replace them.)
My first idea:
for x in range(0, 10)
for y in range(0. 10)
for z in range (0, 10)
print 6*1000000000+2*100000000+3*10000000+7*1000000+4*100000+6*10000+x*1000+y*100+z*10+3