How to capture repeated occurrence using python 3 regex

How to capture repeated occurrence using python 3 regex - regex

Consider sentence : W U T Sample A B C D
I'm trying to use re.groups after re.search to fetch A, B, C, D (letters in caps after 'Sample'). There could be variable number of letters
Few unsuccessful attempts :
A = re.search('Sample\s([A-Z])\s*([A-Z])*', 'W U T Sample A B C D')
A.groups()
('A', 'B')
A = re.search('Sample\s([A-Z])(\s*([A-Z]))*', 'W U T Sample A B C D')
A.groups()
('A', ' D', 'D')
A = re.search('Sample\s([A-Z])(?:\s*([A-Z]))*', 'W U T Sample A B C D')
A.groups()
('A', 'D')
I'm expecting A.groups() to give ('A', 'B', 'C', 'D')
Taking another example, 'XSS 55 D W Sample R G Y BH' should give the output ('R', 'G', 'Y', 'B', 'H')

Most regex engines, including Python's, will overwrite a repeating capture group. So, the repeating capture group you see will just be the final one, and your current approach will not work. As a workaround, we can try first isolating the substring you want, and then applying re.findall:
input = "W U T Sample A B C D"
text = re.search(r'Sample\s([A-Z](?:\s*[A-Z])*)', input).group(1) # A B C D
result = re.findall(r'[A-Z]', text)
print(result)
['A', 'B', 'C', 'D']

Related

How to compare two lists of different length and map the items based on count

I have two lists of different lengths and i want match the items based on their actual relation. One list is the secondary structure elements and other list is aligned sequence. I want to match the secondary structure to its residues in the other list. And adjust the length of secondary structure by inserting '-' to that of gaps in the aligned sequence. The items in ss corresponds to RRCAVVTG in seq.
ss=['-', '-', 'E', 'E', 'E', 'E', 'S', 'S']
seq≈["---------------RRCAVVTG"]
for m in seq:
found=[i for i in list(m)]
sscount=0
sscount1=0
for char,ssi in zip(found,ss):
if char!='-' :
print char , sscount, ssi
sscount+=1
else:
print char, sscount1, '#'
sscount1+=1
The expected results:
---------------##EEEESS
---------------RRCAVVTG
But i get the following results:
- 0 #
- 1 #
- 2 #
- 3 #
- 4 #
- 5 #
- 6 #
- 7 #

I hope I understood the question right. First we fill the string ss with - and then compare it to the string inside seq using zip():
ss = ['-', '-', 'E', 'E', 'E', 'E', 'S', 'S']
seq = ["---------------RRCAVVTG"]
out = ''
for ch1, ch2 in zip('{:->{}}'.format(''.join(ss), len(seq[0])), seq[0]):
if ch1=='-' and ch2 !='-':
out += '#'
elif ch1=='-' and ch2 == '-':
out += '-'
else:
out += ch1
print(out)
print(seq[0])
Prints:
---------------##EEEESS
---------------RRCAVVTG

for m in seq:
found=[i for i in list(m)]
sscount=0
sscount1=0
num=0
for char,ssi in zip(found,itertools.cycle(ss)):
if char!='-' :
print char , sscount, ss[num]
d.append(ss[num])
num+=1
sscount+=1
else:
print char, sscount1, '#'
sscount1+=1

Regex for optional end-part of substring

Consider the following (highly simplified) string:
'a b a b c a b c a b c'
This is a repeating pattern of 'a b c' except at the beginning where the 'c' is missing.
I seek a regular expression which can give me the following matches by the use of re.findall():
[('a', 'b'), ('a', 'b', 'c'), ('a', 'b', 'c'), ('a', 'b', 'c')]
The string above thus have 4 matches of 'a b c' - although with the first match as a special case since the 'c' is missing.
My simplest attempt is where I try to capture 'a' and 'b' and use an optional capture for 'c':
re.findall(r'(a).*?(b).*?(c)?', 'a b a b c a b c a b c')
I get:
[('a', 'b', ''), ('a', 'b', ''), ('a', 'b', ''), ('a', 'b', '')]
Clearly, it has just ignored the c. When using non-optional capture for 'c' the search skips ahead prematurely and misses 'a' and 'b' in the second 'a b c'-substring. This results in 3 wrong matches:
[('a', 'b', 'c'), ('a', 'b', 'c'), ('a', 'b', 'c')]
I have tried several other techniques (for instance, '(?<=c)') to no avail.
Note: The string above is just a skeleton example of my "real-world" problem where the three letters above are themselves strings (from a long log-file) in between other strings and newlines from which I need to extract named groups.
I use Python 3.5.2 on Windows 7.

Since your a, b, and c are placeholders, and you cannot know if those are single characters, or character sequences, or anything else, you need to use a tempered greedy token to make sure the pattern does not overflow to the other matches in the same string, and since the c is optional, just wrap it with a (?:...)? optional non-capturing group:
(a)(?:(?!a|b).)*(b)(?:(?:(?!a|b|c).)*(c))?
^^^^^^^^^^^^^ ^^^ ^^^^^^^^^^^^^^ ^
See the regex demo
Details:
(a) - Group 1 capturing some a
(?:(?!a|b).)* - a tempered greedy token matching any char not starting a a or b sequences
(b) - Group 2 capturing some b
(?: - start of an optional non-capturing group, repeated 1 or 0 times
(?:(?!a|b|c).)* - a tempered greedy token that matches any char but a newline that starts a a, b or c pattern
(c) - Group 3 capturing some c pattern
)? - end of the optional non-capturing group.
To obtain the tuple list you need, you need to build it yourself using comprehension:
import re
r = r'(a)(?:(?!a|b).)*(b)(?:(?:(?!a|b|c).)*(c))?'
s = 'a b a b c a b c a b c'
# print(re.findall(r,s))
# That one is bad: [('a', 'b', ''), ('a', 'b', 'c'), ('a', 'b', 'c'), ('a', 'b', 'c')]
print([(a,b,c) if c else (a,b) for a,b,c in re.findall(r,s)])
# This one is good: [('a', 'b'), ('a', 'b', 'c'), ('a', 'b', 'c'), ('a', 'b', 'c')]
See the Python demo

Writing with numbers

I need to make a function that receives a string of numbers and the ouput is the letters that correspond to those numbers, like if you are sending a message on a cell phone. For example to get the letter 'A' the input should be '2', to get the letter 'B', the input should be '22', etc. For example:
>>>number_to_word('222 '233' '3'):
"CAFE"
The program needs to "go around" the number if the limit of the number is reached. For example, the letter 'A', can be these following inputs: '2', '2222','2222222', etc. Just like if your sending a text message on cell phone when you get pass 'C' ( which is'222' ) the program goes again to 'A', making '2222' it's key. Also, in the input, if the string is '233' the program must separate the different numbers, so this ('233') will be equal this ('2' '33')
I made a dictionary like this:
dic={'2':'A', '22':'B', '222':'C', '3':'D', '33':'E', '333':'F',..... etc}
But I don't know how to make the program "go around" if the input is '2222', and how do I take that number and assign it to the letter 'A'.
Feel free to ask any questions if you don't understand what I'm asking I would be glad to explain it with more detail. Thank you.

This seems to give the desired result:
NUMBERS = {'2': 'A', '22': 'B', '222': 'C', '3': 'D', '33':'E', '333': 'F'}
def normalize_number(number):
for item in number.split():
if len(set(item)) == 1:
yield item
else:
last = item[0]
res = [last]
for entry in item[1:]:
if entry == last:
res.append(entry)
else:
yield ''.join(res)
res = [entry]
last = entry
yield ''.join(res)
def number_to_word(number):
res = []
for item in normalize_number(number):
try:
res.append(NUMBERS[item])
except KeyError:
if len(item) >= 4:
end = len(item) % 3
if end == 0:
end = 3
res.append(NUMBERS[item[:end]])
return ''.join(res)
Test it:
>>> number_to_word('222 2 333 33')
'CAFE'
>>> number_to_word('222 2 333 3333')
'CAFE'
>>> number_to_word('222 2 333 333333')
'CAFE'
>>> number_to_word('222 2333 3333')
'CAFE'
The function normalize_number() turns numbers that have different digits into multiple numbers with only the same digit:
>>> list(normalize_number('222 2 333 3333'))
['222', '2', '333', '3333']
>>> list(normalize_number('222 2333 3333'))
['222', '2', '333', '3333']
>>> list(normalize_number('222 2 333 53333'))
['222', '2', '333', '5', '3333']

How to use the regular expression to make the Pig Latin game?

I am trying to get a single match for the first consonant or consonant cluster in an input. Then the program should move the consonant to the beginning of the word and add "ay" at the end.
Here is my code
import re
consonants = [ 'bl', 'cl', 'fl', 'gl', 'pl', 'sl', 'br', 'cr', 'dr', 'fr', 'gr','pr', 'tr', 'sc', 'sk', 'sm', 'sn', 'sp', 'st', 'sw', 'tw','b', 'c', 'd', 'f', 'g', 'h', 'j', 'k', 'l', 'm', 'n', 'p', 'q', 'r', 's', 't', 'v', 'w', 'x', 'y', 'z']
word1 = str(input("Type something"))
word2 = re.split(r'[b-df-hj-np-tv-z]' or '[bl]''[cl]''[fl]', word1)
if any(consonants in word2 for consonants in consonants):
print(word2[1] + word2[0] + word2[2] + "ay")
The output does not appear in the interactive console.

Right, Python does not do "magic"; or is a well-defined operator which takes two boolean expressions and produces a boolean expression, not something which magically combines two regular expression strings into a new regular expression string. (You have to remember that you're talking to a computer, and computers are very stupid!)
To do the pig latin game you'll probably want to just gather a substring of non-vowels and then check whether it's 0-length (starts with a vowel) or not.

Just solved the program.
import re
words1 = input("Input Sentence:")
b1 = re.search(r"([^aeoiu]*)([aeoiu]*)([^aeoiu]*)([aeoiu]*)([^aeoiu]*)", words1)
b2 = b1.group(1)
b3 = b1.group(2)
b4 = b1.group(3)
b5 = b1.group(4)
b6 = b1.group(5)
if b5 != 5:
print(b3 + b4 + b5 + b6 + b2 + "ay")

Python- How do I sort a list that the script is building to replicate another word?

I'm trying to implement a hangman game. I want part of the function to check if a letter is correct or incorrect. After a letter is found to be correct it will place the letter in a "used letters" list and a "correct letters list" The correct letters list will be built as the game goes on. I'd like it to sort the list to match the hidden word as the game is going.
For instance let's use the word "hardware"
If someone guessed "e, a, and h" it would come out like
correct = ["e", "a", "h"]
I would like it to sort the list so it would go
correct = ["h", "a", "e"]
then
correct = ["h", "a", "r", "a", "e"] after r has been guessed.
I also need to know if it would also see that "a" is in there twice and place it twice.
My code that doesn't allow you to win but you can lose. It's a work in progress.
I also can't get the letters left counter to work. I've made the code print the list to check if it was adding the letters. it is. So I don't know what's up there.
def hangman():
correct = []
guessed = []
guess = ""
words = ["source", "alpha", "patch", "system"]
sWord = random.choice(words)
wLen = len(sWord)
cLen = len(correct)
remaining = int(wLen - cLen)
print "Welcome to hangman.\n"
print "You've got 3 tries or the guy dies."
turns = 3
while turns > 0:
guess = str(raw_input("Take a guess. >"))
if guess in sWord:
correct.append(guess)
guessed.append(guess)
print "Great!, %d letters left." % remaining
else:
print "Incorrect, this poor guy's life is in your hands."
guessed.append(guess)
turns -= 1
print "You have %d turns left." % turns
if turns == 0:
print "HE'S DEAD AND IT'S ALL YOUR FAULT! ARE YOU HAPPY?"
print "YOU LOST ON PURPOSE, DIDN'T YOU?!"
hangman()

I'm not entirely clear on the desired behavior because:
correct = ["h", "a", "r", "a", "e"] after r has been guessed.
This is strange because a has only been guessed once, but shows up for each time it appears in hardware. Should r should also appear twice? If that is the correct behavior, then a very simple list comprehension can be used:
def result(guesses, key):
print [c for c in key if c in guesses]
In [560]: result('eah', 'hardware')
['h', 'a', 'a', 'e']
In [561]: result('eahr', 'hardware')
['h', 'a', 'r', 'a', 'r', 'e']
Iterate the letters in key and include them if the letter has been used as a "guess".
You can also have it insert a place holder for unfound characters fairly easily by using:
def result(guesses, key):
print [c if c in guesses else '_' for c in key]
print ' '.join([c if c in guesses else '_' for c in key])
In [567]: result('eah', 'hardware')
['h', 'a', '_', '_', '_', 'a', '_', 'e']
h a _ _ _ a _ e
In [568]: result('eahr', 'hardware')
['h', 'a', 'r', '_', '_', 'a', 'r', 'e']
h a r _ _ a r e
In [569]: result('eahrzw12', 'hardware')
['h', 'a', 'r', '_', 'w', 'a', 'r', 'e']
h a r _ w a r e

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to capture repeated occurrence using python 3 regex - regex

Related

How to compare two lists of different length and map the items based on count

Regex for optional end-part of substring

Writing with numbers

How to use the regular expression to make the Pig Latin game?

Python- How do I sort a list that the script is building to replicate another word?

Categories

Resources