Regex for optional end-part of substring

Regex for optional end-part of substring - regex

Consider the following (highly simplified) string:
'a b a b c a b c a b c'
This is a repeating pattern of 'a b c' except at the beginning where the 'c' is missing.
I seek a regular expression which can give me the following matches by the use of re.findall():
[('a', 'b'), ('a', 'b', 'c'), ('a', 'b', 'c'), ('a', 'b', 'c')]
The string above thus have 4 matches of 'a b c' - although with the first match as a special case since the 'c' is missing.
My simplest attempt is where I try to capture 'a' and 'b' and use an optional capture for 'c':
re.findall(r'(a).*?(b).*?(c)?', 'a b a b c a b c a b c')
I get:
[('a', 'b', ''), ('a', 'b', ''), ('a', 'b', ''), ('a', 'b', '')]
Clearly, it has just ignored the c. When using non-optional capture for 'c' the search skips ahead prematurely and misses 'a' and 'b' in the second 'a b c'-substring. This results in 3 wrong matches:
[('a', 'b', 'c'), ('a', 'b', 'c'), ('a', 'b', 'c')]
I have tried several other techniques (for instance, '(?<=c)') to no avail.
Note: The string above is just a skeleton example of my "real-world" problem where the three letters above are themselves strings (from a long log-file) in between other strings and newlines from which I need to extract named groups.
I use Python 3.5.2 on Windows 7.

Since your a, b, and c are placeholders, and you cannot know if those are single characters, or character sequences, or anything else, you need to use a tempered greedy token to make sure the pattern does not overflow to the other matches in the same string, and since the c is optional, just wrap it with a (?:...)? optional non-capturing group:
(a)(?:(?!a|b).)*(b)(?:(?:(?!a|b|c).)*(c))?
^^^^^^^^^^^^^ ^^^ ^^^^^^^^^^^^^^ ^
See the regex demo
Details:
(a) - Group 1 capturing some a
(?:(?!a|b).)* - a tempered greedy token matching any char not starting a a or b sequences
(b) - Group 2 capturing some b
(?: - start of an optional non-capturing group, repeated 1 or 0 times
(?:(?!a|b|c).)* - a tempered greedy token that matches any char but a newline that starts a a, b or c pattern
(c) - Group 3 capturing some c pattern
)? - end of the optional non-capturing group.
To obtain the tuple list you need, you need to build it yourself using comprehension:
import re
r = r'(a)(?:(?!a|b).)*(b)(?:(?:(?!a|b|c).)*(c))?'
s = 'a b a b c a b c a b c'
# print(re.findall(r,s))
# That one is bad: [('a', 'b', ''), ('a', 'b', 'c'), ('a', 'b', 'c'), ('a', 'b', 'c')]
print([(a,b,c) if c else (a,b) for a,b,c in re.findall(r,s)])
# This one is good: [('a', 'b'), ('a', 'b', 'c'), ('a', 'b', 'c'), ('a', 'b', 'c')]
See the Python demo

Related

How to capture repeated occurrence using python 3 regex

Consider sentence : W U T Sample A B C D
I'm trying to use re.groups after re.search to fetch A, B, C, D (letters in caps after 'Sample'). There could be variable number of letters
Few unsuccessful attempts :
A = re.search('Sample\s([A-Z])\s*([A-Z])*', 'W U T Sample A B C D')
A.groups()
('A', 'B')
A = re.search('Sample\s([A-Z])(\s*([A-Z]))*', 'W U T Sample A B C D')
A.groups()
('A', ' D', 'D')
A = re.search('Sample\s([A-Z])(?:\s*([A-Z]))*', 'W U T Sample A B C D')
A.groups()
('A', 'D')
I'm expecting A.groups() to give ('A', 'B', 'C', 'D')
Taking another example, 'XSS 55 D W Sample R G Y BH' should give the output ('R', 'G', 'Y', 'B', 'H')

Most regex engines, including Python's, will overwrite a repeating capture group. So, the repeating capture group you see will just be the final one, and your current approach will not work. As a workaround, we can try first isolating the substring you want, and then applying re.findall:
input = "W U T Sample A B C D"
text = re.search(r'Sample\s([A-Z](?:\s*[A-Z])*)', input).group(1) # A B C D
result = re.findall(r'[A-Z]', text)
print(result)
['A', 'B', 'C', 'D']

Regex to split by square brackets and dots with python and re module

I want to build a regex expression to split by '.' and '[]', but here, I would want to keep the result between square brackets.
I mean:
import re
pattern = re.compile("\.|[\[-\]]")
my_string = "a.b.c[0].d.e[12]"
pattern.split(my_string)
# >>> ['a', 'b', 'c', '0', '', 'd', 'e', '12', '']
But I would wish to get the following output (without any empty string):
# >>> ['a', 'b', 'c', '0', 'd', 'e', '12']
Would be it possible? I've tested with a lot of regex patterns and that is the best which I've found but it's not perfect.

You can use a quantifier in your regex and filter:
>>> pattern = re.compile(r'[.\[\]]+')
>>> my_string = "a.b.c[0].d.e[12]"
>>> filter(None, pattern.split(my_string))
['a', 'b', 'c', '0', 'd', 'e', '12']

Replacing multiple occurrences of a character or string inside parentheses in R

I am trying to replace commas within all sets of parentheses with a semicolon, but not change any commas outside of the parentheses.
So, for example:
"a, b, c (1, 2, 3), d, e (4, 5)"
should become:
"a, b, c (1; 2; 3), d, e (4; 5)"
I have started attempting this with gsub, but I am having a really hard time understanding/figuring out what how to identify those commas within the parentheses.
I would call myself an advanced beginner with R, but with regular expressions and text manipulations, a total noob. Any help you can provide would be great.

The simplest solution
A most common workaround that will work in case all parentheses are balanced:
,(?=[^()]*\))
See the regex demo. R code:
a <- "a, b, c (1, 2, 3), d, e (4, 5)"
gsub(",(?=[^()]*\\))", ";", a, perl=T)
## [1] "a, b, c (1; 2; 3), d, e (4; 5)"
See IDEONE demo
The regex matches...
, - a comma if...
(?=[^()]*\)) - it is followed by 0 or more characters other than ( or ) (with [^()]*) and a literal ).
Alternative solutions
If you need to make sure only commas inside the closest open and close parentheses are replaced, it is safer to use a gsubfn based approach:
library(gsubfn)
x <- 'a, b, c (1, 2, 3), d, e (4, 5)'
gsubfn('\\(([^()]*)\\)', function(match) gsub(',', ';', match, fixed=TRUE), x, backref=0)
## => [1] "a, b, c (1; 2; 3), d, e (4; 5)"
Here, \(([^()]*)\) matches (, then 0+ chars other than ( and ) and then ), and after that the match found is passed to the anonymous function where all , chars are replaced with semi-colons using gsub.
If you need to perform this replacement inside balanced parentheses with unknown level depth use a PCRE regex with gsubfn:
x1 <- 'a, b, c (1, (2, (3, 4)), 5), d, e (4, 5)'
gsubfn('\\(((?:[^()]++|(?R))*)\\)', function(match) gsub(',', ';', match, fixed=TRUE), x1, backref=0, perl=TRUE)
## => [1] "a, b, c (1; (2; (3; 4)); 5), d, e (4; 5)"
Pattern details
\( # Open parenthesis
( # Start group 1
(?: # Start of a non-capturing group:
[^()]++ # Any 1 or more chars other than '(' and ')'
| # OR
(?R) # Recursively match the entire pattern
)* # End of the non-capturing group and repeat it zero or more times
) # End of Group 1 (its value will be passed to the `gsub` via `match`)
\) # A literal ')'

gsub("(?<=\\d),", ";", string, perl=T)

Python- How do I sort a list that the script is building to replicate another word?

I'm trying to implement a hangman game. I want part of the function to check if a letter is correct or incorrect. After a letter is found to be correct it will place the letter in a "used letters" list and a "correct letters list" The correct letters list will be built as the game goes on. I'd like it to sort the list to match the hidden word as the game is going.
For instance let's use the word "hardware"
If someone guessed "e, a, and h" it would come out like
correct = ["e", "a", "h"]
I would like it to sort the list so it would go
correct = ["h", "a", "e"]
then
correct = ["h", "a", "r", "a", "e"] after r has been guessed.
I also need to know if it would also see that "a" is in there twice and place it twice.
My code that doesn't allow you to win but you can lose. It's a work in progress.
I also can't get the letters left counter to work. I've made the code print the list to check if it was adding the letters. it is. So I don't know what's up there.
def hangman():
correct = []
guessed = []
guess = ""
words = ["source", "alpha", "patch", "system"]
sWord = random.choice(words)
wLen = len(sWord)
cLen = len(correct)
remaining = int(wLen - cLen)
print "Welcome to hangman.\n"
print "You've got 3 tries or the guy dies."
turns = 3
while turns > 0:
guess = str(raw_input("Take a guess. >"))
if guess in sWord:
correct.append(guess)
guessed.append(guess)
print "Great!, %d letters left." % remaining
else:
print "Incorrect, this poor guy's life is in your hands."
guessed.append(guess)
turns -= 1
print "You have %d turns left." % turns
if turns == 0:
print "HE'S DEAD AND IT'S ALL YOUR FAULT! ARE YOU HAPPY?"
print "YOU LOST ON PURPOSE, DIDN'T YOU?!"
hangman()

I'm not entirely clear on the desired behavior because:
correct = ["h", "a", "r", "a", "e"] after r has been guessed.
This is strange because a has only been guessed once, but shows up for each time it appears in hardware. Should r should also appear twice? If that is the correct behavior, then a very simple list comprehension can be used:
def result(guesses, key):
print [c for c in key if c in guesses]
In [560]: result('eah', 'hardware')
['h', 'a', 'a', 'e']
In [561]: result('eahr', 'hardware')
['h', 'a', 'r', 'a', 'r', 'e']
Iterate the letters in key and include them if the letter has been used as a "guess".
You can also have it insert a place holder for unfound characters fairly easily by using:
def result(guesses, key):
print [c if c in guesses else '_' for c in key]
print ' '.join([c if c in guesses else '_' for c in key])
In [567]: result('eah', 'hardware')
['h', 'a', '_', '_', '_', 'a', '_', 'e']
h a _ _ _ a _ e
In [568]: result('eahr', 'hardware')
['h', 'a', 'r', '_', '_', 'a', 'r', 'e']
h a r _ _ a r e
In [569]: result('eahrzw12', 'hardware')
['h', 'a', 'r', '_', 'w', 'a', 'r', 'e']
h a r _ w a r e

Perform multiple search-and-replaces on the colnames of a dataframe

I have a dataframe with 95 cols and want to batch-rename a lot of them with simple regexes, like the snippet at bottom, there are ~30 such lines. Any other columns which don't match the search regex must be left untouched.
**** Example: names(tr) = c('foo', 'bar', 'xxx_14', 'xxx_2001', 'yyy_76', 'baz', 'zzz_22', ...) ****
I started out with a wall of 25 gsub()s - crude but effective:
names(tr) <- gsub('_1$', '_R', names(tr))
names(tr) <- gsub('_14$', '_I', names(tr))
names(tr) <- gsub('_22$', '_P', names(tr))
names(tr) <- gsub('_50$', '_O', names(tr))
... yada yada
#Joshua: mapply doesn't work, turns out it's more complicated and impossible to vectorize. names(tr) contains other columns, and when these patterns do occur, you cannot assume all of them occur, let alone in the exact order we defined them. Hence, try 2 is:
pattern <- paste('_', c('1','14','22','50','52','57','76','1018','2001','3301','6005'), '$', sep='')
replace <- paste('_', c('R','I', 'P', 'O', 'C', 'D', 'M', 'L', 'S', 'K', 'G'), sep='')
do.call(gsub, list(pattern, replace, names(tr)))
Warning messages:
1: In function (pattern, replacement, x, ignore.case = FALSE, perl = FALSE, :
argument 'pattern' has length > 1 and only the first element will be used
2: In function (pattern, replacement, x, ignore.case = FALSE, perl = FALSE, :
argument 'replacement' has length > 1 and only the first element will be used
Can anyone fix this for me?
EDIT: I read all around SO and R doc on this subject for over a day and couldn't find anything... then when I post it I think of searching for '[r] translation table' and I find xlate. Which is not mentioned anywhere in the grep/sub/gsub documentation.
Is there anything in base/gsubfn/data.table etc. to allow me to write one search-and-replacement instruction? (like a dictionary or translation table)
Can you improve my clunky syntax to be call-by-reference to tr? (mustn't create temp copy of entire df)
EDIT2: my best effort after reading around was:
The dictionary approach (xlate) might be a partial answer to, but this is more than a simple translation table since the regex must be terminal (e.g. '_14$').
I could use gsub() or strsplit() to split on '_' then do my xlate translation on the last component, then paste() them back together. Looking for a cleaner 1/2-line idiom.
Or else I just use walls of gsub()s.

Wall of gsub could be always replace by for-loop. And you can write it as a function:
renamer <- function(x, pattern, replace) {
for (i in seq_along(pattern))
x <- gsub(pattern[i], replace[i], x)
x
}
names(tr) <- renamer(
names(tr),
sprintf('_%s$', c('1','14','22','50','52','57','76','1018','2001','3301','6005')),
sprintf('_%s' , c('R','I', 'P', 'O', 'C', 'D', 'M', 'L', 'S', 'K', 'G'))
)
And I found sprintf more useful than paste for creation this kind of strings.

The question predates the boom of the tidyverse but this is easily solved with the c(pattern1 = replacement1) option in stringr::str_replace_all.
tr <- data.frame("whatevs_1" = NA, "something_52" = NA)
tr
#> whatevs_1 something_52
#> 1 NA NA
patterns <- sprintf('_%s$', c('1','14','22','50','52','57','76','1018','2001','3301','6005'))
replacements <- sprintf('_%s' , c('R','I', 'P', 'O', 'C', 'D', 'M', 'L', 'S', 'K', 'G'))
names(replacements) <- patterns
names(tr) <- stringr::str_replace_all(names(tr), replacements)
tr
#> whatevs_R something_C
#> 1 NA NA
And of course, this particular case can benefit from dplyr
dplyr::rename_all(tr, stringr::str_replace_all, replacements)
#> whatevs_R something_C
#> 1 NA NA

Using do.call() nearly does it, it objects to differing arg lengths. I think I need to nest do.call() inside apply(), like in apply function to elements over a list.
But I need a partial do.call() over pattern and replace.
This is all starting to make a wall of gsub(..., fixed=TRUE) look like a more efficient idiom, if flabby code.
pattern <- paste('_', c('1','14','22','50'), '$', sep='')
replace <- paste('_', c('R','I', 'P', 'O'), sep='')
do.call(gsub, list(pattern, replace, names(tr)))
Warning messages:
1: In function (pattern, replacement, x, ignore.case = FALSE, perl = FALSE, :
argument 'pattern' has length > 1 and only the first element will be used
2: In function (pattern, replacement, x, ignore.case = FALSE, perl = FALSE, :
argument 'replacement' has length > 1 and only the first element will be used

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex for optional end-part of substring - regex

Related

How to capture repeated occurrence using python 3 regex

Regex to split by square brackets and dots with python and re module

Replacing multiple occurrences of a character or string inside parentheses in R

Python- How do I sort a list that the script is building to replicate another word?

Perform multiple search-and-replaces on the colnames of a dataframe

Categories

Resources