Related
I'm trying to match a mathematical-expression-like string, that have nested parentheses.
import re
p = re.compile('\(.+\)')
str = '(((1+0)+1)+1)'
print p.findall(s)
['(((1+0)+1)+1)']
I wanted it to match all the enclosed expressions, such as (1+0), ((1+0)+1)...
I don't even care if it matches unwanted ones like (((1+0), I can take care of those.
Why it's not doing that already, and how can I do it?
As others have mentioned, regular expressions are not the way to go for nested constructs. I'll give a basic example using pyparsing:
import pyparsing # make sure you have this installed
thecontent = pyparsing.Word(pyparsing.alphanums) | '+' | '-'
parens = pyparsing.nestedExpr( '(', ')', content=thecontent)
Here's a usage example:
>>> parens.parseString("((a + b) + c)")
Output:
( # all of str
[
( # ((a + b) + c)
[
( # (a + b)
['a', '+', 'b'], {}
), # (a + b) [closed]
'+',
'c'
], {}
) # ((a + b) + c) [closed]
], {}
) # all of str [closed]
(With newlining/indenting/comments done manually)
Edit: Modified to eliminate unnecessary Forward, as per Paul McGuire's suggestions.
To get the output in nested list format:
res = parens.parseString("((12 + 2) + 3)")
res.asList()
Output:
[[['12', '+', '2'], '+', '3']]
There is a new regular engine module being prepared to replace the existing one in Python. It introduces a lot of new functionality, including recursive calls.
import regex
s = 'aaa(((1+0)+1)+1)bbb'
result = regex.search(r'''
(?<rec> #capturing group rec
\( #open parenthesis
(?: #non-capturing group
[^()]++ #anyting but parenthesis one or more times without backtracking
| #or
(?&rec) #recursive substitute of group rec
)*
\) #close parenthesis
)
''',s,flags=regex.VERBOSE)
print(result.captures('rec'))
Output:
['(1+0)', '((1+0)+1)', '(((1+0)+1)+1)']
Related bug in regex: http://code.google.com/p/mrab-regex-hg/issues/detail?id=78
Regex languages aren't powerful enough to matching arbitrarily nested constructs. For that you need a push-down automaton (i.e., a parser). There are several such tools available, such as PLY.
Python also provides a parser library for its own syntax, which might do what you need. The output is extremely detailed, however, and takes a while to wrap your head around. If you're interested in this angle, the following discussion tries to explain things as simply as possible.
>>> import parser, pprint
>>> pprint.pprint(parser.st2list(parser.expr('(((1+0)+1)+1)')))
[258,
[327,
[304,
[305,
[306,
[307,
[308,
[310,
[311,
[312,
[313,
[314,
[315,
[316,
[317,
[318,
[7, '('],
[320,
[304,
[305,
[306,
[307,
[308,
[310,
[311,
[312,
[313,
[314,
[315,
[316,
[317,
[318,
[7, '('],
[320,
[304,
[305,
[306,
[307,
[308,
[310,
[311,
[312,
[313,
[314,
[315,
[316,
[317,
[318,
[7,
'('],
[320,
[304,
[305,
[306,
[307,
[308,
[310,
[311,
[312,
[313,
[314,
[315,
[316,
[317,
[318,
[2,
'1']]]]],
[14,
'+'],
[315,
[316,
[317,
[318,
[2,
'0']]]]]]]]]]]]]]]],
[8,
')']]]]],
[14,
'+'],
[315,
[316,
[317,
[318,
[2,
'1']]]]]]]]]]]]]]]],
[8, ')']]]]],
[14, '+'],
[315,
[316,
[317,
[318, [2, '1']]]]]]]]]]]]]]]],
[8, ')']]]]]]]]]]]]]]]],
[4, ''],
[0, '']]
You can ease the pain with this short function:
def shallow(ast):
if not isinstance(ast, list): return ast
if len(ast) == 2: return shallow(ast[1])
return [ast[0]] + [shallow(a) for a in ast[1:]]
>>> pprint.pprint(shallow(parser.st2list(parser.expr('(((1+0)+1)+1)'))))
[258,
[318,
'(',
[314,
[318, '(', [314, [318, '(', [314, '1', '+', '0'], ')'], '+', '1'], ')'],
'+',
'1'],
')'],
'',
'']
The numbers come from the Python modules symbol and token, which you can use to build a lookup table from numbers to names:
map = dict(token.tok_name.items() + symbol.sym_name.items())
You could even fold this mapping into the shallow() function so you can work with strings instead of numbers:
def shallow(ast):
if not isinstance(ast, list): return ast
if len(ast) == 2: return shallow(ast[1])
return [map[ast[0]]] + [shallow(a) for a in ast[1:]]
>>> pprint.pprint(shallow(parser.st2list(parser.expr('(((1+0)+1)+1)'))))
['eval_input',
['atom',
'(',
['arith_expr',
['atom',
'(',
['arith_expr',
['atom', '(', ['arith_expr', '1', '+', '0'], ')'],
'+',
'1'],
')'],
'+',
'1'],
')'],
'',
'']
The regular expression tries to match as much of the text as possible, thereby consuming all of your string. It doesn't look for additional matches of the regular expression on parts of that string. That's why you only get one answer.
The solution is to not use regular expressions. If you are actually trying to parse math expressions, use a real parsing solutions. If you really just want to capture the pieces within parenthesis, just loop over the characters counting when you see ( and ) and increment a decrement a counter.
Stack is the best tool for the job: -
import re
def matches(line, opendelim='(', closedelim=')'):
stack = []
for m in re.finditer(r'[{}{}]'.format(opendelim, closedelim), line):
pos = m.start()
if line[pos-1] == '\\':
# skip escape sequence
continue
c = line[pos]
if c == opendelim:
stack.append(pos+1)
elif c == closedelim:
if len(stack) > 0:
prevpos = stack.pop()
# print("matched", prevpos, pos, line[prevpos:pos])
yield (prevpos, pos, len(stack))
else:
# error
print("encountered extraneous closing quote at pos {}: '{}'".format(pos, line[pos:] ))
pass
if len(stack) > 0:
for pos in stack:
print("expecting closing quote to match open quote starting at: '{}'"
.format(line[pos-1:]))
In the client code, since the function is written as a generator function simply use the for loop pattern to unroll the matches: -
line = '(((1+0)+1)+1)'
for openpos, closepos, level in matches(line):
print(line[openpos:closepos], level)
This test code produces following on my screen, noticed the second param in the printout indicates the depth of the parenthesis.
1+0 2
(1+0)+1 1
((1+0)+1)+1 0
From a linked answer:
From the LilyPond convert-ly utility (and written/copyrighted by myself, so I can show it off here):
def paren_matcher (n):
# poor man's matched paren scanning, gives up
# after n+1 levels. Matches any string with balanced
# parens inside; add the outer parens yourself if needed.
# Nongreedy.
return r"[^()]*?(?:\("*n+r"[^()]*?"+r"\)[^()]*?)*?"*n
convert-ly tends to use this as paren_matcher (25) in its regular expressions which is likely overkill for most applications. But then it uses it for matching Scheme expressions.
Yes, it breaks down after the given limit, but the ability to just plug it into regular expressions still beats the "correct" alternatives supporting unlimited depth hands-down in usability.
Balanced pairs (of parentheses, for example) is an example of a language that cannot be recognized by regular expressions.
What follows is a brief explanation of the math for why that is.
Regular expressions are a way of defining finite state automata (abbreviated FSM). Such a device has a finite amount of possible state to store information. How that state can be used is not particularly restricted, but it does mean that there are an absolute maximum number of distinct positions it can recognize.
For example, the state can be used for counting, say, unmatched left parentheses. But because the amount of state for that kind of counting must be completely bounded, then a given FSM can count to a maximum of n-1, where n is the number of states the FSM can be in. If n is, say, 10, then the maximum number of unmatched left parenthesis the FSM can match is 10, until it breaks. Since it's perfectly possible to have one more left parenthesis, there is no possible FSM that can correctly recognize the complete language of matched parentheses.
So what? Suppose you just pick a really large n? The problem is that as a way of describing FSM, regular expressions basically describe all of the transitions from one state to another. Since for any N, an FSM would need 2 state transitions (one for matching a left parenthesis, and one for matching right), the regular expression itself must grow by at least a constant factor multiple of n
By comparison, the next better class of languages, (context free grammars) can solve this problem in a totally compact way. Here's an example in BNF
expression ::= `(` expression `)` expression
| nothing
I believe this function may suit your need, I threw this together fast so feel free to clean it up a bit. When doing nests its easy to think of it backwards and work from there =]
def fn(string,endparens=False):
exp = []
idx = -1
for char in string:
if char == "(":
idx += 1
exp.append("")
elif char == ")":
idx -= 1
if idx != -1:
exp[idx] = "(" + exp[idx+1] + ")"
else:
exp[idx] += char
if endparens:
exp = ["("+val+")" for val in exp]
return exp
You can use regexps, but you need to do the recursion yourself. Something like the following does the trick (if you only need to find, as your question says, all the expressions enclosed into parentheses):
import re
def scan(p, string):
found = p.findall(string)
for substring in found:
stripped = substring[1:-1]
found.extend(scan(p, stripped))
return found
p = re.compile('\(.+\)')
string = '(((1+0)+1)+1)'
all_found = scan(p, string)
print all_found
This code, however, does not match the 'correct' parentheses. If you need to do that you will be better off with a specialized parser.
Here is a demo for your question, though it is clumsy, while it works
import re s = '(((1+0)+1)+1)'
def getContectWithinBraces( x , *args , **kwargs):
ptn = r'[%(left)s]([^%(left)s%(right)s]*)[%(right)s]' %kwargs
Res = []
res = re.findall(ptn , x)
while res != []:
Res = Res + res
xx = x.replace('(%s)' %Res[-1] , '%s')
res = re.findall(ptn, xx)
print(res)
if res != []:
res[0] = res[0] %('(%s)' %Res[-1])
return Res
getContectWithinBraces(s , left='\(\[\{' , right = '\)\]\}')
my solution is that: define a function to extract content within the outermost parentheses, and then you call that function repeatedly until you get the content within the innermost parentheses.
def get_string_inside_outermost_parentheses(text):
content_p = re.compile(r"(?<=\().*(?=\))")
r = content_p.search(text)
return r.group()
def get_string_inside_innermost_parentheses(text):
while '(' in text:
text = get_string_inside_outermost_parentheses(text)
return text
You should write a proper parser for parsing such expression (e.g. using pyparsing).
Regular expressions are not an appropriate tool for writing decent parsers.
Many posts suggest that for nested braces,
REGEX IS NOT THE WAY TO DO IT. SIMPLY COUNT THE BRACES:
For example, see: Regular expression to detect semi-colon terminated C++ for & while loops
Here is a complete python sample to iterate through a string and count braces:
# decided for nested braces to not use regex but brace-counting
import re, string
texta = r'''
nonexistent.\note{Richard Dawkins, \textit{Unweaving the Rainbow: Science, Delusion
and the Appetite for Wonder} (Boston: Houghton Mifflin Co., 1998), pp. 302, 304,
306-309.} more text and more.
This is a statistical fact, not a
guess.\note{Zheng Wu, \textit{Cohabitation: An Alternative Form
of Family Living} (Ontario, Canada: Oxford University Press,
2000), p. 149; \hbox{Judith} Treas and Deirdre Giesen, ``Title
and another title,''
\textit{Journal of Marriage and the Family}, February 2000,
p.\,51}
more and more text.capitalize
'''
pos = 0
foundpos = 0
openBr = 0 # count open braces
while foundpos <> -1:
openBr = 0
foundpos = string.find(texta, r'\note',pos)
# print 'foundpos',foundpos
pos = foundpos + 5
# print texta[pos]
result = ""
while foundpos > -1 and openBr >= 0:
pos = pos + 1
if texta[pos] == "{":
openBr = openBr + 1
if texta[pos] == "}":
openBr = openBr - 1
result = result + texta[pos]
result = result[:-1] # drop the last } found.
result = string.replace(result,'\n', ' ') # replace new line with space
print result
Consider the following (highly simplified) string:
'a b a b c a b c a b c'
This is a repeating pattern of 'a b c' except at the beginning where the 'c' is missing.
I seek a regular expression which can give me the following matches by the use of re.findall():
[('a', 'b'), ('a', 'b', 'c'), ('a', 'b', 'c'), ('a', 'b', 'c')]
The string above thus have 4 matches of 'a b c' - although with the first match as a special case since the 'c' is missing.
My simplest attempt is where I try to capture 'a' and 'b' and use an optional capture for 'c':
re.findall(r'(a).*?(b).*?(c)?', 'a b a b c a b c a b c')
I get:
[('a', 'b', ''), ('a', 'b', ''), ('a', 'b', ''), ('a', 'b', '')]
Clearly, it has just ignored the c. When using non-optional capture for 'c' the search skips ahead prematurely and misses 'a' and 'b' in the second 'a b c'-substring. This results in 3 wrong matches:
[('a', 'b', 'c'), ('a', 'b', 'c'), ('a', 'b', 'c')]
I have tried several other techniques (for instance, '(?<=c)') to no avail.
Note: The string above is just a skeleton example of my "real-world" problem where the three letters above are themselves strings (from a long log-file) in between other strings and newlines from which I need to extract named groups.
I use Python 3.5.2 on Windows 7.
Since your a, b, and c are placeholders, and you cannot know if those are single characters, or character sequences, or anything else, you need to use a tempered greedy token to make sure the pattern does not overflow to the other matches in the same string, and since the c is optional, just wrap it with a (?:...)? optional non-capturing group:
(a)(?:(?!a|b).)*(b)(?:(?:(?!a|b|c).)*(c))?
^^^^^^^^^^^^^ ^^^ ^^^^^^^^^^^^^^ ^
See the regex demo
Details:
(a) - Group 1 capturing some a
(?:(?!a|b).)* - a tempered greedy token matching any char not starting a a or b sequences
(b) - Group 2 capturing some b
(?: - start of an optional non-capturing group, repeated 1 or 0 times
(?:(?!a|b|c).)* - a tempered greedy token that matches any char but a newline that starts a a, b or c pattern
(c) - Group 3 capturing some c pattern
)? - end of the optional non-capturing group.
To obtain the tuple list you need, you need to build it yourself using comprehension:
import re
r = r'(a)(?:(?!a|b).)*(b)(?:(?:(?!a|b|c).)*(c))?'
s = 'a b a b c a b c a b c'
# print(re.findall(r,s))
# That one is bad: [('a', 'b', ''), ('a', 'b', 'c'), ('a', 'b', 'c'), ('a', 'b', 'c')]
print([(a,b,c) if c else (a,b) for a,b,c in re.findall(r,s)])
# This one is good: [('a', 'b'), ('a', 'b', 'c'), ('a', 'b', 'c'), ('a', 'b', 'c')]
See the Python demo
I'm trying to implement a hangman game. I want part of the function to check if a letter is correct or incorrect. After a letter is found to be correct it will place the letter in a "used letters" list and a "correct letters list" The correct letters list will be built as the game goes on. I'd like it to sort the list to match the hidden word as the game is going.
For instance let's use the word "hardware"
If someone guessed "e, a, and h" it would come out like
correct = ["e", "a", "h"]
I would like it to sort the list so it would go
correct = ["h", "a", "e"]
then
correct = ["h", "a", "r", "a", "e"] after r has been guessed.
I also need to know if it would also see that "a" is in there twice and place it twice.
My code that doesn't allow you to win but you can lose. It's a work in progress.
I also can't get the letters left counter to work. I've made the code print the list to check if it was adding the letters. it is. So I don't know what's up there.
def hangman():
correct = []
guessed = []
guess = ""
words = ["source", "alpha", "patch", "system"]
sWord = random.choice(words)
wLen = len(sWord)
cLen = len(correct)
remaining = int(wLen - cLen)
print "Welcome to hangman.\n"
print "You've got 3 tries or the guy dies."
turns = 3
while turns > 0:
guess = str(raw_input("Take a guess. >"))
if guess in sWord:
correct.append(guess)
guessed.append(guess)
print "Great!, %d letters left." % remaining
else:
print "Incorrect, this poor guy's life is in your hands."
guessed.append(guess)
turns -= 1
print "You have %d turns left." % turns
if turns == 0:
print "HE'S DEAD AND IT'S ALL YOUR FAULT! ARE YOU HAPPY?"
print "YOU LOST ON PURPOSE, DIDN'T YOU?!"
hangman()
I'm not entirely clear on the desired behavior because:
correct = ["h", "a", "r", "a", "e"] after r has been guessed.
This is strange because a has only been guessed once, but shows up for each time it appears in hardware. Should r should also appear twice? If that is the correct behavior, then a very simple list comprehension can be used:
def result(guesses, key):
print [c for c in key if c in guesses]
In [560]: result('eah', 'hardware')
['h', 'a', 'a', 'e']
In [561]: result('eahr', 'hardware')
['h', 'a', 'r', 'a', 'r', 'e']
Iterate the letters in key and include them if the letter has been used as a "guess".
You can also have it insert a place holder for unfound characters fairly easily by using:
def result(guesses, key):
print [c if c in guesses else '_' for c in key]
print ' '.join([c if c in guesses else '_' for c in key])
In [567]: result('eah', 'hardware')
['h', 'a', '_', '_', '_', 'a', '_', 'e']
h a _ _ _ a _ e
In [568]: result('eahr', 'hardware')
['h', 'a', 'r', '_', '_', 'a', 'r', 'e']
h a r _ _ a r e
In [569]: result('eahrzw12', 'hardware')
['h', 'a', 'r', '_', 'w', 'a', 'r', 'e']
h a r _ w a r e
This question already has answers here:
Replace multiple letters with accents with gsub
(11 answers)
Closed 9 years ago.
I'm currently running the following code to clean my data from accent characters:
df <- gsub('Á|Ã', 'A', df)
df <- gsub('É|Ê', 'E', df)
df <- gsub('Í', 'I', df)
df <- gsub('Ó|Õ', 'O', df)
df <- gsub('Ú', 'U', df)
df <- gsub('Ç', 'C', df)
However, I would like to do it in just one line (using another function for it would be ok). How can I do this?
Try something like this
iconv(c('Á'), "utf8", "ASCII//TRANSLIT")
You can just add more elements to the c().
EDIT: it is machine dependent, check help(iconv)
Here is the R solution
mychar <- c('ÁÃÉÊÍÓÕÚÇ')
iconv(mychar, "latin1", "ASCII//TRANSLIT") # one line, as requested
[1] "AAEEIOOUC"
It an encoding problem, Normally you resolve it by indicating the right encoding. If you still want to use regular expression to do it , you can use gsubfn to write one liner solution:
library(gsubfn)
ll <- list('Á'='A', 'Ã'='A', 'É'='E',
'Ê'='E', 'Í'='I', 'Ó'='O',
'Õ'='O', 'Ú'='U', 'Ç'='C')
gsubfn('Á|Ã|É|Ê|Í|Ó|Õ|Ú|Ç',ll,'ÁÃÉÊÍÓÕÚÇ')
[1] "AAEEIOOUC"
gsubfn('Á|Ã|É|Ê|Í|Ó|Õ|Ú|Ç',ll,c('ÁÃÉÊÍÓÕÚÇ','ÍÓÕÚÇ'))
[1] "AAEEIOOUC" "IOOUC"
One option could be chartr
> toreplace <- LETTERS
> replacewith <- letters
> (somestring <- paste(sample(LETTERS,10),collapse=""))
[1] "MUXJVYNZQH"
>
> chartr(
+ old=paste(toreplace,collapse=""),
+ new=paste(replacewith,collapse=""),
+ x=somestring
+ )
[1] "muxjvynzqh"
df = as.data.frame(apply(df,2,function(x) gsub('Á|Ã', 'A', df)))
2 indicates columns and 1 indicates rows
I have a dataframe with 95 cols and want to batch-rename a lot of them with simple regexes, like the snippet at bottom, there are ~30 such lines. Any other columns which don't match the search regex must be left untouched.
**** Example: names(tr) = c('foo', 'bar', 'xxx_14', 'xxx_2001', 'yyy_76', 'baz', 'zzz_22', ...) ****
I started out with a wall of 25 gsub()s - crude but effective:
names(tr) <- gsub('_1$', '_R', names(tr))
names(tr) <- gsub('_14$', '_I', names(tr))
names(tr) <- gsub('_22$', '_P', names(tr))
names(tr) <- gsub('_50$', '_O', names(tr))
... yada yada
#Joshua: mapply doesn't work, turns out it's more complicated and impossible to vectorize. names(tr) contains other columns, and when these patterns do occur, you cannot assume all of them occur, let alone in the exact order we defined them. Hence, try 2 is:
pattern <- paste('_', c('1','14','22','50','52','57','76','1018','2001','3301','6005'), '$', sep='')
replace <- paste('_', c('R','I', 'P', 'O', 'C', 'D', 'M', 'L', 'S', 'K', 'G'), sep='')
do.call(gsub, list(pattern, replace, names(tr)))
Warning messages:
1: In function (pattern, replacement, x, ignore.case = FALSE, perl = FALSE, :
argument 'pattern' has length > 1 and only the first element will be used
2: In function (pattern, replacement, x, ignore.case = FALSE, perl = FALSE, :
argument 'replacement' has length > 1 and only the first element will be used
Can anyone fix this for me?
EDIT: I read all around SO and R doc on this subject for over a day and couldn't find anything... then when I post it I think of searching for '[r] translation table' and I find xlate. Which is not mentioned anywhere in the grep/sub/gsub documentation.
Is there anything in base/gsubfn/data.table etc. to allow me to write one search-and-replacement instruction? (like a dictionary or translation table)
Can you improve my clunky syntax to be call-by-reference to tr? (mustn't create temp copy of entire df)
EDIT2: my best effort after reading around was:
The dictionary approach (xlate) might be a partial answer to, but this is more than a simple translation table since the regex must be terminal (e.g. '_14$').
I could use gsub() or strsplit() to split on '_' then do my xlate translation on the last component, then paste() them back together. Looking for a cleaner 1/2-line idiom.
Or else I just use walls of gsub()s.
Wall of gsub could be always replace by for-loop. And you can write it as a function:
renamer <- function(x, pattern, replace) {
for (i in seq_along(pattern))
x <- gsub(pattern[i], replace[i], x)
x
}
names(tr) <- renamer(
names(tr),
sprintf('_%s$', c('1','14','22','50','52','57','76','1018','2001','3301','6005')),
sprintf('_%s' , c('R','I', 'P', 'O', 'C', 'D', 'M', 'L', 'S', 'K', 'G'))
)
And I found sprintf more useful than paste for creation this kind of strings.
The question predates the boom of the tidyverse but this is easily solved with the c(pattern1 = replacement1) option in stringr::str_replace_all.
tr <- data.frame("whatevs_1" = NA, "something_52" = NA)
tr
#> whatevs_1 something_52
#> 1 NA NA
patterns <- sprintf('_%s$', c('1','14','22','50','52','57','76','1018','2001','3301','6005'))
replacements <- sprintf('_%s' , c('R','I', 'P', 'O', 'C', 'D', 'M', 'L', 'S', 'K', 'G'))
names(replacements) <- patterns
names(tr) <- stringr::str_replace_all(names(tr), replacements)
tr
#> whatevs_R something_C
#> 1 NA NA
And of course, this particular case can benefit from dplyr
dplyr::rename_all(tr, stringr::str_replace_all, replacements)
#> whatevs_R something_C
#> 1 NA NA
Using do.call() nearly does it, it objects to differing arg lengths. I think I need to nest do.call() inside apply(), like in apply function to elements over a list.
But I need a partial do.call() over pattern and replace.
This is all starting to make a wall of gsub(..., fixed=TRUE) look like a more efficient idiom, if flabby code.
pattern <- paste('_', c('1','14','22','50'), '$', sep='')
replace <- paste('_', c('R','I', 'P', 'O'), sep='')
do.call(gsub, list(pattern, replace, names(tr)))
Warning messages:
1: In function (pattern, replacement, x, ignore.case = FALSE, perl = FALSE, :
argument 'pattern' has length > 1 and only the first element will be used
2: In function (pattern, replacement, x, ignore.case = FALSE, perl = FALSE, :
argument 'replacement' has length > 1 and only the first element will be used