with open(searchfile) as f:
pattern = "\.?(?P<sentence>.*?\(([A-Za-z0-9_]+)\).*?)\."
for line in f:
match = re.search(pattern, line)
if match != None:
print match.group("sentence")
I am trying to extract every sentence that contains an acronym in parenthesis (essentially 2-4 letter all caps in parenthesis.
In: Here is an (ABC) example. Do not include this sentence. Include this (AB) one. And (AVCD) this one.
Out: Here is an (ABC) example. Include this (AB) one. And (AVCD) this one.
You can use this:
[^.]*?\([A-Z]{2,4}\)[^.]*\.
But note that it is a particulary inefficient way, since the pattern starts with a very permissive subpattern. You can correct that a little by adding a kind of anchor at the begining:
(?:(?<=.)|^)[^.]*?\([A-Z]{2,4}\)[^.]*\.
Unfortunatly, even with this anchor, the regex engine must check the two alternatives for the most of the characters of the string.
A better approach might be to find substrings starting with the acronym until the end of the sentence and dots, and then to extract substrings using the end offset of each results:
#!/usr/bin/python
import re
txt = 'Here is an (ABC) example. Do not include this sentence. Include this (AB) one. And (AVCD) this one.'
pattern = re.compile(r'([!.?])(?=\s)|\([A-Z]{2,4}\)[^.]*(?:\.|$)')
offset = 0
result = ''
for m in pattern.finditer(txt):
if (m.group(1)==None):
result += txt[offset:m.end()]
offset = m.end()
print result
Note: you can be sure that a dot stands for the end of a sentence, it can be something else.
a little more efficient pattern
([^.(]++\([^.)]++\)[^.)]++\.)
Demo
Related
I'd like to match three-character sequences of letters (only letters 'a', 'b', 'c' are allowed) separated by comma (last group is not ended with comma).
Examples:
abc,bca,cbb
ccc,abc,aab,baa
bcb
I have written following regular expression:
re.match('([abc][abc][abc],)+', "abc,defx,df")
However it doesn't work correctly, because for above example:
>>> print bool(re.match('([abc][abc][abc],)+', "abc,defx,df")) # defx in second group
True
>>> print bool(re.match('([abc][abc][abc],)+', "axc,defx,df")) # 'x' in first group
False
It seems only to check first group of three letters but it ignores the rest. How to write this regular expression correctly?
Try following regex:
^[abc]{3}(,[abc]{3})*$
^...$ from the start till the end of the string
[...] one of the given character
...{3} three time of the phrase before
(...)* 0 till n times of the characters in the brackets
What you're asking it to find with your regex is "at least one triple of letters a, b, c" - that's what "+" gives you. Whatever follows after that doesn't really matter to the regex. You might want to include "$", which means "end of the line", to be sure that the line must all consist of allowed triples. However in the current form your regex would also demand that the last triple ends in a comma, so you should explicitly code that it's not so.
Try this:
re.match('([abc][abc][abc],)*([abc][abc][abc])$'
This finds any number of allowed triples followed by a comma (maybe zero), then a triple without a comma, then the end of the line.
Edit: including the "^" (start of string) symbol is not necessary, because the match method already checks for a match only at the beginning of the string.
The obligatory "you don't need a regex" solution:
all(letter in 'abc,' for letter in data) and all(len(item) == 3 for item in data.split(','))
You need to iterate over sequence of found values.
data_string = "abc,bca,df"
imatch = re.finditer(r'(?P<value>[abc]{3})(,|$)', data_string)
for match in imatch:
print match.group('value')
So the regex to check if the string matches pattern will be
data_string = "abc,bca,df"
match = re.match(r'^([abc]{3}(,|$))+', data_string)
if match:
print "data string is correct"
Your result is not surprising since the regular expression
([abc][abc][abc],)+
tries to match a string containing three characters of [abc] followed by a comma one ore more times anywhere in the string. So the most important part is to make sure that there is nothing more in the string - as scessor suggests with adding ^ (start of string) and $ (end of string) to the regular expression.
An alternative without using regex (albeit a brute force way):
>>> def matcher(x):
total = ["".join(p) for p in itertools.product(('a','b','c'),repeat=3)]
for i in x.split(','):
if i not in total:
return False
return True
>>> matcher("abc,bca,aaa")
True
>>> matcher("abc,bca,xyz")
False
>>> matcher("abc,aaa,bb")
False
If your aim is to validate a string as being composed of triplet of letters a,b,and c:
for ss in ("abc,bbc,abb,baa,bbb",
"acc",
"abc,bbc,abb,bXa,bbb",
"abc,bbc,ab,baa,bbb"):
print ss,' ',bool(re.match('([abc]{3},?)+\Z',ss))
result
abc,bbc,abb,baa,bbb True
acc True
abc,bbc,abb,bXa,bbb False
abc,bbc,ab,baa,bbb False
\Z means: the end of the string. Its presence obliges the match to be until the very end of the string
By the way, I like the form of Sonya too, in a way it is clearer:
bool(re.match('([abc]{3},)*[abc]{3}\Z',ss))
To just repeat a sequence of patterns, you need to use a non-capturing group, a (?:...) like contruct, and apply a quantifier right after the closing parenthesis. The question mark and the colon after the opening parenthesis are the syntax that creates a non-capturing group (SO post).
For example:
(?:abc)+ matches strings like abc, abcabc, abcabcabc, etc.
(?:\d+\.){3} matches strings like 1.12.2., 000.00000.0., etc.
Here, you can use
^[abc]{3}(?:,[abc]{3})*$
^^
Note that using a capturing group is fraught with unwelcome effects in a lot of Python regex methods. See a classical issue described at re.findall behaves weird post, for example, where re.findall and all other regex methods using this function behind the scenes only return captured substrings if there is a capturing group in the pattern.
In Pandas, it is also important to use non-capturing groups when you just need to group a pattern sequence: Series.str.contains will complain that this pattern has match groups. To actually get the groups, use str.extract. and
the Series.str.extract, Series.str.extractall and Series.str.findall will behave as re.findall.
My input is of this format: (xxx)yyyy(zz)(eee)fff where {x,y,z,e,f} are all numbers. But fff is optional though.
Input: x = (123)4567(89)(660)
Expected output: Only the eeepart i.e. the number inside 3rd "()" i.e. 660 in my example.
I am able to achieve this so far:
re.search("\((\d*)\)", x).group()
Output: (123)
Expected: (660)
I am surely missing something fundamental. Please advise.
Edit 1: Just added fff to the input data format.
You could find all those matches that have round braces (), and print the third match with findall
import re
n = "(123)4567(89)(660)999"
r = re.findall("\(\d*\)", n)
print(r[2])
Output:
(660)
The (eee) part is identical to the (xxx) part in your regex. If you don't provide an anchor, or some sequencing requirement, then an unanchored search will match the first thing it finds, which is (xxx) in your case.
If you know the (eee) always appears at the end of the string, you could append an "at-end" anchor ($) to force the match at the end. Or perhaps you could append a following character, like a space or comma or something.
Otherwise, you might do well to match the other parts of the pattern and not capture them:
pattern = r'[0-9()]{13}\((\d{3})\)'
If you want to get the third group of numbers in brackets, you need to skip the first two groups which you can do with a repeating non-capturing group which looks for a set of digits enclosed in () followed by some number of non ( characters:
x = '(123)4567(89)(660)'
print(re.search("(?:\(\d+\)[^(]*){2}(\(\d+\))", x).group(1))
Output:
(660)
Demo on rextester
I am trying to pull a certain number from various strings. The number has to be standalone, before ', or before (. The regex I came up with was:
\b(?<!\()(x)\b(,|\(|'|$) <- x is the numeric number.
If x is 2, this pulls the following string (almost) fine, except it also pulls 2'abd'. Any advice what I did wrong here?
2(2'Abf',3),212,2'abc',2(1,2'abd',3)
Your actual question is, as I understand it, get these specific number except those in parenthesis.
To do so I suggest using the skip_what_to_avoid|what_i_want pattern like this:
(\((?>[^()\\]++|\\.|(?1))*+\))
|\b(2)(?=\b(?:,|\(|'|$))
The idea here is to completely disregard the overall matches (and there first group use for the recursive pattern to capture everything between parenthesis: (\((?>[^()\\]++|\\.|(?1))*+\))): that's the trash bin. Instead, we only need to check capture group $2, which, when set, contains the asterisks outside of comments.
Demo
Sample Code:
import regex as re
regex = r"(\((?>[^()\\]++|\\.|(?1))*+\))|\b(2)(?=\b(?:,|\(|'|$))"
test_str = "2(2'Abf',3),212,2'abc',2(1,2'abd',3)"
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches):
matchNum = matchNum + 1
if match.groups()[1] is not None:
print ("Found at {start}-{end}: {group}".format(start = match.start(2), end = match.end(2), group = match.group(2)))
Output:
Found at 0-1: 2
Found at 16-17: 2
Found at 23-24: 2
This solution requires the alternative Python regex package.
There is a problem that I need to do, but there are some caveats that make it hard.
Problem: Match on all non-empty strings over the alphabet {abc} that contain at most one a.
Examples
a
abc
bbca
bbcabb
Nonexample
aa
bbaa
Caveats: You cannot use a lookahead/lookbehind.
What I have is this:
^[bc]*a?[bc]*$
but it matches empty strings. Maybe a hint? Idk anything would help
(And if it matters, I'm using python).
As I understand your question, the only problem is, that your current pattern matches empty strings. To prevent this you can use a word boundary \b to require at least one word character.
^\b[bc]*a?[bc]*$
See demo at regex101
Another option would be to alternate in a group. Match an a surrounded by any amount of [bc] or one or more [bc] from start to end which could look like: ^(?:[bc]*a[bc]*|[bc]+)$
The way I understood the issue was that any character in the alphabet should match, just only one a character.
Match on all non-empty strings over the alphabet... at most one a
^[b-z]*a?[b-z]*$
If spaces can be included:
^([b-z]*\s?)*a?([b-z]*\s?)*$
You do not even need a regex here, you might as well use .count() and a list comprehension:
data = """a,abc,bbca,bbcabb,aa,bbaa,something without the bespoken letter,ooo"""
def filter(string, char):
return [word
for word in string.split(",")
for c in [word.count(char)]
if c in [0,1]]
print(filter(data, 'a'))
Yielding
['a', 'abc', 'bbca', 'bbcabb', 'something without the bespoken letter', 'ooo']
You've got to positively match something excluding the empty string,
using only a, b, or c letters. But can't use assertions.
Here is what you do.
The regex ^(?:[bc]*a[bc]*|[bc]+)$
The explanation
^ # BOS
(?: # Cluster choice
[bc]* a [bc]* # only 1 [a] allowed, arbitrary [bc]'s
| # or,
[bc]+ # no [a]'s only [bc]'s ( so must be some )
) # End cluster
$ # EOS
I've got two files:
1st: Entries.txt
confirmation.resend
send
confirmation.showResendForm
login.header
login.loginBtn
2nd: Used_Entries.txt
confirmation.showResendForm = some value
login.header = some other value
I want to find all entries from the first file (Entries.txt) that have not been asigned a value in the 2nd file (Used_Entries.txt)
In this example I'd like the following result:
confirmation.resend
send
login.loginBtn
In the result confirmation.showResendForm and login.header do not show up because these exist in the Used_Entries.txt
How do I do this? I've been playing around with regular expressions but haven't been able to solve it. A bash script or sth would be much appreciated!
You can do this with regex. But get your code mood ready, because you can't match both files with regex at once, and we do want to match both contents with regex at once. Well, that means you must have at least some understanding of your language, I would like you to concatenate the contents from the two files with at least a new line in between.
This regex solution expects your string to be matched to be in this format:
text (no equals sign)
text
text
...
key (no equals sign) ␣ (optional whitespace) = (literal equal) whatever (our regex will skip this part.)
key=whatever
key=whatever
Do I have your attention? Yes? Please see the following regex (using techniques accessible to most regex engines):
/(^[^=\n]+$)(?!(?s).*^\1\s*=)/m
Inspired from a recent answer I saw from zx81, you can switch to (?s) flag in the middle to switch to DOTALL mode suddenly, allowing you to start multiline matching with . in the middle of a RegExp. Using this technique and the set syntax above, here's what the regex does, as an explanation:
(^[^=\n]+$) Goes through all the text (no equals sign) elements. Enforces no equals signs or newlines in the capture. This means our regex hits every text element as a line, and tries to match it appropriately.
(?! Opens a negative lookahead group. Asserts that this match will not locate the following:
(?s).* Any number of characters or new lines - As this is a greedy match, will throw our matcher pointer to the very end of the string, skipping to the last parts of the document to backtrack and scoop up quickly.
^\1\s*= The captured key, followed by an equals sign after some optional whitespaces, in its own line.
) Ends our group.
View a Regex Demo!
A regex demo with more test cases
I'm stupid. I could had just put this:
/(^[^=\n]+$)(?!.*^\1\s*=)/sm
I've been going at this a little bit to complex and just solved it with a small script in scala:
import scala.io.Source
object HelloWorld {
def main(args: Array[String]) {
val entries = (for(line <- Source.fromFile("Entries.txt").getLines()) yield {
line
}).toList
val usedEntries = (for(line <- Source.fromFile("Used_Entries.txt").getLines()) yield {
line.dropRight(line.length - line.indexOf(' '))
}).toList
println(entries)
println(usedEntries)
val missingEntries = (for {
entry <- entries
if !usedEntries.exists(_ == entry)
} yield {
entry
}).toList
println(missingEntries)
println("Missing Entries: ")
println()
for {
missingEntry <- missingEntries
} yield {
println(missingEntry)
}
}
}
import re
e=open("Entries.txt",'r')
m=e.readlines()
u=open("Used_Entries.txt",'r')
s=u.read()
y=re.sub(r"= .*","",s)
for i in m:
if i.strip() in [k.strip() for k in y.split("\n")] :
pass
else:
print i.strip()