import re
line = "Cats are smarter than dogs"
matchObj = re.match( r'(.*) are (.*)', line)
if matchObj:
print ("matchObj.group(2) : ", matchObj.group(2))
else:
print ("No match!!")
When i run this code i get an ouput: smarter than dogs
But if put an extra space at the end of my my RE
matchObj = re.match( r'(.*) are (.*) ', line)
I get output as: smarter than
Can anyone explain why am i getting this difference in output
When you're adding the extra space in matchObj = re.match( r'(.*) are (.*) ', line), you're asking to match as many character as it can in (.*) followed by a space.
In this case it is smarter than, as the space character matches the space in dogs.
Without the space, the . can match any number of characters other than new line. So it ends up matching until the end of the string, smarter than dogs.
Read the documentation on regex for more info.
Related
How can I add a new line every time there is a pattern of a regex-list found in a string ?
I am using python 3.6.
I got the following input:
12.13.14 Here is supposed to start a new line.
12.13.15 Here is supposed to start a new line.
Here is some text. It is written in one lines. 12.13. Here is some more text. 2.12.14. Here is even more text.
I wish to have the following output:
12.13.14
Here is supposed to start a new line.
12.13.15
Here is supposed to start a new line.
Here is some text. It is written in one lines.
12.13.
Here is some more text.
2.12.14.
Here is even more text.
My first try returns as the output the same as the input:
in_file2 = 'work1-T1.txt'
out_file2 = 'work2-T1.txt'
start_rx = re.compile('|'.join(
['\d\d\.\d\d\.', '\d\.\d\d\.\d\d','\d\d\.\d\d\.\d\d']))
with open(in_file2,'r', encoding='utf-8') as fin2, open(out_file2, 'w', encoding='utf-8') as fout2:
text_list = fin2.read().split()
fin2.seek(0)
for string in fin2:
if re.match(start_rx, string):
string = str.replace(start_rx, '\n\n' + start_rx + '\n')
fout2.write(string)
My second try returns an error 'TypeError: unsupported operand type(s) for +: '_sre.SRE_Pattern' and 'str''
in_file2 = 'work1-T1.txt'
out_file2 = 'work2-T1.txt'
start_rx = re.compile('|'.join(
['\d\d\.\d\d\.', '\d\.\d\d\.\d\d','\d\d\.\d\d\.\d\d']))
with open(in_file2,"r") as fin2, open(out_file2, 'w') as fout3:
for line in fin2:
start = False
if re.match(start_rx, line):
start = True
if start == False:
print ('do something')
if start == True:
line = '\n' + line ## leerzeichen vor Pos Nr
line = line.replace(start_rx, start_rx + '\n')
fout3.write(line)
First of all, to search and replace with a regex, you need to use re.sub, not str.replace.
Second, if you use a re.sub, you can't use the regex pattern inside a replacement pattern, you need to group the parts of the regex you want to keep and use backreferences in the replacement (or, if you just want to refer to the whole match, use \g<0> backreference, no capturing groups are required).
Third, when you build an unanchored alternation pattern, make sure longer alternatives come first, i.e. start_rx = re.compile('|'.join(['\d\d\.\d\d\.\d\d', '\d\.\d\d\.\d\d', '\d\d\.\d\d\.'])). However, you may use a more precise pattern here manually.
Here is how your code can be fixed:
with open(in_file2,'r', encoding='utf-8') as fin2, open(out_file2, 'w', encoding='utf-8') as fout2:
text = fin2.read()
fout2.write(re.sub(r'\s*(\d+(?:\.\d+)+\.?)\s*', r'\n\n\1\n', text))
See the Python demo
The pattern is
\s*(\d+(?:\.\d+)+\.?)\s*
See the regex demo
Details
\s* - 0+ whitespaces
(\d+(?:\.\d+)+\.?) - Group 1 (\1 in the replacement pattern):
\d+ - 1+ digits
(?:\.\d+)+ - 1 or more repetitions of . and 1+ digits
\.? - an optional .
\s* - 0+ whitespaces
Try this
out_file2=re.sub(r'(\d+) ', r'\1\n', in_file2)
out_file2=re.sub(r'(\w+)\.', r'\1\.\n', in_file2)
I'm look to extract prices from a string of scraped data.
I'm using this at the moment:
re.findall(r'£(?:\d+\.)?\d+.\d+', '£1.01')
['1.01']
Which works fine 99% of the time. However, I occasionally see this:
re.findall(r'£(?:\d+\.)?\d+.\d+', '£1,444.01')
['1,444']
I'd like to see ['1444.01'] ideally.
This is an example of the string I'm extracting the prices from.
'\n £1,000.73 \n\n\n + £1.26\nUK delivery\n\n\n'
I'm after some help putting together the regex to get ['1000.73', '1.26'] from that above string
You may grab all the values with '£(\d[\d.,]*)\b' and then remove all the commas with
import re
s = '\n £1,000.73 \n\n\n + £1.26\nUK delivery\n\n\n'
r = re.compile(r'£(\d[\d.,]*)\b')
print([x.replace(',', '') for x in re.findall(r, s)])
# => ['1000.73', '1.26']
See the Python demo
The £(\d[\d.,]*)\b pattern finds £ and then captures a digit and then any 0+ digits/,/., as many as possible, but will backtrack to a position where a word boundary is.
I have the following code that can return a line from text where a certain word exists
with open('/Users/Statistical_NLP/Project/text.txt') as f:
haystack = f.read()
with open('/Users/Statistical_NLP/Project/test.txt') as f:
for line in f:
needle = line.strip()
pattern = '^.*{}.*$'.format(re.escape(needle))
for match in re.finditer(pattern, haystack, re.MULTILINE):
print match.group(0)
How can I search for a word and return not the whole line, just the 3 words after and the three words before this certain word.
Something has to be changed in this line in my code:
pattern = '^.*{}.*$'.format(re.escape(needle))
Thanks a lot
The following regex will help you achieve what you want.
((?:\w+\s+){3}YOUR_WORD_HERE(?:\s+\w+){3})
For a better understanding of the regex, I suggest you go to the following page and experiment with it.
https://regex101.com/r/eS8zW5/3
This will match the three words before, the matched word and three words after.
The following will match 3 words before and after if they exist
((?:\w+\s+){0,3}YOUR_WORD_HERE(?:\s+\w+){0,3})
with open(searchfile) as f:
pattern = "\.?(?P<sentence>.*?\(([A-Za-z0-9_]+)\).*?)\."
for line in f:
match = re.search(pattern, line)
if match != None:
print match.group("sentence")
I am trying to extract every sentence that contains an acronym in parenthesis (essentially 2-4 letter all caps in parenthesis.
In: Here is an (ABC) example. Do not include this sentence. Include this (AB) one. And (AVCD) this one.
Out: Here is an (ABC) example. Include this (AB) one. And (AVCD) this one.
You can use this:
[^.]*?\([A-Z]{2,4}\)[^.]*\.
But note that it is a particulary inefficient way, since the pattern starts with a very permissive subpattern. You can correct that a little by adding a kind of anchor at the begining:
(?:(?<=.)|^)[^.]*?\([A-Z]{2,4}\)[^.]*\.
Unfortunatly, even with this anchor, the regex engine must check the two alternatives for the most of the characters of the string.
A better approach might be to find substrings starting with the acronym until the end of the sentence and dots, and then to extract substrings using the end offset of each results:
#!/usr/bin/python
import re
txt = 'Here is an (ABC) example. Do not include this sentence. Include this (AB) one. And (AVCD) this one.'
pattern = re.compile(r'([!.?])(?=\s)|\([A-Z]{2,4}\)[^.]*(?:\.|$)')
offset = 0
result = ''
for m in pattern.finditer(txt):
if (m.group(1)==None):
result += txt[offset:m.end()]
offset = m.end()
print result
Note: you can be sure that a dot stands for the end of a sentence, it can be something else.
a little more efficient pattern
([^.(]++\([^.)]++\)[^.)]++\.)
Demo
I want to match two string which differ only in element and newlines
$string1 = "perl is <match>scripting language</match>";
$string2 = "perl<TAG> is<TAG> scr<TAG>ipt<TAG>inglanguage";
Note: spaces and <TAG> and newline can come anywhere in string2. space may or may not present in string2 for e.g. in above instance in $string2 spaces between words scripting language is missing. we have to ignore space,tags,newline while matching string1 against string2. <match> tag in string1 indicates the data to be matched against string2
output required :
whole content of string2 in addition with <match> tag.
perl<TAG> is<TAG> <match>scr<TAG>ipt<TAG>inglanguage</match>
Code i tried :
while($string =~ /<match>(.*?)<\/match>/gs)
{
my $data_to_match = $1;
$data_to_match = add_pat($data_to_match);
$string2 =~ s{($data_to_match)}
{
"<match>$&<\/match>"
}esi;
}
sub add_pat
{
my ($data) = (#_);
my #array = split//,$data;
foreach my $each(#array)
{
$each = quotemeta $each;
$each = '(?:(<TAG>|\s)+)?'.$each.'(?:(<TAG>|\s)+)?';
}
$data = join '',#array;
return $data;
}
Problem : since space is missing in string2 it is not matching.i tried making space optional while appending pattern to each character. but making space optional. $string pattern goes on running.
In reality, i have large string to match. these space is causing problem..Please suggest
Use regular expressions to remove all the characters that you wish to ignore from both of the strings. Then compare the remaining values of the two strings.
So you will end up both strings, for example:
'perlisscriptinglanguage' and 'perlisscriptinglanguage'
If you want you can also upper/lower case them to match too.
If they match then just return the original string 2.
I think its weird that you are expected to "match". but $string2, if you take out the tags, doesnt match the original string.
Anyway, since your code is tolerant of Additional spaces and tags in $string2, then you can wipe all spaces (and tags if applicable) from $string1.
I added $data_to_match =~ s/ +//; before your call to add_pat. That didnt quite work because this line "$each = '(?:(|\s)+)?'.$each.'(?:(|\s)+)?';" adds the (?:(|\s)+)?' even before your first letter of the match from $string1. You actually have a lot of redundant TAG patterns, you add one to the front and back of each letter. I dont know what quotemeta does so im not sure how to fix the code there. I just added
$data_to_match =~ s/\Q(?:(<TAG>|\s)+)?\E//; line after the call to add_pat to strip off the first TAG pattern from the front of the pattern. otherwise it'll match wrong and output this 'perl < TAG> is< match>< TAG> scr< TAG>ipt< TAG>inglanguage< /match>'
Really you should only be putting one "(?:(|\s)+)?" inbetween each letter of the $string1 match, and more importantly; you should not be putting "(?:(|\s)+)?" before the first letter or after the last letter.