I am searching a text file consisting of single words on each line for the following:
Lines that have two consecutive a’s in them but which don’t start with an a
import re
import sys
pattern = '^[^Aa][A-Za-z]*[Aa]{2}'
regexp = re.compile(pattern)
inFile = open('words.txt', 'r')
outFile = open('exercise04.log', 'w')
for line in inFile:
match = regexp.search(line)
if match:
outFile.write(line)
inFile.close()
outFile.close()
My main concern is my regex search pattern rather than the python itself. I understand the ^[^Aa] at the start stops the first character from being 'A' or 'a', but is there a better way of breaking out of this statement to check for two consecutive 'a's in each word than I have used?
Your pattern looks fine.
If you want to make sure the first character is a letter, use
pattern = '^[B-Zb-z][A-Za-z]*[Aa]{2}'
Related
#open text file
with open('words') as f:
for line in f.readlines():
#pull out all 3 letter words using regular expression and add to wordlist
word_list += re.findall(r'\b(\w{3})\b', line)
I use this to find all 3 letter words in a dictionary. From there, I want to add a question mark to the beginning of each word. I assume I need the re.sub function, but can't seem to get the syntax right.
You can do this a few ways, one of them is to get all your 3 letters words and then update them afterwards, otherwise, you can do along the lines of what you're doing and extend a list as you go. There's not really a need for re.sub here if you want to end up building a list of 3 letters words prefixed with ?
Sample words file:
the quick brown fox called bob jumped over the lazy dog
and went straight to bed
cos bob needed to sleep right now
Sample code:
word_list = []
with open('words') as fin:
for line in fin:
matches = re.findall(r'\b(\w{3})\b', line)
word_list.extend(f'?{word}' for word in matches)
Sample word_list after run:
['?the',
'?fox',
'?bob',
'?the',
'?dog',
'?and',
'?bed',
'?cos',
'?bob',
'?now']
You can use re.sub, where \1 refers to the first capture group:
re.sub(r'\b(\w{3})\b', r'?\1', line)
First compile pattern:
re.compile(r'\b(\w{3})\b')
and then use it like this:
word_list += '?' + re.search(line)
How can I add a new line every time there is a pattern of a regex-list found in a string ?
I am using python 3.6.
I got the following input:
12.13.14 Here is supposed to start a new line.
12.13.15 Here is supposed to start a new line.
Here is some text. It is written in one lines. 12.13. Here is some more text. 2.12.14. Here is even more text.
I wish to have the following output:
12.13.14
Here is supposed to start a new line.
12.13.15
Here is supposed to start a new line.
Here is some text. It is written in one lines.
12.13.
Here is some more text.
2.12.14.
Here is even more text.
My first try returns as the output the same as the input:
in_file2 = 'work1-T1.txt'
out_file2 = 'work2-T1.txt'
start_rx = re.compile('|'.join(
['\d\d\.\d\d\.', '\d\.\d\d\.\d\d','\d\d\.\d\d\.\d\d']))
with open(in_file2,'r', encoding='utf-8') as fin2, open(out_file2, 'w', encoding='utf-8') as fout2:
text_list = fin2.read().split()
fin2.seek(0)
for string in fin2:
if re.match(start_rx, string):
string = str.replace(start_rx, '\n\n' + start_rx + '\n')
fout2.write(string)
My second try returns an error 'TypeError: unsupported operand type(s) for +: '_sre.SRE_Pattern' and 'str''
in_file2 = 'work1-T1.txt'
out_file2 = 'work2-T1.txt'
start_rx = re.compile('|'.join(
['\d\d\.\d\d\.', '\d\.\d\d\.\d\d','\d\d\.\d\d\.\d\d']))
with open(in_file2,"r") as fin2, open(out_file2, 'w') as fout3:
for line in fin2:
start = False
if re.match(start_rx, line):
start = True
if start == False:
print ('do something')
if start == True:
line = '\n' + line ## leerzeichen vor Pos Nr
line = line.replace(start_rx, start_rx + '\n')
fout3.write(line)
First of all, to search and replace with a regex, you need to use re.sub, not str.replace.
Second, if you use a re.sub, you can't use the regex pattern inside a replacement pattern, you need to group the parts of the regex you want to keep and use backreferences in the replacement (or, if you just want to refer to the whole match, use \g<0> backreference, no capturing groups are required).
Third, when you build an unanchored alternation pattern, make sure longer alternatives come first, i.e. start_rx = re.compile('|'.join(['\d\d\.\d\d\.\d\d', '\d\.\d\d\.\d\d', '\d\d\.\d\d\.'])). However, you may use a more precise pattern here manually.
Here is how your code can be fixed:
with open(in_file2,'r', encoding='utf-8') as fin2, open(out_file2, 'w', encoding='utf-8') as fout2:
text = fin2.read()
fout2.write(re.sub(r'\s*(\d+(?:\.\d+)+\.?)\s*', r'\n\n\1\n', text))
See the Python demo
The pattern is
\s*(\d+(?:\.\d+)+\.?)\s*
See the regex demo
Details
\s* - 0+ whitespaces
(\d+(?:\.\d+)+\.?) - Group 1 (\1 in the replacement pattern):
\d+ - 1+ digits
(?:\.\d+)+ - 1 or more repetitions of . and 1+ digits
\.? - an optional .
\s* - 0+ whitespaces
Try this
out_file2=re.sub(r'(\d+) ', r'\1\n', in_file2)
out_file2=re.sub(r'(\w+)\.', r'\1\.\n', in_file2)
I have the following code that can return a line from text where a certain word exists
with open('/Users/Statistical_NLP/Project/text.txt') as f:
haystack = f.read()
with open('/Users/Statistical_NLP/Project/test.txt') as f:
for line in f:
needle = line.strip()
pattern = '^.*{}.*$'.format(re.escape(needle))
for match in re.finditer(pattern, haystack, re.MULTILINE):
print match.group(0)
How can I search for a word and return not the whole line, just the 3 words after and the three words before this certain word.
Something has to be changed in this line in my code:
pattern = '^.*{}.*$'.format(re.escape(needle))
Thanks a lot
The following regex will help you achieve what you want.
((?:\w+\s+){3}YOUR_WORD_HERE(?:\s+\w+){3})
For a better understanding of the regex, I suggest you go to the following page and experiment with it.
https://regex101.com/r/eS8zW5/3
This will match the three words before, the matched word and three words after.
The following will match 3 words before and after if they exist
((?:\w+\s+){0,3}YOUR_WORD_HERE(?:\s+\w+){0,3})
I am trying to read a file and match words there has over 6 characters in it. but I keep getting this error:
Traceback (most recent call last):
File "dummy.py", line 9, in <module>
matches = re.findall("\w{6,}", f.read().split())
File "/usr/lib/python2.7/re.py", line 181, in findall
return _compile(pattern, flags).findall(string)
TypeError: expected string or buffer
And I can't figure out why I am getting this error? The code is pasted below
import re
with open('test.txt', 'r') as f:
matches = re.findall("\w{6,}", f.read().split())
nr_long_words = len(matches)
print (matches)
f.read().split() gives a list of strings, but re.findall expects a single string, thus the TypeError: expected string or buffer. You could apply the regex to each of the substrings in a loop or list comprehension, but you do not need to split() at all:
matches = re.findall("\w{6,}", f.read())
Note that if the file is very large, then f.read() might not be a good idea (but for text files it's probably not an issue, as those are rarely marger than a few megabytes, if at all). In this case, you could read the file line-by-line and sum up the long words per line:
nr_long_words = sum(len(re.findall(r"\w{6,}", line)) for line in f)
Also, as noted in comments, \w{6,} might not be the best regex for "long words" to start with. \w will, e.g., also match numbers or the underscore _. If you want to match exclusively (ascii-)letters, better use [A-Za-z], but this might cause problems with non-ascii letters, such as umlauts, accents, arabic, etc. Also, you might want to include word boundary characters, i.e. \b, to make sure that the six letters are not part of a longer, non-word sequence, i.e. use a regex like r'\b[A-Za-z]{6,}\b'
Try:
import re
nr_long_words = 0
with open('input.txt', 'r') as f:
for line in f:
matches = re.findall("\w{6,}", line)
nr_long_words += len(matches)
print(nr_long_words)
it should print count of words longer than 6 characters in file.
with open(searchfile) as f:
pattern = "\.?(?P<sentence>.*?\(([A-Za-z0-9_]+)\).*?)\."
for line in f:
match = re.search(pattern, line)
if match != None:
print match.group("sentence")
I am trying to extract every sentence that contains an acronym in parenthesis (essentially 2-4 letter all caps in parenthesis.
In: Here is an (ABC) example. Do not include this sentence. Include this (AB) one. And (AVCD) this one.
Out: Here is an (ABC) example. Include this (AB) one. And (AVCD) this one.
You can use this:
[^.]*?\([A-Z]{2,4}\)[^.]*\.
But note that it is a particulary inefficient way, since the pattern starts with a very permissive subpattern. You can correct that a little by adding a kind of anchor at the begining:
(?:(?<=.)|^)[^.]*?\([A-Z]{2,4}\)[^.]*\.
Unfortunatly, even with this anchor, the regex engine must check the two alternatives for the most of the characters of the string.
A better approach might be to find substrings starting with the acronym until the end of the sentence and dots, and then to extract substrings using the end offset of each results:
#!/usr/bin/python
import re
txt = 'Here is an (ABC) example. Do not include this sentence. Include this (AB) one. And (AVCD) this one.'
pattern = re.compile(r'([!.?])(?=\s)|\([A-Z]{2,4}\)[^.]*(?:\.|$)')
offset = 0
result = ''
for m in pattern.finditer(txt):
if (m.group(1)==None):
result += txt[offset:m.end()]
offset = m.end()
print result
Note: you can be sure that a dot stands for the end of a sentence, it can be something else.
a little more efficient pattern
([^.(]++\([^.)]++\)[^.)]++\.)
Demo