Python RE: Find all matches of pattern b following pattern a - regex

I have a text file that looks like this:
Warning-[blah1]
few lines
Warning-[blah2]
few more lines
Total warnings: 2
few more lines
Warning-[blah3]
more of random lines
Warning-[blah4]
My objective is to find all matches of Warnings that come after the line "Total warnings: 2".
So far I have tried two approaches:
regex = re.compile('Total\swarnings.(Warning-[\S+])',re.DOTALL)
regex = re.compile('Total\swarnings.?(Warning-[\S+])',re.DOTALL)
The first approach gives me the greedy result i.e. matches only blah4 and the second matches only blah3. How can I get it to match both?
I am using findall.

import re
with open('sample.txt') as f:
f = f.read()
f = f.split('Total warnings: 2')
f = f[:1]
for el in f:
el = el.split("\n")
el = [x for x in el if re.match(r'Warning\-\[.*?\]',x,flags=re.IGNORECASE)]
print el

You could try splitting the text file on "Total warnings", and then only processing the second half of the file:
import re
with open('yourfile.txt') as f:
halves = f.read().split('Total warnings')
regex = re.compile(r'Warning-\[(\S+)\]')
matches = re.findall(regex, halves[1])

Related

selecting variations of phone numbers using regex

import re
s = 'so the 1234 2-1-1919 215.777.9839 1333331234 20-20-2000 A1234567 (515)2331129 7654321B (511)231-1134 512-333-1134 7777777 a7727373 there 1-22-2001 *1831 5647 and !2783 '
reg = r'[()\d-]{7,}'
r1 = re.findall(reg,s)
I have the following reg that gives the following
r1
['2-1-1919',
'1333331234',
'20-20-2000',
'1234567',
'(515)2331129',
'7654321',
'(511)231-1134',
'512-333-1134',
'7777777',
'7727373',
'1-22-2001']
I want to get the following output
['(515)2331129',
'(511)231-1134',
'512-333-1134']
So I tried to alter reg = r'[()\d-]{7,}' by adding \b
reg = r'[\b()\b\d-]{7,}'
But this doesnt work. How do I change reg = r'[()\d-]{7,}' to get the output I want?
To put my two cents in, you could use a regex/parser combination as in:
from parsimonious.grammar import Grammar
from parsimonious.expressions import IncompleteParseError, ParseError
import re
junk = """so the 1234 2-1-1919 215.777.9839 1333331234 20-20-2000 A1234567 (515)2331129 7654321B
(511)231-1134 512-333-1134 7777777 a7727373 there 1-22-2001 *1831 5647 and !2783"""
rx = re.compile(r'[-()\d]+')
grammar = Grammar(
r"""
phone = area part part
area = (lpar digits rpar) / digits
part = dash? digits
lpar = "("
rpar = ")"
dash = "-"
digits = ~"\d{3,4}"
"""
)
for match in rx.finditer(junk):
possible_number = match.group(0)
try:
tree = grammar.parse(possible_number)
print(possible_number)
except (ParseError, IncompleteParseError):
pass
This yields
(515)2331129
(511)231-1134
512-333-1134
The idea here is to first match possible candidates which are then checked with the parser grammar.
Maybe, we could use alternation based on the cases you might have:
\d{3}-\d{3}-\d{4}|\(\s*\d{3}\s*\)\d{7}|\(\s*\d{3}\s*\)\s*\d{3}-\d{4}
We can also include additional boundaries if it'd be necessary:
(?<!\S)(?:\d{3}-\d{3}-\d{4}|\(\s*\d{3}\s*\)\d{7}|\(\s*\d{3}\s*\)\s*\d{3}-\d{4})(?!\S)
Demo
Test
import re
expression = r"\d{3}-\d{3}-\d{4}|\(\s*\d{3}\s*\)\d{7}|\(\s*\d{3}\s*\)\s*\d{3}-\d{4}"
string = """
so the 1234 2-1-1919 215.777.9839 1333331234 20-20-2000 A1234567 (515)2331129 7654321B (511)231-1134 512-333-1134 7777777 a7727373 there 1-22-2001 *1831 5647 and !2783 (511) 231-1134 ( 511)231-1134 (511 ) 231-1134
511-2311134
"""
print(re.findall(expression, string))
Output
['(515)2331129', '(511)231-1134', '512-333-1134', '(511) 231-1134', '( 511)231-1134', '(511 ) 231-1134']
If you wish to explore/simplify/modify the expression, it's been
explained on the top right panel of
regex101.com. If you'd like, you
can also watch in this
link, how it would match
against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:

Parse text between multiple lines - Python 2.7 and re Module

I have a text file i want to parse. The file has multiple items I want to extract. I want to capture everything in between a colon ":" and a particular word. Let's take the following example.
Description : a pair of shorts
amount : 13 dollars
requirements : must be blue
ID1 : 199658
----
The following code parses the information out.
import re
f = open ("parse.txt", "rb")
fileRead = f.read()
Description = re.findall("Description :(.*?)amount", fileRead, re.DOTALL)
amount = re.findall("amount :(.*?)requirements", fileRead, re.DOTALL)
requirements = re.findall("requirements :(.*?)ID1", fileRead, re.DOTALL)
ID1 = re.findall("ID1 :(.*?)-", fileRead, re.DOTALL)
print Description[0]
print amount[0]
print requirements[0]
print ID1[0]
f.close()
The problem is that sometimes the text file will have a new line such as this
Description
: a pair of shorts
amount
: 13 dollars
requirements: must be blue
ID1: 199658
----
In this case my code will not work because it is unable to find "Description :" because it is now separated into a new line. If I choose to change the search to ":(.*?)requirements" it will not return just the 13 dollars, it will return a pair of shorts and 13 dollars because all of that text is in between the first colon and the word, requirements. I want to have a way of parsing out the information no matter if there is a line break or not. I have hit a road block and your help would be greatly appreciated.
You can use a regex like this:
Description[^:]*(.*)
^--- use the keyword you want
Working demo
Quoting your code you could use:
import re
f = open ("parse.txt", "rb")
fileRead = f.read()
Description = re.findall("Description[^:]*(.*)", fileRead)
amount = re.findall("amount[^:]*(.*)", fileRead)
requirements = re.findall("requirements[^:]*(.*)", fileRead)
ID1 = re.findall("ID1[^:]*(.*)", fileRead)
print Description[0]
print amount[0]
print requirements[0]
print ID1[0]
f.close()
You can simply do this:
import re
f = open ("new.txt", "rb")
fileRead = f.read()
keyvals = {k.strip():v.strip() for k,v in dict(re.findall('([^:]*):(.*)(?=\b[^:]*:|$)',fileRead,re.M)).iteritems()}
print(keyvals)
f.close()
Output:
{'amount': '13 dollars', 'requirements': 'must be blue', 'Description': 'a pair of shorts', 'ID1': '199658'}

How to count words with one syllable in a list of strings of one word using regular expressions

I'm trying to count the number of words, in a pretty long text, that have one syllable. This was defined as words that have zero or more consonants followed by 1 or more vowels followed by zero or more consonants.
The text has been lowercased and split into a list of strings of single words. Yet everytime I try to use RE's to get the count I get an error because the object is a list and not a string.
How would I do this in a list?
f = open('pg36.txt')
war = f.read()
warlow = war.lower()
warsplit = warlow.split()
import re
def syllables():
count = len(re.findall('[bcdfghjklmnpqrstvwxyz]*[aeiou]+[bcdfghjklmnpqrstvwxyz]*', warsplit))
return count
print (count)
syllables()
Because you're trying to use findall function against the list not the string, since findall works only against the string . So you could try the below.
import re
f = open('file')
war = f.read()
warlow = war.lower()
warsplit = warlow.split()
def syllables():
count = 0
for i in warsplit:
if re.match(r'^[bcdfghjklmnpqrstvwxyz]*[aeiou]+[bcdfghjklmnpqrstvwxyz]*$', i):
count += 1
return count
print syllables()
f.close()
OR
Use findall function directly on warlow variable.
import re
f = open('file')
war = f.read()
warlow = war.lower()
print len(re.findall(r'(?<!\S)[bcdfghjklmnpqrstvwxyz]*[aeiou]+[bcdfghjklmnpqrstvwxyz]*(?!\S)', warlow))
f.close()
Try this regex instead:
^[^aeiouAEIOU]*[aeiouAEIOU]+[^aeiouAEIOU]*$

Regex match a spanish word ending with a dot(.) and underscore

This is the regex I'm trying:
([\w\s\/áéíóúüñçåÁÉÍÓÚÜÑÇÐ\-]+)(\.\_)
Here are two examples that it should match against:
EL ROSARIO / ESCUINAPA._ Con poco más de 4 mil pesos...
and
Cuautitlán._ Con poco más de 4 mil pesos...
The expression works for the first example but not for the second because of encoding probably:
docHtml = urllib.urlopen(link).read()
#using the lxml function html
tree = html.fromstring(docHtml)
newsCity = CSSSelector('#pid p')
try:
city_paragraph = newsCity(tree)
city_match = re.search('([\w\s\/áéíóúüñçåÁÉÍÓÚÜÑÇÐ\-]+\._)',city_paragraph[0].text)
Your regular expression appears to be correct. I suspect that the bug is in how you're reading the strings that you're matching against. You want something like:
import codecs
f = codecs.open('spanish.txt', encoding='utf-8')
for line in f:
print repr(line)
Finally figured it out:
newsCity = CSSSelector('#tamano5 p')
city_paragraph = newsCity(tree)
city_p = city_paragraph[0].text
city_utf=city_p.encode("utf-8")
city_match = re.search('([\w\s\/áéíóúüñçåÁÉÍÓÚÜÑÇÐ\-]+\._)',city_utf)
This gives me the expected result which in this case was to extract the city string using re.search.

How to grab a letter after ';' with regular expressions?

How can I grab a letter after ; using regular expressions? For example:
c ; d
e ; f ; m ; k ; s
import re
f = open('file.txt')
regex = re.compile(r"(?<=\; )\w+")
for line in f:
match = regex.search(line)
if match:
print match.group()
This code only grabs d and f. I need the outcome yo look like:
d
f
m
k
s
Replace all occurrences of "; " to a newline character and trim all spaces from the ends of every line.
use a regex similar to this if you want to "blacklist" the ";" character:
[;]
I don't know much about python, but here how you would use it in JavaScript:
var desired_chars = myString.replace(/[;]/gi, '')
Instead of regex.search use regex.findall. That'll give you a list of matches for each line which you can then manipulate and print on separate lines.