Using python regex to find repeated values after a header - regex

If I have a string that looks something like:
s = """
...
Random Stuff
...
HEADER
a 1
a 3
# random amount of rows
a 17
RANDOM_NEW_HEADER
a 200
a 300
...
More random stuff
...
"""
Is there a clean way to use regex (in Python) to find all instances of a \d* after HEADER, but before the pattern is broken by SOMETHING_TOTALLY_DIFFERENT? I thought about something like:
import re
pattern = r'HEADER(?:\na \d*)*\na (\d*)'
print re.findall(pattern, s)
Unfortunately, regex doesn't find overlapping matches. If there's no sensible way to do this with regex, I'm okay with anything faster than writing my own for loop to extract this data.
(TL;DR -- There's a distinct header, followed by a pattern that repeats. I want to catch each instance of that pattern, as long as there isn't a break in the repetition.)
EDIT:
To clarify, I don't necessarily know what SOMETHING_TOTALLY_DIFFERENT will be, only that it won't match a \d+. I want to collect all consecutive instances of \na \d+ that follow HEADER\n.

How about a simple loop?
import re
e = re.compile(r'(a\s+\d+)')
header = 'whatever your header field is'
breaker = 'something_different'
breaker_reached = False
header_reached = False
results = []
with open('yourfile.txt') as f:
for line in f:
if line == header:
# skip processing lines unless we reach the header
header_reached = True
continue
if header_reached:
i = e.match(line)
if i and not breaker_reached:
results.append(i.groups()[0])
else:
# There was no match, check if we reached the breaker
if line == breaker:
breaker_reached = True

Not completly sure where you want the regex to stop please clarify
'((a \d*)\s){1,}'

import re
sentinel_begin = 'HEADER'
sentinel_end = 'SOMETHING_TOTALLY_DIFFERENT'
re.findall(r'(a \d*)', s[s.find(sentinel_begin): s.find(sentinel_end)])

Related

Entire text is matched but not able to group in named groups

I have following example text:
my_app|key1=value1|user_id=testuser|ip_address=10.10.10.10
I want to extract sub-fields from it in following way:
appName = my_app,
[
{key = key1, value = value1},
{key = user_id, value = testuser},
{key = ip_address, value = 10.10.10.10}
]
I have written following regex for doing this:
(?<appName>\w+)\|(((?<key>\w+)?(?<equals>=)(?<value>[^\|]+))\|?)+
It matches the entire text but is not able to group it correctly in named groups.
Tried testing it on https://regex101.com/
What am I missing here?
I think the main problem you have is trying to write a regex that matches ALL the key=value pairs. That's not the way to do it. The correct way is based on a pattern that matches ONLY ONE key=value, but is applied by a function that finds all accurances of the pattern. Every languages supplies such a function. Here's the code in Python for example:
import re
txt = 'my_app|key1=value1|user_id=testuser|ip_address=10.10.10.10'
pairs = re.findall(r'(\w+)=([^|]+)', txt)
print(pairs)
This gives:
[('key1', 'value1'), ('user_id', 'testuser'), ('ip_address', '10.10.10.10')]
The pattern matches a key consisting of alpha-numeric chars - (\w+) with a value. The value is designated by ([^|]+), that is everything but a vertical line, because the value can have non-alpha numeric values, such a dot in the ip address.
Mind the findall function. There's a search function to catch a pattern once, and there's a findall function to catch all the patterns within the text.
I tested it on regex101 and it worked.
I must comment, though, that the specific text pattern you work on doesn't require regex. All high level languages supply a split function. You can split by vertical line, and then each slice you get (expcept the first one) you split again by the equal sign.
Use the PyPi regex module with the following code:
import regex
s = "my_app|key1=value1|user_id=testuser|ip_address=10.10.10.10"
rx = r"(?<appName>\w+)(?:\|(?<key>\w+)=(?<value>[^|]+))+"
print( [(m.group("appName"), dict(zip(m.captures("key"),m.captures("value")))) for m in regex.finditer(rx, s)] )
# => [('my_app', {'ip_address': '10.10.10.10', 'key1': 'value1', 'user_id': 'testuser'})]
See the Python demo online.
The .captures property contains all the values captured into a group at all the iterations.
Not sure, but maybe regular expression might be unnecessary, and splitting similar to,
data='my_app|key1=value1|user_id=testuser|ip_address=10.10.10.10'
x= data.split('|')
appName = []
for index,item in enumerate(x):
if index>0:
element = item.split('=')
temp = {"key":element[0],"value":element[1]}
appName.append(temp)
appName = str(x[0] + ',' + str(appName))
print(appName)
might return an output similar to the desired output:
my_app,[{'key': 'key1', 'value': 'value1'}, {'key': 'user_id', 'value': 'testuser'}, {'key': 'ip_address', 'value': '10.10.10.10'}]
using dict:
temp = {"key":element[0],"value":element[1]}
temp can be modified to other desired data structure that you like to have.

Get digits between slashes or on the end in URL

I need a reg expression (for groovy) to match 7 digits between 2 slashes (in a url) or on the end of the url. So fe:
https://stackoverflow.com/questions/6032324/problem-with-this-reg-expression
I need 6032324 but it should also match:
https://stackoverflow.com/questions/6032324
If it has 1 digit more/less, I should not match.
Maybe its an easy reg exp but Im not so familiar with this :)
Thanks for you help!
Since you are parsing a URL, it makes sense to use an URL parser to first grab the path part to split with /. Then, you will have direct access to the slash-separated path parts that you may test against a very simple [0-9]{7} pattern and get them all with
def results = new URL(surl).path.split("/").findAll { it.matches(/\d{7}/) }
You may also take the first match:
def results = new URL(surl).path.split("/").findAll { it.matches(/\d{7}/) }.first()
Or last:
def results = new URL(surl).path.split("/").findAll { it.matches(/\d{7}/) }.last()
See the Groovy demo:
def surl = "https://stackoverflow.com/questions/6032324/problem-with-this-reg-expression"
def url = new URL(surl)
final result = url.path.split("/").findAll { it.matches(/\d{7}/) }.first()
print(result) // => 6032324

Python and Regex with special characters

I can't get my regex to work as desired in my Python 3 code.
I am trying to parse a file find a specific pattern (the exact pattern is Total Optimized)
I am doing this because the file can contain lines which say """Total Optimization (Active)""" and other permutations. I have tried the following lines. None work
PkOp = re.compile(r'Total Optimized\t\d')
PkOp = re.compile(r'Total Optimized\t\d')
PkOp = re.compile(r'Total Optimized\t[^(Active)]')
My basic code (which is simplified here) to just print the matching line out. If I got that working I would then choose the array item I wanted such as
PkOp = PkOpArray[4]
App = re.compile(r'Appliance\s(Active)')
PkOp = re.compile(r"Total Optimized\t\d")
with open("SteelheadMetric2.txt","r") as f:
with open("mydumbfile.txt","w") as o:
for line in f:
line = line.lstrip()
matches = PkOp.findall(line)
for firestick in matches:
PkOpArray = line.split()
PkOp = PkOpArray
print(PkOp)
Mostly I get this error
matches = PkOp.findall(line)
AttributeError: 'list' object has no attribute 'findall'
If I remove the slash characters I can get it to show lines with 'Total Optimization' or 'Appliance' whatever. I just can't be more specific in what I want.
What am I missing? It works fine if I just compile a text string but to use special regex like whitespace, tab digit it fails. The regex checks out in notepad++
When you write PkOp = PkOpArray you have just changed your regex into a list.
If you delete that line, and change your print(PkOp) to print(PkOpArray), it should fix your problem, assuming the rest of your code is correct.

Python: get the string between two capitals

I'd like your opinion as you might be more experienced on Python as I do.
I came from C++ and I'm still not used to the Pythonic way to do things.
I want to loop under a string, between 2 capital letters. For example, I could do that this way:
i = 0
str = "PythonIsFun"
for i, z in enumerate(str):
if(z.isupper()):
small = ''
x = i + 1
while(not str[x].isupper()):
small += str[x]
I wrote this on my phone, so I don't know if this even works but you caught the idea, I presume.
I need you to help me get the best results on this, not just in a non-forced way to the cpu but clean code too. Thank you very much
This is one of those times when regexes are the best bet.
(And don't call a string str, by the way: it shadows the built-in function.)
s = 'PythonIsFun'
result = re.search('[A-Z]([a-z]+)[A-Z]', s)
if result is not None:
print result.groups()[0]
you could use regular expressions:
import re
re.findall ( r'[A-Z]([^A-Z]+)[A-Z]', txt )
outputs ['ython'], and
re.findall ( r'(?=[A-Z]([^A-Z]+)[A-Z])', txt )
outputs ['ython', 's']; and if you just need the first match,
re.search ( r'[A-Z]([^A-Z]+)[A-Z]', txt ).group( 1 )
You can use a list comprehension to do this easily.
>>> s = "PythonIsFun"
>>> u = [i for i,x in enumerate(s) if x.isupper()]
>>> s[u[0]+1:u[1]]
'ython'
If you can't guarantee that there are two upper case characters you can check the length of u to make sure it is at least 2. This does iterate over the entire string, which could be a problem if the two upper case characters occur at the start of a lengthy string.
There are many ways to tackle this, but I'd use regular expressions.
This example will take "PythonIsFun" and return "ythonsun"
import re
text = "PythonIsFun"
pattern = re.compile(r'[a-z]') #look for all lower-case characters
matches = re.findall(pattern, text) #returns a list of lower-chase characters
lower_string = ''.join(matches) #turns the list into a string
print lower_string
outputs:
ythonsun

Take first successful match from a batch of regexes

I'm trying to extract set of data from a string that can match one of three patterns. I have a list of compiled regexes. I want to run through them (in order) and go with the first match.
regexes = [
compiled_regex_1,
compiled_regex_2,
compiled_regex_3,
]
m = None
for reg in regexes:
m = reg.match(name)
if m: break
if not m:
print 'ARGL NOTHING MATCHES THIS!!!'
This should work (haven't tested yet) but it's pretty fugly. Is there a better way of boiling down a loop that breaks when it succeeds or explodes when it doesn't?
There might be something specific to re that I don't know about that allows you to test multiple patterns too.
You can use the else clause of the for loop:
for reg in regexes:
m = reg.match(name)
if m: break
else:
print 'ARGL NOTHING MATCHES THIS!!!'
If you just want to know if any of the regex match then you could use the builtin any function:
if any(reg.match(name) for reg in regexes):
....
however this will not tell you which regex matched.
Alternatively you can combine multiple patterns into a single regex with |:
regex = re.compile(r"(regex1)|(regex2)|...")
Again this will not tell you which regex matched, but you will have a match object that you can use for further information. For example you can find out which of the regex succeeded from the group that is not None:
>>> match = re.match("(a)|(b)|(c)|(d)", "c")
>>> match.groups()
(None, None, 'c', None)
However this can get complicated however if any of the sub-regex have groups in them as well, since the numbering will be changed.
This is probably faster than matching each regex individually since the regex engine has more scope for optimising the regex.
Since you have a finite set in this case, you could use short ciruit evaluation:
m = compiled_regex_1.match(name) or
compiled_regex_2.match(name) or
compiled_regex_3.match(name) or
print("ARGHHHH!")
In Python 2.6 or better:
import itertools as it
m = next(it.ifilter(None, (r.match(name) for r in regexes)), None)
The ifilter call could be made into a genexp, but only a bit awkwardly, i.e., with the usual trick for name binding in a genexp (aka the "phantom nested for clause idiom"):
m = next((m for r in regexes for m in (r.match(name),) if m), None)
but itertools is generally preferable where applicable.
The bit needing 2.6 is the next built-in, which lets you specify a default value if the iterator is exhausted. If you have to simulate it in 2.5 or earlier,
def next(itr, deft):
try: return itr.next()
except StopIteration: return deft
I use something like Dave Kirby suggested, but add named groups to the regexps, so that I know which one matched.
regexps = {
'first': r'...',
'second': r'...',
}
compiled = re.compile('|'.join('(?P<%s>%s)' % item for item in regexps.iteritems()))
match = compiled.match(my_string)
print match.lastgroup
Eric is in better track in taking bigger picture of what OP is aiming, I would use if else though. I would also think that using print function in or expression is little questionable. +1 for Nathon of correcting OP to use proper else statement.
Then my alternative:
# alternative to any builtin that returns useful result,
# the first considered True value
def first(seq):
for item in seq:
if item: return item
regexes = [
compiled_regex_1,
compiled_regex_2,
compiled_regex_3,
]
m = first(reg.match(name) for reg in regexes)
print(m if m else 'ARGL NOTHING MATCHES THIS!!!')