pyspark not working with regex - regex

I've made RDD from a file with list of urls:
url_data = sc.textFile("url_list.txt")
Now i'm trying to make another RDD with all rows that contain 'net.com' and this string starts with non numeric or letter symbol. I mean include lines with .net.com or \tnet.com and exclude internet.com or cnet.com.
filtered_data = url_data.filter(lambda x: '[\W]net\.com' in x)
But this line gives no results.
How can i make pyspark shell work with regex?

Why not define a function in python that uses the re or re2 (much faster) package, and returns a Bool if there is a match.
def url_filter(url):
pattern = re.compile(r'REGEX_PATTERN')
match = pattern.match(URL)
if match:
return True
else:
return False
Then just pass it in to filter function url_data.filter(lambda x: python_regex_fuction(x))

Related

Entire text is matched but not able to group in named groups

I have following example text:
my_app|key1=value1|user_id=testuser|ip_address=10.10.10.10
I want to extract sub-fields from it in following way:
appName = my_app,
[
{key = key1, value = value1},
{key = user_id, value = testuser},
{key = ip_address, value = 10.10.10.10}
]
I have written following regex for doing this:
(?<appName>\w+)\|(((?<key>\w+)?(?<equals>=)(?<value>[^\|]+))\|?)+
It matches the entire text but is not able to group it correctly in named groups.
Tried testing it on https://regex101.com/
What am I missing here?
I think the main problem you have is trying to write a regex that matches ALL the key=value pairs. That's not the way to do it. The correct way is based on a pattern that matches ONLY ONE key=value, but is applied by a function that finds all accurances of the pattern. Every languages supplies such a function. Here's the code in Python for example:
import re
txt = 'my_app|key1=value1|user_id=testuser|ip_address=10.10.10.10'
pairs = re.findall(r'(\w+)=([^|]+)', txt)
print(pairs)
This gives:
[('key1', 'value1'), ('user_id', 'testuser'), ('ip_address', '10.10.10.10')]
The pattern matches a key consisting of alpha-numeric chars - (\w+) with a value. The value is designated by ([^|]+), that is everything but a vertical line, because the value can have non-alpha numeric values, such a dot in the ip address.
Mind the findall function. There's a search function to catch a pattern once, and there's a findall function to catch all the patterns within the text.
I tested it on regex101 and it worked.
I must comment, though, that the specific text pattern you work on doesn't require regex. All high level languages supply a split function. You can split by vertical line, and then each slice you get (expcept the first one) you split again by the equal sign.
Use the PyPi regex module with the following code:
import regex
s = "my_app|key1=value1|user_id=testuser|ip_address=10.10.10.10"
rx = r"(?<appName>\w+)(?:\|(?<key>\w+)=(?<value>[^|]+))+"
print( [(m.group("appName"), dict(zip(m.captures("key"),m.captures("value")))) for m in regex.finditer(rx, s)] )
# => [('my_app', {'ip_address': '10.10.10.10', 'key1': 'value1', 'user_id': 'testuser'})]
See the Python demo online.
The .captures property contains all the values captured into a group at all the iterations.
Not sure, but maybe regular expression might be unnecessary, and splitting similar to,
data='my_app|key1=value1|user_id=testuser|ip_address=10.10.10.10'
x= data.split('|')
appName = []
for index,item in enumerate(x):
if index>0:
element = item.split('=')
temp = {"key":element[0],"value":element[1]}
appName.append(temp)
appName = str(x[0] + ',' + str(appName))
print(appName)
might return an output similar to the desired output:
my_app,[{'key': 'key1', 'value': 'value1'}, {'key': 'user_id', 'value': 'testuser'}, {'key': 'ip_address', 'value': '10.10.10.10'}]
using dict:
temp = {"key":element[0],"value":element[1]}
temp can be modified to other desired data structure that you like to have.

Extract rows from csv file using regex substring?

I have a csv file that looks like this (obviously < anystring > means just that).
<anystring>tony_UPearly_start,1,2,3
<anystring>tony_UPlate_start,4,5,6
<anystring>tony_UP<anystring>_start,7,8,9
<anystring>jane_UPearly_start,1,2,3
<anystring>jane_UPlate_start,4,5,6
<anystring>jane_UP<anystring>_start,7,8,9
I am trying to extract rows UPanystring (rows 3 and 6 in this example) using negative look forward to exclude rows 1,2 and 4,5
import re
import csv
search = re.compile(r'.*_UP(?!early|late)')
output = []
with open('test.csv', mode='r', encoding='utf-8') as f:
csvfile = csv.reader(f)
for row in csvfile:
if row[0] == search:
output.append(row)
print(output)
>>>[]
when I am after
print (output)
[<anystring>tony_UP<anystring>_start,7,8,9, <anystring>jane_UP<anystring>_start,7,8,9]
The regex search works when I test on a regex platform but not in python?
Thanks for the comments: the search code now looks like
search = re.compile(r'^.*?_UP(?!early|late).*$')
output = []
with open('test.csv', mode='r', encoding='utf-8') as f:
csvfile = csv.reader(f)
for row in csvfile:
search.search(row[0]) # it think this needs and if=true but it won't accept a boolean here?
output.append(row)
This now returns all rows (ie filters nothing whereas before it filtered everything)
You want to return a list of rows that contain _UP not followed with early or late.
The pattern should look like
search = re.compile(r'_UP(?!early|late)')
You do not need any ^, .*, etc. because when you use re.search, you are looking for a pattern match anywhere inside a string.
Then, all you need is to test the row for the regex match:
if search.search(row):
output.append(row)
See the Python demo:
import re
csvfile="""<anystring>tony_UPearly_start,1,2,3
<anystring>tony_UPlate_start,4,5,6
<anystring>tony_UP<anystring>_start,7,8,9
<anystring>jane_UPearly_start,1,2,3
<anystring>jane_UPlate_start,4,5,6
<anystring>jane_UP<anystring>_start,7,8,9""".splitlines()
search = re.compile(r'_UP(?!early|late)')
output = []
for row in csvfile:
if search.search(row):
output.append(row)
print(output)
And the output is your expected list:
['<anystring>tony_UP<anystring>_start,7,8,9', '<anystring>jane_UP<anystring>_start,7,8,9']

Using python regex to find repeated values after a header

If I have a string that looks something like:
s = """
...
Random Stuff
...
HEADER
a 1
a 3
# random amount of rows
a 17
RANDOM_NEW_HEADER
a 200
a 300
...
More random stuff
...
"""
Is there a clean way to use regex (in Python) to find all instances of a \d* after HEADER, but before the pattern is broken by SOMETHING_TOTALLY_DIFFERENT? I thought about something like:
import re
pattern = r'HEADER(?:\na \d*)*\na (\d*)'
print re.findall(pattern, s)
Unfortunately, regex doesn't find overlapping matches. If there's no sensible way to do this with regex, I'm okay with anything faster than writing my own for loop to extract this data.
(TL;DR -- There's a distinct header, followed by a pattern that repeats. I want to catch each instance of that pattern, as long as there isn't a break in the repetition.)
EDIT:
To clarify, I don't necessarily know what SOMETHING_TOTALLY_DIFFERENT will be, only that it won't match a \d+. I want to collect all consecutive instances of \na \d+ that follow HEADER\n.
How about a simple loop?
import re
e = re.compile(r'(a\s+\d+)')
header = 'whatever your header field is'
breaker = 'something_different'
breaker_reached = False
header_reached = False
results = []
with open('yourfile.txt') as f:
for line in f:
if line == header:
# skip processing lines unless we reach the header
header_reached = True
continue
if header_reached:
i = e.match(line)
if i and not breaker_reached:
results.append(i.groups()[0])
else:
# There was no match, check if we reached the breaker
if line == breaker:
breaker_reached = True
Not completly sure where you want the regex to stop please clarify
'((a \d*)\s){1,}'
import re
sentinel_begin = 'HEADER'
sentinel_end = 'SOMETHING_TOTALLY_DIFFERENT'
re.findall(r'(a \d*)', s[s.find(sentinel_begin): s.find(sentinel_end)])

Regular expression to group a pattern OR group empty string as ""

I'm using Python 3.3.2 with regular expressions. I have a pretty simple function
def DoRegexThings(somestring):
m = re.match(r'(^\d+)( .*$)?', somestring)
return m.group(1)
Which I am using to just get a numeric portion at the beginning of string, and discard the rest. However, it fails on the case of an empty string, since it is unable to match a group.
I've looked at this similar question which was asked previously, and changed my regular expression to this:
(^$)|(^\d+)( .*$)?
But it only causes it to return "None" every time, and still fails on empty strings. What I really want is a regular expression which I can use to either grab the numeric portion of my record, e.g. if the record is 1234 sometext, I just want 1234, or if the string is empty I want m.group(1) to return an empty string. My workaround right now is
m = re.match(r'(^\d+)( .*$)?', somestring)
if m == None: # Handle empty string case
return somestring
else:
return m.group(1)
But if I can avoid checking the match object for None, I'd like to. Is there a way to accomplish this?
I think you're making this overly complicated:
re.match(r"\d*", somestring).group()
will return a number if it's at the start of the string (.match() ensures this) or the empty string if there is no number.
>>> import re
>>> somestring = "987kjh"
>>> re.match(r"\d*", somestring).group()
'987'
>>> somestring = "kjh"
>>> re.match(r"\d*", somestring).group()
''

Take first successful match from a batch of regexes

I'm trying to extract set of data from a string that can match one of three patterns. I have a list of compiled regexes. I want to run through them (in order) and go with the first match.
regexes = [
compiled_regex_1,
compiled_regex_2,
compiled_regex_3,
]
m = None
for reg in regexes:
m = reg.match(name)
if m: break
if not m:
print 'ARGL NOTHING MATCHES THIS!!!'
This should work (haven't tested yet) but it's pretty fugly. Is there a better way of boiling down a loop that breaks when it succeeds or explodes when it doesn't?
There might be something specific to re that I don't know about that allows you to test multiple patterns too.
You can use the else clause of the for loop:
for reg in regexes:
m = reg.match(name)
if m: break
else:
print 'ARGL NOTHING MATCHES THIS!!!'
If you just want to know if any of the regex match then you could use the builtin any function:
if any(reg.match(name) for reg in regexes):
....
however this will not tell you which regex matched.
Alternatively you can combine multiple patterns into a single regex with |:
regex = re.compile(r"(regex1)|(regex2)|...")
Again this will not tell you which regex matched, but you will have a match object that you can use for further information. For example you can find out which of the regex succeeded from the group that is not None:
>>> match = re.match("(a)|(b)|(c)|(d)", "c")
>>> match.groups()
(None, None, 'c', None)
However this can get complicated however if any of the sub-regex have groups in them as well, since the numbering will be changed.
This is probably faster than matching each regex individually since the regex engine has more scope for optimising the regex.
Since you have a finite set in this case, you could use short ciruit evaluation:
m = compiled_regex_1.match(name) or
compiled_regex_2.match(name) or
compiled_regex_3.match(name) or
print("ARGHHHH!")
In Python 2.6 or better:
import itertools as it
m = next(it.ifilter(None, (r.match(name) for r in regexes)), None)
The ifilter call could be made into a genexp, but only a bit awkwardly, i.e., with the usual trick for name binding in a genexp (aka the "phantom nested for clause idiom"):
m = next((m for r in regexes for m in (r.match(name),) if m), None)
but itertools is generally preferable where applicable.
The bit needing 2.6 is the next built-in, which lets you specify a default value if the iterator is exhausted. If you have to simulate it in 2.5 or earlier,
def next(itr, deft):
try: return itr.next()
except StopIteration: return deft
I use something like Dave Kirby suggested, but add named groups to the regexps, so that I know which one matched.
regexps = {
'first': r'...',
'second': r'...',
}
compiled = re.compile('|'.join('(?P<%s>%s)' % item for item in regexps.iteritems()))
match = compiled.match(my_string)
print match.lastgroup
Eric is in better track in taking bigger picture of what OP is aiming, I would use if else though. I would also think that using print function in or expression is little questionable. +1 for Nathon of correcting OP to use proper else statement.
Then my alternative:
# alternative to any builtin that returns useful result,
# the first considered True value
def first(seq):
for item in seq:
if item: return item
regexes = [
compiled_regex_1,
compiled_regex_2,
compiled_regex_3,
]
m = first(reg.match(name) for reg in regexes)
print(m if m else 'ARGL NOTHING MATCHES THIS!!!')