Extracting Prices with Regex - regex

I'm look to extract prices from a string of scraped data.
I'm using this at the moment:
re.findall(r'£(?:\d+\.)?\d+.\d+', '£1.01')
['1.01']
Which works fine 99% of the time. However, I occasionally see this:
re.findall(r'£(?:\d+\.)?\d+.\d+', '£1,444.01')
['1,444']
I'd like to see ['1444.01'] ideally.
This is an example of the string I'm extracting the prices from.
'\n £1,000.73 \n\n\n + £1.26\nUK delivery\n\n\n'
I'm after some help putting together the regex to get ['1000.73', '1.26'] from that above string

You may grab all the values with '£(\d[\d.,]*)\b' and then remove all the commas with
import re
s = '\n £1,000.73 \n\n\n + £1.26\nUK delivery\n\n\n'
r = re.compile(r'£(\d[\d.,]*)\b')
print([x.replace(',', '') for x in re.findall(r, s)])
# => ['1000.73', '1.26']
See the Python demo
The £(\d[\d.,]*)\b pattern finds £ and then captures a digit and then any 0+ digits/,/., as many as possible, but will backtrack to a position where a word boundary is.

Related

How to combine independent regular expressions and apply them on all rows of a dataset using Pandas?

Problem Statement:
I have two seperate regular expressions that I am trying to "combine" into one and apply to each row in a dataset. The matching part of each row should go to a new Pandas dataframe column called "Wanted". Please see example data below for how values that match should be formatted in the "Wanted" column.
Example Data (how I want it to look):
Column0
Wanted (Want "Column0" to look like this)
Alice\t12-345-623/ 10-1234
Alice, 12-345-623, 10-1234
Bob 201-888-697 / 12-0556a
Bob, 201-888-697, 12-0556a
Tim 073-110-101 / 13-1290
Tim, 073-110-101, 13-1290
Joe 74-111-333/ 33-1290 and Amy(12-345-623)/10-1234c
Joe, 74-111-333, 33-1290, Amy, 12-345-623, 10-1234c
In other words...:
2-3 digits ----- hyphen ---- 3 digits --- hyphen ---- 3 digits ---- any character ----
2 digits --- hyphen --- 4 digits ---- permit one single character
What I have tried #1:
After dinking around for a while I figured out two different regular expressions that on their own will solve part of the problem. Kinda.
This will match for the first group of numbers in each row (but doesn't get the second group--which I want) I'm interested in that I have tried. I'm not sure how robust this is though.
Example Problem Row (regex = r"(?:\d{1,3}-){0,3}\d{1,3}")
search_in = "Alice\t12-345-623/ 10-1234"
wanted_regex = r"(?:\d{1,3}\-){0,3}\d{1,3}"
match = re.search(wanted_regex, search_in)
match.group(0)
Wanted: Alice, 12-345-623, 10-1234
Got: 12-345-623 # matches the group of numbers but isn't formatted how I would like (see example data)
What I have tried #2:
This will match for the second part in each row--- but! --- only if its the only value in the column. The problem I have is that it matches on the first group of digits instead of the second.
Example Problem Row (regex = r"(?:\d{2,3}-){1}\d{3,4}") # different regex than above!
search_in = "Alice\t12-345-623/ 10-1234"
wanted_regex = r"(?:\d{2,3}\-){1}\d{3,4}"
match = re.search(wanted_regex, search_in)
match.group(0)
Wanted : Alice, 12-345-623, 10-1234
Got: 12-345 # matched on the first part
Known Problems:
When I try, "Alice\t12-345-623/ 10-1234", it will match "12-345" when I'm trying to match "10-1234"
Thank you!
Thanks in advance to all you wizards being willing to help me with this problem. I really appreciate it:)
Note: I have asked regarding regex that may make solving this problem easier. It might not, but here is the link anyways --> How to use regex to select a row and a fixed number of rows following a row containing a specific substring in a pandas dataframe
So this works for the four test examples you gave. How's this using the .split() method? Technically this returns a list of values and not a string.
import re
# text here
text = "Joe 74-111-333/ 33-1290 and Amy(12-345-623)/10-1234c"
# split this out to a list. remove the ending parenthesis since you are *splitting* on this
new_splits = re.split(r'\t|/|and|\(| ', text.replace(')',''))
# filter out the blank spaces
list(filter(None,new_splits))
['Joe', '74-111-333', '33-1290', 'Amy', '12-345-623', '10-1234c']
and if you are using pandas you can try the same steps above:
df['answer_Step1'] = df['Column0'].str.split(r'\\t+|/|and|\(| ')
df['answer_final'] = df['answer_Step1'].apply(lambda x: list(filter(None,x)))
You can use
re.sub(r'\s*\band\b\s*|[^\w-]+', ', ', text)
See the regex demo.
Pandas version:
df['Wanted'] = df['Column0'].str.replace(r'\s*\band\b\s*|[^\w-]+', ', ', regex=True)
Details:
\s*\band\b\s* - a whole word (\b are word boundaries) and enclosed with optional zero or more whitespace chars
| - or
[^\w-]+ - one or more chars other than letters, digits, _ and -
See a Python demo:
import re
texts = ['Alice 12-345-623/ 10-1234',
'Bob 201-888-697 / 12-0556a','Tim 073-110-101 / 13-1290',
'Joe 74-111-333/ 33-1290 and Amy(12-345-623)/10-1234c']
for text in texts:
print(re.sub(r'\s*\band\b\s*|[^\w-]+', ', ', text))
# => Alice, 12-345-623, 10-1234
# Bob, 201-888-697, 12-0556a
# Tim, 073-110-101, 13-1290
# Joe, 74-111-333, 33-1290, Amy, 12-345-623, 10-1234c

Get segment of string in between characters

I have a giant data set that includes lots of file names with various parts of strings that I need to grab.
I have this code segment currently:
def fps(data):
for i in data:
pattern = r'.(\d{4}).' # finds data in between the periods
frames = re.findall(pattern, ' '.join(data)) #puts info into frames list
frames.sort()
for i in range(len(frames)): #Turns the str into integers
frames[i] = int(frames[i])
return frames
This is great and all but it only returns 4 characters after and before a period.
How would I grab part of the string after a period and before the next period.
Preferably without using regular edit because it's a little too complex for a simpleton like me.
For example:
One string may look like this
string = ['filename.0530.extension']
while the others may look like this
string2 = ['filename.042.extension']
string3 = [filename.045363.extension']
I would need to output the numbers in between the periods on the terminal so:
0530, 042, 045363
To match your example data your could match a dot, capture in a group one or more digits \d+ (instead of exactly 4 \d{4}) followed by matching a dot:
\.(\d+)\.
If you want to match all between the dots you might use a negating character class [^.] to match not a dot:
\.([^.]+)\.
Note that if you want to match a literal dot you should escape it \.
Demo
To match the numbers between your periods in your example, you can use this:
^.*\.[^.\s]*?\.?(\d+)\..*$
Here's an online example

Python regex negative lookbehind embedded numeric number

I am trying to pull a certain number from various strings. The number has to be standalone, before ', or before (. The regex I came up with was:
\b(?<!\()(x)\b(,|\(|'|$) <- x is the numeric number.
If x is 2, this pulls the following string (almost) fine, except it also pulls 2'abd'. Any advice what I did wrong here?
2(2'Abf',3),212,2'abc',2(1,2'abd',3)
Your actual question is, as I understand it, get these specific number except those in parenthesis.
To do so I suggest using the skip_what_to_avoid|what_i_want pattern like this:
(\((?>[^()\\]++|\\.|(?1))*+\))
|\b(2)(?=\b(?:,|\(|'|$))
The idea here is to completely disregard the overall matches (and there first group use for the recursive pattern to capture everything between parenthesis: (\((?>[^()\\]++|\\.|(?1))*+\))): that's the trash bin. Instead, we only need to check capture group $2, which, when set, contains the asterisks outside of comments.
Demo
Sample Code:
import regex as re
regex = r"(\((?>[^()\\]++|\\.|(?1))*+\))|\b(2)(?=\b(?:,|\(|'|$))"
test_str = "2(2'Abf',3),212,2'abc',2(1,2'abd',3)"
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches):
matchNum = matchNum + 1
if match.groups()[1] is not None:
print ("Found at {start}-{end}: {group}".format(start = match.start(2), end = match.end(2), group = match.group(2)))
Output:
Found at 0-1: 2
Found at 16-17: 2
Found at 23-24: 2
This solution requires the alternative Python regex package.

Extracting email addresses from messy text in OpenRefine

I am trying to extract just the emails from text column in openrefine. some cells have just the email, but others have the name and email in john doe <john#doe.com> format. I have been using the following GREL/regex but it does not return the entire email address. For the above exaple I'm getting ["n#doe.com"]
value.match(
/.*([a-zA-Z0-9_\-\+]+#[\._a-zA-Z0-9-]+).*/
)
Any help is much appreciated.
The n is captured because you are using .* before the capturing group, and since it can match any 0+ chars other than line break chars greedily the only char that can land in Group 1 during backtracking is the char right before #.
If you can get partial matches git rid of the .* and use
/[^<\s]+#[^\s>]+/
See the regex demo
Details
[^<\s]+ - 1 or more chars other than < and whitespace
# - a # char
[^\s>]+ - 1 or more chars other than whitespace and >.
Python/Jython implementation:
import re
res = ''
m = re.search(r'[^<\s]+#[^\s>]+', value)
if m:
res = m.group(0)
return res
There are other ways to match these strings. In case you need a full string match .*<([^<]+#[^>]+)>.* where .* will not gobble the name since it will stop before an obligatory <.
If some cells contain just the email, it's probably better to use the #wiktor-stribiżew's partial match. In the development version of Open Refine, there is now a value.find() function that can do this, but it will only be officially implemented in the next version (2.9). In the meantime, you can reproduce it using Python/Jython instead of GREL:
import re
return re.findall(r"[^<\s]+#[^\s>]+", value)[0]
Result :

How to better this regex?

I have a list of strings like this:
/soccer/poland/ekstraklasa-2008-2009/results/
/soccer/poland/orange-ekstraklasa-2007-2008/results/
/soccer/poland/orange-ekstraklasa-youth-2010-2011/results/
From each string I want to take a middle part resulting in respectively:
ekstraklasa
orange ekstraklasa
orange ekstraklasa youth
My code here does the job but it feels like it can be done in fewer steps and probably with regex alone.
name = re.search('/([-a-z\d]+)/results/', string).group(1) # take the middle part
name = re.search('[-a-z]+', name).group() # trim numbers
if name.endswith('-'):
name = name[:-1] # trim tailing `-` if needed
name = name.replace('-', ' ')
Can anyone see how make it better?
This regex should do the work:
/(?:\/\w+){2}\/([\w\-]+)(?:-\d+){2}/
Explanation:
(?:\/\w+){2} - eat the first two words delimited by /
\/ - eat the next /
([\w\-]+)- match the word characters of hyphens (this is what we're looking for)
(?:-\d+){2} - eat the hyphens and the numbers after the part we're looking for
The result is in the first match group
I cant test it because i am not using python, but i would use an Expression like
^(/soccer/poland/)([a-z\-]*)(.*)$
or
^(/[a-z]*/[a-z]*/)([a-z\-]*)(.*)$
This Expressen works like "/soccer/poland/" at the beginning, than "everything with a to z (small) or -" and the rest of the string.
And than taking 2nd Group!
The Groups should hold this Strings:
/soccer/poland/
orange-ekstraklasa-youth-
2010-2011/results/
And then simply replacing "-" with " " and after that TRIM Spaces.
PS: If ur Using regex101.com e.g., u need to escape / AND just use one Row of String!
Expression
^(\/soccer\/poland\/)([a-z\-]*)(.*)$
And one Row of ur String.
/soccer/poland/orange-ekstraklasa-youth-2010-2011/results/
If u prefere to use the Expression not just for soccer and poland, use
^(\/[a-z]*\/[a-z]*\/)([a-z\-]*)(.*)$