Extracting a number following specific text in R - regex

I have a data frame which contains a column full of text. I need to capture the number (can potentially be any number of digits from most likely 1 to 4 digits in length) that follows a certain phrase, namely 'Floor Area' or 'floor area'. My data will look something like the following:
"A beautiful flat on the 3rd floor with floor area: 50 sqm and a lift"
"Newbuild flat. Floor Area: 30 sq.m"
"6 bed house with floor area 50 sqm, lot area 25 sqm"
If I try to extract just the number or if I look back from sqm I will sometimes get the lot area by mistake.If someone could help me with a lookahead regex or similar in stringr, I'd appreciate it. Regex is a weak point for me. Many thanks in advance.

A common technique to extract a number before or after a word is to match all the string up to the word or number or number and word while capturing the number and then matching the rest of the string and replacing with the captured substring using sub:
# Extract the first number after a word:
as.integer(sub(".*?<WORD_OR_PATTERN_HERE>.*?(\\d+).*", "\\1", x))
# Extract the first number after a word:
as.integer(sub(".*?(\\d+)\\s*<WORD_OR_PATTERN_HERE>.*", "\\1", x))
NOTE: Replace \\d+ with \\d+(?:\\.\\d+)? to match int or float numbers (to keep consistency with the code above, remember change as.integer to as.numeric). \\s* matches 0 or more whitespace in the second sub.
For the current scenario, a possible solution will look like
v <- c("A beautiful flat on the 3rd floor with floor area: 50 sqm and a lift","Newbuild flat. Floor Area: 30 sq.m","6 bed house with floor area 50 sqm, lot area 25 sqm")
as.integer(sub("(?i).*?\\bfloor area:?\\s*(\\d+).*", "\\1", v))
# [1] 50 30 50
See the regex demo.
You may also leverage a capturing mechanism with str_match from stringr and get the second column value ([,2]):
> library(stringr)
> v <- c("A beautiful flat on the 3rd floor with floor area: 50 sqm and a lift","Newbuild flat. Floor Area: 30 sq.m","6 bed house with floor area 50 sqm, lot area 25 sqm")
> as.integer(str_match(v, "(?i)\\bfloor area:?\\s*(\\d+)")[,2])
[1] 50 30 50
See the regex demo.
The regex matches:
(?i) - in a case-insensitive way
\\bfloor area:? - a whole word (\b is a word boundary) floor area followed by an optional : (one or zero occurrence, ?)
\\s* - zero or more whitespace
(\\d+) - Group 1 (will be in [,2]) capturing one or more digits
See R demo online

The following regex may get you started:
[Ff]loor\s+[Aa]rea:?\s+(\d{1,4})
The DEMO.

use following regex with Case Insensitive matching:
floor\s*area:?\s*(\d{1,4})

You need lookbehind regex.
str_extract_all(x, "\\b[Ff]loor [Aa]rea:?\\s*\\K\\d+", perl=T)
or
str_extract_all(x, "(?i)\\bfloor area:?\\s*\\K\\d+", perl=T)
DEMO
Donno why the above code won't return anything. You may try sub also,
> sub(".*\\b[Ff]loor\\s+[Aa]rea:?\\s*(\\d+).*", "\\1", x)
[1] "50" "30" "50"

text<- "A beautiful flat on the 3rd floor with floor area: 50 sqm and a lift"
unique(na.omit(as.numeric(unlist(strsplit(unlist(text), "[^0-9]+")))))
# [1] 3 50
Hope this helped.

Related

Regex to get any numbers after the occurrence of a string in a line

Hi guys im trying to get the the substring as well as the corresponding number from this string
text = "Milk for human consumption may be taken only from cattle from 80 hours after the last treatment."
I want to select the word milk and the corresponding number 80 from this sentence. This is part of a larger file and i want a generic solution to get the word milk in a line and then the first number that occurs after this word anywhere in that line.
(Milk+)\d
This is what i came up with thinking that i can make a group milk and then check for digits but im stumped how to start a search for numbers anywhere on line and not just immediately after the word milk. Also is there any way to make the search case insensitive?
Edit: im looking to get both the word and the number if possible eg: "milk" "80" and using python
/(?<!\p{L})([Mm]ilk)(?!p{L})\D*(\d+)/
This matches the following strings, with the match and the contents of the two capture groups noted.
"The Milk99" # "Milk99" 1:"Milk" 2:"99"
"The milk99 is white" # "milk99" 1:"milk" 2:"99"
"The 8 milk is 99" # "milk is 99" 1:"milk" 2:"99"
"The 8milk is 45 or 73" # "milk is 45" 1:"milk" 2:"45"
The following strings are not matched.
"The Milk is white"
"The OJ is 99"
"The milkman is 37"
"Buttermilk is 99"
"MILK is 99"
This regular expression could be made self-documenting by writing it in free-spacing mode:
/
(?<!\p{L}) # the following match is not preceded by a Unicode letter
([Mm]ilk) # match 'M' or 'm' followed by 'ilk' in capture group 2
(?!p{L}) # the preceding match is not followed by a Unicode letter
\D* # match zero or more characters other than digits
(\d+) # match one or more digits in capture group 2
/x # free-spacing regex definition mode
\D* could be replaced with .*?, ? making the match non-greedy. If the greedy variant were used (.*), the second capture group for "The 8milk is 45 or 73" would contain "3".
To match "MILK is 99", change ([Mm]ilk) to (?i)(milk).
This seems to work in java (I overlooked that the questioner wanted python or the question was later edited) like you want to:
String example =
"Test 40\n" +
"Test Test milk for human consumption may be taken only from cattle from hours after the last treatment." +
"\nTest Milk for human consumption may be taken only from cattle from 80 hours after the last treatment." +
"\nTest miLk for human consumption may be taken only from cattle from 80 hours after the last treatment.";
Matcher m = Pattern.compile("((?i)(milk).*?(\\d+).*\n?)+").matcher(example);
m.find();
System.out.print(m.group(2) + m.group(3));
Look at how it tests whether the word "milk" appears in a case insensitive manner anywhere before a number in the exact same line and only prints these both. It also prints only the first found occurence (making it find all occurencies is also possible pretty easily just by a little modifications of the given code).
I hope the way it extracts these both things from a matching pattern is in the sense of your task.
You should try this one
(Milk).*?(\d+)
Based on your language, you can also specify a case-insensitive search. Example in JS: /(Milk).*?(\d+)/i, the final i makes the search case insensitive.
Note the *?, the most important part ! This is a lazy iteration. In other words, it reads any char, but as soon as it can stop and process the next instruction successfully then it does. Here, as soon as you can read a digit, you read it. A simple * would have returned the last number from this line after Milk instead

Regex match characters when not preceded by a string

I am trying to match spaces just after punctuation marks so that I can split up a large corpus of text, but I am seeing some common edge cases with places, titles and common abbreviations:
I am from New York, N.Y. and I would like to say hello! How are you today? I am well. I owe you $6. 00 because you bought me a No. 3 burger. -Sgt. Smith
I am using this with the re.split function in Python 3 I want to get this:
["I am from New York, N.Y. and I would like to say hello!",
"How are you today?",
"I am well.",
"I owe you $6. 00 because you bought me a No. 3 burger."
"-Sgt. Smith"]
This is currently my regex:
(?<=[\.\?\!])(?<=[^A-Z].)(?<=[^0-9].)(?<=[^N]..)(?<=[^o].)
I decided to try to fix the No. first, with the last two conditions. But it relies on matching the N and the o independently which I think is going to case false positives elsewhere. I cannot figure out how to get it to make just the string No behind the period. I will then use a similar approach for Sgt. and any other "problem" strings I come across.
I am trying to use something like:
(?<=[\.\?\!])(?<=[^A-Z].)(?<=[^0-9].)^(?<=^No$)
But it doesn't capture anything after that. How can I get it to exclude certain strings which I expect to have a period in it, and not capture them?
Here is a regexr of my situation: https://regexr.com/4sgcb
This is the closest regex I could get (the trailing space is the one we match):
(?<=(?<!(No|\.\w))[\.\?\!])(?! *\d+ *)
which will split also after Sgt. for the simple reason that a lookbehind assertion has to be fixed width in Python (what a limitation!).
This is how I would do it in vim, which has no such limitation (the trailing space is the one we match):
\(\(No\|Sgt\|\.\w\)\#<![?.!]\)\( *\d\+ *\)\#!\zs
For the OP as well as the casual reader, this question and the answers to it are about lookarounds and are very interesting.
You may consider a matching approach, it will offer you better control over the entities you want to count as single words, not as sentence break signals.
Use a pattern like
\s*((?:\d+\.\s*\d+|(?:No|M[rs]|[JD]r|S(?:r|gt))\.|\.(?!\s+-?[A-Z0-9])|[^.!?])+(?:[.?!]|$))
See the regex demo
It is very similar to what I posted here, but it contains a pattern to match poorly formatted float numbers, added No. and Sgt. abbreviation support and a better handling of strings not ending with final sentence punctuation.
Python demo:
import re
p = re.compile(r'\s*((?:\d+\.\s*\d+|(?:No|M[rs]|[JD]r|S(?:r|gt))\.|\.(?!\s+-?[A-Z0-9])|[^.!?])+(?:[.?!]|$))')
s = "I am from New York, N.Y. and I would like to say hello! How are you today? I am well. I owe you $6. 00 because you bought me a No. 3 burger. -Sgt. Smith"
for m in p.findall(s):
print(m)
Output:
I am from New York, N.Y. and I would like to say hello!
How are you today?
I am well.
I owe you $6. 00 because you bought me a No. 3 burger.
-Sgt. Smith
Pattern details
\s* - matches 0 or more whitespace (used to trim the results)
(?:\d+\.\s*\d+|(?:No|M[rs]|[JD]r|S(?:r|gt))\.|\.(?!\s+-?[A-Z0-9])|[^.!?])+ - one or more occurrences of several aternatives:
\d+\.\s*\d+ - 1+ digits, ., 0+ whitespaces, 1+ digits
(?:No|M[rs]|[JD]r|S(?:r|gt))\. - abbreviated strings like No., Mr., Ms., Jr., Dr., Sr., Sgt.
\.(?!\s+-?[A-Z0-9]) - matches a dot not followed by 1 or more whitespace and then an optional - and uppercase letters or digits
| - or
[^.!?] - any character but a ., !, and ?
(?:[.?!]|$) - a ., !, and ? or end of string.
As mentioned in my comment above, if you are not able to define a fixed set of edge cases, this might not be possible without false positives or false negatives. Again, without context you are not able to destinguish between abbreviations like "-Sgt. Smith" and ends of sentences like "Sergeant is often times abbreviated as Sgt. This makes it shorter.".
However, if you can define a fixed set of edge cases, its probably easier and much more readable to do this in multiple steps.
1. Identify your edge cases
For example, you can destinguish "Ill have a No. 3" and "No. I am your father" by checking for a subsequent number. So you would identify that edge case with a regex like this: No. \d. (Again, context matters. Sentences like "Is 200 enough? No. 200 is not enough." will still give you a false positive)
2. Mask your edge cases
For each edge case, mask the string with a respective string that will 100% not be part of the original text. E.g. "No." => "======NUMBER======"
3. Run your algorithm
Now that you got rid of your unwanted punctuations, you can run a simpler regex like this to identify the true positives: [\.\!\?]\s
4. Unmask your edge cases
Turn "======NUMBER======" back into "No."
Doing it with only one regex will be tricky - as stated in comments, there are lots of edge cases.
Myself I would do it with three steps:
Replace spaces that should stay with some special character (re.sub)
Split the text (re.split)
Replace the special character with space
For example:
import re
zero_width_space = '\u200B'
s = 'I am from New York, N.Y. and I would like to say hello! How are you today? I am well. I owe you $6. 00 because you bought me a No. 3 burger. -Sgt. Smith'
s = re.sub(r'(?<=\.)\s+(?=[\da-z])|(?<=,)\s+|(?<=Sgt\.)\s+', zero_width_space, s)
s = re.split(r'(?<=[.?!])\s+', s)
from pprint import pprint
pprint([line.replace(zero_width_space, ' ') for line in s])
Prints:
['I am from New York, N.Y. and I would like to say hello!',
'How are you today?',
'I am well.',
'I owe you $6. 00 because you bought me a No. 3 burger.',
'-Sgt. Smith']

Regex for extracting exactly 10 digits from String

I have multiple formats of strings from which I have to extract exactly 10 digit number.
I have tried the following regexes for it. But it extracts the first 10 digits from the number instead of ignoring it.
([0-9]{10}|[0-9\s]{12})
([[:digit:]]{10})
These are the formats
Format 1
KINDLY AUTH FOR FUNDS
ACC 1469007967 (Number needs to be extracted)
AMT R5 000
DD 15/5
FROM:006251
Format 2
KINDLY AUTH FOR FUNDS
ACC 146900796723423 **(Want to ignore this number)**
AMT R5 000
AMT R30 000
DD 15/5
FROM:006251
Format 3
PLEASE AUTH FUNDS
ACC NAME-PREMIER FISHING
ACC NUMBER -1186 057 378 **(the number after - sign needs to be extracted)**
CHQ NOS-7132 ,7133,7134
AMOUNTS-27 000,6500,20 000
THANKS
FROM:190708
Format 4
PLEASE AUTHORISE FOR FUNDS ON AC
**1162792833** CHQ:104-R8856.00 AND (The number in ** needs to be extracted)
CHQ:105-R2772.00
REGARDS,
To match those numbers including the formats to have either 10 digits or 4 space space 3 space 3, you might use a backreference \1 to a capturing group which will match an optional space.
Surround the pattern by word boundaries \b to prevent the digits being part of a larger word.
\b\d{4}( ?)\d{3}\1\d{3}\b
Regex demo
Your expression seems to be fine, just missing a word boundary and we might want to likely modify the second compartment, just in case:
\b([0-9]{10}|[0-9]{4}\s[0-9]{3}\s[0-9]{3})\b
In this demo, the expression is explained, if you might be interested.
Adding a word boundary \b helps. The regex becomes: (\b([0-9]{10}|[0-9\s]{12})\b).
Check it here https://regex101.com/r/6Hm8PD/2

Regular expression: matching multiple words

I am using regular expressions in R to extract strings from a variable. The variable contains distinct values that look like:
MEDIUM /REGULAR INSEAM
XX LARGE /SHORT INSEAM
SMALL /32" INSM
X LARGE /30" INSM
I have to capture two things: the value before the / as a whole(SMALL,XX LARGE) and the string(alphabetic or numeric) after it. I dont want the " INSM or the INSEAM part.
The regular expression for first two I am using is ([A-Z]\w+) \/([A-Z]\w+) INSEAM and for the last two I am using ([A-Z]\w+) \/([0-9][0-9])[" INSM].
The part ([A-Z]\w+) only captures one word, so it works fine for MEDIUM and SMALL, but fails for X LARGE, XX LARGE etc. Is there a way I can modify it to capture two occurances of word before the / character? Or is there a better way to do it?
Thanks in advance!
From your description, Wiktor's regex will fail on "XX LARGE/SHORT" due to the extra space. It is safer to capture everything before the forward slash as a group:
sub("^(.*/\\w+).*", "\\1", x)
#[1] "MEDIUM /REGULAR" "XX LARGE /SHORT" "SMALL /32" "X LARGE /30"
It seems you can use
(\w+(?: \w+)?) */ *(\w+)
See the regex demo
Pattern details:
(\w+(?: \w+)?) - Group 1 capturing one or more word chars followed with an optional sequence of a space + one or more word chars
*/ * - a / enclosed with 0+ spaces
(\w+) - Group 2 capturing 1 or more word chars
R code with stringr:
> library(stringr)
> v <- c("MEDIUM /REGULAR INSEAM", "XX LARGE /SHORT INSEAM", "SMALL /32\" INSM", "X LARGE /30\" INSM")
> str_match(v, "(\\w+(?: \\w+)?) */ *(\\w+)")
[,1] [,2] [,3]
[1,] "MEDIUM /REGULAR" "MEDIUM" "REGULAR"
[2,] "XX LARGE /SHORT" "XX LARGE" "SHORT"
[3,] "SMALL /32" "SMALL" "32"
[4,] "X LARGE /30" "X LARGE" "30"

Problems with regex parsing

I am trying to write a program using the lynx command on this page "http://www.rottentomatoes.com/movie/box_office.php" and I can't seem to wrap my head around a certain problem.... getting the title by itself. My problem is a title can contain special characters, numbers, and all titles are variable in length. I want to write a regex that could parse the entire page and find lines like this....
(I added spaces between the title and the next number, which is how many weeks it has been out, to distinguish between title and weeks released)
1 -- 30% The Vow 1 $41.2M $41.2M $13.9k 2958
2 -- 53% Safe House 1 $40.2M $40.2M $12.9k 3119
3 -- 42% Journey 2: The Mysterious Island 1 $27.3M $27.3M $7.9k 3470
4 -- 57% Star Wars: Episode I - The Phantom Menace (in 3D) 1 $22.5M $22.5M $8.5k 2655
5 1 86% Chronicle 2 $12.1M $40.0M $4.2k 2908
the regex I have started out with is:
/(\d+)\s(\d+|\-\-)\s(\d+\%)\s
If someone can help me figure out how to grab the title successfully that would be much appreciated! Thanks in advanced.
Capture all the things!!
^(\d+)\s+(\d+|\-\-)\s+(\d+\%)\s+(.*)\s+(\d+)\s+(\$\d+(?:.\d+)?[Mk])\s+(\$\d+(?:.\d+)?[Mk])\s+(\$\d+(?:.\d+)?[Mk])\s+(\d+)$
Explained:
^ <- Start of the line
(\d+)\s+ <- Numbers (captured) followed by as many spaces as you want
(\d+|\-\-)\s+ <- Numbers [or "--"] (captured) followed by as many spaces as you want
(\d+\%)\s+ <- Numbers [with '%'] (captured) followed by as many spaces as you want
(.*)\s+ <- Anything you can match [don't be greedy] (captured) followed by as many spaces as you want
(\d+)\s+ <- Numbers (captured) followed by as many spaces as you want
(\$\d+(?:.\d+)?[Mk])\s+ <- "$" and Numbers [with floating point] and "M or k" (captured) followed by as many spaces as you want
(\$\d+(?:.\d+)?[Mk])\s+ <- "$" and Numbers [with floating point] and "M or k" (captured) followed by as many spaces as you want
(\$\d+(?:.\d+)?[Mk])\s+ <- "$" and Numbers [with floating point] and "M or k" (captured) followed by as many spaces as you want
(\d+) <- Numbers (captured)
$ <- End of the line
So to be serious this is what I've done, I cheated a bit and captured everything (as I think you'll do in the end) to get a lookahead for the title capture.
In a non-greedy regex (.*) [or (.*?) if you want to force the "ungreedyness"] will capture the least possible characters, and the end of the regex tries to capture everything else.
Your regex ends up capturing only the title (the only thing left).
What you can do is using an actual lookahead and make assertions.
Resources:
regular-expressions.info - Lookaround
regexr.com - This regex tested