Regular expression to extract number before/after word - regex

I have 10000 descriptions and I want to use regular expressions to extract the number associated with the phrase ``arrested''.
For example:
"police arrests 4 people"
"7 people were arrested".
The numbers range from 1-99.
I have tried the following code:
gen arrest= regexm(description, "(^[1-9][0-9]$)[ ]*(arrests|arrested)")
I cannot simply extract just the number, because the descriptions also mention numbers that have nothing to do with arrests.

You can use this regex:
(?:([1-9]?[0-9])[a-zA-Z ]{0,20}(?:arrests|arrested))|(?:(?:arrests|arrested)[a-zA-Z ]{0,20}([1-9]?[0-9]))
It divides the search into 2 by alternation, whether the number is before or after 'arrests|arrested'.
It creates a non capturing Group, that matches a number from 1-9 (which is optional) and a number from 0-9. This is followed by matching 0 - 20 of any letter and Space (the other Words) before it matches 'arrests OR arrested. It then ORs that with the opposite situation (where the number comes last).
This will match, if the number is within 20 chars from 'arrests|arrested'.

Perhaps something like this?
(\d+)[^,.\d\n]+?(?=arrest|custody)|(?<=arrest|custody)[^,.\d\n]+?(\d+)
Regex101
Keep in mind, this will not match textual versions of the number (i.e., five people were arrested) - so you would have to incorporate that if desired.
Breaking down the pattern
(\d+)[^,.\d\n]+?(?=arrest|custody) First option if # comes before watched terms
(\d+) the number to capture, with + one or more digits
[^,.\d\n]+? matches anything except a comma ,, period ., digit \d, or new line \n. These prevent FPs from different sentences (must be contained in the same sentence) - +? one or more times (lazy)
(?=arrest|custody) positive look ahead checking for either word:
(?<=arrest|custody)[^,.\d\n]+?(\d+) Second option if # comes after watched terms
(?<=arrest|custody) positive lookbehind checking that the word comes before #
[^,.\d\n]+? matches anything except a comma ,, period ., digit \d, or new line \n. These prevent FPs from different sentences (must be contained in the same sentence) - +? one or more times (lazy)
(\d+) the number to capture, with + one or more digits
Miscellaneous Notes
If you want to add textual representations of your numbers, then you would incorporate that into the (\d+) capturing group.
If you have any additional terms to watch for other than arrested or custody, you would add those terms to both lookaround groups

The following works for me (solution based on #PoulBak's idea):
clear
input strL var1
"This is 1 long string saying that police arrests 4 people"
"3 news outlets today reported that 7 people were arrested"
"several witnesses saw 5 people arrested and other 3 killed"
end
generate var2 = ustrregexs(0) if ustrregexm(var1, "(?:([1-9]?[0-9])[a-zA-Z ]{0,20}(?:arrests|arrested))|(?:(?:arrests|arrested)[a-zA-Z ]{0,20}([1-9]?[0-9]))")
list
+-------------------------------------------------------------------------------------+
| var1 var2 |
|-------------------------------------------------------------------------------------|
1. | This is 1 long string saying that police arrests 4 people arrests 4 |
2. | 3 news outlets today reported that 7 people were arrested 7 people were arrested |
3. | several witnesses saw 5 people arrested and other 3 killed 5 people arrested |
+-------------------------------------------------------------------------------------+

Related

Regex: How to find a phone number (or number sequence) that begins with a particular single digit (multiple numbers on the same line)

Newbie question but how can I check for instances where there are multiple numbers on the same line. For instance, the content reads for example contact 408-555-5454 or reach out to 408-555-4545. Right now the best I can do is ^4 but that's only catching multiple things if the mutliline flag is tured on. Any idea.
You could try the regex below
/4\d{2}(-| )?\d{3}(-| )?\d{4}/g
This of course assumes that you're looking for numbers that start with 4. You can have a look at the Regex Snippet here and you can experiment with trying different variations of the regex to suit your needs.
here's a key to the regex elements included:
4 = matches the literal number 4
\d{2} = matches 2 digits (0-9).
(-| )? = matches either a hyphen or single space but makes it not required. ie you can have a space or hyphen or not.
\d{3} = matches 3 digits (0-9)
Same as #3 above
\d{4} = matches 4 digits (0-9)
the g flag will ensure that you're searching through the whole text and not stopping after the first match.
If you like the answer please Accept it :)

Regex to match less than a two-digit number

I have some records as such, in a file:
A 20 year old programmer
A 52 year old politician
A 12 year old practitioner
Many many more of these...
I want to match only lines that contain a number less than 20. I have tried:
^[0-20]{2}$
But it works for only numbers 0-2. How should I construct a regular expression to match numbers < 20? For instance, it should match:
A 12 year old practitioner
But not
A 20 year old programmer
A 52 year old politician
You may use
\b1?[0-9]\b
See the regex demo
Details
\b - a word boundary
(?:1?[0-9]) - an optional 1 and any digit after it
\b - a word boundary
Word boundary variations
To match anywhere in a string, even if glued to a word:
(?<!\d)1?[0-9](?!\d)
To only match in between whitespaces:
(?<!\S)1?[0-9](?!\S)
Using regex to match digit ranges is usually a bit clumsy, but here, you can do it pretty simply with:
\b1?\d\b
https://regex101.com/r/YCWmNo/2
In plain language: an optional one, followed by a digit. So, any standalone digit is allowed, but a two-digit number needs its first digit to be a 1.
If you want to permit leading zeros, change to \b[01]?\d\b.

How to limit optional whitespace matches in regex

YARP. (Yup, another regex problem).
Not sure the clearest way to describe this other than concrete examples.
Sample text:
4444 4444 4444 4444
4444444444444444
44 44 44 44 44 44 44 44
4444-4444-4444-4444
4444 (multiple spaces) 4444 (multiple spaces) 4444 (multiple spaces) 4444
0.4444444444444444
0.4444 4444 4444 4444
I need to build a regex that will match 1, 2 and 4 only. Requirements 13-16 digits, dashes and spaces optional, but only if single space, and no more than 3 total.
This is obviously CC info search related, and I've done a ton of research, found many examples that find matches for most, all or none, but nothing that will eliminate excessive false positives like 3 and 5 above. I'm using PowerGREP 5, I've read the entire tutorial on https://www.regular-expressions.info/tutorial.html and I can not figure out how to limit the number of optional whitespaces in the overall match. ie: "1 2 3 4 5 6 7 8 9" matches just as well as "123 456 789" if i make space(s) optional. Essentially, I want the regex to end match search if more than 3 spaces/dashes are detected.
Side note: I work for a company that deals with a TON of calendar data, so grepping a huge drive with many "1 2 3 4 5 6 7 8 ..." style text strings is generating a ton of false hits, even if I take time to tailor searches to CC inclusive patterns.
Any help would be super appreciated.
The closest I've found is:
\b(?:\d[ -]*?){13,16}\b
Which grabs any 13-16 digits (allowing for a dash or space in between) as expected, but it will also match "1 2 3 4 5 6 7 8 9 10 11" which is obviously not helpful.
All inclusive CC branded regex that fails to find valid numbers if they contain spaces/dashes: (but will find UK telephone numbers, heh):
\b(?:4[0-9]{12}(?:[0-9]{3})?|(?:5[1-5][0-9]{2}|222[1-9]|22[3-9][0-9]|2[3-6][0-9]{2}|27[01][0-9]|2720)[0-9]{12}|3[47][0-9]{13}|3(?:0[0-5]|[68][0-9])[0-9]{11}|6(?:011|5[0-9]{2})[0-9]{12}|(?:2131|1800|35\d{3})\d{11})\b
So then I tried replacing any [0-9] character class instances above with (?:\d[ -]*?) and that will find valid CCs with dashes/spaces, but it also matches all the "1 2 3 4 5 6 7 8 9 10 11" type false positives.
I am very new to regex, so if I'm committing a huge noob error, please feel free to point me in the right direction. Thank you!
Edit:
Replacing [0-9] with (?:\d[ -]?) for just the bigger consecutive string parts seems to be pretty close to what I need. Grepped same drive as before and only got 311 matches, and all 3 positive files found, I can live with just 308 false matches, but I gotta imagine there's a better way to do this still. And it's still matching strings of 13-16 digits with more than 3 delimiters...
Current regex:
\b(?:4(?:\d[ -]?){12}(?:[0-9]{3})?|(?:5[1-5][0-9]{2}|222[1-9]|22[3-9][0-9]|2[3-6][0-9]{2}|27[01][0-9]|2720)(?:\d[ -]?){12}|3[47](?:\d[ -]?){13}|3(?:0[0-5]|[68][0-9])(?:\d[ -]?){11}|6(?:011|5[0-9]{2})(?:\d[ -]?){12}|(?:2131|1800|35\d{3})(?:\d[ -]?){11})\b
Since it looks like you want ever fourth digit to be followed by either a dash, a single space, or nothing, the simplest way would be to use
^(\d{4}[\s\-]?){3}\d{4}$
This would meet your written criteria, but allow a mixture like: 1234-5678 9012. If that's not acceptable, you can use a positive lookahead to validate that the pattern repeats the same
^(?=(\d{4}){3}|(\d{4}-){3}|(\d{4}\s){3})(\d{4}[\s-]?){3}\d{4}$
The first regex
Starts at the beginning of the string: ^
Finds four digits (0-9), optionally followed by space or dash, and repeats this pattern 3 times: (\d{4}[\s\-]?){3}
Then is followed by four more digits and the end of the string: \d{4}$
Taking just the lookahead from the second regex: (?=(\d{4}){3}|(\d{4}-){3}|(\d{4}\s){3})
Before the pattern starts to capture anything, we again start at the beginning of the string and look at the first three repeated patterns and ensures that the delimiter between is the same.
I see that in your example regex, you want to allow 13-16 digits and mine was specifically for 16. For 13-16 digits, you need to determine where you want those delimiters to be. Can they be anywhere, as long as there are only three of them and they don't repeat? I also see that you're using word boundaries, so I'm guessing that you're trying to match substrings. You can do that, but it'll be a little more difficult. Dashes and spaces are both word boundaries, so you might be get some false positives without some lookarounds.
As far as integrating into your CC regex, you're lazy matching an infinite number of dashes or spaces; you just want ? instead of *?. If you need more flexibility where those spaces/numbers go, while still limiting them then I'd probably use a negative regex to validate.

Regular Expression

I wonder if anyone can help.
I need to write a regular expression that throws away everything apart from the last word if that last word is an alphanumeric (numbers and letters) or a single number or a single letter.
For example
Ground floor Apartment 2
Garden Apartment 1A
Block 2D
Suite 12
Unit C
Basement Flat
General Office
I would like to remove all words and characters that are not part of the actual number i.e.
Ground Floor Apartment 2 should become 2
Garden Apartment 1A should become 1A
Block 2D should become 2D
Suite 12 should become 12
Unit C should become C
Basement Flat should become Blank as there is no numbers involved
General Office should become blank
Many Thanks in advance
You could try using a positive lookahead which asserts your requirements at the end of the string.
(?:\b[A-Za-z]{1}|\d+|(?=.*\d)[a-zA-Z0-9]+)$
Explanation
A non capturing group (?:
A word boundary \b
Match a single letter [A-Za-z]{1}
Or |
One or more digits \d+
Or |
A positive lookahead which asserts that the last word contains a digit (?=.*\d)
Match one or more lower/upper case characters or digits [a-zA-Z0-9]+
Close non capturing group )
The end of the string $
What language are you using? You should be able to get the last word by splitting/exploding the string using spaces, then apply the regex to the last word.
You may want to just handle if the length of the word is 1 to make your regex simpler to understand and troubleshoot. This regex works for any word that is 2 letters or longer.
Here's a regex that should work for that last word. It uses a positive lookahead to ensure one letter and one number are present. https://regex101.com/r/i5R9bq/1/
(?=.*[0-9])(?=.*[A-z])[0-9A-z]+

Regex matching multiple inputs

I am trying to do a smart input field for UK style weight input, e.g. "6 stone and 3 lb" or "6 st 11 pound", capturing the 2 numbers in groups.
For now I got: ([0-9]{1,2}).*?([0-9]{1,2}).*
Problem is it matches "12 stone" in 2 groups, 1 and 2 instead of just 12. Is it possible to make a regex which captures correctly in both cases?
You need to make the first part possessive so it never gets backtracked into.
([0-9]{1,2}+).*?([0-9]{1,2})
Because . matches everythig including numbers.. try this:
/(\d{1,2})\D+(\d{1,2})?/
Something like this?
\b(\d+)\b.*?\b(\d+)\b
Groups 1 and 2 will have your numbers in either case.
Explanation :
"
\b # Assert position at a word boundary
( # Match the regular expression below and capture its match into backreference number 1
\d # Match a single digit 0..9
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)
\b # Assert position at a word boundary
. # Match any single character that is not a line break character
*? # Between zero and unlimited times, as few times as possible, expanding as needed (lazy)
\b # Assert position at a word boundary
( # Match the regular expression below and capture its match into backreference number 2
\d # Match a single digit 0..9
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)
\b # Assert position at a word boundary
"
This works, then look at capture groups 1 and 3:
([0-9]{1,2})[^0-9]+(([0-9]{1,2})?.+)?
The idea is to make a number and text manditory, but make a second number and text optional.
Here is my suggestion for a regex to match both variants you showed:
(?<stone>\d+\s(?:stone|st))(?:\s(and)?\s?)(?<pound>\d+\s(?:pound|lb))
It's a bit vague at the moment, this works:
/([0-9]{1,2})(?:[^0-9]+([0-9]{1,2}).*)?/
for this data:
6 stone and 3 lb
6 st 11 pound
12 stone
12 st and 11lbs
Seeing as everyone is having a go, here's mine:
(\d+)(?:\D+(\d+)?)
It's definitely the concisest so far. This will match one or two groups of digits anywhere:
"12": ("12", null)
"12st": ("12", null)
"12 st": ("12", null)
"12st 34 lb": ("12", "34")
"cabbage 12st 34 lb": ("12", "34")
"12 potato 34 moo": ("12", "34")
The next step would be making it catch the name of the units that were used.
Edit: as pointed out above, we don's know what language you're using, and not all regex functionality is available in all implementations. However as far as I know, \d for digits and \D for non-digits is fairly universal.