Extract digit after or before a specific word - regex

I have several articles about terrorist attacks which include info of the number of people killed and wounded. I am trying to extract the number concerning the people wounded.
This is a sample of the sentences to target:
at least 22 others were wounded
additional 20 soldiers were wounded
more than 40 people had been wounded
wounding at least six people
injuring at least 60 others
wounding more than 25
27 others were wounded
wounding 14
wounding 33
185 people were wounded
28 people wounded
As you can see the wordS wounded, wounding,injuring are either before or after the digit I want to extract, ususally within 3 or 4 words of distance from the number.
In this link you can find a sample of the articles and the regualr expression that I am trying to apply without success:
[Regex] (https://regex101.com/r/0DRayP/10)

You need to use capturing groups to get into groups your desired matches like:
(\d+)?.*?(wound(?:ed|ing)|injured).*?(\d+)
You are interested in groups $1, $2 and $3
Here is an example:
Online Demo

Related

Regex matching issue to Test-String

i have a problem and dont get it.
My Regex:
My Test-String:
I have two issues and one general question :)
As you can see in my Test-String the very last (german) Phone Number (the big yellow one in the Test-String attachment) does not match my Regex-Pattern correctly. I dont get it, what is the Problem here? the "0049" fits Group 5, but should fit Group 2, why is that?
My second Problem is, how can i get rid of the spaces before and after every match? (The 7 yellow small circles in the Test-String Attachment)
For copy/paste purposes, here is the Regex and Test-String again:
Regex:
((\+\d{2}|00\d{2})?([ ])?(\()?(\d{2,4})(\))?([-| |/])?(\d{3,})([ ])?(\d+)?([ ])?(\d+)?)
Test-String:
Vorwahl 089, die E.123 ebenfalls , also (089) 1234567. Die DIN 5008, also +49 89 1234567 respectivly 0049 89 1234567. Die E.123 empfiehlt, also +49 89 123456 0 respectivly 0049 89 123456 0 oder +49 89 123456 789. Also +49 89 123 456 789. Klammern 089/1234567 und 0151 19406041. Test +49 151 123 456 789 respectivly 0049 151 123 456 789
Last but not at least, my general question:
Is it a good approach to Group each logical part as i did in my example?
A last Information: I validate my Regex with https://regex101.com/ and use it in Python with the re Module.
The thing that makes it unpredictable are the numerous optional groups (..)?.
As first step i recommend replacing ([ ])?(\d+)? as a coupled expression ([ ]?\d+)?, which will avoid spaces at the end of the match - your point #2.
As a second step i recommend coupling the first optional space with the expression of the "national dialling": ((\+|00)\d{2}([ ])?)?. Now we are lucky, because it solves both the space at the beginning and the recognition of the whole number, due to less possible matching options.
The new expression now looks like this:
(((\+|00)\d{2}([ ])?)?(\()?(\d{2,4})(\))?([-| |/])?(\d{3,})([ ]?\d+)?([ ]?\d+)?)
I now recommend to simplify the last part, if you dont need the single group-values:
(((\+|00)\d{2}([ ])?)?(\()?(\d{2,4})(\))?([-| |/])?(\d{3,})([ ]?\d+){0,2})
For better performance I suggest you remove the parenteses/groups where possible or mark them as non-capturing, if you don't need to have the specific group-values.
In some programming languages you will not need to most outer parenteses, as that is always group 0.

Italian phone 10-digit number regex issue

I'm trying to use the regex from this site
/^([+]39)?((38[{8,9}|0])|(34[{7-9}|0])|(36[6|8|0])|(33[{3-9}|0])|(32[{8,9}]))([\d]{7})$/
for italian mobile phone numbers but a simple number as 3491234567 results invalid.
(don't care about spaces as i'll trim them)
should pass:
349 1234567
+39 349 1234567
TODO: 0039 349 1234567
TODO: (+39) 349 1234567
TODO: (0039) 349 1234567
regex101 and regexr both pass the validation..what's wrong?
UPDATE:
To clarify:
The regex should match any number that starts with either
388/389/380 (38[{8,9}|0])|
or
347/348/349/340 (34[{7-9}|0])|
or
366/368/360 (36[6|8|0])|
or
333/334/335/336/337/338/339/330 (33[{3-9}|0])|
328/329 (32[{8,9}])
plus 7 digits ([\d]{7})
and the +39 at the start optionally ([+]39)?
The following regex appears to fulfill your requirements. I took out the syntax errors and guessed a bit, and added the missing parts to cover your TODO comments.
^(\((00|\+)39\)|(00|\+)39)?(38[890]|34[7-90]|36[680]|33[3-90]|32[89])\d{7}$
Demo: https://regex101.com/r/yF7bZ0/1
Your test cases fail to cover many of the variations captured by the regex; perhaps you'll want to beef up the test set to make sure it does what you want.
The beginning allows for an optional international prefix with or without the parentheses. The basic pattern is (00|\+)39 and it is repeated with or without parentheses around it. (Perhaps a better overall approach would be to trim parentheses and punctuation as well as whitespace before processing begins; you'll want to keep the plus as significant, of course.)
Updated with information from #Edoardo's answer; wrapped for legibility and added comments:
^ # beginning of line
(\((00|\+)39\)|(00|\+)39)? # country code or trunk code, with or without parentheses
( # followed by one of the following
32[89]| # 328 or 329
33[013-9]| # 33x where x != 2
34[04-9]| # 34x where x not in 1,2,3
35[01]| # 350 or 351
36[068]| # 360 or 366 or 368
37[019] # 370 or 371 or 379
38[089]) # 380 or 388 or 389
\d{6,7} # ... followed by 6 or 7 digits
$ # and end of line
There are obvious accidental gaps which will probably also get filled over time. Generalizing this further is likely to improve resilience toward future changes, but of course may at the same time increase the risk of false positives. Make up your mind about which is worse.
I found this and i updated with new operators and MVNO prefixes (Iliad, ho.)
^(\((00|\+)39\)|(00|\+)39)?(38[890]|34[4-90]|36[680]|33[13-90]|32[89]|35[01]|37[019])\d{6,7}$
I improved the regex adding the case to handle space between numbers:
^(\((00|\+)39\)|(00|\+)39)?(38[890]|34[4-90]|36[680]|33[13-90]|32[89]|35[01]|37[019])(\s?\d{3}\s?\d{3,4}|\d{6,7})$
so, for example, I can match phone number like this (0039) 349 123 4567 or this 349 123 4567
Following doc:
https://it.qaz.wiki/wiki/Telephone_numbers_in_Italy
A simple regex for MOBILE italian numbers without special chars is:
/^3[0-9]{8,9}$/
it match a string starting with the digit '3' and followed by 8 or 9 digits, ex:
3345678103
you can add then ITALIAN prefix like '+39 ' or '0039 '
/^+39 3[0-9]{8,9}$/ --- match --> +39 3345678103
/^\0039 3[0-9]{8,9}$/ --- match --> 0039 3345678103

regex filter street number and zip from address

I have some detail fields I need to filter out house number and zip but leave all other numbers in.
example 1:
Van: ION VORM.-VR.TIJD-SOC.TOER HOOGSTRAAT NR. 42 1000 BRUSSEL Belgiƫ IBAN: BE80877459990177 Mededeling: VK 60
example 2:
Van: SYND ABVV-REGIO ANTWERPEN OMMEGANCKSTRAAT 35 2018 ANTWERPEN Belgiƫ IBAN: BE15877800950130 Mededeling: VK38
in the first one I need to filter out 42 and 1000, in the second 35 and 2018
So basically I need a regex that will filter out the numbers from (any)straat(some chars that may include spaces)number(space)number
Thx
This regex works for your two examples:
.+ ([0-9]+) ([0-9]+) .+
Live example: https://regex101.com/r/zA3fM9/1

Regular Expression for group

I have a text were I need to find 3 groups strings.
I try expression: \r?\n\r?\n\r?[0-9A-Z].*\d{7} but I find only 2 strings instead 3.
I should highlight 00170784,HEDINV,00173575 but I get only 00170784 and 00173575
This is the text:
BUY
USM4
200 contracts
04/28/2014 15:50
00170784
56
contracts
HEDINV
64
contracts
00173575
80
contracts
At average price of USD 134.375
SELL
USM4
200 contracts
04/28/2014 15:50
00170784
56
contracts
HEDINV
64
contracts
00173575
80
contracts
At average price of USD 134.5938
May I suggest using this instead?
^\d{8}$|^[A-Z]{6}$
It has two capture groups it looks for. One is an 8 digit sequence for a whole line. The other is a 6 letter sequence for a whole line. That grabs what you're looking, unless there's a specific reason you're using all those linebreak matches.

What's the best Regular Expression to use for returning some phone numbers, but not all?

I'm new to Regular expressions and working on something that will return all UK phone numbers with an area code beginning 01, 02, 03 or 07 only. It has to not look up 08 or 09. It also has to take in to account the different grouping styles too. But here's the kicker... it's got to be 80 characters or less.
This was my best shot:
(01|02|03|07|44\D*1|44\D*2|44\D*3|44\D*7|)(\d\D*){9}
The problem is that it's returning any 9 digit or less number and I can't figure out why.
Any help would be grand!
(01|02|03|07|44\D*1|44\D*2|44\D*3|44\D*7) is matching either 0 or 44\D* followed by 1, 2, 3 or 7 which simplifies to:
(?:44\D*|0)[1237]
Putting that with the rest gives:
(?:44\D*|0)[1237](\D*\d\D*){9}
Debuggex Demo