Regex problem with telephone number capturing part of ID - regex

I have a regex for grabbing text on email for french number which is like this :
(?:(?:\+|00)33|0)\s*[1-9](?:[\s.-]*\d{2}){4}
Which work pretty well but if there is no phone number on an email it will grab part of the id of a Facebook page www.facebook.com/leboncoin-1565**0575204105**27 and then I have people trying to ring that nuumber :X
In case it's not clear and don't want it, I tried negative lookahead and behind but without any success
See problem at regex101.
Note that the phone number could be anywhere not necessary at the beginning of a line.

Use
(?:(?:\+|\b00)33|\b0)\s*[1-9](?:[\s.-]*\d{2}){4}\b
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
\+ '+'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
--------------------------------------------------------------------------------
00 '00'
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
33 '33'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
--------------------------------------------------------------------------------
0 '0'
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
[1-9] any character of: '1' to '9'
--------------------------------------------------------------------------------
(?: group, but do not capture (4 times):
--------------------------------------------------------------------------------
[\s.-]* any character of: whitespace (\n, \r,
\t, \f, and " "), '.', '-' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
\d{2} digits (0-9) (2 times)
--------------------------------------------------------------------------------
){4} end of grouping
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char

Related

Regex pattern to extract Hearst patterns

I am new to Regex and I am unable to extract hyponym-hypernym pairs in the form of a list or tuple.
I tried using this pattern but I get no matches
(NP_[\w.]*(, NP_[\w.]*)*,? (and)? other NP_[\w.]*)
I have the following annotated sentences for 'and other' pattern:
NP_kimmel faces NP_dui , NP_fleeing or NP_evading_police , and other NP_possible_charges .
The NP_network has asked NP_big_bang_theory_co-creator_bill prady to mastermind the NP_revival , which would see the NP_return of NP_kermit the NP_frog , NP_miss_piggy , NP_fozzie_bear and other NP_old_favorites .
I want to extract a list such as :
[NP_dui,NP_fleeing or NP_evading_police, NP_possible_charges]
OR
(NP_dui,NP_possible_charges)
(NP_fleeing or NP_evading_police,NP_possible_charges)
Similarly for the sentence 2:
[NP_kermit the NP_frog , NP_miss_piggy , NP_fozzie_bear, NP_old_favorites]
or Similar tuples.
Any help would be appreciated.
Use
NP_[\w.]*(?:\s*(?:,|\bor\b|,?\s*and(?:\s+other)?\b)\s*NP_[\w.]*)+
This extracts strings with your matches. Next, extract expected ents with NP_[\w.]*.
Python code:
import re
test_strs = ["NP_kimmel faces NP_dui , NP_fleeing or NP_evading_police , and other NP_possible_charges.",
"The NP_network has asked NP_big_bang_theory_co-creator_bill prady to mastermind the NP_revival , which would see the NP_return of NP_kermit the NP_frog , NP_miss_piggy , NP_fozzie_bear and other NP_old_favorites ."]
p = r'NP_[\w.]*(?:\s*(?:,|\bor\b|,?\s*and(?:\s+other)?\b)\s*NP_[\w.]*)+'
for test_str in test_strs:
matches = []
for match in re.findall(p, test_str):
matches.extend(re.findall(r'NP_[\w.]*\b', match))
print(matches)
Results:
['NP_dui', 'NP_fleeing', 'NP_evading_police', 'NP_possible_charges']
['NP_frog', 'NP_miss_piggy', 'NP_fozzie_bear', 'NP_old_favorites']
EXPLANATION
--------------------------------------------------------------------------------
NP_ 'NP_'
--------------------------------------------------------------------------------
[\w.]* any character of: word characters (a-z, A-
Z, 0-9, _), '.' (0 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
(?: group, but do not capture (1 or more times
(matching the most amount possible)):
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
, ','
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
--------------------------------------------------------------------------------
or 'or'
--------------------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
,? ',' (optional (matching the most
amount possible))
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ")
(0 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
and 'and'
--------------------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ")
(1 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
other 'other'
--------------------------------------------------------------------------------
)? end of grouping
--------------------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
NP_ 'NP_'
--------------------------------------------------------------------------------
[\w.]* any character of: word characters (a-z,
A-Z, 0-9, _), '.' (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
)+ end of grouping

Remove a sequence of pipe separated numbers using regex

I am trying to match a sequence of four numbers that are separated by pipes in a string. The numbers may be negative, float, or double digits, for example:
13|5|-1|3 or 5|5|0|3 or 13|4|1.5|1
The string may also contain additional numbers and words; a full example looks like so:
SOME STRING CONTENT 13|5|-1|3 MORE 1.6 CONTENT HERE
How could I identify those numbers between and to the left/right of the pipes using regex?
I have tried [\d\-.\|] which matches all digits, decimals, pipes, and negative signs but also find it matches the additional number/decimal content in the string. Any help on just selecting that one section would be appreciated!
You can use
-?\b\d+(?:\.\d+)?(?:\|\-?\d+(?:\.\d+)?){3}\b
The pattern matches:
-? Match an optional -
\b A word boundary to prevent a partial match
\d+(?:\.\d+)? Match 1+ digits with an optional decimal part
(?:\|\-?\d+(?:\.\d+)?){3} Repeat 3 times the same as previous part preceded by a pipe
\b A word boundary
Regex demo
As well use
(?<!\S)-?\d*\.?\d+(?:\|-?\d*\.?\d+){3}(?!\S)
See proof.
EXPLANATION
--------------------------------------------------------------------------------
(?<! look behind to see if there is not:
--------------------------------------------------------------------------------
\S non-whitespace (all but \n, \r, \t, \f,
and " ")
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
-? '-' (optional (matching the most amount
possible))
--------------------------------------------------------------------------------
\d* digits (0-9) (0 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
\.? '.' (optional (matching the most amount
possible))
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
(?: group, but do not capture (3 times):
--------------------------------------------------------------------------------
\| '|'
--------------------------------------------------------------------------------
-? '-' (optional (matching the most amount
possible))
--------------------------------------------------------------------------------
\d* digits (0-9) (0 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
\.? '.' (optional (matching the most amount
possible))
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
){3} end of grouping
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
\S non-whitespace (all but \n, \r, \t, \f,
and " ")
--------------------------------------------------------------------------------
) end of look-ahead

Replace all occurrences except for the 3rd

I am scanning a QR code and need a script to replace the commas with a ( \t)
My results are:
820-20171-002, ,Nov 24, 2020,,,13,283.40,,Mike Shmow
My problem is - I don't want a comma after the date. Right now I have the following - which does work to replace commas with a tab.
decodeResults[0].content.replace(/,/g, "\t");
I am trying to replace the /,/g with an expression to replace all commas except for the 3rd occurrence.
Use
.replace(/(?<!\b[a-zA-Z]{3}\s+\d{1,2}(?=,\s*\d{4})),/g, '\t')
See proof
Explanation
--------------------------------------------------------------------------------
(?<! Negative lookbehind start, fail if pattern matches
--------------------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
--------------------------------------------------------------------------------
[a-zA-Z]{3} any character of: 'a' to 'z', 'A' to 'Z'
(3 times)
--------------------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1
or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
\d{1,2} digits (0-9) (between 1 and 2 times
(matching the most amount possible))
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
, ','
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ")
(0 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
\d{4} digits (0-9) (4 times)
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
) end of negative lookbehind
--------------------------------------------------------------------------------
, ','

How to select specific content between second and third comma using Regex in sublime text?

Hi everyone I'm trying to select the content in between the second and third comma using regex
This is my content
INSERT INTO table (column1, column2, column3) VALUES ('Alejandro', 'dislike', '', 20, 'otro nombre')
INSERT INTO table (column1, column2, column3) VALUES ('Jando', 'like', '', 30, 'wtf')
As you can see between second and third comma are just single quotes '' and I want to selected them using regex because I need to modify like 5000 lines in sublime text 3, I hope you can help me, I tried with no success ,(.*){2} I know I'm wrong, I have no experience with regex
Note: No all time will be single quotes ''
Use
\bVALUES\s(?:[^\n,]*,){2}\h*\K[^,\n]+
See proof
Explanation
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
VALUES 'VALUES'
--------------------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
--------------------------------------------------------------------------------
(?: group, but do not capture (2 times):
--------------------------------------------------------------------------------
[^\n,]* any character except: '\n' (newline),
',' (0 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
, ','
--------------------------------------------------------------------------------
){2} end of grouping
--------------------------------------------------------------------------------
\h* horizontal whitespace (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
\K match reset operator (discarding text matched so far)
--------------------------------------------------------------------------------
[^,\n]+ any character except: ',', '\n' (newline)
(1 or more times (matching the most amount
possible))
Another attempt:
\bVALUES\s*\((?:\s*'(?:''|[^'])*'\s*,){2}\s*\K'(?:''|[^'])*'
See another proof
Explanation
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
VALUES 'VALUES'
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
\( '('
--------------------------------------------------------------------------------
(?: group, but do not capture (2 times):
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
' ' char
--------------------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the most amount
possible)):
--------------------------------------------------------------------------------
'' ''
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
[^'] any character except: '
--------------------------------------------------------------------------------
)* end of grouping
--------------------------------------------------------------------------------
' '\''
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
, ','
--------------------------------------------------------------------------------
){2} end of grouping
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
\K match reset operator (discarding text matched so far)
--------------------------------------------------------------------------------
' '
--------------------------------------------------------------------------------
(?: group, but do not capture (0 or more times
(matching the most amount possible)):
--------------------------------------------------------------------------------
'' '\'\''
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
[^'] any character except: '''
--------------------------------------------------------------------------------
)* end of grouping

Split the string using tags using RegEx

I need help in splitting the below multi tagged string with the tags like <eyn> and <un> and <an>
Your colleague <eyn id='test#test.com'>user</eyn> is now communicating with <un id='test#test.com'>user</un> from <an id='4442729'>test, Inc.</an>
Doing it with Regex
It's not advisable to use a regex to parse HTML due to all the possible obscure edge cases that can crop up, but it seems that you have some control over the HTML so you should able to avoid many of the edge cases the regex police cry about.
Proposed Solution
I'd probably want to collect the entire tag, the ID value, and the raw text between the open and close tags all in one action.
This Regex
<(eyn|un|an)\b(?=\s)(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\bid=('[^']*'|"[^"]*"|[^'"\s>]*))(?:[^>=]|='[^']*'|="[^"]*"|=[^'"\s]*)*\s?\/?>(.*?)<\/\w+>
** To see the image better, simply right click the image and select view in new window
Will do the following
find all the eyn, un, an tags
requires the tag to have an ID attribute
allow the ID attribute value to be unquoted or surrounded by ' or "
avoids difficult edge cases that makes pattern matching in HTML difficult
Creates the following capture groups
group 0 the entire tag from open to close
group 1 the tag name
group 2 the ID value
group 3 the raw text inside between the open and close tags
Examples
See also Live demo
Sample Text
Note the difficult edge cases nested inside the second block of text.
Your colleague <eyn id='test#test.com'>user</eyn> is now communicating with <un id='test#test.com'>user</un> from <an id='4442729'>test, Inc.</an>
Your colleague <eyn onmouseover=' if ( 3 > a ) { var
string=" <eyn id=NotTheDroidYouAreLookingFor>R2D2</eyn>; "; } '
id='DesiredDroids'>This is the droid I'm looking for</eyn> is now communicating with <un id="test#test.com">user</un> from <an id=4442729>test, Inc.</an>
Sample Matches
Match 1
Full match 15-49 `<eyn id='test#test.com'>user</eyn>`
Group 1. 16-19 `eyn`
Group 2. 23-38 `'test#test.com'`
Group 3. 39-43 `user`
Match 2
Full match 76-108 `<un id='test#test.com'>user</un>`
Group 1. 77-79 `un`
Group 2. 83-98 `'test#test.com'`
Group 3. 99-103 `user`
Match 3
Full match 114-146 `<an id='4442729'>test, Inc.</an>`
Group 1. 115-117 `an`
Group 2. 121-130 `'4442729'`
Group 3. 131-141 `test, Inc.`
Match 4
Full match 163-326 `<eyn onmouseover=' if ( 3 > a ) { var
string=" <eyn id=NotTheDroidYouAreLookingFor>R2D2</eyn>; "; } '
id='DesiredDroids'>This is the droid I'm looking for</eyn>`
Group 1. 164-167 `eyn`
Group 2. 271-286 `'DesiredDroids'`
Group 3. 287-320 `This is the droid I'm looking for`
Match 5
Full match 353-385 `<un id="test#test.com">user</un>`
Group 1. 354-356 `un`
Group 2. 360-375 `"test#test.com"`
Group 3. 376-380 `user`
Match 6
Full match 391-421 `<an id=4442729>test, Inc.</an>`
Group 1. 392-394 `an`
Group 2. 398-411 `4442729`
Group 3. 406-416 `test, Inc.`
Explained
NODE EXPLANATION
--------------------------------------------------------------------------------
< '<'
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
eyn 'eyn'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
un 'un'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
an 'an'
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the least amount
possible)):
--------------------------------------------------------------------------------
[^>=] any character except: '>', '='
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
=' '=\''
--------------------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
' '\''
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
=" '="'
--------------------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
" '"'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
= '='
--------------------------------------------------------------------------------
[^'"] any character except: ''', '"'
--------------------------------------------------------------------------------
[^\s>]* any character except: whitespace (\n,
\r, \t, \f, and " "), '>' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
)*? end of grouping
--------------------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
--------------------------------------------------------------------------------
id= 'id='
--------------------------------------------------------------------------------
( group and capture to \2:
--------------------------------------------------------------------------------
' '\''
--------------------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
' '\''
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
" '"'
--------------------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
" '"'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
[^'"\s>]* any character except: ''', '"',
whitespace (\n, \r, \t, \f, and " "),
'>' (0 or more times (matching the
most amount possible))
--------------------------------------------------------------------------------
) end of \2
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
(?: group, but do not capture (0 or more times
(matching the most amount possible)):
--------------------------------------------------------------------------------
[^>=] any character except: '>', '='
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
=' '=\''
--------------------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
' '\''
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
=" '="'
--------------------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
" '"'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
= '='
--------------------------------------------------------------------------------
[^'"\s]* any character except: ''', '"',
whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
)* end of grouping
--------------------------------------------------------------------------------
\s? whitespace (\n, \r, \t, \f, and " ")
(optional (matching the most amount
possible))
--------------------------------------------------------------------------------
\/? '/' (optional (matching the most amount
possible))
--------------------------------------------------------------------------------
> '>'
--------------------------------------------------------------------------------
( group and capture to \3:
--------------------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
--------------------------------------------------------------------------------
) end of \3
--------------------------------------------------------------------------------
< '<'
--------------------------------------------------------------------------------
\/ '/'
--------------------------------------------------------------------------------
\w+ word characters (a-z, A-Z, 0-9, _) (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
> '>'