How to match this pattern (with emoji)?

How to match this pattern (with emoji)? - regex

I have a text file of a few thousand entries built like that:
11111111111: text text text text text :: word11111111: text text text text :: word111111111:
Where:
11111111 is a big number
text text text text can be anything including emoji
word is one of 8 words
the second 111111111 is another number, but different.
I tried, but just couldn't match it.
I don't know how to treat the emoji, and another problem is the spaces are not consistent, sometimes is a whitespace, sometimes tab, and so on.

Description
^([0-9]+):\s*((?:(?!\s::).)*)\s::\s*([^:]+)\s*:\s*((?:(?!\s::).)*)\s::\s*([^:]+):$
This regular expression will do the following:
Capture the leading 11111111
Match the :
Capture the text text text text text which may contain emojis.
Match the ::
Capture the word11111111
match the :
Capture the text text text text text which may contain emojis.
Match the ::
Capture the word11111111
Match the :
Allow the : or :: to be delimiters
Do not include the spaces surrounding the delimiters to be included in the matches.
To see the image better, you can right click it and select open in new window
Example
Live Demo
https://regex101.com/r/qG7uZ7/1
Sample text
11111111111: text text text text text :: word11111111: text text text text :: word111111111:
Capture Groups from match
0. 11111111111: text text text text text :: word11111111: text text text text :: word111111111:
1. `11111111111`
2. `text text text text text`
3. `word11111111`
4. `text text text text`
5. `word111111111`
Explanation
NODE EXPLANATION
----------------------------------------------------------------------
^ the beginning of a "line"
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[0-9]+ any character of: '0' to '9' (1 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
: ':'
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the most amount
possible)):
----------------------------------------------------------------------
(?! look ahead to see if there is not:
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
:: '::'
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
. any character except \n
----------------------------------------------------------------------
)* end of grouping
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
:: '::'
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
( group and capture to \3:
----------------------------------------------------------------------
[^:]+ any character except: ':' (1 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \3
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
: ':'
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
( group and capture to \4:
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the most amount
possible)):
----------------------------------------------------------------------
(?! look ahead to see if there is not:
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
:: '::'
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
. any character except \n
----------------------------------------------------------------------
)* end of grouping
----------------------------------------------------------------------
) end of \4
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
:: '::'
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
( group and capture to \5:
----------------------------------------------------------------------
[^:]+ any character except: ':' (1 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \5
----------------------------------------------------------------------
: ':'
----------------------------------------------------------------------
$ before an optional \n, and the end of a
"line"
----------------------------------------------------------------------

Related

Regular expression to extract fields with different whitespace

I'm processing some PDFs (each with a big table inside) with pdftotext and my output is a plain text file with different lines.
From the file, which I would like to read line by line, I want to extract only those lines with the following formatting and ignoring the rest:
YYYYYYYY A ZZZ BBBBBB BBB BBBBBB ZZZZZZZ C ZZZ DDDDDD DDD DD ZZZZZ E ZZZ FFFFFF FFF FFF ZZZZZ G (ZZZZ HHHH HHH HHHHH)
^ ^ ^ ^ ^ ^ ^ ^
Y = whitespace or text => IGNORED
A = 1 digit => CAPTURED
B = text with spaces => CAPTURED
C = 1 digit => CAPTURED
D = text with spaces => CAPTURED
E = 1 digit => CAPTURED
F = text with spaces => CAPTURED
G = 1 digit => CAPTURED
H = text with spaces => CAPTURED, if present
Z = whitespace (variable length >= 3 chars) => IGNORED
() = this part may or may not be present
I tried with the following regex:
^.+(\d)\s+(.{3,}?)\s{3,}(\d)\s+(.{3,}?)\s{3,}(\d)\s+(.{3,}?)\s{3,}(\d)\s+(.{3,}|)$
It works but RegexBuddy says that it leads to a "catastrophic backtracking". So what would be the correct way to handle it? I need to capture the groups A..H to do my processing.
EDIT: Adding a couple of testing lines with fictional words:
01/11/2020 03/11/2020
N. First header N. Secondo header N. Third head N. Last optional header
1 COURAGE AND STOCKS EVERYTHING TO 1 SHE SAYS SHE HAS THE ABILITY 1 CHILD BY THE TO 1 JITTERY X) CLASSIC - A6
2 SHE LIKES ALL THE SAME THINGS 2 CROWD YELLS AND SCREMS MORE THEN MEMESU 2 WEDGES PROBABLY ARE NOT 2 FRAGILE Y) BOILINGE - A6
3 SOUNDTRACK OF ME 3 DROPPING ALL THE MONTHS 3 THE BEST FOR RELATIONSHIPS AT 3 NATURAL Z) ENVIOUSLYER - A6
Nothere 4 GOING AND HE COULD HEAR THUNDER IN KUNE 4 WONDERED IF HER LIE 4 CHANGED FOREV LT.200/250 4
5 MUSTER N.2 GROUNDKEE OF MY. 230 5 LEMONADE SHOWERS OF 5 FOCUS JUST MOST 5
Expected results, line by line:
Fail (there are two dates)
Fail (empty line)
Fail (doesn't have numbers, format mismatch)
Fail (empty line)
Match
Fail (empty line)
Match
Match
Match (the "Nothere" text should be ignored of course)
Match

Use
^.*?\s(\d)\s+(\S.*?)\s+(\d)\s+(\S.*?)\s+(\d)\s+(\S.*?)\s+(\d)(?:\s+(\S.*))?$
See proof.
Explanation
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
--------------------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
( group and capture to \2:
--------------------------------------------------------------------------------
\S non-whitespace (all but \n, \r, \t, \f,
and " ")
--------------------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
--------------------------------------------------------------------------------
) end of \2
--------------------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
( group and capture to \3:
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
) end of \3
--------------------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
( group and capture to \4:
--------------------------------------------------------------------------------
\S non-whitespace (all but \n, \r, \t, \f,
and " ")
--------------------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
--------------------------------------------------------------------------------
) end of \4
--------------------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
( group and capture to \5:
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
) end of \5
--------------------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
( group and capture to \6:
--------------------------------------------------------------------------------
\S non-whitespace (all but \n, \r, \t, \f,
and " ")
--------------------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
--------------------------------------------------------------------------------
) end of \6
--------------------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
( group and capture to \7:
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
) end of \7
--------------------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1
or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
( group and capture to \8:
--------------------------------------------------------------------------------
\S non-whitespace (all but \n, \r, \t,
\f, and " ")
--------------------------------------------------------------------------------
.* any character except \n (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of \8
--------------------------------------------------------------------------------
)? end of grouping
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string

How to select specific content between second and third comma using Regex in sublime text?

Hi everyone I'm trying to select the content in between the second and third comma using regex
This is my content
INSERT INTO table (column1, column2, column3) VALUES ('Alejandro', 'dislike', '', 20, 'otro nombre')
INSERT INTO table (column1, column2, column3) VALUES ('Jando', 'like', '', 30, 'wtf')
As you can see between second and third comma are just single quotes '' and I want to selected them using regex because I need to modify like 5000 lines in sublime text 3, I hope you can help me, I tried with no success ,(.*){2} I know I'm wrong, I have no experience with regex
Note: No all time will be single quotes ''

Use
\bVALUES\s(?:[^\n,]*,){2}\h*\K[^,\n]+
See proof
Explanation
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
VALUES 'VALUES'
--------------------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
--------------------------------------------------------------------------------
(?: group, but do not capture (2 times):
--------------------------------------------------------------------------------
[^\n,]* any character except: '\n' (newline),
',' (0 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
, ','
--------------------------------------------------------------------------------
){2} end of grouping
--------------------------------------------------------------------------------
\h* horizontal whitespace (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
\K match reset operator (discarding text matched so far)
--------------------------------------------------------------------------------
[^,\n]+ any character except: ',', '\n' (newline)
(1 or more times (matching the most amount
possible))
Another attempt:
\bVALUES\s*\((?:\s*'(?:''|[^'])*'\s*,){2}\s*\K'(?:''|[^'])*'
See another proof
Explanation
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
VALUES 'VALUES'
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
\( '('
--------------------------------------------------------------------------------
(?: group, but do not capture (2 times):
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
' ' char
--------------------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the most amount
possible)):
--------------------------------------------------------------------------------
'' ''
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
[^'] any character except: '
--------------------------------------------------------------------------------
)* end of grouping
--------------------------------------------------------------------------------
' '\''
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
, ','
--------------------------------------------------------------------------------
){2} end of grouping
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
\K match reset operator (discarding text matched so far)
--------------------------------------------------------------------------------
' '
--------------------------------------------------------------------------------
(?: group, but do not capture (0 or more times
(matching the most amount possible)):
--------------------------------------------------------------------------------
'' '\'\''
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
[^'] any character except: '''
--------------------------------------------------------------------------------
)* end of grouping

What regex to use in this case

Correct code:
"key1=val1;key2=val2;key3=val3" -- Correct as each pair is having ";" at the end except the last pair
Incorrect code:
"key1=val1;key2=val2; key3=val3;" -- Invalid as last pair is having ";" at the end
"key1=val1;;;key2=val2;;;key3=val3" -- Invalid as there are multiple ";" in the middle
I got the regex below from some old link in stackoverflow, but it is not working in the above case:
^(?:\s*\w+\s*=\s*[^;]*;)+$

You might use
^\w+\s*=\s*\w+(?:;\s*\w+\s*=\s*\w+)*$
Explanation
^ Start of string
\w+\s*=\s*\w+ Match 1+ word chars, = and 1+ word chars with optional whitespace chars
(?: Non capture group
;\s*\w+\s*=\s*\w+ Match ; and the same patter as mentioned above
)* Close the group and repeat 0+ times
$ End of string
Regex demo
With the doubled backslashes
^\\w+\\s*=\\s*\\w+(?:;\\s*\\w+\\s*=\\s*\\w+)*$

Also, a shorter one:
^(?:\s*\w+\s*=\s*\w+(?:;(?!\s*$)|\s*$))+\s*$
See proof
Explanation
EXPLANATION
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
(?: group, but do not capture (1 or more times
(matching the most amount possible)):
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
\w+ word characters (a-z, A-Z, 0-9, _) (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
= '='
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
\w+ word characters (a-z, A-Z, 0-9, _) (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
; ';'
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ")
(0 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
$ before an optional \n, and the end
of the string
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ")
(0 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
$ before an optional \n, and the end of
the string
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
)+ end of grouping
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string

Try below:
.*\w;\w.*\w;\w.*[^;]$
Test here
Explanation:
.* --> matches any character
\w --> matches any word character
[^;]$ --> Will exclude any line ending with ;

I find things like this much easier without regex. For eample with JavaScript:
function isValid(string) {
return string.split(/;/).map(e => e.split(/=/)).every(e => e.length === 2);
}

RegEx: find a specific string enclosed in double quotes

I have the following string:
<img src="/images/site_graphics/newsite/foo_com_logo.png" alt="foo.com" width="82" height="14"/>
What is the regex to match only the string within double quotes that start from src= ?

\ssrc\s*=\s*"([^"]*)"
The result will be in group 1.
Explain:
\s : Whitespace
* : Any amount of character
[^"] : Not double quote
( ) : Group

Forward
It's not advisable to use a regex to parse HTML due to all the possible obscure edge cases that can crop up, but it seems that you have some control over the HTML so you should able to avoid many of the edge cases the regex police cry about.
Description
This regular expression
<img\b(?=\s)(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\ssrc=['"]([^"]*)['"]?)(?:[^>=]|='[^']*'|="[^"]*"|=[^'"\s]*)*"\s?\/?>
** To see the image better, simply right click the image and select view in new window
Will do the following:
This regex captures the entire IMG tag
Places the source attribute value into capture group 1, without quotes if they exist.
Allow attributes to have single, double or no quotes
Can be modified to validate any number of other attributes
Avoid difficult edge cases which tend to make parsing HTML difficult
Example
Live Demo
https://regex101.com/r/qW9nG8/16
Sample text
Note the difficult edge case in the first line where we are looking for a specific droid.
<img onmouseover=' if ( 6 > 3 { funSwap(" src="NotTheDroidYourLookingFor.jpg", 6 > 3 ) } ; ' src="http://website/ThisIsTheDroidYourLookingFor.jpeg" onload="img_onload(this);" onerror="img_onerror(this);" data-pid="jihgfedcba" data-imagesize="ppew" />
some text
<img src="http://website/someurl.jpeg" onload="img_onload(this);" />
more text
<img src="https://en.wikipedia.org/wiki/File:BH_LMC.png"/>
Sample Matches
Capture group 0 gets the entire IMG tag
Capture group 1 gets just the src attribute value
[0][0] = <img onmouseover=' funSwap(" src='NotTheDroidYourLookingFor.jpg", data-pid) ; ' src="http://website/ThisIsTheDroidYourLookingFor.jpeg" onload="img_onload(this);" onerror="img_onerror(this);" data-pid="jihgfedcba" data-imagesize="ppew" />
[0][1] = http://website/ThisIsTheDroidYourLookingFor.jpeg
[1][0] = <img src="http://website/someurl.jpeg" onload="img_onload(this);" />
[1][1] = http://website/someurl.jpeg
[2][0] = <img src="https://en.wikipedia.org/wiki/File:BH_LMC.png"/>
[2][1] = https://en.wikipedia.org/wiki/File:BH_LMC.png
Explanation
NODE EXPLANATION
----------------------------------------------------------------------
<img '<img'
----------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the least amount
possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=' '=\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=" '="'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
[^'"] any character except: ''', '"'
----------------------------------------------------------------------
[^\s>]* any character except: whitespace (\n,
\r, \t, \f, and " "), '>' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
)*? end of grouping
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
src= 'src='
----------------------------------------------------------------------
['"] any character of: ''', '"'
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
['"]? any character of: ''', '"' (optional
(matching the most amount possible))
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?: group, but do not capture (0 or more times
(matching the most amount possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=' '=\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=" '="'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
[^'"\s]* any character except: ''', '"',
whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
----------------------------------------------------------------------
)* end of grouping
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
\s? whitespace (\n, \r, \t, \f, and " ")
(optional (matching the most amount
possible))
----------------------------------------------------------------------
\/? '/' (optional (matching the most amount
possible))
----------------------------------------------------------------------
> '>'
----------------------------------------------------------------------

Use Load runner web_reg_save_param_regexp save the number after tvNode_R- from a block of html code

I use the regular expression attached but it works without adding the highlighted "ORG_40365", once I added "ORG_40365" it does not work.
However, I need to specify the occurrence related to specific node and need to add "ORG_40365" at the end. Otherwise will return other value unexpected.
Please click here to see the web_reg_save_param_regexp used
I cannot copy the code here as it does not allow me to save with error.
I have the above string and want to save 9704 after tvNode_R-. However, the regular expression does not work.
Any help would be greatly appreciated!

Foreward
Pattern Matching HTML can be rather difficult so it's generally recommended to use an HTML parsing tool.
Also I manually transcribed your photo of text. I recommend either replacing the photo with real text or inserting real text in addition to the photo.
Description
<tr(?=\s)(?=(?:[^>=]|=(?:'[^']*'|"[^"]*"|[^'"][^\s>]*))*?\sclass=['"]dxtlNode)(?=(?:[^>=]|=(?:'[^']*'|"[^"]*"|[^'"][^\s>]*))*?\sid=['"][^"]*tvNode_R-([0-9]{4}))(?:[^>=]|=(?:'[^']*'|"[^"]*"|[^'"\s]*))*\s?\/?>
** To see the image better, simply right click the image and select view in new window
This regular expression will do the following:
find tr tags
require the tr tag to have an class dxtlNode
require the tr tag to have an id with tvNode_R- followed by 4 digits
Captures the 4 digits identified above into Capture Group 1
Capture the entire opening tr tag into Capture Group 0
Avoids some difficult edge cases which makes pattern matching in HTML difficult
Example
Live Demo
https://regex101.com/r/gK5dM0/1
Sample text
<td class="dxtlHSEC"></td></tr><tr id="m_m_splitMaster_P_TC_p_n_tvNode_R-9704"
class="dxtlNode" oncontextmenup="return aspxTLMenu('m_m_splitMaster_P_TC_p_n_tvNode','Node',';9704',event)">
<span class="btxt">ORG_40365</span></td><td class="dstlHSEC"></td></tr>
Sample Matches
Capture Group 0. [31-211] `<tr id="m_m_splitMaster_P_TC_p_n_tvNode_R-9704" class="dxtlNode" oncontextmenup="return aspxTLMenu('m_m_splitMaster_P_TC_p_n_tvNode','Node',';9704',event)">`
Capture Group 1. [73-77] `9704`
Explanation
NODE EXPLANATION
----------------------------------------------------------------------
<tr '<tr'
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the least amount
possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
[^'"] any character except: ''', '"'
----------------------------------------------------------------------
[^\s>]* any character except: whitespace
(\n, \r, \t, \f, and " "), '>' (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
)*? end of grouping
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
class= 'class='
----------------------------------------------------------------------
['"] any character of: ''', '"'
----------------------------------------------------------------------
dxtlNode 'dxtlNode'
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the least amount
possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
[^'"] any character except: ''', '"'
----------------------------------------------------------------------
[^\s>]* any character except: whitespace
(\n, \r, \t, \f, and " "), '>' (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
)*? end of grouping
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
id= 'id='
----------------------------------------------------------------------
['"] any character of: ''', '"'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
tvNode_R- 'tvNode_R-'
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[0-9]{4} any character of: '0' to '9' (4 times)
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?: group, but do not capture (0 or more times
(matching the most amount possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
[^'"\s]* any character except: ''', '"',
whitespace (\n, \r, \t, \f, and " ")
(0 or more times (matching the most
amount possible))
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
)* end of grouping
----------------------------------------------------------------------
\s? whitespace (\n, \r, \t, \f, and " ")
(optional (matching the most amount
possible))
----------------------------------------------------------------------
\/? '/' (optional (matching the most amount
possible))
----------------------------------------------------------------------
> '>'
----------------------------------------------------------------------

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to match this pattern (with emoji)? - regex

Related

Regular expression to extract fields with different whitespace

How to select specific content between second and third comma using Regex in sublime text?

What regex to use in this case

RegEx: find a specific string enclosed in double quotes

Use Load runner web_reg_save_param_regexp save the number after tvNode_R- from a block of html code

Categories

Resources