How to match minimize length of strings around keyword by regex

How to match minimize length of strings around keyword by regex - regex

I have below string (one long string).
{"type":"Execution","typeValue":"Custom","targetValue":"_self","params":{"_report":"reportname","hyperlinkInput":"2","Organization":"orgid","As_Of_Date":"2016-04-01"},"id":"1111","href":"href?hyperlinkInput=2&Organization=orgid&As_Of_Date=2016-04-01&Locale=en_US","selector":"ExecutionEnd"},{"type":"Execution","typeValue":"Custom","targetValue":"_self","params":{"_report":"reportname","hyperlinkInput":"2","Organization":"orgid","As_Of_Date":"2016-04-01","CustomerID":"2222"},"id":"1234","href":"href?hyperlinkInput=2&Organization=orgid&As_Of_Date=2016-04-01&CustomerID=2222&Locale=en_US","selector":"ExecutionEnd"},{"type":"Execution","typeValue":"Custom","targetValue":"_self","params":{"_report":"reportname","hyperlinkInput":"2","Organization":"orgid","As_Of_Date":"2016-04-01"},"id":"1112","href":"href?hyperlinkInput=2&Organization=orgid&As_Of_Date=2016-04-01&Locale=en_US","selector":"ExecutionEnd"},{"type":"Execution","typeValue":"Custom","targetValue":"_self","params":{"_report":"reportname","hyperlinkInput":"2","Organization":"orgid","As_Of_Date":"2016-04-01","CustomerID":"2223"},"id":"1235","href":"href?hyperlinkInput=2&Organization=orgid&As_Of_Date=2016-04-01&CustomerID=22223&Locale=en_US","selector":"ExecutionEnd"},
Please note:
The string is in one line and you could notice that in each {} pair, the content is very similar.
I could only do it with regex and cannot do any split by any functions.
I want to use regular expression to filter out the one containing CustomerID with minimum complete length. For example, I want to filter out as below.
{"type":"Execution","typeValue":"Custom","targetValue":"_self","params":{"_report":"reportname","hyperlinkInput":"2","Organization":"orgid","As_Of_Date":"2016-04-01","CustomerID":"2222"},"id":"1234","href":"href?hyperlinkInput=2&Organization=orgid&As_Of_Date=2016-04-01&CustomerID=2222&Locale=en_US","selector":"ExecutionEnd"}
{"type":"Execution","typeValue":"Custom","targetValue":"_self","params":{"_report":"reportname","hyperlinkInput":"2","Organization":"orgid","As_Of_Date":"2016-04-01","CustomerID":"2223"},"id":"1235","href":"href?hyperlinkInput=2&Organization=orgid&As_Of_Date=2016-04-01&CustomerID=22223&Locale=en_US","selector":"ExecutionEnd"},
But I'm not sure how to do this. I tried many times with zero width assertion but still cannot figure it out. Could you please enlighten me? Thanks!

Forward
I don't recommend using Regex to parse JSON because of all the possible edge cases. But it appears you have some control over the data and can therefore limit the edge cases.
Description
Based on your source text, this regex will do the following:
Find all the JSON entries that have a CustomerID field in nested inside the Params array, and embeded inside the href string
Validates that both the CustomerID located in Params and href are identical
work with both compressed and expanded JSON
Avoids some obvious edge cases that the regex police complain about
Note: running this regex I used the Case insensitive flag.
\{(?=(?:"[^"]*"|[^{}"]*|\{[^{}]*})*?"params":\{(?:"[^"]*"|[^{}"]*|\{[^{}]*})*?"CustomerID":"([^"]*)")(?=(?:"[^"]*"|[^{}"]*|\{[^{}]*})*?"href":"[^"]*&CustomerID=\1)(?:"[^"]*"|[^{}"]*|\{[^{}]*})*}
To view the image better, right click the image and select open in new window.
Examples
Source Text
{"type":"Execution","typeValue":"Custom","targetValue":"_self","params":{"_report":"reportname","hyperlinkInput":"2","Organization":"orgid","As_Of_Date":"2016-04-01"},"id":"1111","href":"href?hyperlinkInput=2&Organization=orgid&As_Of_Date=2016-04-01&Locale=en_US","selector":"ExecutionEnd"}
,{"type":"Execution","typeValue":"Custom","targetValue":"_self","params":{"_report":"reportname","hyperlinkInput":"2","Organization":"orgid","As_Of_Date":"2016-04-01","CustomerID":"2222"},"id":"1234","href":"href?hyperlinkInput=2&Organization=orgid&As_Of_Date=2016-04-01&CustomerID=2222&Locale=en_US","selector":"ExecutionEnd"},{"type":"Execution","typeValue":"Custom","targetValue":"_self","params":{"_report":"reportname","hyperlinkInput":"2","Organization":"orgid","As_Of_Date":"2016-04-01"},"id":"1112","href":"href?hyperlinkInput=2&Organization=orgid&As_Of_Date=2016-04-01&Locale=en_US","selector":"ExecutionEnd"},{"type":"Execution","typeValue":"Custom","targetValue":"_self","params":{"_report":"reportname","hyperlinkInput":"2","Organization":"orgid","As_Of_Date":"2016-04-01","CustomerID":"2223"},"id":"1235","href":"href?hyperlinkInput=2&Organization=orgid&As_Of_Date=2016-04-01&CustomerID=22223&Locale=en_US","selector":"ExecutionEnd"},
{"type":"Execution","typeValue":"Custom","targetValue":"_self","params":{"_report":"reportname","hyperlinkInput":"2","Organization":"orgid","As_Of_Date":"2016-04-01","CustomerID":"44444"},"id":"1235","href":"href?hyperlinkInput=2&Organization=orgid&As_Of_Date=2016-04-01&CustomerID=44444&Locale=en_US","selector":"ExecutionEnd"},
Matches
[0][0] = {"type":"Execution","typeValue":"Custom","targetValue":"_self","params":{"_report":"reportname","hyperlinkInput":"2","Organization":"orgid","As_Of_Date":"2016-04-01","CustomerID":"2222"},"id":"1234","href":"href?hyperlinkInput=2&Organization=orgid&As_Of_Date=2016-04-01&CustomerID=2222&Locale=en_US","selector":"ExecutionEnd"}
[0][1] = 2222
[1][0] = {"type":"Execution","typeValue":"Custom","targetValue":"_self","params":{"_report":"reportname","hyperlinkInput":"2","Organization":"orgid","As_Of_Date":"2016-04-01","CustomerID":"44444"},"id":"1235","href":"href?hyperlinkInput=2&Organization=orgid&As_Of_Date=2016-04-01&CustomerID=44444&Locale=en_US","selector":"ExecutionEnd"}
[1][1] = 44444
Explained
Capture Groups
group 0 gets the entire matching JSON block
group 1 gets the value associated with CustomerID
Expanded
NODE EXPLANATION
----------------------------------------------------------------------
\{ '{'
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the least amount
possible)):
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
[^{}"]* any character except: '{', '}', '"' (0
or more times (matching the most
amount possible))
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
\{ '{'
----------------------------------------------------------------------
[^{}]* any character except: '{', '}' (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
} '}'
----------------------------------------------------------------------
)*? end of grouping
----------------------------------------------------------------------
"params": '"params":'
----------------------------------------------------------------------
\{ '{'
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the least amount
possible)):
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
[^{}"]* any character except: '{', '}', '"' (0
or more times (matching the most
amount possible))
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
\{ '{'
----------------------------------------------------------------------
[^{}]* any character except: '{', '}' (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
} '}'
----------------------------------------------------------------------
)*? end of grouping
----------------------------------------------------------------------
"CustomerID":" '"CustomerID":"'
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the least amount
possible)):
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
[^{}"]* any character except: '{', '}', '"' (0
or more times (matching the most
amount possible))
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
\{ '{'
----------------------------------------------------------------------
[^{}]* any character except: '{', '}' (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
} '}'
----------------------------------------------------------------------
)*? end of grouping
----------------------------------------------------------------------
"href":" '"href":"'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
&CustomerID= '&CustomerID='
----------------------------------------------------------------------
\1 what was matched by capture \1
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?: group, but do not capture (0 or more times
(matching the most amount possible)):
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
[^{}"]* any character except: '{', '}', '"' (0
or more times (matching the most amount
possible))
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
\{ '{'
----------------------------------------------------------------------
[^{}]* any character except: '{', '}' (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
} '}'
----------------------------------------------------------------------
)* end of grouping
----------------------------------------------------------------------
} '}'

Related

RegEx: find a specific string enclosed in double quotes

I have the following string:
<img src="/images/site_graphics/newsite/foo_com_logo.png" alt="foo.com" width="82" height="14"/>
What is the regex to match only the string within double quotes that start from src= ?

\ssrc\s*=\s*"([^"]*)"
The result will be in group 1.
Explain:
\s : Whitespace
* : Any amount of character
[^"] : Not double quote
( ) : Group

Forward
It's not advisable to use a regex to parse HTML due to all the possible obscure edge cases that can crop up, but it seems that you have some control over the HTML so you should able to avoid many of the edge cases the regex police cry about.
Description
This regular expression
<img\b(?=\s)(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\ssrc=['"]([^"]*)['"]?)(?:[^>=]|='[^']*'|="[^"]*"|=[^'"\s]*)*"\s?\/?>
** To see the image better, simply right click the image and select view in new window
Will do the following:
This regex captures the entire IMG tag
Places the source attribute value into capture group 1, without quotes if they exist.
Allow attributes to have single, double or no quotes
Can be modified to validate any number of other attributes
Avoid difficult edge cases which tend to make parsing HTML difficult
Example
Live Demo
https://regex101.com/r/qW9nG8/16
Sample text
Note the difficult edge case in the first line where we are looking for a specific droid.
<img onmouseover=' if ( 6 > 3 { funSwap(" src="NotTheDroidYourLookingFor.jpg", 6 > 3 ) } ; ' src="http://website/ThisIsTheDroidYourLookingFor.jpeg" onload="img_onload(this);" onerror="img_onerror(this);" data-pid="jihgfedcba" data-imagesize="ppew" />
some text
<img src="http://website/someurl.jpeg" onload="img_onload(this);" />
more text
<img src="https://en.wikipedia.org/wiki/File:BH_LMC.png"/>
Sample Matches
Capture group 0 gets the entire IMG tag
Capture group 1 gets just the src attribute value
[0][0] = <img onmouseover=' funSwap(" src='NotTheDroidYourLookingFor.jpg", data-pid) ; ' src="http://website/ThisIsTheDroidYourLookingFor.jpeg" onload="img_onload(this);" onerror="img_onerror(this);" data-pid="jihgfedcba" data-imagesize="ppew" />
[0][1] = http://website/ThisIsTheDroidYourLookingFor.jpeg
[1][0] = <img src="http://website/someurl.jpeg" onload="img_onload(this);" />
[1][1] = http://website/someurl.jpeg
[2][0] = <img src="https://en.wikipedia.org/wiki/File:BH_LMC.png"/>
[2][1] = https://en.wikipedia.org/wiki/File:BH_LMC.png
Explanation
NODE EXPLANATION
----------------------------------------------------------------------
<img '<img'
----------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the least amount
possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=' '=\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=" '="'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
[^'"] any character except: ''', '"'
----------------------------------------------------------------------
[^\s>]* any character except: whitespace (\n,
\r, \t, \f, and " "), '>' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
)*? end of grouping
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
src= 'src='
----------------------------------------------------------------------
['"] any character of: ''', '"'
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
['"]? any character of: ''', '"' (optional
(matching the most amount possible))
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?: group, but do not capture (0 or more times
(matching the most amount possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=' '=\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=" '="'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
[^'"\s]* any character except: ''', '"',
whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
----------------------------------------------------------------------
)* end of grouping
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
\s? whitespace (\n, \r, \t, \f, and " ")
(optional (matching the most amount
possible))
----------------------------------------------------------------------
\/? '/' (optional (matching the most amount
possible))
----------------------------------------------------------------------
> '>'
----------------------------------------------------------------------

Capture words on the right side of | (OR) in regex expression that are not in the left

I am trying to capture words on the right side of this regex expression that are not captured on the left.
In the code below, the left side captures "17 inch" in this string: "this 235/45R17 is a 17 inch tyre"
(?<=([-.0-9]+(\s)(inches|inch)))|???????
However, anything I put in the right side, such as a simple +w is interfering with the left side
How can I tell the RegEx to capture any word, unless it is a digit followed by inch - in which case capture both 17 and inch?

Description
((?:(?![0-9.-]+\s*inch(?:es)?).)+)|([0-9.-]+\s*inch(?:es)?)
** To see the image better, simply right click the image and select view in new window
Example
Live Demo
https://regex101.com/r/fY9jU5/2
Sample text
this 235/45R17 is a 17 inch tyre
Sample Matches
Capture group 1 will be the values that didn't match the 17 inch
Capture Group 2 will be the number of inches
MATCH 1
1. [0-20] `this 235/45R17 is a `
MATCH 2
2. [20-27] `17 inch`
MATCH 3
1. [27-32] ` tyre`
Explanation
NODE EXPLANATION
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
(?: group, but do not capture (1 or more
times (matching the most amount
possible)):
----------------------------------------------------------------------
(?! look ahead to see if there is not:
----------------------------------------------------------------------
[0-9.-]+ any character of: '0' to '9', '.',
'-' (1 or more times (matching the
most amount possible))
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ")
(0 or more times (matching the most
amount possible))
----------------------------------------------------------------------
inch 'inch'
----------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount
possible)):
----------------------------------------------------------------------
es 'es'
----------------------------------------------------------------------
)? end of grouping
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
. any character except \n
----------------------------------------------------------------------
)+ end of grouping
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
[0-9.-]+ any character of: '0' to '9', '.', '-'
(1 or more times (matching the most
amount possible))
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
----------------------------------------------------------------------
inch 'inch'
----------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
----------------------------------------------------------------------
es 'es'
----------------------------------------------------------------------
)? end of grouping
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------

Use Load runner web_reg_save_param_regexp save the number after tvNode_R- from a block of html code

I use the regular expression attached but it works without adding the highlighted "ORG_40365", once I added "ORG_40365" it does not work.
However, I need to specify the occurrence related to specific node and need to add "ORG_40365" at the end. Otherwise will return other value unexpected.
Please click here to see the web_reg_save_param_regexp used
I cannot copy the code here as it does not allow me to save with error.
I have the above string and want to save 9704 after tvNode_R-. However, the regular expression does not work.
Any help would be greatly appreciated!

Foreward
Pattern Matching HTML can be rather difficult so it's generally recommended to use an HTML parsing tool.
Also I manually transcribed your photo of text. I recommend either replacing the photo with real text or inserting real text in addition to the photo.
Description
<tr(?=\s)(?=(?:[^>=]|=(?:'[^']*'|"[^"]*"|[^'"][^\s>]*))*?\sclass=['"]dxtlNode)(?=(?:[^>=]|=(?:'[^']*'|"[^"]*"|[^'"][^\s>]*))*?\sid=['"][^"]*tvNode_R-([0-9]{4}))(?:[^>=]|=(?:'[^']*'|"[^"]*"|[^'"\s]*))*\s?\/?>
** To see the image better, simply right click the image and select view in new window
This regular expression will do the following:
find tr tags
require the tr tag to have an class dxtlNode
require the tr tag to have an id with tvNode_R- followed by 4 digits
Captures the 4 digits identified above into Capture Group 1
Capture the entire opening tr tag into Capture Group 0
Avoids some difficult edge cases which makes pattern matching in HTML difficult
Example
Live Demo
https://regex101.com/r/gK5dM0/1
Sample text
<td class="dxtlHSEC"></td></tr><tr id="m_m_splitMaster_P_TC_p_n_tvNode_R-9704"
class="dxtlNode" oncontextmenup="return aspxTLMenu('m_m_splitMaster_P_TC_p_n_tvNode','Node',';9704',event)">
<span class="btxt">ORG_40365</span></td><td class="dstlHSEC"></td></tr>
Sample Matches
Capture Group 0. [31-211] `<tr id="m_m_splitMaster_P_TC_p_n_tvNode_R-9704" class="dxtlNode" oncontextmenup="return aspxTLMenu('m_m_splitMaster_P_TC_p_n_tvNode','Node',';9704',event)">`
Capture Group 1. [73-77] `9704`
Explanation
NODE EXPLANATION
----------------------------------------------------------------------
<tr '<tr'
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the least amount
possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
[^'"] any character except: ''', '"'
----------------------------------------------------------------------
[^\s>]* any character except: whitespace
(\n, \r, \t, \f, and " "), '>' (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
)*? end of grouping
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
class= 'class='
----------------------------------------------------------------------
['"] any character of: ''', '"'
----------------------------------------------------------------------
dxtlNode 'dxtlNode'
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the least amount
possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
[^'"] any character except: ''', '"'
----------------------------------------------------------------------
[^\s>]* any character except: whitespace
(\n, \r, \t, \f, and " "), '>' (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
)*? end of grouping
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
id= 'id='
----------------------------------------------------------------------
['"] any character of: ''', '"'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
tvNode_R- 'tvNode_R-'
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[0-9]{4} any character of: '0' to '9' (4 times)
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?: group, but do not capture (0 or more times
(matching the most amount possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
[^'"\s]* any character except: ''', '"',
whitespace (\n, \r, \t, \f, and " ")
(0 or more times (matching the most
amount possible))
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
)* end of grouping
----------------------------------------------------------------------
\s? whitespace (\n, \r, \t, \f, and " ")
(optional (matching the most amount
possible))
----------------------------------------------------------------------
\/? '/' (optional (matching the most amount
possible))
----------------------------------------------------------------------
> '>'
----------------------------------------------------------------------

Printing in patterns in perl

I am having a great trouble to remove the errors in unicode encoded corpus.
In following form
രണവര്‍ഗ്ഗത്തിനകത്തു=ഭരണവര്‍ഗ്ഗത്തിന്:stemഅകത്തു|:suffix
ഭസ്മമാക്കിക്കളയുകയും=ഭസ്മം:stemആക്കിക്കളയുകയും|:suffix
ഭസ്മമാക്കി=ഭസ്മം:stemആക്കി|:suffix
ഭാഗത്തുനിന്നുണ്ടാകണം=ഭാഗത്ത്:stemനിന്ന്:stemഉണ്ടാകണം|:suffix,:
ഭാഗമായ=ഭാഗം:stemആയ|:suffix
ഭാര്യമാരില്‍നിന്നും=ഭാര്യമാരില്‍:stemനിന്നും|:suffix:suffix
ഭാര്യമാരുണ്ടായിരുന്നവരില്‍നിന്നു=ഭാര്യമാര്‍:stemഉണ്ടായിരുന്നവരില്‍:stemനിന്നു|:suffix,:suffix:suffix
ഭാര്യയായി=ഭാര്യ:stemആയി|:suffix
ഭാ‌ഷ്യകര്‍ത്താവായ=ഭാ‌ഷ്യകര്‍ത്താവ്:stemആയ|:suffix:suffix
ഭിത്തികളൊക്കെ=ഭിത്തികള്‍:stemഒക്കെ|:suffix
ഭിന്നതയില്ലെന്നും=ഭിന്നത:stemഇല്ല:stemഎന്നും|:suffix,:suffix0
ഭൂപ്രഭുക്കളെന്ന്=ഭൂപ്രഭുക്കള്‍:stemഎന്ന്|:suffix0
ഭൂമിയില്‍നിന്ന്=ഭൂമിയില്‍:stemനിന്ന്|:suffix
ഭൂമിയിലുള്ള=ഭൂമിയില്‍:stemഉള്ള|:suffix
ഭൂമിയെപ്പോലൊരു=ഭൂമിയെ:stemപോലെ:stemഒരു|:suffix,:suffix0
ഭൂമുഖവീക്ഷണനായി=ഭൂമുഖവീക്ഷണന്‍:stemആയി|:suffix:suffix
ഭൂസഞ്ചാരംപോലെ=ഭൂസഞ്ചാരം:stemപോലെ|:suffix
ഭേദിക്കേണ്ടതായി=ഭേദിക്കേണ്ടതാ്:stemആയി|:suffix:suffix
ഭൗതികവാദികളാണ്=ഭൗതികവാദികള്‍:stemആണ്|:suffix0
മക്കളയച്ചു=മക്കള്‍:stemഅയച്ചു|:suffix
മക്കള്‍ക്കാണ്=മക്കള്‍ക്ക്:stemആണ്|:suffix
മഞ്ചേരിയിലേക്കാണ്=മഞ്ചേരിയിലേക്ക്:stemആണ്|:suffix:suffix
മഞ്ചേശ്വരത്താണ്=മഞ്ചേശ്വരത്ത്:stemആണ്|:suffix:suffix
മഞ്ഞുവെള്ളത്തിലാഴ്ത്തി=മഞ്ഞുവെള്ളത്തില്‍:stemആഴ്ത്തി|:suffix:suffix
മടങ്ങാണിതിന്=മടങ്ങ്:stemആണ്:stemഇതിന്|:suffix,:suffix
മടിയനായിരുന്നു=മടിയന്‍:stemആയിരുന്നു|:suffix
Where I need to remove two stem together and two suffixes together. In the case of two stems I need keep first stem and convert the second into suffix. In the case of two suffixes like this :suffix:suffix, :suffix,:suffix0 I need to keep only one suffix
use strict;
use warnings qw/ all FATAL /;
use List::Util 'reduce';
while ( <> ) {
my ($word, $ss) = / \( ( /[^()]* ) \) /gx;
my #ss = split ' ', $ss;
my $str = reduce { sprintf 'S (%s) (%s)', $a, $b } #ss;
printf "%s (%s)\n", $word, $str;
}
This is the perl code I am trying to change but that code is not sufficient to handle the complexities. Is there any way to handle the kinds of errors.
**Expected output**
`ഭാര്യമാരുണ്ടായിരുന്നവരില്‍നിന്നു=ഭാര്യമാര്‍:stemഉണ്ടായിരുന്നവരില്‍:stemനിന്നു|:suffix,:suffix:suffix` to
ഭാര്യമാരുണ്ടായിരുന്നവരില്‍നിന്നു=ഭാര്യമാര്‍:stemഉണ്ടായിരുന്നവരില്‍:suffixനിന്നു|:suffix
ഭാ‌ഷ്യകര്‍ത്താവായ=ഭാ‌ഷ്യകര്‍ത്താവ്:stemആയ|:suffix:suffix to
ഭാ‌ഷ്യകര്‍ത്താവായ=ഭാ‌ഷ്യകര്‍ത്താവ്:stemആയ|:suffix
മഞ്ചേരിയിലേക്കാണ്=മഞ്ചേരിയിലേക്ക്:stemആണ്|:suffix:suffix to
മഞ്ചേരിയിലേക്കാണ്=മഞ്ചേരിയിലേക്ക്:stemആണ്|:suffix
Any one interested in helping me?

Description
^([^:]+:stem[^:]+)(?::stem(?=.*?(:suffix))|)([^:]+?\|:suffix[^:]*)(?::suffix[^:]*)*$
Replace with: \1\2\3
This regular expression will do the following:
Assumes that each line will have a suffix string this is then pattern matched and pulled into the capture group 2
If there is a second stem it is replaced with suffix
Removes all but the first suffix entries
Example
Live Demo
https://regex101.com/r/rJ9gW3/2
Sample text
ഭാര്യമാരുണ്ടായിരുന്നവരില്‍നിന്നു=ഭാര്യമാര്‍:stemഉണ്ടായിരുന്നവരില്‍:stemനിന്നു|:suffix,:suffix:suffix
ഭാ‌ഷ്യകര്‍ത്താവായ=ഭാ‌ഷ്യകര്‍ത്താവ്:stemആയ|:suffix:suffix
മഞ്ചേരിയിലേക്കാണ്=മഞ്ചേരിയിലേക്ക്:stemആണ്|:suffix:suffix
Sample Matches
ഭാര്യമാരുണ്ടായിരുന്നവരില്‍നിന്നു=ഭാര്യമാര്‍:stemഉണ്ടായിരുന്നവരില്‍:suffixനിന്നു|:suffix,
ഭാ‌ഷ്യകര്‍ത്താവായ=ഭാ‌ഷ്യകര്‍ത്താവ്:stemആയ|:suffix
മഞ്ചേരിയിലേക്കാണ്=മഞ്ചേരിയിലേക്ക്:stemആണ്|:suffix
Explanation
NODE EXPLANATION
----------------------------------------------------------------------
^ the beginning of a "line"
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[^:]+ any character except: ':' (1 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
:stem ':stem'
----------------------------------------------------------------------
[^:]+ any character except: ':' (1 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
:stem ':stem'
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
.*? any character except \n (0 or more
times (matching the least amount
possible))
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
:suffix ':suffix'
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
( group and capture to \3:
----------------------------------------------------------------------
[^:]+? any character except: ':' (1 or more
times (matching the least amount
possible))
----------------------------------------------------------------------
\| '|'
----------------------------------------------------------------------
:suffix ':suffix'
----------------------------------------------------------------------
[^:]* any character except: ':' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \3
----------------------------------------------------------------------
(?: group, but do not capture (0 or more times
(matching the most amount possible)):
----------------------------------------------------------------------
:suffix ':suffix'
----------------------------------------------------------------------
[^:]* any character except: ':' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
)* end of grouping
----------------------------------------------------------------------
$ before an optional \n, and the end of a
"line"
----------------------------------------------------------------------

Regex pspad to replace item in line based on inclusion of other pattern

I need to match an item in a line that, upon finding this, will replace another item in the line. Ex:
Find all lines containing: code="ap"
Replace: quant="*****" with quant="0"
Note that the quant field I will need to replace could have multiple variables between the quotes.
I tried from another thread:
Need: replace "BBB" with "CCC" but only in lines that contain the word "AAA"
Search: ((?=.*?AAA)[^\r\n]*)(BBB)
Replace: $1CCC
However, I'm not sure if it will work with the quotes included in my find or how to enter the variable data in the initial quant replace.

Description
This regex will do the following:
validate the line contains code="ap", if the line does not contain this string then no replacements will be made on the line
find the first quantity="somevalue" string and replace with quantity="0"
allow quantity's value to be any value, to include spaces and other special characters
avoid difficult edge cases
allow the code and quantity key names to appear in any order on the line
allow the values to be surrounded by single or double quotes or no quotes
The Regular Expression
For this regex I used the case insenstive and multiline flags
^(?=(?:[^=\r\n]|='[^']*'|="[^"]*"|=[^'"][^\s]*)*?code=(['"]?)ap\1)((?:[^=\r\n]|='[^']*'|="[^"]*"|=[^'"][^\s]*)*?quantity=)(?:"[^"]*"|'[^']*'|[^\s\n\r]*)(.*?)$
Replace with:
$2"0"$3
Note: to see the image better right click and select open in new window
Examples
Source Text
Note the difficult edge case in the last several lines.
code="ap" quantity="SomeValue" other="values"
code="Not ap" quantity="SomeValue"
code="ap" quantity="SomeValue"
quantity="AlsoSomeValue2" code="ap"
code="ap" other=' quantity="Save this value" ' quantity="SomeValue"
code="Not ap" quantity="SomeValue" other=' Code="ap" '
After Replace
code="ap" quantity="0" other="values"
code="Not ap" quantity="SomeValue"
code="ap" quantity="0"
quantity="0" code="ap"
code="ap" other=' quantity="Save this value" ' quantity="0"
code="Not ap" quantity="SomeValue" other=' Code="ap" '
Explanation
NODE EXPLANATION
----------------------------------------------------------------------
^ the beginning of a "line"
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the least amount
possible)):
----------------------------------------------------------------------
[^=\r\n] any character except: '=', '\r'
(carriage return), '\n' (newline)
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=' '=\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=" '="'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
[^'"] any character except: ''', '"'
----------------------------------------------------------------------
[^\s]* any character except: whitespace (\n,
\r, \t, \f, and " ") (0 or more times
(matching the most amount possible))
----------------------------------------------------------------------
)*? end of grouping
----------------------------------------------------------------------
code= 'code='
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
['"]? any character of: ''', '"' (optional
(matching the most amount possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
ap 'ap'
----------------------------------------------------------------------
\1 what was matched by capture \1
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the least amount
possible)):
----------------------------------------------------------------------
[^=\r\n] any character except: '=', '\r'
(carriage return), '\n' (newline)
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=' '=\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=" '="'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
[^'"] any character except: ''', '"'
----------------------------------------------------------------------
[^\s]* any character except: whitespace (\n,
\r, \t, \f, and " ") (0 or more times
(matching the most amount possible))
----------------------------------------------------------------------
)*? end of grouping
----------------------------------------------------------------------
quantity= 'quantity='
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
[^\s\n\r]* any character except: whitespace (\n,
\r, \t, \f, and " "), '\n' (newline),
'\r' (carriage return) (0 or more times
(matching the most amount possible))
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
( group and capture to \3:
----------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
----------------------------------------------------------------------
) end of \3
----------------------------------------------------------------------
$ before an optional \n, and the end of a
"line"

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to match minimize length of strings around keyword by regex - regex

Related

RegEx: find a specific string enclosed in double quotes

Capture words on the right side of | (OR) in regex expression that are not in the left

Use Load runner web_reg_save_param_regexp save the number after tvNode_R- from a block of html code

Printing in patterns in perl

Regex pspad to replace item in line based on inclusion of other pattern

Categories

Resources