How to exclude images from regex email extraction - regex

I am using some email extractor software to (surprise surprise) extract emails from websites. It uses the regex:
[A-Z0-9._%+-]+#[A-Z0-9.-]{3,65}\.[A-Z]{2,4}
But this churns out images as well as emails eg _212000482_1#80xauto.jpg
I can change this regex, but I cannot figure out how to exclude matches ending in .png, .jpg etc.
There is a lot of information on validating emails - and how hard this is - but all I want to do is exclude images from the result list.

Description
In your sample text the undesired substring resembles an email address, but conveniently ends in jpg. So with a negative lookahead we can just exclude the filename extensions.
(?!\S*\.(?:jpg|png|gif|bmp)(?:[\s\n\r]|$))[A-Z0-9._%+-]+#[A-Z0-9.-]{3,65}\.[A-Z]{2,4}
Example
Live Demo
https://regex101.com/r/mU7bO3/2
Sample text
droids#gmail.com _212000482_1#80xauto.jpg More.Droids#deathstar.com
Sample Matches
droids#gmail.com
More.Droids#deathstar.com
Explanation
NODE EXPLANATION
----------------------------------------------------------------------
(?! look ahead to see if there is not:
----------------------------------------------------------------------
\S* non-whitespace (all but \n, \r, \t, \f,
and " ") (0 or more times (matching the
most amount possible))
----------------------------------------------------------------------
\. '.'
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
jpg 'jpg'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
png 'png'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
gif 'gif'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
bmp 'bmp'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
[\s\n\r] any character of: whitespace (\n, \r,
\t, \f, and " "), '\n' (newline), '\r'
(carriage return)
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
$ before an optional \n, and the end of
a "line"
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
[A-Z0-9._%+-]+ any character of: 'A' to 'Z', '0' to '9',
'.', '_', '%', '+', '-' (1 or more times
(matching the most amount possible))
----------------------------------------------------------------------
# '#'
----------------------------------------------------------------------
[A-Z0-9.-]{3,65} any character of: 'A' to 'Z', '0' to '9',
'.', '-' (between 3 and 65 times (matching
the most amount possible))
----------------------------------------------------------------------
\. '.'
----------------------------------------------------------------------
[A-Z]{2,4} any character of: 'A' to 'Z' (between 2
and 4 times (matching the most amount
possible))
----------------------------------------------------------------------

Related

RegEx: find a specific string enclosed in double quotes

I have the following string:
<img src="/images/site_graphics/newsite/foo_com_logo.png" alt="foo.com" width="82" height="14"/>
What is the regex to match only the string within double quotes that start from src= ?
\ssrc\s*=\s*"([^"]*)"
The result will be in group 1.
Explain:
\s : Whitespace
* : Any amount of character
[^"] : Not double quote
( ) : Group
Forward
It's not advisable to use a regex to parse HTML due to all the possible obscure edge cases that can crop up, but it seems that you have some control over the HTML so you should able to avoid many of the edge cases the regex police cry about.
Description
This regular expression
<img\b(?=\s)(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\ssrc=['"]([^"]*)['"]?)(?:[^>=]|='[^']*'|="[^"]*"|=[^'"\s]*)*"\s?\/?>
** To see the image better, simply right click the image and select view in new window
Will do the following:
This regex captures the entire IMG tag
Places the source attribute value into capture group 1, without quotes if they exist.
Allow attributes to have single, double or no quotes
Can be modified to validate any number of other attributes
Avoid difficult edge cases which tend to make parsing HTML difficult
Example
Live Demo
https://regex101.com/r/qW9nG8/16
Sample text
Note the difficult edge case in the first line where we are looking for a specific droid.
<img onmouseover=' if ( 6 > 3 { funSwap(" src="NotTheDroidYourLookingFor.jpg", 6 > 3 ) } ; ' src="http://website/ThisIsTheDroidYourLookingFor.jpeg" onload="img_onload(this);" onerror="img_onerror(this);" data-pid="jihgfedcba" data-imagesize="ppew" />
some text
<img src="http://website/someurl.jpeg" onload="img_onload(this);" />
more text
<img src="https://en.wikipedia.org/wiki/File:BH_LMC.png"/>
Sample Matches
Capture group 0 gets the entire IMG tag
Capture group 1 gets just the src attribute value
[0][0] = <img onmouseover=' funSwap(" src='NotTheDroidYourLookingFor.jpg", data-pid) ; ' src="http://website/ThisIsTheDroidYourLookingFor.jpeg" onload="img_onload(this);" onerror="img_onerror(this);" data-pid="jihgfedcba" data-imagesize="ppew" />
[0][1] = http://website/ThisIsTheDroidYourLookingFor.jpeg
[1][0] = <img src="http://website/someurl.jpeg" onload="img_onload(this);" />
[1][1] = http://website/someurl.jpeg
[2][0] = <img src="https://en.wikipedia.org/wiki/File:BH_LMC.png"/>
[2][1] = https://en.wikipedia.org/wiki/File:BH_LMC.png
Explanation
NODE EXPLANATION
----------------------------------------------------------------------
<img '<img'
----------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the least amount
possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=' '=\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=" '="'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
[^'"] any character except: ''', '"'
----------------------------------------------------------------------
[^\s>]* any character except: whitespace (\n,
\r, \t, \f, and " "), '>' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
)*? end of grouping
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
src= 'src='
----------------------------------------------------------------------
['"] any character of: ''', '"'
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
['"]? any character of: ''', '"' (optional
(matching the most amount possible))
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?: group, but do not capture (0 or more times
(matching the most amount possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=' '=\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=" '="'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
[^'"\s]* any character except: ''', '"',
whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
----------------------------------------------------------------------
)* end of grouping
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
\s? whitespace (\n, \r, \t, \f, and " ")
(optional (matching the most amount
possible))
----------------------------------------------------------------------
\/? '/' (optional (matching the most amount
possible))
----------------------------------------------------------------------
> '>'
----------------------------------------------------------------------

Capture words on the right side of | (OR) in regex expression that are not in the left

I am trying to capture words on the right side of this regex expression that are not captured on the left.
In the code below, the left side captures "17 inch" in this string: "this 235/45R17 is a 17 inch tyre"
(?<=([-.0-9]+(\s)(inches|inch)))|???????
However, anything I put in the right side, such as a simple +w is interfering with the left side
How can I tell the RegEx to capture any word, unless it is a digit followed by inch - in which case capture both 17 and inch?
Description
((?:(?![0-9.-]+\s*inch(?:es)?).)+)|([0-9.-]+\s*inch(?:es)?)
** To see the image better, simply right click the image and select view in new window
Example
Live Demo
https://regex101.com/r/fY9jU5/2
Sample text
this 235/45R17 is a 17 inch tyre
Sample Matches
Capture group 1 will be the values that didn't match the 17 inch
Capture Group 2 will be the number of inches
MATCH 1
1. [0-20] `this 235/45R17 is a `
MATCH 2
2. [20-27] `17 inch`
MATCH 3
1. [27-32] ` tyre`
Explanation
NODE EXPLANATION
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
(?: group, but do not capture (1 or more
times (matching the most amount
possible)):
----------------------------------------------------------------------
(?! look ahead to see if there is not:
----------------------------------------------------------------------
[0-9.-]+ any character of: '0' to '9', '.',
'-' (1 or more times (matching the
most amount possible))
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ")
(0 or more times (matching the most
amount possible))
----------------------------------------------------------------------
inch 'inch'
----------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount
possible)):
----------------------------------------------------------------------
es 'es'
----------------------------------------------------------------------
)? end of grouping
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
. any character except \n
----------------------------------------------------------------------
)+ end of grouping
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
[0-9.-]+ any character of: '0' to '9', '.', '-'
(1 or more times (matching the most
amount possible))
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
----------------------------------------------------------------------
inch 'inch'
----------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
----------------------------------------------------------------------
es 'es'
----------------------------------------------------------------------
)? end of grouping
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------

Use Load runner web_reg_save_param_regexp save the number after tvNode_R- from a block of html code

I use the regular expression attached but it works without adding the highlighted "ORG_40365", once I added "ORG_40365" it does not work.
However, I need to specify the occurrence related to specific node and need to add "ORG_40365" at the end. Otherwise will return other value unexpected.
Please click here to see the web_reg_save_param_regexp used
I cannot copy the code here as it does not allow me to save with error.
I have the above string and want to save 9704 after tvNode_R-. However, the regular expression does not work.
Any help would be greatly appreciated!
Foreward
Pattern Matching HTML can be rather difficult so it's generally recommended to use an HTML parsing tool.
Also I manually transcribed your photo of text. I recommend either replacing the photo with real text or inserting real text in addition to the photo.
Description
<tr(?=\s)(?=(?:[^>=]|=(?:'[^']*'|"[^"]*"|[^'"][^\s>]*))*?\sclass=['"]dxtlNode)(?=(?:[^>=]|=(?:'[^']*'|"[^"]*"|[^'"][^\s>]*))*?\sid=['"][^"]*tvNode_R-([0-9]{4}))(?:[^>=]|=(?:'[^']*'|"[^"]*"|[^'"\s]*))*\s?\/?>
** To see the image better, simply right click the image and select view in new window
This regular expression will do the following:
find tr tags
require the tr tag to have an class dxtlNode
require the tr tag to have an id with tvNode_R- followed by 4 digits
Captures the 4 digits identified above into Capture Group 1
Capture the entire opening tr tag into Capture Group 0
Avoids some difficult edge cases which makes pattern matching in HTML difficult
Example
Live Demo
https://regex101.com/r/gK5dM0/1
Sample text
<td class="dxtlHSEC"></td></tr><tr id="m_m_splitMaster_P_TC_p_n_tvNode_R-9704"
class="dxtlNode" oncontextmenup="return aspxTLMenu('m_m_splitMaster_P_TC_p_n_tvNode','Node',';9704',event)">
<span class="btxt">ORG_40365</span></td><td class="dstlHSEC"></td></tr>
Sample Matches
Capture Group 0. [31-211] `<tr id="m_m_splitMaster_P_TC_p_n_tvNode_R-9704" class="dxtlNode" oncontextmenup="return aspxTLMenu('m_m_splitMaster_P_TC_p_n_tvNode','Node',';9704',event)">`
Capture Group 1. [73-77] `9704`
Explanation
NODE EXPLANATION
----------------------------------------------------------------------
<tr '<tr'
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the least amount
possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
[^'"] any character except: ''', '"'
----------------------------------------------------------------------
[^\s>]* any character except: whitespace
(\n, \r, \t, \f, and " "), '>' (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
)*? end of grouping
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
class= 'class='
----------------------------------------------------------------------
['"] any character of: ''', '"'
----------------------------------------------------------------------
dxtlNode 'dxtlNode'
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the least amount
possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
[^'"] any character except: ''', '"'
----------------------------------------------------------------------
[^\s>]* any character except: whitespace
(\n, \r, \t, \f, and " "), '>' (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
)*? end of grouping
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
id= 'id='
----------------------------------------------------------------------
['"] any character of: ''', '"'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
tvNode_R- 'tvNode_R-'
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[0-9]{4} any character of: '0' to '9' (4 times)
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?: group, but do not capture (0 or more times
(matching the most amount possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
[^'"\s]* any character except: ''', '"',
whitespace (\n, \r, \t, \f, and " ")
(0 or more times (matching the most
amount possible))
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
)* end of grouping
----------------------------------------------------------------------
\s? whitespace (\n, \r, \t, \f, and " ")
(optional (matching the most amount
possible))
----------------------------------------------------------------------
\/? '/' (optional (matching the most amount
possible))
----------------------------------------------------------------------
> '>'
----------------------------------------------------------------------

Regular Expression: Currency Amount over 5,000,000

I have the following regular expression:
^((R|AED|AFN|ALL|AMD|ANG|AOA|ARS|AUD|AWG|AZN|BAM|BBD|BDT|BGN|BHD|BIF|BMD|BND|BOB|BRL|BSD|BTN|BWP|BYR|BZD|CAD|CDF|CHF|CLP|CNY|COP|CRC|CUC|CUP|CVE|CZK|DJF|DKK|DOP|DZD|EGP|ERN|ETB|EUR|FJD|FKP|GBP|GEL|GHS|GIP|GMD|GNF|GTQ|GYD|HKD|HNL|HRK|HTG|HUF|IDR|ILS|INR|IQD|IRR|ISK|JMD|JOD|JPY|KES|KGS|KHR|KMF|KPW|KRW|KWD|KYD|KZT|LAK|LBP|LKR|LRD|LSL|LYD|MAD|MDL|MGA|MKD|MMK|MNT|MOP|MRO|MUR|MVR|MWK|MXN|MYR|MZN|NAD|NGN|NIO|NOK|NPR|NZD|OMR|PAB|PEN|PGK|PHP|PKR|PLN|PYG|QAR|RON|RSD|RUB|RWF|SAR|SBD|SCR|SDG|SEK|SGD|SHP|SLL|SOS|SRD|SSP|STD|SYP|SZL|THB|TJS|TMT|TND|TOP|TRY|TTD|TWD|TZS|UAH|UGX|USD|UYU|UZS|VEF|VND|VUV|WST|XAF|XCD|XOF|XPF|YER|ZAR|ZMW) ?([1-9][0-9]{0,2}((,| )?[0-9]{3}){2,}|0)\.[0-9][0-9])?$
It will match on any Currency value 1,000,000 or greater. I need it to match on 5,000,000 or greater. That seems like it should be a simple change but I'm struggling with it.
Thanks for the help.
Description
^(R|AED|AFN|ALL|AMD|ANG|AOA|ARS|AUD|AWG|AZN|BAM|BBD|BDT|BGN|BHD|BIF|BMD|BND|BOB|BRL|BSD|BTN|BWP|BYR|BZD|CAD|CDF|CHF|CLP|CNY|COP|CRC|CUC|CUP|CVE|CZK|DJF|DKK|DOP|DZD|EGP|ERN|ETB|EUR|FJD|FKP|GBP|GEL|GHS|GIP|GMD|GNF|GTQ|GYD|HKD|HNL|HRK|HTG|HUF|IDR|ILS|INR|IQD|IRR|ISK|JMD|JOD|JPY|KES|KGS|KHR|KMF|KPW|KRW|KWD|KYD|KZT|LAK|LBP|LKR|LRD|LSL|LYD|MAD|MDL|MGA|MKD|MMK|MNT|MOP|MRO|MUR|MVR|MWK|MXN|MYR|MZN|NAD|NGN|NIO|NOK|NPR|NZD|OMR|PAB|PEN|PGK|PHP|PKR|PLN|PYG|QAR|RON|RSD|RUB|RWF|SAR|SBD|SCR|SDG|SEK|SGD|SHP|SLL|SOS|SRD|SSP|STD|SYP|SZL|THB|TJS|TMT|TND|TOP|TRY|TTD|TWD|TZS|UAH|UGX|USD|UYU|UZS|VEF|VND|VUV|WST|XAF|XCD|XOF|XPF|YER|ZAR|ZMW)\s?([5-9][,\s][0-9]{3}[,\s]|[0-9]{2,3}[,\s][0-9]{3}[,\s]|[0-9]{1,3}[,\s](?:[0-9]{3}[,\s]){2,})
([0-9]{3})\.([0-9]{2})$
This regular expression will do the following:
match on 5,000,000 or greater
Example
Live Demo
https://regex101.com/r/cS7aA5/1
Sample text
usd 5,000,000.00
usd 15,000,000.00
usd 15,000,000,000.00
usd 15,000,000,000,000.00
usd 5,000.00
usd 4,999,999.99
usd
Sample Matches
usd 5,000,000.00
usd 15,000,000.00
usd 15,000,000,000.00
usd 15,000,000,000,000.00
Explanation
NODE EXPLANATION
^ the beginning of a "line"
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
R 'R'
----------------------------------------------------------------------
| OR
All the other money types
| OR
----------------------------------------------------------------------
ZMW 'ZMW'
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
\s? whitespace (\n, \r, \t, \f, and " ")
(optional (matching the most amount
possible))
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
[5-9] any character of: '5' to '9'
----------------------------------------------------------------------
[,\s] any character of: ',', whitespace (\n,
\r, \t, \f, and " ")
----------------------------------------------------------------------
[0-9]{3} any character of: '0' to '9' (3 times)
----------------------------------------------------------------------
[,\s] any character of: ',', whitespace (\n,
\r, \t, \f, and " ")
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
[0-9]{2,3} any character of: '0' to '9' (between 2
and 3 times (matching the most amount
possible))
----------------------------------------------------------------------
[,\s] any character of: ',', whitespace (\n,
\r, \t, \f, and " ")
----------------------------------------------------------------------
[0-9]{3} any character of: '0' to '9' (3 times)
----------------------------------------------------------------------
[,\s] any character of: ',', whitespace (\n,
\r, \t, \f, and " ")
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
[0-9]{1,3} any character of: '0' to '9' (between 1
and 3 times (matching the most amount
possible))
----------------------------------------------------------------------
[,\s] any character of: ',', whitespace (\n,
\r, \t, \f, and " ")
----------------------------------------------------------------------
(?: group, but do not capture (at least 2
times (matching the most amount
possible)):
----------------------------------------------------------------------
[0-9]{3} any character of: '0' to '9' (3 times)
----------------------------------------------------------------------
[,\s] any character of: ',', whitespace (\n,
\r, \t, \f, and " ")
----------------------------------------------------------------------
){2,} end of grouping
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
( group and capture to \3:
----------------------------------------------------------------------
[0-9]{3} any character of: '0' to '9' (3 times)
----------------------------------------------------------------------
) end of \3
----------------------------------------------------------------------
\. '.'
----------------------------------------------------------------------
( group and capture to \4:
----------------------------------------------------------------------
[0-9]{2} any character of: '0' to '9' (2 times)
----------------------------------------------------------------------
) end of \4
----------------------------------------------------------------------
$ before an optional \n, and the end of a
"line"
----------------------------------------------------------------------
Extra Credit
You can further improve your expression execution time by 66% by replacing the many currency alternations with a simple span tree. The above expression takes 1060 steps to complete on that sample text, whereas the following expression takes 324 steps to complete.
^(A(?:ED|FN|LL|MD|NG|OA|RS|UD|WG|ZN)|B(?:AM|BD|DT|GN|HD|IF|MD|ND|OB|RL|SD|TN|WP|YR|ZD)|C(?:AD|DF|HF|LP|NY|OP|RC|UC|UP|VE|ZK)|D(?:JF|KK|OP|ZD)|E(?:GP|RN|TB|UR)|F(?:JD|KP)|G(?:BP|EL|HS|IP|MD|NF|TQ|YD)|H(?:KD|NL|RK|TG|UF)|I(?:DR|LS|NR|QD|RR|SK)|J(?:MD|OD|PY)|K(?:ES|GS|HR|MF|PW|RW|WD|YD|ZT)|L(?:AK|BP|KR|RD|SL|YD)|M(?:AD|DL|GA|KD|MK|NT|OP|RO|UR|VR|WK|XN|YR|ZN)|N(?:AD|GN|IO|OK|PR|ZD)|OMR|P(?:AB|EN|GK|HP|KR|LN|YG)|QAR|R(?:ON|SD|UB|WF)?|S(?:AR|BD|CR|DG|EK|GD|HP|LL|OS|RD|SP|TD|YP|ZL)|T(?:HB|JS|MT|ND|OP|RY|TD|WD|ZS)|U(?:AH|GX|SD|YU|ZS)|V(?:EF|ND|UV)|WST|X(?:AF|CD|OF|PF)|YER|Z(?:AR|MW))\s?([5-9][,\s][0-9]{3}[,\s]|[0-9]{2,3}[,\s][0-9]{3}[,\s]|[0-9]{1,3}[,\s](?:[0-9]{3}[,\s]){2,})([0-9]{3})\.([0-9]{2})$
Use this regex.
^((R|AED|AFN|ALL|AMD|ANG|AOA|ARS|AUD|AWG|AZN|BAM|BBD|BDT|BGN|BHD|BIF|BMD|BND|BOB|BRL|BSD|BTN|BWP|BYR|BZD|CAD|CDF|CHF|CLP|CNY|COP|CRC|CUC|CUP|CVE|CZK|DJF|DKK|DOP|DZD|EGP|ERN|ETB|EUR|FJD|FKP|GBP|GEL|GHS|GIP|GMD|GNF|GTQ|GYD|HKD|HNL|HRK|HTG|HUF|IDR|ILS|INR|IQD|IRR|ISK|JMD|JOD|JPY|KES|KGS|KHR|KMF|KPW|KRW|KWD|KYD|KZT|LAK|LBP|LKR|LRD|LSL|LYD|MAD|MDL|MGA|MKD|MMK|MNT|MOP|MRO|MUR|MVR|MWK|MXN|MYR|MZN|NAD|NGN|NIO|NOK|NPR|NZD|OMR|PAB|PEN|PGK|PHP|PKR|PLN|PYG|QAR|RON|RSD|RUB|RWF|SAR|SBD|SCR|SDG|SEK|SGD|SHP|SLL|SOS|SRD|SSP|STD|SYP|SZL|THB|TJS|TMT|TND|TOP|TRY|TTD|TWD|TZS|UAH|UGX|USD|UYU|UZS|VEF|VND|VUV|WST|XAF|XCD|XOF|XPF|YER|ZAR|ZMW) ?(([1-9][0-9][0-9]|[1-9][0-9]|[5-9])((,| )?[0-9]{3}){2,}|0)\.[0-9][0-9])?$
Updated link.
Use alternation with a quantifier
The segment of your regular expression that needs to be addressed are the character classes that match a range from 1 to 999:
[1-9][0-9]{0,2}
Use the alternation token | in conjunction with an increase to the first argument of the quantifier {} to achieve a range from 5 to 999:
([5-9]|[1-9][0-9]{1,2})
Optional enhancements:
For brevity, you may also like to modify the leading currency identifier set to a character class instead:
(R|[A-Z]{3})
Likewise, you can minimize the segment that matches the trailing two-digit decimal with a quantifier:
\.[0-9]{2}
Putting it all together:
((R|[A-Z]{3}) ?([5-9]|[1-9][0-9]{1,2})((,| )?[0-9]{3}){2,})\.[0-9]{2}
Source: Regexper.com
Keep your regex as it is.
preg_match($patter, $string, $out);
$amount = str_replace(",","", $out[2]);
If ($amount>=5000000){
// code here
}

Using ReplaceTextWithMapping with multiple columns in mapping file

I would need to clarify the usage of ReplaceTextWithMapping in NiFi in my specific case. My input file looks like this:
{"field1" : "A",
"field2" : "A",
"field3": "A"
}
The mapping file looks, instead, like this:
Header1;Header2;Header3
A;some text;2
My expected result would be as follows:
{"field1" : "some text",
"field2": "A",
"field3": "A2"
}
The Regular Expression set is simply as follows:
[A-Z0-9]+
and it matches the field key in the mapping file (we are expecting either a capital letter or capital letter + digit), but then I am not sure how you decided to which value (from col 2 or from col3) you want to assign the input value to. Also, my field2 should not changed and needs retaining the same value it is getting from the input value, with no mapping involved. At the moment, I am getting something like this:
{"field1" : "some text A2",
"field2": "some text A2",
"field3": "some text A2"
}
I guess my main question is: can you mapped the same value in your input file with different values coming from different column of your mapping file?
Thank you
EDIT: I am using ReplaceTextWithMapping, an out-of-the-box processor in Apache NiFi (v. 0.5.1). Throughout my dataflow, I end up with a Json file on which I need to apply some mappings coming from external files I would like to load in memory (rather than parse using ExtractText, for example).
Forward
It appears that you're working with a JSON string, it would be easier to work with such a string via a JSON parsing engine as the JSON structure allows the creation of difficult edge cases that makes parsing with regular expressions difficult. With that said, I'm sure you have your reasons, and I'm not the Regex Police.
Description
To do such a replacement it would be easier to capture the substrings you'll keep and the substrings you want to replace.
(\{"[a-z0-9]+"\s*:\s*")([a-z0-9]+)("[,\r\n]+"[a-z0-9]+"\s*:\s*")([a-z0-9]+)("[,\r\n]+"[a-z0-9]+"\s*:\s*")([a-z0-9]+)("[,\r\n]+\})
Replace with: $1SomeText$3$4$5A2$7
Note: I recommend using the following flags with this expression: Case Insensitive, and Dot matches all characters including new lines.
Exmaples
Live Deno
This example shows how the regular expression matches against your source text:
https://regex101.com/r/vM1qE2/1
Source Text
{"field1" : "A",
"field2" : "A",
"field3": "A"
}
After Replacement
{"field1" : "SomeText",
"field2" : "A",
"field3": "A2"
}
Explanation
NODE EXPLANATION
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
\{ '{'
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
[a-z0-9]+ any character of: 'a' to 'z', '0' to '9'
(1 or more times (matching the most
amount possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
----------------------------------------------------------------------
: ':'
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
[a-z0-9]+ any character of: 'a' to 'z', '0' to '9'
(1 or more times (matching the most
amount possible))
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
( group and capture to \3:
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
[,\r\n]+ any character of: ',', '\r' (carriage
return), '\n' (newline) (1 or more times
(matching the most amount possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
[a-z0-9]+ any character of: 'a' to 'z', '0' to '9'
(1 or more times (matching the most
amount possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
----------------------------------------------------------------------
: ':'
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
) end of \3
----------------------------------------------------------------------
( group and capture to \4:
----------------------------------------------------------------------
[a-z0-9]+ any character of: 'a' to 'z', '0' to '9'
(1 or more times (matching the most
amount possible))
----------------------------------------------------------------------
) end of \4
----------------------------------------------------------------------
( group and capture to \5:
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
[,\r\n]+ any character of: ',', '\r' (carriage
return), '\n' (newline) (1 or more times
(matching the most amount possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
[a-z0-9]+ any character of: 'a' to 'z', '0' to '9'
(1 or more times (matching the most
amount possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
----------------------------------------------------------------------
: ':'
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
) end of \5
----------------------------------------------------------------------
( group and capture to \6:
----------------------------------------------------------------------
[a-z0-9]+ any character of: 'a' to 'z', '0' to '9'
(1 or more times (matching the most
amount possible))
----------------------------------------------------------------------
) end of \6
----------------------------------------------------------------------
( group and capture to \7:
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
[,\r\n]+ any character of: ',', '\r' (carriage
return), '\n' (newline) (1 or more times
(matching the most amount possible))
----------------------------------------------------------------------
\} '}'
----------------------------------------------------------------------
) end of \7
So I dove into ReplaceTextWithMapping to try and get it to solve your use-case but I just don't think it is powerful enough to do what you want. Currently it is designed almost solely for the purpose: match a simple regex, map one group of non-whitespace characters to another group of characters (can have white space and back references).
When looking at your use-case as pure text, it is to change the value of one capture group based on the value of another capture group and a mapping file. Looking at it in terms of JSON, your use-case is much simpler, you want to change the value of a key/value pair based on what the key is and a mapping file. Side note, if you didn't need the mapping file, I believe there is a new JSON to JSON processor coming in 0.7.0[1] that would work.
As for looking for a solution, both ways of looking at your problem are valid. ReplaceTextWithMapping certainly could use expanded functionality to allow for advanced use-cases but may make it too complicated (though it could be more confusing now due to the unclear scope of it's functionality). A new processor, along the lines of "ReplaceJsonWithMapping" could certainly be added as well but would need to clearly define it's scope and purpose.
Also for a more immediate solution there is always the option to use the ExecuteScript processor. Here[2] is a link to blog (written by the creator of ExecuteScript) which outlines how to write a basic JSON-to-JSON processor. There would need to be more logic added to have the ability to read a file for the mapping.
[1] https://issues.apache.org/jira/browse/NIFI-361
[2] http://funnifi.blogspot.com/2016/02/executescript-json-to-json-conversion.html