Regular Expression: Currency Amount over 5,000,000 - regex

I have the following regular expression:
^((R|AED|AFN|ALL|AMD|ANG|AOA|ARS|AUD|AWG|AZN|BAM|BBD|BDT|BGN|BHD|BIF|BMD|BND|BOB|BRL|BSD|BTN|BWP|BYR|BZD|CAD|CDF|CHF|CLP|CNY|COP|CRC|CUC|CUP|CVE|CZK|DJF|DKK|DOP|DZD|EGP|ERN|ETB|EUR|FJD|FKP|GBP|GEL|GHS|GIP|GMD|GNF|GTQ|GYD|HKD|HNL|HRK|HTG|HUF|IDR|ILS|INR|IQD|IRR|ISK|JMD|JOD|JPY|KES|KGS|KHR|KMF|KPW|KRW|KWD|KYD|KZT|LAK|LBP|LKR|LRD|LSL|LYD|MAD|MDL|MGA|MKD|MMK|MNT|MOP|MRO|MUR|MVR|MWK|MXN|MYR|MZN|NAD|NGN|NIO|NOK|NPR|NZD|OMR|PAB|PEN|PGK|PHP|PKR|PLN|PYG|QAR|RON|RSD|RUB|RWF|SAR|SBD|SCR|SDG|SEK|SGD|SHP|SLL|SOS|SRD|SSP|STD|SYP|SZL|THB|TJS|TMT|TND|TOP|TRY|TTD|TWD|TZS|UAH|UGX|USD|UYU|UZS|VEF|VND|VUV|WST|XAF|XCD|XOF|XPF|YER|ZAR|ZMW) ?([1-9][0-9]{0,2}((,| )?[0-9]{3}){2,}|0)\.[0-9][0-9])?$
It will match on any Currency value 1,000,000 or greater. I need it to match on 5,000,000 or greater. That seems like it should be a simple change but I'm struggling with it.
Thanks for the help.

Description
^(R|AED|AFN|ALL|AMD|ANG|AOA|ARS|AUD|AWG|AZN|BAM|BBD|BDT|BGN|BHD|BIF|BMD|BND|BOB|BRL|BSD|BTN|BWP|BYR|BZD|CAD|CDF|CHF|CLP|CNY|COP|CRC|CUC|CUP|CVE|CZK|DJF|DKK|DOP|DZD|EGP|ERN|ETB|EUR|FJD|FKP|GBP|GEL|GHS|GIP|GMD|GNF|GTQ|GYD|HKD|HNL|HRK|HTG|HUF|IDR|ILS|INR|IQD|IRR|ISK|JMD|JOD|JPY|KES|KGS|KHR|KMF|KPW|KRW|KWD|KYD|KZT|LAK|LBP|LKR|LRD|LSL|LYD|MAD|MDL|MGA|MKD|MMK|MNT|MOP|MRO|MUR|MVR|MWK|MXN|MYR|MZN|NAD|NGN|NIO|NOK|NPR|NZD|OMR|PAB|PEN|PGK|PHP|PKR|PLN|PYG|QAR|RON|RSD|RUB|RWF|SAR|SBD|SCR|SDG|SEK|SGD|SHP|SLL|SOS|SRD|SSP|STD|SYP|SZL|THB|TJS|TMT|TND|TOP|TRY|TTD|TWD|TZS|UAH|UGX|USD|UYU|UZS|VEF|VND|VUV|WST|XAF|XCD|XOF|XPF|YER|ZAR|ZMW)\s?([5-9][,\s][0-9]{3}[,\s]|[0-9]{2,3}[,\s][0-9]{3}[,\s]|[0-9]{1,3}[,\s](?:[0-9]{3}[,\s]){2,})
([0-9]{3})\.([0-9]{2})$
This regular expression will do the following:
match on 5,000,000 or greater
Example
Live Demo
https://regex101.com/r/cS7aA5/1
Sample text
usd 5,000,000.00
usd 15,000,000.00
usd 15,000,000,000.00
usd 15,000,000,000,000.00
usd 5,000.00
usd 4,999,999.99
usd
Sample Matches
usd 5,000,000.00
usd 15,000,000.00
usd 15,000,000,000.00
usd 15,000,000,000,000.00
Explanation
NODE EXPLANATION
^ the beginning of a "line"
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
R 'R'
----------------------------------------------------------------------
| OR
All the other money types
| OR
----------------------------------------------------------------------
ZMW 'ZMW'
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
\s? whitespace (\n, \r, \t, \f, and " ")
(optional (matching the most amount
possible))
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
[5-9] any character of: '5' to '9'
----------------------------------------------------------------------
[,\s] any character of: ',', whitespace (\n,
\r, \t, \f, and " ")
----------------------------------------------------------------------
[0-9]{3} any character of: '0' to '9' (3 times)
----------------------------------------------------------------------
[,\s] any character of: ',', whitespace (\n,
\r, \t, \f, and " ")
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
[0-9]{2,3} any character of: '0' to '9' (between 2
and 3 times (matching the most amount
possible))
----------------------------------------------------------------------
[,\s] any character of: ',', whitespace (\n,
\r, \t, \f, and " ")
----------------------------------------------------------------------
[0-9]{3} any character of: '0' to '9' (3 times)
----------------------------------------------------------------------
[,\s] any character of: ',', whitespace (\n,
\r, \t, \f, and " ")
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
[0-9]{1,3} any character of: '0' to '9' (between 1
and 3 times (matching the most amount
possible))
----------------------------------------------------------------------
[,\s] any character of: ',', whitespace (\n,
\r, \t, \f, and " ")
----------------------------------------------------------------------
(?: group, but do not capture (at least 2
times (matching the most amount
possible)):
----------------------------------------------------------------------
[0-9]{3} any character of: '0' to '9' (3 times)
----------------------------------------------------------------------
[,\s] any character of: ',', whitespace (\n,
\r, \t, \f, and " ")
----------------------------------------------------------------------
){2,} end of grouping
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
( group and capture to \3:
----------------------------------------------------------------------
[0-9]{3} any character of: '0' to '9' (3 times)
----------------------------------------------------------------------
) end of \3
----------------------------------------------------------------------
\. '.'
----------------------------------------------------------------------
( group and capture to \4:
----------------------------------------------------------------------
[0-9]{2} any character of: '0' to '9' (2 times)
----------------------------------------------------------------------
) end of \4
----------------------------------------------------------------------
$ before an optional \n, and the end of a
"line"
----------------------------------------------------------------------
Extra Credit
You can further improve your expression execution time by 66% by replacing the many currency alternations with a simple span tree. The above expression takes 1060 steps to complete on that sample text, whereas the following expression takes 324 steps to complete.
^(A(?:ED|FN|LL|MD|NG|OA|RS|UD|WG|ZN)|B(?:AM|BD|DT|GN|HD|IF|MD|ND|OB|RL|SD|TN|WP|YR|ZD)|C(?:AD|DF|HF|LP|NY|OP|RC|UC|UP|VE|ZK)|D(?:JF|KK|OP|ZD)|E(?:GP|RN|TB|UR)|F(?:JD|KP)|G(?:BP|EL|HS|IP|MD|NF|TQ|YD)|H(?:KD|NL|RK|TG|UF)|I(?:DR|LS|NR|QD|RR|SK)|J(?:MD|OD|PY)|K(?:ES|GS|HR|MF|PW|RW|WD|YD|ZT)|L(?:AK|BP|KR|RD|SL|YD)|M(?:AD|DL|GA|KD|MK|NT|OP|RO|UR|VR|WK|XN|YR|ZN)|N(?:AD|GN|IO|OK|PR|ZD)|OMR|P(?:AB|EN|GK|HP|KR|LN|YG)|QAR|R(?:ON|SD|UB|WF)?|S(?:AR|BD|CR|DG|EK|GD|HP|LL|OS|RD|SP|TD|YP|ZL)|T(?:HB|JS|MT|ND|OP|RY|TD|WD|ZS)|U(?:AH|GX|SD|YU|ZS)|V(?:EF|ND|UV)|WST|X(?:AF|CD|OF|PF)|YER|Z(?:AR|MW))\s?([5-9][,\s][0-9]{3}[,\s]|[0-9]{2,3}[,\s][0-9]{3}[,\s]|[0-9]{1,3}[,\s](?:[0-9]{3}[,\s]){2,})([0-9]{3})\.([0-9]{2})$

Use this regex.
^((R|AED|AFN|ALL|AMD|ANG|AOA|ARS|AUD|AWG|AZN|BAM|BBD|BDT|BGN|BHD|BIF|BMD|BND|BOB|BRL|BSD|BTN|BWP|BYR|BZD|CAD|CDF|CHF|CLP|CNY|COP|CRC|CUC|CUP|CVE|CZK|DJF|DKK|DOP|DZD|EGP|ERN|ETB|EUR|FJD|FKP|GBP|GEL|GHS|GIP|GMD|GNF|GTQ|GYD|HKD|HNL|HRK|HTG|HUF|IDR|ILS|INR|IQD|IRR|ISK|JMD|JOD|JPY|KES|KGS|KHR|KMF|KPW|KRW|KWD|KYD|KZT|LAK|LBP|LKR|LRD|LSL|LYD|MAD|MDL|MGA|MKD|MMK|MNT|MOP|MRO|MUR|MVR|MWK|MXN|MYR|MZN|NAD|NGN|NIO|NOK|NPR|NZD|OMR|PAB|PEN|PGK|PHP|PKR|PLN|PYG|QAR|RON|RSD|RUB|RWF|SAR|SBD|SCR|SDG|SEK|SGD|SHP|SLL|SOS|SRD|SSP|STD|SYP|SZL|THB|TJS|TMT|TND|TOP|TRY|TTD|TWD|TZS|UAH|UGX|USD|UYU|UZS|VEF|VND|VUV|WST|XAF|XCD|XOF|XPF|YER|ZAR|ZMW) ?(([1-9][0-9][0-9]|[1-9][0-9]|[5-9])((,| )?[0-9]{3}){2,}|0)\.[0-9][0-9])?$
Updated link.

Use alternation with a quantifier
The segment of your regular expression that needs to be addressed are the character classes that match a range from 1 to 999:
[1-9][0-9]{0,2}
Use the alternation token | in conjunction with an increase to the first argument of the quantifier {} to achieve a range from 5 to 999:
([5-9]|[1-9][0-9]{1,2})
Optional enhancements:
For brevity, you may also like to modify the leading currency identifier set to a character class instead:
(R|[A-Z]{3})
Likewise, you can minimize the segment that matches the trailing two-digit decimal with a quantifier:
\.[0-9]{2}
Putting it all together:
((R|[A-Z]{3}) ?([5-9]|[1-9][0-9]{1,2})((,| )?[0-9]{3}){2,})\.[0-9]{2}
Source: Regexper.com

Keep your regex as it is.
preg_match($patter, $string, $out);
$amount = str_replace(",","", $out[2]);
If ($amount>=5000000){
// code here
}

Related

How to exclude images from regex email extraction

I am using some email extractor software to (surprise surprise) extract emails from websites. It uses the regex:
[A-Z0-9._%+-]+#[A-Z0-9.-]{3,65}\.[A-Z]{2,4}
But this churns out images as well as emails eg _212000482_1#80xauto.jpg
I can change this regex, but I cannot figure out how to exclude matches ending in .png, .jpg etc.
There is a lot of information on validating emails - and how hard this is - but all I want to do is exclude images from the result list.
Description
In your sample text the undesired substring resembles an email address, but conveniently ends in jpg. So with a negative lookahead we can just exclude the filename extensions.
(?!\S*\.(?:jpg|png|gif|bmp)(?:[\s\n\r]|$))[A-Z0-9._%+-]+#[A-Z0-9.-]{3,65}\.[A-Z]{2,4}
Example
Live Demo
https://regex101.com/r/mU7bO3/2
Sample text
droids#gmail.com _212000482_1#80xauto.jpg More.Droids#deathstar.com
Sample Matches
droids#gmail.com
More.Droids#deathstar.com
Explanation
NODE EXPLANATION
----------------------------------------------------------------------
(?! look ahead to see if there is not:
----------------------------------------------------------------------
\S* non-whitespace (all but \n, \r, \t, \f,
and " ") (0 or more times (matching the
most amount possible))
----------------------------------------------------------------------
\. '.'
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
jpg 'jpg'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
png 'png'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
gif 'gif'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
bmp 'bmp'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
[\s\n\r] any character of: whitespace (\n, \r,
\t, \f, and " "), '\n' (newline), '\r'
(carriage return)
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
$ before an optional \n, and the end of
a "line"
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
[A-Z0-9._%+-]+ any character of: 'A' to 'Z', '0' to '9',
'.', '_', '%', '+', '-' (1 or more times
(matching the most amount possible))
----------------------------------------------------------------------
# '#'
----------------------------------------------------------------------
[A-Z0-9.-]{3,65} any character of: 'A' to 'Z', '0' to '9',
'.', '-' (between 3 and 65 times (matching
the most amount possible))
----------------------------------------------------------------------
\. '.'
----------------------------------------------------------------------
[A-Z]{2,4} any character of: 'A' to 'Z' (between 2
and 4 times (matching the most amount
possible))
----------------------------------------------------------------------

How to match this pattern (with emoji)?

I have a text file of a few thousand entries built like that:
11111111111: text text text text text :: word11111111: text text text text :: word111111111:
Where:
11111111 is a big number
text text text text can be anything including emoji
word is one of 8 words
the second 111111111 is another number, but different.
I tried, but just couldn't match it.
I don't know how to treat the emoji, and another problem is the spaces are not consistent, sometimes is a whitespace, sometimes tab, and so on.
Description
^([0-9]+):\s*((?:(?!\s::).)*)\s::\s*([^:]+)\s*:\s*((?:(?!\s::).)*)\s::\s*([^:]+):$
This regular expression will do the following:
Capture the leading 11111111
Match the :
Capture the text text text text text which may contain emojis.
Match the ::
Capture the word11111111
match the :
Capture the text text text text text which may contain emojis.
Match the ::
Capture the word11111111
Match the :
Allow the : or :: to be delimiters
Do not include the spaces surrounding the delimiters to be included in the matches.
To see the image better, you can right click it and select open in new window
Example
Live Demo
https://regex101.com/r/qG7uZ7/1
Sample text
11111111111: text text text text text :: word11111111: text text text text :: word111111111:
Capture Groups from match
0. 11111111111: text text text text text :: word11111111: text text text text :: word111111111:
1. `11111111111`
2. `text text text text text`
3. `word11111111`
4. `text text text text`
5. `word111111111`
Explanation
NODE EXPLANATION
----------------------------------------------------------------------
^ the beginning of a "line"
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[0-9]+ any character of: '0' to '9' (1 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
: ':'
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the most amount
possible)):
----------------------------------------------------------------------
(?! look ahead to see if there is not:
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
:: '::'
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
. any character except \n
----------------------------------------------------------------------
)* end of grouping
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
:: '::'
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
( group and capture to \3:
----------------------------------------------------------------------
[^:]+ any character except: ':' (1 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \3
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
: ':'
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
( group and capture to \4:
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the most amount
possible)):
----------------------------------------------------------------------
(?! look ahead to see if there is not:
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
:: '::'
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
. any character except \n
----------------------------------------------------------------------
)* end of grouping
----------------------------------------------------------------------
) end of \4
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
:: '::'
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
( group and capture to \5:
----------------------------------------------------------------------
[^:]+ any character except: ':' (1 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \5
----------------------------------------------------------------------
: ':'
----------------------------------------------------------------------
$ before an optional \n, and the end of a
"line"
----------------------------------------------------------------------

Using ReplaceTextWithMapping with multiple columns in mapping file

I would need to clarify the usage of ReplaceTextWithMapping in NiFi in my specific case. My input file looks like this:
{"field1" : "A",
"field2" : "A",
"field3": "A"
}
The mapping file looks, instead, like this:
Header1;Header2;Header3
A;some text;2
My expected result would be as follows:
{"field1" : "some text",
"field2": "A",
"field3": "A2"
}
The Regular Expression set is simply as follows:
[A-Z0-9]+
and it matches the field key in the mapping file (we are expecting either a capital letter or capital letter + digit), but then I am not sure how you decided to which value (from col 2 or from col3) you want to assign the input value to. Also, my field2 should not changed and needs retaining the same value it is getting from the input value, with no mapping involved. At the moment, I am getting something like this:
{"field1" : "some text A2",
"field2": "some text A2",
"field3": "some text A2"
}
I guess my main question is: can you mapped the same value in your input file with different values coming from different column of your mapping file?
Thank you
EDIT: I am using ReplaceTextWithMapping, an out-of-the-box processor in Apache NiFi (v. 0.5.1). Throughout my dataflow, I end up with a Json file on which I need to apply some mappings coming from external files I would like to load in memory (rather than parse using ExtractText, for example).
Forward
It appears that you're working with a JSON string, it would be easier to work with such a string via a JSON parsing engine as the JSON structure allows the creation of difficult edge cases that makes parsing with regular expressions difficult. With that said, I'm sure you have your reasons, and I'm not the Regex Police.
Description
To do such a replacement it would be easier to capture the substrings you'll keep and the substrings you want to replace.
(\{"[a-z0-9]+"\s*:\s*")([a-z0-9]+)("[,\r\n]+"[a-z0-9]+"\s*:\s*")([a-z0-9]+)("[,\r\n]+"[a-z0-9]+"\s*:\s*")([a-z0-9]+)("[,\r\n]+\})
Replace with: $1SomeText$3$4$5A2$7
Note: I recommend using the following flags with this expression: Case Insensitive, and Dot matches all characters including new lines.
Exmaples
Live Deno
This example shows how the regular expression matches against your source text:
https://regex101.com/r/vM1qE2/1
Source Text
{"field1" : "A",
"field2" : "A",
"field3": "A"
}
After Replacement
{"field1" : "SomeText",
"field2" : "A",
"field3": "A2"
}
Explanation
NODE EXPLANATION
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
\{ '{'
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
[a-z0-9]+ any character of: 'a' to 'z', '0' to '9'
(1 or more times (matching the most
amount possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
----------------------------------------------------------------------
: ':'
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
[a-z0-9]+ any character of: 'a' to 'z', '0' to '9'
(1 or more times (matching the most
amount possible))
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
( group and capture to \3:
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
[,\r\n]+ any character of: ',', '\r' (carriage
return), '\n' (newline) (1 or more times
(matching the most amount possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
[a-z0-9]+ any character of: 'a' to 'z', '0' to '9'
(1 or more times (matching the most
amount possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
----------------------------------------------------------------------
: ':'
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
) end of \3
----------------------------------------------------------------------
( group and capture to \4:
----------------------------------------------------------------------
[a-z0-9]+ any character of: 'a' to 'z', '0' to '9'
(1 or more times (matching the most
amount possible))
----------------------------------------------------------------------
) end of \4
----------------------------------------------------------------------
( group and capture to \5:
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
[,\r\n]+ any character of: ',', '\r' (carriage
return), '\n' (newline) (1 or more times
(matching the most amount possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
[a-z0-9]+ any character of: 'a' to 'z', '0' to '9'
(1 or more times (matching the most
amount possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
----------------------------------------------------------------------
: ':'
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
) end of \5
----------------------------------------------------------------------
( group and capture to \6:
----------------------------------------------------------------------
[a-z0-9]+ any character of: 'a' to 'z', '0' to '9'
(1 or more times (matching the most
amount possible))
----------------------------------------------------------------------
) end of \6
----------------------------------------------------------------------
( group and capture to \7:
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
[,\r\n]+ any character of: ',', '\r' (carriage
return), '\n' (newline) (1 or more times
(matching the most amount possible))
----------------------------------------------------------------------
\} '}'
----------------------------------------------------------------------
) end of \7
So I dove into ReplaceTextWithMapping to try and get it to solve your use-case but I just don't think it is powerful enough to do what you want. Currently it is designed almost solely for the purpose: match a simple regex, map one group of non-whitespace characters to another group of characters (can have white space and back references).
When looking at your use-case as pure text, it is to change the value of one capture group based on the value of another capture group and a mapping file. Looking at it in terms of JSON, your use-case is much simpler, you want to change the value of a key/value pair based on what the key is and a mapping file. Side note, if you didn't need the mapping file, I believe there is a new JSON to JSON processor coming in 0.7.0[1] that would work.
As for looking for a solution, both ways of looking at your problem are valid. ReplaceTextWithMapping certainly could use expanded functionality to allow for advanced use-cases but may make it too complicated (though it could be more confusing now due to the unclear scope of it's functionality). A new processor, along the lines of "ReplaceJsonWithMapping" could certainly be added as well but would need to clearly define it's scope and purpose.
Also for a more immediate solution there is always the option to use the ExecuteScript processor. Here[2] is a link to blog (written by the creator of ExecuteScript) which outlines how to write a basic JSON-to-JSON processor. There would need to be more logic added to have the ability to read a file for the mapping.
[1] https://issues.apache.org/jira/browse/NIFI-361
[2] http://funnifi.blogspot.com/2016/02/executescript-json-to-json-conversion.html

Match only the first occurrence of a phrase

I have the following Json:
{"field1": "someText",
"field2": "Text Again",
"field3": "Text Again"}
I would need to match the first occurrence of any phrase starting with a capital letter (such as "Text Again", for example)
I have written the following:
("[A-Za-z]+\s[A-Za-z]+")
It does work fine when testing with https://regex101.com/, for instance. However, it does not seem to correctly function as part of the usage of ReplaceTextWithMapping (Apache NiFi). Is the regex incorrect?
Thank you for your help
Description
:\s*"\s*(?=[A-Z])(?![^"]*?\s[a-z])([A-Za-z\s]+)"
This regular expression does the following:
finds the first title case string in value side of what appears to be JSON encoded string
ensures each word is capitalized
returns the value inside the quotes as capture group 1
Example
Live Demo
https://regex101.com/r/eO0xW6/1
Source String
{"field1": "someText",
"field2": "Text again",
"field3": "Text Again"}
First Match
Text Again
Explanation
Summary
:\s*" validates that where only checking the value side of the JSON
\s* matches any spaces after the opening quote if they exist
(?=[A-Z]) ensure the first character in the string is uppercase
(?![^"]*?\s[a-z]) looks for any spaces that are followed by a lower case character. If found then this isn't a match
([A-Za-z\s]+) captures all the characters inside the quote
" matches the quote
Detailed
NODE EXPLANATION
----------------------------------------------------------------------
: ':'
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
[A-Z] any character of: 'A' to 'Z'
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?! look ahead to see if there is not:
----------------------------------------------------------------------
[^"]*? any character except: '"' (0 or more
times (matching the least amount
possible))
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
[a-z] any character of: 'a' to 'z'
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[A-Za-z\s]+ any character of: 'A' to 'Z', 'a' to
'z', whitespace (\n, \r, \t, \f, and "
") (1 or more times (matching the most
amount possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
I have posted my findings on the issue to the Apache NiFi mailing list:
http://apache-nifi-developer-list.39713.n7.nabble.com/Issues-with-Regex-used-with-ReplaceTextWithMapping-where-am-I-going-wrong-tc10592.html
I have not received any confirmation from the community, but it seems to me that, although the regex [A-Z][A-Za-z]*\s[A-Z][A-Za-z]* is correct in this case, the processor (ReplaceTextWithMapping) does not deal well with blank spaces (\s) and the string contains space between two words.

Perl regular expressions

I'm reading some code that involves regular expression and having some trouble.
Can someone please explain it and give an example of text it would parse?
if(/\|\s*STUFF(\d+)\s*\|\s*STUFF(\d+)/)
{
$a = $1;
$b = $2;
}
One string it matches against is |STUFF1|STUFF2.
YAPE::Regex::Explain
(?-imsx:\|\s*STUFF(\d+)\s*\|\s*STUFF(\d+))
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
\| '|'
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
STUFF 'STUFF'
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
\| '|'
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
STUFF 'STUFF'
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
\|\s*STUFF(\d+)\s*\|\s*STUFF(\d+)
\| look for a literal pipe character |.
\s* look for any number (zero or more) whitespace characters.
STUFF look for the string STUFF
(\d+) look for any number of digits (one or more), and save them to $1.
\s* look for any number of whitespace characters (zero or more)
then repeat once, and save the next digit sequence in $2.
If the regex matches, we know that $1 and $2 must be defined (i.e. they have a value).
In that case, we assign $1 to the variable $a and $2 to $b.
As no explicit string to match against is provided, the $_ variable is implicitly used.
Example text:
foo bar |STUFF123|STUFF456 baz bar foo
and
foo |
STUFF0
|STUFF1234567890bar