Multiple replace regex in one Apache-NiFi statement

Multiple replace regex in one Apache-NiFi statement - regex

I have a csv in following format.
id,mobile
1,02146477474
2,08585377474
3,07646474637
4,02158789566
5,04578599525
I want to add a new column and add just leading 3 numbers to that column (for specific cases and all the others NOT_VALID string). So result should be:
id,number,provider
1,02146477474,021
2,08585377474,085
3,07646474637,NOT_VALID
4,02158789566,021
5,04578599525,NOT_VALID
I can use following regex for replacing that. But I would like to use all possible conversations in one step. Using UpdateRecord processor.
${field.value:replaceFirst('085[0-9]+','085')}
When I use something like this:
${field.value:replaceFirst('085[0-9]+','085'):or(${field.value:replaceFirst('086[0-9]+','086')}`)}
This replaces all with false.

Nifi uses Java regex
As soon, as you are using record processing, this should work for you:
${field.value:replaceFirst('^(021|085)?.*','$1')}
The group () optionally ? catches 021 or 085 at the beginning of string ^
The replacement - $1 - is the first group
PS: The sites like https://regex101.com/ helps to understand regex

Related

How to combine multiple RegEx commands for Notepad++ using capture groups and alternations?

I am converting exported SQL views as files to a different syntax using a separate specialized conversion tool. This tool can't handle certain commands and formatting so I'm using Notepad++ with RegEx to alter the files ahead of time.
So far I am getting the results that I want, but it takes three separate Find/Replace actions. I'd like to reduce these three RegEx actions down to one if possible.
Find: (.*)(CREATE VIEW.*\nGO)(.*)
Replace: \2
Find: (CREATE VIEW )(.*)(\r\nAS)
Replace: \1"\2"\3
Find: (oldschema1\.|\[oldschema1\]\.|\[|\]|TOP \(100\) PERCENT|oldschema2\.)|(^GO$)|(\A^(.*?))
Replace: (?1)(?2\;)(?3SET SCHEMA schemaname\; \n\n\1)```
I'm using Notepad++ 7.7.1 64-bit, Find/Replace with Regular Expression search mode - ". matches newline" check on.
You'll see in my code that I'm already using capture groups with alternation. I thought I could combine the first two RegEx steps as additional capture groups to Step 3 but it doesn't work out, possibly because they are nested.
I tried referencing the nested groups by incrementing the referencing number accordingly, but it doesn't work (blanks out the result).
Here is an example SQL view file. It's not a working view because I added "oldschema2" so the RegEx would have something to find for one of the replacements, but it's representative as an example here.
garbage
text
beforehand
CREATE VIEW [oldschema1].[viewname]
AS
SELECT DISTINCT
TOP (100) PERCENT oldschema1.TABLENAME.FIELD1, oldschema1.TABLENAME.FIELD2
FROM oldschema1.TABLENAME
WHERE (oldschema1.TABLENAME.FIELD3 = N'Z003') AND oldschema2.TABLENAME.FIELD2 = 1
ORDER BY oldschema1.TABLENAME.FIELD1
GO
garbage
text
after
Here is some additional details of what I'm trying to achieve with each pass.
Notepad++ RegEx Step 1 - isolate view block from CREATE VIEW to GO
Find:
(.*)(CREATE VIEW.*\nGO)(.*)
Replace:
\2
Step 2 - put quotes around view name
Find:
(CREATE VIEW )(.*)(\r\nAS)
Replace:
\1"\2"\3
Step 3 - remove/replace various texts and insert a line at the beginning of the file
Find:
(oldschema1\.|\[oldschema1\]\.|\[|\]|TOP \(100\) PERCENT|oldschema2\.)|(^GO$)|(\A^(.*?))
Replace:
(?1)(?2\;)(?3SET SCHEMA schemaname\; \n\n\1)
The expected output from the above example would be:
SET SCHEMA schemaname;
CREATE VIEW "viewname"
AS
SELECT DISTINCT
TABLENAME.FIELD1, TABLENAME.FIELD2
FROM TABLENAME
WHERE (TABLENAME.FIELD3 = N'Z003') AND TABLENAME.FIELD2 = 1
ORDER BY TABLENAME.FIELD1
;
which I achieve with the above three steps, but I'd like to do it in one Find/Replace if possible.
I'm pretty new to RegEx, and StackOverflow for that matter. Your help is greatly appreciated.

Step 1
I'm not so sure about it, but I'm guessing that maybe we would want an expression similar to:
[\s\S]*?(CREATE VIEW[\s\S]*GO\s*)[\s\S]*
to be replaced with $1, where our desired data is in this capturing group:
(CREATE VIEW[\s\S]*GO\s*)
and we can even remove \s*:
(CREATE VIEW[\s\S]*GO)
and just try:
[\s\S]*?(CREATE VIEW[\s\S]*GO)[\s\S]*
with an m flag.
In the right panel of this demo, the expression is further explained, if you might be interested.
Step 2
We can likely try:
(CREATE VIEW)(.*)
and replace with:
SET SCHEMA schemaname;\n\n$1 "viewname"
Demo
Step 3
This step would probably be done with an expression similar to:
TOP \(100\) PERCENT |oldschema1\.
being replaced with an empty string.
Demo
Step 4:
\s*GO being replaced with \n; or just ; and we might likely have the desired output, not sure though.
Demo

How to find "complicated" URLs in a text file

I'm using the following regex to find URLs in a text file:
/http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+/
It outputs the following:
http://rda.ucar.edu/datasets/ds117.0/.
http://rda.ucar.edu/datasets/ds111.1/.
http://www.discover-earth.org/index.html).
http://community.eosdis.nasa.gov/measures/).
Ideally they would print out this:
http://rda.ucar.edu/datasets/ds117.0/
http://rda.ucar.edu/datasets/ds111.1/
http://www.discover-earth.org/index.html
http://community.eosdis.nasa.gov/measures/
Any ideas on how I should tweak my regex?
Thank you in advance!
UPDATE - Example of the text would be:
this is a test http://rda.ucar.edu/datasets/ds117.0/. and I want this to be copied over http://rda.ucar.edu/datasets/ds111.1/. http://www.discover-earth.org/index.html). http://community.eosdis.nasa.gov/measures/).

This will trim your output containing trail characters, ) .
import re
regx= re.compile(r'(?m)[\.\)]+$')
print(regx.sub('', your_output))
And this regex seems workable to extract URL from your original sample text.
https?:[\S]*\/(?:\w+(?:\.\w+)?)?
Demo,,, ( edited from https?:[\S]*\/)
Python script may be something like this
ss=""" this is a test http://rda.ucar.edu/datasets/ds117.0/. and I want this to be copied over http://rda.ucar.edu/datasets/ds111.1/. http://www.discover-earth.org/index.html). http://community.eosdis.nasa.gov/measures/). """
regx= re.compile(r'https?:[\S]*\/(?:\w+(?:\.\w+)?)?')
for m in regx.findall(ss):
print(m)

So for the urls you have here:
https://regex101.com/r/uSlkcQ/4
Pattern explanation:
Protocols (e.g. https://)
^[A-Za-z]{3,9}:(?://)
Look for recurring .[-;:&=+\$,\w]+-class (www.sub.domain.com)
(?:[\-;:&=\+\$,\w]+\.?)+`
Look for recurring /[\-;:&=\+\$,\w\.]+ (/some.path/to/somewhere)
(?:\/[\-;:&=\+\$,\w\.]+)+
Now, for your special case: ensure that the last character is not a dot or a parenthesis, using negative lookahead
(?!\.|\)).
The full pattern is then
^[A-Za-z]{3,9}:(?://)(?:[\-;:&=\+\$,\w]+\.?)+(?:\/[\-;:&=\+\$,\w\.]+)+(?!\.|\)).

There are a few things to improve or change in your existing regex to allow this to work:
http[s]? can be changed to https?. They're identical. No use putting s in its own character class
[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),] You can shorten this entire thing and combine character classes instead of using | between them. This not only improves performance, but also allows you to combine certain ranges into existing character class tokens. Simplifying this, we get [a-zA-Z0-9$-_#.&+!*\(\),]
We can go one step further: a-zA-Z0-9_ is the same as \w. So we can replace those in the character class to get [\w$-#.&+!*\(\),]
In the original regex we have $-_. This creates a range so it actually inclues everything between $ and _ on the ASCII table. This will cause unwanted characters to be matched: $%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_. There are a few options to fix this:
[-\w$#.&+!*\(\),] Place - at the start of the character class
[\w$#.&+!*\(\),-] Place - at the end of the character class
[\w$\-#.&+!*\(\),] Escape - such that you have \- instead
You don't need to escape ( and ) in the character class: [\w$#.&+!*(),-]
[0-9a-fA-F][0-9a-fA-F] You don't need to specify [0-9a-fA-F] twice. Just use a quantifier like so: [0-9a-fA-F]{2}
(?:%[0-9a-fA-F][0-9a-fA-F]) The non-capture group isn't actually needed here, so we can drop it (it adds another step that the regex engine needs to perform, which is unnecessary)
So the result of just simplifying your existing regex is the following:
https?://(?:[$\w#.&+!*(),-]|%[0-9a-fA-F]{2})+
Now you'll notice it doesn't match / so we need to add that to the character class. Your regex was matching this originally because it has an improper range $-_.
https?://(?:[$\w#.&+!*(),/-]|%[0-9a-fA-F]{2})+
Unfortunately, even with this change, it'll still match ). at the end. That's because your regex isn't told to stop matching after /. Even implementing this will now cause it to not match file names like index.html. So a better solution is needed. If you give me a couple of days, I'm working on a fully functional RFC-compliant regex that matches URLs. I figured, in the meantime, I would at least explain why your regex isn't working as you'd expect it to.

Thanks all for the responses. A coworker ended up helping me with it. Here is the solution:
des_links = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', des)
for i in des_links:
tmps = "/".join(i.split('/')[0:-1])
print(tmps)

Using regex multiple capture groups to split up a string

I have a file that looks like this...
"1234567123456","V","0","0","BLAH","BLAH","BLAH","BLAH"
"1234567123456","D","TEST1 "
"1234567123456","D","TEST 2~TEST3"
"1234567123456","R","TEST4~TEST5"
"1234567123457","V","0","0","BLAH","BLAH","BLAH","BLAH"
"1234567123457","D","TEST 6"
"1234567123457","D","TEST7"
"1234567123457","R","TEST 8~TEST9~TEST,10"
All I'm trying to do is parse the D and R lines. The ~ is used in this case as a separator. So the end results would be...
"1234567123456","V","0","0","BLAH","BLAH","BLAH","BLAH"
"1234567123456","D","TEST1 "
"1234567123456","D","TEST3"
"1234567123456","D","TEST3"
"1234567123456","R","TEST4"
"1234567123456","R","TEST5"
"1234567123457","V","0","0","BLAH","BLAH","BLAH","BLAH"
"1234567123457","D","TEST 6"
"1234567123457","D","TEST7"
"1234567123457","R","TEST 8"
"1234567123457","R","TEST9"
"1234567123457","R","TEST,10"
I'm using regex on applications like Textpad and Notepad++. I have not figured out how to use a regex like /.+/g because the applications do not like the forward slashes. So I don't think I can use things like the global modifier. I currently have the following regex...
//In a program like Textpad/Notepad++
<FIND> "(.{13})","D","([^~]*)~(.*)
<REPLACE> "\1","D","\2"\n"\1","D","\3
Now if I run a find and replace with the above params a few times it would work fine (for the D lines only). The problem is there is an unknown number of lines to be made. For example...
"1234567123456","D","TEST1~TEST2~TEST3~TEST4~TEST5"
"1234567123457","D","TEST1~TEST2~TEST3"
"1234567123458","D","TEST1~TEST2"
"1234567123459","D","TEST1~TEST2~TEST3~TEST4"
I was hoping to be able to use a MULTI capture group to make this work. I found this PAGE talking about the common mistake between repeating a capturing group and capturing a repeated group. I need to capture a repeated group. For some reason I just could not make mine work right though. Anyone else have an idea?
Note: If I could get rid of the leading and trailing spaces EX: "1234567123456","D","TEST1 " ending up as "1234567123456","D","TEST1" that would be even better but not necessary.
RESOURCES:
http://www.regular-expressions.info/captureall.html
http://regex101.com/

Data Validation in Pentaho using regular expression

I have these sample data. (Current Balance is numeric field and has some bad records which need to be replaced)
Accno,Cust_id,gender,DOB,Current_balance
0008647447654709299,87128110,M,29/02/1960,184126.23
0008650447626799299,143500723,F,4/18/1967,165198.85
0008651447674209299,479941323,M,5/5/1979,NULL
0008653447693589299,687746622,M,18-08-1981,#20
0008654447606469299,890134223,M,18-08-1983,0
0008655447659179299,684451923,F,10/9/1982,142.25
0008658447686789299,57470921,F,25-02-1978,458518.25
0008669447629759299,57470925,M,23-01-1981,xx
I need to validate data in Pentaho and want the output like below :
Accno,Cust_id,gender,DOB,Current_balance
0008647447654709299,87128110,M,29/02/1960,184126.23
0008650447626799299,143500723,F,4/18/1967,165198.85
0008651447674209299,479941323,M,5/5/1979,
0008653447693589299,687746622,M,18-08-1981,
0008654447606469299,890134223,M,18-08-1983,0
0008655447659179299,684451923,F,10/9/1982,142.25
0008658447686789299,57470921,F,25-02-1978,458518.25
0008669447629759299,57470925,M,23-01-1981,
That means the validator pass the good row(s) and replace those bad data into null value.
Can anyone suggest how can I do this??

I'm not sure about Pentaho, but to point you in the right direction, you can use the following regex:
,(?=[^,]+$)(?!\d+(\.\d{2})).*$
In Multi-line mode
If you replace all matches with ',' you should have the desired output.
Working on RegexPal
RegexPlanet translates this into the following Java regex (looks like you just need to escape the backslashes):
,(?=[^,]+$)(?!\\d+(\\.\\d{2})).*$
So in Java I guess you'd use something like:
str.replaceAll("(?m),(?=[^,]+$)(?!\\d+(\\.\\d{2})).*$", ",");
The (?m) at the start is the multi-line flag mentioned above.

Regex Assistance for a url filepath

Can someone assist in creating a Regex for the following situation:
I have about 2000 records for which I need to do a search/repleace where I need to make a replacement for a known item in each record that looks like this:
<li>View Product Information</li>
The FILEPATH and FILE are variable, but the surrounding HTML is always the same. Can someone assist with what kind of Regex I would substitute for the "FILEPATH/FILE" part of the search?

you may match the constant part and use grouping to put it back
(<li>View Product Information</li>)
then you should replace the string with $1your_replacement$2, where $1 is the first matching group and $2 the second (if using python for instance you should call Match.group(1) and Match.group(2))
You would have to escape \ chars if you're using Java instead.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Multiple replace regex in one Apache-NiFi statement - regex

Related

How to combine multiple RegEx commands for Notepad++ using capture groups and alternations?

How to find "complicated" URLs in a text file

Using regex multiple capture groups to split up a string

Data Validation in Pentaho using regular expression

Regex Assistance for a url filepath

Categories

Resources