Data Validation in Pentaho using regular expression - regex

I have these sample data. (Current Balance is numeric field and has some bad records which need to be replaced)
Accno,Cust_id,gender,DOB,Current_balance
0008647447654709299,87128110,M,29/02/1960,184126.23
0008650447626799299,143500723,F,4/18/1967,165198.85
0008651447674209299,479941323,M,5/5/1979,NULL
0008653447693589299,687746622,M,18-08-1981,#20
0008654447606469299,890134223,M,18-08-1983,0
0008655447659179299,684451923,F,10/9/1982,142.25
0008658447686789299,57470921,F,25-02-1978,458518.25
0008669447629759299,57470925,M,23-01-1981,xx
I need to validate data in Pentaho and want the output like below :
Accno,Cust_id,gender,DOB,Current_balance
0008647447654709299,87128110,M,29/02/1960,184126.23
0008650447626799299,143500723,F,4/18/1967,165198.85
0008651447674209299,479941323,M,5/5/1979,
0008653447693589299,687746622,M,18-08-1981,
0008654447606469299,890134223,M,18-08-1983,0
0008655447659179299,684451923,F,10/9/1982,142.25
0008658447686789299,57470921,F,25-02-1978,458518.25
0008669447629759299,57470925,M,23-01-1981,
That means the validator pass the good row(s) and replace those bad data into null value.
Can anyone suggest how can I do this??

I'm not sure about Pentaho, but to point you in the right direction, you can use the following regex:
,(?=[^,]+$)(?!\d+(\.\d{2})).*$
In Multi-line mode
If you replace all matches with ',' you should have the desired output.
Working on RegexPal
RegexPlanet translates this into the following Java regex (looks like you just need to escape the backslashes):
,(?=[^,]+$)(?!\\d+(\\.\\d{2})).*$
So in Java I guess you'd use something like:
str.replaceAll("(?m),(?=[^,]+$)(?!\\d+(\\.\\d{2})).*$", ",");
The (?m) at the start is the multi-line flag mentioned above.

Related

Multiple replace regex in one Apache-NiFi statement

I have a csv in following format.
id,mobile
1,02146477474
2,08585377474
3,07646474637
4,02158789566
5,04578599525
I want to add a new column and add just leading 3 numbers to that column (for specific cases and all the others NOT_VALID string). So result should be:
id,number,provider
1,02146477474,021
2,08585377474,085
3,07646474637,NOT_VALID
4,02158789566,021
5,04578599525,NOT_VALID
I can use following regex for replacing that. But I would like to use all possible conversations in one step. Using UpdateRecord processor.
${field.value:replaceFirst('085[0-9]+','085')}
When I use something like this:
${field.value:replaceFirst('085[0-9]+','085'):or(${field.value:replaceFirst('086[0-9]+','086')}`)}
This replaces all with false.
Nifi uses Java regex
As soon, as you are using record processing, this should work for you:
${field.value:replaceFirst('^(021|085)?.*','$1')}
The group () optionally ? catches 021 or 085 at the beginning of string ^
The replacement - $1 - is the first group
PS: The sites like https://regex101.com/ helps to understand regex

Regex ignore first 12 characters from string

I'm trying to create a custom filter in Google Analytic to remove the query parts of the url which I don't want to see. The url has the following structure
[domain]/?p=899:2000:15018702722302::NO:::
I would like to create a regex which skips the first 12 characters (that is until:/?p=899:2000), and what ever is going to be after that replace it with nothing.
So I made this one: https://regex101.com/r/Xgbfqz/1 (which could be simplified to .{0,12}) , but I actually would like to skip those and only let the regex match whatever is going to be after that, so that I'll be able to tell in Google Analytics to replace it with "".
The part in the url that is always the same is
?p=[3numbers]:[0-4numbers]
Thank you
Your regular expression:
\/\?p=\d{3}\:\d{0,4}(.*)
Tested in Golang RegEx 2 and RegEx101
It search for /p=###:[optional:####] and capture the rest of the right side string.
(extra) JavaScript:
paragraf='[domain]/?p=899:2000:15018702722302::NO:::'
var regex= /\/\?p=\d{3}\:\d{0,4}(.*)/;
var match = regex.exec(paragraf);
alert('The rest of the right side of the string: ' + match[1]);
Easily use "[domain]/?p=899:2000:15018702722302::NO:::".substr(12)
You can try this:
/\?p\=\d{3}:\d{0,4}
Which matches just this: ?p=[3numbers]:[0-4numbers]
Not sure about replacing though.
https://regex101.com/r/Xgbfqz/1

How to extract FirstName and LastName from html tags with regex?

I have response body which contains
"<h3 class="panel-title">Welcome
First Last </h3>"
I want to fetch 'First Last' as a output
The regular expression I have tried are
"Welcome(\s*([A-Za-z]+))(\s*([A-Za-z]+))"
"Welcome \s*([A-Za-z]+)\s*([A-Za-z]+)"
But not able to get the result. If I remove the newline and take it as
"<h3 class="panel-title">Welcome First Last </h3>" it is detecting in online regex maker.
I suspect your problem is the carriage return between "Welcome" and the user name. If you use the "single-line mode" flag (?s) in your regex, it will ignore newlines. Try these:
(?s)Welcome(\s*([A-Za-z]+))(\s*([A-Za-z]+))
(?s)Welcome \s*([A-Za-z]+)\s*([A-Za-z]+)
(this works in jMeter and any other java or php based regex, but not in javascript. In the comments on the question you say you're using javascript and also jMeter - if it is a jMeter question, then this will help. if javaScript, try one of the other answers)
Well, usually I don't recommend regex for this kind of work. DOM manipulation plays at its best.
but you can use following regex to yank text:
/(?:<h3.*?>)([^<]+)(?:<\/h3>)/i
See demo at https://regex101.com/r/wA2sZ9/1
This will extract First and Last names including extra spacing. I'm sure you can easily deal with spaces.
In jmeter reg exp extractor you can use:
<h3 class="panel-title">Welcome(.*?)</h3>
Then take value using $1$.
In the data you shown welcome is followed by enter.If actually its part of response then you have to use \n.
<h3 class="panel-title">Welcome\n(.*?)</h3>
Otherwise above one is enough.
First verify this in jmeter using regular expression tester of response body.
Welcome([\s\S]+?)<
Try this, it will definitely work.
Regular expressions are greedy by default, try this
Welcome\s*([A-Za-z]+)\s*([A-Za-z]+)
Groups 1 and 2 contain your data
Check it here

Regex HTTP Response Body Message

I use a jmeter for REST testing.
I have made a HTTP Request, and this is the response data:
{"id":11,"name":"value","password":null,"status":"ACTIVE","lastIp":"0.0.0.0","lastLogin":null,"addedDate":1429090984000}
I need just the ID (which is 11) in
{"id":11,....
I use the REGEX below :
([0-9].+?)
It works perfectly but it will be a problem if my ID more than 2 digits. I need to change the REGEX to :
([0-9][0-9].+?)
Is there any dynamic REGEX for my problem. Thank you for your attention.
Regards,
Stefio
If you want any integer between {"id": and , use the following Regular Expression:
{"id":(\d+),
However the smarter way of dealing with JSON data could be JSON Path Extractor (available via JMeter Plugins), going forward this option can be much easier to use against complex JSON.
See Using the XPath Extractor in JMeter guide (scroll down to "Parsing JSON") to learn more on syntax and use cases.
I suggest using the following regular expression:
"id":([^,]*),
This will first find "id": and then look for anything that is not a comma until it finds a comma. Note the character grouping is only around the value of the ID.
This will work for ANY length ID.
Edit:
The same concept works for almost any JSON data, for example where the value is quoted:
"key":"([^"]*)"
That regular expression will extract the value from given key, as long as value is quoted and does not contain quotes. It first finds "key": and then matches anything that is not a quote until the next quote.
You can use the quantifier like this:
([0-9]{2,}.+?)
It will catch 2 or more digits, and then any symbol, 1 or more times. If you want to allow no other characters after the digits, use * instead of +:
([0-9]{2,}.*?)
Regex demo

How do I extract a postcode from one column in SSIS using regular expression

I'm trying to use a custom regex clean transformation (information found here ) to extract a post code from a mixed address column (Address3) and move it to a new column (Post Code)
Example of incoming data:
Address3: "London W12 9LZ"
Incoming data could be any combination of place names with a post code at the start, middle or end (or not at all).
Desired outcome:
Address3: "London"
Post Code: "W12 9LZ"
Essentially, in plain english, "move (not copy) any post code found from address3 into Post Code".
My regex skills aren't brilliant but I've managed to get as far as extracting the post code and getting it into its own column using the following regex, matching from Address3 and replacing into Post Code:
Match Expression:
(?<stringOUT>([A-PR-UWYZa-pr-uwyz]([0-9]{1,2}|([A-HK-Ya-hk-y][0-9]|[A-HK-Ya-hk-y][0-9] ([0-9]|[ABEHMNPRV-Yabehmnprv-y]))|[0-9][A-HJKS-UWa-hjks-uw])\ {0,1}[0-9][ABD-HJLNP-UW-Zabd-hjlnp-uw-z]{2}|([Gg][Ii][Rr]\ 0[Aa][Aa])|([Ss][Aa][Nn]\ {0,1}[Tt][Aa]1)|([Bb][Ff][Pp][Oo]\ {0,1}([Cc]\/[Oo]\ )?[0-9]{1,4})|(([Aa][Ss][Cc][Nn]|[Bb][Bb][Nn][Dd]|[BFSbfs][Ii][Qq][Qq]|[Pp][Cc][Rr][Nn]|[Ss][Tt][Hh][Ll]|[Tt][Dd][Cc][Uu]|[Tt][Kk][Cc][Aa])\ {0,1}1[Zz][Zz])))
Replace Expression:
${stringOUT}
So this leaves me with:
Address3: "London W12 9LZ"
Post Code: "W12 9LZ"
My next thought is to keep the above match/replace, then add another to match anything that doesn't match the above regex. I think it might be a negative lookahead but I can't seem to make it work.
I'm using SSIS 2008 R2 and I think the regex clean transformation uses .net regex implementation.
Thanks.
Just solved this. As usual, it was simpler logic than I thought it should be. Instead of trying to match the non-post code strings and replace them with themselves, I have added another line matching the postcode again and replacing it with "".
So in total, I have:
Match the post code using the above regex and move it to the Post Code column
Match the post code using the above regex and replace it with "" in the Address3 column