Replace column values using Apache NiFi - replace

I have a sample csv looks like this
ID,FNAME,PROBLEM_COL
1,sachith,
2,nalaka,
3,john,
4,adams,
Always PROBLEM_COL value is empty. I want to replace empty with null string.
For that I used UpdateRecord processor and CSVReader with Use String Fields From Headers
Custom value as /PROBLEM_COL and ${field.value:replaceFirst('','null')}
This runs without error/warning. But PROBLEM_COL is not replaced. I had referred this, but this does not solve my issue. My headers are in block-letters.

Use replaceEmpty('null') instead of replaceFirst
https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html#replaceempty

try to change your regex to be something like
',$'
which means: comma followed by end of line

With processor UpdateRecord you can use
replace( /PROBLEM_COL, '', 'null' )

Related

RegEx to remove multiline property

I'm porting my system to another data access library. For that, I'm using regex to replaces/remove some codes on my source. (A example above)
I need to remove everything between IBOQ_OrderingItems.Strings and ') by regex. But I can't write a regex to express this condition to express that. In my attempts, this does not recognize something like #180'asdf' or 'adsf (asdf) asdf' or ' adf '. When recognized, the regexp delete all content of file.
object SQLCalcula_umaLinha: TFDQuery
IBOQ_OrderingItems.Strings = (
'sf')
end
object SQLCalcula_VariasLinhas: TFDQuery
IBOQ_OrderingItems.Strings = (
'sfdf'
'sdffs'
'sf')
end
object SQLCalcula_parentesesNoMeio: TFDQuery
IBOQ_OrderingItems.Strings = (
'sfdf'
'sdffs ('' asdf '')'
'sf')
end
I found a solution:
IBOQ_.*.Strings.=.\((\s.[\w|\s|('|')|#|!|$|#||&|*|<|>|=|*|~]*.)+'\)
I hope to help :)
Or you could try something like
IBOQ_.*\.Strings\s*=\s*\((?:'[^']*'|[^)])*\)
which does it in 288 steps instead of yours, that does it in 48067 steps ;)
Check it out here at regex101.
Edit Changed to handle parentheses inside quotes.

Removing single and double quotes from BigQuery using regexp_extract

I'm a total noob with regexp. All I want to do is to remove the single and double quotes from a string in BigQuery. I can remove the single and double quotes at the beginning of the string, but not the end:
SELECT regexp_extract(foo, r'\"new_foo\":\"(.*?)\"') AS new_foo
FROM [mybq:Schema.table]
All I get is Null but without regexp_extract I have expected results. Help is appreciated.
Try something like below
SELECT REGEXP_REPLACE(foo, r'([\'\"])', '') AS new_foo
FROM [mybq:Schema.table]
Your regex expression should be like /["']/g
And your are using different method to get the expected result. Try REGEXP_REPLACE('orig_str', 'reg_exp', 'replace_str')
Something like this:
SELECT REGEXP_REPLACE(word, /["']/g, '')AS new_foo
FROM [mybq:Schema.table]
select replace(word,'"','') as word

Data Validation in Pentaho using regular expression

I have these sample data. (Current Balance is numeric field and has some bad records which need to be replaced)
Accno,Cust_id,gender,DOB,Current_balance
0008647447654709299,87128110,M,29/02/1960,184126.23
0008650447626799299,143500723,F,4/18/1967,165198.85
0008651447674209299,479941323,M,5/5/1979,NULL
0008653447693589299,687746622,M,18-08-1981,#20
0008654447606469299,890134223,M,18-08-1983,0
0008655447659179299,684451923,F,10/9/1982,142.25
0008658447686789299,57470921,F,25-02-1978,458518.25
0008669447629759299,57470925,M,23-01-1981,xx
I need to validate data in Pentaho and want the output like below :
Accno,Cust_id,gender,DOB,Current_balance
0008647447654709299,87128110,M,29/02/1960,184126.23
0008650447626799299,143500723,F,4/18/1967,165198.85
0008651447674209299,479941323,M,5/5/1979,
0008653447693589299,687746622,M,18-08-1981,
0008654447606469299,890134223,M,18-08-1983,0
0008655447659179299,684451923,F,10/9/1982,142.25
0008658447686789299,57470921,F,25-02-1978,458518.25
0008669447629759299,57470925,M,23-01-1981,
That means the validator pass the good row(s) and replace those bad data into null value.
Can anyone suggest how can I do this??
I'm not sure about Pentaho, but to point you in the right direction, you can use the following regex:
,(?=[^,]+$)(?!\d+(\.\d{2})).*$
In Multi-line mode
If you replace all matches with ',' you should have the desired output.
Working on RegexPal
RegexPlanet translates this into the following Java regex (looks like you just need to escape the backslashes):
,(?=[^,]+$)(?!\\d+(\\.\\d{2})).*$
So in Java I guess you'd use something like:
str.replaceAll("(?m),(?=[^,]+$)(?!\\d+(\\.\\d{2})).*$", ",");
The (?m) at the start is the multi-line flag mentioned above.

regular expression: how to ignore rest of the line

I have an input like this (a JSON format)
{"location":[{"id":"1BCDEFGHIJKLM","somename":"abcd","fewname":"xyzland","sid":"","sname":"","regionname":"Zee-Whole","type":"some","siteCode":"","someCode":"ROTXY","fewCode":"NL","pCode":"ROTXY","someid":"1BCDEFGHIJKLM","fewid":"GIC8"},{"id":"7823XYZHMOPRE","somename":"abcd Junction","fewname":"United States","sid":"","sname":"","regionname":"New York","type":"some","siteCode":"","someCode":"USRTJ","fewCode":"US","pCode":"USNWK","someid":"7823XYZHMOPRE","fewid":"7823XYZLMOPRE"},{"id":"799XYZHMOPRE","somename":"abcd-Maasvlakte","fewname":"xyzland","sid":"","sname":"","regionname":"Zee-Whole","type":"some","siteCode":"","someCode":"XYROT","fewCode":"NL","pCode":"","someid":"799XYZHMOPRE","fewid":"OIUOWER348534"}]}
Now, I want to pick up the first "id" value which is 1BCDEFGHIJKLM using regular expressions. I have managed upto the point using
[^({"location":[?{"id":")].{0,12} but this is incomplete. Could some one help how do I ignore the rest of the line after the value 1BCDEFGHIJKLM
Regex isn't the way to do this. Whatever platform you are using, it must have a JSON parser.
That will be your best error-free solution.
Assuming you must use regex, you can grab all the id's using "id":"(.*?)", and take the first match.
I found the following article, which might help you.
While messy, how is your regex incomplete?
It could be shortened to ("id":"([^"]+)") which is more readable, and doesn't limit the ID to twelve characters. If that is beneficial.
If you problem is getting more than one result, most languages have a "g" global switch.
In javascript, the following would return "1BCDEFGHIJKLM":
var firstID = str.match(/"id":"([^"]+)"/)[1]
As match()returns an array, in which [0] is the entire returned string, and [1] the first parenthasis.
Don't have to use regex. In your favourite language, split on commas. Then go through each item, check for "id" and split on colon (:). Get the last element. Eg Python
>>> s
'{"location":[{"id":"1BCDEFGHIJKLM","somename":"abcd","fewname":"xyzland","sid":"","sname":"","regionname":"Zee-Whole","type":"some","siteCode":"","someCode":"ROTXY","fewCode":"NL","pCode":"ROTXY","someid":"1BCDEFGHIJKLM","fewid":"GIC8"},{"id":"7823XYZHMOPRE","somename":"abcd Junction","fewname":"United States","sid":"","sname":"","regionname":"New York","type":"some","siteCode":"","someCode":"USRTJ","fewCode":"US","pCode":"USNWK","someid":"7823XYZHMOPRE","fewid":"7823XYZLMOPRE"},{"id":"799XYZHMOPRE","somename":"abcd-Maasvlakte","fewname":"xyzland","sid":"","sname":"","regionname":"Zee-Whole","type":"some","siteCode":"","someCode":"XYROT","fewCode":"NL","pCode":"","someid":"799XYZHMOPRE","fewid":"OIUOWER348534"}]}'
>>> for i in s.split(","):
... if '"id"' in i:
... print i.split(":")[-1]
... break
...
"1BCDEFGHIJKLM"
Of course, ideally, you should use a dedicated JSON parser.

Given the full path to a file, how do I get just the path without the filename?

Suppose I have a path in a string called '/home/user/directory/HelloWorld.txt'. I would like to remove the HelloWorld.txt, and end up with '/home/user/directory'. What regex would I need.
Don't use a regex. Instead, use File::Basename, which can handle all the special cases.
use File::Basename;
dirname("/foo/bar/baz/quux.txt"); --> "/foo/bar/baz"
split on "/", remove last element and join them back.
$path='/home/user/directory/HelloWorld.txt';
#s = split /\// ,$path;
pop(#s);
print join("/",#s);