RegEx works everywhere except in Pentaho RegEx Evaluation Step - regex

I have a couple of RegEx that work on the online regex websites but not in Pentaho. Could you please help?
Here's the string:
:6585d0f0ba88767ac3b590f719596d864d73e9c1:
harmonicbalance/src/harmonicbalance/HarmonicBalanceFlowModel.cpp
harmonicbalance/src/harmonicbalance/HbFlutterModel.cpp
:8302994b565553c83a048b8905ae597349d99627:
emp/src/emp/PhasePairSingleParticleReynoldsNumber.h
emp/src/emp/TomiyamaDragCoefficientMethod.cpp
:9da194f17ec08bb20ad1be8df68b78ca137ab18a:
combustion/src/combustion/ReactingSpeciesTransportBasedModel.cpp
combustion/src/complexchemistry/TurbulentFlameClosure.cpp
:6a59f0be1e347a65e525e58742bb304639ea9bc4:
meshing/src/meshing/SurfaceMeshManipulation.cpp
physics/src/discretization/FvIndirectRegionInterfaceManager.cpp
physics/src/discretization/FvIndirectRegionInterfaceManager.h
physics/src/discretization/FvRepresentation.cpp
physics/src/discretization/FvRepresentation.h
:64b7f6d36b11b6cd94c20cad53463b7deef8c85a:
resourceclient/src/resourceclient/ResourcePool.cpp
resourceclient/src/resourceclient/ResourcePool.h
resourceclient/src/resourceclient/RestClient.cpp
resourceclient/src/resourceclient/RestClient.h
resourceclient/src/resourceclient/test/ResourcePoolTest.cpp
I would like to capture two groups. First group will extract all commit SHA1 and the other group would extract file names.
Below are the expressions I tried:
(?:^:([A-Za-z0-9]+):|(?!^)\G)\n+([A-Za-z/.-]+)
https://regex101.com/r/3IBkPz/1
^:(\w+):\s+((?:\s*(?!:)[^\s]+)+)
https://regex101.com/r/oIoDvM/1
Thoughts?

AFAIK (as of PDI-8.0), the Regex Evaluation step does NOT support the regex 'g' modifier, your regex pattern must cover all the text to be able to make a match.
For example: the following pattern will not match anything in Regex Evaluation step:
:([0-9a-f]+):\s+([^:]+)
but if I prepend .* to this pattern and pick "Enable dotall mode":
.*:([0-9a-f]+):\s+([^:]+)
it will match the last commit(sha1 + filenames). You can try move .* to the end of
the original pattern which will get you the first commit. So if you want to retrieve
the full list of commits(sha1 + filenames) with the g modifier, this step is
probably not a solution for you.
As the fields are basically split by colons ':' and new lines, you can probably try the following approach:
Use Split field to rows step, Delimiter=':' and include rownum in output, this rownum can be used to filter rows where even number is sha1 and odd number is filenames
Use Analytic Query step to create a new field with LEAD = 1, so now you can get sha1 and filenames in the same row
Use Calculator and Fileter step to calculate the remainer of rownum/2 and keep only rows with the odd number of rownum
Use Split fields to rows again to split filenames to filename using "\n"(Delimiter is a Regular Expression). you might want to filter out the EMPTY filename, since the delimiter only support one char

Related

RegExp set contains one or multiple words

Is there a way in regular expressions to match a subset of words against a set of words separated by a separator that does not involve creating a new pattern for every new word added to the set.
Right now I cannot think of anything else than creating a (?:{item1, item2, ...}) pattern for every extra item in the set (see example below).
Example matching a single word of the set:
Set: foo,bar,baz
Match: foo
RegExp:/^(foo|bar|baz)$/ <- MATCH
Example that will match a subset of words:
Set: foo,bar,baz
Match: foo,bar
RegExp: /^(foo|bar|baz)(?:,(foo|bar|baz)(?:,(foo|bar|baz))?)?$/ <- MATCH
The pattern grows rapidly when adding new items to the set. Is there some (magical) way to do this in a shorter version?
One general approach which looks slightly better than your current attempt would be to use lookaheads:
^(?=.*\bfoo\b)(?=.*\bbar\b).*$
Demo
You may add one lookahead assertion for each CSV term which needs to be matched in the input CSV list.
Edit: If you want OR behavior here, then we can use an alternation of lookaheads. To match either foo or bar as a CSV term we can try:
^(?:(?=.*\bfoo\b)|(?=.*\bbar\b)).*$

Extracting a numerical value from a paragraph based on preceding words

I'm working with some big text fields in columns. After some cleanup I have something like below:
truth_val: ["5"]
xerb Scale: ["2"]
perb Scale: ["1"]
I want to extract the number 2. I'm trying to match the string "xerb Scale" and then extract 2. I tried capturing the group including 2 as (?:xerb Scale:\s\[\")\d{1} and tried to exclude the matched group through a negative look ahead but had no luck.
This is going to be in a SQL query and I'm trying to extract the numerical value through a REGEXP_EXTRACT() function. This query is part of a pipeline that loads this information into the database.
Any help would be much appreciated!
You should match what you do not need to obtain in order to set the context for your match, and you need to match and capture what you need to extract:
xerb Scale:\s*\["(\d+)"]
^^^^^
See the regex demo. In Presto, use REGEXP_EXTRACT to get the first match:
SELECT regexp_extract(col, 'xerb Scale:\s*\["(\d+)"]', 1); -- 2
^^^
Note the 1 argument:
regexp_extract(string, pattern, group) → varchar
Finds the first occurrence of the regular expression pattern in string and returns the capturing group number group

How to duplicate regex search result within one line?

I have a csv table following the scheme:
"text1","text2",3
"text5","text?",5
"baa","foo",99
...
Which I need to transform to:
"text1","text2","-text2-",3
"text5","text?","-text?-",5
"baa","foo","-foo-",99
...
I'm sorry but I have no idea how to duplicate a part of a line using a regex.
I'm using VS Code find-replace engine.
How could I do this?
See regex101 demo.
Find: ^(\s*"[^"]*?","([^"]*?)",)
Replace: $1"-$2-",
Group 1: the first two values in each line, like "text1","text2",
Group 2: just the inner second value, like text2
Replace: Use Group 1 and then replicate Group 2 with surrounding "-Group2-"
Make sure you have this in your settings.json:
"search.usePCRE2": true,
"text1","text2",3
"text5","text?",5
Find the matched word group1, group2, group3. Match A-Za-z0-9 and "?" characters. I am not sure how long the last data number that I set the number 1~3 digital numbers. You can adjust to your condition easier.
("[\w?]+"),"([\w?]+)",(\d{1,3})
Replace with regex as following
$1,"$2","-$2-",$3
The results should be as following
"text1","text2","-text2-",3
"text5","text?","-text?-",5
Never mind asking questions to me.

Remove columns from CSV

I don't know anything about Notepad++ Regex.
This is the data I have in my CSV:
6454345|User1-2ds3|62562012032|324|148|9c1fe63ccd3ab234892beaf71f022be2e06b6cd1
3305611|User2-42g563dgsdbf|22023001345|0|0|c36dedfa12634e33ca8bc0ef4703c92b73d9c433
8749412|User3-9|xgs|f|98906504456|1534|51564|411b0fdf54fe29745897288c6ad699f7be30f389
How can I use a Regex to remove the 5th and 6th column? The numbers in the 5th and 6th column are variable in length.
Another problem is the User row can also contain a |, to make it even worse.
I can use a macro to fix this, but the file is a few millions lines long.
This is the final result I want to achieve:
6454345|User1-2ds3|62562012032|9c1fe63ccd3ab234892beaf71f022be2e06b6cd1
3305611|User2-42g563dgsdbf|22023001345|c36dedfa12634e33ca8bc0ef4703c92b73d9c433
8749412|User3-9|xgs|f|98906504456|411b0fdf54fe29745897288c6ad699f7be30f389
I am open for suggestions on how to do this with another program, command line utility, either Linux or Windows.
Match \|[^|]+\|[^|]+(\|[^|]+$)
Repalce $1
Basically, Anchor to the end of the line, and remove columns [-1] and [-2] (I assume columns can't be empty. Replace + with * if they can)
If you need finer detail then that, I'd recommend writing a Java or Python script to manual parse and rewrite the file for you.
I've captured three groups and given them names. If you use a replace utility like sed or vimregex, you can replace remove with nothing. Or you can use a programming language to concatenate keep_before and keep_after for the desired result.
^(?<keep_before>(?:[^|]+\|){3})(?<remove>(?:[^|]+\|){2})(?<keep_after>.*)$
You may have to remove the group namings and use \1 etc. instead, depending on what environment you use.
Demo
From Notepad++ hit ctrl + h then enter the following in the dialog:
Find what: \|\d+\|\d+(\|[0-9a-z]+)$
Replace with: $1
Search mode: Regular Expression
Click replace and done.
Regex Explain:
\|\d+ : match 1st string that starts with | followed by number
\|\d+ : match 2nd string that starts with | followed by number
(\|[0-9a-z]+): match and capture the string after the 2nd number.
$ : This is will force regex search to match the end of the string.
Replacement:
$1 : replace the found string with whatever we have between the captured group which is whatever we have between the parentheses (\|[0-9a-z]+)

Regex - Combining positive and negative lookbehind

I am doing some replaces in some huge SSIS packages to reflect changes in table- and column names.
Some of the tabels have columnnames witch are identical to the tablenames and I need to match the columnname without matching the tablename.
So what i need is a way to match MyName in [MyName] but not in [dbo].[MyName]
(?<=\[)(MyName)(?=\]) matches both, and I thought that (?<!\[dbo\]\.)(?<=\[)(MyName)(?=\]) would do the trick, but it does not seem to work.
You need to include the opening square bracket in the first lookbehind:
(?<!\[dbo\]\.\[)(?<=\[)(MyName)(?=\])