Regex - Combining positive and negative lookbehind - regex

I am doing some replaces in some huge SSIS packages to reflect changes in table- and column names.
Some of the tabels have columnnames witch are identical to the tablenames and I need to match the columnname without matching the tablename.
So what i need is a way to match MyName in [MyName] but not in [dbo].[MyName]
(?<=\[)(MyName)(?=\]) matches both, and I thought that (?<!\[dbo\]\.)(?<=\[)(MyName)(?=\]) would do the trick, but it does not seem to work.

You need to include the opening square bracket in the first lookbehind:
(?<!\[dbo\]\.\[)(?<=\[)(MyName)(?=\])

Related

RegExp set contains one or multiple words

Is there a way in regular expressions to match a subset of words against a set of words separated by a separator that does not involve creating a new pattern for every new word added to the set.
Right now I cannot think of anything else than creating a (?:{item1, item2, ...}) pattern for every extra item in the set (see example below).
Example matching a single word of the set:
Set: foo,bar,baz
Match: foo
RegExp:/^(foo|bar|baz)$/ <- MATCH
Example that will match a subset of words:
Set: foo,bar,baz
Match: foo,bar
RegExp: /^(foo|bar|baz)(?:,(foo|bar|baz)(?:,(foo|bar|baz))?)?$/ <- MATCH
The pattern grows rapidly when adding new items to the set. Is there some (magical) way to do this in a shorter version?
One general approach which looks slightly better than your current attempt would be to use lookaheads:
^(?=.*\bfoo\b)(?=.*\bbar\b).*$
Demo
You may add one lookahead assertion for each CSV term which needs to be matched in the input CSV list.
Edit: If you want OR behavior here, then we can use an alternation of lookaheads. To match either foo or bar as a CSV term we can try:
^(?:(?=.*\bfoo\b)|(?=.*\bbar\b)).*$

RegEx works everywhere except in Pentaho RegEx Evaluation Step

I have a couple of RegEx that work on the online regex websites but not in Pentaho. Could you please help?
Here's the string:
:6585d0f0ba88767ac3b590f719596d864d73e9c1:
harmonicbalance/src/harmonicbalance/HarmonicBalanceFlowModel.cpp
harmonicbalance/src/harmonicbalance/HbFlutterModel.cpp
:8302994b565553c83a048b8905ae597349d99627:
emp/src/emp/PhasePairSingleParticleReynoldsNumber.h
emp/src/emp/TomiyamaDragCoefficientMethod.cpp
:9da194f17ec08bb20ad1be8df68b78ca137ab18a:
combustion/src/combustion/ReactingSpeciesTransportBasedModel.cpp
combustion/src/complexchemistry/TurbulentFlameClosure.cpp
:6a59f0be1e347a65e525e58742bb304639ea9bc4:
meshing/src/meshing/SurfaceMeshManipulation.cpp
physics/src/discretization/FvIndirectRegionInterfaceManager.cpp
physics/src/discretization/FvIndirectRegionInterfaceManager.h
physics/src/discretization/FvRepresentation.cpp
physics/src/discretization/FvRepresentation.h
:64b7f6d36b11b6cd94c20cad53463b7deef8c85a:
resourceclient/src/resourceclient/ResourcePool.cpp
resourceclient/src/resourceclient/ResourcePool.h
resourceclient/src/resourceclient/RestClient.cpp
resourceclient/src/resourceclient/RestClient.h
resourceclient/src/resourceclient/test/ResourcePoolTest.cpp
I would like to capture two groups. First group will extract all commit SHA1 and the other group would extract file names.
Below are the expressions I tried:
(?:^:([A-Za-z0-9]+):|(?!^)\G)\n+([A-Za-z/.-]+)
https://regex101.com/r/3IBkPz/1
^:(\w+):\s+((?:\s*(?!:)[^\s]+)+)
https://regex101.com/r/oIoDvM/1
Thoughts?
AFAIK (as of PDI-8.0), the Regex Evaluation step does NOT support the regex 'g' modifier, your regex pattern must cover all the text to be able to make a match.
For example: the following pattern will not match anything in Regex Evaluation step:
:([0-9a-f]+):\s+([^:]+)
but if I prepend .* to this pattern and pick "Enable dotall mode":
.*:([0-9a-f]+):\s+([^:]+)
it will match the last commit(sha1 + filenames). You can try move .* to the end of
the original pattern which will get you the first commit. So if you want to retrieve
the full list of commits(sha1 + filenames) with the g modifier, this step is
probably not a solution for you.
As the fields are basically split by colons ':' and new lines, you can probably try the following approach:
Use Split field to rows step, Delimiter=':' and include rownum in output, this rownum can be used to filter rows where even number is sha1 and odd number is filenames
Use Analytic Query step to create a new field with LEAD = 1, so now you can get sha1 and filenames in the same row
Use Calculator and Fileter step to calculate the remainer of rownum/2 and keep only rows with the odd number of rownum
Use Split fields to rows again to split filenames to filename using "\n"(Delimiter is a Regular Expression). you might want to filter out the EMPTY filename, since the delimiter only support one char

Regular expression - remove last literal

So I have the following table in PostgreSQL.
This is a test table only with one column route that has values of route names like
I-95
US-95N
I-95 S
I want to remove the trailing direction literals from all the route names.
UPDATE <schema>.<table>
SET route= regexp_replace(route, '%[:digit:](S|N|E|W)', '%[:digit:]', 'ig');
No change in the records happens. Anyone has any idea what I am doing wrong here?
To remove any single letter signifying a cardinal direction following immediately after a digit:
UPDATE tbl
SET route = regexp_replace(route, '(\d)[SNEW]', '\1', 'ig')
SQL Fiddle.
A positive lookbehind match would be even more elgant, but sadly only lookahead matches are implemented. So I use a back-reference to re-insert the first (captured) part from the match.
The bracket expression [SNEW] is simpler for the case than multiple branches (S|N|E|W), which would need non-capturing parentheses in this case: (:?S|N|E|W).

How can I match groups separated by other groups in regex?

I am writing a regex to match a list of items that follow a specific complex format, so the regex for that is very long. The items on this list have to be separated by either a comma, which can optionally be padded with either one space on the right or spaces on both sides, so the regex for matching the delimiter is ( , )|(, ?). Also, I want the list to be between square brackets.
For example, it should match the following:
[]
[validItem]
[validItem,validItem, validItem]
But not the following:
[validItem,invalidItem]
[validItemvalidItem]
[validItem, validItem ]
The regex I currently have is: \[verylongregex(?:(?: , )|(?:, ?)verylongregex)*\], but I'd like to simplify this to include the regex pattern that matches the element format only once.
Does regex have a method to match X groups separated by another group?
Here is an answer. I don`t know if it is what you are looking for, but here it is nonetheless.
1/ Assuming you want to capture the list in one group:
(\[(?:complexRegex(?: , |, ?|\]))+)
Demo: http://regex101.com/r/pW2oZ1/1
2/ Assuming you want all element of the list matched separately, this is a much more complex thing (at least for my knowledge...). Here is a working (complex) solution:
(?:\[|(?!\[)\G(?: , |, ?))(complexRegex)(?=(?:(?: , |, ?)complexRegex)*\])
Demo: http://regex101.com/r/iB3jD1/2
I don't have the time to write an explanation right now if it's needed. Ask for it in the comments if you want one, I'll write it later today. Sorry...

Strategy advice for this regex (matching in the middle of lookahead and a lookbehind)

I am using positive lookbehind and lookahead to match a word between certain parts (FROM and TO strings).
.*(?<=FROM)\s+(.*?)\s+(?=TO).*
EDIT: That approach cannot be changed. Need to assume, not a workaround for the approach itself, thank you! It's more a theoretical question about how to deal with that lokaheads in-between matching.
I'd like to input an string like
FROM table a, table2 b TO
and obtain as \1 table and table2. a and b labels are optional.
My problem is that if I place something like (?:(\w+)\s*,?)+? for matching every table part, it seems like it's done backwards
http://regex101.com/r/mV4rD8
If I'm understanding what you want correctly, you don't need lookahead/behind. You can do:
FROM (?:(\w+)(?: \w)*(?:,)? )+TO
Of the three parts inside the outermost parentheses, the second and third need to be treated separately because they are optional for different reasons. The second is present if the a and b labels are present. The third is present if the table is not the last one in the list.
This will capture the table names as you described. So e.g.:
FROM table1 a, table2, table3 c TO
Will capture "table1", "table2" and "table3".
I used literal spaces, but you can replace them with \s if you prefer.
EDIT: With the lookahead and lookbehind still present, as per your requirement:
.*(?<=FROM)\s+(?:(\w+)(?:\s+\w)*(?:\s*,)?\s+)+(?=TO).*