RegEx to compare two substrings within a line - regex

I have a consistently formatted lines that are essentially delimited field/value pairs and I need to use a single regex expression to compare the values of two fields and match if they are the same. I don't need the values, just the match/no match comparison. Here are two examples, the first should match the second should not.
##field1##someValue##field2##someValue##otherField##otherValue
##field1##someValue##field2##someDIFFERENTValue##otherField##otherValue
I can match any field or value on a known pattern, and I can use lookarounds to do AND operations.
But I don't know how to extract and save the value of some pattern match, like field1##(.*?)##
, and then use that "someValue" in an expression comparing it to the "contents" of field2##(.*?)##
Thanks in advance for any assistance.

You may use
##field1##([^#]*)##field2##(?=\1##)
##field1##((?:(?!##).)*?)##field2##(?=\1##)
Use the former if there can be no # in value contents. Else, use the latter regex.
See the regex demo
Details
##field1## - a literal string
([^#]*) - Capturing group 1 (later in the pattern, it can be referred to with \1 backreference): any 0 or more chars other than #
##field2## - a literal string
(?=\1##) - immediately to the right of the current location, there must be the same value as captured in Group 1 followed with ##. If end of string can occur, replace with (?=\1(?:##|$)).
The (?:(?!##).)*? pattern "roughly" means any text but ##, it actually matches any single char (other than a line break char), as few occurrences as possible, that does not start a ## char sequence.

Related

MSBUILD RegexReplace get all text till 2nd last dot from end

I am working with ToolsVersion="3.5".
I wanted to match from end of the string till 2-nd last dot (.).
For Example for given value 123.456.78.910.abcdefgh I wanted to get 910.abcdefgh only.
I tried with
<RegexReplace Input="$(big_number)" Expression="/(\w+\.\w+)$/gm" Replacement="$1" count="1">
<Output ItemName ="big_number_tail" TaskParameter="Output"/>
</RegexReplace>
But it is returning entire string only.
Any idea what went wrong ?
First of all, do not use a regex literal in a text attribute. When you define regex via strings, not code, regex literal notation (like /.../gm) is not usually used and in these cases / regex delimiters and g, m, etc. flags are treated as part of a pattern, and as a result, it never matches.
Besides, when you extract via replacing as here, you need to make sure you match the whole string with your pattern, and only capture the part you want to extract. Note you may have more than 1 capturing group, and then you could use $2, $3, etc. in the replacement.
You can use
<RegexReplace Input="$(big_number)" Expression=".*\.([^.]*\.[^.]*)$" Replacement="$1" count="1">
See the regex demo. Details:
.* - any zero or more chars other than line break chars, as many as possible
\. - a . char
([^.]*\.[^.]*) - Group 1 ($1 refers to this part): zero or more non-dot chars, a . char, and again zero or more chars other than dots
$ - end of string.

Regex for a set of different scenarios (first period with a complete word in front of it)

I have a set of strings that can come in different formats. I want to be able to get everything from the first period and any of the chars (can be a space, bracket, curly brace etc) before the first period.
For example:
if SCHEMA.COLUMN = 'XYZ' - should return SCHEMA
SUM(SCHEMA.COLUMN) - should return SCHEMA
[SCHEMA.COLUMN] - should return SCHEMA
select product_id decode (warehouse_id 'Apple','APPL', 'Microsoft', 'MSFT') from SCHEMA1.inventories a, SCHEMA2.quantity b where a.id = b.id - multiple periods in this but should return SCHEMA1
select product_id decode (warehouse_id '.','APPL', 'Microsoft', 'MSFT') from SCHEMA1.inventories a, SCHEMA2.quantity b where a.id = b.id - multiple periods in this but should return SCHEMA1
I am able to get the regex to return the string if there is one begin char but couldnt get multiple begin chars
\((.*?)\.
this is returning SCHEMA when the string is SUM(SCHEMA.column)
I was referring to some previous posts on this topic but couldnt succeed with those solutions
Previous Answers
Can someone suggest how this can be done ?
As OP has changed problem to match dot excluding dots in quoted string. Moreover quotes can be escaped as well.
Here is the regex that may be used:
^.*?\b(\w+)\.(?=(?:[^'\\]*'[^'\\]*(?:\\.[^'\\]*)*')*[^'\\]*$)
'[^'\\]*(?:\\.[^'\\]*)*' matches a quoted string ignoring escaped quotes in the string.
(?=...) makes sure to match dot outside quoted string by asserting that we have 0 or more of fully quoted strings ahead of current position.
Original Solution:
You may use this regex and grab string from capture group #1:
^[^.]*\b(\w+)\.
RegEx Demo
RegEx Details:
^: Start
[^.]*: Match 0 or more characters containing non-dot characters
\b: word boundary
(\w+): Capture group #1 containing 1+ word characters
\.: Match a dot
RegEx Demo 2
The following covers most (but not all) situations. It identifies the occurrence of an identifier followed by a . serving as a separator in a qualified name. The sought string is held in capture group #1.
\b(\w+)\.(?=[a-zA-Z_])
The problematic cases are preceding string literals that contain .. These should be skipped. Detecting and skipping string literals with a regex is complicated since in general you have to count matching delimiters and cater for escaped delimiters within a literal.
So this solution might suffice to serve your needs. It will fail if a part of a string literal matches \w\.[a-zA-Z_] but that usually does not happen: . in a punctuation role is usually followed by some non-letter ( eg. whitespace, delimiters ).
This solution will produce others but the first match if global behavior cannot be turned off in the regex engine.
Demo (Regex 101)
Update
The following regex does correctly skip over string literals preceding the first match of a qualified name's first match:
^[^']*?('[^\\']*((\\.)[^\\']*)*'[^']*?)*\b(\w+)\.(?=[a-zA-Z_])
The desired result is in capture group 4.
The pattern works by repeatedly matching an alternating sequence of literals and non-literals - the matched string starting with either of it - as a (possibly empty) prefix to the first qualified name. There is an obvious extension to 2 kinds of literal delimiters.
While the pattern works, I do advise to thoroughly consider alternative approaches to using it in production code, as it suffers badly in terms of maintainability.
Demo (Regex 101)

Repeated variable length regexp matching

I have an expression
AA-BB/CC/DD
I want to convert this to
<AA-BB> <AA-CC> <AA-DD>
All I can do is configure this as a regexp substitution. I can't figure it out.
AA should match at the beginning of a line. - and / are literal characters, BB,CC and DD are numbers, i.e \d+
So a first draft is ...
^(\w+)([\-/]\d+)+
but I want all matches, not just the greedy one.
(actually this one matches AA-BB-CC-DD too, but that's ok although it's not according to spec)
No, you can't do that with regex. Probably with .net, because there you can access all intermediate results of repeated capturing groups ...
Repeating a Capturing Group vs. Capturing a Repeated Group
That is the problem, if you do something like ^(\w+)([\-/]\w+)+ the value stored in group2 is always only the last pattern it matched. Your task is not possible with regex/replace.
I would do something like:
^(\w+)-([\w+\/]+)
Then split the content of group 2 by "/" and combine group1 with each element of the array resulting from the split.

Regexp: How to match a string that doesn't have any character repeated 3 times?

I'm trying to make a single pattern that will validate an input string. The validation rule does not allow any character to be repeated more that 3 times in a row.
For example:
Aabcddee - is valid.
Aabcddde - is not valid, because of 3 d chracters.
The goal is to provide a RegExp pattern that could match one of above examples, but not both. I know I could use back-references such as ([a-z])\1{1,2} but this matches only sequential characters. My problem is that I cannot figure out how to make a single pattern for that. I tried this, but I don't quite get why it isn't working:
^(([a-z])\1{1,2})+$
Here I try to match any character that is repeated 1 or 2 times in the internal group, then I match that internal group if it's repeated multiple times. But it's not working that way.
Thanks.
To check that the string does not have a character (of any kind, even new line) repeated 3 times or more in a row:
/^(?!.*(.)\1{2})/s
You can also check that the input string does NOT have any match to this regex. In this case, you can also know the character being repeated 3 times or more in a row. Notice that this is exactly the same as above, except that the regex inside the negative look-ahead (?!pattern) is taken out.
/^.*(.)\1{2}/s
If you want to add validation that the string only contains characters from [a-z], and you consider aaA to be invalid:
/^(?!.*(.)\1{2})[a-z]+$/i
As you can see i flag (case-insensitive) affect how the text captured is compared against the current input.
Change + to * if you want to allow empty string to pass.
If you want to consider aaA to be valid, and you want to allow both upper and lower case:
/^(?!.*(.)\1{2})[A-Za-z]+$/
At first look, it might seem to be the same as the previous one, but since there is no i flag, the text captured will not subject to case insensitive matching.
Below is failed answer, you can ignore it, but you can read it for fun.
You can use this regex to check that the string does not have 3 repeated character (of any kind, even new line).
/^(?!.*(.)(?:.*\1){2})/s
You can also check that the input string does NOT have any match to this regex. In this case, you can also know the character being repeated more than or equal to 3 times. Notice that this is exactly the same as above, except that the regex inside the negative look-ahead (?!pattern) is taken out.
/^.*(.)(?:.*\1){2}/s
If you want to add validation that the string only contains characters from [a-z], and you consider aaA to be invalid:
/^(?!.*(.)(?:.*\1){2})[a-z]+$/i
As you can see i flag (case-insensitive) affect how the text captured is compared against the current input.
If you want to consider aaA to be valid, and you want to allow both upper and lower case:
/^(?!.*(.)(?:.*\1){2})[A-Za-z]+$/
At first look, it might seem to be the same as the previous one, but since there is no i flag, the text captured will not subject to case insensitive matching.
From your question I get that you want to match
only strings consisting of chars from [A-Za-z] AND
only strings which have no sequence of the same character with a length of 3 or more
Then this regexp should work:
^(?:([A-Za-z])(?:(?!\1)|\1(?!\1)))+$
(Example in perl)

Using regex to find arbitrary length consecutive blocks

I have a string containing ones and zeroes. I want to determine if there are substrings of 1 or more characters that are repeated at least 3 consecutive times. For example, the string '000' has a length 1 substring consisting of a single zero character that is repeated 3 times. The string '010010010011' actually has 3 such substrings that each are repeated 3 times ('010', '001', and '100').
Is there a regex expression that can find these repeating patterns without knowing either the specific pattern or the pattern's length? I don't care what the pattern is nor what its length is, only that the string contains a 3-peat pattern.
Here's something that might work, however, it will only tell you if there is a pattern repeated three times, and (I don't think) can't be extended to tell you if there are others:
/(.+).*?\1.*?\1/
Breaking that out:
(.+) matches any 1 or more characters, starting anywhere in the string
.*? allows any length of interposing other characters (0 or more)
\1 matches whatever was captured by the (...+) parentheses
.*? 0 or more of anything
\1 the original pattern, again
If you want the repetitions to occur immediately adjacent, then instead use
/(.+)\1\1/
… as suggested by #Buh Buh — the \1 vs. $1 notation may vary, depending on your regexp system.
(.+)\1\1
The \ might be a different charactor depending on your language choice. This means match any string then try to match it again twice more.
The \1 means repeat the 1st match.
it looks weird, but this could be the solution:
/000000000|100100100|010010010|001001001|110110110|011011011|101101101|111111111/
This contains all possible combinations for three times. So your regular expression will match for these numbers (i.e.):
10010010011
00010010011
10110110110
But not for these:
101010101010
001110111110
111000111000
And it doesn't matter where the sequence appears in the whole string.