I have a source string that looks like this: mID00231mID00008mID00231mID00054mID00013mID00008mID00065
The pattern I am trying to create, using this example, is: For the last occurrence of "mID00231" in the string, one or more occurrences of each of {mID00054, mID00013, mID00008, mID00065} must follow it (in any order).
Examples of matches:
mID00231mID00008mID00231mID00054mID00013mID00008mID00065
mID00231mID00013mID00054mID00008mID00065mID00008
Example of no match because of missing "mID00065":
mID00231mID00054mID00013mID00008
Example of no match because the last occurrence of "mID00231" is not followed by a "mID00054" and a "mID00008":
mID00231mID00013mID00065mID00054mID00008mID00231mID00013mID00065
I am fairly new to regex but usually arrive at something that works. This one has been very difficult. I tried this:
(?:mID00231)(?:(?=.*mID00054)(?=.*mID00013)(?=.*mID00008)(?=.*mID00065).*)
It works if there is only one occurrence of the first element (mID00231). If the element repeats, the pattern fails. Any help is appreciated.
You need to fail the match if there is the same value with a negative lookahead:
mID00231((?!.*mID00231)(?=.*mID00054)(?=.*mID00013)(?=.*mID00008)(?=.*mID00065).*)
^^^^^^^^^^^^^^
See the regex demo.
Details:
mID00231 - match a literal mID00231 text
( - start of the capturing group
(?!.*mID00231) - there cannot be mID00231 anywhere after 0+ any chars but a newline
(?=.*mID00054) - there must be mID00054 anywhere after 0+ any chars but a newline
(?=.*mID00013) - there must be mID00013 anywhere after 0+ any chars but a newline
(?=.*mID00008) - there must be mID00008 anywhere after 0+ any chars but a newline
(?=.*mID00065) - there must be mID00065 anywhere after 0+ any chars but a newline
.* - 0+ any chars but a newline
) - end of the capturing group.
Related
I have fields which contain data in the following possible formats (each line is a different possibility):
AAA - Something Here
AAA - Something Here - D
Something Here
Note that the first group of letters (AAA) can be of varying lengths.
What I am trying to capture is the "Something Here" or "Something Here - D" (if it exists) using PCRE, but I can't get the Regex to work properly for all three cases. I have tried:
- (.*) which works fine for cases 1 and 2 but obviously not 3;
(?<= - )(.*) which also works fine for cases 1 and 2;
(?! - )(.+)| - (.+) works for cases 2 and 3 but not 1.
I feel like I'm on the verge of it but I can't seem to crack it.
Thanks in advance for your help.
Edit: I realized that I was unclear in my requirements. If there is a trailing " - D" (the letter in the data is arbitrary but should only be a single character), that needs to be captured as well.
About the patterns that you tried:
- (.*)This pattern will match the first occurrence of - followed by matching the rest of the line. It will match too much for the second example as the .* will also match the second occurrence of -
(?<= - )(.*)This pattern will match the same as the first example without the - as it asserts that is should occur directly to the left
(?! - )(.+)| - (.+) This pattern uses a negative lookahead which asserts what is directly to the right is not (?! - ). As none of the example start with - , the whole line will be matched directly after the negative lookahead due to .+ and the second part after the alternation | will not be evaluated
If the first group of letters can be of varying length, you could make the match either specific matching 1 or more uppercase characters [A-Z]+ or 1+ word characters \w+.
To get a more broad match, you could match 1 or more non whitespace characters using \S+
^(?:\S+\h-\h)?\K\S+(?:\h(?!-\h)\S+)*
Explanation
^ Start of string
(?:\S+\h-\h)? Optionally match the first group of non whitespace chars followed by - between horizontal whitespace chars
\K Clear the match buffer (Forget what is currently matched)
\S+ Match 1+ non whitespace characters
(?: Non capture group
\h(?!-\h) Match a horizontal whitespace char and assert what is directly to the right is not - followed by another horizontal whitespace char
\S+ Match 1+ non whitespace chars
)* Close non capture group and repeat 1+ times to match more "words" separated by spaces
Regex demo
Edit
To match an optional hyphen and trailing single character, you could add an optional non capturing group (?:-\h\S\h*)?$ and assert the end of the string if the pattern should match the whole string:
^(?:\S+\h-\h)?\K\S+(?:\h(?!-\h)\S+)*\h*(?:-\h\S\h*)?$
Regex demo
You may use
^(?:.*? - )?\K.*?(?= - | *$)
^(?:.*?\h-\h)?\K.*?(?=\h-\h|\h*$)
See the regex demo
Details
^ - start of string
-(?:.*? - )? - an optional non-capturing group matching any 0+ chars other than line break chars as few as possible up to the first space-space
\K - match reset operator
.*? - any 0+ chars other than line break chars as few as possible
(?= - | *$) - space-space or 0+ spaces till the end of string should follow immediately on the right.
Note that \h matches any horizontal whitespace chars.
^(?:[A-Z]+ - \K)?.*\S
demo
Since "Something Here" can be anything, there's no reason to specially describe the eventual last letter in the pattern. You don't need something more complicated.
With this pattern I assume that you are not interested by the trailing spaces, that's why I ended it with \S. If you want to keep them, remove the \S and change the previous quantifier to +.
I have the following string:
this is a test string user:testuser,anotheruser hashtag:peach,phone,milk site:youtube.com,twitter.com flair:😂bobby😂
Currently the regex ([^:\s]+):([^:\s]+) matches all the filters with colon in between (user, hashtag, site, flair). How can I also grab the remaining "this is a test string" part as another match?
Demo:
https://regex101.com/r/L0T2GJ/11
You may add an alternative to match any 0+ chars as few as possible from the start of the string till the first key followed with a colon:
^.*?(?=\s+[^:\s]+:)|([^:\s]+):([^:\s]+)
^^^^^^^^^^^^^^^^^^^
See the regex demo
Details
^ - start of the string
-.*? - any 0+ chars other than line break chars, as few as possible
(?=\s+[^:\s]+:) - the positive lookahead makes sure that, immediately to the right of the current position, there is
\s+ - 1+ whitespaces
[^:\s]+ - 1+ chars other than : and whitespace
: - a colon
I want to parse a nested structure like this one in MATLAB :
structure NAME_PART_1
Some content
block NAME_PART_2
Some other content
end NAME_PART_2
block NAME_PART_3
subblock NAME_PART_4
Some content++
end NAME_PART_4
end NAME_PART_3
end NAME_PART_1
structure
NAME_PART_5
end NAME_PART_5
First, I would like to extract the content of each structure. It's quite easy because a structure content is always between "structure NAME" and "end NAME".
So, I would like to use regex. But I don't know in advance what the structure name will be.
So, I wrote my regex like this :
\bstructure\s+([\w.-]*)((?:\s|.)*)\bend\b\s+XXXX
But, I don't know by what I should replace "XXXX", in order to "reference" the content of the first class of this regex. But is that even possible?
Try this Regex:
structure\s+([\w.-]+)\s*((?:(?!end\s+\1)[\s\S])*)end\s+\1
Click for Demo
Explanation:
structure - matches structure
\s+ - matches 1+ occurrences of a white-space
([\w.-]+) - matches 1+ occurrences of either a word character or a . or a -. This sub-match which contains the structure name is captured in Group 1.
\s* - matches 0+ occurrences of a white-space
((?:(?!end\s+\1)[\s\S])*) - Tempered Greedy Token - Matches 1+ occurrences of any character [\s\S] which does not start with the sequence end followed by Group 1 contents \1 i.e, structure name. This sub-match is captured in Group 2 which contains the contents of the structure
end\s+\1 - matches the word end followed by 1+ white-spaces followed by Structure Name contained in Group 1 \1.
Apart from making use of a backreference \1 to refer what is captured, you might replace the alternation in the capturing group ((?:\s|.)*) with matching a newline followed by 0+ characters and repeat that while capturing it ((?:\n.*)+)
Also you might omit the word boundary after end end\b\s+ as 1+ whitespace characters is what follows after end and instead add a word boundary at the end so that \1 is not part of a larger match.
\bstructure\s+([\w.-]+)((?:\n.*)+)\bend\s+\1\b
Regex demo
Explanation
\bstructure\s+ Match structure followed by 1+ whitespace chars
([\w.-]+) Capture in a group repeating 1+ times any of the listed chars
( Capturing group
(?:\n.*)+ Match newline followed by 0+ times any char except a newline
) Close capturing group
\bend Match end
\s+\1\b Match 1+ times a whitespace char followed by a backreference to group 1 and end with a word boundary.
I have a string which contains the rego number of the car like
1FX9JE - 2012 Audi A3 Ambition Sportback MY12 Stronic
I would like to match everything except the rego number, so anything after the dash.
The regex I came up with is (php)
\s.[^-]*$
My initial regex which i came up can match anything after the dash only if the string contains only 1 dash. For example https://regex101.com/r/Jao8W0/1
However, if the string has more than 1 dash. The regex is not usable.
For example : https://regex101.com/r/Jao8W0/2
Is there anyway for me to match anything after the first dash even though the string contains additional dash after the first dash.
Thank you
Try this Regex:
^[^-\r\n]+-\s*\K.*$
Click for Demo
Explanation:
^ - asserts the start of the string
[^-\r\n]+ - matches 1+ occurrences of any character that is neither a - or nor a newline
-\s* - matches the first - in the string followed by 0+ whitespaces
\K - forgets everything matched so far
.* - matches 0+ occurrences of any character
$ - asserts the end of the string
if only has one space, you can use this pattern:
(?<=\-\s)(.*)
else if there may have more than one space, get the group(1) from match
(?<=\-)\s*(.*)
(?<=...) Ensures that the given pattern will match, ending at the
current position in the expression. The pattern must have a fixed
width. Does not consume any characters.
I think I have solved this, but I'm wondering if anyone sees a flaw or a better method:
Using a regular expression in Notebook++ I'm trying to remove all strings that contain static and variable info like this:
{start of line},1,NRAG-E21-PRDCT-DT-CRWLR-8416 Result Data,NRAG-E21-PRDCT-DT-CRWLR-8416 Result Data,1,http:<l></l>//www.url.com/product/10E026,
-note: both ,1, strings are variable as well ,1, ,2, ,3, etc...
The advantage that I have is that it appears at the end of the string - just before the comma - the pattern is always [0-9] [A-Z] [0-9]
it, therefore, seems that this should work:
^.*?\/[0-9]+[A-Z]+[0-9]+,
That selects the start of the line ^ followed by everything before the pattern that looks like /10E026 and the comma at the end.
Does anybody see a flaw or a better way to find a string like that?
That selects the start of the line ^ followed by everything before the pattern that looks like /10E026 and the comma at the end.
That is not so. your ^.*?\/[0-9]+[A-Z]+[0-9]+, matches the start of a line (^), any 0+ chars other than a newline, as few as possible up to the first /, then a /, 1+ digits, 1+ uppercase ASCII letters, 1+ digits, and a comma - anywhere inside the string
It seems you need to match up to the last occurrence of the /xxAAAxxx, pattern:
^.*/[0-9]+[A-Z]+[0-9]+,
See the regex demo
Pattern details:
^ - start of a line (in Notepad++, ^ matches line start by default)
.* - 0+ any chars but a newline, greedily, up to the last...
/ - forward slash (no need escaping here)
[0-9]+[A-Z]+[0-9]+ - 1+ digits, 1+ uppercase letters, 1+ digits
, - a comma.