Regex is matching second occurence. I need it to match first occurence - regex

This is my regex code:
.*(X.*)\s(.*?)\$
This is my data string:
1247.P1.06.Z01.0020N.X396X111.Y008 1247.P1.06.Z01.0020N$M234477$
This is properly grabbing the second item that ends with the first $ sign:
1247.P1.06.Z01.0020N
But for the first string, I want it to grab:
X396X111.Y008
Instead it is grabbing:
X111.Y008
So I want it to get the first X and everything up to the space. But the second X is triggering the match.
The string starting with "X" is always 13 characters, so I tried specifying the length but it still started with the second X
I am fine with either pattern:
Start with the first X and end with the space.
Start with the first X and grab 13 characters.
Thank you.

Get rid of .* at the beginning of the regular expression. It's greedy, so it's skipping over the longest possible prefix that allows the rest of the regular expression to match. That forces the rest to get the last occurrence instead of the first.
DEMO
In general, it's not necessary to put .* at the beginning of end of a regular expression. It just looks for the pattern anywhere in the input, so stuff around the match will just be ignored.

Your match is too loose. A stricter regex could be:
X\S+\s
which matches an X, then every non whitespace character until a whitespace character.
Demo: https://regex101.com/r/Jl2BJS/2/
If the ID is always 13 characters you can do:
X.{13}
Demo: https://regex101.com/r/Jl2BJS/3/
Alternatively removing the .*, or making it non greedy with ? or the U modifier would also work.
Demo: https://regex101.com/r/Jl2BJS/4/ or https://regex101.com/r/Jl2BJS/5/

Related

Removing last character from a line using regex

I just started learning regex and I'm trying to understand how it possible to do the following:
If I have:
helmut_rankl:20Suzuki12
helmut1195:wasserfall1974
helmut1951:roller11
Get:
helmut_rankl:20Suzuki1
helmut1195:wasserfall197
helmut1951:roller1
I tried using .$ which actually match the last character of a string, but it doesn't match letters and numbers.
How do I get these results from the input?
You could match the whole line, and assert a single char to the right if you want to match at least a single character.
.+(?=.)
Regex demo
If you also want to match empty strings:
.*(?=.)
This will do what you want with regex's match function.
^(.*).$
Broken down:
^ matches the start of the string
( and ) denote a capturing group. The matches which fall within it are returned.
.* matches everything, as much as it can.
The final . matches any single character (i.e. the last character of the line)
$ matches the end of the line/input

Trying to combine two Regex

I'm trying to combine two working regex patterns into one. Please let me know the correct syntax and if this can be better written.
Pattern 1: (?P<date>.*)\s+(?P<timezone>.*)\|.*\|.*\|(?P<ip>[\w*.:-]+)\|.*\|
Pattern 2: (?P<path>[^\/]+(?=\-[^\/-]*$))
Sample line:
06/Mar/2020:00:01:04 -0500|/TESTSTREAM|5766764|4.2.2.1|123290|path1/path2/x-fr-US.OPEN.1-Turtle-2020.30.04-64.mp3
The first expression matches the start of the string, the second matches the end, you can combine them by putting a non-greedy .*? between them, like this:
(?P<date>.*)\s+(?P<timezone>.*)\|.*\|.*\|(?P<ip>[\w*.:-]+)\|.*\|.*?(?P<path>[^\/]+(?=\-[^\/-]*$))
As you can see here this expression works, but it takes 1660 steps to match the string. This is because .* between | first capture the whole string up to the end, and then try to step back character by character in order to find the match.
If you use the non-greedy modifiers here: .*?, then the regex machine will initially match an empty string and then will need to move forward character by character until it finds the matching |. It will reduce the number of steps to 1183: demo
However, if you want to remove this backtracking (forward-tracking) at all, you can just very quickly skip as many non-| characters as possible with [^|]*. Similarly we can replace other .* patterns in the regex. The resulting regex finds a match in just 47 steps, more than 30-times less than the original regex:
(?P<date>\S*)\s+(?P<timezone>[^|]*)\|[^|]*\|[^|]*\|(?P<ip>[\w*.:-]+)\|[^|]*\|(?:[^\/\n]*\/)*(?P<path>.*)-.*
Demo here.
Update 2020-03-09
If you want to keep the last slash you can use this regex:
(?P<date>\S*)\s+(?P<timezone>[^|]*)\|[^|]*\|[^|]*\|(?P<ip>[\w*.:-]+)\|[^|]*\|.*?(?P<path>\/[^\/]*)-[^\/]*

Regex for Removing Everything Before Certain Comma Position

I'm trying to remove and replace everything before the 13th comma in an array like so:
{1,1,0,0,0,4,0,0,0,0,20,4099,4241,706,706,714,714,817,824,824,824,2,2,2,2,1,1,1,1},
to where it becomes:
{706,706,714,714,817,824,824,824,2,2,2,2,1,1,1,1},
Reference: I'm using regex in Notepad ++.
I found this regex string to match everything after a certain comma to the end of the line:
,[^,]*,[^,]*,[^,]*,[^,]*,[^,]*,[^,]*,[^,]*$
But how do I turn it around to start from the beginning?
I appreciate your time and help, thank you.
Whereas $ matches the end of the subject string, ^ matches the beginning. So if you want to match up to and including the 13th comma:
^[^,]*,[^,]*,[^,]*,[^,]*,[^,]*,[^,]*,[^,]*,[^,]*,[^,]*,[^,]*,[^,]*,[^,]*,[^,]*,
Replace with "{".
You may use
{(?:[^,}]*,){13}
Replace with a mere {. See the regex demo. This version will work correctly even if you have {...} substrings spanning across lines and having fewer than 13 items in between.
Details
{ - a {
(?:[^,}]*,){13} - 13 consecutive occurrences of
[^,}]* - 0+ chars other than , and } (the } is important to avoid overflowing from one {...} substring into another)
, - a comma
You may also use
{\K(?:[^,}]*,){13}
And replace with an empty string. See another regex demo. You do not need to replace with { because \K omits the first { from the match, and it is thus kept in the final text.
Try the following find and replacement:
Find:
\{(?:[^,]*,){13}(.*)
Replace:
{$1
The above pattern could be slightly adjusted depending on what your expectations are for where this bracketed string might appear, edge cases you want to cover/avoid, etc.
Demo

Why there are two matches

I think that there is a match,but there are two.That's strange.I want to know why
Why are you surprised? .* matches any number of characters, including 0.
So you get one match that contains the entire line, and a second match that contains the empty string between the first match and the end of the string.
Regular expressions don't just deal with characters, but also with positions between characters (known as anchors). For example ^ matches the position before the first character, $ matches the position after the last character in a string.
A regex engine "walks through" a string, starting from the position before the first character. It then steps forward one character at a time.
For example, when applying the regex .* to "Hello", the regex engine starts before the H. It then matches Hello - after that .* can't match any more characters, so the regex engine returns "Hello" as the first match. The regex engine is now positioned after the o. If you call it again and ask it to match, it will succeed in returning a match because you're asking it to match any string, even an empty one, from the current position - and that's possible.
Why doesn't the regex engine return an infinite number of empty strings, then? It checks whether the last match was started from the end of the string, and if it was, no further matches will be attempted.
Some languages don't even try a regex match once from the final position in a string (Ruby seems to be one example), but I'd say it's more correct to return two matches.
Since it appears more clarification is necessary: The regex engine steps through the string along the positions visualized by |s below:
"|H|e|l|l|o|"
^ Position before the first character
^ Position after the last character

Regular exp to match string from beginning until certain char is met

I have some long string where i'm trying to catch a substring until a certain character is met.
Lets suppose I have the following string, and I would like to get the text until the first ampersand.
abc.8965.aghtj&hgjkiyu5.8jfhsdj
I would like to extract what is present before the ampersand so: abc.8965.aghtj
W thought this would work:
grep'^.*&{1}'
I would translate it as
^ start of string
.* match whatever chars
&{1} until the first ampersand is matched
Any advice?
I'm afraid this will take me weeks
{1} does not match the first occurrence; instead it means "match exactly one of the preceding pattern/character", which is identical to just matching the character (&{3} would match &&&).
In order to match the first occurrence of &, you need to use .*?:
grep'^.*?&'
Normally, .* is greedy, meaning it matches as much as possible. This means your pattern would match the last ampersand rather than the first one. .*? is the non-greedy version, matching as little as possible while fulfilling the pattern.
Update: That syntax may not be supported by grep. Here is another option:
'^[^&]*&'
It matches anything that is not an ampersand, up to the first ampersand.
You also may have to enable extended regular expression in grep (-E).
Try this one:
^.*?(?=&)
it won't get ampersand sign, just a text before it