Why there are two matches - regex

I think that there is a match,but there are two.That's strange.I want to know why

Why are you surprised? .* matches any number of characters, including 0.
So you get one match that contains the entire line, and a second match that contains the empty string between the first match and the end of the string.
Regular expressions don't just deal with characters, but also with positions between characters (known as anchors). For example ^ matches the position before the first character, $ matches the position after the last character in a string.
A regex engine "walks through" a string, starting from the position before the first character. It then steps forward one character at a time.
For example, when applying the regex .* to "Hello", the regex engine starts before the H. It then matches Hello - after that .* can't match any more characters, so the regex engine returns "Hello" as the first match. The regex engine is now positioned after the o. If you call it again and ask it to match, it will succeed in returning a match because you're asking it to match any string, even an empty one, from the current position - and that's possible.
Why doesn't the regex engine return an infinite number of empty strings, then? It checks whether the last match was started from the end of the string, and if it was, no further matches will be attempted.
Some languages don't even try a regex match once from the final position in a string (Ruby seems to be one example), but I'd say it's more correct to return two matches.
Since it appears more clarification is necessary: The regex engine steps through the string along the positions visualized by |s below:
"|H|e|l|l|o|"
^ Position before the first character
^ Position after the last character

Related

Removing last character from a line using regex

I just started learning regex and I'm trying to understand how it possible to do the following:
If I have:
helmut_rankl:20Suzuki12
helmut1195:wasserfall1974
helmut1951:roller11
Get:
helmut_rankl:20Suzuki1
helmut1195:wasserfall197
helmut1951:roller1
I tried using .$ which actually match the last character of a string, but it doesn't match letters and numbers.
How do I get these results from the input?
You could match the whole line, and assert a single char to the right if you want to match at least a single character.
.+(?=.)
Regex demo
If you also want to match empty strings:
.*(?=.)
This will do what you want with regex's match function.
^(.*).$
Broken down:
^ matches the start of the string
( and ) denote a capturing group. The matches which fall within it are returned.
.* matches everything, as much as it can.
The final . matches any single character (i.e. the last character of the line)
$ matches the end of the line/input

Why the character ^ is required in an regex ^(?!.*?spam) to filter strings?

I try to filter strings, that don't contain word "spam".
I use the regex from here!
But I can't understand why I need the symbol ^ at the start of expression. I know that it signs the start of regex but I do not understand why it doesn't work without ^ in my case?
UPD. All the answers hereunder are very usefull.
It's completely clear now. Thank you!
The regex (?!.*?spam) matches a position in a string that is not followed by something matching .*?spam.
Every single string has such a position, because if nothing else, the very end of the string is certainly not followed by anything matching .*?spam.
So every single string contains a match for the regex (?!.*?spam).
The anchor ^ in ^(?!.*?spam) restricts the regex, so that it only matches strings where the very beginning of the string isn't followed by anything matching .*?spam — i.e., strings that don't contain spam at all (or anywhere in the first line, at least, depending on whether . matches newlines).
The lookahead is a zero-width assertion (that is, it ensures a position in your string). In your case it is a negative lookahead making sure that not "zero more characters, followed by the word spam" are following. This is true for a couple of positions in your string, see a demo on regex101.com without the anchor.
With the anchor the matching process starts right at the very beginning, so the whole string is analyzed, see the altered demo on regex101.com as well.

Regex is matching second occurence. I need it to match first occurence

This is my regex code:
.*(X.*)\s(.*?)\$
This is my data string:
1247.P1.06.Z01.0020N.X396X111.Y008 1247.P1.06.Z01.0020N$M234477$
This is properly grabbing the second item that ends with the first $ sign:
1247.P1.06.Z01.0020N
But for the first string, I want it to grab:
X396X111.Y008
Instead it is grabbing:
X111.Y008
So I want it to get the first X and everything up to the space. But the second X is triggering the match.
The string starting with "X" is always 13 characters, so I tried specifying the length but it still started with the second X
I am fine with either pattern:
Start with the first X and end with the space.
Start with the first X and grab 13 characters.
Thank you.
Get rid of .* at the beginning of the regular expression. It's greedy, so it's skipping over the longest possible prefix that allows the rest of the regular expression to match. That forces the rest to get the last occurrence instead of the first.
DEMO
In general, it's not necessary to put .* at the beginning of end of a regular expression. It just looks for the pattern anywhere in the input, so stuff around the match will just be ignored.
Your match is too loose. A stricter regex could be:
X\S+\s
which matches an X, then every non whitespace character until a whitespace character.
Demo: https://regex101.com/r/Jl2BJS/2/
If the ID is always 13 characters you can do:
X.{13}
Demo: https://regex101.com/r/Jl2BJS/3/
Alternatively removing the .*, or making it non greedy with ? or the U modifier would also work.
Demo: https://regex101.com/r/Jl2BJS/4/ or https://regex101.com/r/Jl2BJS/5/

How to only match a single instance of a character?

Not quite sure how to go about this, but basically what I want to do is match a character, say a for example. In this case all of the following would not contain matches (i.e. I don't want to match them):
aa
aaa
fooaaxyz
Whereas the following would:
a (obviously)
fooaxyz (this would only match the letter a part)
My knowledge of RegEx is not great, so I am not even sure if this is possible. Basically what I want to do is match any single a that has any other non a character around it (except for the start and end of the string).
Basically what I want to do is match any single a that has any other non a character around it (except for the start and end of the string).
^[^\sa]*\Ka(?=[^\sa]*$)
DEMO
\K discards the previously matched characters and lookahead assertes whether a match is possibel or not. So the above matches only the letter a which satifies the conditions.
OR
a{2,}(*SKIP)(*F)|a
DEMO
You may use a combination of a lookbehind and a lookahead:
(?<!a)a(?!a)
See the regex demo and the regex graph:
Details
(?<!a) - a negative lookbehind that fails the match if, immediately to the left of the current location, there is a a char
a - an a char
(?!a) - a negative lookahead that fails the match if, immediately to the right of the current location, there is a a char.
You need two things:
a negated character class: [^a] (all except "a")
anchors (^ and $) to ensure that the limits of the string are reached (in other words, that the pattern matches the whole string and not only a substring):
Result:
^[^a]*a[^a]*$
Once you know there is only one "a", you can use the way you want to extract/replace/remove it depending of the language you use.

How to match previous character of in a string using regex

Lets say i have this string §cHi there
I want to use regex to match §c so I can extract the message.
The way I want it to work is it looks for § and then match the next and the previous character.
So far I am able to match the next character by this expression [§].{1}.
My question is how to match the privious character which is Â.
First match .[§]., then take the first and third character of the match.
Note: your example regex [§].{1} is the same as [§].
. matches any character (sometimes with the exception of the newline character)