Regex: how to match repeating pattern incrementally - regex

Given the following string:
one.two.three.four
How do I match/capture which results in the following in one go:
one
one.two
one.two.three
(if it's possible at all)

You can use this:
(?=(^|(?<=[.]))([\w.]+))
This will perform a non-width look ahead, it means that the string will be iterated on character at the time and matching the pattern; inside it says:
Using a non-width look-behind:
is there the beginning of the string?
do i have a . behind the cursor?
Using a capture group, it will get the rest of the string that was not consumed yet.

(\w+)\.?
(\w+) matches any word character (equal to [a-zA-Z0-9_])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed
\.? Quantifier — matches the character . literally (case sensitive)
if your characters are lowercased alphabets. then try this. ([a-z]+)\.?

Related

How to overcome multiple matches within same sentence (regex) [duplicate]

I am trying to implement a regex which includes all the strings which have any number of words but cannot be followed by a : and ignore the match if it does. I decided to use a negative look ahead for it.
/([a-zA-Z]+)(?!:)/gm
string: lame:joker
since i am using a character range it is matching one character at a time and only ignoring the last character before the : .
How do i ignore the entire match in this case?
Link to regex101: https://regex101.com/r/DlEmC9/1
The issue is related to backtracking: once your [a-zA-Z]+ comes to a :, the engine steps back from the failing position, re-checks the lookahead match and finds a match whenver there are at least two letters before a colon, returning the one that is not immediately followed by :. See your regex demo: c in c:real is not matched as there is no position to backtrack to, and rea in real:c is matched because a is not immediately followed with :.
Adding implicit requirement to the negative lookahead
Since you only need to match a sequence of letters not followed with a colon, you can explicitly add one more condition that is implied: and not followed with another letter:
[A-Za-z]+(?![A-Za-z]|:)
[A-Za-z]+(?![A-Za-z:])
See the regex demo. Since both [A-Za-z] and : match a single character, it makes sense to put them into a single character class, so, [A-Za-z]+(?![A-Za-z:]) is better.
Preventing backtracking into a word-like pattern by using a word boundary
As #scnerd suggests, word boundaries can also help in these situations, but there is always a catch: word boundary meaning is context dependent (see a number of ifs in the word boundary explanation).
[A-Za-z]+\b(?!:)
is a valid solution here, because the input implies the words end with non-word chars (i.e. end of string, or chars other than letter, digits and underscore). See the regex demo.
When does a word boundary fail?
\b will not be the right choice when the main consuming pattern is supposed to match even if glued to other word chars. The most common example is matching numbers:
\d+\b(?!:) matches 12 in 12,, but not in 12:, and also 12c and 12_
\d+(?![\d:]) matches 12 in 12, and 12c and 12_, not in 12: only.
Do a word boundary check \b after the + to require it to get to the end of the word.
([a-zA-Z]+\b)(?!:)
Here's an example run.

Regular expression to remove syslog date in filebeat?

I would like to parse some syslog lines that they look like
Oct 20 16:34:59 artguard TTN-xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
I would like to turn them into
TTN-xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
So I was wondering how the regular expression should look like that would allow me to do so, since the first part will change every day, because it is appended by the syslog.
EDIT: to avoid duplicated, I am trying to use REGEX with filebeat, where no all regex are supported as explained here
Regex101
(TTN-.*$)
Debuggex Demo
Explained
1st Capturing Group (TTN-.*$)
TTN- matches the characters TTN- literally (case sensitive)
.* matches any character (except for line terminators)
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
$ asserts position at the end of a line
Global pattern flags
g modifier: global. All matches (don't return after first match)
m modifier: multi line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)
The regular expression TTN-\S* is probably a way of doing what you're looking for, here it is in a java-script example.
var value = "Oct 20 16:34:59 artguard TTN-xxxxxxxxxxxxxxxxxxxxxxxxxxxxx";
var matches = value.match(
new RegExp("TTN-\\S*", "gi")
);
document.writeln(matches);
It works in two main parts:
The TTN- matches TTN- (obviously)
The \S* matches any character that is not a white-space, this is done as many times as possible.
Currently it is always expecting atleas a '-' after the TTN but if you repace the '-' with a '-{01}' in the regex it will expect TNN maybe a dash followed by 0-n characters that are not a white-space. You could also replace \S* with \w* to get all the letters and digits or .* to get all characters apart from end of line /n character, TNN-\S*[^\s{2}] too end the match with two spaces. Hope this was helpful.

Regular expression - finding full words that contain a specific string

I am trying to match the 'words' that contain a specific string inside a provided string.
This reg_ex works great:
preg_match('/\b(\w*form\w*)\b/', $string, $matches);
So for example if my string contained: "Which person has reformed or performed" it returns reformed and performed.
However, I need to match codes inside codes so my definition of 'word' is based on splitting the string purely by a space.
For example, I have a string like:
Test MFC-123/Ben MFC/7474
And I need to match 'MFC' which should return 'MFC-123/Ben' and 'MFC/7474'.
How can I modify the above reg_ex to match all characters and use space as a boundary.
Thanks
Simply using this will do it for you:
(MFC\S+)
It means any non whitespace character after the MFC
If the MFC comes in between text, or alone, then you can place \S* before and after the MFC`. For example
(\S*MFC\S*)
This matches:
MFC-12312
1231-MFC
MFC
If you want to get the whole block of text which does not contain space and contain your MFC as a match you can use the following regex:
\b(\S*MFC\S+)\b
explanation:
\b assert position at a word boundary (^\w|\w$|\W\w|\w\W)
1st Capturing group (\S*MFC\S+)
\S* match any non-white space character [^\r\n\t\f ]
Quantifier: Between zero and unlimited times, as many times as possible, giving back as needed.
MFC matches the characters MFC literally (case sensitive)
\S+ match any non-white space character [^\r\n\t\f ]
Quantifier: Between one and unlimited times, as many times as possible, giving back as needed.
\b assert position at a word boundary (^\w|\w$|\W\w|\w\W)
example where matched blocks are in bold:
Test MFC-123/Ben jbas2/jda lmasdlmasd;mwrsMFCkmasd j2\13 MFC/7474
hope this helps.

Regular Expression that matches a word, or nothing

I'm really struggling to put a label on this, which is probably why I was unable to find what I need through a search.
I'm looking to match the following:
Auto Reply
Automatic Reply
AutomaticReply
The platform that I'm using doesn't allow for the specification of case-insensitive searches. I tried the following regular expression:
.*[aA]uto(?:matic)[ ]*[rR]eply.*
Thinking that (?:matic) would cause my expression to match Auto or Automatic. However, it is only matching Automatic.
What am I doing wrong?
What is the proper terminology here?
This is using Perl for the regular expression engine (I think that's PCRE but I'm not sure).
(?:...) is to regex patterns as (...) is to arithmetic: It simply overrides precedence.
ab|cd # Matches ab or cd
a(?:b|c)d # Matches abd or acd
A ? quantifier is what makes matching optional.
a? # Matches a or an empty string
abc?d # Matches abcd or abd
a(?:bc)?d # Matches abcd or ad
You want
(?:matic)?
Without the needless leading and trailing .*, we get the following:
/[aA]uto(?:matic)?[ ]*[rR]eply/
As #adamdc78 points out, that matches AutoReply. This can be avoided as using the following:
/[aA]uto(?:matic[ ]*|[ ]+)[rR]eply/
or
/[aA]uto(?:matic|[ ])[ ]*[rR]eply/
This should work:
/.*[aA]uto(?:matic)? *[rR]eply/
you were simply missing the ? after (?:matic)
[Aa]uto(?:matic ?| )[Rr]eply
This assumes that you do not want AutoReply to be a valid hit.
You're just missing the optional ("?") in the regex. If you're looking to match the entire line after the reply, then including the .* at the end is fine, but your question didn't specify what you were looking for.
You can use this regex with line start/end anchors:
^[aA]uto(?:matic)? *[rR]eply$
Explanation:
^ assert position at start of the string
[aA] match a single character present in the list below
aA a single character in the list aA literally (case sensitive)
uto matches the characters uto literally (case sensitive)
(?:matic)? Non-capturing group
Quantifier: Between zero and one time, as many times as possible, giving back as needed
[greedy]
matic matches the characters matic literally (case sensitive)
* matches the character literally
Quantifier: Between zero and unlimited times, as many times as possible, giving back
as needed [greedy]
[rR] match a single character present in the list below
rR a single character in the list rR literally (case sensitive)
eply matches the characters eply literally (case sensitive)
$ assert position at end of the string
Slightly different. Same result.
m/([aA]uto(matic)? ?[rR]eply)/
Tested Against:
Some other stuff....
Auto Reply
Automatic Reply
AutomaticReply
Now some similar stuff that shouldn't match (auto).

Regular expression doesn't match if a character participated in a previous match

I have this regex:
(?:\S)\++(?:\S)
Which is supposed to catch all the pluses in a query string like this:
?busca=tenis+nike+categoria:"Tenis+e+Squash"&pagina=4&operador=or
It should have been 4 matches, but there are only 3:
s+n
e+c
s+e
It is missing the last one:
e+S
And it seems to happen because the "e" character has participated in a previous match (s+e), because the "e" character is right in the middle of two pluses (Teni s+e+S quash).
If you test the regex with the following input, it matches the last "+":
?busca=tenis+nike+categoria:"Tenis_e+Squash"&pagina=4&operador=or
(changed "s+e" for "s_e" in order not to cause the "e" character to participate in the match).
Would someone please shed a light on that?
Thanks in advance!
In a consecutive match the search for the next match starts at the position of the end of the previous match. And since the the non-whitespace character after the + is matched too, the search for the next match will start after that non-whitespace character. So a sequence like s+e+S you will only find one match:
s+e+S
\_/
You can fix that by using look-around assertions that don’t match the characters of the assumption like:
\S\++(?=\S)
This will match any non-whitespace character followed by one or more + only if it is followed by another non-whitespace character.
But tince whitespace is not allowed in a URI query, you don’t need the surrounding \S at all as every character is non-whitespace. So the following will already match every sequence of one or more + characters:
\++
You are correct: The fourth match doesn't happen because the surrounding character has already participated in the previous match. The solution is to use lookaround (if your regex implementation supports it - JavaScript doesn't support lookbehind, for example).
Try
(?<!\s)\++(?!\s)
This matches one or more + unless they are surrounded by whitespace. This also works if the plus is at the start or the end of the string.
Explanation:
(?<!\s) # assert that there is no space before the current position
# (but don't make that character a part of the match itself)
\++ # match one or more pluses
(?!\s) # assert that there is no space after the current position
If your regex implementation doesn't support lookbehind, you could also use
\S\++(?!\s)
That way, your match would contain the character before the plus, but not after it, and therefore there will be no overlapping matches (Thanks Gumbo!). This will fail to match a plus at the start of the string, though (because the \S does need to match a character). But this is probably not a problem.
You can use the regex:
(?<=\S)\++(?=\S)
To match only the +'s that are surrounded by non-whitespace.