How to select all whitespaces with certain interval - regex

I'd let use regex to find all whitespaces after given number of interval.
for example, if interval is 3 for following string the result should be
string = "this is a test for regex"
string = "this(select)is a(select)test(select)for(select)regex"
the whitespace after is should not be selected since the interval is 3 and the length of is only 2
I did this ^(?=.{3,})\s$ but no luck. Thank you

Depending on your regex engine, you can use either of the following:
\K Reset Method
If your engine supports \K.
See regex in use here
.{3,}?\K\s+
This method matches any character 3 or more times (but as few as possible), then resets the pattern's match, then matches one or more whitespace characters.
Capture Group Method
See regex in use here
(.{3,}?)\s+
Replace with $1
This method captures any character 3 or more times (but as few as possible), then matches one or more whitespace characters. You would then replace the matches with first capture group's match.
The ? that follows a quantifier (in the cases above {3,}) causes it to match in a lazy manner, meaning that once it satisfies at least 3 matches and finds a whitespace character, it'll stop (this prevents it from matching the whole line up to the last space).
The \K token resets the pattern's match. This means that nothing preceding this toke will be captured in the overall match (resulting in only the whitespace characters being matched)

This would do it:
.{3,}?(\s)
.{3,}? - lazily match any 3 chars
(\s) - capture the whitespace which follows
Your desired spaces would be in $1.
https://regex101.com/r/7iWjQO/1

Related

Regex Pattern to Match except when the clause enclosed by the tilde (~) on both sides

I want to extract matches of the clauses match-this that is enclosed with anything other than the tilde (~) in the string.
For example, in this string:
match-this~match-this~ match-this ~match-this#match-this~match-this~match-this
There should be 5 matches from above. The matches are explained below (enclosed by []):
Either match-this~ or match-this is correct for first match.
match-this is correct for 2nd match.
Either ~match-this# or ~match-this is correct for 3rd match.
Either #match-this~ or #match-this or match-this~ is correct for 4th match.
Either ~match-this or match-this is correct for 5th match.
I can use the pattern ~match-this~ catch these ~match-this~, but when I tried the negation of it (?!(~match-this)), it literally catches all nulls.
When I tried the pattern [^~]match-this[^~], it catches only one match (the 2nd match from above). And when I tried to add asterisk wild card on any negation of tilde, either [^~]match-this[^~]* or [^~]*match-this[^~], I got only 2 matches. When I put the asterisk wild card on both, it catches all match-this including those which enclosed by tildes ~.
Is it possible to achieve this with only one regex test? Or Does it need more??
If you also want to match #match-this~ as a separate match, you would have to account for # while matching, as [^~] also matches #
You could match what you don't want, and capture in a group what you want to keep.
~[^~#]*~|((?:(?!match-this).)*match-this(?:(?!match-this)[^#~])*)
Explanation
~[^~#]*~ Match any char except ~ or # between ~
| Or
( Capture group 1
(?:(?!match-this).)* Match any char if not directly followed by *match-this~
match-this Match literally
(?:(?!match-this)[^#~])* Match any char except ~ or # if not directly followed by match this
) Close group 1
See a regex demo and a Python demo.
Example
import re
pattern = r"~[^~#]*~|((?:(?!match-this).)*match-this(?:(?!match-this)[^#~])*)"
s = "match-this~match-this~ match-this ~match-this#match-this~match-this~match-this"
res = [m for m in re.findall(pattern, s) if m]
print (res)
Output
['match-this', ' match-this ', '~match-this', '#match-this', 'match-this']
If all five matches can be "match-this" (contradicting the requirement for the 3rd match) you can match the regular expression
~match-this~|(\bmatch-this\b)
and keep only matches that are captured (to capture group 1). The idea is to discard matches that are not captured and keep matches that are captured. When the regex engine matches "~match-this~" its internal string pointer is moved just past the closing "~", thereby skipping an unwanted substring.
Demo
The regular expression can be broken down as follows.
~match-this~ # match literal
| # or
( # begin capture group 1
\b # match a word boundary
match-this # match literal
\b # match a word boundary
) # end capture group 1
Being so simple, this regular expression would be supported by most regex engines.
For this you need both kinds of lookarounds. This will match the 5 spots you want, and there's a reason why it only works this way and not another and why the prefix and/or suffix can't be included:
(?<=~)match-this(?!~)|(?<!~)match-this(?=~)|(?<!~)match-this(?!~)
Explaining lookarounds:
(?=...) is a positive lookahead: what comes next must match
(?!...) is a negative lookahead: what comes next must not match
(?<=...) is a positive lookbehind: what comes before must match
(?<!...) is a negative lookbehind: what comes before must not match
Why other ways won't work:
[^~] is a class with negation, but it always needs one character to be there and also consumes that character for the match itself. The former is a problem for a starting text. The latter is a problem for having advanced too far, so a "don't match" character is gone already.
(^|[^~]) would solve the first problem: either the text starts or it must be a character not matching this. We could do the same for ending texts, but this is a dead again anyway.
Only lookarounds remain, and even then we have to code all 3 variants, hence the two |.
As per the nature of lookarounds the character in front or behind cannot be captured. Additionally if you want to also match either a leading or a trailing character then this collides with recognizing the next potential match.
It's a difference between telling the engine to "not match" a character and to tell the engine to "look out" for something without actually consuming characters and advancing the current position in the text. Also not every regex engine supports all lookarounds, so it matters where you actually want to use it. For me it works fine in TextPad 8 and should also work fine in PCRE (f.e. in PHP). As per regex101.com/r/CjcaWQ/1 it also works as expected by me.
What irritates me: if the leading and/or trailing character of a found match is important to you, then just extract it from the input when processing all the matches, since they also come with starting positions and lengths: first match at position 0 for 10 characters means you look at input text position -1 and 10.

What is the way to combine two regexes? [duplicate]

I want to design an expression for not allowing whitespace at the beginning and at the end of a string, but allowing in the middle of the string.
The regex I've tried is this:
\^[^\s][a-z\sA-Z\s0-9\s-()][^\s$]\
This should work:
^[^\s]+(\s+[^\s]+)*$
If you want to include character restrictions:
^[-a-zA-Z0-9-()]+(\s+[-a-zA-Z0-9-()]+)*$
Explanation:
the starting ^ and ending $ denotes the string.
considering the first regex I gave, [^\s]+ means at least one not whitespace and \s+ means at least one white space. Note also that parentheses () groups together the second and third fragments and * at the end means zero or more of this group.
So, if you take a look, the expression is: begins with at least one non whitespace and ends with any number of groups of at least one whitespace followed by at least one non whitespace.
For example if the input is 'A' then it matches, because it matches with the begins with at least one non whitespace condition. The input 'AA' matches for the same reason. The input 'A A' matches also because the first A matches for the at least one not whitespace condition, then the ' A' matches for the any number of groups of at least one whitespace followed by at least one non whitespace.
' A' does not match because the begins with at least one non whitespace condition is not satisfied. 'A ' does not matches because the ends with any number of groups of at least one whitespace followed by at least one non whitespace condition is not satisfied.
If you want to restrict which characters to accept at the beginning and end, see the second regex. I have allowed a-z, A-Z, 0-9 and () at beginning and end. Only these are allowed.
Regex playground: http://www.regexr.com/
This RegEx will allow neither white-space at the beginning nor at the end of your string/word.
^[^\s].+[^\s]$
Any string that doesn't begin or end with a white-space will be matched.
Explanation:
^ denotes the beginning of the string.
\s denotes white-spaces and so [^\s] denotes NOT white-space. You could alternatively use \S to denote the same.
. denotes any character expect line break.
+ is a quantifier which denote - one or more times. That means, the character which + follows can be repeated on or more times.
You can use this as RegEx cheat sheet.
In cases when you have a specific pattern, say, ^[a-zA-Z0-9\s()-]+$, that you want to adjust so that spaces at the start and end were not allowed, you may use lookaheads anchored at the pattern start:
^(?!\s)(?![\s\S]*\s$)[a-zA-Z0-9\s()-]+$
^^^^^^^^^^^^^^^^^^^^
Here,
(?!\s) - a negative lookahead that fails the match if (since it is after ^) immediately at the start of string there is a whitespace char
(?![\s\S]*\s$) - a negative lookahead that fails the match if, (since it is also executed after ^, the previous pattern is a lookaround that is not a consuming pattern) immediately at the start of string, there are any 0+ chars as many as possible ([\s\S]*, equal to [^]*) followed with a whitespace char at the end of string ($).
In JS, you may use the following equivalent regex declarations:
var regex = /^(?!\s)(?![\s\S]*\s$)[a-zA-Z0-9\s()-]+$/
var regex = /^(?!\s)(?![^]*\s$)[a-zA-Z0-9\s()-]+$/
var regex = new RegExp("^(?!\\s)(?![^]*\\s$)[a-zA-Z0-9\\s()-]+$")
var regex = new RegExp(String.raw`^(?!\s)(?![^]*\s$)[a-zA-Z0-9\s()-]+$`)
If you know there are no linebreaks, [\s\S] and [^] may be replaced with .:
var regex = /^(?!\s)(?!.*\s$)[a-zA-Z0-9\s()-]+$/
See the regex demo.
JS demo:
var strs = ['a b c', ' a b b', 'a b c '];
var regex = /^(?!\s)(?![\s\S]*\s$)[a-zA-Z0-9\s()-]+$/;
for (var i=0; i<strs.length; i++){
console.log('"',strs[i], '"=>', regex.test(strs[i]))
}
if the string must be at least 1 character long, if newlines are allowed in the middle together with any other characters and the first+last character can really be anyhing except whitespace (including ##$!...), then you are looking for:
^\S$|^\S[\s\S]*\S$
explanation and unit tests: https://regex101.com/r/uT8zU0
This worked for me:
^[^\s].+[a-zA-Z]+[a-zA-Z]+$
Hope it helps.
How about:
^\S.+\S$
This will match any string that doesn't begin or end with any kind of space.
^[^\s].+[^\s]$
That's it!!!! it allows any string that contains any caracter (a part from \n) without whitespace at the beginning or end; in case you want \n in the middle there is an option s that you have to replace .+ by [.\n]+
pattern="^[^\s]+[-a-zA-Z\s]+([-a-zA-Z]+)*$"
This will help you accept only characters and wont allow spaces at the start nor whitespaces.
This is the regex for no white space at the begining nor at the end but only one between. Also works without a 3 character limit :
\^([^\s]*[A-Za-z0-9]\s{0,1})[^\s]*$\ - just remove {0,1} and add * in order to have limitless space between.
As a modification of #Aprillion's answer, I prefer:
^\S$|^\S[ \S]*\S$
It will not match a space at the beginning, end, or both.
It matches any number of spaces between a non-whitespace character at the beginning and end of a string.
It also matches only a single non-whitespace character (unlike many of the answers here).
It will not match any newline (\n), \r, \t, \f, nor \v in the string (unlike Aprillion's answer). I realize this isn't explicit to the question, but it's a useful distinction.
Letters and numbers divided only by one space. Also, no spaces allowed at beginning and end.
/^[a-z0-9]+( [a-z0-9]+)*$/gi
I found a reliable way to do this is just to specify what you do want to allow for the first character and check the other characters as normal e.g. in JavaScript:
RegExp("^[a-zA-Z][a-zA-Z- ]*$")
So that expression accepts only a single letter at the start, and then any number of letters, hyphens or spaces thereafter.
use /^[^\s].([A-Za-z]+\s)*[A-Za-z]+$/. this one. it only accept one space between words and no more space at beginning and end
If we do not have to make a specific class of valid character set (Going to accept any language character), and we just going to prevent spaces from Start & End, The must simple can be this pattern:
/^(?! ).*[^ ]$/
Try on HTML Input:
input:invalid {box-shadow:0 0 0 4px red}
/* Note: ^ and $ removed from pattern. Because HTML Input already use the pattern from First to End by itself. */
<input pattern="(?! ).*[^ ]">
Explaination
^ Start of
(?!...) (Negative lookahead) Not equal to ... > for next set
Just Space / \s (Space & Tabs & Next line chars)
(?! ) Do not accept any space in first of next set (.*)
. Any character (Execpt \n\r linebreaks)
* Zero or more (Length of the set)
[^ ] Set/Class of Any character expect space
$ End of
Try it live: https://regexr.com/6e1o4
^[^0-9 ]{1}([a-zA-Z]+\s{1})+[a-zA-Z]+$
-for No more than one whitespaces in between , No spaces in first and last.
^[^0-9 ]{1}([a-zA-Z ])+[a-zA-Z]+$
-for more than one whitespaces in between , No spaces in first and last.
Other answers introduce a limit on the length of the match. This can be avoided using Negative lookaheads and lookbehinds:
^(?!\s)([a-zA-Z0-9\s])*?(?<!\s)$
This starts by checking that the first character is not whitespace ^(?!\s). It then captures the characters you want a-zA-Z0-9\s non greedily (*?), and ends by checking that the character before $ (end of string/line) is not \s.
Check that lookaheads/lookbehinds are supported in your platform/browser.
Here you go,
\b^[^\s][a-zA-Z0-9]*\s+[a-zA-Z0-9]*\b
\b refers to word boundary
\s+ means allowing white-space one or more at the middle.
(^(\s)+|(\s)+$)
This expression will match the first and last spaces of the article..

Regular expression to remove syslog date in filebeat?

I would like to parse some syslog lines that they look like
Oct 20 16:34:59 artguard TTN-xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
I would like to turn them into
TTN-xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
So I was wondering how the regular expression should look like that would allow me to do so, since the first part will change every day, because it is appended by the syslog.
EDIT: to avoid duplicated, I am trying to use REGEX with filebeat, where no all regex are supported as explained here
Regex101
(TTN-.*$)
Debuggex Demo
Explained
1st Capturing Group (TTN-.*$)
TTN- matches the characters TTN- literally (case sensitive)
.* matches any character (except for line terminators)
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
$ asserts position at the end of a line
Global pattern flags
g modifier: global. All matches (don't return after first match)
m modifier: multi line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)
The regular expression TTN-\S* is probably a way of doing what you're looking for, here it is in a java-script example.
var value = "Oct 20 16:34:59 artguard TTN-xxxxxxxxxxxxxxxxxxxxxxxxxxxxx";
var matches = value.match(
new RegExp("TTN-\\S*", "gi")
);
document.writeln(matches);
It works in two main parts:
The TTN- matches TTN- (obviously)
The \S* matches any character that is not a white-space, this is done as many times as possible.
Currently it is always expecting atleas a '-' after the TTN but if you repace the '-' with a '-{01}' in the regex it will expect TNN maybe a dash followed by 0-n characters that are not a white-space. You could also replace \S* with \w* to get all the letters and digits or .* to get all characters apart from end of line /n character, TNN-\S*[^\s{2}] too end the match with two spaces. Hope this was helpful.

Finding a match one after another

How do I find multiple matches that are (and can only be) separated from each other by whitespaces?
I have this regular expression:
/([0-9]+)\s*([A-Za-z]+)/
And I want each of the matches (not groups) to be surrounded by a whitespace or another match. If the condition is not fullfilled, the match should not be returned.
This is valid: 1min 2hours 3days
This is not: 1min, 2hours 3days (1min and 2hours should not be returned)
Is there a simpler way of finding a continuous sequence of matches (in Java preferably) than repeating the whole regex before and after the main one, checking if there is a whitespace, start/end of the string or another match?
I believe this pattern will meet your requirements (provided that only a single space character separates your alphanumeric tokens):
(?<=^|[\w\d]\s)([\w\d]+)(?=\s|$)
^^^^^^^^^^ ^^^^^^^ ^^^^
(2) (1) (3)
A capture group that contains an alphanumeric string.
A look-behind assertion: To the left of the capture group must be a) the beginning of the line or b) an alphanumeric character followed by a single space character.
A look-ahead assertion: To the right of the capture group must be a) a space character or b) the end of the line.
See regex101.com demo.
Here is some sample data that I included in the demo. Each bolded alphanumeric string indicates a successful capture:
1min 2hours 3days
1min, 2hours 3days
42min 4hours 2days
String text = "1min 2hours 3days";
boolean match = text.matches("(?:\\s*[0-9]+\\s*[A-Za-z]+\\s*)*");
This is basically looking for a pattern on your example. Then using * after the pattern its looking for zero or more occurrence of the pattern in text. And ?: means doesn't capture the group.
This will will also return true for empty string. If you don't want the empty string to be true, then change * into +
I've mananged to solve my problem by splitting the string using string.split("\\s+") and then matching the results to the pattern /([0-9]+)\s*([A-Za-z]+)/.
There is an error here the '' will match all characters and ignore your rest
/([0-9]+)\s([A-Za-z]+)/
Change to
/(\d+)\s+(\w+)/g
This will return an array of matches either digits or word characters. There is no need to always write '[0-9]' or '[A-Za-z]' the same thing can be said as '\d' match any 0 to 9 more can be found at this cheat sheet regular expression cheat sheet

Regular expression doesn't match if a character participated in a previous match

I have this regex:
(?:\S)\++(?:\S)
Which is supposed to catch all the pluses in a query string like this:
?busca=tenis+nike+categoria:"Tenis+e+Squash"&pagina=4&operador=or
It should have been 4 matches, but there are only 3:
s+n
e+c
s+e
It is missing the last one:
e+S
And it seems to happen because the "e" character has participated in a previous match (s+e), because the "e" character is right in the middle of two pluses (Teni s+e+S quash).
If you test the regex with the following input, it matches the last "+":
?busca=tenis+nike+categoria:"Tenis_e+Squash"&pagina=4&operador=or
(changed "s+e" for "s_e" in order not to cause the "e" character to participate in the match).
Would someone please shed a light on that?
Thanks in advance!
In a consecutive match the search for the next match starts at the position of the end of the previous match. And since the the non-whitespace character after the + is matched too, the search for the next match will start after that non-whitespace character. So a sequence like s+e+S you will only find one match:
s+e+S
\_/
You can fix that by using look-around assertions that don’t match the characters of the assumption like:
\S\++(?=\S)
This will match any non-whitespace character followed by one or more + only if it is followed by another non-whitespace character.
But tince whitespace is not allowed in a URI query, you don’t need the surrounding \S at all as every character is non-whitespace. So the following will already match every sequence of one or more + characters:
\++
You are correct: The fourth match doesn't happen because the surrounding character has already participated in the previous match. The solution is to use lookaround (if your regex implementation supports it - JavaScript doesn't support lookbehind, for example).
Try
(?<!\s)\++(?!\s)
This matches one or more + unless they are surrounded by whitespace. This also works if the plus is at the start or the end of the string.
Explanation:
(?<!\s) # assert that there is no space before the current position
# (but don't make that character a part of the match itself)
\++ # match one or more pluses
(?!\s) # assert that there is no space after the current position
If your regex implementation doesn't support lookbehind, you could also use
\S\++(?!\s)
That way, your match would contain the character before the plus, but not after it, and therefore there will be no overlapping matches (Thanks Gumbo!). This will fail to match a plus at the start of the string, though (because the \S does need to match a character). But this is probably not a problem.
You can use the regex:
(?<=\S)\++(?=\S)
To match only the +'s that are surrounded by non-whitespace.