Regexp: find out if value that repeats several times - regex

I have strings:
TH 8H 5C QS TC
9S 4S JS KS JS
I want the second one to be picked up by reqexp. Help me please to contract the necessary expression.
What I tried so far is: S{5} but of course it look up sequentially.
Could I avoid determining which character I am looking for. I need 5 repetition of any. Could it be like .{5} ?
Thanks in advance!

If you have standalone strings, use
^\wS(?: \wS){4}$
See the regex demo
If these strings appear inside a larger text, replace the ^ and $ anchors with word boundaries \b:
\b\wS(?: \wS){4}\b
See another demo
Note that \w matches any alphanumeric or underscore character. If there can be any non-whitespace character, use \S instead:
\b\SS(?: \SS){4}\b
One more demo
\SS will match a non-whitespace followed with an S and (?: \SS){4} will match 4 same sequences (thus, there will be 5 2-character sequences with S at the end of each).

Related

RegExp: Match first 3 char words

/[\w|A-Z]{1,3}[a-z]/g
but I want to match only the first 3 char of words.
For example:
I WANt THE FIRst 3 CHAr OF WORds ONLy.
It's for a rapid lector: only uppercase the begining of any words.
The best could be: (First 3 char)(Rest of the word or space)
https://regex101.com/r/PCi8Dn/2
Thank you !
Original answer
Use positive lookahead ((?=[pattern]) to match without including in the match.
[A-Z]{1,3}(?=[a-z])
appears to do what you want (if I've understood your spec correctly).
You can see it in action here.
New answer following clarification on spec
I think this does what you want:
(\S{1,3})(\S*[\s\.]+)
The breakdown is:
1st capturing group: (\S{1,3})
Matches a maximum of 3 non-space characters (\S used instead of \w because I think you want to match characters with diacritics like à and punctuation in the middle of words like '.
2nd capturing group: (\S*[\s\.]+)
Matches zero or more non-space characters (the remaining characters in each word) followed by one or more delimiter characters (space or period). I included period as a delimiter to match the last word. You might want to adjust that part depending on your exact needs.
See it in action here.

Find certain colons in string using Regex

I'm trying to search for colons in a given string so as to split the string at the colon for preprocessing based on the following conditions
Preceeded or followed by a word e.g A Book: Chapter 1 or A Book :Chapter 1
Do not match if it is part of emoticons i.e :( or ): or :/ or :-) etc
Do not match if it is part of a given time i.e 16:00 etc
I've come up with a regex as such
(\:)(?=\w)|(?<=\w)(\:)
which satisfies conditions 2 & 3 but still fails on condition 3 as it matches the colon present in the string representation of time. How do I fix this?
edit: it has to be in a single regex statement if possible
You can use
(:\b|\b:)(?!(?:(?<=\b\d:)|(?<=\b\d{2}:))\d{1,2}\b)
See the regex demo. Details:
(:\b|\b:) - Group 1: a : that is either preceded or followed with a word char
(?!(?:(?<=\b\d:)|(?<=\b\d{2}:))\d{1,2}\b) - there should be no one or two digits right after : (followed with a word boundary) if the : is preceded with a single or two digits (preceded with a word boundary).
Note :\b is equal to :(?=\w) and \b: is equal to (?<=\w):.
If you need to get the same capturing groups as in your original pattern, replace (:\b|\b:) with (?:(:)\b|\b(:)).
More flexible solution
Note that excluding matches can be done with a simpler pattern that matches and captures what you need and just matches what you do not need. This is called "best regex trick ever". So, you may use a regex like
8:|:[PD]|\d+(?::\d+)+|(:\b|\b:)
that will match 8:, :P, :D, one or more digits and then one or more sequences of : and one or more digits, or will match and capture into Group 1 a : char that is either preceded or followed with a word char. All you need to do is to check if Group 1 matched, and implement required extraction/replacement logic in the code.
Word characters \w include numbers [a-zA-Z0-9_]
So just use [a-ZA-Z] instead
(\:)(?=[a-zA-Z])|(?<=[a-zA-Z])(\:)
Test Here

Regex - Can quantifier skip certain range?

I have a simple regex like this [0-9a-zA-Z]{32,45} that matches 0-9,a-z,A-Z 32 to 45 times. Is there a way I can have the regex skip a certain range? For example, I don't want to match if there are 40 characters.
One way to do that would be
\b[0-9a-zA-Z]{32,39}+(?:[0-9a-zA-Z]{2,6})?\b
See proof. You match 32 to 39 occurrences possessively, then an optional occurrence of 2 to 6 repetitions of the pattern.
Another way could be using an alternation | repeating the character class either 41-45 times or 32-39 times.
You could prepend and append a word boundary \b to the pattern.
\b(?:[0-9a-zA-Z]{41,45}|[0-9a-zA-Z]{32,39})\b
Regex demo

.net Regex to look ahead and eliminate strings in advance that dont contain certain characters

I am Using .Net Flavor of Regex.
Suppose i have a string 123456789AB
and i want to match AB (Could be any two Capital letters) only if the string part containing numbers(123456789) has 5 and 8 in it.
So what i came up with was
(?=5)(?=8)([A-Z]{2})
But this is not working.
After some trail error on RegexStorm
I got to
(?=(.*5))(?=(.*8))[A-Z]{2}
What i am expecting is it will start matching from the start of the string as look ahead does not consume any characters.
But the part "[A-Z]{2}" does not move ahead to match AB in the input string.
My question is why is that so?
i know replacing it with .*[A-Z]{2} will make it move ahead but then the string matched has entire string in it.
What is the solution in this case other than putting word part ([A-Z]{2}) in a separate group and then catching only that group.
Lookaheads check for the pattern match immediately to the right of the current position in the string. (?=(.*5))(?=(.*8)) matches a location that is immediately followed with any 0 or more chars other than line break chars as many as possible and then 5 and then - at the same position - another similar check if performed but requiring 8 after any zero or more chars, as many as possible.
You may use as many as lookbehinds as there are required substrings before the two letters:
(?s)(?<=5.*?)(?<=8.*?)[A-Z]{2}
See the regex demo
Details
(?s) - makes the . match newline characters, too
(?<=5.*?) - a location that is immediately preceded with 5 and then 0 or more chars as few as possible
(?<=8.*?) - a location that is immediately preceded with 8 and then 0 or more chars as few as possible
[A-Z]{2} - two ASCII uppercase letters.
An alternative would be to "unfold" what you expect to match using exclusionary character classes and alternation of match order. Not pretty, but pretty fast:
(?<=\b[^58]*?(?:5[^8]*8|8[^5]*5)[^A-Z]*?)[A-Z]{2}

How do I split a filename using Logstash Grok?

One of these days I'll learn regex.
I have the following filename
PE-run1000hbgmm3f1-job1000hbgmm3dt-Output-Workflow-1000hbgmm3fb-22.07.17.log
I'm able to get this to work so...
(?<logtype>[^-]+)-(?<run_id>[^-]+)-(?<job_id>[^-]+)-(?<capability>[^(0-9\.0-9\.0-9)]+)
logtype: PE
run_id: run1000hbgmm3f1
job_id: job1000hbgmm3dt
But I'm getting
capability: Output-Workflow-
...though I want it to be
capability: Output-Workflow-1000hbgmm3fb
...that is, all the text after the job_id up to the timestamp HH.mm.ss. Any help please? Thanks!
It is because you cannot negate a sequence of symbols with a negated character class. [^(0-9\.0-9\.0-9)] matches any single char other than (, digit, . and ).
You may replace your (?<capability>[^(0-9\.0-9\.0-9)]+) with (?<capability>.*?)-\d{2}\.\d{2}\.\d{2} to get the right value.
Now, the (?<capability>.*?)-\d{2}\.\d{2}\.\d{2} will match any 0+ chars (and capture them into "capability" group) as few as possible (since the *? is a lazy quantifier) up to the first occurrence of -, followed with 2 digits, and then 3 sequences of a dot (\.) followed with 2 digits.
See the regex demo at regex101.com.