Testing a single sentence with an optional period - regex

I'm trying to write a regex that tests a single sentence. The sentence can contain any content and should either: end in a period and have nothing following that period or not have a period or any ending punctuation.
I started with this: .*?\.$ and it worked fine testing for a sentence ending in a period. But if I mark the period as optional .*?\.?$ then a sentence can have any ending including a period and text after that period.
To be clear, these should pass the test: He jumped over the fence. He jumped over the fence
And this should not pass the test: He jumped over the fence. She jumped over it too.

Try:
^(?:[^.]+\.|[^.]+)$
Regex demo.
^ - start of the string
(?:[^.]+\.|[^.]+) - match either [^.]+\. (one or more non-. characters and .) or [^.]+ (one or more non-. characters) in non-capturing group.
$ - end of the string

This pattern .*?\.$ can match the whole line He jumped over the fence. She jumped over it too. because the . can also match a literal dot.
If you don't want to cross newlines and you do want to match for example 1.2m when having to end on a dot, or matching only chars other than ending punctuations:
If a lookahead assertion is supported:
^(?:[^\.\n]*(?:\.(?![^\S\n])[^\.\n]*)*\.|[^!?.\n]+)$
Explanation
^ Start of string
(?: Non capture group
[^\.\n]* Match optional chars other than a dot
(?:\.(?![^\S\n])[^\.\n]*)* Optionally repeat matching a dot not directly followed by a space
\. Match a dot
| Or
[^!?.\n]+ Match 1+ times any char except for ! ? . or a newline (Or add more ending punctuation chars)
) Close the non capture group
$ End of string
See a regex101 demo

You can use such regex:
.*?[^.]$
Optional (?) means that regex will match if symbol presents or not presents in string
[^.]$ - means that you want to exclude the presence of a dot at the end of a sentence.

Related

Regular expression that matches at least 4 words starting with the same letter?

I've been trying to solve this problems for few hours but with no luck. The task is to write a regular expression that matches at least four words starting with the same letter. But! These words do not have to be one after another.
This regex should be able to match a line like this:
cat color coral chat
but also one like this:
cat take boom candle creepy drum cheek
Thank you!
So far I have got this regex but it only matches words when they are in order.
(\w)\w+\s+\1\w+\s+\1\w+\s+\1
If you have only words in the line that can be matched with \w:
\b(\w)\w*(?:(?:\s+\w+)*?\s+\1\w*){3}
Explanation
\b A word boundary to prevent a partial word match
(\w)\w* Capture a single word character in group 1 followed by matching optional word characters
(?: Non capture group to repeat as a whole part
(?:\s+\w+)*? Match 1+ whitespace chars and 1+ word chars in between in case the word does not start with the character captured in the back reference
\s+\1\w* Match 1+ whitespace chars, a backreference to the same captured character and optional word characters
){3} Close the non capture group and repeat 3 times
See a regex demo
Note that \s can also match a newline.
If the words that should with the same character should be at least 2 characters long (as (\w)\w+ matches 2 or more characters)
\b(\w)\w+(?:(?:\s+\w+)*?\s+\1\w+){3}
See another regex demo.
Another idea to match lines with at least 4 words starting with the same letter:
\b(\w)(?:.*?\b\1){3}
See this demo at regex101
This is not very accurate, it just checks if there are three \b word boundaries, each followed by \1 in the first group \b(\w) captured character to the right with .*? any characters in between.

Checking for whitespaces with RegEx

I have strings that look like some text - other text and I need to delete everything before and including the hyphen - and the space after it
But do to typos I might have :
some text -other text or some text- other text or some text-other text or double spaces instead of single spaces
I am using RegEx ^.*\s+\-\s+ and this works for some text - other text with single or multiple spaces before and after the -
But for the other possibilities where the whitespace is missing, I have used two or so I have ^.*\s+\-\s+|.*\-\s|.*\-
Is there a more concise patter that does not use multiple ors for this?
Thank you for any help on this
https://regex101.com/r/TNU7i6/1
Instead of using an alternation with 3 patterns, you might use a pattern to match all except the -, then match the - and optional whitespace chars.
^[^-]*-\s*
Regex demo
If there should be a non whitespace char following, and a lookahead is supported:
^[^-]*-\s*(?=\S)
^ Start of string
[^-]*- Match 0+ times any char except -, then match -
\s* Match optional whitespace chars
(?=\S) Positive lookahead, assert a non whitespace char to the right
Regex demo
Note that \s and the negated character class [^-] can also match a newline.
1st solution: With your shown samples, please try following.
^.*?\s+\S+\s?-\s*(.*)$
OR
^.*?\s+\S+\s*-\s*(.*)$
Online demo for above regex
2nd solution: You could use \K option too to forget matched regex part, in that case try:
^.*?\s+\S+\s?-\s*\K.*$
OR
^.*?\s+\S+\s*-\s*\K.*$
Online demo for above regex
1st solution explanation:
^.*?\s+ ##From starting of value matching till 1st occurrence of space(s).
\S+\s? ##Matching 1 or more non-space occurrences followed by optional space here.
-\s* ##Matching - followed by optional space.
(.*)$ ##Matching everything till last of value.
2nd solution explanation:
^.*?\s+ ##Matching everything till 1st space occurrence(s) from starting of value.
\S+\s? ##Matching non spaces 1 or more occurrences followed by space optional.
-\s*\K ##Matching - followed by spaces(0 or more occurrences) and \K will discard all previous matched values(so that we can match exact values as per output).
.*$ ##Matching everything after previously matched values(which is discarded by \K).

How to capture everything until another capture group

I have the following template :
1251 Left Random Text I want to fill
It can go through multiple lines
As you can see
9841 Right Again we see a lot of random text with 3115 numbers
And this also goes
To multiple lines
0121 Right
5151 Right This one is just one line
I was wrong
9731 Left This one is just a line
5123 NA Instruction 5151 was wrong
4113 Right Instr 9841 was correct
We checked
I want to have 3 groups:
1251
Left
Random Text I want to fill
It can go through multiple lines
As you can see
I'm using
(\d+)\s(\w+)\s(.*)
but it stops at the current line only (so I get only Random Text I want to fill in group 3, although I want including As you can see)
If I'm using Single line flag I get only 1 match for each group, group 3 almost being all
Here is live : https://regex101.com/r/W3x0mH/4
You could use a repeating group matching all the lines while asserting that the next line does not start wit 1+ digits followed by Left or Right:
(\d+)\s(\w+)\s(.*(?:\r?\n(?!\d).*)*)
Explanation
(\d+)\s(\w+)\s Match the first 2 groups
(Third capturing group
.* Match 0+ times any char except a newline
(?: Non capturing group
\r?\n(?!\d).* Match newline, assert what is on the right is not a digit
)* Close non capturing group and repeat 0+ times
) Close capturing group
Regex demo
You may use this regex with a lookahead:
^(\d+)\s(\w+)\s(.*?)(?=\n\d|\z)
with DOTALL and MULTILINE modifiers.
Updated Regex Demo
RegEx Details:
^: Line start
(\d+): Match and capture 1+ digits in group #1
\s: match a whitespace
(\w+): Match and capture 1+ word characters in group #2
\s: match a whitespace
(.*?): Match 0 or more of any character (non-greedy) provided next lookahead assertion is satiSfied
(?=\n\d|\z): Lookahead assertion to assert that we have a newline followed by a digit or there is end of input
Faster Regex:
If you are using this regex on a long string then you should also keep overall performance in mind as a regex with DOTALL modifier will tend to get slow on a large size text. For that I suggest using this regex that doesn't need DOTALL modifier:
^(\d+)\s(\w+)\s(.*(?:\n.*)*?)(?=\n\d|\z)
RegEx Demo 2
On regex101 demo this regex takes just 181 steps as compared to first one that takes 1300 steps.
For the third group, repeat any character while using negative lookahead for ^\d, which would indicate the start of a new match:
(\d+)\s(\w+)\s((?:(?!^\d)[\s\S])*)
https://regex101.com/r/W3x0mH/5
You may try with this regex:
^(\d+)\s+(\w+)\s+(.*?)(?=^\d|\z)
^(\d+)\s+ , ^\d+ Line begins with numbers followed by one or more whitespace character \s+
(\w+)\s+ where \w+ one or more characters (left,right,na or something else) followed by one or more whitespace \w+
(.*?) matches everything until it finds a line beginning with number or \z end of string.
I think it fits your requirement....
Regex101

Regex : Match everything after first dash

I have a string which contains the rego number of the car like
1FX9JE - 2012 Audi A3 Ambition Sportback MY12 Stronic
I would like to match everything except the rego number, so anything after the dash.
The regex I came up with is (php)
\s.[^-]*$
My initial regex which i came up can match anything after the dash only if the string contains only 1 dash. For example https://regex101.com/r/Jao8W0/1
However, if the string has more than 1 dash. The regex is not usable.
For example : https://regex101.com/r/Jao8W0/2
Is there anyway for me to match anything after the first dash even though the string contains additional dash after the first dash.
Thank you
Try this Regex:
^[^-\r\n]+-\s*\K.*$
Click for Demo
Explanation:
^ - asserts the start of the string
[^-\r\n]+ - matches 1+ occurrences of any character that is neither a - or nor a newline
-\s* - matches the first - in the string followed by 0+ whitespaces
\K - forgets everything matched so far
.* - matches 0+ occurrences of any character
$ - asserts the end of the string
if only has one space, you can use this pattern:
(?<=\-\s)(.*)
else if there may have more than one space, get the group(1) from match
(?<=\-)\s*(.*)
(?<=...) Ensures that the given pattern will match, ending at the
current position in the expression. The pattern must have a fixed
width. Does not consume any characters.

Tokenizing a string with a regular expression

Suppose I have a string like this: abc def ghi jkl (I put a space at the end for the sake of simplicity but it doesn't really matter for me) and I want to capture its "chunks" as follows:
abc
def
ghi
jkl
if and only if there are 1-4 "chunks" in the string. I have already tried the following regex:
^([^ ]+ ){1,4}$
at Regex101.com but it only captures the last occurrence. A warning about it is issued:
A repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations or use a non-capturing group instead if you're not interested in the data
How to correct the regular expression to achieve my goal?
Since you have no access to the code, the only solution you might use is a regex based on the \G operator that will only allow consecutive matches and a lookahead anchored at the start that will require 1 to 4 non-whitespace chunks in the string.
(?:^(?=\s*\S+(?:\s+\S+){0,3}\s*$)|\G(?!^))\s*\K\S+
See the regex demo
Details:
(?:^(?=\s*\S+(?:\s+\S+){0,3}\s*$)|\G(?!^)) - a custom boundary that checks if:
^(?=\s*\S+(?:\s+\S+){0,3}\s*$) - the string start position (^) that is followed with 1 to 4 non-whitespace chunks, separated with 1+ whitespaces, and trailing/leading whitespaces are allowed, too
| - or
\G(?!^) - the current position at the end of the previous successful match (\G also matches the start of a string, thus we have to use the negative lookahead to exclude that matching position, since there is a separate check performed)
\s* - zero or more whitespaces
\K - a match reset operator discarding all the text matched so far
\S+ - 1 or more characters other than whitespace
It can be done on linux using tr:
tr -sc 'a-zA-Z' '\n' < text.txt > out_text.txt
where in a text.txt file is your string to be normalized.