Using a regular expression, I'm trying to match a label, in this case "Business Unit:", followed by one or more spaces, then match everything in a submatch after that to the end of that line. I'm having a problem when there are no characters after the label on the line, it grabs the next line.
For example, here's some test data:
Business Unit:(space)(space)BU1(space)
This is Line 2
Business Unit:(space)(space)
This is Line 4
So I want to grab just "BU1" from the first line, and that works. It should match an empty string from the third line, but it matches the contents of the fourth line instead, in this case "This is Line 4".
Here is my expression:
Business Unit:\s+(.+)
I thought the dot character is not suppose to match a newline, but it seems like it is.
What's the correct regular expression in this case?
The real problem here is that \s+ is greedy, so it will match all whitespace (including new lines), so it matches up until the next line and then .+ catches the rest.
This should meet your requirements.
The pattern is ^Business Unit: *([\S]*)
This is assuming of course your business unit won't contain any spaces. If it does, then I can modify the pattern.
It depends, a bit on the context you are using the regex in because multi-line handling may vary, but here is a start:
/^Business Unit: +([^ ]*) *$/
^ Starting from the beginning of the line,
Match the literal, Business Unit:,
+ followed by 1 or more spaces,
([^ ]*) capture any possible non-blank stuff,
*$ followed by spaces till the end of the line.
Again, depending on your context, you may need to specify the linend as \n:
/^Business Unit: +([^ ]*) *\n/
The \n character is part of \s. That is why you get a match onto the following line.
You can do:
/^Business Unit:[ \t]*([^\n]*?)[ \t]*$/m
Demo
If you want to exclude the leading horizontal spaces and not match if blank:
/^Business Unit:[ \t]+(\S+)[ \t]*$/m
Demo
Use a character class substraction for whitespace except newlines:
Business Unit:[\s&&[^\n]]*(\S*)
See live demo.
The expression [\s&&[^\n]] is the subtraction, then the capture is for 0 or more non-whitespace (your target).
In your example you capture the last line because \s also matches a newline.
What you could do is replace \s+ to a whitespace and capture in a group any character zero or more times .*
You might use a word boundary \b at the start.
\bBusiness Unit: +(.*)
Update
Bases on the comments, to not match whitespace at the end of the line you could use match one or more times a non whitespace characters \S+ followed by repeated pattern that matches a whitespace or a tab [ \t] and one or more times a non whitespace character and make the group optional ?
\bBusiness Unit: +(\S+(?:[ \t]\S+)*)?
Related
I cannot make a regex that only captures a trailing space or N of spaces, followed by a single letter s.
((\s)+(s){1,1})
Works but breaks when you start to stress test it, for example it greedily captures words beginning with s.
word s word s
word s
word suffering
word spaces
word s some ss spaces
there's something wrong
words S s
If you want a single letter s to be captured, as opposed to an s at the beginning of a longer word, you need to specify a word break \b after s:
\s+s\b
Demo on regex101
If you for example do not want to match in s# you can also assert a whitespace boundary to the right.
Note that for a match only, you can omit all the capture groups, and using (s){1,1} is the same as (s){1} which by itself can be omitted and would leave just s
\s+s(?!\S)
Regex demo
As \s can also match a newline, if you want to match spaces without newlines:
[^\S\n]+s(?!\S)
Regex demo
I want to regex match the last word in a string where the string ends in ... The match should be the word preceding the ...
Example: "Do not match this. This sentence ends in the last word..."
The match would be word. This gets close: \b\s+([^.]*). However, I don't know how to make it work with only matching ... at the end.
This should NOT match: "Do not match this. This sentence ends in the last word."
If you use \s+ it means there must be at least a single whitespace char preceding so in that case it will not match word... only.
If you want to use the negated character class, you could also use
([^\s.]+)\.{3}$
( Capture group 1
[^\s.]+ Match 1+ times any char except a whitespace char or dot
) Close group
\.{3} Match 3 dots
$ End of string
Regex demo
You can anchor your regex to the end with $. To match a literal period you will need to escape it as it otherwise is a meta-character:
(\S+)\.\.\.$
\S matches everything everything but space-like characters, it depends on your regex flavor what it exactly matches, but usually it excludes spaces, tabs, newlines and a set of unicode spaces.
You can play around with it here:
https://regex101.com/r/xKOYa4/1
I am looking to create a match for the following:
"Adam Lambert"
"Mr. Adam Lambert"
"adam#test.com"
But not match the following
"Adam Lambert"
"Adam Lambert "
Rules:
Any alphanumeric character should be matches
A single space at any point should be matched.
Any number of single spaces can be matches
double spaces are not matched
a single space at the end of a string is not matched
EDIT
I also need to match the following. Sorry I missed this.
name:((\w+(?:\S\w+)*|\s(?:\w+\S)*)\S)*
I need to match to:
name:
name:A
name:Adam Lambert
The above regex matches from "name:Ad..." but it will not match "name:A"
I would generalize a solution to matching a sequence of non-space characters followed by optional groups of non-space characters following a single space only, since your only hard criterion seems to be the number of spaces. For example:
^\S+(?: \S+)*$
^(?:\S+(?:\s\S+)*|\s(?:\S+\s)*)\S$
Meaning:
^ start of the line
(?: non-capturing group
\S+ one or more non-whitespace characters
(?:\s\S+)* zero or more groups of a single whitespace and one or more
non-whitespace characters
or (|)
^ start of the line
\s one whitespace character
(?:\S+\s)* zero or more groups of non-whitespace characters and one whitespace character
) end non-capturing group
Finally one non whitespace character \S and the end of the line: $.
In your third example the # won't be matched with \w but it will if you change it to \S (any non-whitespace character)
See it in action here: regexr.com/50lp2
edit: I can't type
What is the easiest way to match all lines which follow these rules:
The line is not empty
The line does not only contain whitespace
I've found an expression which only matches empty lines or those, who only contains white spaces, but I am not able to invert it. This is what I have found: ^\s*[\r\n].
Is it simply possible to invert regular expressions?
Thank you very much!
To match non-empty lines, you can use the following regex with multiline mode ON (thanks #Casimir for the character class correction):
^[^\S\r\n]*\S.*$
The end of line is consumed with .* that matches any characters but a newline.
See demo
To just check if the line is not whitespace (but not match it), use a simplified version:
^[^\S\r\n]*\S
See another demo
The [^\S\r\n]* matches 0 or more characters other than non-whitespace and carriage return and line feed symbols. The \S matches a non-whitespace character.
And by the way, if you code in C#, you do not need a regex to check if a string is whitespace, as there is String.IsNullOrWhiteSpace, just split the multiline string with str.Split(new[] {"\r\n"}, StringSplitOptions.None).
Just verify that there is at least one non-whitespace character:
^.*\S.*$
See it in action
Explanation:
From start (^) til end ($)
.* - any amount of any characters
\S - one non-whitespace character
Trying to create a pattern that matches an opening bracket and gets everything between it and the next space it encounters.
I thought \[.*\s would achieve that, but it gets everything from the first opening bracket on. How can I tell it to break at the next space?
\[[^\s]*\s
The .* is a greedy, and will eat everything, including spaces, until the last whitespace character. If you replace it with \S* or [^\s]*, it will match only a chunk of zero or more characters other than whitespace.
Masking the opening bracket might be needed. If you negate the \s with ^\s, the expression should eat everything except spaces, and then a space, which means up to the first space.
You could use a reluctant qualifier:
[.*?\s
Or instead match on all non-space characters:
[\S*\s
Use this:
\[[^ ]*
This matches the opening bracket (\[) and then everything except space ([^ ]) zero or more times (*).
I suggest using \[\S*(?=\s).
\[: Match a [ character.
\S*: Match 0 or more non-space characters.
(?=\s): Match a space character, but don't include it in the pattern. This feature is called a zero-width positive look-ahead assertion and makes sure you pattern only matches if it is followed by a space, so it won't match at the end of line.
You might get away with \[\S*\s if you don't care about groups and want to include the final space, but you would have to clarify exactly which patterns need matching and which should not.
You want to replace . with [^\s], this would match "not space" instead of "anything" that . implies