Regex not stopping at first space - regex

Trying to create a pattern that matches an opening bracket and gets everything between it and the next space it encounters.
I thought \[.*\s would achieve that, but it gets everything from the first opening bracket on. How can I tell it to break at the next space?

\[[^\s]*\s
The .* is a greedy, and will eat everything, including spaces, until the last whitespace character. If you replace it with \S* or [^\s]*, it will match only a chunk of zero or more characters other than whitespace.
Masking the opening bracket might be needed. If you negate the \s with ^\s, the expression should eat everything except spaces, and then a space, which means up to the first space.

You could use a reluctant qualifier:
[.*?\s
Or instead match on all non-space characters:
[\S*\s

Use this:
\[[^ ]*
This matches the opening bracket (\[) and then everything except space ([^ ]) zero or more times (*).

I suggest using \[\S*(?=\s).
\[: Match a [ character.
\S*: Match 0 or more non-space characters.
(?=\s): Match a space character, but don't include it in the pattern. This feature is called a zero-width positive look-ahead assertion and makes sure you pattern only matches if it is followed by a space, so it won't match at the end of line.
You might get away with \[\S*\s if you don't care about groups and want to include the final space, but you would have to clarify exactly which patterns need matching and which should not.

You want to replace . with [^\s], this would match "not space" instead of "anything" that . implies

Related

Regular Expression to Match Past Label Including Empty String

Using a regular expression, I'm trying to match a label, in this case "Business Unit:", followed by one or more spaces, then match everything in a submatch after that to the end of that line. I'm having a problem when there are no characters after the label on the line, it grabs the next line.
For example, here's some test data:
Business Unit:(space)(space)BU1(space)
This is Line 2
Business Unit:(space)(space)
This is Line 4
So I want to grab just "BU1" from the first line, and that works. It should match an empty string from the third line, but it matches the contents of the fourth line instead, in this case "This is Line 4".
Here is my expression:
Business Unit:\s+(.+)
I thought the dot character is not suppose to match a newline, but it seems like it is.
What's the correct regular expression in this case?
The real problem here is that \s+ is greedy, so it will match all whitespace (including new lines), so it matches up until the next line and then .+ catches the rest.
This should meet your requirements.
The pattern is ^Business Unit: *([\S]*)
This is assuming of course your business unit won't contain any spaces. If it does, then I can modify the pattern.
It depends, a bit on the context you are using the regex in because multi-line handling may vary, but here is a start:
/^Business Unit: +([^ ]*) *$/
^ Starting from the beginning of the line,
Match the literal, Business Unit:,
+ followed by 1 or more spaces,
([^ ]*) capture any possible non-blank stuff,
*$ followed by spaces till the end of the line.
Again, depending on your context, you may need to specify the linend as \n:
/^Business Unit: +([^ ]*) *\n/
The \n character is part of \s. That is why you get a match onto the following line.
You can do:
/^Business Unit:[ \t]*([^\n]*?)[ \t]*$/m
Demo
If you want to exclude the leading horizontal spaces and not match if blank:
/^Business Unit:[ \t]+(\S+)[ \t]*$/m
Demo
Use a character class substraction for whitespace except newlines:
Business Unit:[\s&&[^\n]]*(\S*)
See live demo.
The expression [\s&&[^\n]] is the subtraction, then the capture is for 0 or more non-whitespace (your target).
In your example you capture the last line because \s also matches a newline.
What you could do is replace \s+ to a whitespace and capture in a group any character zero or more times .*
You might use a word boundary \b at the start.
\bBusiness Unit: +(.*)
Update
Bases on the comments, to not match whitespace at the end of the line you could use match one or more times a non whitespace characters \S+ followed by repeated pattern that matches a whitespace or a tab [ \t] and one or more times a non whitespace character and make the group optional ?
\bBusiness Unit: +(\S+(?:[ \t]\S+)*)?

Unexpected working of Negated Shorthand Character Classes

The Regular expression
/[\D\S]/
should match characters Which is not a digit or not whitespace
But When I test this expression in regexpal
It starts matching any character that's digit, whitespace
What i am doing wrong ?
\D = all characters except digits,
\S = all characters except whitespaces
[\D\S] = union (set theory) of the above character groups = all characters.
Why? Because \D contains \s and \S contains \d.
If you want to match characters which are not dights nor whitespaces you can use [^\d\s].
Your regex is invalidating itself as it goes. Putting the regex inside of [] means it has to match one of the items inside of it. These two items override each other, which end up matching everything. In theory, anything that is non digit, would match every other char. available, and any non whitespace matches any digit and any other char. as well.
You can try using [^\d\s] which says to negate the match of any digit or any space. Instead of having everything caught in the original regex, this negates the matching of both the \d and \s. You can see testing done with it here.

What does this regular expression mean?

/\ATo\:\s+(.*)/
Also, how do you work it out, what's the approach?
In multi-line regular expressions, \A matches the start of the string (and \Z is end of string, while ^/$ matches the start/end of the string or the start/end of a line). In single line variants, you just use ^ and $ for start and end of string/line since there is no distinction.
To is literal, \: is an escaped :.
\s means whitespace and the + means one or more of the preceding "characters" (white space in this case).
() is a capturing group, meaning everything in here will be stored in a "register" that you can use. Hence, this is the meat that will be extracted.
.* simply means any non newline character ., zero or more times *.
So, what this regex will do is process a string like:
To: paxdiablo
Re: you are so cool!
and return the text paxdiablo.
As to how to learn how to work this out yourself, the Perl regex tutorial(a) is a good start, and then practise, practise, practise :-)
(a) You haven't actually stated which regex implementation you're using but most modern ones are very similar to Perl. If you can find a specific tutorial for your particular flavour, that would obviously be better.
\A is a zero-width assertion and means "Match only at beginning of string".
The regex reads: On a line beginning with "To:" followed by one or more whitespaces (\s), capture the remainder of the line ((.*)).
First, you need to know what the different character classes and quantifiers are. Character classes are the backslash-prefixed characters, \A from your regex, for instance. Quantifiers are for instance the +. There are several references on the internet, for instance this one.
Using that, we can see what happens by going left to right:
\A matches a beginning of the string.
To matches the text "To" literally
\: escapes the ":", so it loses it's special meaning and becomes "just a colon"
\s matches whitespace (space, tab, etc)
+ means to match the previous class one or more times, so \s+ means one or more spaces
() is a capture group, anything matched within the parens is saved for later use
. means "any character"
* is like the +, but zero or more times, so .* means any number of any characters
Taking that together, the regex will match a string beginning with "To:", then at least one space, and the anything, which it will save. So, with the string "To: JaneKealum", you'll be able to extract "JaneKealum".
You start from left and look for any escaped (ie \A) characters. The rest are normal characters. \A means the start of the input. So To: must be matched at the very beginning of the input. I think the : is escaped for nothing. \s is a character group for all spaces (tabs, spaces, possibly newlines) and the + that follows it means you must have one or more space characters. After that you capture all the rest of the line in a group (marked with ( )).
If the input was
To: progo#home
the capture group would contain "progo#home"
It matches To: at the beginning of the input, followed by at least one whitespace, followed by any number of characters as a group.
The initial and trailing / characters delimit the regular expression.
A \ inside the expression means to treat the following character specially or treat it as a literal if it normally has a special meaning.
The \A means match only at the beginning of a string.
To means match the literal "To"
\: means match a literal ':'. A colon is normally a literal and has no special meaning it can be given.
\s means match a whitespace character.
+ means match as many as possible but at least one of whatever it follows, so \s+ means match one or more whitespace characters.
The ( and ) define a group of characters that will be captured and returned by the expression evaluator.
And finally the . matches any character and the * means match as many as possible but can be zero. Therefore the (.*) will capture all characters to the end of the input string.
So therefore the pattern will match a string that starts "To:" and capture all characters that occur after the first succeeding non-whitespace character.
The only way to really understand these things is to go through them one bit at a time and check the meaning of each component.

Regex anchor question

Would an anchor like "^" or "\A" at the beginning of this regex make any sense - any difference?
$string =~/(.*)([a-z])$/
Yes, either ^ or \A will cause the regex to not match if there is a newline anywhere before the letter, because .* (zero or more of any characters except newline) will no longer match up to the letter before the end.
Without the beginning anchor, the regex will match from after the last newline through the end of the string (or through the letter before the newline at the end, if there is a newline).
No, because of the greedy nature of regular expression matching that regex will pull everything before the final letter of the string, provided the last character is a letter.
It would make sense, just not any difference.

Trim string using reqex match

I have to use a crippled tool which doesn't provide any way to trim leading an trailing spaces from a string. It does have .NET style regex, but only Match is implemented, not replace. So, I came up (surprisingly by myself) with this regex that seems to work.. but I don't completely understand why it works :-)
$trimmed = regex/[^ ].*[^ ]/ ($original_string)
Why does this work, does it really work in all cases, and is there a better way if you only have regex Match ( even group matches can't be captured :( ) ?
It should work fine unless there's only a single character surrounded by space.
Your pattern searches for:
A non-space character [^ ]
Zero or more characters of any kind, as many as possible (greedy match) .*
A non-space character [^ ]
So, if there aren't at least two non-space characters (1 and 3), the pattern won't match at all.
You should use \b instead of [^ ], that will match any 'word boundary', but will be of zero length and won't require two non-space characters:
\b.*\b
It works like this: [^ ] will match the first non space character, .* will match anything, and [^ ] will again match a non space character. Since regex is greedy the longest possible match is returned, so in this case the longest possible string with two non spaces at the ends effectively trimming off whitespace at the beginning and end of $original_string.
A good tutorial on regex is here, it teaches you about greedy and lazy matching which are key to understanding and optimizing regexes. It also teaches you about matching between characters which is what you would want to do here (see the answer about \b by Martin).