I have to use a crippled tool which doesn't provide any way to trim leading an trailing spaces from a string. It does have .NET style regex, but only Match is implemented, not replace. So, I came up (surprisingly by myself) with this regex that seems to work.. but I don't completely understand why it works :-)
$trimmed = regex/[^ ].*[^ ]/ ($original_string)
Why does this work, does it really work in all cases, and is there a better way if you only have regex Match ( even group matches can't be captured :( ) ?
It should work fine unless there's only a single character surrounded by space.
Your pattern searches for:
A non-space character [^ ]
Zero or more characters of any kind, as many as possible (greedy match) .*
A non-space character [^ ]
So, if there aren't at least two non-space characters (1 and 3), the pattern won't match at all.
You should use \b instead of [^ ], that will match any 'word boundary', but will be of zero length and won't require two non-space characters:
\b.*\b
It works like this: [^ ] will match the first non space character, .* will match anything, and [^ ] will again match a non space character. Since regex is greedy the longest possible match is returned, so in this case the longest possible string with two non spaces at the ends effectively trimming off whitespace at the beginning and end of $original_string.
A good tutorial on regex is here, it teaches you about greedy and lazy matching which are key to understanding and optimizing regexes. It also teaches you about matching between characters which is what you would want to do here (see the answer about \b by Martin).
Related
I have this regex:
\[tag\](.*?)\[\/tag\]
It match any character between two tags. The problem that is matching also empty contents or just white spaces inside the tags, for example:
[tag][/tag]
[tag] [/tag]
How can I avoid it? Make it to match at least 1 character and not only white spaces. Thanks!
Use
\[tag\](?!\s*\[\/tag\])(.*?)\[\/tag\]
^^^^^^^^^^^^^^^^
See the regex demo and the Regulex graph:
The (?!\s*\[\/tag\]) is a negative lookahead that fails the match if, immediately to the right of the current location, there is 0+ whitespaces, [/tag].
You might change your expression to something similar to this:
\[tag\]([\s\S]+)\[\/tag\]
and you might add a quantifier to it, and bound it with number of chars, similar to this expression:
\[tag\]([\s\S]{3,})\[\/tag\]
Or you could do the same with your original expression as this expression:
Try this regex:
\[(tag)\](?!\s*\[\/\1\])(.*?)\[\/\1\]
This regex matches tag only if it has at least one non-whitespace char.
If this is a PCRE (or php) or NP++ or Perl, use this
(?s)(?:\[tag\]\s*\[/tag\](*SKIP)(?!)|\[tag\]\s*(.+?)\s*\[/tag\])
https://regex101.com/r/aCsOoQ/1
If not, you're stuck with using Stribnetz regex, which works because of
an odd condition of your requirements.
Readable
(?s)
(?:
\[tag\]
\s*
\[/tag\]
(*SKIP)
(?!)
|
\[tag\]
\s*
( .+? ) # (1)
\s*
\[/tag\]
)
There are a thousand regular expression questions on SO, so I apologize if this is already covered. I did look first.
I have string:
Name Subname 11X22 88X620 AB33(20) YA5619 77,66
I need to capture this string: YA5619
What I am doing is just finding AB33(20) and after this I am capturing until first white space. But AB33(20) can be AB-33(20) or AB33(-20) or AB33(-1).
My preg_match regex is: (?<=\bAB\d{2}\(\d{2}\)\s).+?(?=\s)
Why I am getting error when I change from \d{2} to \d+?
For final result I was thinking this regix will work but no:
(?<=\bAB-?\d+\(-?\d+\)\s).+?(?=\s)
Any ideas what I am doing wrong?
With most regex flavors, lookbehind needs to evaluate to a fixed-length sequence, so you can't use variable quantifiers like * or + or even {1,2}.
Instead of using lookaround, you can simply match your marker pattern and then forget it with \K.
AB-?\d+(?:\(-?\d+\))? \K[^ ]+
demo: https://regex101.com/r/8XXngH/1
It depends on the language. If it is in .NET for example, it matches due to the various length in the lookbehind.
Another solution might be to use a character class and add the character you would allow to match. Then match a whitespace character and capture in a group matching \S+ which matches 1+ times not a whitespace character.
\bAB[()\d-]+\s\K\S+
Explanation
\bAB Match literally prepended with word boundary to prevent AB being part of a larger match.
[()\d-]+ Match 1+ times any of the listed character in the character class
\s Match a whitespace char (or \s+ to match 1 or more)
\K Reset the starting point of the reported match( Forget what was matched)
\S+ Match in a group 1+ times not a whitespace character
Regex demo | Php demo
Digits are optional, and are only allowed in the end of a word
Spaces are optional, and are only allowed in the middle of a word.
I am pretty much just trying to match the possible months in a few languages, say English and Vietnamese
For example, the following are valid matches:
'June'
'tháng 6'
But the following are not because of space: 'June ' ' June'
This is my testcases: https://regex101.com/r/pZ0mN3/2.
As you can see, I came up with ^\S[\S ]+\S$ which is kind of working, but I wonder if there's a better way to do it.
To match a string with no leading and trailing spaces in the JavaScript regex flavor, you can use several options:
Require the first and the last non-whitespace character with \S (=[^\s]). This can be done with, say, ^\S[\S\s]*\S$. This regex requires at least 2 characters to be in the string. Your regex requires 3 chars in the input since you used +. It won't allow some Unicode whitespaces either.
You may use a combination of grouping with optional quantifiers (those allowing 0 length matches). See ^\S(?:\s*\S+)*$ (where \s is replaced with since it is a multiline demo). The \S at the beginning matches a non-whitespace char and then a non-capturing group follows, that is * quantified (matches zero or more occurrences) and matches 0+ sequences of 0+ whitespaces followed with 1+ non-whitespace characters. This is a good expression for those flavors like RE2 that do not support lookarounds, but support quantified groups.
You may use lookaheads to require the first and last character to be non-whitespace characters: ^(?=[\S\s]*\S$)\S[\S\s]*$ where (?=[\s\S]*\S$) requires the last char to be a non-whitespace and the \S after the lookahead will require the first char to be non-whitespace. [\s\S]* matches 0+ any characters. This will match 1 char strings, but won't match empty strings.
If your regex to match strings with no leading/trailing whitespaces should also match an empty string, use 2 negative lookaheads: ^(?!\s)(?![\S\s]*\s$)[\S\s]*$. The (?!\s) lookahead will fail the match if there is a leading whitespace, (?![\S\s]*\s$) will do the same in case of trailing whitespace, and [\s\S]* will match 0+ any characters. *If lookarounds are not supported, use ^(?:\S(?: *\S+)*)?$ that is much less efficient.
If you do not need to match any chars between the non-whitespace chars, you may revert [\s\S] to your [\S ]. In PCRE, a horizontal whitespace can be matched with \h, in .NET and others that support Unicode properties, you can use [\t\p{Zs}] to match any horizontal whitespace. In JS, [^\S\r\n\f\v\u2028\u2029] can be used for that purpose.
Note that some regex flavors do not support non-capturing groups, you may replace all (?: with ( in the above patterns.
The Regular expression
/[\D\S]/
should match characters Which is not a digit or not whitespace
But When I test this expression in regexpal
It starts matching any character that's digit, whitespace
What i am doing wrong ?
\D = all characters except digits,
\S = all characters except whitespaces
[\D\S] = union (set theory) of the above character groups = all characters.
Why? Because \D contains \s and \S contains \d.
If you want to match characters which are not dights nor whitespaces you can use [^\d\s].
Your regex is invalidating itself as it goes. Putting the regex inside of [] means it has to match one of the items inside of it. These two items override each other, which end up matching everything. In theory, anything that is non digit, would match every other char. available, and any non whitespace matches any digit and any other char. as well.
You can try using [^\d\s] which says to negate the match of any digit or any space. Instead of having everything caught in the original regex, this negates the matching of both the \d and \s. You can see testing done with it here.
Trying to create a pattern that matches an opening bracket and gets everything between it and the next space it encounters.
I thought \[.*\s would achieve that, but it gets everything from the first opening bracket on. How can I tell it to break at the next space?
\[[^\s]*\s
The .* is a greedy, and will eat everything, including spaces, until the last whitespace character. If you replace it with \S* or [^\s]*, it will match only a chunk of zero or more characters other than whitespace.
Masking the opening bracket might be needed. If you negate the \s with ^\s, the expression should eat everything except spaces, and then a space, which means up to the first space.
You could use a reluctant qualifier:
[.*?\s
Or instead match on all non-space characters:
[\S*\s
Use this:
\[[^ ]*
This matches the opening bracket (\[) and then everything except space ([^ ]) zero or more times (*).
I suggest using \[\S*(?=\s).
\[: Match a [ character.
\S*: Match 0 or more non-space characters.
(?=\s): Match a space character, but don't include it in the pattern. This feature is called a zero-width positive look-ahead assertion and makes sure you pattern only matches if it is followed by a space, so it won't match at the end of line.
You might get away with \[\S*\s if you don't care about groups and want to include the final space, but you would have to clarify exactly which patterns need matching and which should not.
You want to replace . with [^\s], this would match "not space" instead of "anything" that . implies