Regex - Multiline Problem - regex

I think I'm burnt out, and that's why I can't see an obvious mistake. Anyway, I want the following regex:
#BIZ[.\s]*#ENDBIZ
to grab me the #BIZ tag, #ENDBIZ tag and all the text in between the tags. For example, if given some text, I want the expression to match:
#BIZ
some text some test
more text
maybe some code
#ENDBIZ
At the moment, the regex matches nothing. What did I do wrong?
ADDITIONAL DETAILS
I'm doing the following in PHP
preg_replace('/#BIZ[.\s]*#ENDBIZ/', 'my new text', $strMultiplelines);

The dot loses its special meaning inside a character class — in other words, [.\s] means "match period or whitespace". I believe what you want is [\s\S], "match whitespace or non-whitespace".
preg_replace('/#BIZ[\s\S]*#ENDBIZ/', 'my new text', $strMultiplelines);
Edit: A bit about the dot and character classes:
By default, the dot does not match newlines. Most (all?) regex implementations have a way to specify that it match newlines as well, but it differs by implementation. The only way to match (really) any character in a compatible way is to pair a shorthand class with its negation — [\s\S], [\w\W], or [\d\D]. In my personal experience, the first seems to be most common, probably because this is used when you need to match newlines, and including \s makes it clear that you're doing so.
Also, the dot isn't the only special character which loses its meaning in character classes. In fact, the only characters which are special in character classes are ^, -, \, and ]. Check out the "Metacharacters Inside Character Classes" section of the character classes page on Regular-Expressions.info.

// Replaces all of your code with "my new text", but I do not think
// this is actually what you want based on your description.
preg_replace('/#BIZ(.+?)#ENDBIZ/s', 'my new text', $contents);
// Actually "gets" the text, which is what I think you might be looking for.
preg_match('/(#BIZ)(.+?)(#ENDBIZ)/s', $contents, $matches);
list($dummy, $startTag, $data, $endTag) = $matches;

This should work
#BIZ[\s\S]*#ENDBIZ
You can try this online Regular Expression Testing Tool

The mistake is the character group [.\s] that will match a dot (not any character) or white space. You probably tried to get .* with . matching newline characters, too. You achieve this by enabling the single line option ((?s:) does this in .NET regex).
(?s:#BIZ.*?#ENDBIZ)

Depending on the environment you're using your regex in, it may need special care to properly parse multiline text, eg re.DOTALL in Python. So what environment is that?

you can use
preg_replace('/#BIZ.*?#ENDBIZ/s', 'my new text', $strMultiplelines);
the 's' modifier says "match the dot with anything, even the newline character". the '?' says don't be greedy, such as for the case of:
foo
#BIZ
some text some test
more text
maybe some code
#ENDBIZ
bar
#BIZ
some text some test
more text
maybe some code
#ENDBIZ
hello world
the non-greediness won't get rid of the "bar" in the middle.

Unless I am missing something, you handle this the same way that you would in Perl, with either the /m or /s modifier at the end? Oddly enough the other answers that rather correctly pointed this out got down voted?!

It looks like you're doing a javascript regex, you'll need to enable multiline by specifying the m flag at the end of the expression:
var re = /^deal$/mg

Related

Trouble converting regex

This regex:
"REGION\\((.*?)\\)(.*?)END_REGION\\((.*?)\\)"
currently finds this info:
REGION(Test) my user typed this
END_REGION(Test)
I need it to instead find this info:
#region REGION my user typed this
#endregion END_REGION
I have tried:
"#region\\ (.*?)\\\n(.*?)#endregion\\ (.*?)\\\n"
It tells me that the pattern assignment has failed. Can someone please explain what I am doing wrong? I am new to Regex.
It seems the issue lies in the multiline \n. My recommendation is to use the modifier s to avoid multiline complexities like:
/#region\ \(.*?\)(.*?)\s#endregion\s\(.*?\)/s
Online Demo
s modifier "single line" makes the . to match all characters, including line breaks.
Try this:
#region(.*)?\n(.*)?#endregion(.*)?
This works for me when testing here: http://regexpal.com/
When using your original text and regex, the only thing that threw it off is that I did not have a new line at the end because your sample text didn't have one.
Constructing this regex doesn't fail using boost, even if you use the expanded modifier.
Your string to the compiler:
"#region\\ (.*?)\\\n(.*?)#endregion\\ (.*?)\\\n"
After parsed by compiler:
#region\ (.*?)\\n(.*?)#endregion\ (.*?)\\n
It looks like you have one too many escapes on the newline.
if you present the regex as expanded to boost, an un-escaped pound sign # is interpreted as a comment.
In that case, you need to escape the pound sign.
\#region\ (.*?)\\n(.*?)\#endregion\ (.*?)\\n
If you don't use the expanded modifier, then you don't need to escape the space characters.
Taking that tack, you can remove the escape on the space's, and fixing up the newline escapes, it looks like this raw (what gets passed to regex engine):
#region (.*?)\n(.*?)#endregion (.*?)\n
And like this as a source code string:
"#region (.*?)\\n(.*?)#endregion (.*?)\\n"
Your regular expression has an extra backslash when escaping the newline sequence \\\n, use \\s* instead. Also for the last capturing group you can use a greedy quantifier instead and remove the newline sequence.
#region\\ (.*?)\\s*(.*?)#endregion\\ (.*)
Compiled Demo

Regex optimization: negative character class "[^#]" nullifies multiline flag "m"

I'm trying to parse a text line by line, catching everything EXCEPT what's after a specific marker, # for example. No escaping to take into account, pretty basic.
For instance, if the input text is:
Multiline input text
Mid-sentence# cut, this won't be matched
Hey there
If want to retrieve
['Multiline input text',
'Mid-sentence',
'Hey There']
This is working fine with /(.*?)(?:#.*$|$)/mg (even though there are a few empty matches). However, if I try to improve the regex (by avoiding backtracking and getting rid of empty matches) with /([^#]++)(?:#.*$|$)/mg, it returns
[
"Multiline input text
Mid-sentence",
"
Hey There"
]
As if [^#] was including linebreaks, even with the multiline flag on. As far as I can tell I can fix that by adding [^#\n\r] into the class character, but this makes the multiline option kind of useless and I'm afraid it could break on some weird linebreaks in some environments/encoding.
Would any of you know the reason for this behavior, and if there's another workaround? Thanks!
Edit
Originally, it happens in PCRE. But even in Javascript with /([^#]+)(?:#.*$|$)/mg, same unwanted multiline behavior. I know I could probably use the language to parse the text line by line, but I'd like to do it with regex only.
It seems you got your definition of /m wrong. The only thing this flag does is to change what ^ and $ matches, so that they also match at the beginning and end of line respectively. It does not affect anything else. If you don't want to match line breaks you should do as you suggested and use [^#\n\r].
The regex that will work for you is:
^(.*?)(?:#.*|)$
Online Demo: http://regex101.com/r/aP8eV6
DIfference is use of .*? instead of [^#]+.
[^#]+ by definition matches anything but # and that includes newlines as well.
multiline flag m only lets you use line start/end anchors ^ and $ in multiline inputs.

parsing url for specific param value

im looking to use a regular expression to parse a URL to get a specific section of the url and nothing if I cannot find the pattern.
A url example is
/te/file/value/jifle?uil=testing-cdas-feaw:jilk:&jklfe=https://value-value.jifels/temp.html/topic?id=e997aad4-92e0-j30e-a3c8-jfkaliejs5#c452fds-634d-f424fds-cdsa&bf_action=jildape
I wish to get the bolded text in it.
Currently im using the regex "d=([^#]*)" but the problem is im also running across urls of this pattern:
and im getting the bold section of it
/te/file/value/jifle?uil=testing-cdas-feaw:jilk:&jklfe=https://value-value.jifels/temp.html/topic?id=e997aad4-92e0-j30e-a3c8-jfkaliejs5&bf_action=jildape
I would prefer it have no matches of this url because it doesnt contain the #
Regexes are not a magic tool that you should always use just because the problem involves a string. In this case, your language probably has a tool to break apart URLs for you. In PHP, this is parse_url(). In Perl, it's the URI::URL module.
You should almost always prefer an existing, well-tested solution to a common problem like this rather than writing your own.
So you want to match the value of the id parameter, but only if it has a trailing section containing a '#' symbol (without matching the '#' or what's after it)?
Not knowing the specifics of what style of regexes you're using, how about something like:
id=([^#&]*)#
regex = "id=([\\w-])+?#"
This will grab everything that is character class[a-zA-Z_0-9-] between 'id=' and '#' assuming everything between 'id=' and '#' is in that character class(i.e. if an '&' is in there, the regex will fail).
id=
-Self explanatory, this looks for the exact match of 'id='
([\\w-])
-This defines and character class and groups it. The \w is an escaped \w. '\w' is a predefined character class from java that is equal to [a-zA-Z_0-9]. I added '-' to this class because of the assumed pattern from your examples.
+?
-This is a reluctant quantifier that looks for the shortest possible match of the regex.
#
-The end of the regex, the last character we are looking for to match the pattern.
If you are looking to grab every character between 'id=' and the first '#' following it, the following will work and it uses the same logic as above, but replaces the character class [\\w-] with ., which matches anything.
regex = "id=(.+?)#"

Very simple regex

I need a regex to match something this:
<a space><any character/s>#<any character/s><a space>
Yes, it's a very very basic email parser.
Thanks!
Something like this? /^ [^#]+#[^ ]+ $/
The square brackets indicate a character class, which is the characters that can be present there. So, your regex would match .#. or *#*. Instead, try "\ .*#.*\ " (quotes to show the space at the end, don't include them inside your regex.
For testing e-mail, you might use the regex described here:
\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b
It still doesn't cover 100% of e-mails, but the comprehensive version is fairly involved.
^ .+#.+ $
This translates to "the start of the string is followed by a space, one or more characters, the # symbol, one or more characters, and the last character in the string is a space."

How to ignore whitespace in a regular expression subject string?

Is there a simple way to ignore the white space in a target string when searching for matches using a regular expression pattern? For example, if my search is for "cats", I would want "c ats" or "ca ts" to match. I can't strip out the whitespace beforehand because I need to find the begin and end index of the match (including any whitespace) in order to highlight that match and any whitespace needs to be there for formatting purposes.
You can stick optional whitespace characters \s* in between every other character in your regex. Although granted, it will get a bit lengthy.
/cats/ -> /c\s*a\s*t\s*s/
While the accepted answer is technically correct, a more practical approach, if possible, is to just strip whitespace out of both the regular expression and the search string.
If you want to search for "my cats", instead of:
myString.match(/m\s*y\s*c\s*a\*st\s*s\s*/g)
Just do:
myString.replace(/\s*/g,"").match(/mycats/g)
Warning: You can't automate this on the regular expression by just replacing all spaces with empty strings because they may occur in a negation or otherwise make your regular expression invalid.
Addressing Steven's comment to Sam Dufel's answer
Thanks, sounds like that's the way to go. But I just realized that I only want the optional whitespace characters if they follow a newline. So for example, "c\n ats" or "ca\n ts" should match. But wouldn't want "c ats" to match if there is no newline. Any ideas on how that might be done?
This should do the trick:
/c(?:\n\s*)?a(?:\n\s*)?t(?:\n\s*)?s/
See this page for all the different variations of 'cats' that this matches.
You can also solve this using conditionals, but they are not supported in the javascript flavor of regex.
You could put \s* inbetween every character in your search string so if you were looking for cat you would use c\s*a\s*t\s*s\s*s
It's long but you could build the string dynamically of course.
You can see it working here: http://www.rubular.com/r/zzWwvppSpE
If you only want to allow spaces, then
\bc *a *t *s\b
should do it. To also allow tabs, use
\bc[ \t]*a[ \t]*t[ \t]*s\b
Remove the \b anchors if you also want to find cats within words like bobcats or catsup.
This approach can be used to automate this
(the following exemplary solution is in python, although obviously it can be ported to any language):
you can strip the whitespace beforehand AND save the positions of non-whitespace characters so you can use them later to find out the matched string boundary positions in the original string like the following:
def regex_search_ignore_space(regex, string):
no_spaces = ''
char_positions = []
for pos, char in enumerate(string):
if re.match(r'\S', char): # upper \S matches non-whitespace chars
no_spaces += char
char_positions.append(pos)
match = re.search(regex, no_spaces)
if not match:
return match
# match.start() and match.end() are indices of start and end
# of the found string in the spaceless string
# (as we have searched in it).
start = char_positions[match.start()] # in the original string
end = char_positions[match.end()] # in the original string
matched_string = string[start:end] # see
# the match WITH spaces is returned.
return matched_string
with_spaces = 'a li on and a cat'
print(regex_search_ignore_space('lion', with_spaces))
# prints 'li on'
If you want to go further you can construct the match object and return it instead, so the use of this helper will be more handy.
And the performance of this function can of course also be optimized, this example is just to show the path to a solution.
The accepted answer will not work if and when you are passing a dynamic value (such as "current value" in an array loop) as the regex test value. You would not be able to input the optional white spaces without getting some really ugly regex.
Konrad Hoffner's solution is therefore better in such cases as it will strip both the regest and test string of whitespace. The test will be conducted as though both have no whitespace.