Regex matching in ColdFusion OR condition - regex

I am attempting to write a CF component that will parse wikiCreole text. I am having trouble getting the correct matches with some of my regular expression though. I feel like if I can just get my head around the first one the rest will just click. Here is an example:
The following is sample input:
You can make things **bold** or //italic// or **//both//** or //**both**//.
Character formatting extends across line breaks: **bold,
this is still bold. This line deliberately does not end in star-star.
Not bold. Character formatting does not cross paragraph boundaries.
My first attempt was:
<cfset out = REreplace(out, "\*\*(.*?)\*\*", "<strong>\1</strong>", "all") />
Then I realized that it would not match where the ** is not given, and it should end where there are two carriage returns.
So I tried this:
<cfset out = REreplace(out, "\*\*(.*?)[(\*\*)|(\r\n\r\n)]", "<strong>\1</strong>", "all") />
and it is close but for some reason it gives you this:
You can make things <strong>bold</strong>* or //italic// or <strong>//both//</strong>* or //<strong>both</strong>*//.
Character formatting extends across line breaks: <strong>bold,</strong>
this is still bold. This line deliberately does not end in star-star.
Not bold. Character formatting does not cross paragraph boundaries.
Any ideas?
PS: If anyone has any suggestions for better tags, or a better title for this post I am all ears.

The [...] represents a character class, so this:
[(\*\*)|(\r\n\r\n)]
Is effectively the same as this:
[*|\r\n]
i.e. it matches a single "*" and the "|" isn't an alternation.
Another problem is that you replace the double linefeed. Even if your match succeeded you would end up merging paragraphs. You need to either restore it or not consume it in the first place. I'd use a positive lookahead to do the latter.
In Perl I'd write it this way:
$string =~ s/\*\*(.*?)(?:\*\*|(?=\n\n))/<strong>$1<\/strong>/sg;
Taking a wild guess, the ColdFusion probably looks like this:
REreplace(out, "\*\*(.*?)(?:\*\*|(?=\r\n\r\n))", "<strong>\1</strong>", "all")

You really should change your
(.*?)
to something like
[^*]*?
to match any character except the *. I don't know if that is the problem, but it could be the any-character . is eating one of your stars. It also a generally accepted "best practice" when trying to balance matching characters like the double star or html start/end tags to explicitly exclude them from your match set for the inner text.
*Disclaimer, I didn't test this in ColdFusion for the nuances of the regex engine - but the idea should hold true.

I know this is an older question but in response to where Ryan Guill said "I tried the $1 but it put a literal $1 in there instead of the match" for ColdFusion you should use \1 instead of $1

I always use a regex web-page. It seems like I start from scratch every time I used regex.
Try using '$1' instead of \1 for this one - the replace is slightly different... but I think the pattern is what you need to get working.
Getting closer with this:
**(.?)**|//(.?)//
The tricky part is the //** or **//
Ok, first checking for //bold//
then //bold// then bold, then
//bold//
**//(.?)//**|//**(.?)**//|**(.?)**|//(.?)//

I find this app immensely helpful when I'm doing anything with regex:
http://www.gskinner.com/RegExr/desktop/
Still doesn't help with your actual issue, but could be useful going forward.

Related

URL regex that skips ending periods

I'm trying to create a regex that matches url strings within normal text. I have this:
http[s]?://[^\s]+
This seems to work well with the exception that if the url is at the end of a sentence it will grab the period as well. For example for this string:
I am typing some text with the url http://something.com/something-?args=someargs. This is another sentence.
it matches:
http://something.com/some-thing?args=someargs.
I would like it to match:
http://something.com/some-thing?args=someargs
Obviously I can't exclude periods because they are in the url previously but I can't figure out how to tell it to exclude the last period if there is one. I could potentially use a negative lookahead for end of line or whitespace, but if it's in the middle of the line (without a period after it) that would leave off the last character of the url.
Most of the ones I have seen online have the same issue that they match the ending dot so maybe it's not possible? I know basic regex but certainly not a genius with it so if someone has a solution I would be very grateful :).
Also, I can do some post-process in this case to remove the dot if I need to, just seems like there should be a Regex solution...
Try this one
http[s]?://[^\s]+[^. ]

Perl: regx pattern matching

Whenever i find the word ".abc.corp:" in a line on file, i would like to exclude those lines:
Example Line:
kubernte-fileserver-NN.abc.corp:/srv/export/storage/nsp_updates 1231231 123112 123123 89% /devops
can someone help me to find out the correct regex pattern:
im trying out with below pattern match: unable to figure it out
/^(.*(?!\.abc\.corp).*)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\%\s+([\/\w\d\.-]+)$/
I'm confused with subpattern group $1 with negative look around!
By "exclude it", I assume you want to exclude the entire line.
Your try will not exclude anything, because here Perl can always find some point in the share path to split it, where the split point is not immediately followed by .abc.corp, like if it splits it:
kubernte-fileserver-N
N.abc.corp:/srv/export/storage/nsp_updates
or (as it's actually going to do) just consume everything by the first .*, with nothing left for the second one.
I'd instead first try to match the string you're trying to avoid, and failing to do so, proceed with the actual handling:
if (/^\S+\.abc\.corp:/) {
# SKIP
}
elsif (/^(.*)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\%\s+([\/\w\d\.-]+)$/) {
...
}
Besides actually working, this makes the code much more readable.

replacing all open tags with a string

Before somebody points me to that question, I know that one can't parse html with regex :) And this is not what I am trying to do.
What I need is:
Input: a string containing html.
Output: replace all opening tags
***<tag>
So if I get
<a><b><c></a></b></c>, I want
***<a>***<b>***<c></a></b></c>
as output.
I've tried something like:
(<[~/].+>)
and replace it with
***$1
But doesn't really seem to work the way I want it to. Any pointers?
Clarification: it's guaranteed that there are no self closing tags nor comments in the input.
You just have two problems: ^ is the character to exclude items from a character class, not ~; and the .+ is greedy, so will match as many characters as possible before the final >. Change it to:
(<[^/].+?>)
You can also probably drop the parentheses and replace with $0 or $&, depending on the language.
Try using: (<[^/].*?>) and replace it with ***$1

Regex which ignores comments

being a regex beginner, I need some help writing a regex. It should match a particular pattern, lets say "ABC". But the pattern shouldn't be matched when it is used in comment (' being the comment sign). So XYZ ' ABC
shouldn't match. x("teststring ABC") also shouldn't match. But ABC("teststring ' xxx") has to match to end, that is xxx not being cut off.
Also does anybody know a free Regex application that you can use to "debug" your regex? I often have problems recognizing whats wrong with my tries. Thanks!
Some will swear by RegexBuddy. I've never used the debugger, but I advise you to steer away from the regex generator it provides. It's just a bad idea.
You may be able to pull this off with whatever regex flavor you're using, but in general I think you're going to find it easier and more maintainable to do this the "hard" way. Regular expressions are for regular languages, and nested anything usually means that regexes aren't a good idea. Modern extensions to regex syntax means it may be doable, but it's not going to be pretty, and you sure won't remember what happened in the morning. And one place where regular expressions fail quite spectacularly (even with modern non-regular extensions) is parsing nested structures - trying to parse any mixture comments, quoted strings, and parenthesis quickly devolves into an incomprehensible and unmaintainable mess. Don't get me wrong - I'm a fan of regular expressions in the right places. This isn't one of them.
On the topic of good regex tools, I really like RegexBuddy, but it's not free.
Other than that, a regex is the wrong tool for the job if you need to check inside string delimiters and all sorts too. You need a finite-state machine.
Odd that lots of people recommend their favorite tools, but nobody provides a solution for the problem at hand. (I'm the developer of RegexBuddy, so I'll refrain from recommending any tools.)
There's no good way of matching Y unless it's part of XYZ with a single regular expression. What you can do is write a regex that matches both Y and XYZ: Y|XYZ. Then use a bit of extra code to process the matches for Y, and ignore those for XYZ. One way to do that is with a capturing group: (Y)|XYZ. Now you can process the matches of the first capturing group. When XYZ matches, the capturing group doesn't match anything.
To do this for your VB-style comments, you can use the regex:
'.*|(ABC)
This regex matches a single quote and everything up to the end of the line, or ABC. This regex will match all comments (whether those include ABC or not). The capturing group will match all occurrences of ABC, except those in comments.
If you want your regex to both skip comments and strings, you can add strings to your regex:
'.*|"[^"\r\n]*"|(ABC)
I find the best 'debugger' for regexes is just messing around in an interactive environment trying lots of small bits out. For Python, ipython is great; for Ruby, irb, for command-line type stuff, sed...
Just try out little pieces at a time, make sure you understand them, then add an extra little bit. Rinse and repeat.
For NET development you might as well try RegexDesigner, this tool can generate code(VB/C#) for you. It is a very good tool for us Regex starters.
link text
Here is my solution to this problem:
1. Find a store all your comments in hash
2. Do your regexp replacement
3. Bring comments back to file
Save your time :-)
string fileTextWithComments = "Some tetx file contents";
Dictionary<string, string> comments = new Dictionary<string, string>();
// 1. Find a store all your comments in hash
Regex rc = new Regex("(?:/\\*(?:[^*]|(?:\\*+[^*/]))*\\*+/)|(?://.*)");
MatchCollection matches = rc.Matches(fileTextWithComments);
int index = 0;
foreach (Match match in matches)
{
string key = string.Format("/*Comment#{0}*/", index++);
comments.Add(key, match.Value);
fileTextWithComments = fileTextWithComments.Replace(match.Value, key);
}
// 2. Do your regexp replacement
Regex r = new Regex("YOUR REGEXP PATTERN");
fileTextWithComments = r.Replace(fileTextWithComments, "NEW STRING");
// 3. Bring comments back to file :-)
foreach (string key in comments.Keys)
{
string comment = comments[key];
fileTextWithComments = fileTextWithComments.Replace(key, comment);
}
Could you clarify? I read it thrice, and I think you want to match a given pattern when it appears as a literal. As in not as part of a comment or a string.
What your asking for is pretty tricky to do as a single regexp. Because you want to skip strings. Multiple strings in one line would complicate matters.
I wouldn't even try to do it in one regexp. Instead, I'd pass each line through a filter first, to remove strings, and then comments in that order. And then try and match your pattern.
In Perl because of it's regexp processing power. Assuming #lines is a list of lines you want to match, and $pattern is the pattern you want to match.
#matches =[];
for (#lines){
$line = $_;
$line ~= s/"[^"]*?(?<!\)"//g;
$line ~= s/'.*//g;
push #matches, $_ if $line ~= m/$pattern/;
}
The first substitution finds any pattern that starts with a double quotation mark and ends with the first unescaped double quote. Using the standard escape character of a backspace.
The next strips comments. If the pattern still matches, it adds that line to the list of matches.
It's not perfect because it can't tell the difference between "a\\" and "a\" The first is usually a valid string, the later is not. Either way these substitutions will continue to look for another ", if one isn't found the string isn't thrown out. We could use another substitution to replace all double backslashes with something else. But this will cause problems if the pattern you're looking for contains a backslash.
You can use a zero width look-behind assertion if you only have single line comments, but if you're using multi-line comments, it gets a little trickier.
Ultimately, you really need to solve this kind of issue with some sort of parser, given that the definition of a comment is really driven by a grammar.
This answer to a different but related question looks good too...
If you have Emacs, there is a built-in regex tool called "regexp-builder". I don't really understand the specifics of your regex question well enough to suggest an answer to that.
RegEx1: (-user ")(.*?)"
Subject: report -user "test user" -date 1/4/13 -day monday -daterange "1/4/13 1/20/13" -
Result: -user "test user"
Regex2: (-daterange ")(.*?)"
Subject: report -user "test user" -date 1/4/13 -day monday -daterange "1/4/13 1/20/13" -
Result: -daterange "1/4/13 1/20/13"
RegEx3: (-date )(.*?)( -)
Subject: report -user "test user" -date 1/4/13 -day monday -daterange "1/4/13 1/20/13" -
Result: -date 1/4/13 -
RegEx4: (-day )(.*?)( -)
Subject: report -user "test user" -date 1/4/13 -day monday -daterange "1/4/13 1/20/13" -
Result: -day monday -
Search for the quoted value first if not found, search for the no quotes parameter. This expects only one occurrence of the parameter. It also expects the command to either; use quotes to encapsulate a string with no quotes inside, or; use any character other than a quote in the first position, have no occurrence of ' -' until the next parameter, and have a trailing ' -' (add it onto the string before the regex).

Need regexp to find substring between two tokens

I suspect this has already been answered somewhere, but I can't find it, so...
I need to extract a string from between two tokens in a larger string, in which the second token will probably appear again meaning... (pseudo code...)
myString = "A=abc;B=def_3%^123+-;C=123;" ;
myB = getInnerString(myString, "B=", ";" ) ;
method getInnerString(inStr, startToken, endToken){
return inStr.replace( EXPRESSION, "$1");
}
so, when I run this using expression ".+B=(.+);.+"
I get "def_3%^123+-;C=123;" presumably because it just looks for the LAST instance of ';' in the string, rather than stopping at the first one it comes to.
I've tried using (?=) in search of that first ';' but it gives me the same result.
I can't seem to find a regExp reference that explains how one can specify the "NEXT" token rather than the one at the end.
any and all help greatly appreciated.
Similar question on SO:
Regex: To pull out a sub-string between two tags in a string
Regex to replace all \n in a String, but no those inside [code] [/code] tag
Replace patterns that are inside delimiters using a regular expression call
RegEx matching HTML tags and extracting text
You're using a greedy pattern by not specifying the ? in it. Try this:
".+B=(.+?);.+"
Try this:
B=([^;]+);
This matches everything between B= and ; unless it is a ;. So it matches everything between B= and the first ; thereafter.
(This is a continuation of the conversation from the comments to Evan's answer.)
Here's what happens when your (corrected) regex is applied: First, the .+ matches the whole string. Then it backtracks, giving up most of the characters it just matched until it gets to the point where the B= can match. Then the (.+?) matches (and captures) everything it sees until the next part, the semicolon, can match. Then the final .+ gobbles up the remaining characters.
All you're really interested in is the "B=" and the ";" and whatever's between them, so why match the rest of the string? The only reason you have to do that is so you can replace the whole string with the contents of the capturing group. But why bother doing that if you can access contents of the group directly? Here's a demonstration (in Java, because I can't tell what language you're using):
String s = "A=abc;B=def_3%^123+-;C=123;";
Pattern p = Pattern.compile("B=(.*?);");
Matcher m = p.matcher(s);
if (m.find())
{
System.out.println(m.group(1));
}
Why do a 'replace' when a 'find' is so much more straightforward? Probably because your API makes it easier; that's why we do it in Java. Java has several regex-oriented convenience methods in its String class: replaceAll(), replaceFirst(), split(), and matches() (which returns true iff the regex matches the whole string), but not find(). And there's no convenience method for accessing capturing groups, either. We can't match the elegance of Perl one-liners like this:
print $1 if 'A=abc;B=def_3%^123+-;C=123;' =~ /B=(.*?);/;
...so we content ourselves with hacks like this:
System.out.println("A=abc;B=def_3%^123+-;C=123;"
.replaceFirst(".+B=(.*?);.+", "$1"));
Just to be clear, I'm not saying not to use these hacks, or that there's anything wrong with Evan's answer--there isn't. I just think we should understand why we use them, and what trade-offs we're making when we do.