Is it possible to match a nested pair with regex? - regex

Im attempting to parse some BBCode with regex, but the nested structures are giving me a headache
What I'm trying to parse is the following:
[COLOR="Red"]Red [COLOR="Green"]Green[/COLOR][/COLOR]
I've come up with the following pattern, which I need to deal with the quotation marks around the color attribute, but it only matches the first leading COLOR and the first closing COLOR. Its not matching in a proper nested arrangement
\[COLOR=(\"?)(.*?)(\"?)]([\s\S]*?)\[\/COLOR\]\
Its being done in dart, as follows, but really I believe the problem might be with my regex pattern rather then the dart implementation
text = text.replaceAllMapped(RegExp(r'\[COLOR=(\"?)(.*?)(\"?)]([\s\S]*?)\[\/COLOR\]', caseSensitive: false, multiLine: true), (match) {
return '<font style="color: ${match.group(2)}">${match.group(4)}</font>';
});

Matching braces (of any kind) are not regular. It's known to be a problem which is context free (can be solved by a stack machine or specified by a context free grammar), but not regular (can be solved by a finite state machine or specified by a regular expression).
While the commonly implemented "regular expressions" can do some non-regular things (due to backreferences), this is not one of those things.
In general, I'd recommend using a RegExp to tokenize the input, then build the stack based machine yourself on top.
Here, because it's simple enough, I'd just match the start and end markers and replace them individually, and not try to match the text between.
var re = RegExp(r'\[COLOR="(\w+)"\]|\[/COLOR\]');
text = text.replaceAllMapped(re, (m) {
var color = m[1]; // The color of a start tag, null if not start tag.
return color == null ? "</span>" : ​"<span style='color:$color'>";
});
If you want to check that the tags are balanced, we're back to having a stack (in this case so simple it's just a counter):
var re = RegExp(r'\[COLOR="(\w+)"\]|\[/COLOR\]');
var nesting = 0;
text = text.replaceAllMapped(re, (m) {
var color = m[1];
if (color == null) {
if (nesting == 0) {
throw ArgumentError.value(text, "text", "Bad nesting");
}
nesting--; // Decrement on close tag.
return "</span>";
}
nesting++; // Increment on open-tag.
return ​"<span style='color:$color'>";
});
if (nesting != 0) {
throw ArgumentError.value(text, "text", "Bad nesting");
}

Related

regx to check + in entire string using javascript or jquery

i am trying to validate input string to chech whether it contains '+' symbol anywhere in the string. i used for of loop but didnt get what is exprected.
const isMobileValidWithoutPlus = funcLib.isValidMobileWithoutPlus(mobileNumber);
isValidMobileWithoutPlus(mobileNumber) {
if (!mobileNumber) {
return false;
}
const checkRegex = new RegExp('\\+?\\d+');
return checkRegex.test(mobileNumber);
}
but able to get desired out.
The regex for this would be
const rgx = new RegExp(/\+/gm);
Your regular expression checks if you have a string that can either start with + or not, and is followed by one or more numbers. But you're saying you want to just check if there's a "+" anywhere in the number. For that you can use this regex above.
Also, do you need to use a regex?
You can do this using indexOf on a string if using regex is not a must.
let number = "+001234";
function hasPlus(number) {
return number.indexOf('+') !== -1;
}
Regular expressions are generally useful when you don't have one specific string that you're looking for, or when you want to find all the apparitions of a regex in a longer string. In your case, checking if a string contains "+", it isn't necessary to use them.

how to exclude a string if it's in a URL using regex?

I'm replacing a number of different strings, but only want them to replace in normal text, and not get rewritten when they appear as a link in a document. The regex to find the strings is very straightforward: /word|anotherword|athirdword/gi but what that means is that if there's a link that contains anotherwordit's getting found and then replaced as well, breaking the link.
I think I just need a part in my regex that says "but just ignore anything that starts with http or https" but not sure how to write that.
thanks so much!
edit. here's what I'm doing with the javascript
if (node.nodeType === 3) {
var text = node.nodeValue;
var replacedText = text.replace(/word|anotherword|athirdword/gi, 'replaced text');
if (replacedText !== text) {
element.replaceChild(document.createTextNode(replacedText), node);
}
}
the result replaces those three strings anywhere on a page, which is great. except it changes http://www.foo.com/the-whole-world into http://www.foo.com/the-whole-replaced text which obviously breaks the link.
I would try negative lookbehind.
Negative lookbehind differs greatly from flavor to flavor, so it won't work in different flavors.
For JavaScript, you can try following:
str.replace(/(http:[\/\.-a-z0-9]+)?(word|anotherword|athirdword)/gi, function($0, $1){
return $1 ? $0 : '';
});
Fiddle.
You can split the string first, then do a conditional replace:
function condReplace(str) {
var sentences = [];
var res = str.split(/(https?:\/\/[^\s]+)(?:\s+|$)/i);
res.forEach(function(entry) {
if (entry) {
if (entry.match(/^http?:\/\//i)) {
sentences.push(entry);
} else {
sentences.push(entry.replace(/word|anotherword|athirdword/g, "REPLACED"))
}
}
});
document.write(sentences.join(" "));
}
var str = "http://sometext.com/word.doc and This is a word normal text anotherword containing a anotherword another link http://www.foo.com/the-whole-word. This is a single word.";
condReplace(str);

Capitalize every word in actionScript using a regular expression

I'm trying to do initial caps in actionScript without loops but I'm stuck. I wanted to select the first letter or every word then apply uppercase on that letter. Well I got the selection part right, but at a dead end right now, any ideas? I was trying to do this without loops and cutting up strings.
// replaces with x since I can't figure out how to replace with
// the found result as uppercase
public function initialcaps():void
{
var pattern:RegExp=/\b[a-z]/g;
var myString:String="yes that is my dog dancing on the stage";
var nuString:String=myString.replace(pattern,"x");
trace(nuString);
}
You can also use this to avoid the compiler warnings.
myString.replace(pattern, function():String
{
return String(arguments[0]).toUpperCase();
});
Try to use a function that returns the uppercase letter:
myString.replace(pattern, function($0){return $0.toUpperCase();})
This works at least in JavaScript.
Just thought I'd throw them two cents in for strings that may be all caps
var pattern:RegExp = /\b[a-zA-Z]/g;
myString = myString.toLowerCase().replace(pattern, function($0){return $0.toUpperCase();});
This answer does not throw any kind of compiler errors under strict and I wanted it to be a little more robust, handling edge cases like hyphens (ignore them), underscores (treat them like spaces) and other special non-word characters such as slashes or dots.
It's really important to note the /g switch at the end of the regular expression. Without it, the rest of the function is pretty useless, because it will only address the first word, and not any subsequent ones.
for each ( var myText:String in ["this is your life", "Test-it", "this/that/the other thing", "welcome to the t.dot", "MC_special_button_04", "022s33FDs"] ){
var upperCaseEveryWord:String = myText.replace( /(\w)([-a-zA-Z0-9]*_?)/g, function( match:String, ... args ):String { return args[0].toUpperCase() + args[1] } );
trace( upperCaseEveryWord );
}
Output:
This Is Your Life
Test-it
This/That/The Other Thing
Welcome To The T.Dot
MC_Special_Button_04
022s33FDs
For the copy-and-paste artists, here's a ready-to-roll function:
public function upperCaseEveryWord( input:String ):String {
return input.replace( /(\w)([-a-zA-Z0-9]*_?)/g, function( match:String, ... args ):String { return args[0].toUpperCase() + args[1] } );
}

Code to parse capture groups in regular expressions into a tree

I need to identify (potentially nested) capture groups within regular expressions and create a tree. The particular target is Java-1.6 and I'd ideally like Java code. A simple example is:
"(a(b|c)d(e(f*g))h)"
which would be parsed to
"a(b|c)d(e(f*g))h"
... "b|c"
... "e(f*g)"
... "f*g"
The solution should ideally account for count expressions, quantifiers, etc and levels of escaping. However if this is not easy to find a simpler approach might suffice as we can limit the syntax used.
EDIT. To clarify. I want to parse the regular expression string itself. To do so I need to know the BNF or equivalent for Java 1.6 regexes. I am hoping someone has already done this.
A byproduct of a result would be that the process would test for validity of the regex.
Consider stepping up to an actual parser/lexer:
http://www.antlr.org/wiki/display/ANTLR3/FAQ+-+Getting+Started
It looks complicated, but if your language is fairly simple, it's fairly straightforward. And if it's not, doing it in regexes will probably make your life hell :)
I came up with a partial solution using an XML tool (XOM, http://www.xom.nu) to hold the tree. First the code, then an example parse. First the escaped characters (\ , ( and ) ) are de-escaped (here I use BS, LB and RB), then remaining brackets are translated to XML tags, then the XML is parsed and the characters re-escaped. What is needed further is a BNF for Java 1.6 regexes doe quantifiers such as ?:, {d,d} and so on.
public static Element parseRegex(String regex) throws Exception {
regex = regex.replaceAll("\\\\", "BS");
regex.replaceAll("BS\\(", "LB");
regex.replaceAll("BS\\)", "RB");
regex = regex.replaceAll("\\(", "<bracket>");
regex.replaceAll("\\)", "</bracket>");
Element regexX = new Builder().build(new StringReader(
"<regex>"+regex+"</regex>")).getRootElement();
extractCaptureGroupContent(regexX);
return regexX;
}
private static String extractCaptureGroupContent(Element regexX) {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < regexX.getChildCount(); i++) {
Node childNode = regexX.getChild(i);
if (childNode instanceof Text) {
Text t = (Text)childNode;
String s = t.getValue();
s = s.replaceAll("BS", "\\\\").replaceAll("LB",
"\\(").replaceAll("RB", "\\)");
t.setValue(s);
sb.append(s);
} else {
sb.append("("+extractCaptureGroupContent((Element)childNode)+")");
}
}
String capture = sb.toString();
regexX.addAttribute(new Attribute("capture", capture));
return capture;
}
example:
#Test
public void testParseRegex2() throws Exception {
String regex = "(.*(\\(b\\))c(d(e)))";
Element regexElement = ParserUtil.parseRegex(regex);
CMLUtil.debug(regexElement, "x");
}
gives:
<regex capture="(.*((b))c(d(e)))">
<bracket capture=".*((b))c(d(e))">.*
<bracket capture="(b)">(b)</bracket>c
<bracket capture="d(e)">d
<bracket capture="e">e</bracket>
</bracket>
</bracket>
</regex>

Want to Encode text during Regex.Replace call

I have a regex call that I need help with.
I haven't posted my regex, because it is not relevant here.
What I want to be able to do is, during the Replace, I also want to modify the ${test} portion by doing a Html.Encode on the entire text that is effecting the regex.
Basically, wrap the entire text that is within the range of the regex with the bold tag, but also Html.Encode the text inbetween the bold tag.
RegexOptions regexOptions = RegexOptions.Compiled | RegexOptions.IgnoreCase;
text = Regex.Replace(text, regexBold, #"<b>${text}</b>", regexOptions);
There is an incredibly easy way of doing this (in .net). Its called a MatchEvaluator and it lets you do all sorts of cool find and replace. Essentially you just feed the Regex.Replace method the method name of a method that returns a string and takes in a Match object as its only parameter. Do whatever makes sense for your particular match (html encode) and the string you return will replace the entire text of the match in the input string.
Example: Lets say you wanted to find all the places where there are two numbers being added (in text) and you want to replace the expression with the actual number. You can't do that with a strict regex approach, but you can when you throw in a MatchEvaluator it becomes easy.
public void Stuff()
{
string pattern = #"(?<firstNumber>\d+)\s*(?<operator>[*+-/])\s*(?<secondNumber>\d+)";
string input = "something something 123 + 456 blah blah 100 - 55";
string output = Regex.Replace(input, pattern, MatchMath);
//output will be "something something 579 blah blah 45"
}
private static string MatchMath(Match match)
{
try
{
double first = double.Parse(match.Groups["firstNumber"].Value);
double second = double.Parse(match.Groups["secondNumber"].Value);
switch (match.Groups["operator"].Value)
{
case "*":
return (first * second).ToString();
case "+":
return (first + second).ToString();
case "-":
return (first - second).ToString();
case "/":
return (first / second).ToString();
}
}
catch { }
return "NaN";
}
Find out more at http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.matchevaluator.aspx
Don't use Regex.Replace in this case... use..
foreach(Match in Regex.Matches(...))
{
//do your stuff here
}
Heres an implementation of this I've used to pick out special replace strings from content and localize them.
protected string FindAndTranslateIn(string content)
{
return Regex.Replace(content, #"\{\^(.+?);(.+?)?}", new MatchEvaluator(TranslateHandler), RegexOptions.IgnoreCase);
}
public string TranslateHandler(Match m)
{
if (m.Success)
{
string key = m.Groups[1].Value;
key = FindAndTranslateIn(key);
string def = string.Empty;
if (m.Groups.Count > 2)
{
def = m.Groups[2].Value;
if(def.Length > 1)
{
def = FindAndTranslateIn(def);
}
}
if (group == null)
{
return Translate(key, def);
}
else
{
return Translate(key, group, def);
}
}
return string.Empty;
}
From the match evaluator delegate you return everything you want replaced, so where I have returns you would have bold tags and an encode call, mine also supports recursion, so a little over complicated for your needs, but you can just pare down the example for your needs.
This is equivalent to doing an iteration over the collection of matches and doing parts of the replace methods job. It just saves you some code, and you get to use a fancy shmancy delegate.
If you do a Regex.Match, the resulting match objects group at the 0th index, is the subset of the intput that matched the regex.
you can use this to stitch in the bold tags and encode it there.
Can you fill in the code inside {} to add the bold tag, and encode the text?
I'm confused as to how to apply the changes to the entire text block AND replace the section in the text variable at the end.