regex vs substring - regex

I have a very short xml String passed to my app from another app and I'm only interested in extracting the content between the "level" tags. Which solution is better between these two:
String xmlString =
"<type>
<perm>
<date>99999999</date>
<level>admin</level>
</perm>
</type>";
String level = xmlString.substring(xmlString.indexOf("<level>") + "<level>".length(),
xmlString.indexOf("</level>"));
or
Pattern p1 = Pattern.compile("<level>(\\S+)</level>");
Matcher m = p1.matcher(xmlString);
if (m.find()) {
String level = m.group(1);
}

Have you tried bench-marking this on your own? From what I've read it seems that you generally want to go regex first and if you can't optimize that then try substring. However I'm a little confused why you aren't using something like XmlObject.factory to handling your XML parsing. https://xmlbeans.apache.org/docs/2.0.0/reference/org/apache/xmlbeans/XmlObject.Factory.html

Related

How to build a Raw string for regex from string variable

How build a regex from a string variable, and interpret that as Raw format.
std::regex re{R"pattern"};
For the above code, is there a way to replace the fixed string "pattern" with a std::string pattern; variable that is either built from compile time or run time.
I tried this but didn't work:
std::string pattern = "key";
std::string pattern = std::string("R(\"") + pattern + ")\"";
std::regex re(pattern); // does not work as if it should when write re(R"key")
Specifically, the if using re(R("key") the result is found as expected. But building using re(pattern) with pattern is exactly the same value ("key"), it did not find the result.
This is probably what I need, but it was for Java, not sure if there is anything similar in C++:
How do you use a variable in a regular expression?
std::string pattern = std::string("R(\"") + pattern + ")\"";
should be build from raw string literals as follows
pattern = std::string(R"(\")") + pattern + std::string(R"(\")");
This results in a string value like
\"key\"
See a working live example;
In case you want to have escaped parenthesis, you can write
pattern = std::string(R"(\(")") + pattern + std::string(R"("\))");
This results in a string value like
\("key"\)
Live example
Side note: You can't define the pattern variable twice. Omit the std::string type in follow up uses.

How to replace parts of a string in lua "in a single pass"?

I have the following string of anchors (where I want to change the contents of the href) and a lua table of replacements, which tells which word should be replaced for:
s1 = '<a href="word7">'
replacementTable = {}
replacementTable["word1"] = "potato1"
replacementTable["word2"] = "potato2"
replacementTable["word3"] = "potato3"
replacementTable["word4"] = "potato4"
replacementTable["word5"] = "potato5"
The expected result should be:
<a href="word7">
I know I could do this iterating for each element in the replacementTable and process the string each time, but my gut feeling tells me that if by any chance the string is very big and/or the replacement table becomes big, this apporach is going to perform poorly.
So I though it could be best if I could do the following: apply the regular expression for finding all the matches, get an iterator for each match and replace each match for its value in the replacementTable.
Something like this would be great (writing it in Javascript because I don't know yet how to write lambdas in Lua):
var newString = patternReplacement(s1, '<a[^>]* href="([^"]*)"', function(match) { return replacementTable[match] })
Where the first parameter is the string, the second one the regular expression and the third one a function that is executed for each match to get the replacement. This way I think s1 gets parsed once, being more efficient.
Is there any way to do this in Lua?
In your example, this simple code works:
print((s1:gsub("%w+",replacementTable)))
The point is that gsub already accepts a table of replacements.
In the end, the solution that worked for me was the following one:
local updatedBody = string.gsub(body, '(<a[^>]* href=")(/[^"%?]*)([^"]*")', function(leftSide, url, rightSide)
local replacedUrl = url
if (urlsToReplace[url]) then replacedUrl = urlsToReplace[url] end
return leftSide .. replacedUrl .. rightSide
end)
It kept out any querystring parameter giving me just the URI. I know it's a bad idea to parse HTML bodies with regular expressions but for my case, where I required a lot of performance, this was performing a lot faster and just did the job.

Extract a between a specific pattern

I have to extract some substrings, this is like an XML markup in a plain text doc, like
lsdkfjsdklfj sdklfsdklfjsd <AAA>myString</AAA>sdfsdfsdfsdf
Can i extract this pattern in a single command?
In a case like this, I tried to use a matcher, the group command to extract this single match.
I don't want to do something like
String pattern = /<AAA>(.*)<\/AAA>/;
// Create a Pattern object
Pattern r = Pattern.compile(pattern);
// Now create matcher object.
Matcher m = r.matcher("lsdkfjsdklfj sdklfsdklfjsd <AAA>myString</AAA>sdfsdfsdfsdf");
if (m.find( )) {
System.out.println("Found value: " + m.group(0) );
}
There must be a more elegant way.
Edit :
Thank you time_yates, i was looking for something like that.
Could you explain a little why you use [0][1] on the result of
def extract = (input =~ '<AAA>(.+?)</AAA>')[0][1]
Answer by tim_yates :
=~ returns a Matcher, and so [0] gets the first match, which is 2 groups, the first is the String that had the match in it (your whole string) the second [1] is the group you defined in your expression
Thank you so much for your help, and thanks to all the readers.
Power of a community !!!
Can't you just do:
def input = 'lsdkfjsdklfj sdklfsdklfjsd <AAA>myString</AAA>sdfsdfsdfsdf'
def extract = (input =~ '<AAA>(.+?)</AAA>')[0][1]
assert extract == 'myString'
This is the shortest (not the best) way I can think of without external libs:
String str = "lsdkfjsdklfj sdklfsdklfjsd <AAA>myString</AAA>sdfsdfsdfsdf";
System.out.println(str.substring(str.indexOf(">") + 1, str.lastIndexOf("<")));
Or using StringUtils (which is million times better than my previous sugestion with substring):
StringUtils.substringBetween(str, "<AAA>", "</AAA>");
Still I'd go with matcher() like you proposed among all these.

Using a Variable in an AS3, Regexp

Using Actionscript 3.0 (Within Flash CS5)
A standard regex to match any digit is:
var myRegexPattern:Regex = /\d/g;
What would the regex look like to incorporate a string variable to match?
(this example is an 'IDEAL' not a 'WORKING' snippet) ie:
var myString:String = "MatchThisText"
var myRegexPatter_WithString:Regex = /\d[myString]/g;
I've seen some workarounds which involve creating multiple regex instances, then combine them by source, with the variable in question, which seems wrong. OR using the flash string to regex creator, but it's just plain sloppy with all the double and triple escape sequences required.
There must be some pain free way that I can't find in the live docs or on google. Does AS3 hold this functionality even? If not, it really should.
Or I am missing a much easier means of simply avoiding this task that I'm simply naive too due to my newness to regex?
I've actually blogged about this, so I'll just point you there: http://tyleregeto.com/using-vars-in-regular-expressions-as3 It talks about the possible solutions, but there is no ideal one like you mention.
EDIT
Here is a copy of the important parts of that blog entry:
Here is a regex to strip the tags from a block of text.
/<("[^"]*"|'[^']*'|[^'">])*>/ig
This nifty expression works like a charm. But I wanted to update it so the developer could limit which tags it stripped to those specified in a array. Pretty straight forward stuff, to use a variable value in a regex you first need to build it as a string and then convert it. Something like the following:
var exp:String = 'start-exp' + someVar + 'more-exp';
var regex:Regexp = new RegExp(exp);
Pretty straight forward. So when approaching this small upgrade, that's what I did. Of course one big problem was pretty clear.
var exp:String = '/<' + tag + '("[^"]*"|'[^']*'|[^'">])*>/';
Guess what, invalid string! Better escape those quotes in the string. Whoops, that will break the regex! I was stumped. So I opened up the language reference to see what I could find. The "source" parameter, (which I've never used before,) caught my eye. It returns a String described as "the pattern portion of the regular expression." It did the trick perfectly. Here is the solution:
var start:Regexp = /])*>/ig;
var complete:RegExp = new RegExp(start.source + tag + end.source);
You can reduce it down to this for convenience:
var complete:RegExp = new RegExp(/])*>/.source + tag, 'ig');
As Tyler correctly points out (and his answer works just fine), you can assemble your regex as a string end then pass this string to the RegExp constructor with the new RegExp("pattern", "flags") syntax.
function assembleRegex(myString) {
var re = new RegExp('\\d' + myString, "i");
return re;
}
Note that when using a string to store a regex pattern, you do need to add some extra backslashes to get it to work right (e.g. to get a \d in the regex, you need to specify \\d in the string). Note also that the string pattern does not use the forward slash delimiters. In other words, the following two statements are equivalent:
var re1 = /\d/ig;
var re2 = new Regexp("\\d", "ig");
Additional note: You may need to process the myString variable to escape any backslashes it might contain (if they are to be interpreted as literal). If this is the case the function becomes:
function assembleRegex(myString) {
myString = myString.replace(/\\/, '\\\\');
var re = new RegExp('\\d' + myString);
return re;
}

GWT - 2.1 RegEx class to parse freetext

I'm struggling with the com.google.gwt.regexp.shared.RegExpclass and simply want to parse the phone numbers from a string and get ALL occurrences of a number but only seems to be able to get the 1st occurrences.. I know there is subtle difference in the regex between java (where it works) and GWT.
String freeText = "Theo Powell<5643321309>, Robert Roberts<9653768972>, Betty Wilson<6268281885>, Brandon Anderson<703203115>";
MatchResult matchResult = RegExp.compile("[\+]?[0-9." "-]{8,}").exec(freeText);
int groupCount = matchResult.getGroupCount(); // result = 1
String s = matchResult.getGroup(0); //result = 5643321309
Thanks in advance.
Ian..
You'll have to loop, applying the pattern again until it returns nothing. For that, you first have to use the "global" flag:
ArrayList<String> matches = new ArrayList<String>();
RegExp pattern = RegExp.compile("[\+]?[0-9. -]{8,}", "g");
for (MatchResult result = pattern.exec(freeText); result != null; result = pattern.exec(freeText)) {
matches.add(result.getGroup(0));
}
If you think it's a bit "magic" or "kludgy" (which it kind of is), I'd suggest reading docs about the JavaScript RegExp object, as the RegExp class in GWT is a direct mapping of this: https://developer.mozilla.org/en/JavaScript/Reference/Global_Objects/RegExp/exec (with sample code in JS very similar to the one above).
Change the regex from
[\+]?[0-9." "-]{8,}
to
([\+]?[0-9." "-]{8,})
See Capturing Groups for further details.