Reg Expression to scrape background:url but 'url(data:image' - regex

I am working on gradle script to go through large css file and scrap out the URLs for images. So far:
def temp = ".post-format background:url(image/goes/here.jpg); {background: .post-format {background: url(../img/post //formats.png);display:;display:.woocommerce-info:before {background: url(data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAIAAAAFCAYAAABvsz2cAAAAG0lEQVQIHWP8DwQMQMACxIwwBliECcQDATgDAMHrBQqJ6tMZAAAAAElFTkSuQmCC)center no-repeat #18919c }"
def list = temp.findAll(/background:[\s]?url\([^\)]*\)/){ match ->
match
}
This works but it also takes the 'data:image' file url that we don't need. So, here the temp variable contains both - the good 'image/goes/here.jpg' url and also the one we don't need 'data:image/png[..]'. How would we have to update the regular expression to make it work? If you could also share your rational behind of the correct regular expression to help us better learn regular expressions i would much appreciate. Thank You a lot

You can use the negative look ahead mechanism to accomplish what you want. Immediately following the escaped left parenthesis you insert (?!data:image) which means that you must not match that text at that point. So your regex becomes:
/background:[\s]?url\((?!data:image)[^\)]*\)/
You can see the approach illustrated in this rubular. See also How can I find everything BUT certain phrases with a regular expression?

You didn't specify what language you're using, but if the URL you want is always the first one, just don't do a global match (which is what findAll does, whatever language that is). Most likely, changing temp.findAll to temp.match and assigning the results to a scalar string variable will do it. But please tell us which language.

Related

Regular Expression to match parent and sub node

I want to development a regular expresion to match the tag :
<claim-text>aaaaaaa
<claim-text>bbbbbbb</claim-text>
<claim-text>ccccccc</claim-text>
</claim-text>
I tried
<claim-text>(.*)</claim-text>
But, only bbbbbbb and ccccccc can be matched. Can I get some help to cover aaaaaaa also?
Thanks
For a generic solution with any depth, you will at least need a stack, which not available for most regular expression implementation. However, if you know the structure will only have the depth you specified, you could use something like this:
<claim-text>([^<\r\n]*)
You can see a working example here: https://regex101.com/r/kbDbwF/1
It will search for your opening tag, and then find anything up to the next opening or closing tag [^<], or to the next line break [^\r\n]. I have combined both character classes to one definition [^<\r\n]. However, this is not a general solution!
Do not under any circumstances try to parse HTML with a regex unless you wish to invoke rite 666 Ph'nglui mglw'nafh Cthulhu R'lyeh wgah'nagl fhtagn.
Use an HTML parsing library see this page for some ways to do it.

Using a regular expression to insert text in a match

Regular Expressions are incredible. I'm in my regex infancy so help solving the following would be greatly appreciated.
I have to search through a string to match for a P character that's not surrounded by operators, power or negative signs. I then have to insert a multiplication sign. Case examples are:
33+16*55P would become 33+16*55*P
2P would become 2*P
P( 33*sin(45) ) would become P*(33*sin(45))
I have written some regex that I think handles this although I don't know how using regex I can insert a character:
The reg is I've written is:
[^\^\+\-\/\*]?P+[^\^\+\-\/\*]
The language where the RegEx will be used is ActionScript 3.
A live example of the regex can be seen at:
http://www.regexr.com/39pkv
I would be massively grateful if someone could show me how I insert a multiplication sign in middle of the match ie P2, becomes P*2, 22.5P becomes 22.5P
ActionScript 3 has search, match and replace functions that all utilise regular expressions. I'm unsure how I'd use string.replace( expression, replaceText ) in this context.
Many thanks in advance
Welcome to the wonder (and inevitable frustration that will lead to tearing your hair out) that is regular expressions. You should probably read over the documentation on using regular expressions in ActionScript, as well as this similar question.
You'll need to combine RegExp.test() with the String.replace() function. I don't know ActionScript, so I don't know if it will work as is, but based on the documentation linked above, the below should be a good start for testing and getting an idea of what the form of your solution might look like. I think #Vall3y is right. To get the replace right, you'd want to first check for anything leading up to a P, then for anything after a P. So two functions is probably easier to get right without getting too fancy with the Regex:
private function multiplyBeforeP(str:String):String {
var pattern:RegExp = new RegExp("([^\^\+\-\/\*]?)P", "i");
return str.replace(pattern, "$1*P");
}
private function multiplyAfterP(str:String):String {
var pattern:RegExp = new RegExp("P([^\^\+\-\/\*])", "i");
return str.replace(pattern, "P*$1");
}
Regex is used to find patterns in strings. It cannot be used to manipulate them. You will need to use action script for that.
Many programming languages have a string.replace method that accepts a regex pattern. Since you have two cases (inserting after and before the P), a simple solution would be to split your regex into two ([^\^\+\-\/\*]?P+ and P+[^\^\+\-\/\*] for example, this might need adjustment), and switch each pattern with the matching string ("*P" and "P*")

Regex challenge: Match phrase only if outside of an <a href> tag

I am working on improving our glossary functionality in a custom CMS that is running with classic ASP (ASP 3.0) on IIS with VBScript code. I am stumped on a regex challenge I cannot solve.
Here is the current code:
If InStr(ART_ArticleBody, "href") = False then
sql="SELECT URL, Term, RegX FROM GLOSSARYDB;"
Set rsGlossary = Server.CreateObject("ADODB.Recordset")
rsGlossary.open sql, strSQLConn
Set RegExObject = New RegExp
While Not rsGlossary.EOF
URL = rsGlossary("URL")
Phrase = rsGlossary("RegX")
With RegExObject
.Pattern = Phrase
.IgnoreCase = true
.Global = false
End With
set expressionmatch = RegExObject.Execute(ART_ArticleBody)
if expressionmatch.count > 0 then
For Each expressionmatched in expressionmatch
RegExObject.Pattern = Phrase
URL = ""& expressionmatched.Value & ""
ART_ArticleBody = RegExObject.Replace(ART_ArticleBody, URL)
next
end if
rsGlossary.movenext
wend
rsGlossary.movefirst
Set RegExObject = nothing
end if
Instead of skipping putting glossary links in any article that has an href in it, as the above code does, I would like to change the code to process every article but have the RegEx pattern avoid matching on a glossary entry if the match is inside of an a tag.
For example, in italics below is a test example for this regex entry in my DB: ROI|return on investment|investment return
Here is a link that uses the glossary term: Info on return on investment.
Now, here is the glossary term in plain text, not inside of a link: return on investment.
We want to find the third instance of a match but not find the first two because they are both inside of a HTML link.
In the above text, if I were processing the article for the glossary entry "ROI|return on investment|investment return" I do not want to match on the first or second occurance that match because they are in an a tag. I need the regex pattern to skip over those matches and just match on any that are not inside of an a tag.
Any help on this would be greatly appreciated.
Try this regex:
<a\b[^<>]*>[\s\S]*?</a>|(ROI|return on investment|investment return)
This matches an HTML anchor, or any of the terms you're looking for. The terms are captured into group number 1. So in your VBScript code, check if the first capturing group matched anything, and you've got one of your keywords outside an <a> tag.
This regex indeed won't work correctly if you have nested <a> tags. That shouldn't be a problem, as anchors are normally not nested inside each other. If it is a problem, you can't solve it with VBScript/JavaScript regular expressions. The regex also won't work correctly if you have <a> tags that are missing their closing tags. If you want to take that into account, try this regex:
<a\b[^<>]*>(?:(?:(?!<a\b)[\s\S])*?</a>)?|(ROI|return on investment|investment return)
This problem is, as they say, "non-trivial" in its current state. However, if you could modify your system to output more semantic markup, it would make things much easier:
undesired tag match
This is <span class="tag">a tag</span>
In this case, you can simply search:
(?<=<span class=\"tag\">)(phrase1|phrase2|phrase3)(?=</span>)
Or something a little more robust
(?<=<span class=\"tag\">).+?(?=</span>)
This way you can easily focus your searches to data within a specific <span>, and leave everything else aside.
You can't solve it because it can't be done, at least not with 100% reliability. HTML is not a "regular" language in the regular expression sense. Like the saying goes, when you have a hammer, everything starts to look like a nail. There are some things regular expressions aren't good at. This is one of them.
Most languages have some form of HTML parsing library as standard or easily obtained. Use those. That's what they were designed for.
In general, you can't use a regular expression to recognize arbitrarily nested constructs (such as bracket-delimited HTML tags). If you had solved this problem, there's be a lot of mathematicians lining up to hear about it. :)
Having said that, .NET does indeed offer an extension to regular expressions that permits what I just said was impossible, and--even better!--the sample chapter for the great "Mastering Regular Expressions" available here happens to cover that feature.
(accounts receivable|A/R)(?!((?!</?a\b).)*</a)
(phrase1|phrase2|phrase3)(?!((?!</?a\b).)*</a)
The above approach seems to work, at least in my RegexBuddy software. I didn't figure it out on my own. Had some help from a guru. Time to test it in my ASP code. Thanks to all who provided input. I'm sure I didn't describe what I needed well enough for you to come up with the above solution. Mea culpa.

Regex not returning 2 groups

I'm having a bit of trouble with my regex and was wondering if anyone could please shed some light on what to do.
Basically, I have this Regex:
\[(link='\d+') (type='\w+')](.*|)\[/link]
For example, when I pass it the string:
[link='8' type='gig']Blur[/link] are playing [link='19' type='venue']Hyde Park[/link]"
It only returns a single match from the opening [link] tag to the last [/link] tag.
I'm just wondering if anyone could please help me with what to put in my (.*|) section to only select one [link][/link] section at a time.
Thanks!
You need to make the wildcard selection ungreedy with the "?" operator. I make it:
/\[(link='\d+')\s+(type='\w+')\](.*?)\[\/link\]/
of course this all falls down for any kind of nesting, in which case the language is no longer regular and regexs aren't suitable - find a parser
Regular Expressions Info a is a fantastic site. This page gives an example of dealing with html tags. There's also an Eclipse plugin that lets you develop expressions and see the matching in realtime.
You need to make the .* in the middle of your regex non-greedy. Look up the syntax and/or flag for non-greedy mode in your flavor of regular expressions.

Returning a portion of a regular expression match

This question shows my ignorance of regular expressions. I've never understood it quite enough.
If I wanted to match, for instance, just the URL portion of an tag in HTML, what would I need to do?
My regular expression to get the entire tag is:
<A[^>]*?HREF\s*=\s*[""']?([^'"" >]+?)[ '""]?>
I have no idea what I would need to do to get the URL out of that and I have no clue where to look in regular expression documentation to figure this out.
If programming in Perl you could utilize the $1 operator within an if() statement. For ex.
if( $HREF =~ /<A[^>]*?HREF\s*=\s*[""']?([^'"" >]+?)[ '""]?>/ ) {
print $1;
}
the exactly HOW part depends on the regex library you're using, but the way is to use a grouped expression. You actually already have one in your example, as grouped expressions are parenthesized. The href attribute value is your first group (your zeroth group is the whole expression.)
You can use round brackets to group parts of the regular expression match. In this case you could use a round bracket around the URL part and then later use a number to refer to that group. See here to see how exactly you can do this.
I switched things up a bit - try something like this:
<a[^>]*href="([^"]*).*>