I've got a problem I need solved using Regex expressions; it involves taking a CSS selector and compiling a regex that matches the string representation of the nodes inside an HTML document. The point is to avoid parsing the HTML as XML and then either making Xpath or DOM queries to apply style attributes.
Does anyone know of a project that already implements something like this in any language? The target platform would be .NET 3.5.
Html Agility Pack
Regular expressions seem like an amazingly bad way of matching those nodes. I'm not sure I follow your problem - why not just use something like jquery to pick out those nodes? eg given a css selector 'div>span.red:first-child',
$('div>span.red:first-child')
would return an array of those matching nodes.
EDIT: Oh, wait - are you trying to do this 'offline', as it were - not in a user's browser? Yeah, ignore my advice. (Even so, I'd still suggest that regular expressions aren't going to help you. Why are you against generating an xml-document representation of the page?)
Related
We usually write our Search Path in findnodes() function as below
//parentNode[subNode/text() = 'CPUUSAGE']/subNode
what is I want to match a part of the text here and find all the nodes?
something like
//parentNode[subNode/text() =~ '/CPUUSAGE'/]/subNode
Obviously this is Invalid xPath...
Any thoughts how to achieve this?
I know I can first find the nodes and then try to match the textContent. But Can we do that in one shot directly in findnodes()?
XPath 1.0 (which libxml implements) doesn’t include any built-in support for regular expressions. In the example you give, which uses a fairly simple regular expression, you could use the contains function to achieve a similar result:
//parentNode[subNode[contains(text(), 'CPUUSAGE')]]/subNode
(As an aside that’s an odd expression – you’d probably really want something like //parentNode/subNode[contains(text(), 'CPUUSAGE')] but I realise it’s just an example.)
There are some other string functions that could be useful in creating other simple queries.
You could create your own custom XPath function to filter nodes based on a regular expression, in fact the docs for the Perl LibXML module includes an example of doing just that.
XPath 2.0 does have support for using regular expressions with a group of string functions. Unless you have an XPath 2.0 processor that will not be too useful.
XML::Twig has support for regular expressions in its xpaths.
The following is an xpath that I used in an answer to this SO question: Updating xml attribute value based on other with Perl
project[string(path) =~ /\bopensource\b/]/revision
I also created a second answer so that I could experiment with how XML::LibXML could be used to solve the same problem, and in that case I just iterated over all projects and did the regex filtering manually.
I'm lousy at regular expressions but occasionally they're the only thing that's the right solution for a problem.
Is there something in the .NET framework that allows you to input an unencoded string and get a pattern from it? Which you could then modify as required?
e.g. I want to remove a CDATA section that contains a file from some XML but I can't work out what the right pattern is for <![CDATA[hugepileofrandombinarydataherethatalsoneedstogo]]> and I don't want to ask for help each time I'm stuck on a regex pattern.
Such tools exist, google by "regex generator".
But, as suggested in comments, better learn regex. Simple patterns are easy. Something like <!\[.*?]]>
in your case.
There are Regex Design tools like expresso...
http://www.ultrapico.com/expresso.htm
It's not perfect but as there is no suitable .Net component the text to regex page at txt2re.com is the best I've seen for those people who occasionally need to build a regex to match a string but don't have the time to relearn regex each time they want to use one.
I'm pretty hopeless with regualr expressions and im struggling with what is probable an increadably simple one!
I have a string which contains many instances of something like this:
<li>STEAK</li>
I know the value of 'STEAK' and im looking for the value of the href attribute.
This value can be anything.
I'm using C# .net 4.0
Thanks for any help
use
HTML agility pack
instead (don't regex html)
I'm trying to match anything between and including style=""
eg: style="whatever:0; morestuff:1; otherstuff:3"
The pattern will be /style="([^"]*)"/, but may vary a bit depending on what language you're using.
Also if you're trying to do this through javascript, jquery would make this as easy as
$("#element-id").attr("style");
If you're trying to do this from another language, use an HTML parsing lib as HTML isn't regular. BeautifulSoup for Python is quite nice.
String under test
style="whatever:0; morestuff:1; otherstuff:3"
Regex
style\s*=\s*"([^"]*)"
Contents of group 1
whatever:0; morestuff:1; otherstuff:3
Notice!
It is very hard to write a regex-based HTML parser that is correct, secure, and maintainable. If you need to write program that deals with HTML in a robust, reliable, and secure way, you should use a real HTML parsing library like jsoup (Java) or Html Agility Pack (C#). To find an HTML parser for your favorite language, Google: yourlanguage html parser.
If you need to remove all style tags from html (clean inline styles entirely), use this as regexp:
style=\"[^\"]*\"
This works for me in sublime text 2-3
/(style="([^"]*)")/
for the whole string (untested). do you want the key value pairs retrieved as well?
I have many links in my page.
For example Australia
Now I want only the href with its value i.e (href="/promotions/download/schools/australia.aspx") with vbscript regular expression.
My regex would be something like:
href="([^"]*)"
Might need escaping in your context but that (or something very much like it) should work.
Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). Luckily, you should have access to the best parser available: the web browser. Modern browsers create a Document Object Model which is a tree structure that contains all of the information about the page. One of the methods you can call on the DOM is links. I don't really know vbscript, but this code looks like it should work:
For i = 0 To document.links.length
document.write(document.links(i).href & "<BR>")
Next