Looking for regex to erase href text - regex

If I have a bunch of urls like this:
<li>Xyz 123</li>
<li>Xyz 345</li>
What would a regex look like to erase the urls inside the hrefs so that they become:
<li>Xyz 123</li>
<li>Xyz 345</li>

The following should do what you like:
/href=\"([^\"]*)\"/
Basically match href="<any text but a '"'>".

Search for <a href="[^"]*" and replace with <a href="".
If you add more details about which language you're using, I can be more specific. Be aware also that regular expressions are usually not the tool of choice when dealing with HTML.

First of all, do not use regex to parse HTML — why? Have a look here or here.
Process the HTML using an XML reader / XML document processing engine. Then use XPath to find nodes matching your criteria and alter href attributes in the DOM.
Note: For HTML which is not well-formed XML a more-general HTML (SGML) parser is required.

I partially agree with the others but a more complete version would be
/(<a[^>]+href\s*=\s*\")(.*?)("[^>]*>)/$1$3/gi

Related

How to extract html attributes via regex

I am looking to see how a regex can be used to get attribute/values from an html tag. Yes I know that an xml/html parser can be used, but this is for testing my ability in regex. For example, in this html element:
<input name=dir value=">">
<input value=">" name=dir >
How would I extract out:
(?<name>...) and (?<value>...)
Is it possible once you have matched something to go "back" to the start of the match? For example:
<(?P<element>\w+).+(?:value="(?P<value>[^"])")####.+(?:name="(?P<name>[^"])")
Where #### basically means "go back to the start of the previous match/capture group (so that I don't have to modify every possible ordering of the tags). How could this be done?
Yes, using a parser is the best way.
As stated in the comments, you cannot (easily) extract all information in one sweep.
You can achieve what you want with several regexes:
input.*?name=(?'name'[^ ]+)
Test here.
input.*?value="(?'value'[^"]+)"
Test here.

regex to find div tags

I need a regex that will find either an opening div tag, or a closing div tag, or both in an html web page. Thanks :)
Just to be safe:
</? *div[^>]*>
You could start with:
</?div>
This won't correctly handle:
whitespace
attributes on the div
self-closing div tags
upper case tags
tags inside HTML comments that should be ignored
etc...
To handle HTML correctly you're better off using an HTML parser rather than regular expressions.
If you can use xpath it would be //div Look into using an XML parser that supports it instead of regex. If you MUST use regex, go with coding_hero's answer.
Just for show, in PHP:
//$htmldoc is some xhtml document from somewhere
$xhtml = simplexml_load_file($htmldoc);
$divs = $xhtml->xpath('//div'); //grab simpleXMLElement from all divs in document
return $divs->asXML(); //returns xml of div elements and children
HTML, XHTML, and XML can not be parsed using regular expressions. There are parsers designed for this type of thing. If you specify the language(s) you are using, I'm sure someone can suggest the right tool(s) for the job, but I know for a fact that regular expressions will not be on that list.
To find opening and closing div tag I would use
</?\bdiv\b[^>]*>

Regular expression to match word instances not in html attrs or link text

I want to metch a keyword that is not linked, as the following example shows, I just match the google keyword that is neither between <a></a> nor included in the attributes, I only want to match the last google:
google is linked, google is not linked.
Do not parse HTML with regular expressions. HTML is an irregular language. Use a HTML parser.
This works for me (javascript):
var matches = str.match(/(?:<a[^>]*>[^<]*<\/a>[\s\S]*)*(google)/);
See it in action
Provided you can be sure that your HTML is well behaved (and valid), especially does not contain comments or nested a tags, you can try
google(?!((?!<a[\s>]).)*</a>)
That matches any "google" that is not followed by a closing a tag before the next opening a tag. But you might be better of using a HTML Parser instead.

regex match question

I want to match any of these cases with a regex. I have the header text, but I need to match it with the (possible) corresponding HTML:
<h1>header title</h1>
<h2>site | header title</h2>
<h3 class="header">header title</h3>
<h2>header title 23 jan 2009</h2>
<h1>header title</h1>
I have this:
/(<(h1|h2|h3))(.+?)".$title."(.+?)(<\/\\2>)/i
But it seems to not always work, and don't see why.
Thanks
Don't use regexes to parse HTML! Use an HTML parser, instead.
Is $title regex-escaped (so characters like {, [ etc. are escaped)?
With line end may be problem too; there should something like multiline support, if you regex implementation supports it.
It is better to process structured data with appropriate tools - XML with XML parser, HTML with HTML parser. There are parsers like BeautifulSoup in Python, hpricot in Ruby, libxml2...
What you (logically) want for your example is something like:
<(group of anything not including ">"> (Value to extract) <(group of anything not including ">">
e.g.
<[^>]>([^>]+)<[^>]>
The specific regex syntax is a bit dependent on what environment you're working on.
You can get away with this if you're sure what you're parsing is no more complicated than your example. However, you really shouldn't be parsing html (or xml) with a regex (as someone has already noted here) because xml can be arbitrarily nested, and regex can't possibly deal with that.

How to Find Quotes within a Tag?

I have a string like this:
This <span class="highlight">is</span> a very "nice" day!
What should my RegEx-pattern in VB look like, to find the quotes within the tag? I want to replace it with something...
This <span class=^highlight^>is</span> a very "nice" day!
Something like <(")[^>]+> doesn't work :(
Thanks
It depends on your regex flavor, but this works for most of them:
"(?=[^<]*>)
EDIT: For anyone curious how this works. This translates into English as "Find a quote that is followed by a > before the next <".
Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.
If you are using VB.net you should be able to use HTMLAgilityPack.
Try this: <span class="([^"]+?)?">
This should get your the first attribute value in a tag:
<[^">]+"(?<value>[^"]*)"[^>]*>
If your intention is to replace ALL quotation marks within tags, you could use the following regular expression:
(<[^>"]*)(")([^>]*>)
That will isolate the substrings before and after your quotation mark. Note that this does not attempt to match opening and closing quotation marks. It simply matches a quotation mark within a tag.