How to extract html attributes via regex

How to extract html attributes via regex - regex

I am looking to see how a regex can be used to get attribute/values from an html tag. Yes I know that an xml/html parser can be used, but this is for testing my ability in regex. For example, in this html element:
<input name=dir value=">">
<input value=">" name=dir >
How would I extract out:
(?<name>...) and (?<value>...)
Is it possible once you have matched something to go "back" to the start of the match? For example:
<(?P<element>\w+).+(?:value="(?P<value>[^"])")####.+(?:name="(?P<name>[^"])")
Where #### basically means "go back to the start of the previous match/capture group (so that I don't have to modify every possible ordering of the tags). How could this be done?

Yes, using a parser is the best way.
As stated in the comments, you cannot (easily) extract all information in one sweep.
You can achieve what you want with several regexes:
input.*?name=(?'name'[^ ]+)
Test here.
input.*?value="(?'value'[^"]+)"
Test here.

Related

Ignore tags and javascript with regex

I'm trying to perform a regex replacement on the HTML below. I'm using an existing (I didn't write it and don't really understand it) regex pattern that ignores anything inside of an HTML tag, but I need it to also ignore anything between script tags. The pattern is (?<!<[^>]*)(diversity|and|inclusion). The problem is that the and in 'playerBrandingId' in the javascript is getting matched and ultimately replaced. In case it matters, I'm using C#. You can see what I get here.
<p>When it comes to building more diverse and inclusive workforces, the sports industry is already a leader, but it can do much more. One of the ways SBD/SBJ is focusing on diversity and inclusion is by talking to business leaders about what the industry can do better. In our first video in the “SBJ Diversity and Inclusion” series, we hear from execs working in leagues, technology, recruitment and academia.</p>
<div class="article-offset-block article-video article-offset-block--half">
<div class="u-vr2">
<div id='video-F17F523A70EB43ECAF54DF46144835B4'></div>
</div>
</div>
<script>
var playerParam = {
'pcode': 'poeXI63BtIsR_ugBoy3Z6X8KfiMo',
'playerBrandingId': 'video-F17F523A70EB43ECAF54DF46144835B4',
'autoplay': false,
'loop': false
};
OO.ready(function () { window.ppF17F523A70EB43ECAF54DF46144835B4 = OO.Player.create('video-F17F523A70EB43ECAF54DF46144835B4', 'w5cW9qZTE6qRRDqfBdi861XWJTXci9uE', playerParam); });
</script>
EDIT:
The pattern is generated by a user's query, so the pattern could include the word window or player which would be matched in the javascript when I change the pattern to include the \b like so: (?<!<[^>]*)\b(window|player|and)\b
Another example

Change your regex to (?<!<[^>]*)\b(diversity|and|inclusion)\b The \b adds a test for a word boundary. forcing each word inside the ( and ) to be whole words.
EDIT:
You are trying to parse the HTML to extract the text nodes then check them,
you should not under any circumstances try to parse HTML with a regex unless you wish to invoke rite 666 Ph'nglui mglw'nafh Cthulhu R'lyeh wgah'nagl fhtagn.
Use an HTML parsing library see this page for some ways to do it or search for extracting text nodes from HTML with .NET and C#

The answer is that you cannot do what I'm trying to do with Regex according to this.

REGEX replace all " style='anything'" except within tables

I am parsing html. I know this shouldn't be done with regex but dom/xpath. In my case it should just be fast, simple and without tidy so I chose regex.
The task is replacing all the style='xxx' with an empty string, except within tables.
This regex for preg_replace works catching all style='xxx' no matter where:
'/ style="([^"]+)"/s'
The content can look like this
<!-- more html here -->
<span style='do:smtg'><table class=... > <span style="...">
<table> <div style=""></div></table></span></table>
<!-- more html here -->
or just simple non nested tables, meaning regex should exclude all style='...' also within nested tables.
Is there a simple syntax doing this?

Thou Shalt Not Parse HTML with Regular Expressions!
No, really, you shouldn't.
As evidenced by your example, you can expect nested tables. That means the regex should keep track of the level of nesting, to decide whether or not you're in a table. If you find a way to do this, it will certainly not be "fast and simple".

Email, resurrecting this question because it had a regex that wasn't mentioned. (Found your question while doing some research for a regex bounty quest.)
With all the disclaimers about using regex to parse html, here is a simple way to do it.
First we need a regex to match tables, nested or not. This does it with simple recursion:
<table(?:.*?(?R).*?|.*?)</table>
Next, we exclude these, and match what we do want. Here is the whole regex:
(?s)<table(?:.*?(?R).*?|.*?)<\/table>(*SKIP)(*F)|style=(['"])[^'"]*\1
See the demo
The left side of the alternation matches complete tables, nested or not, then deliberately fails. The right side matches and captures your styles to Group 1, allowing for different quote styles. We know these are the right styles because they were not matched by the expression on the left.
With this regex, you can do a simple preg_replace($regex, "", $yourstring);
Reference
How to match pattern except in situations s1, s2, s3
How to match a pattern unless...

Notepad++ Regex to remove styling

I need to remove some tags from a whole lot of html pages.
Lately I discovered the option of regex in Notepad++
But.. Even after hours of Googling I don't seem to get it right.
What do I need?
Example:
<p class=MsoNormal style='margin-left:19.85pt;text-indent:-19.85pt'><spanlang=NL style='font-size:11.0pt;font-family:Symbol'>·<span style='font:7.0pt "Times New Roman"'> </span></span><span lang=NL style='font-size:9.0pt;font-family:"Arial","sans-serif"'>zware uitvoering met doorzichtige vulruimte;</span></p>
I need to remove everything about styling, classes and id's. So I need to only have the clean tags without anything else.
Anyone able to help me on this one?
Kind regards
EDIT
Check an entire file via pastebin: http://pastebin.com/0tNwGUWP

I think this pattern will erase all styles in "p" and "span" tags :
((?<=<p)|(?<=<span))[^>]*(?=>)
=> how it works:
( (?<=<p) | (?<=<span) ): This is a LookBehind Block to make sure
that the string we are looking for comes after <p OR <span
[^>]* : Search for any character that is not a > character
(?=>) : This is a LookAfter block to make sure that the
string we are looking for comes before > character
PS: Tested on Notepad ++

If sample you provided is representative of what you need to process, then, the following quick and dirty solution will work:
Find what: [a-z]+='[^']*'
Replace with:
Find what: [a-z]+=[a-zA-Z]*
Replace with:
You must run the first one first to pick up the style='...' attributes and you'll need to run the second next to pickup both the class='...' and lang='...'.
There's good reason why others posters are saying don't attempt to parse HTML this way. You'll end up in all sorts of trouble since regex, in general cannot handle all the wonderful weirdness of HTML.

My advise as follows.
As I see in your sample text you have only "p" and "span" tags that need to be handled. And you apparently want to remove all the styles inside them. In this case, you could consider removing everything inside those tags, leave them simple <p> or <span>.
I don't know about Notepad++ but a simple C# program can do this job quickly.

Assuming <spanlang=NL a typo (should be <span lang=NL), I'd do:
Find what: (<\w+)[^>]*>
Replace with: $1>

If you don't mind doing a little bit of programming: HTMLAgilityPack can easily remove scripts/styles/wathever from you xml/html.
Example:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
doc.DocumentNode.Descendants()
.Where(n => n.Name == "script" || n.Name == "style")
.ToList()
.ForEach(n => n.Remove());

REGEX Pattern - How do I match upto a certain tag in html

I have some html which I want to grab between 2 tags. However nested tags exist in the html so looking for wouldn't work as it would return on the first nested div.
Basically I want my regex to..
Match some text literally, followed by ANY character upto another literal text string. So my question is how do I get [^<]* to continue matching until it see's the next div.
such as
<div id="test"[^<]*<div id="test2"
Example html
<div id="test" class="whatever">
<div class="wrapper">
<fieldset>Test</fieldset><div class="testclass">some info</div>
</div>
<!-- end test div--></div>
</div>
<div id="test2" class="endFind">

In general, I suspect you want to look at "greedy" vs "lazy" in your regex, assuming that's supported by your platform/language.
For example, <div[^>]*>(.*?)</div> would make $1 match all the text inside a div, but would try to keep it as small as possible. Some people call *? a "lazy star".
But it seems you're looking to find the text within a div that is before the start of the first nested div. That would be something like <div[^>]*>(.*?)<div
Read about greedy vs lazy here and check to make sure that whatever language you're using supports it.
$ php -r '$text="<div>Test<div>foo</div></div>\n"; print preg_replace("/<div[^>]*>(.*?)<div.*/", "\$1", $text);'
Test
$

Regex is not capable of parsing HTML. If this is part of an application, you're doing something wrong. If you absolutely have to parse a document, use a html/xml parser.
If you're trying to screen scrape something and don't want to bother with a parser, look for identifying marks in the page you're scraping. For example, maybe the embedded div ends just before the one you want to match, so you could match </div></div> instead.
Alternatively, here's a regex that meets your requirements. However, it is very fragile: it will break if, for example, #test's children have children, or the html isn't valid, or I missed something, etc, etc ...
/<div id="test"[^<]*(<([^ >]+).+<\/$2>[^<]*)*<\/div>/

Looking for regex to erase href text

If I have a bunch of urls like this:
<li>Xyz 123</li>
<li>Xyz 345</li>
What would a regex look like to erase the urls inside the hrefs so that they become:
<li>Xyz 123</li>
<li>Xyz 345</li>

The following should do what you like:
/href=\"([^\"]*)\"/
Basically match href="<any text but a '"'>".

Search for <a href="[^"]*" and replace with <a href="".
If you add more details about which language you're using, I can be more specific. Be aware also that regular expressions are usually not the tool of choice when dealing with HTML.

First of all, do not use regex to parse HTML — why? Have a look here or here.
Process the HTML using an XML reader / XML document processing engine. Then use XPath to find nodes matching your criteria and alter href attributes in the DOM.
Note: For HTML which is not well-formed XML a more-general HTML (SGML) parser is required.

I partially agree with the others but a more complete version would be
/(<a[^>]+href\s*=\s*\")(.*?)("[^>]*>)/$1$3/gi

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to extract html attributes via regex - regex

Yes, using a parser is the best way. As stated in the comments, you cannot (easily) extract all information in one sweep. You can achieve what you want with several regexes: input.?name=(?'name'[^ ]+) Test here. input.?value="(?'value'[^"]+)" Test here.

Related

Ignore tags and javascript with regex

REGEX replace all " style='anything'" except within tables

Notepad++ Regex to remove styling

REGEX Pattern - How do I match upto a certain tag in html

Looking for regex to erase href text

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to extract html attributes via regex - regex

Yes, using a parser is the best way. As stated in the comments, you cannot (easily) extract all information in one sweep. You can achieve what you want with several regexes: input.*?name=(?'name'[^ ]+) Test here. input.*?value="(?'value'[^"]+)" Test here.

Related

Ignore tags and javascript with regex

REGEX replace all " style='anything'" except within tables

Notepad++ Regex to remove styling

REGEX Pattern - How do I match upto a certain tag in html

Looking for regex to erase href text

Categories

Resources

Yes, using a parser is the best way. As stated in the comments, you cannot (easily) extract all information in one sweep. You can achieve what you want with several regexes: input.?name=(?'name'[^ ]+) Test here. input.?value="(?'value'[^"]+)" Test here.