Quick regex help: grab text from html

Quick regex help: grab text from html - regex

I have the following html snippet:
<h1 class="header" itemprop="name">Some text here<span class="nobr">
I would like to get the text between the html tags, I'm struggling with this for hours now, please help me! What regex would solve my problem?

You should not use regex for that, but some HTML parser. As you didn't specify language, it is hard to help, but you will find it by googling...
If you need it just for this one case, you can use regex />(.*?)</

In Javascript you can access that info via:
document.getElementsByTagName("h1").item(0).textContent
or
document.getElementsByClassName("header").item(0).textContent

Like other's have said - you shouldn't be using regular expressions for parsing HTML. But with that aside the following will grab that text for you:
(?<=\>).+(?=\<)

Related

regex to obtain text between several html tags

From the book Oreilly Mastering Regular Expressions, I found out this example:
to take only text from O’Reilly Media
he proposed to use the regex like this <a\b([ˆ>]+)>(.+?)</a>
But for me no luck to catch only text.
Please, could enyone explain to me how build regex to catch only text.
I try to understand regex, so please, be polite, don't tell me to use another methods to parse html.
Thank you.

To reveal all the text between tags, strip off the tags.
<(?:(?:(?:(script|style|object|embed|applet|noframes|noscript|noembed)(?:\s+(?>"[\S\s]*?"|'[\S\s]*?'|(?:(?!/>)[^>])?)+)?\s*>)[\S\s]*?</\1\s*(?=>))|(?:/?[\w:]+\s*/?)|(?:[\w:]+\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]?)+\s*/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>
https://regex101.com/r/B4Dhkj/1

How can I match a tag ignoring its attributes

So say I have some HTML that looks like the following
This text
This Text
This Text
What regex would grab 'this text'? I'm really new to the concept of regex so have tried things like "(.*)" but not really had much success. Can anyone offer advice? :D

You can use the following regular expression
<a\b[^>]*>(.*?)</a>
You can find more examples on this site

extract out-links of an html file in c++/g++

I want to list all the urls in an html file using g++, and I tried to use grep but I wasn't successful. can anyone help me doing that?
thanks a lot

You could use boost::regex in order to find url's in html code.

Related question about C++ library for HTML parsing, which you could use to parse your HTML. Then look for <a> tags.

Which regular expression to use to extract some words from an HTML text?

I am having a hard time building a regular expression to grab some words from a HTML text.
Let's say I have the following :
<p style="padding-left :12px">SOME_TEXT_I_WANT</p><p>SOME_OTHER_TEXT</p>
*SOME_TEXT_I_WANT* and *SOME_OTHER_TEXT* can be either a bunch of words like "SOME RANDOM TEXT" or HTML text like "<strong>SOME BOLD TEXT</strong>"
My goal is to extract those texts with one regex.

Which language do you intend to use? Does a HTML parser exist for this language? If yes, consider using a parser.
However, if this is a "one-off", you may be able to get through with something along the lines of:
#<p[^>]*>(.*?)</p>#
The above has certain limitations, most notably it does not match <p data-something="a > b">...</p> nor nested <p>s. (I am not able to tell whether the mark-up you're trying to parse actually allows nested <p>s—just informing you on possible pitfalls.)

Assuming you are using PHP:
$html = "<p>some text here</p>"
preg_replace("/<.+?>/","", $html);

Don't use regex. If you ask why, there is a very popular SO post that describes what can happen if you try to use regex for parsing HTML.
Use your language's HTML or XML parser and extract what you need using existing functionality.

How to remove both html tags & content,values inside the tags using regular expressions

How to remove both html tags & content,values inside the tags using regular expressions

Have a look at some of these articles
When is it wise to use
regular expressions with HTML?
Regex HTML Extraction C#
But be aware, you are opening yourself for a whole world of hurt.
Parsing Html The Cthulhu Way

As generic as your question is, you will probably only get generic answers:
Use an HTML DOM parser.

You should use an html parser for this, e.g. htmlcleaner

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Quick regex help: grab text from html - regex

I have the following html snippet: <h1 class="header" itemprop="name">Some text here<span class="nobr"> I would like to get the text between the html tags, I'm struggling with this for hours now, please help me! What regex would solve my problem?

You should not use regex for that, but some HTML parser. As you didn't specify language, it is hard to help, but you will find it by googling... If you need it just for this one case, you can use regex />(.*?)</

In Javascript you can access that info via: document.getElementsByTagName("h1").item(0).textContent or document.getElementsByClassName("header").item(0).textContent

Like other's have said - you shouldn't be using regular expressions for parsing HTML. But with that aside the following will grab that text for you: (?<=\>).+(?=\<)

Related

regex to obtain text between several html tags

How can I match a tag ignoring its attributes

extract out-links of an html file in c++/g++

Which regular expression to use to extract some words from an HTML text?

How to remove both html tags & content,values inside the tags using regular expressions

Categories

Resources