Similar regular expression in same line [closed] - regex

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 3 years ago.
Improve this question
I have requirement of following reg-ex pattern:
Sample string :
<html> a test of strength and <h1> valour </h1> for <<<NOT>>> faint hearted <b> BUT </b> protoganist having their characters <<<CARVED>>> out of gibralter <b> ROCK </b>
This above is single string in which I want to strip out every HTML tag and retain <<<xyz>>> .
My attempt:
(^|\n| )<[^>]*>(\n| |$)
Can someone please critically review this ?

This is what I've come up with. It uses lookbehinds to make sure you identify hmtl tags by what will precede and follow them without actually including them in the match. The point is to look for < and > only if they are followed or preceded by spaces or letters (not other < or >). Is this what you are after or did I misread you?
(?=([ A-z]?))<{1}\/?[A-z1-6]+>{1}(?=[^>])

Related

Perl regex remove whitespace before close tag [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 12 days ago.
Improve this question
I need to remove white space before the closing tag using perl regex.
From
<span class="inf">cranium </span>
<span class="inf">craniums </span>
<span class="inf">crania </span>
to
<span class="inf">cranium</span>
<span class="inf">craniums</span>
<span class="inf">crania</span>
Using:
find . -type f -exec perl -pi -w -e 's/(\s)([\<\/span>])/$2/' \{\} \;
What am I doing wrong?
One can capture a pattern and then in the replacement put back only that, thus effectively removing all other that was matched, as attempted in the question, like so
s{\s+(</span>)}{$1}g
We match spaces† and the pattern </span> immediately following them and replace all that with what's been captured in the first (left-most) set of parenthesis ($1). That's it.‡
The /g modifier is there so that this is done throughout the whole string.
Or, using lookahead
s{\s+(?=</span>)}{}g
Now </span> pattern isn't "consumed" out of the string but is only "asserted" to be there, following spaces; so we don't need to "put it back" and the empty replacement effectively removes only the whitespace.
There is also no need to escape {} in the find command
† This includes all kinds of "whitespace." See about it for instance in perlrecharclass
‡ A comment on [] used in the question: that's a "character class." It matches any one of the characters listed inside, with some restrictions and modifications. See linked docs.
So [\<\/span>] matches either of the characters: <, /, s, p, a, n, >. The \ is used to escape < and / so it isn't matched itself. (However, escaping those would be unneeded, except for / if that is the delimiter for the whole regex.)
See perlrecharclass, and the tutorial perlretut. The full, top-level, reference is perlre.

How to use regex to extract links from HTML tag-attributes (src, data, href, or others) [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 months ago.
Improve this question
I have an HTML doc as string and would like to extract all it's links with one regex command/pattern (for better performance), instead of searching for every tag separately (which is the only way I know to solve it).
HTML example:
<img src="..." data-full-resolution="..." />
<object data="..."/>
Please consider also that the image tag has two attributes that should be extracted (src and data-full-resolution).
The programming language is intentionally left-out as I need a 'raw' solution, without HTML librarires.
(?:data-full-resolution|src|href|data)=\"(.*?)\"
Regex Explanation
(?: Non-capturing group
data-full-resolution|src|href|data One of data-full-resolution, src, href or data
) Close non-capturing group
=\" Match =" after an attribute name
( Capturing group
.*? Non-greedy capturing till the next quote
) Close group
\" Match the close quote
See regex demo
Python Example
import re
html = """
<img src="<link-src>" data-full-resolution="<link-data-full-resolution>" />
<object data="<link-data>"/>"""
print(re.findall(r"(?:data-full-resolution|src|href|data)=\"(.*?)\"", html)) # ['<link-href>', '<link-src>', '<link-data-full-resolution>', '<link-data>']
Where re.findall returns the list of captured groups.

Python regex: text between <a> tag [duplicate]

This question already has answers here:
Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms
(10 answers)
Closed 3 years ago.
I know I got BeautifulSoup, But I want to try my own.
Regex I've been working on
<br>This Text needed
<a>unwanted text</a>
<br/>
This text needed
<a >unwanted text</a>
This text needed
<a>unwanted text</a>
<br>this text needed
What I have come up with:
(</a>|(<br(/>|>)))(\s.*|\w.*)
I want to match the This text needed but one of them isn't matching.
How about this way with lookahead negative and lookbehind positive,
(?<!<a>)This Text needed(?!<\/a>)
DEMO: https://regex101.com/r/RT5LZu/1

Regular Expression select all between <table *****> html tag [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Iam looking to way to find and replace multiple occurances of opening <table > tag using regex
tag have different patterns between <> and need to select only characters within opening tag - not between
<table i need to select this> i dont want select this one </table>
no language specified - i need to replace whole content opening table tag to <table class="table-responsive"> in sql file
If this is a one-off replacement, regex will be fine. As long as there are not > inside the attributes of the opening tag, you can use this to select opening tags:
<table[^>]*>

Regex code exception gif [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 8 years ago.
Improve this question
I have the following function that returns me the first image of the post:
$output = preg_match_all('/<img.+src=[\'"]([^\'"]+)[\'"].*>/i',
$post->post_content, $matches);
however returns me any image, I need to ignore the images in gif format, how could I add this condition in regex expression?
Easier to loop through the results and use a different regex.
$output = preg_match_all('/<img[^>]+?src=[\'"](.+?)[\'"].*?>/i', $post->post_content, $matches);
foreach ($matches as $imgSrc)
{
if (!preg_match("/\.gif$/i"), $imgSrc)
{
$noGif[] = $imgSrc;
}
}
It is easier to understand, and there won't be unexpected side effects like blocking valid pictures that happen to have the letter "gif" in the file name.
Note, be very carefull when using .+ and .*. As it stands, your regex matches a LOT more than you think:
Try it on this, for instance:
<img whatever> whatever <img src="mypic.png"> <some other tag>
You should probably not be using regular expressions
HTML is not regular
Regexes may match today, but what about tomorrow?
Say you've got a file of HTML where you're trying to extract URLs from tags.
<img src="http://example.com/whatever.jpg">
So you write a regex like this (in Perl):
if ( $html =~ /<img src="(.+)"/ ) {
$url = $1;
}
In this case, $url will indeed contain http://example.com/whatever.jpg. But what happens when you start getting HTML like this:
<img src='http://example.com/whatever.jpg'>
or
<img src=http://example.com/whatever.jpg>
or
<img border=0 src="http://example.com/whatever.jpg">
or
<img
src="http://example.com/whatever.jpg">
or you start getting false positives from
<!-- <img src="http://example.com/outdated.png"> -->
<img[^>]+src=[\'"](?:([^\'"](?!\.gif))+)[\'"][^>]*>
Updated to have only one capture.
Fixed to include dot. Now would only fail on strange things like a.gif.jpg
Also added safety matches as suggested in comment.