regex to get linkable text - regex

I've been trying for hours now.
I need to get the linkable text meaning, all text from a webpage source that is between <a href> and </a> except the other tags that are nested between the <a> tags.
Example:
<a href="blabla.net">THIS TEXT
<img src="hhh.jpg" /> THIS TEXT TOO
<span> ALSO THIS TEXT. </span>AND ALSO THIS TEXT</a>

You could use a simple regular expression with a non greedy group:
<[aA]\b[^\>]*>([\w\W]*?)<\/[aA]>
You can test it on this page by hitting F12 then typing
$(document.body).html().match(/<a\b[^\>]*>([\w\W]*?)<\/a>/ig)

You can try the following Regular expression, that returns the text between tags in four groups:
(?<=>)[^<]+?(?=<)
It removes tags from the text.

Related

Using SED to remove specific anchor tags within html in database

I've got a table which contains hundreds of guides with screenshots. The screenshots images were surrounded by anchor tags as they were clickable before but now I need to remove the anchor tags. All the anchor tags to be removed have an href=#screenshot followed by a number as in the example below. My plan is to dump the table using mysqldump and then use sed to find and replace the correct strings.
<p>Choose components to install and click next.</p>
<div class="screen">
<img src="/images/screens/install/step3.jpg" alt="Step 3">
</div>
Should be
<p>Choose components to install and click next.</p>
<div class="screen">
<img src="/images/screens/install/step3.jpg" alt="Step 3">
</div>
I can match the first tag using <a\shref\=\"#screenshot\d+\"\> but I also need to match its second closing tag so that both can be removed whilst not removing other anchor tags. Any help would be greatly appreciated!
You can try replacing
<a\shref\=\"#screenshot\d+\"\>(.*)<\/a>
with \1.
The parenthesis will capture everything that is found between them so you can restore it using \1, \2...
Keep in mind though that regexes are not the right weapon to use when trying to modify HTML. Read this (and the comments around it) for an explanation.

get text between html tags correctly

I want to grab the text between html tags using Dreamweaver's search and replace tool.
The link format is a standard a tag e.g.
Text
Or:
Text and Text 2
Or:
Text
I am using the following expression:
(.*)
This works fine for example 1, but it picks up everything between the first opening tag <a href and the last closing tag </a> in the case of example 2.
What can I do to just targeting each individual link?
Also, what can I do in the case of example 3 where links also have a target="_blank" property?
if you just want the "Text" in the body of the tag
<a[^>]*>([^<]*)</a>
would work
if you also want the href
<a[^>]*href="([^>"]*)"[^>]*>([^<]*)</a>

Autohotkey regular expression to strip html tags in multiple lines

I have the following tag in the html file from which I need to grap only the text "XX(1119601.1)" using autohotkey and regular expression. Since the closing tag appears only after few line breaks I couldnt get the text between the tags.
<dd class="call_number">
<!-- holdings allowed -->
XX(1119601.1)
</dd>
Any help on this would be much appreciated.
txt =
(Ltrim
<dd class="call_number">
<!-- holdings allowed -->
XX(1119601.1)
</dd>
)
RegexMatch(txt, "<dd .+?>(.*)</dd>", m)
msgbox % RegexReplace(m1, "<!.+>")
This code first matches everything within tags (you can make it a bit more specific, like only matching strings in tags) then replaces Html comments.
You can remove unwanted newlines with RegexReplace as well.
Edit:
Changed RegexMatch to not automatically remove newlines.

Regexp: remove all tags from string except one kind of tags

I have such string
<p>test <span class=\"match\">match</span> <span class=\"testtes\">dddddd</span></p>
I want to get string without tags. But I want to save highlighting by class "match":
test <span class=\"match\">match</span> dddddd
If I want to just remove all tags I substitute all substrings that satisfied regexp /<\/?[^>]*>/ by empty string. But what regexp should I use in my special case?
UPD: The algorithm is: if you see and some sentence without tags and then then you shouldn't remove these spans; otherwise you should remove all tags
I can could do someting like this
<\/?(?![^>]*class=\\"match)[^>]*>
This would preserve the opening tag and result in this
test <span class=\"match\">match dddddd
See it here on Regexr
But how should I find the matching closing tag?
<p>test <span class=\"match\">match</span> <span class=\"testtes\">dddddd</span></p>
^^^^^^^ or the next one? ^^^^^^^
Regex can't know which closing tag belongs to the opening <span> tag that contains that class. I don't have the possibility to find matching closing tags. So its not a good idea to do this using regex.
I am quite sure the language you are using has an html parser that can be used to do this task.

Regular expression to remove <p> tags around elements wrapped in [...]'s

I'm a total regexp noob. I'm working with wordpress and I'm desperately trying to deal with wordpress's wautop, which I hate and love (more hate!). Anyways I'm trying to remove <p> tags around certain commands.
Here's what I get:
<p>
[hide]
<img.../>
[/hide]
</p>
or
<p>
[imagelist]
<img .../>
<img .../>
[/imagelist]
</p>
Here's what I'd like:
[hide]
<img.../>
[/hide]
or
[imagelist]
<img .../>
<img .../>
[/imagelist]
I've tried:
preg_replace('/<p[^>]*>(\[[^>]*\])<\/p[^>]*>/', '$1', $content); // No luck!
EDIT:
When I am doing the regexp it is still just a variable containing text.. It is not parsed as html yet. I know it is possible because I already did it with getting rid of p tags around an image tag. So I just need a regexp to handle text that will be parsed as html at some point in the future.
Here's a similar question
Thanks!
Matt Mueller
You can't use regular expressions to parse HTML, because HTML is, by definition, a non-regular language. Period, end of discussion.
The language of matching HTML tags is context-free, not regular. This means regular expressions are probably not the right tool to use here. Context-free languages require parsers rather than regular expressions. So, you can either remove ALL <p> and </p> tags with a regular expression, or you can use an HTML parser to remove matching tags from certain parts of your document.
Try this regex:
'%<p[^>]*>\s*(\[([^\[\]]+)\].*?\[/\2\])\s*</p>%s'
Explanation:
\[([^\[\]]+)\] matches the opening bbcode tag and captures the tag name in group #2.
\[/\2\] matches a corresponding losing tag.
.*? matches anything, reluctantly. Thanks to the s flag at the end, it also matches newlines. The effect of the reluctant .*? is that it stops matching the first time it finds a closing bbcode tag with the right name. If tags are nested (within tags with the same name) or improperly balanced, it won't work correctly. I wouldn't expect that be a problem, but I have no experience with WordPress, so YMMV.