Regex confusion - title of certain length - regex

I have been going through Regex tutorials for hours now and I can't seem to grasp it very well. I would like a regex statement that extracts an html title only if the title is exceptionally long (1000+ characters). I've managed to work out the following to select the entire title:
<title>(.*?)</title>
I have no idea where to begin adding the length portion. Any assistance would be greatly appreciated!

<title>(.{1000,})</title>
would do that (unless the title contains newlines - in that case it depends on the regex engine how to handle that).
This also presupposes that there is only one <title> tag in the string you're looking at (which probably is the case in an HTML file, so you should be OK, given the general warning that regexes are a brittle tool when dealing with HTML).

Related

Regex to match everything except a pattern

Regex noob here struggling with this, which I know it will be easy for some of you regex gods out there!
Given the following:
title: Some title
date: 2022-08-15
tags: <value to extract>
identifier: 1234567
---------------------------
Some text
some more text
I would like a regex to match everything except the value of tags (ie the "<value to extract>" text).
For context, this is supposed to run on emacs (in case it matters).
EDIT: Just to clarify as per #phils question, all I care about extracting the tags value. However, this is via a package setting that asks for a regex string and I don't have much control over how it gets use. It seems to expect a regex to strip what I don't need from the string rather than matching what I do want, which is slightly annoying.. Also, the since it seems to match everything with \\(.\\), I'm guessing it's using the global flag?
Please let me know if any of this isn't clear.
Emacs regular expressions can't trivially express "not foo" for arbitrary values of foo. (The likes of PCRE have non-regular extensions for zero-width negative look-ahead/behind assertions, but in Emacs that sort of functionality is generally done with the support of lisp code1.)
You can still do it purely with regexp matching, but it's simply very cumbersome. An Emacs regexp which matches any line which does not begin with tags: is:
^\(?:$\|[^t]\|t[^a]\|ta[^g]\|tag[^s]\|tags[^:]\).*
or if you need to enter it in the elisp double-quoted read syntax for strings:
"^\\(?:$\\|[^t]\\|t[^a]\\|ta[^g]\\|tag[^s]\\|tags[^:]\\).*"
1 In lisp code you would instead simply check each line to see whether it does start with tags: and, if so, skip it (which is why Emacs generally gets away without the feature you're looking for, but of course that doesn't help you here).
After playing around with it for a bit and taken inspiration from #phils' answer, I've come up with the following:
"^\\(?:\\(#\\+\\)?\\(?:filetags:\s+\\|tags:\s+\\|title:.*\\|identifier:.*\\|date:.*\\)\\|.*\\)"
I've also added an extra \\(#\\+\\)? to account for org meta keys which would usually have the format #+key: value.

How do I select src between <> if img exists?

I need to select src=" using a regular expression in the form: //, but only if it is within an image tag.
This should return true:
<img alt="Alt text" src="/directory/Images/my-image.jpg" />
This to return false:
<script type="text/javascript" async="" src="https://www.google-analytics.com/analytics.js"></script>
The end result will be replacing the scr=", which the application I am using performs, I need the regex for the find.
First, the standard disclaimer: if you are using regexes to parse a HTML DOM, you are DOING IT WRONG. With all structured data (XML, JSON, and so forth), the right way to parse HTML is to use something built for that purpose, and query it using its querying system.
That said, it is often the case that what you want is a quick hack on the commandline or the search field of an editor or whatever, and you don't want or need to faff with writing an application that loads in DOM-parsing libraries.
In that case, if you're not actually writing a program, and you don't mind that there are edge-cases where any regex you try will break, then consider something like this:
/<img\b[^<>]+\bsrc\s*=\s*"([^"]+)"/i ... maybe replacing the leading / and trailing /i with whatever other thing your language uses to denote a case-insensitive regular expression.
Note that this makes assumptions, that the url is quoted with doublequotes, the tag is correctly formed, there are no extraneous <img strings in the document, there are no doublequotes in the URL, and countless others that I didn't think of, but a proper parser would. These assumptions are a large part of why using a parser is so important: it makes no such assumptions, and if fed garbage, will correctly let you know that you did so, rather than trying to digest it and giving you pain later on.
<img\b - an img tag. The word boundary ensures this isn't an imgur tag or whatever.
[^<>]+ - one or more characters, with no closing tag, and for safety, no opening tags either.
\bsrc\s*=\s* - 'src=', but with optional whitespace, and another word-boundary check.
"([^"]+)" - some URL consisting of non-quote characters, within quotes.
Now, be aware that since we're doing NO security checking on the URL, you could be grabbing anything, such as javascript:...something malicious..., or it could be 6GB long - you just don't know. You could add in checking for such things, but you'll always miss something, unless you control the input and know exactly what you're parsing.
Your mention of "my application" does mean that I must reiterate: the above is almost certainly the wrong way to do it if you are writing an application, and the question you should be asking is probably closer to "how do I get the value of the src attribute of an img tag from a HTML page, in my chosen programming language?" rather than "how do I use regexes to extract this token from this HTML tag?"
When I say this, I don't mean "ivory-tower computer scientists will look down their nose at you" - though I admit there can be a lot of that kind of snootiness in programming :D
I mean something more like... "you're setting yourself up for pain as you run into edge-case after edge-case, and spiral down into a deep rabbit-hole of infinitely refining your regex. And you can likely avoid the pain with a simple one-liner, infinitely nicer than regex, perhaps document.querySelector('img[src^="/directory/Images"]') as #LGSon suggests in a comment.
People will say this because they've had this pain, and they're wincing at the idea that you might suffer it too.
There are several ways to match that. This RegEx is just an example and it is not certainly the best expression:
(src=")(.+)(.jpg|.JPG|.PNG|.png|.JPEG)"
You can wrap your target image URLs with a capturing group (), maybe similar to this expression:
(src=")((.+)(.jpg|.JPG|.PNG|.png|.JPEG))"
and simply call it using $2 (group #2).
You can also simplify it as you wish by adding ignore flag such as this expression:
src="((.+)(\.[a-rt-z]+))"

Regex: How to parse images not only the <img> tags

For the last whole week I have been busting my head to find a Regular Expression which could parse all the images in a html source file. I know that there are many out there but mainly they parse the tags. The tricky part is sometimes the images are in javascript and sometimes they have weird long formats such as :
http://pinterest.com/pin/create/button/?url=http://www.designscene.net/2015/07/binx-walton-josephine-le-tutour-vera-wang.html&media=http://www.designscene.net/wp-content/uploads/2015/07/Vera-Wang-Fall-Winter-2015-Patrick-Demarchelier-03-620x806.jpg&description=Binx Walton and Josephine Le Tutour for Vera Wang FW15
I have tried negative look heads and booleans but could not find a good solution. Please give me a perspective.
Well as you said there are many ways to do that and to be honest there is not a regex solution which could parse all the html files out there.. I have tried it in the past as well. For me the below worked the best :
/(?:.(?!http|\,))+(\.jpg|\.png)
A bit of an explanation:
/......(.jpg|.png) starts from the first slash it finds until finds an image ext
. any char between the slash and the ext
(?:.(?!http|\,))+ omit if there is http or , in it (works like a charm for the example link you have given
Hope it helps, regex is a very complex world. You can write the same exp in so many different ways. May be there is a better solution then I suggest.
Would this help?
https://regex101.com/r/jP4tV7/4
(http[^&"']+(?:jpg|gif|jpeg|png))(?:\&|'|")
You should be able to search for any url that ends in an image-extension. This quick and dirty expression should do it
(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})[\/\w \.-]*(jpg|png|gif|jpeg|tif|tiff)
Available at: http://regexr.com/3bg8o

RegExp get string inside string

Let presume we have something like this:
<div1>
<h1>text1</h1>
<h1>text2</h1>
</div1>
<div2>
<h1>text3</h1>
</div2>
Using RegExp we need to get text1 and text2 but not text3.
How to do this?
Thanks in advance.
EDIT:
This is just an example.
The text I'm parsing could be just plain text.
The main thing I want to accomplish is list all strings from a specific section of a document.
I gave this HTML code for example as it perfectly resembles the thing I need to get.
(?siU)<h1>(.*)</h1> would parse all three strings, but how to get only first two?
EDIT2:
Here is another rather dumb example. :)
Section1
This is a "very" nice sentence.
It has "just" a few words.
Section2
This is "only" an example.
The End
I need quoted words from first but not from second section.
Yet again, (?siU)"(.*)" returns quoted words from whole text,
and I need only those between words Section1 and Section2.
This is for the "Rainmeter" application, which apparently uses Perl regex syntax.
I'm sorry, but I can't explain it better. :)
For the general case of the two examples provided -- for use in Rainmeter regex -- you can use:
(?siU)<h1>(.*)</h1>(?=.+<div2>) for the first sample and
(?siU)"(.*)"(?=.+Section2) for the second.
Note that Rainmeter seems to escape things for you, but you might need to change " to \", above.
These both use Positive Lookahead but beware: both solutions will fail in the case of nested tags/structures or if there are mutiple Section1's and Section2's. Regex is not the best tool for this kind of parsing.
But maybe this is good enough for your current needs?
Use a DOM library and getElementsByTagName('div') and you'll get a nodeList back. You can reference the first item with ->item(0) and then getElementsByTagName('h1') using the div as a context node, grab the text with ->nodeValue property.

Regex challenge: Match phrase only if outside of an <a href> tag

I am working on improving our glossary functionality in a custom CMS that is running with classic ASP (ASP 3.0) on IIS with VBScript code. I am stumped on a regex challenge I cannot solve.
Here is the current code:
If InStr(ART_ArticleBody, "href") = False then
sql="SELECT URL, Term, RegX FROM GLOSSARYDB;"
Set rsGlossary = Server.CreateObject("ADODB.Recordset")
rsGlossary.open sql, strSQLConn
Set RegExObject = New RegExp
While Not rsGlossary.EOF
URL = rsGlossary("URL")
Phrase = rsGlossary("RegX")
With RegExObject
.Pattern = Phrase
.IgnoreCase = true
.Global = false
End With
set expressionmatch = RegExObject.Execute(ART_ArticleBody)
if expressionmatch.count > 0 then
For Each expressionmatched in expressionmatch
RegExObject.Pattern = Phrase
URL = ""& expressionmatched.Value & ""
ART_ArticleBody = RegExObject.Replace(ART_ArticleBody, URL)
next
end if
rsGlossary.movenext
wend
rsGlossary.movefirst
Set RegExObject = nothing
end if
Instead of skipping putting glossary links in any article that has an href in it, as the above code does, I would like to change the code to process every article but have the RegEx pattern avoid matching on a glossary entry if the match is inside of an a tag.
For example, in italics below is a test example for this regex entry in my DB: ROI|return on investment|investment return
Here is a link that uses the glossary term: Info on return on investment.
Now, here is the glossary term in plain text, not inside of a link: return on investment.
We want to find the third instance of a match but not find the first two because they are both inside of a HTML link.
In the above text, if I were processing the article for the glossary entry "ROI|return on investment|investment return" I do not want to match on the first or second occurance that match because they are in an a tag. I need the regex pattern to skip over those matches and just match on any that are not inside of an a tag.
Any help on this would be greatly appreciated.
Try this regex:
<a\b[^<>]*>[\s\S]*?</a>|(ROI|return on investment|investment return)
This matches an HTML anchor, or any of the terms you're looking for. The terms are captured into group number 1. So in your VBScript code, check if the first capturing group matched anything, and you've got one of your keywords outside an <a> tag.
This regex indeed won't work correctly if you have nested <a> tags. That shouldn't be a problem, as anchors are normally not nested inside each other. If it is a problem, you can't solve it with VBScript/JavaScript regular expressions. The regex also won't work correctly if you have <a> tags that are missing their closing tags. If you want to take that into account, try this regex:
<a\b[^<>]*>(?:(?:(?!<a\b)[\s\S])*?</a>)?|(ROI|return on investment|investment return)
This problem is, as they say, "non-trivial" in its current state. However, if you could modify your system to output more semantic markup, it would make things much easier:
undesired tag match
This is <span class="tag">a tag</span>
In this case, you can simply search:
(?<=<span class=\"tag\">)(phrase1|phrase2|phrase3)(?=</span>)
Or something a little more robust
(?<=<span class=\"tag\">).+?(?=</span>)
This way you can easily focus your searches to data within a specific <span>, and leave everything else aside.
You can't solve it because it can't be done, at least not with 100% reliability. HTML is not a "regular" language in the regular expression sense. Like the saying goes, when you have a hammer, everything starts to look like a nail. There are some things regular expressions aren't good at. This is one of them.
Most languages have some form of HTML parsing library as standard or easily obtained. Use those. That's what they were designed for.
In general, you can't use a regular expression to recognize arbitrarily nested constructs (such as bracket-delimited HTML tags). If you had solved this problem, there's be a lot of mathematicians lining up to hear about it. :)
Having said that, .NET does indeed offer an extension to regular expressions that permits what I just said was impossible, and--even better!--the sample chapter for the great "Mastering Regular Expressions" available here happens to cover that feature.
(accounts receivable|A/R)(?!((?!</?a\b).)*</a)
(phrase1|phrase2|phrase3)(?!((?!</?a\b).)*</a)
The above approach seems to work, at least in my RegexBuddy software. I didn't figure it out on my own. Had some help from a guru. Time to test it in my ASP code. Thanks to all who provided input. I'm sure I didn't describe what I needed well enough for you to come up with the above solution. Mea culpa.