How to extract only first match with regexp? - regex

Here is string examples:
<option value="20150110.1932.mtsat_2.visir.bckgr.NW_Pacific_Overview.DAYNGT.jpg">20150110.1932.mtsat_2.visir.bckgr.NW_Pacific_Overview.DAYNGT.jpg</option>
<option value="20150110.1901.mtsat_2.visir.bckgr.NW_Pacific_Overview.NGT.jpg">20150110.1901.mtsat_2.visir.bckgr.NW_Pacific_Overview.NGT.jpg</option>
expected out:
20150110.1932.mtsat_2.visir.bckgr.NW_Pacific_Overview.DAYNGT.jpg
20150110.1901.mtsat_2.visir.bckgr.NW_Pacific_Overview.NGT.jpg
I need extract only first matches of file name from string without quote symbols . How can I do it?
the pattern is first 6 digits and end with jpg
I am programming on D.
I have a lot of variants. And all of them are cripple. One of them:
(="[0-9]{8}).+(\")

I know you don't want to use a html parser, but I want to show how simple that is for people in the future who find this question.
regex kinda sorta works for html sometimes, but there's a lot of things it doesn't to: it would leave html entities (& for example) undecoded and extracting the right tag can be hard. A HTML parser makes it easy and correct (and IMO more readable):
My dom.d is does a decent job on html, so I'll show how to use it.
Grab dom.d from my github:
https://github.com/adamdruppe/arsd/blob/master/dom.d
( and if you are parsing tag soup from random non UTF-8 websites, characterencodings.d too: https://github.com/adamdruppe/arsd/blob/master/characterencodings.d )
Then you can do it like this:
import arsd.dom;
import std.stdio;
void main() {
auto document = new Document("your html string here");
foreach(option; document.querySelectorAll("option"))
writeln(option.value); // or option.innerText
}
Compile with dmd yourfile.d dom.d. (add characterencodings.d to the command line if you need to handle non utf-8 too)
querySelectorAll works like CSS selectors, similar to the same function in Javascript and in jQuery, so you can put in context too to extract the option tags from the rest of the html document.

You can try this regex:
(?<=>)([0-9]{8}.+)(?=<)
Online demo

\b\d{8}[^" ]*\.jpg(?![^"]*"(?:[^"]*"[^"]*")*[^"]*$)
Try this.See demo.
https://regex101.com/r/fA6wE2/20

Related

RegExp find wrong tags

I have some urls saved in DB like hello world
with break tags, so i need to delete them, the problem that <br/> are in other places to so i can't delete all of them,
i write RegExp <*"*<br\/?>"> but it select not only <br> and quotes too.
You really shouldn't be using regular expressions for parsing HTML or XML.
Having said that. As I understand it, you have br tags inside the href attribute of a tags.
try :
href\s*?=\s*?\"(.*?)(<br\/?\>)\"
If you try to search about the right lines in the database, then this is your regex extended to match the whole line:
<.*\".*<br\/>\">.*>
After this you can mach the '<br/>' directly in those lines. Is there a language to edit your DB?
Some of the other answers here are okay. I'll offer an alternative:
https://regex101.com/r/uG5PBA/2
This'll put the break tags in a capture group -- group 1, so that you can simply nix them.
Regex:
<a[\s\S]*?(\<br\/>)[\s\S]*?<\/a>
Test String:
hello worldhello world

Group Extraction with Regular Expressions [duplicate]

I want a regular expression to extract the title from a HTML page. Currently I have this:
title = re.search('<title>.*</title>', html, re.IGNORECASE).group()
if title:
title = title.replace('<title>', '').replace('</title>', '')
Is there a regular expression to extract just the contents of <title> so I don't have to remove the tags?
Use ( ) in regexp and group(1) in python to retrieve the captured string (re.search will return None if it doesn't find the result, so don't use group() directly):
title_search = re.search('<title>(.*)</title>', html, re.IGNORECASE)
if title_search:
title = title_search.group(1)
Note that starting in Python 3.8, and the introduction of assignment expressions (PEP 572) (:= operator), it's possible to improve a bit on Krzysztof Krasoń's solution by capturing the match result directly within the if condition as a variable and re-use it in the condition's body:
# pattern = '<title>(.*)</title>'
# text = '<title>hello</title>'
if match := re.search(pattern, text, re.IGNORECASE):
title = match.group(1)
# hello
Try using capturing groups:
title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)
May I recommend you to Beautiful Soup. Soup is a very good lib to parse all of your html document.
soup = BeatifulSoup(html_doc)
titleName = soup.title.name
Try:
title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)
re.search('<title>(.*)</title>', s, re.IGNORECASE).group(1)
The provided pieces of code do not cope with Exceptions
May I suggest
getattr(re.search(r"<title>(.*)</title>", s, re.IGNORECASE), 'groups', lambda:[u""])()[0]
This returns an empty string by default if the pattern has not been found, or the first match.
I'd think this should suffice:
#!python
import re
pattern = re.compile(r'<title>([^<]*)</title>', re.MULTILINE|re.IGNORECASE)
pattern.search(text)
... assuming that your text (HTML) is in a variable named "text."
This also assumes that there are no other HTML tags which can be legally embedded inside of an HTML TITLE tag and there exists no way to legally embed any other < character within such a container/block.
However ...
Don't use regular expressions for HTML parsing in Python. Use an HTML parser! (Unless you're going to write a full parser, which would be a of extra, and redundant work when various HTML, SGML and XML parsers are already in the standard libraries).
If you're handling "real world" tag soup HTML (which is frequently non-conforming to any SGML/XML validator) then use the BeautifulSoup package. It isn't in the standard libraries (yet) but is widely recommended for this purpose.
Another option is: lxml ... which is written for properly structured (standards conformant) HTML. But it has an option to fallback to using BeautifulSoup as a parser: ElementSoup.
The currently top-voted answer by Krzysztof Krasoń fails with <title>a</title><title>b</title>. Also, it ignores title tags crossing line boundaries, e.g., for line-length reasons. Finally, it fails with <title >a</title> (which is valid HTML: White space inside XML/HTML tags).
I therefore propose the following improvement:
import re
def search_title(html):
m = re.search(r"<title\s*>(.*?)</title\s*>", html, re.IGNORECASE | re.DOTALL)
return m.group(1) if m else None
Test cases:
print(search_title("<title >with spaces in tags</title >"))
print(search_title("<title\n>with newline in tags</title\n>"))
print(search_title("<title>first of two titles</title><title>second title</title>"))
print(search_title("<title>with newline\n in title</title\n>"))
Output:
with spaces in tags
with newline in tags
first of two titles
with newline
in title
Ultimately, I go along with others recommending an HTML parser - not only, but also to handle non-standard use of HTML tags.
I needed something to match package-0.0.1 (name, version) but want to reject an invalid version such as 0.0.010.
See regex101 example.
import re
RE_IDENTIFIER = re.compile(r'^([a-z]+)-((?:(?:0|[1-9](?:[0-9]+)?)\.){2}(?:0|[1-9](?:[0-9]+)?))$')
example = 'hello-0.0.1'
if match := RE_IDENTIFIER.search(example):
name, version = match.groups()
print(f'Name: {name}')
print(f'Version: {version}')
else:
raise ValueError(f'Invalid identifier {example}')
Output:
Name: hello
Version: 0.0.1
Is there a particular reason why no one suggested using lookahead and lookbehind? I got here trying to do the exact same thing and (?<=<title>).+(?=<\/title>) works great. It will only match whats between parentheses so you don't have to do the whole group thing.

Preg_match_all with nested matches

i'm developing a template system and running into some issues.
The plan is to create HTML documents with [#tags] in them.
I could just use str_replace (i can loop trough all posible replacements), but i want to push this a little further ;-)
I want to allow nested tags, and allow parameters with each tag:
[#title|You are looking at article [#articlenumber] [#articlename]]
I would like to get the following results with preg_match_all:
[0] title|You are looking at article [#articlenumber] [#articlename]
[1] articlenumber
[2] articlename
My script will split the | for parameters.
The output from my script will be something like:
<div class='myTitle'>You are looking at article 001 MyProduct</div>
The problem i'm having is that i'm not exprerienced with regex. Al my paterns results almost what i want, but have problems with the nested params.
\[#(.*?)\]
Will stop at the ] from articlenumber.
\[#(.*?)(((?R)|.)*?)\]
Is more like it, but it doesn't catch the articlenumber; https://regex101.com/r/UvH7zi/1
Hope someone can help me out! Thanks in advance!
You cannot do this using general Python regular expressions. You are looking for a feature similar to "balancing groups" available in the .NET RegEx's engine that allows nested matches.
Take a look at PyParsing that allows nested expression:
from pyparsing import nestedExpr
import pyparsing as pp
text = '{They {mean to {win}} Wimbledon}'
print(pp.nestedExpr(opener='{', closer='}').parseString(text))
The output is:
[['They', ['mean', 'to', ['win']], 'Wimbledon']]
Unfortunately, this does not work very well with your example. You need a better grammar, I think.
You can experiment with a QuotedString definition, but still.
import pyparsing as pp
single_value = pp.QuotedString(quoteChar="'", endQuoteChar="'")
parser = pp.nestedExpr(opener="[", closer="]",
content=single_value,
ignoreExpr=None)
example = "['#title|You are looking at article' ['#articlenumber'] ['#articlename']]"
print(parser.parseString(example, parseAll=True))
I'm typing this on my phone so there might be some mistakes, but what you want can be quite easily achieved by incorporating a lookahead into your expression:
(?=\\[(#(?:\\[(?1)\\]|.)*)\\])
Edit: Yup, it works, here you go: https://regex101.com/r/UvH7zi/4
Because (?=) consumes no characters, the pattern looks for and captures the contents of all "[#*]" substrings in the subject, recursively checking that the contents themselves contain balanced groups, if any.
here is my code:
#\w+\|[\w\s]+\[#(\w+)]\s+\[#(\w+)]
https://regex101.com/r/UvH7zi/3
For now i've crated a parser:
- get all opening tags, and put their strpos in array
- loop trough all start positions of the opening tags
- Look for the next closingtag, is it before the next open-tag? than the tag is complete
- If the closingtag was after an opening tag, skip that one and look for the next (and keep checking for openingtags in between)
That way i could find all complete tags and replace them.
But that took about 50 lines of code and multiple loops, so one preg_match would be greater ;-)

finding text between <script></script> tags with RegEx for Coldfusion including linebreaks

I am trying to extract javascript code from HTML content that I receive via CFHTTP request.
I have this simple regex that catches everyting as long as there is no linebreak in the code between the tags.
var result=REMatch("<script[^>]*>(.*?)</script>",html);
This will catch:
<script>testtesttest</script<
but not
<script>
testtest
</script>
I have tried to use (?m) for multiline, but it doesn't work like that.
I am using the reference to figure it out but I am just not getting it with regex.
Heads up, normally there would be javascript between the script tags, not simple text so also characters like {}();:-_ etc.
Can anyone help me out?
Cheers
[[UPDATE]]
Thanks guys, I will try the solutions. I favor regex because but I will look into the HTML Parser too.
(?m) multiline mode is for making ^ and $ match on line breaks (not just start/end of string as is default), but what you're trying to do here is make . include newlines - for that you want (?s) (dot-all mode).
However, I probably wouldn't do this with regex - a HTML parser is a more robust solution. Here's how to do it with jSoup:
var result = jsoup.parse(html).select('script').text();
More details on using jSoup in CF are available here, or alternatively you can use the TagSoup parser, which ships with CF10 (so you don't need to worry about jars/etc).
If you really want regex, then you can use this:
var result = rematch('<script[^>]*>(?:[^<]+|<(?!/script>))+',html);
Unlike using (?s).*? this avoids matching empty blocks (but it will still fail in certain edge cases - if accuracy is required use a HTML parser).
To extract just the text from the first script block, you can strip the script tag with this:
result = ListRest( result[1] , '>' );
You can use dot matches all mode or replace . with [\s\S] to get the same effect.
<script[^>]*>[\s\S]*?</script> would match everything including newlines.

Regular Expression to extract src attribute from img tag

I am trying to write a pattern for extracting the path for files found in img tags in HTML.
String string = "<img src=\"file:/C:/Documents and Settings/elundqvist/My Documents/My Pictures/import dialog step 1.JPG\" border=\"0\" />";
My Pattern:
src\\s*=\\s*\"(.+)\"
Problem is that my pattern will also include the 'border="0" part of the img tag.
What pattern would match the URI path for this file without including the 'border="0"?
Your pattern should be (unescaped):
src\s*=\s*"(.+?)"
The important part is the added question mark that matches the group as few times as possible
This one only grabs the src only if it's inside of an tag and not when it is written anywhere else as plain text. It also checks if you've added other attributes before or after the src attribute.
Also, it determines whether you're using single (') or double (") quotes.
\<img.+src\=(?:\"|\')(.+?)(?:\"|\')(?:.+?)\>
So for PHP you would do:
preg_match("/\<img.+src\=(?:\"|\')(.+?)(?:\"|\')(?:.+?)\>/", $string, $matches);
echo "$matches[1]";
for JavaScript you would do:
var match = text.match(/\<img.+src\=(?:\"|\')(.+?)(?:\"|\')(?:.+?)\>/)
alert(match[1]);
Hopefully that helps.
Try this expression:
src\s*=\s*"([^"]+)"
I solved it by using this regex.
/<img.*?src="(.*?)"/g
Validated in https://regex101.com/r/aVBUOo/1
You want to play with the greedy form of group-capture. Something like
src\\s*=\\s*\"(.+)?\"
By default the regex will try and match as much as possible
I am trying to write a pattern for extracting the path for files found in img tags in HTML.
Can we have an autoresponder for "Don't use regex to parse [X]HTML"?
Problem is that my pattern will also include the 'border="0" part of the img tag.
Not to mention any time 'src="' appears in plain text!
If you know in advance the exact format of the HTML you're going to be parsing (eg. because you generated it yourself), you can get away with it. But otherwise, regex is entirely the wrong tool for the job.
I'd like to expand on this topic as usually the src attribute comes unquoted so the regex to take the quoted and unquoted src attribute is:
src\s*=\s*"?(.+?)["|\s]