Group Extraction with Regular Expressions [duplicate] - regex

I want a regular expression to extract the title from a HTML page. Currently I have this:
title = re.search('<title>.*</title>', html, re.IGNORECASE).group()
if title:
title = title.replace('<title>', '').replace('</title>', '')
Is there a regular expression to extract just the contents of <title> so I don't have to remove the tags?

Use ( ) in regexp and group(1) in python to retrieve the captured string (re.search will return None if it doesn't find the result, so don't use group() directly):
title_search = re.search('<title>(.*)</title>', html, re.IGNORECASE)
if title_search:
title = title_search.group(1)

Note that starting in Python 3.8, and the introduction of assignment expressions (PEP 572) (:= operator), it's possible to improve a bit on Krzysztof Krasoń's solution by capturing the match result directly within the if condition as a variable and re-use it in the condition's body:
# pattern = '<title>(.*)</title>'
# text = '<title>hello</title>'
if match := re.search(pattern, text, re.IGNORECASE):
title = match.group(1)
# hello

Try using capturing groups:
title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)

May I recommend you to Beautiful Soup. Soup is a very good lib to parse all of your html document.
soup = BeatifulSoup(html_doc)
titleName = soup.title.name

Try:
title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)

re.search('<title>(.*)</title>', s, re.IGNORECASE).group(1)

The provided pieces of code do not cope with Exceptions
May I suggest
getattr(re.search(r"<title>(.*)</title>", s, re.IGNORECASE), 'groups', lambda:[u""])()[0]
This returns an empty string by default if the pattern has not been found, or the first match.

I'd think this should suffice:
#!python
import re
pattern = re.compile(r'<title>([^<]*)</title>', re.MULTILINE|re.IGNORECASE)
pattern.search(text)
... assuming that your text (HTML) is in a variable named "text."
This also assumes that there are no other HTML tags which can be legally embedded inside of an HTML TITLE tag and there exists no way to legally embed any other < character within such a container/block.
However ...
Don't use regular expressions for HTML parsing in Python. Use an HTML parser! (Unless you're going to write a full parser, which would be a of extra, and redundant work when various HTML, SGML and XML parsers are already in the standard libraries).
If you're handling "real world" tag soup HTML (which is frequently non-conforming to any SGML/XML validator) then use the BeautifulSoup package. It isn't in the standard libraries (yet) but is widely recommended for this purpose.
Another option is: lxml ... which is written for properly structured (standards conformant) HTML. But it has an option to fallback to using BeautifulSoup as a parser: ElementSoup.

The currently top-voted answer by Krzysztof Krasoń fails with <title>a</title><title>b</title>. Also, it ignores title tags crossing line boundaries, e.g., for line-length reasons. Finally, it fails with <title >a</title> (which is valid HTML: White space inside XML/HTML tags).
I therefore propose the following improvement:
import re
def search_title(html):
m = re.search(r"<title\s*>(.*?)</title\s*>", html, re.IGNORECASE | re.DOTALL)
return m.group(1) if m else None
Test cases:
print(search_title("<title >with spaces in tags</title >"))
print(search_title("<title\n>with newline in tags</title\n>"))
print(search_title("<title>first of two titles</title><title>second title</title>"))
print(search_title("<title>with newline\n in title</title\n>"))
Output:
with spaces in tags
with newline in tags
first of two titles
with newline
in title
Ultimately, I go along with others recommending an HTML parser - not only, but also to handle non-standard use of HTML tags.

I needed something to match package-0.0.1 (name, version) but want to reject an invalid version such as 0.0.010.
See regex101 example.
import re
RE_IDENTIFIER = re.compile(r'^([a-z]+)-((?:(?:0|[1-9](?:[0-9]+)?)\.){2}(?:0|[1-9](?:[0-9]+)?))$')
example = 'hello-0.0.1'
if match := RE_IDENTIFIER.search(example):
name, version = match.groups()
print(f'Name: {name}')
print(f'Version: {version}')
else:
raise ValueError(f'Invalid identifier {example}')
Output:
Name: hello
Version: 0.0.1

Is there a particular reason why no one suggested using lookahead and lookbehind? I got here trying to do the exact same thing and (?<=<title>).+(?=<\/title>) works great. It will only match whats between parentheses so you don't have to do the whole group thing.

Related

RegEx remove part of string and and replace another part

I have a challenge getting the desired result with RegEx (using C#) and I hope that the community can help.
I have a URL in the following format:
https://somedomain.com/subfolder/category/?abc=text:value&ida=0&idb=1
I want make two modifications, specifically:
1) Remove everything after 'value' e.g. '&ida=0&idb=1'
2) Replace 'category' with e.g. 'newcategory'
So the result is:
https://somedomain.com/subfolder/newcategory/?abc=text:value
I can remove the string from 1) e.g. ^[^&]+ above but I have been unable to figure out how to replace the 'category' substring.
Any help or guidance would be much appreciated.
Thank you in advance.
Use the following:
Find: /(category/.+?value)&.+
Replace: /new$1 or /new\1 depending on your regex flavor
Demo & explanation
Update according to comment.
If the new name is completely_different_name, use the following:
Find: /category(/.+?value)&.+
Replace: /completely_different_name$1
Demo & explanation
You haven't specified language here, I mainly work on python so the solution is in python.
url = re.sub('category','newcategory',re.search('^https.*value', value).group(0))
Explanation
re.sub is used to replace value a with b in c.
re.search is used to match specific patterns in string and store value in the group. so in the above code re.search will store value from "https to value" in group 0.
Using Python and only built-in string methods (there is no need for regular expressions here):
url = r"https://somedomain.com/subfolder/category/?abc=text:value&ida=0&idb=1"
new_url = (url.split('value')[0] + "value").replace("category", 'newcategory')
print(new_url)
Outputs:
https://somedomain.com/subfolder/newcategory/?abc=text:value

How to extract only first match with regexp?

Here is string examples:
<option value="20150110.1932.mtsat_2.visir.bckgr.NW_Pacific_Overview.DAYNGT.jpg">20150110.1932.mtsat_2.visir.bckgr.NW_Pacific_Overview.DAYNGT.jpg</option>
<option value="20150110.1901.mtsat_2.visir.bckgr.NW_Pacific_Overview.NGT.jpg">20150110.1901.mtsat_2.visir.bckgr.NW_Pacific_Overview.NGT.jpg</option>
expected out:
20150110.1932.mtsat_2.visir.bckgr.NW_Pacific_Overview.DAYNGT.jpg
20150110.1901.mtsat_2.visir.bckgr.NW_Pacific_Overview.NGT.jpg
I need extract only first matches of file name from string without quote symbols . How can I do it?
the pattern is first 6 digits and end with jpg
I am programming on D.
I have a lot of variants. And all of them are cripple. One of them:
(="[0-9]{8}).+(\")
I know you don't want to use a html parser, but I want to show how simple that is for people in the future who find this question.
regex kinda sorta works for html sometimes, but there's a lot of things it doesn't to: it would leave html entities (& for example) undecoded and extracting the right tag can be hard. A HTML parser makes it easy and correct (and IMO more readable):
My dom.d is does a decent job on html, so I'll show how to use it.
Grab dom.d from my github:
https://github.com/adamdruppe/arsd/blob/master/dom.d
( and if you are parsing tag soup from random non UTF-8 websites, characterencodings.d too: https://github.com/adamdruppe/arsd/blob/master/characterencodings.d )
Then you can do it like this:
import arsd.dom;
import std.stdio;
void main() {
auto document = new Document("your html string here");
foreach(option; document.querySelectorAll("option"))
writeln(option.value); // or option.innerText
}
Compile with dmd yourfile.d dom.d. (add characterencodings.d to the command line if you need to handle non utf-8 too)
querySelectorAll works like CSS selectors, similar to the same function in Javascript and in jQuery, so you can put in context too to extract the option tags from the rest of the html document.
You can try this regex:
(?<=>)([0-9]{8}.+)(?=<)
Online demo
\b\d{8}[^" ]*\.jpg(?![^"]*"(?:[^"]*"[^"]*")*[^"]*$)
Try this.See demo.
https://regex101.com/r/fA6wE2/20

What is the regex required to find specific urls within content from a list of urls generated by a for loop?

As I write this I realise there are two parts to this question, however I think I am only really stuck on the first part and therefore the second is only provided for context:
Part A:
I need to search the contents of each value returned by a for loop (where each value is a url) for the following:
href="/dir/Sub_Dir/dir/163472311232-text-text-text-text/page-n"
where:
the numerals 163472311232 could be any length (ie it could be 5478)
-text-text-text-text could be any number of different words
where page-n could be from page-2 up until any number
where matches are not returned more than once, ie only unique matches are returned and therefore only one of the following would be returned:
href="/dir/Sub_Dir/dir/5422-la-la/page-4
href="/dir/Sub_Dir/dir/5422-la-la/page-4
Part B:
So the logic would be something like:
list_of_urls = original_list
for url in list_of_urls:
headers = {'User-Agent' : 'Mozilla 5.0'}
request = urllib2.Request(url, None, headers)
url_for_re = urllib2.urlopen(request).read()
another_url = re.findall(r'href="(/dir/Sub_dir\/dir/[^"/]*)"', url_for_re, re.I)
file.write(url)
file.write('\n')
file.write(another_url)
file.write('\n')
Which i am hoping will give me output similar to:
a.html
a/page-2.html
a/page-3.html
a/page-4.html
b.html
b/page-2.html
b/page-3.html
b/page-4.html
So my question is (assuming the logic in part B is ok):
What is the required regex pattern to use for part A?
I am a newbie to python and regex so this will limit my understanding somewhat in regards to relatively complicated regex suggestions etc.
update:
after suggestions i tried to test the following regex which did not produce any results:
import re
content = 'href="/dir/Sub_Dir/dir/5648342378-text-texttttt-texty-text-text/page-2"'
matches = re.findall(r'href="/dir/Sub_Dir/dir/[0-9]+-[a-zA-Z]+-[a-zA-Z]+-[a-zA-Z]+-[a-zA-Z]+/page-([2-9]|[1-9][0-9]+)"', content, re.I)
prefix = 'http://www.test.com'
for match in matches:
i = prefix + match + '\n'
print i
solution:
i think this is the regex that will work:
matches = re.findall(r'href="(/dir/Sub_Dir/dir/[^"/]*/page-[2-9])"', content, re.I)
You can have... most of what you want. Regexes don't really do the distinct thing, so I suggest you just use them to get all the URLs, and then remove duplicates yourself.
Off the top of my head it would be something like this:
href="/dir/Sub_Dir/dir/[0-9]+-[a-zA-Z]+-[a-zA-Z]+-[a-zA-Z]+-[a-zA-Z]+/page-([2-9])|([1-9][0-9]+)"
Plus or minus escaping rules, specifics on what words are allowed, etc. I'm a Windows guy, there's a great tool called Expresso which is helpful for learning regexes. I hope there's an equivalent for whatever platform you're using, it comes in handy.

Regular Expression to extract src attribute from img tag

I am trying to write a pattern for extracting the path for files found in img tags in HTML.
String string = "<img src=\"file:/C:/Documents and Settings/elundqvist/My Documents/My Pictures/import dialog step 1.JPG\" border=\"0\" />";
My Pattern:
src\\s*=\\s*\"(.+)\"
Problem is that my pattern will also include the 'border="0" part of the img tag.
What pattern would match the URI path for this file without including the 'border="0"?
Your pattern should be (unescaped):
src\s*=\s*"(.+?)"
The important part is the added question mark that matches the group as few times as possible
This one only grabs the src only if it's inside of an tag and not when it is written anywhere else as plain text. It also checks if you've added other attributes before or after the src attribute.
Also, it determines whether you're using single (') or double (") quotes.
\<img.+src\=(?:\"|\')(.+?)(?:\"|\')(?:.+?)\>
So for PHP you would do:
preg_match("/\<img.+src\=(?:\"|\')(.+?)(?:\"|\')(?:.+?)\>/", $string, $matches);
echo "$matches[1]";
for JavaScript you would do:
var match = text.match(/\<img.+src\=(?:\"|\')(.+?)(?:\"|\')(?:.+?)\>/)
alert(match[1]);
Hopefully that helps.
Try this expression:
src\s*=\s*"([^"]+)"
I solved it by using this regex.
/<img.*?src="(.*?)"/g
Validated in https://regex101.com/r/aVBUOo/1
You want to play with the greedy form of group-capture. Something like
src\\s*=\\s*\"(.+)?\"
By default the regex will try and match as much as possible
I am trying to write a pattern for extracting the path for files found in img tags in HTML.
Can we have an autoresponder for "Don't use regex to parse [X]HTML"?
Problem is that my pattern will also include the 'border="0" part of the img tag.
Not to mention any time 'src="' appears in plain text!
If you know in advance the exact format of the HTML you're going to be parsing (eg. because you generated it yourself), you can get away with it. But otherwise, regex is entirely the wrong tool for the job.
I'd like to expand on this topic as usually the src attribute comes unquoted so the regex to take the quoted and unquoted src attribute is:
src\s*=\s*"?(.+?)["|\s]

What is the best way to find and replace urls on a giant text?

I have I huge backup of posts of my blog. All posts has images like:
"http://www.mysite.com/nonono-nonono.jpg"
or
"http://www.mysite.com/nonono-nonono.gif"
or even
"http://www.mysite.com/nonono.jpg"
But I have other links for urls on the same domain like ""http://www.mysite.com/category/post.html" and I just want to replace urls for the images (luckly all images are on the root of the website).
I need to learn RegExp to do that? Is there any powerful tool to find and replace texts like this? Thanks
Regular expressions will be your best bet... maybe something like this (based on the one from strfriend)?
^((ht|f)tp(s?)\:\/\/|~/|/)?([\w]+:\w+#)?([a-zA-Z]{1}([\w\-]+\.)+([\w]{2,5}))(:[\d]{1,5})?((/?\w+/)+|/?)(\w+\.(jpg|gif|png))?
Regular expressions are certainly one way to do it, and probably the most flexible. But if all of your image urls start with "http://www.mysite.com/" and end with ".jpg", then you can use string manipulation functions. For example, if you have a string variable called s, that you want to test:
const string mysite = "http://www.mysite.com/";
const string jpg = ".jpg";
string newString = string.Empty;
if (s.BeginsWith(mysite))
{
if (s.EndsWith(jpg))
{
string textToReplace = s.SubString(mysite.Length, s.Length - mysite.Length - jpg.Length);
newString = s.Replace(textToReplace, "whatever you want to replace it with.");
}
}
It's a rather brute force method, but it'll work.
I'm using RegExp on EditPad Pro. I'll find a good tutorial for beginners also. Thanks for the tip #CalvinR
It's possible with regular expressions, but I'd probably write a Python script using Beautiful Soup:
# fix_imgs.py
import sys
from BeautifulSoup import BeautifulSoup
for filename in sys.argv[1:]:
contents = open(filename).read()
soup = BeautifulSoup(contents)
# replacing each img tag
for img in soup.findAll('img'):
img.src = img.src.replace("http://www.mysite.com", "http://www.example.com")
new_contents = str(soup)
output_filename = "replaced." + filename
open(output_filename, "w").write(new_contents)
Honestly I think you should learn regular expressions regardless, it's a great tool to have up your sleeve especially in situations such as this. They are an extremely powerful tool for string manipulation, Perl is also a great language to learn at the same time as it makes using Reg Exps a breeze.
To replace all filenames by 'new_image_name_here' in image urls:
$ perl -pe's~(http://.*?/)[^/]+?\.(jpg|gif)\b~$1new_image_name_here.$2~g' huge_file.html > output.html
To replace a netloc part by 'www.othersite.org' in 'http://<netloc>/<image_path>':
$ perl -pe's~(?<=http://)[^/]+(?=/(?:[^/]+/)*[^/]+?\.(?:jpg|gif)\b)~www.othersite.org~g' huge_file.html > output.html
These regexs are simple therefore they are easily fooled. Use more specific regexs for your input data.