Regex to replace domain within links that are not images - regex

Need to replace a domain name on all the links on the page that are not images or pdf files.
This would be a full html page received through a proxy service.
Example:
test<img src="http://www.test.com" /><a href="http://www.test.com/test.pdf">pdf
test1
Result:
test<img src="http://www.test.com" /><a href="http://www.test.com/test.pdf">pdf
test1

If you are using .NET, I strongly suggest you to use HTML Agility Pack
Direct parsing using regex can be very error prone. This questions is also similar to the post below.
What regex should I use to remove links from HTML code in C#?

If the domain is http://www.example.com, the following should do the trick:
/http:\/\/www\.example\.com\S*(?!pdf|jpg|png|gif)\s/
This uses a negative lookahead to ensure that the regex matches a string only if the string does not contain pdf,png,jpg or gif at the specified position.

If none of your pdf urls have query parameters (like a.pdf?asd=12), the following code will work. It replaces only absolute and root-relative urls.
var links = document.getElementsByTagName("a");
var len = links.length;
var newDomain = "http://mydomain.com";
/**
* Match absolute urls (starting with http)
* and root relative urls (starting with a `/`)
* Does not match relative urls like "subfolder/anotherpage.html"
* */
var regex = new RegExp("^(?:https?://[^/]+)?(/.*)$", "i");
//uncomment next line if you want to replace only absolute urls
//regex = new RegExp("^https?://[^/]+(/.*)$", "i");
for(var i = 0; i < len; i++)
{
var link = links.item(i);
var href = link.getAttribute("href");
if(!href) //in case of named anchors
continue;
if(href.match(/\.pdf$/i)) //if pdf
continue;
href = href.replace(regex, newDomain + "$1");
link.setAttribute("href", href);
}

Related

How to extract part of url - dart/flutter

I'm trying to extract the part of url (To be more specific, I'm trying to extract the value of page_info parameter in the url which is next to rel="next"
String testUrl = "<https://demo.myshopify.com/admin/api/2022-01/products.json?limit=10&page_info=eyJkaXJlY3Rpb24iOiJwcmV2IiwibGFzdF9pZCI6NjczMDU4MDcyMTc1NCwibGFzdF92YWx1ZSI6IjE4SyBHb2xkIFBsYXRlZCBCcmFjZWxldCJ9>; rel='previous', <https://demo.myshopify.com/admin/api/2022-01/products.json?limit=10&page_info=eyJkaXJlY3Rpb24iOiJuZXh0IiwibGFzdF9pZCI6NjczMDIyNzcxMjA5MCwibGFzdF92YWx1ZSI6IjE4SyBHb2xkIFBsYXRlZCBIZWFydCBQZW5kYW50IE5lY2tsYWNlIn0>; rel='next'";
List<String> splitUrl = testUrl.split("=");
print(splitUrl[5]);
// this is what it prints out
eyJkaXJlY3Rpb24iOiJuZXh0IiwibGFzdF9pZCI6NjczMDIyNzcxMjA5MCwibGFzdF92YWx1ZSI6IjE4SyBHb2xkIFBsYXRlZCBIZWFydCBQZW5kYW50IE5lY2tsYWNlIn0>; rel
// this is what I'm trying to extract
eyJkaXJlY3Rpb24iOiJuZXh0IiwibGFzdF9pZCI6NjczMDIyNzcxMjA5MCwibGFzdF92YWx1ZSI6IjE4SyBHb2xkIFBsYXRlZCBIZWFydCBQZW5kYW50IE5lY2tsYWNlIn0
// value for rel="next"
I tried to split the url by using split function on String but that would also bring the angle bracket with it. I'm trying to extract only page_info= parameter value which is for rel="next"
I know this has to do something with regex but I'm not really good at it! Any help would be really appreciated
I grabbed that url from header response (paginated REST API), it returns two page_info parameters (one for next and other one for previous page) I'm trying to extract value for next page. Splitting the url didn't help me
thank you
An alternative approach is to use Uri.parse to parse the URL:
void main() {
String testUrl = "<https://demo.myshopify.com/admin/api/2022-01/products.json?limit=10&page_info=eyJkaXJlY3Rpb24iOiJwcmV2IiwibGFzdF9pZCI6NjczMDU4MDcyMTc1NCwibGFzdF92YWx1ZSI6IjE4SyBHb2xkIFBsYXRlZCBCcmFjZWxldCJ9>; rel='previous', <https://demo.myshopify.com/admin/api/2022-01/products.json?limit=10&page_info=eyJkaXJlY3Rpb24iOiJuZXh0IiwibGFzdF9pZCI6NjczMDIyNzcxMjA5MCwibGFzdF92YWx1ZSI6IjE4SyBHb2xkIFBsYXRlZCBIZWFydCBQZW5kYW50IE5lY2tsYWNlIn0>; rel='next'";
// Extract just the URL.
var match = RegExp(r'<([^>]*)>').firstMatch(testUrl);
if (match != null) {
var uri = Uri.parse(match.group(1)!);
print(uri.queryParameters['page_info']); // Prints: eyJkaXJlY3Rpb24iOiJ...
}
}
Note that the above wouldn't need any of the RegExp code if testUrl were a proper URL without the angle brackets and rel='next' junk.
the regEx pattern page_info=([\w]+)
gives you
eyJkaXJlY3Rpb24iOiJwcmV2IiwibGFzdF9pZCI6NjczMDU4MDcyMTc1NCwibGFzdF92YWx1ZSI6IjE4SyBHb2xkIFBsYXRlZCBCcmFjZWxldCJ9
eyJkaXJlY3Rpb24iOiJuZXh0IiwibGFzdF9pZCI6NjczMDIyNzcxMjA5MCwibGFzdF92YWx1ZSI6IjE4SyBHb2xkIFBsYXRlZCBIZWFydCBQZW5kYW50IE5lY2tsYWNlIn0
https://regexr.com/6qj1h

regex to pick folders from network path

I'm trying to get a regex for selecting part of a network path
\\server.env.com\Target\Test1\Test2\final1\final2\final3\final4\final5
I need to skip two folders after Target and get the rest of the path from the above. So regex should give me final1\final2\final3\final4\final5 in this case. The path can have more levels of folders after final5. So the regex should work for any number of folders.
When I am using look behind, the browser says its not supported, so cannot use it.
Using regex...
var str1="path \\server.env.com\\Target\\Test1\\Test2\\final1\\final2\\final3\\final4\\final5"
// "path \server.env.com\Target\Test1\Test2\final1\final2\final3\final4\final5"
str1.match( /([^\\]*\\){5}(.*)/ )[2]
// "final1\final2\final3\final4\final5"
works based on a test for number of forward slashes prior to the 'finals'
Or, using split
var arr = str1.split("\\")
arr.splice(0,5)
var result = arr.join("\\")
result // "final1\final2\final3\final4\final5"
Following this post: How do you use a variable in a regular expression?
Create a regex that replaces everything up to Target and then two more sub-directories
var path = "\\server.env.com\\Target\\Test1\\Test2\\final1\\final2\\final3\\final4\\final5"
console.log("Path is: " + path)
var target = "Target"
var regex = "^.*" + target + "\\\\(?:[^\\\\]+\\\\){2}"
console.log("Regex is: "+ regex)
var re = new RegExp(regex, "mg")
var extracted = path.replace(re, "")
console.log("Extraction is: " + extracted)

Multiple regex to match file extension with version

My current regex is like so
/\.(jpe?g|png|gif|svg)$/i
I'm trying to modify it to support matching when the extension has get parameters at the end of it so all of the below formats would match
../fonts/fontawesome-webfont.svg
../fonts/fontawesome-webfont.svg?v=4.3.0
../fonts/fontawesome-webfont.svg?v=4.3.0#fontawesomeregular'
How can I modify it to support these?
Assuming the URLs to be parsed follow proper formatting (where only one '?' delimiter can be used to signify the start of the query) you could do:
/\.(jpe?g|png|gif|svg)(?:\?.*|)$/i
var urls = [
'../fonts/fontawesome-webfont.svg',
'../fonts/fontawesome-webfont.svg?v=4.3.0',
'../fonts/fontawesome-webfont.svg?v=4.3.0#fontawesomeregular'
];
var matches = urls.map(function(url) { return url.match(/\.(jpe?g|png|gif|svg)(?:\?.*|)$/i); });
document.write('<pre>' + JSON.stringify(matches, null, 2) + '</pre>');
Alternatively you could use Node's url.parse():
var url = require('url');
var urlObj = url.parse(URL_STRING);
var matches = urlObj.pathname.match(/\.(jpe?g|png|gif|svg)$/i);

Javascript regex replace

I have a langauge dropdown, and a javascript function which changes the page to the corresponding language selected. I need help on my regex replace:
For example, I would like this URL to turn into this url:
http://localhost:7007/en/Product/Detail/1038
http://localhost:7007/fr/Product/Detail/1038
function languageChange(sender) {
var lang = $(sender).val();
var target = window.location.href;
target = target.replace(/(http:\/\/.*?)([a-zA-Z]{2})(.*$)/gim, '$1' + lang + '$3');
window.location = target;
}
Is your URL always the same structure? If so, you may not need a regex at all. Split the url at each "/", replace index 3, then join your array back to together with "/".
Here is a code sample:
function changeLanguage(url, newLang) {
var url = url.split('/');
url[3] = newLang;
return url.join('/');
}
changeLanguage('http://localhost:7007/en/Product/Detail/1038','Fr');
Note: I originally wrote "splice" instead of "join" in my response. Join is the correct method.
Here is a function that processes any number of URLs within a string, and replaces the language part (the first part of path), only if exists and is from 2 to 4 chars long:
function changeLanguage(text, lang) {
return text.replace(
/\b(\w+:\/\/[^\/]+\/)[A-Z]{2,4}(?=[\/\s]|$)/gim,
'$1' + lang);
}
Edit: Converted to function format.
Use this regex:
target =
target.replace(/(https?:\/\/[^/]+)\/?([^/]*)(.*)/gi, '$1/' + lang + '$3');
if e.g. lang='fr' then target holds http://localhost:7007/fr/Product/Detail/1038 value;

Can a sizzle selector evaluate a regular expression?

I need to select links with a specific format of URLs. Can I use sizzle to evaluate a link's href attribute against a regular expression?
For example, can I do something like this:
var arrayOfLinks = Sizzle('a[HREF=[0-9]+$]');
to create an array of all links on the page whose URL ends in a number?
Give this a try. I've attempted to convert the jQuery regex selector that Kobi linked to into a Sizzle selector extension. Seems to work, but I haven't put it through a lot of testing.
Sizzle.selectors.filters.regex = function(elem, i, match){
var matchParams = match[3].split(',', 2);
var attr = matchParams[0];
var pattern = matchParams[1];
var regex = new RegExp(pattern.replace(/^\s+|\s+$/g,''), 'ig');
return regex.test(elem.getAttribute(attr));
};
In this case, your example would be written as:
var arrayOfLinks = Sizzle('a:regex(href,[0-9]+$)');