Finding a URL within two strings regex - regex

I have a long HTML file that contains the names of organizations and their URL's. Each organization's "section" in the code is demarcated by the word "organization" followed by a lot of code, with their URL located inside that code, and ends with the word "organization".
For example:
organization -- a lot of code (with the URL located somewhere inside) -- organization
I have tried to use regex to search and extract the URL, but to no avail.
organization(?<Protocol>\w+):\/\/(?<Domain>[\w#][\w.:#]+)\/?[\w\.?=%&=\ #/$,]*organization
I suspect my problem lies somewhere in my trying to demarcate the search for URL's by just using the word "organization", but I am not sure.

Try group 1 from this:
organization.*\b(\w+://[\w.?%&=#/$,-]+).*?organization

Your current regex is searching for something sandwiched immediately between two instances of "organization". If there's any chance of characters existing between "organization" and your URL, you'll need to introduce a non-greedy match for any instances of anything (.*?), and if there are newlines in the mix you'll need to use (?:.|\n)*?.
So your regex becomes:
organization(?:.|\n)*?(?<Protocol>\w+):\/\/(?<Domain>[\w#][\w.:#]+)\/?[\w\.?=%&=\ #/$,]*(?:.|\n)*?organization
(Because of the bold insertions, this mistakenly appears to have spaces, but it does not. If you select it and copy/paste, it will paste correctly without spaces)

Related

Regex for multiple URLs without clear pattern

I'm quite new to using regex so I hope there's someone who can help me out. I want to set up an event on Google Tag Manager through RegEx that fires whenever someone views a page. I'm trying to do this using the Page URL as a parameter so that the event hits, when that URL is visited. Its for around 1400 urls that are in the same sub-folder but have a different page name. For example: https://www.example.com/products/product-name-1, https://www.example.com/products/product-name-2
What would be the best way to group these into one RegEx formula?
I've tried to separate all urls by using the '|' sign without any result. I've also tried this format, without any luck: (^/page-url-1/$|^/page-url-1/$|^/page-url-1/$|^/page-url-1/$)
A couple things are happening with your attempt. First, you aren't escaping the '/'. This is a reserved or special character and you will need to precede it with a \ to tell the engine that you want that specific character. It would look like this:
\/products\/page-url-1
I am assuming you are using a {{Page Path}} so the above would match for any paths that contain /products/page-url-1.
If you want the event to fire on all pages within the /products directory, there is an easier way of doing this.
\/products\/.*
what this will do is match any pages within your /products directory. If you have a landing page on /products, this will be omitted from the firing. The '.' means it will then match any character after the / and '*' means it can do this unlimited times.
EDIT:
Since you aren't looking for all the products pages, you can you a matching group and list them all. I suspect that all the product names will be different enough and not share any common path elements so you will have to list out the ones want.
\/products\/(product-url-1|product-url-2|product-url-3).*

RegEx: need to automatically direct downloads from URL with changing section

I have no knowledge of RegEx code but I've just downloaded a Google Chrome extension that lets me automatically direct downloads to specific folders on my computer.
I want jpgs from a stock photo website to be downloaded in a specific folder, but part of the URL changes for every single file. how do I write out the File URL so it ignores the random section of the URL?
https://website.com/photos/IGNORE THIS PART with azAZ01 RANDOM CODE/download?force=true
Your question was a little confusing to me, but this should be the regex you need:
https://website.com/photos/(\w|\d)*
To help break it down, the basic text (e.g. https:, website.com, photos) just matches with the raw text. The / thing is an escape character '\' followed by the slash we want. Then for the random part, I'm assuming it's made up of letters and numbers, so that last part translates roughly to, any letter '\w' or any number '\d', and the * means any number of those.
Also, you can use Regex101.com as a helpful tool when making regex

Find multiple '/' forward slashes in string of URLs for sitemap

We are trying to clean up our site map as our Magento store has created duplicate pages. I want to use a regular expression to select, or invert select, all of the pages which are linked to the top level URL.
For example, we want to find the first line-
/site/product<<
/site/category/product/
/site/category/product
Is there any way to find only two instances of a forward slash in the whole string, which are not next to each other?
Thank you for your help in advance.
I've tried something like this
(.*(?<!\/)$)
Your pattern (.*(?<!\/)$) matches any char except a newline until the end of the string and after that asserts that what is on the left is not a forward slash which will give you the first and the third match.
You could match from the start of the string ^ 2 times a forward slash and then 1+ times not a forward slash or a newline [^/\n]+ and then assert the end of the string $
^/[^/\n]+/[^/\n]+$
Regex demo
I would like like to provide a quick answer to this problem in case it helps anyone else in the future. Our sitemap had too many duplicate URLs due to an incorrect set up on our Magento store. Instead of submitting a sitemap with 20,000+ top-level URLs we decided to manually remove the top level items ourselves.
Not ideal at all.
We tweaked with the site map PHP generation code to pull top-level URLs as site/category/id/###. Then we used Notepad++ to bookmark and delete these lines accordingly.

RegEx filter links from a document

I am currently learning regex and I am trying to filter all links (eg: http://www.link.com/folder/file.html) from a document with notepad++. Actually I want to delete everything else so that in the end only the http links are listed.
So far I tried this : http\:\/\/www\.[a-zA-Z0-9\.\/\-]+
This gives me all links which is find, but how do I delete the remaining stuff so that in the end I have a neat list of all links?
If I try to replace it with nothing followed by \1, obviously the link will be deleted, but I want the exact opposite to have everything else deleted.
So it should be something like:
- find a string of numbers, letters and special signs until "http"
- delete what you found
- and keep searching for more numbers, letters ans special signs after "html"
- and delete that again
Any ideas? Thanks so much.
In Notepad++, in the Replace menu (CTRL+H) you can do the following:
Find: .*?(http\:\/\/www\.[a-zA-Z0-9\.\/\-]+)
Replace: $1\n
Options: check the Regular expression and the . matches newline
This will return you with a list of all your links. There are two issues though:
The regex you provided for matching URLs is far from being generic enough to match any URL. If it is working in your case, that's fine, else check this question.
It will leave the text after the last matched URL intact. You have to delete it manually.
The answer made previously by #psxls was a great help for me when I have wanted to perform a similar process.
However, this regex rule was written six years ago now: accordingly, I had to adjust / complete / update it in order it can properly work with the some recent links, because:
a lot of URL are now using HTTPS instead of HTTP protocol
many websites less use www as main subdomain
some links adds punctuation mark (which have to be preserved)
I finally reshuffle the search rule to .*?(https?\:\/\/[a-zA-Z0-9[:punct:]]+) and it worked correctly with the file I had.
Unfortunately, this seemingly simple task is going to be almost impossible to do in notepad++. The regex you would have to construct would be...horrible. It might not even be possible, but if it is, it's not worth it. I pretty much guarantee that.
However, all is not lost. There are other tools more suitable to this problem.
Really what you want is a tool that can search through an input file and print out a list of regex matches. The UNIX utility "grep" will do just that. Don't be scared off because it's a UNIX utility: you can get it for Windows:
http://gnuwin32.sourceforge.net/packages/grep.htm
The grep command line you'll want to use is this:
grep -o 'http:\/\/www.[a-zA-Z0-9./-]\+\?' <filename(s)>
(Where <filename(s)> are the name(s) of the files you want to search for URLs in.)
You might want to shake up your regex a little bit, too. The problems I see with that regex are that it doesn't handle URLs without the 'www' subdomain, and it won't handle secure links (which start with https). Maybe that's what you want, but if not, I would modify it thusly:
grep -o 'https\?:\/\/[a-zA-Z0-9./-]\+\?' <filename(s)>
Here are some things to note about these expressions:
Inside a character group, there's no need to quote metacharacters except for [ and (sometimes) -. I say sometimes because if you put the dash at the end, as I have above, it's no longer interpreted as a range operator.
The grep utility's syntax, annoyingly, is different than most regex implementations in that most of the metacharacters we're familiar with (?, +, etc.) must be escaped to be used, not the other way around. Which is why you see backslashes before the ? and + characters above.
Lastly, the repetition metacharacter in this expression (+) is greedy by default, which could cause problems. I made it lazy by appending a ? to it. The way you have your URL match formulated, it probably wouldn't have caused problems, but if you change your match to, say [^ ] instead of [a-zA-Z0-9./-], you would see URLs on the same line getting combined together.
I did this a different way.
Find everything up to the first/next (https or http) (then everything that comes next) up to (html or htm), then output just the '(https or http)(everything next) then (html or htm)' with a line feed/ carriage return after each.
So:
Find: .*?(https:|http:)(.*?)(html|htm)
Replace with: \1\2\3\r\n
Saves looking for all possible (incl non-generic) url matches.
You will need to manually remove any text after the last matched URL.
Can also be used to create url links:
Find: .*?(https:|http:)(.*?)(html|htm)
Replace: \1\2\3\r\n
or image links (jpg/jpeg/gif):
Find: .*?(https:|http:)(.*?)(jpeg|jpg|gif)
Replace: <img src="\1\2\3">\r\n
I know my answer won't be RegEx related, but here is another efficient way to get lines containing URLs.
This won't remove text around links like Toto mentioned in comments.
At least if there is nice pattern to all links, like https://.
CTRL+F => change tab to Mark
Insert https://
Tick Mark to bookmark.
Mark All.
Find => Bookmarks => Delete all lines without bookmark.
I hope someone who lands here in search of same problem will find my way more user-friendly.
You can still use RegEx to mark lines :)

Regular Expression to match a specific URL broken up by arbitrary characters

I run a Django-based forum (the framework is probably not important to the question, but still) and it has been increasingly getting spammed with posts that link to a specific website constantly (www.solidwoodkitchen.co.uk - these people are apparently the worst).
I've implemented a string blocking system that stops them posting to the forum if the URL of the website is included in the post, but as spam bots usually do, it has figured out a way around that by breaking up the URL with other characters (eg. w_w_w.s*olid_wood*kit_ch*en._*co.*uk .). So a couple of questions:
Is it even possible to build a regex capable of finding the specific URL within a block of text even when it has been modified like that?
If it is, would this cause a performance hit?
Description
You could break the url into a string of characters, then join them together with [^a-z0-9]*?. So in this case with www.solidwoodkitchen.co.uk the resulting regex would look like:
w[^a-z0-9]*?w[^a-z0-9]*?w[^a-z0-9]*?[.][^a-z0-9]*?s[^a-z0-9]*?o[^a-z0-9]*?l[^a-z0-9]*?i[^a-z0-9]*?d[^a-z0-9]*?w[^a-z0-9]*?o[^a-z0-9]*?o[^a-z0-9]*?d[^a-z0-9]*?k[^a-z0-9]*?i[^a-z0-9]*?t[^a-z0-9]*?c[^a-z0-9]*?h[^a-z0-9]*?e[^a-z0-9]*?n[^a-z0-9]*?[.][^a-z0-9]*?c[^a-z0-9]*?o[^a-z0-9]*?[.][^a-z0-9]*?u[^a-z0-9]*?k
Edit live on Debuggex
This could would basically search for the entire string of characters seperated by zero or more non alphanumeric characters.
Or you could take the input text and strip out all punctuation then simply search for wwwsolidwoodkitchencouk.