getting regex with specific number of slashes in text - regex

I have the following text
http://www.faz.net/aktuell/politik/ausland/amerika/venezuela-das-ende-der-sozialistischen-epoche-13952597.html
http://www.faz.net/aktuell/politik/ausland/bundeswehr-einsatz-von-der-leyen-gesteht-fehler-in-afghanistan-ein-13952438.html
http://www.faz.net/aktuell/politik/inland/bayerns-ehrenamtliche-in-der-fluechtlingskrise-13948777.html
I would like to retrieve only those links that start with http://www.faz.net/aktuell/politik/ but end with .html with only one slash in between. Basically, avoiding the first link in the list above.
I tried the following
http://www.faz.net/aktuell/politik/.*/.*?\.html
However, all get selected. How do avoid the extra slash in the first? Please help

You can use the following:
http://www\.faz\.net/aktuell/politik/[^/]*/[^/]*\.html
See DEMO

Related

What regex in Google Analytics to use for this case?

I'm trying to figure out what landing page regex to use to only show URLs that have only two sub-folders, e.g. see image below: just show green URLs but not the read ones as they have 3+ subfolders. Any advice on how to do this in GA with regex?
Cheers
If you want to match a path having only two components, e.g.
/component1/component2/
Then you may use the following regex:
/[^/]+/[^/]+/
Demo
If your regex tool requires anchors, then add them:
^/[^/]+/[^/]+/$
Is this what you are looking for?
^\/[!#$&-;=?-[]_a-z~]+\/[!#$&-;=?-[]_a-z~]+\/$
The two sections contain all the valid html characters. We're also forcing the regex to start with slash, end with slash and have only one slash in between.

Find multiple '/' forward slashes in string of URLs for sitemap

We are trying to clean up our site map as our Magento store has created duplicate pages. I want to use a regular expression to select, or invert select, all of the pages which are linked to the top level URL.
For example, we want to find the first line-
/site/product<<
/site/category/product/
/site/category/product
Is there any way to find only two instances of a forward slash in the whole string, which are not next to each other?
Thank you for your help in advance.
I've tried something like this
(.*(?<!\/)$)
Your pattern (.*(?<!\/)$) matches any char except a newline until the end of the string and after that asserts that what is on the left is not a forward slash which will give you the first and the third match.
You could match from the start of the string ^ 2 times a forward slash and then 1+ times not a forward slash or a newline [^/\n]+ and then assert the end of the string $
^/[^/\n]+/[^/\n]+$
Regex demo
I would like like to provide a quick answer to this problem in case it helps anyone else in the future. Our sitemap had too many duplicate URLs due to an incorrect set up on our Magento store. Instead of submitting a sitemap with 20,000+ top-level URLs we decided to manually remove the top level items ourselves.
Not ideal at all.
We tweaked with the site map PHP generation code to pull top-level URLs as site/category/id/###. Then we used Notepad++ to bookmark and delete these lines accordingly.

How to dismiss the end of the url parameters with regex?

I have a script that is supposed to trigger when a certain page path is open.
The issue: the page path contains multiple parameters including the parameter "returnUrl", returning the previous page visited.
Here is the url I want to check :
/cxsSearchApply?positionId=a0w0X000004IceYQAS&lang=en&returnUrl=https://example.com/cxsrec__cxsSearchDetail?id=a0w0X000004IceYQAS&lang=en&returnUrl=https://example.com/cxsrec__cxsSearch&lang=en
I initially used this regex code to get triggered on this page :
(cxsSearchApply.*)
But I have others regex codes like:
(cxsSearchSearchDetail.*)
And they also trigger because of the page path included in the url...
What reggex I should use to match the first part of the url but nothing after "returnUrl" ?
So you want to match cxsSearchApply on the text before &returnUrl. You could use a lookahead:
(cxsSearchApply.*)(?=returnUrl=)
However, what you really want is to match everything before the first &returnUrl. So you need a non-greedy operator:
(cxsSearchApply.*?)(?=returnUrl=)
Likewise, for your other search, it should no longer match because it is also only looking at the first part:
(cxsSearchSearchDetail.*?)(?=returnUrl=)
I believe that will get you what you want.
Nothing after "returnUrl"
If this is literally what you want, you can simply do (.*)(&returnUrl=.*) and take the first capture group as your result.

Is it possible to remove the slash in this matching?

I want to extend my regexp for filepaths matching and I don't know how to do it even if I see the problem.
Innput example
"C://species/dinosaurs/trex.json"
Ouput example
["C://species/dinosaurs" "trex" "json"]
so that I have the folder path, the filename and the extension.
I also want the folder path to be optional
My regexp
I tried
"^(.*[\\\/])?(.*)\.(.*)$"
It outputs
["C://species/dinosaurs/" "trex" "json"]
Almost but I have the / at the end of the head
I so tried
"^((.*)[\\\/])?(.*)\.(.*)$"
I ouputs
["C://species/dinosaurs/" "C://species/dinosaurs" "trex" "json"]
Maybe better because I juste have to remove the first match whereas in the first case I have to post-process the string.
I see the problem because several / can exist in the body so that it is harder.
Is it possible to say that the end of the first matching group can be all but not /.
I tried
^(.*(?!\/))[\\\/]?(.*)\.(.*)$
Does not work. I just discovered negative assertions but the output is
["C://species/dinosaurs/trex" "json"]
Any clue ?
This one should suit your needs:
^(?:(.*)/)?([^/]+)\.([^.]+)$
Visualization by Debuggex

RegEx filter links from a document

I am currently learning regex and I am trying to filter all links (eg: http://www.link.com/folder/file.html) from a document with notepad++. Actually I want to delete everything else so that in the end only the http links are listed.
So far I tried this : http\:\/\/www\.[a-zA-Z0-9\.\/\-]+
This gives me all links which is find, but how do I delete the remaining stuff so that in the end I have a neat list of all links?
If I try to replace it with nothing followed by \1, obviously the link will be deleted, but I want the exact opposite to have everything else deleted.
So it should be something like:
- find a string of numbers, letters and special signs until "http"
- delete what you found
- and keep searching for more numbers, letters ans special signs after "html"
- and delete that again
Any ideas? Thanks so much.
In Notepad++, in the Replace menu (CTRL+H) you can do the following:
Find: .*?(http\:\/\/www\.[a-zA-Z0-9\.\/\-]+)
Replace: $1\n
Options: check the Regular expression and the . matches newline
This will return you with a list of all your links. There are two issues though:
The regex you provided for matching URLs is far from being generic enough to match any URL. If it is working in your case, that's fine, else check this question.
It will leave the text after the last matched URL intact. You have to delete it manually.
The answer made previously by #psxls was a great help for me when I have wanted to perform a similar process.
However, this regex rule was written six years ago now: accordingly, I had to adjust / complete / update it in order it can properly work with the some recent links, because:
a lot of URL are now using HTTPS instead of HTTP protocol
many websites less use www as main subdomain
some links adds punctuation mark (which have to be preserved)
I finally reshuffle the search rule to .*?(https?\:\/\/[a-zA-Z0-9[:punct:]]+) and it worked correctly with the file I had.
Unfortunately, this seemingly simple task is going to be almost impossible to do in notepad++. The regex you would have to construct would be...horrible. It might not even be possible, but if it is, it's not worth it. I pretty much guarantee that.
However, all is not lost. There are other tools more suitable to this problem.
Really what you want is a tool that can search through an input file and print out a list of regex matches. The UNIX utility "grep" will do just that. Don't be scared off because it's a UNIX utility: you can get it for Windows:
http://gnuwin32.sourceforge.net/packages/grep.htm
The grep command line you'll want to use is this:
grep -o 'http:\/\/www.[a-zA-Z0-9./-]\+\?' <filename(s)>
(Where <filename(s)> are the name(s) of the files you want to search for URLs in.)
You might want to shake up your regex a little bit, too. The problems I see with that regex are that it doesn't handle URLs without the 'www' subdomain, and it won't handle secure links (which start with https). Maybe that's what you want, but if not, I would modify it thusly:
grep -o 'https\?:\/\/[a-zA-Z0-9./-]\+\?' <filename(s)>
Here are some things to note about these expressions:
Inside a character group, there's no need to quote metacharacters except for [ and (sometimes) -. I say sometimes because if you put the dash at the end, as I have above, it's no longer interpreted as a range operator.
The grep utility's syntax, annoyingly, is different than most regex implementations in that most of the metacharacters we're familiar with (?, +, etc.) must be escaped to be used, not the other way around. Which is why you see backslashes before the ? and + characters above.
Lastly, the repetition metacharacter in this expression (+) is greedy by default, which could cause problems. I made it lazy by appending a ? to it. The way you have your URL match formulated, it probably wouldn't have caused problems, but if you change your match to, say [^ ] instead of [a-zA-Z0-9./-], you would see URLs on the same line getting combined together.
I did this a different way.
Find everything up to the first/next (https or http) (then everything that comes next) up to (html or htm), then output just the '(https or http)(everything next) then (html or htm)' with a line feed/ carriage return after each.
So:
Find: .*?(https:|http:)(.*?)(html|htm)
Replace with: \1\2\3\r\n
Saves looking for all possible (incl non-generic) url matches.
You will need to manually remove any text after the last matched URL.
Can also be used to create url links:
Find: .*?(https:|http:)(.*?)(html|htm)
Replace: \1\2\3\r\n
or image links (jpg/jpeg/gif):
Find: .*?(https:|http:)(.*?)(jpeg|jpg|gif)
Replace: <img src="\1\2\3">\r\n
I know my answer won't be RegEx related, but here is another efficient way to get lines containing URLs.
This won't remove text around links like Toto mentioned in comments.
At least if there is nice pattern to all links, like https://.
CTRL+F => change tab to Mark
Insert https://
Tick Mark to bookmark.
Mark All.
Find => Bookmarks => Delete all lines without bookmark.
I hope someone who lands here in search of same problem will find my way more user-friendly.
You can still use RegEx to mark lines :)