Extract url based on specific keyword - regex

I am crawling data from certain websites and I am looking to extract data from specific urls. One such case let say url with *devicehelp.optus.com.au/web/* as as example. PFB my regex -
/[^]*devicehelp\.optus\.com\.au\/web\/[^.]*/
This regex doesn't give me perfect match what I am looking for. Could someone please let me know what am I missing here?
Test urls -
*devicehelp.optus.com.au/web/*
http://www.top.abc.something.optus.devicehelp.optus.com.au/web/web/web/
This regex works when I test it on http://regexr.com/ but doesn't on https://regex101.com/

In most regex flavors, [^] is an invalid regex construct, while on the site you tested (regexr.com), this will be parsed as any character (since the regexr regex flavor is JavaScript).
To match any character but a newline zero or more times, you may use .*.
.*\bdevicehelp\.optus\.com\.au\/web\/.*
The \b is a word boundary, so as to match devicehelp as a whole word (if you do not intend to match it as a whole word, you may remove it). Dots should be escaped to match literal dots.

Related

Regex, how can I ignore partial matches? For instance ".co" would match ".com" when I regex for words between two spaces

I am trying to regex & extract some URLs from a large text file. Most URLs don't have an HTTP/HTTPS affixed to them so it is making this a lot more difficult.
If I want to regex for URLs containing ".co", I made it so the regex finds ".co" and selects from the first space before the occurrence and to first space after the occurence using:
(\S+\.co\S+)
But the problem with this comes when I have URLs with the .com TLD in the file too.
For example, this regex selects all URLs from below instead of only the ".co" URLs
pizza.com/test is good
pizza.co/test is great
Regex Extracts:
pizza.com/test
pizza.co/test
I only want it to extract:
pizza.co/test
Here is my regexr example: https://regexr.com/5hl2h
Does anyone know of a way I can achieve this with regex? Or should I look for an alternative solution?
Much appreciate the help here.
You could use
\S+\.co(?!m)\S*
Explanation
\S+ Match 1+ non whitespace chars
\.co(?!m) Match .co not directly followed by m
\S* Match 0+ non whitespace chars to also match ending with .co
Regex demo

Regex lookahead. Find word without .min. in string

I'm trying to replace a link in a html file with regex and nodejs. I want to replace links without a .min.js extension.
For example, it should match "common.js" but not "common.min.js"
Here's what I've tried:
let htmlOutput = html.replace(/common\.(?!min)*js/g, common.name);
I think this negative lookahead should work but it doesn't match anything. Any help would be appreciated.
The (?!min)*js part is corrupt: you should not quantify zero-width assertions like lookaheads (they do not consume text so quantifiers after them are treated either as user errors or are ignored). Since js does not start with min this lookahead even without a quantifier is redundant.
If you want to match a string with a whole word common, then having any chars and ending with .js but not .min.js you need
/\bcommon\b(?!.*\.min\.js$).*\.js$/
See the regex demo.
Details:
\b - word boundary
common - a substring
\b - word boundary
(?!.*\.min\.js$) - immediately to the right, there should not be any 0 or more chars followed with .min.js at the end of the string
.* - any 0 or more chars
\.js - a .js substring
$ - end of string.
Here, we likely can find a simple expression to pass any char except new lines and ., after the word common, followed by .js:
common([^\.]+)?\.js
Demo
RegEx Circuit
jex.im visualizes regular expressions:
The end regex I'm using is /\bcommon[^min]+js\b/g
This will find the word common with any amount of chracters afterword except if those characters contain the word minand ending in js allowing me to replace scripts on my html page like:
script src="~/dist/common.js"
OR
script src="~/dist/common.9cf5748e0e7fc2928a07.js"
Thanks to Wiktor Stribiżew for helping me.

Regex - returning a match without a period

I'm using the below regex string to match the word "kohls" which is located in a group of other words.
\W*((?i)kohls(?-i))\W*
It works great when the word is alone, but if the word is in a url, the match includes a period on both sides.
See the below examples:
Thank you for shopping at Kohls - returns a match for kohls.
https://www.kohls.com - returns a match for .kohls.
Edit. https://www.KohlsAndMichaels.com - doesn't return any match for kohls.
I want it to only extract the exact match for kohls without periods or any other symbols/text in front or behind it. Can you tell me what I'm doing wrong?
In cases like that you can always use a site like regex101.com, which explains the regular expression and shows the matches with colors. So this is how your regular expression currently works:
As you can see in blue color, the problem with the dots is in the \W*, which matches any non-word character. In order to fix this, you can use the following regular expression:
\b((?i)kohls(?-i))\b
The \b (before and after the word you want to match) is used to assert the position at a word boundary. See how this work on that website now:
If you still have questions, look at the explanation of the regular expression provided by that website. It is worth looking.
The \W metacharacter is used to find non-word characters. So adding a star operator will match 0 or more of these non-word characters (like periods). Did you meant to add a word boundary instead?
\b(?i)kohls(?-i)\b
Replace both \W* with [\W,\.\-]* etc.
Should be enough.

regex and string extraction from HTML

How do I modify the following String manipulation to look for "text to extract" in the HTML code below ? I don't understand the "(?<=')[^']+" I understand it is a regex pattern and I looked on a website but I don't get the logic of it... Maybe if someone show me the way with my question I could understand better..
if let match = dataString?.range(of: "(?<=')[^']+", options: .regularExpression) {
print(dataString?.substring(with: match) as Any)
HTML code:
<span class="phrase">Text to Extract</span></span></span></p>
First, https://regex101.com/ is a free online resource where you can test regex, and it will explain what each part of it is doing.
The regex (?<=')[^']+ can be broken down as follows
(?<=<token>) is a positive look-behind for a token. In this case, the char single-quote (')
[^<chars>] match anything not one of the following characters. In this case, the char single-quote (')
+ match the previous token 1 or more times. In this case, [^']
So the above regex matches anything between two '. Note that this has no concept of opening and closing, so a'b'c'd'e would match b, c, and d.
To match a literal phrase, you would just use that phrase in your regex (escaping any regex special characters with \).
If you need context aware (nest tracking) extraction, any regex will be inherently wrong, and you will need an HTML parser to extract it for you.

Mixing Lookahead and Lookbehind in 1 Regexp

I'm trying to match first occurrence of window.location.replace("http://stackoverflow.com") in some HTML string.
Especially I want to capture the URL of the first window.location.replace entry in whole HTML string.
So for capturing URL I formulated this 2 rules:
it should be after this string: window.location.redirect("
it should be before this string ")
To achieve it I think I need to use lookbehind (for 1st rule) and lookahead (for 2nd rule).
I end up with this Regex:
.+(?<=window\.location\.redirect\(\"?=\"\))
It doesn't work. I'm not even sure that it legal to mix both rules like I did.
Can you please help me with translating my rules to Regex? Other ways of doing this (without lookahead(behind)) also appreciated.
The pattern you wrote is really not the one you need as it matches something very different from what you expect: text window.location.redirect("=") in text window.location.redirect("=") something. And it will only work in PCRE/Python if you remove the ? from before \" (as lookbehinds should be fixed-width in PCRE). It will work with ? in .NET regex.
If it is JS, you just cannot use a lookbehind as its regex engine does not support them.
Instead, use a capturing group around the unknown part you want to get:
/window\.location\.redirect\("([^"]*)"\)/
or
/window\.location\.redirect\("(.*?)"\)/
See the regex demo
No /g modifier will allow matching just one, first occurrence. Access the value you need inside Group 1.
The ([^"]*) captures 0+ characters other than a double quote (URLs you need should not have it). If these URLs you have contain a ", you should use the second approach as (.*?) will match any 0+ characters other than a newline up to the first ").