Regex not working with similar looking string - regex

I am trying to get the content between Start and End tag for the below mentioned strings
Products
Services & Solution
Regex used:
<([a-z0-9]+)([^<]+)\*(?:>(.\*?)</\\2>|\\D+/>)
It is working fine for the first string but not with the later once

Why so complex? Won't simple />([^<]+)</ capture the content of an element?

Depending on the flavour of regex - use lookahead and lookbehind methods to get just the match between > and < i.e.
(?<=>)[^>]*(?=<)
(?<=>) - looks ahead for a >
(?=<) - looks behind for a <
[^>]* - matches the text in the link itself
lookahead and lookbehind are zero width matches so will will just get what you need

Usually you don't want to parse HTML your self with regex, parser are better at that.
Assuming you are using PCRE here's a random guess at the expression you are looking for:
(?is)<([a-z]+)\b[^<>]*(?:>(.*?)</\1>|/>)
Note that this will not work with nested tags.

Just get rid of the tags.
var str = 'Products '
var str2 = 'Services & Solution '
var RE_findOpenAndCloseTag = /^<[^>]+>|<\/[^>]>$/g;
str.replace( RE_findOpenAndCloseTag, '' ) == "Products ";
str2.replace( RE_findOpenAndCloseTag, '' ) == "Services & Solution ";
Note that RE_findOpenAndCloseTag assumes that tags will always start with a < and not contain an > unless it's closing the tag.
Thus this will fail.
'>">This will fail
But an easier way would be to convert the tags into a node, then get the innerHTML.

Try this it will resolve your issue (Just Add |</\1>)
<([a-z0-9]+)([^<]+)*(?:>(.*?)|\D+/>|</\1>)
For more detail please refer

Related

Regex ignore first 12 characters from string

I'm trying to create a custom filter in Google Analytic to remove the query parts of the url which I don't want to see. The url has the following structure
[domain]/?p=899:2000:15018702722302::NO:::
I would like to create a regex which skips the first 12 characters (that is until:/?p=899:2000), and what ever is going to be after that replace it with nothing.
So I made this one: https://regex101.com/r/Xgbfqz/1 (which could be simplified to .{0,12}) , but I actually would like to skip those and only let the regex match whatever is going to be after that, so that I'll be able to tell in Google Analytics to replace it with "".
The part in the url that is always the same is
?p=[3numbers]:[0-4numbers]
Thank you
Your regular expression:
\/\?p=\d{3}\:\d{0,4}(.*)
Tested in Golang RegEx 2 and RegEx101
It search for /p=###:[optional:####] and capture the rest of the right side string.
(extra) JavaScript:
paragraf='[domain]/?p=899:2000:15018702722302::NO:::'
var regex= /\/\?p=\d{3}\:\d{0,4}(.*)/;
var match = regex.exec(paragraf);
alert('The rest of the right side of the string: ' + match[1]);
Easily use "[domain]/?p=899:2000:15018702722302::NO:::".substr(12)
You can try this:
/\?p\=\d{3}:\d{0,4}
Which matches just this: ?p=[3numbers]:[0-4numbers]
Not sure about replacing though.
https://regex101.com/r/Xgbfqz/1

Negative lookbehind in a regex with an optional prefix

We are using the following regex to recognize urls (derived from this gist by Jim Gruber). This is being executed in Scala using scala.util.matching which in turn uses java.util.regex:
(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?!js)[a-z]{2,6}/)(?:[^\s()<>{}\[\]]+)(?:[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!#)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?!js)[a-z]{2,6}\b/?(?!#)))
This version has escaped forward slashes, for Rubular:
(?i)\b(((?:https?:(?:\/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?!js)[a-z]{2,6}\/)(?:[^\s()<>{}\[\]]+)(?:[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!#)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?!js)[a-z]{2,6}\b\/?(?!#))))
Previously the front-end was only sending plaintext to the back end, however now they're allowing users to create anchor tags for urls. Therefore the back end now needs to recognize urls except for those that are already in anchor tags. I initially tried to accomplish this with a negative loohbehind, ignoring urls with a href=" prefix
(?i)\b((?<!href=")((?:https?: ... etc
The problem is that our url regex is very liberal, recognizing http://www.google.com, www.google.com, and google.com - given
Google
the negative lookbehind will ignore http://www.google.com, but then the regex will still recognize www.google.com. I'm wondering if there's a succinct way to tell the regex "ignore www.google.com and google.com if they are substrings of an ignored http(s)://www.google.com"
At present I'm using a filter on the url regex matches (code is in Scala) - this also ignores urls in link text (www.google.com) by ignoring urls with a > prefix and </a> suffix. I'd rather stick with the filter if doing this in a regex would make an already complicated regex even more unreadable.
urlPattern.findAllMatchIn(text).toList.filter(m => {
val start: Int = m.start(1)
val end: Int = m.end(1)
val isHref: Boolean = (start - 6 > 0) &&
text.substring(start - 6, start) == """href=""""
val isAnchor: Boolean = (start - 1 > 0 && end + 3 < text.length &&
text.substring(start - 1, start) == ">" &&
text.substring(end, end + 3) == "</a>")
!(isHref || isAnchor) && Option(m.group(1)).isDefined
})
<a href=\S+|\b((?:https?:(?:\/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?!js)[a-z]{2,6}\/)(?:[^\s()<>{}\[\]]+)(?:[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!#)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?!js)[a-z]{2,6}\b\/?(?!#)))
or
<a href=(?:(?!<\/a>).)*<\/a>|\b((?:https?:(?:\/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?!js)[a-z]{2,6}\/)(?:[^\s()<>{}\[\]]+)(?:[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!#)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?!js)[a-z]{2,6}\b\/?(?!#)))
Try this. What it essentially does is:
Consumes all href links so that it cannot be matched later
Does not capture it so it will not appear in groups anyways.
Process the rest as before.
See demo.
http://regex101.com/r/vR4fY4/17
It seems that you're not only wanting to ignore www.google.com and google.com if they are substrings of an ignored http(s)://www.google.com", but instead any substring fragments from a previously ignored section... In which case, you can use a bit of code to work around this! Please see the regex:
(a href=")?(?i)\b(((?:https?:(?:\/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?!js)[a-z]{2,6}\/)(?:[^\s()<>{}\[\]]+)(?:[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!#)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?!js)[a-z]{2,6}\b\/?(?!#))))
^^^^^^^^^^^
I'm not good at scala but you can probably do this:
val links = new Regex("""(a href=")?(?i)\b(((?:https?:... """.r, "unwanted")
val unwanted = for (o <- links findAllMatchIn text) yield o group "unwanted"
If unwanted is scala.Null, then the match is useful.
You can workaround for a need of replacement by replacing an alternative:
a href="(?i)\b(?:(?:https?:(?:\/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?!js)[a-z]{2,6}\/)(?:[^\s()<>{}\[\]]+)(?:[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!#)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?!js)[a-z]{2,6}\b\/?(?!#)))|((?i)\b(((?:https?:(?:\/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?!js)[a-z]{2,6}\/)(?:[^\s()<>{}\[\]]+)(?:[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!#)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?!js)[a-z]{2,6}\b\/?(?!#)))))
The second part of the regex behind the pipe | is grouped as a capturing group. You can replace by this regex with the first group: \1
Similar question:
Regex Pattern to Match, Excluding when... / Except between
How about just adding the <a href= part as an optional group, then when checking your matching, you only return those matches in which that group is empty?

Regex URI portion: Remove hyphens

I have to split URIs on the second portion:
/directory/this-part/blah
The issue I'm facing is that I have 2 URIs which logically need to be one
/directory/house-&-home/blah
/directory/house-%26-home/blah
This comes back as:
house-&-home and house-%26-home
So logically I need a regex to retrieve the second portion but also remove everything between the hyphens.
I have this, so far:
/[^(/;\?)]*/([^(/;\?)]*).*
(?<=directory\/)(.+?)(?=\/)
Does this solve your issue? This returns:
house-&-home and house-%26-home
Here is a demo
If you want to get the result:
house--home
then you should use a replace method. Because I am not sure what language you are using, I will give my example in java:
String regex = (?<=directory\/)(.+?)(?=\/);
String str = "/directory/house-&-home/blah"
Pattern.compile(regex).matcher(str).replaceAll("\&", "");
This replace method allows you to replace a certain pattern ( The & symbol ) with nothing ""

Regular expressions: Matching text up to last index of character

For example:
http://foobar.com/foo/bar/foobar.php
From this address, I need to extract the following:
http://foobar.com/foo/bar
I have tried with the following regex:
(?<namespace>.*)/.*?
but returned value is
http:
Can anyone help? Thanks.
Try this:
^(?<namespace>.*)/[^/]+$
A quick explanation:
^ # the start of input
(?<namespace>.*)/ # zero or more chars followed by a '/' (which the last '/')
[^/]+ # one or more chars other than '/'
$ # the end of input
I think a regex is overkill here. What programming language are you using? This would be how it's done in JavaScript.
var url = 'http://foobar.com/foo/bar/foobar.php'
url.split('/').slice(0,-1).join('/')
You could even use substr for some performance!
var url = 'http://foobar.com/foo/bar/foobar.php'
url.substr(0, url.lastIndexOf('/'))
The only reason I offered the array way is because I'm not sure of cross browser compatibility on lastIndexOf.
Try with this expression:
^(?<namespace>.*)/.*$

Return a vbscript regex match on multilines

I am using vbscript regex to find self-defined tags within a file.
"\[\$[\s,\S]*\$\]"
Unfortunately, I am doing something wrong, so it will grab all of the text between two different tags. I know this is caused by not excluding "$]" between the pre and post tag, but I can't seem to find the right way to fix this. For example:
[$String1$]
useless text
[$String2$]
returns
[$String1$]
useless text
[$String2$]
as one match.
I want to get
[$String1$]
[$String2$]
as two different matches.
Any help is appreciated.
Wade
The RegEx is greedy and will try to match as much as it can in one go.
For this kind of matching where you have a specific format, instead of matching everything until the closing tag, try matching NOT CLOSING TAG until closing tag. This will prevent the match from jumping to the end.
"\[\$[^\$]*\$\]"
Make the * quantifier lazy by adding a ?:
"\[\$[\s\S]*?\$\]"
should work.
Or restrict what you allow to be matches between your delimiters:
"\[\$.*\$\]"
will work as long as there is only one [$String$] section per line, and sections never span multiple lines;
"\[\$(?:(?!\$\])[\s\S])*\$\]"
checks before matching each character after a [$ that no $] follows there.
No need to use regex. try this. If your tags are always defined by [$...$]
Set objFS = CreateObject( "Scripting.FileSystemObject" )
strFile=WScript.Arguments(0)
Set objFile = objFS.OpenTextFile(strFile)
strContent = objFile.ReadAll
strContent = Split(strContent,"$]")
For i=LBound(strContent) To UBound(strContent)
m = InStr( strContent(i) , "[$" )
If m > 0 Then
WScript.Echo Mid(strContent(i),m) & "$]"
End If
Next