regex in as3 to ignore match with specific start to string - regex

I have the following as3 function below which converts normal html with links so that the links have 'event:' prepended so that I can catch them with a TextEvent listener.
protected function convertLinks(str:String):String
{
var p1:RegExp = /href|HREF="(.[^"]*)"/gs;
str = str.replace(p1,'HREF="event:$1"');
return str;
}
For example
<a href="http://www.somedomain.com">
gets converted to
<a href="event:http://www.somedomain.com">
This works just fine, but i have a problem with links that have already been converted.
I need to exclude the situation where i have a string such as
<a href="event:http://www.somedomain.com">
put through the function, because at the moment this gets converted to
<a href="event:event:http://www.somedomain.com">
Which breaks the link.
How can i modify my function so that links with 'event:' at the start are NOT matched and are left unchanged?

First of all, trying to manipulate HTML with regex may not be a good idea.
That said, according to the flavor comparison chart on regular-expressions.info, ActionScript regex is based off of ECMA engine, which supports lookaheads.
Thus you can write this:
/(?:href|HREF)="(?!event:)(.[^"]*)"/
(?=…) is positive lookahead; it asserts that a given pattern can be matched. (?!…) is negative lookahead; it asserts that a given pattern can NOT be matched.
Note that the inclusion of the . is very peculiar. It's probably not intended to include the . there since it can match a closing doublequote.
Note also that I've fixed the alternation for the href/HREF by using a non-capturing group (?:…).
This is because:
this|that matches either "this" or "that"
this|that thing matches either "this" or "that thing"
(this|that) thing matches either "this thing" or "that thing"
Alternatively you may also want to just turn on case-insensity flag /i, which would handle things like hReF or eVeNt:.
Thus, perhaps your pattern should just be
/href="(?!event:)([^"]*)"/gsi
If lookahead was not supported, you can use an optional pattern that matches event: if it's there, excluding it from group 1, so that it doesn't get included when you substitute in $1.
/href="(?:event:)?([^"]*)"/gsi
\________/ \_____/
non-capturing group 1
optional

Related

Regex to match text from multiple links

How to extract links which contain a certain word?
For e.g.:
https://www.test.com/text/1###https://www.test.com/text/word/2###https://www.test.com/text/text/word/3###https://www.test.com/3/text###https://www.test.com/word/3/text/text
How to search "word" from below regex?
((https:).*?(###))
The result should be like this
https://www.test.com/text/word/2
https://www.test.com/text/text/word/3
https://www.test.com/word/3/text/text
Let's try to build such regex. First we need to find the beginning of url:
/(https?:\/\//
We add ? after https for http urls.
Then we need to find any text except ###, so we need to add:
(?:(?!###).)*
which means - any amount of characters not starting a ### sequence.
Also we need to add word itself and previous sub-expression again, since word can be surrounded by any text:
word(?:(?!###).)*
But the thing is that last sub-expression will skip last character before ###, so we need to add one more thing to handle it:
.(?=###|$)
which means - any character followed by ### or end of string. The final expression will look like:
/(https:\/\/(?:(?!###).)*word(?:(?!###).)*.(?=###|$))/g
But i believe, it's better to just split text by ### and then check for needed word by String.prototype.includes.
If the word has to be a part of the pathname, you might use filter in combination with URL and check if the parts of the pathname contain word.
let str = 'https://www.test.com/text/1###https://www.test.com/text/word/2###https://www.test.com/text/text/word/3###https://www.test.com/3/text###https://www.test.com/word/3/text/text';
let filteredUrls = str.split("###")
.filter(s =>
new URL(s).pathname
.split('/')
.includes('word')
);
console.log(filteredUrls);
If you want to use regex only and possessive quantifiers are supported (The javascript tag has been removed) you might use:
https?://[^#w]*(?:#(?!##)|w(?!ord)|[^#w]*)++word.*?(?=###|$)
Regex demo
Previous answer
You for sure looking for this regular expression:
https://www.test.com/(text/)*word/\d+(/text)*
Here is how you can use it in JavaScript context (very slash / is escaped by backslash \/):
var str = 'https://www.test.com/text/1###https://www.test.com/text/word/2###https://www.test.com/text/text/word/3###https://www.test.com/3/text###https://www.test.com/word/3/text/text';
var urls = str.match(/https:\/\/www.test.com\/(text\/)*word\/\d+(\/text)*/g);
console.log(urls);
In the array you get exactly the elements you wanted.
Update the answer after update question and adding comment by the author
If you need take the words from your example string, then you have to use a little more complex regular exception:
var str = 'https://www.test.com/text/1###https://www.test.com/text/word/2###https://www.test.com/text/text/word/3###https://www.test.com/3/text###https://www.test.com/word/3/text/text';
var urls = str.match(/(?<=\/)\w+(?=\/\d+\/\w)|(?<=(\w\/\w+\/))\w+(?=\/\d)/g);
console.log(urls);
Explanation
Here is regular expression /(?<=(\w\/\w+\/))\w+(?=\/\d)|(?<=\/)\w+(?=\/\d+\/\w)/g, limited by /.../ and with the g flag forcing pattern searches for occurrence.
The regular expression has two alternatives ...|...
The first one (?<=\/)\w+(?=\/\d+\/\w) captures cases when the searched word is directly behind the slash (?<=\/) and before more words behind the number (?=\/\d+\/\w).
https://www.test.com/word/3/text/text
The second alternative (?<=(\w\/\w+\/))\w+(?=\/\d) captures cases where the word is preceded by other words following the domain (?<=(\w\/\w+\/)) (in fact two slashes separated by alphanumeric characters) and the searched word is immediately before the slash followed by the number (?=\/\d).
https://www.test.com/text/word/2
https://www.test.com/text/text/word/3
All slashes must be escaped: \/.
The construction (?<=...) means lookbehind in regular expressions and (?=...) means lookahead in regular expressions.
Note 1. The above example currently only works well in a Chrome browser, as that:
(...) now lookbehind is part of the ECMAScript 2018 specification. As of this writing (late 2018), Google's Chrome browser is the only popular JavaScript implementation that supports lookbehind. So if cross-browser compatibility matters, you can't use lookbehind in JavaScript.
Note 2. Lookbehnd, even if it is interpreted correctly, in most regular expression engines must contain a fixed length regular expression, which I do not keep in the example above, because this one is still valid and works for regular expression engines used in Google Chrome's JavaScript engine, JGsoft engine and .NET framework RegEx classes.
Note 3. The lookbehind syntax or its poorer \K replacement are widely supported by many regular expression engines used in a large group of programming languages.
More explanation about regular expressions which I used you can find for example here.
You may first split by ### then check whether /word/ exists in each element:
var s = 'https://www.test.com/text/1###https://www.test.com/text/word/2###https://www.test.com/text/text/word/3###https://www.test.com/3/text###https://www.test.com/word/3/text/text';
var result = [];
s.split(/###/).forEach(function(el) {
if (el.includes('/word/'))
result.push(el);
})
// or else by using filter
// result = s.split(/###/).filter(el => el.includes('/word/'))
console.log(result);

Can regex alternatives (a|b|c) work together with the end of the pattern?

I have a regex pattern that I'm using to try and match anything wrapped in an <a>, <em>, or quote ".
(?:<a.*?>|<em>|")(.*?)(?:"|<\/em>|<\/a>)
However, what I'd like to do is force the <a>'s to work together, and the <em>'s and so on. What I want not to happen is it to match a string that starts with an <a> but ends with a ".
For example:
<a href='google.com'>"Google"</a>
Should return Google and (probably also "Google", but thats not a big deal). However, at the moment, its returning href='google.com'> as a match (and completely ignoring "Google") since it starts and ends with the "correct" patterns.
You can see all the ways this particular pattern breaks here on Regex101.
So is there a way to tell regex that if it starts a match with <a> that it must finish with </a> (and the same for the other patterns)?
You want a back reference:
<(a|em|")[^>]*>(.*?)(?:</\1>)
See live demo.
Your target is in group 2 (there's no avoiding capturing the tag as group 1 if you use a back reference).

Bounding Multiple Matches With Single Text

I'm trying to parse out the properties of a type (eg. the words 'Cusip', 'Issuer', and 'Coupon') shown here:
Public Type GetPricesResponse
Cusip As String
Issuer As String
Coupon As String
End Type
The regex ([a-zA-Z0-9]+).+As works great for this code snippet (see http://regexr.com?300fl), but may not work when mixed with a larger body of code. So, I've tried to "bound" this regex with the words Public Type on the front, and End Type at the end to specifically identify what I need as follows:
Public\sType\s([a-zA-Z0-9]+).+As.+End\sType
...but of course it then doesn't match anything.
I have the MultiLine option set as well.
You've presented two different problems.
The first is, roughly, "can I write a regex to match this thing", the answer is yes. For simplicity I've used \w instead of [a-zA-Z0-9]:
Public\s+Type\s+(\w+)\s+((\w+)\s+As\s+(\w+)\s*('.*\s*)?)+End\s+Type
The next is "how can I parse out the properties" and the answer to that is, as written in the comments: don't use a single regex. First, use a regex which captures only the definitions:
Public\s+Type\s+\w+\s+(.*?)End\s+Type
This uses a the reluctant quantifier *? so that the regex won't gobble up End Type and the DOTALL flag so that you can match several lines. From this match, you take group 1 and repeatedly find the following:
^\s+(\w+)\s+.*$
Group 1 from this match will be your property name.
Use the following regexp to match the whole thing:
Public\s+Type\s+(?<tname>[\w]+)\s+((?<pname>[\w]+)\s+As\s+(?<ptype>[\w]+)\s+)+End\s+Type
Note that it uses named groups for easier access to matched content. Therefore after the whole content is matched, the group named tname matches the class type, the group named pname matches the property name, and the group named ptype matches the corresponding properties type.
Here's its live demo:
http://regexr.com?300l0

how to group in regex matching correctly?

consider following scenario
input string = "WIPR.NS"
i have to replace this with "WIPR2.NS"
i am using following logic.
match pattern = "(.*)\.NS$" \\ any string that ends with .NS
replace pattern = "$12.NS"
In above case, since there is no group with index 12, i get result $12.NS
But what i want is "WIPR2.NS".
If i don't have digit 2 to replace, it works in all other cases but not working for 2.
How to resolve this case?
Thanks in advance,
Alok
Usually depends entirely on your regex engine (I'm not familiar with those that use $1 to represent a capture group, I'm more used to \1 but you'd have the same problem with that).
Some will provide a delimiter that you can use, like:
replace pattern = "${1}2.NS"
which clearly indicates that you want capture group 1 followed by the literal 2.NS.
In fact, by looking at this page, it appears that's exactly the way to do it (assuming .NET):
To replace with the first backreference immediately followed by the digit 9, use ${1}9. If you type $19, and there are less than 19 backreferences, the $19 will be interpreted as literal text, and appear in the result string as such.
Also keep in mind that Jay provides an excellent answer for this specific use case that doesn't require capture groups at all (by just replacing .NS with 2.NS).
You may want to look into that as a possibility - I'll leave this answer here since:
it's the accepted answer; and
it probably better for the more complex cases, like changing X([A-Z])4([A-Z]) with X${1}5${2}, where you have variable text on either side of the bit you wish to modify.
You don't need to do anything with what precedes the .NS, since only what is being matched is subject to replacement.
match pattern = "\.NS$" (any string that ends with .NS -- don't forget to escape the .)
replace pattern = "2.NS"
You can further refine this with lookaround zero-width assertions, but that depends on your regex engine, and you have not specified the environment/programming language in which you are working.

Regex with exception of particular words

I have problem with regex.
I need to make regex with an exception of a set of specified words, for example: apple, orange, juice.
and given these words, it will match everything except those words above.
applejuice (match)
yummyjuice (match)
yummy-apple-juice (match)
orangeapplejuice (match)
orange-apple-juice (match)
apple-orange-aple (match)
juice-juice-juice (match)
orange-juice (match)
apple (should not match)
orange (should not match)
juice (should not match)
If you really want to do this with a single regular expression, you might find lookaround helpful (especially negative lookahead in this example). Regex written for Ruby (some implementations have different syntax for lookarounds):
rx = /^(?!apple$|orange$|juice$)/
I noticed that apple-juice should match according to your parameters, but what about apple juice? I'm assuming that if you are validating apple juice you still want it to fail.
So - lets build a set of characters that count as a "boundary":
/[^-a-z0-9A-Z_]/ // Will match any character that is <NOT> - _ or
// between a-z 0-9 A-Z
/(?:^|[^-a-z0-9A-Z_])/ // Matches the beginning of the string, or one of those
// non-word characters.
/(?:[^-a-z0-9A-Z_]|$)/ // Matches a non-word or the end of string
/(?:^|[^-a-z0-9A-Z_])(apple|orange|juice)(?:[^-a-z0-9A-Z_]|$)/
// This should >match< apple/orange/juice ONLY when not preceded/followed by another
// 'non-word' character just negate the result of the test to obtain your desired
// result.
In most regexp flavors \b counts as a "word boundary" but the standard list of "word characters" doesn't include - so you need to create a custom one. It could match with /\b(apple|orange|juice)\b/ if you weren't trying to catch - as well...
If you are only testing 'single word' tests you can go with a much simpler:
/^(apple|orange|juice)$/ // and take the negation of this...
This gets some of the way there:
((?:apple|orange|juice)\S)|(\S(?:apple|orange|juice))|(\S(?:apple|orange|juice)\S)
\A(?!apple\Z|juice\Z|orange\Z).*\Z
will match an entire string unless it only consists of one of the forbidden words.
Alternatively, if you're not using Ruby or you're sure that your strings contain no line breaks or you have set the option that ^ and $ do not match on beginnings/ends of lines
^(?!apple$|juice$|orange$).*$
will also work.
Here's some easy copy-paste code that works for more than just exact-words exceptions.
Copy/Paste Code:
In the following regex, ONLY replace the all-caps sections with your regex.
Python regex
pattern = r"REGEX_BEFORE(?>(?P<exceptions_group_1>EXCEPTION_PATTERN)|YOUR_NORMAL_PATTERN)(?(exceptions_group_1)always(?<=fail)|)REGEX_AFTER"
Ruby regex
pattern = /REGEX_BEFORE(?>(?<exceptions_group_1>EXCEPTION_PATTERN)|YOUR_NORMAL_PATTERN)(?(<exceptions_group_1>)always(?<=fail)|)REGEX_AFTER/
PCRE regex
REGEX_BEFORE(?>(?<exceptions_group_1>EXCEPTION_PATTERN)|YOUR_NORMAL_PATTERN)(?(exceptions_group_1)always(?<=fail)|)REGEX_AFTER
JavaScript
Impossible as of 6/17/2020, and probably won't be possible in the near future.
Full Examples
REGEX_BEFORE = \b
YOUR_NORMAL_PATTERN = \w+
REGEX_AFTER =
EXCEPTION_PATTERN = (apple|orange|juice)
Python regex
pattern = r"\b(?>(?P<exceptions_group_1>(apple|orange|juice))|\w+)(?(exceptions_group_1)always(?<=fail)|)"
Ruby regex
pattern = /\b(?>(?<exceptions_group_1>(apple|orange|juice))|\w+)(?(<exceptions_group_1>)always(?<=fail)|)/
PCRE regex
\b(?>(?<exceptions_group_1>(apple|orange|juice))|\w+)(?(exceptions_group_1)always(?<=fail)|)
How does it work?
This uses decently complicated regex, namely Atomic Groups, Conditionals, Lookbehinds, and Named Groups.
The (?> is the start of an atomic group, which means its not allowed to backtrack: which means, If that group matches once, but then later gets invalidated because a lookbehind failed, then the whole group will fail to match. (We want this behavior in this case).
The (?<exceptions_group_1> creates a named capture group. Its just easier than using numbers. Note that the pattern first tries to find the exception, and then falls back on the normal pattern if it couldn't find the exception.
Note that the atomic pattern first tries to find the exception, and then falls back on the normal pattern if it couldn't find the exception.
The real magic is in the (?(exceptions_group_1). This is a conditional asking whether or not exceptions_group_1 was successfully matched. If it was, then it tries to find always(?<=fail). That pattern (as it says) will always fail, because its looking for the word "always" and then it checks 'does "ways"=="fail"', which it never will.
Because the conditional fails, this means the atomic group fails, and because it's atomic that means its not allowed to backtrack (to try to look for the normal pattern) because it already matched the exception.
This is definitely not how these tools were intended to be used, but it should work reliably and efficiently.
Exact answer to the original question in Ruby
/\b(?>(?<exceptions_group_1>(apple|orange|juice))|\w+)(?(<exceptions_group_1>)always(?<=fail)|)/
Unlike other methods, this one can be modified to reject any pattern such as any word not containing the sub-string "apple","orange", or "juice".
/\b(?>(?<exceptions_group_1>\w*(apple|orange|juice))|\w+)(?(<exceptions_group_1>)always(?<=fail)|)/
Something like (PHP)
$input = "The orange apple gave juice";
if(preg_match("your regex for validating") && !preg_match("/apple|orange|juice/", $input))
{
// it's ok;
}
else
{
//throw validation error
}