Using PCRE2 regex with repeating groups to find email addresses - regex

I need to find all email addresses with an arbitrary number of alphanumeric words, separated through a period. To test the regex, I'm using the website https://regex101.com/.
The structure of a valid email addresses is word1.word2.wordN#word1.word2.wordN.word.
The regex /[a-zA-Z0-9.]+#[a-zA-Z0-9.]+.[a-zA-Z0-9]+/gm finds all email addresses included in the document string, but also includes invalid addresses like ........#....com, if present.
I tried to group the repeating parts by using round brackets and a Kleene star, but that causes the regex engine to collapse.
Invalid regex:
/([a-zA-Z0-9]+.?)*[a-zA-Z0-9]+#([a-zA-Z0-9]+.?)*[a-zA-Z0-9]+.[a-zA-Z0-9]+/gm
Although there are many posts concerning regex groups, I was unable to find an explanation, why the regex engine fails. It seems that the engine gets stuck, while trying to find a match.
How can I avoid this problem, and what is the correct solution?

I think the main issue that caused you troubles is:
. (outside of []) matches any character,you probably meant to specify \. instead (only matches literal dot character).
Also there is no need to make it optional with ?, because the non-dot part of your regex will just match with the alphanumerical characters anyway.
I also reduced the right part (x*x is the same as x+), added a case-insensitive flag and ended up with this:
/([a-z0-9]+\.)*[a-z0-9]+#([a-z0-9]+\.)+[a-z0-9]+/gmi

Related

excluding canadian postal codes during shiping calculation

I'm currently designing a website on shopify, and now I have to create rules for shipping using parcelify.
We've manage to get our account to use the legacy version, which allows us to use regex to put restriction on where we can and can't ship. The only thing is I don't know anything about Regex, so I listened to a couple of tutorials online, and I've come up with a few options, but none of them do what I want to do:
allow shipping anywhere except for postal codes starting with:
g0c
g0e
g0g
g0j
g0t
g0w
g4r
g4t
g4w
g5j
g5l
g8p
j0m
So I've come up with this, I know it can probably be much simpler, but I'm just trying to get this rule to work, maybe I'm totally off and that's why I'm reaching out for help here.
/(^(?!g0C|G0E|G0G|G0J|g0t|g0w|g4r|g4t|g4w|g4w|g4x|g5j|g5l|g8p|j0m)) ?([a-zA-Z0-9]*.{3}$)/gim
From what I understand, if I use a negative lookahead that would be the key to exclude the every postal codes with the FSA (first three characters of a postal code) mentioned above.
When I try to put it in regex101, everything seems fine (unless I just don't get how to read the results), but when it comes to putting it into the shopify app (parcelify), acceptable postal codes are not able to place an order because I'm getting blocked at the shipping step...
Every Canadian postal code is built of 6 character if you don't count the space in the middle
If the string can also begin with a space, you can add that to the negative lookahead to rule out that as well.
The /i makes the pattern case insensitive.
Also allowing spaces at the end:
^(?! ?(?:g0C|G0E|G0G|G0J|g0t|g0w|g4r|g4t|g4w|g4x|g5j|g5l|g8p|j0m)) ?(?:[a-zA-Z0-9] *){6}$
The pattern matches:
^ Start of string
(?! Negative lookahead
?(?:g0C|G0E|G0G|G0J|g0t|g0w|g4r|g4t|g4w|g4w|g4x|g5j|g5l|g8p|j0m)) Match a space followed by any of the alternatives
? Match an optional space (Or * for multiple spaces)
(?:[a-zA-Z0-9] *){6} Repeat 6 times matching a char from the character class followed by optional spaces
$ End of string
Regex demo
A bit shortened version using character classes and accepting no spaces at the end:
^(?! ?(?:g0[Cw]|G0[EGJ]|g0t|g4[rtwx]|g5[jl]|g8p|j0m)) ?(?:[a-zA-Z0-9] *){5}[a-zA-Z0-9]$
Regex demo

RegEx Expression for Eclipse that searches for all items that have not been dealt with

To help stop SQL Injection attacks, I am going through about 2000 parameter requests in my code to validate them. I validate them by determining what type of value (e.g. integer, double) they should return and then applying a function to them to sanitize the value.
Any requests I have dealt with look like this
*SecurityIssues.*(request.getParameter
where * signifies any number of characters on the same line.
What RegExp expression can I use in the Eclipse search (CTRL+H) which will help me search for all the ones I have not yet dealt with, i.e. all the times that the text request.getParameter appears when it is not preceded by the word SecurityIssues?
Examples for matches
The regular expression should match each of the following e.g.
int companyNo = StringFunctions.StringToInt(request.getParameter("COMPANY_NO‌​"))
double percentage = StringFunctions.StringToDouble(request.getParameter("MARKETSHARE"))
int c = request.getParameter("DUMMY")
But should not match:
int companyNo = SecurityIssues.StringToIntCompany(request.getParameter("COMP‌​ANY_NO"))
With inspiration and the links provided by #michaeak (thank you), as well as testing in https://regex101.com/ I appear to have found the answer:
^((?!SecurityIssues).)*(request\.getParameter)
The advantage of this answer is that I can blacklist the word SecurityIssues, as opposed to having to whitelist the formats that I do want.
Note, that it is relatively slow, and also slowed down my computer a lot when performing the search.
Try e.g.
=\s*?((?!SecurityIssues).)*?(request\.getParameter)\(
Notes
Paranthesis ( or ) are special characters for group matching. They need to be escaped with \.
If .* will match anything, also characters that you don't want it to match. So .*? will prevent it from matching anything (reluctant). This can be helpful if after the wildcard other items need to match.
There is a tutorial at https://docs.oracle.com/javase/tutorial/essential/regex/index.html , I think all of these should be available in eclipse. You can then deal with generic replacement also.
Problem
From reading Regular expression that doesn't contain certain string and Regular expression to match a line that doesn't contain a word? it seems quite difficult to create a regex matching anything but not to contain a certain word.

Mistaken Squid Proxy regex? → ^.*stackoverflow\.*

I have several proxy rule files for Squid, and all contain rules like:
acl blacklisted dstdom_regex ^.*facebook\.* ^.*youtube\.* ^.*games.yahoo.com\.*
The patterns match against the domain name: dstdom_regex means destination (server) regular expression pattern matching.
The objective is to block some websites, but I don't know by what method: domain name, keywords in the domain name, ...
Let's expand/describe the pattern:
^.*stackexchange\.* The whole pattern
^ String beginning
.* Match anything (greedy quantifier, I presume)
stackexchange Keyword to match
\.* Any number of dots (.)
Totally legitimate matches:
stackexchange.com: The Stack Exchange website.
stackoverflow.stackexchange: The imaginary Stack Exchange gTLD.
But these possible matches make it seem more like a keyword block:
stackexchange
stackexchanger
notstackexchange
not-stackexchange
some-website.stackexchange
some-website.stackexchange-tld
And the pattern seems to contain a bug, since it allows the following invalid cases to match, thanks to the \.* at the end, although they never naturally occur:
stackexchange.
stackexchange...
stackexchange..........
stackexchange.......com
stackexchange.com
stackexchangecom
you get the idea.
Anything containing stackexchange, even if separated by dots from everything else, is still a valid match.
So now, the question itself:
This all means that this is simply a match for stackexchange! (I'm assuming the original author didn't intend to match infinite dots.)
So why not just use the pattern stackexchange? Wouldn't it be faster and give the same results, except for the "bug" (\.*)?
I.e., isn't ^.*stackexchange equivalent to stackexchange?
Edit: Just to clarify, I didn't write those proxy rule files.
I don't understand why you use \.* to match all the following dots
However to bypass your problem you can try this out :
^[^\.]*\.stackexchange\.*
[^\.]* matches anything except a dot
\. then you match the dot
edit : formatting

Using Flags of Regex within Google Forms

I'm trying to use flags within Google Forms, and I've been googling hoping to find an answer in the last couple of hours, but didn't find any. Google Forms say that the regular expression is not valid. Even when I use a simple regex such as: (?i)t. I'm trying to use the regex inside a paragraph question.
How can I make it work?
Edit:
What I really need is to match [a-zA-Z" ]+( *),( *)[1-9]([0-9]??)\n repeatedly, so each line will look something like: Sam "The Man" McAdams , 9\n. Of course, the number of lines is unknown. using the repetition modifiers of * or + at the end of the regex does not satisfy my needs, because if the first line is accepted as valid, the other lines might be composed of anything really, and it considers it as a valid input, while it's not.
You can use the following expression to validate an entire string that only consists of lines meeting your pattern:
^([a-zA-Z" ]+ *, *[1-9][0-9]?(\n|$))+$
See the regex demo.
The main point is to add an alternation group to match either a newline or the end of string ((\n|$)) and wrap the whole pattern into a +-quantified group ((...)+) anchored at both start (^) and end ($).

Regex Extraction for Google Analytics Content Grouping

I'm attempting to setup Content Groupings using Extraction within Google Analytics.
I have URL's of the form http://www.ehattons.com/52674/Bachmann_Branchline_37_671_Pack_of_3_14_Ton_tank_wagons_in_Fina_livery_weathered/StockDetail.aspx
I wish to use Regex to say that only in cases where a URL contains /StockDetail.aspx, extract everything before the first underscore, excluding any digits. e.g. 'Bachmann'.
I've managed to source the following regex to return everything before the first underscore
^[^_]+(?=_).
However, that's as far as I can get with my limited understanding. Anyone know what regex will do the trick here?
Many thanks,
Well you did the halfway.
Think about it this way : you're looking for extracting something followed by a underscore but not following one when the string contain /StockDetail.aspx. You know that this part of string will always be after your first underscore.
So you start with no underscore before : [^_]
Then you create the group you want to match with ([a-zA-Z]*) (you cannot work with \w since it's including underscore). Your string has to be followed by a underscore so you add _ after your group. And finnaly somewhere in the url you've got /StockDetail.aspx. Your regex should look like this :
[^_]([a-zA-Z]*)_.*(?:\/StockDetail\.aspx)
Result