Regexp_like vs regex validators online - diferent results - regex

I have a regex expression for email validation using plsql that is giving me some headaches... :)
This is the condition I'm using for an email (rercear12345#gmail.com) validation:
IF NOT REGEXP_LIKE (user_email, '^([\w\-\.]+)#((\[([0-9]{1,3}\.){3}[0-9]{1,3}\])|(([\w\-]+\.)+)([a-zA-Z]{2,4}))$') THEN
control := FALSE;
dbms_output.put_line('EMAIL '||C.user_email||' not according to regex');
END IF;
If I make a select based on the expression I don't get any values either:
Select * from TABLE_X where REGEXP_LIKE (user_email, '^([\w\-\.]+)#((\[([0-9]{1,3}\.){3}[0-9]{1,3}\])|(([\w\-]+\.)+)([a-zA-Z]{2,4}))$');
Using regex101.com I get full match with this email: rercear12345#gmail.com
Any idea?

The regular expression syntax that Oracle supports is in the documentation.
It seems Oracle doesn't understand the \w inside the []. You can expand that to:
with table_x (user_email) as (
select 'rercear12345#gmail.com' from dual
union all
select 'bad name#gmail.com' from dual
)
Select * from TABLE_X
where REGEXP_LIKE (user_email, '^[a-zA-Z_0-9.-]+#((\[([0-9]{1,3}\.){3}[0-9]{1,3}\])|([a-zA-Z_0-9-]+.)+[a-zA-Z]{2,4})$');
USER_EMAIL
----------------------
rercear12345#gmail.com
You don't need to escape the . or - inside the square brackets, by doing that you would allow literal backslashes to be matched.
This sort of requirement has come up before - e.g. here - but you seem be allowing IP address octets instead of FQDNs, enclosed in literal square brackets, which is unusual.
As #BobJarvis said you could also use the [:alnum:] but would still need to include underscore. That could allow non-ASCII 'letter' characters you aren't expecting; though they may be valid, as are other symbols you exclude; you seem to be following the 'common advice' mentioned in that article though.

Related

How to use Postgres Regex Replace with a capture group

As the title presents above I am trying to reference a capture groups for a regex replace in a postgres query. I have read that the regex_replace does not support using regex capture groups. The regex I am using is
r"(?:[\s\(\)\=\)\,])(username)(?:[\s\(\)\=\)\,])?"gm
The above regex almost does what I need it to but I need to find out how to only allow a match if the capture groups also capture something. There is no situation where a "username" should be matched if it just so happens to be a substring of a word. By ensuring its surrounded by one of the above I can much more confidently ensure its a username.
An example application of the regex would be something like this in postgres (of course I would be doing an update vs a select):
select *, REGEXP_REPLACE(reqcontent,'(?:[\s\(\)\=\)\,])(username)(?:[\s\(\)\=\)\,])?' ,'NEW-VALUE', 'gm') from table where column like '%username%' limit 100;
If there is any more context that can be provided please let me know. I have also found similar posts (postgresql regexp_replace: how to replace captured group with evaluated expression (adding an integer value to capture group)) but that talks more about splicing in values back in and I don't think quite answers my question.
More context and example value(s) for regex work against. The below text may look familiar these are JQL filters in Jira. We are looking to update our usernames and all their occurrences in the table that contains the filter. Below is a few examples of filters. We originally were just doing a find a replace but that doesn't work because we have some usernames that are only two characters and it was matching on non usernames (e.g je (username) would place a new value in where the word project is found which completely malforms the JQL/String resulting in something like proNEW-VALUEct = balh blah)
type = bug AND status not in (Closed, Executed) AND assignee in (test, username)
assignee=username
assignee = username
Definition of Answered:
Regex that will only match on a 'username' if its surrounded by one of the specials
A way to regex/replace that username in a postgres query.
Capturing groups are used to keep the important bits of information matched with a regex.
Use either capturing groups around the string parts you want to stay in the result and use their placeholders in the replacement:
REGEXP_REPLACE(reqcontent,'([\s\(\)\=\)\,])username([\s\(\)\=\)\,])?' ,'\1NEW-VALUE\2', 'gm')
Or use lookarounds:
REGEXP_REPLACE(reqcontent,'(?<=[\s\(\)\=\)\,])(username)(?=[\s\(\)\=\)\,])?' ,'NEW-VALUE', 'gm')
Or, in this case, use word boundaries to ensure you only replace a word when inside special characters:
REGEXP_REPLACE(reqcontent,'\yusername\y' ,'NEW-VALUE', 'g')

How to simplify g-mail addresses using regular expressions in Hive

I would like to simplify a gmail address in Hive by removing anything unnecessary. I can already remove "." using "translate()", however gmail also allows anything placed between a "+" and the "#" to be ignored. The following regular expression works in Teradata:
select REGEXP_REPLACE('test+friends#gmail.com', '\+.+\\#' ,'\\#');
gives: 'test#gmail.com', but in Hive, I get:
FAILED: SemanticException [Error 10014]: Line 1:7 Wrong arguments
''\#'': org.apache.hadoop.hive.ql.metadata.HiveException: Unable to
execute method public org.apache.hadoop.io.Text
org.apache.hadoop.hive.ql.udf.UDFRegExpReplace.evaluate(org.apache.hadoop.io.Text,org.apache.hadoop.io.Text,org.apache.hadoop.io.Text)
on object org.apache.hadoop.hive.ql.udf.UDFRegExpReplace#131b58d4 of
class org.apache.hadoop.hive.ql.udf.UDFRegExpReplace with arguments
{test+friends#gmail.com:org.apache.hadoop.io.Text,
+.+#:org.apache.hadoop.io.Text, #:org.apache.hadoop.io.Text} of size 3
How do I get this regular expression to work in Hive?
You don't need to escape # in regular expressions. Try:
select REGEXP_REPLACE('test+friends#gmail.com', '\+[^#]+#' ,'#');
You should also use [^#]+ rather than .+ so the match stops at the first #. Otherwise if there are multiple addresses in the input, the match will span all of them.
I found the answer:
select REGEXP_REPLACE('test+friends#gmail.com', '[+].+#' ,'#');
or
select REGEXP_REPLACE('test+friends#gmail.com', '\+.+#' ,'#');
Does the trick. Teradata and Hive seem to have significant differences in how they process regular expressions.

Regular expression to match question mark except repeated or commented(--)

I would like to build a regular expression in C# to match question mark except repeated or commented.
For example, if I have a string below
--???
??
asdlfkj --?
asldfjl -?
aslfldkf --?
aslfkvlv --??
?
-?
dklsafdlafjd = ?
, I want to match like below (between * character).
--???
??
asdlfkj --?
asldfjl -*?*
aslfldkf --?
aslfkvlv --??
*?*
-*?*
dklsafdlafjd = *?*
I'm developing SQL binding method using 2 parameters.
The first one is SQL, for example
select * from atable where id = ?.
SQL can have comment so I want ignore them.
The second one is parameter for SQL as Array to match sequentially;
Does anyone have good idea for it?
If you can negate this regex it should work for you:
(\?{2,}|(?<=--)\?)
I don't know what language you're working in, but you should be able to filter by line. Apply this regex as a predicate and either negate it or use a exclude function.
I'll leave those implementation details up to you.

Removing commas and empty tags from a string using regex

I am trying to filter out spam before being posted using a few routines and external services (akismet) but they all seem to fail when pushing in a comma delimited word or a word formed with empty tags. Eg
b[u][/u]u[u][/u]y[i][/i]m[b][/b] e <-> buyme
b,u,y,m,e <-> buyme
Does anyone know of a good ColdFusion regex to strip out this sort of behavior before I can post it to aksimet for processing?
Firstly: Have you checked whether is Akismet not already doing this?
I would very much suspect it already does all this processing (and more), so you don't actually need to.
Anyway, assuming this is bbcode, and thus the relevant tags will be for bold/italic/underline, you can replace them with:
TextForAkismet = rereplace( TextForAkismet , '\[([biu])\]\[/\1\]' , '' , 'all' )
If there are other empty tags you want to remove, simply update the captured group (the bit in parentheses) as appropriate. To also cater for potentially attributes (but still an empty tag), a quick and dirty way is to use [^\]]* after the tag name (outside the captured group).
'\[([biu]|img|url)[^\]]*\]\[/\1\]'
Depending on the dialect of bbcode you're working with, you may need to handle quoted brackets which would need a more complex expression.
To remove commas that appear between letters, use:
TextForAkismet = rereplace( TextForAkismet , '\b,\b' , '' , 'all' )
(Where \b matches any position between alphanumeric and non-alphanumeric.)

Postgres regex issue

I need to find all records stored in postgres, which matching following regexp:
^((8|\+7)[\- ]?)?(\(?\d{3}\)?[\- ]?)?[\d\- ]{7,10}$
Something like this:
SELECT * FROM users WHERE users.phone ~ '^((8|\+7)[\- ]?)?(\(?\d{3}\)?[\- ]?)?[\d\- ]{7,10}$'
But this one falls with error:
invalid regular expression: quantifier operand invalid
Why won't Postgres work with this regex?
Using the same one in plain Ruby works just fine.
UPDATE
Problem is only with WHERE. When i try to:
SELECT '+79637434199' ~ '^((8|\+7)[\- ]?)(\(?\d{3}\)?[\- ]?)[\d\- ]{7,10}'
Postgres returns true. But when i try:
SELECT * FROM users WHERE users.phone ~ '^((8|\+7)[\- ]?)(\(?\d{3}\)?[\- ]?)[\d\- ]{7,10}'
Result: "invalid regular expression: quantifier operand invalid".
You don't need to escape - inside a character class when you put it at the first or last position, because it cannot be misread as range that way:
[\- ] → [- ]
[\d\- ] → [\d -]
The way you have it the upper bound 10 at the end is futile.
Add $ at the end to disallow trailing characters.
Or \D to disallow trailing digits (but require a non-digit).
Or ($|\D) to either end the string there or have a non-digit follow.
Put together:
SELECT '+79637434199' ~ '^(8|\+7)[ -]?(\(?\d{3}\)?[ -]?)[\d -]{7,10}($|\D)'
Otherwise your expression is just fine and it works for me on PostgreSQL 9.1.4. It should not make any difference whatsoever whether you use it in a WHERE clause or in a SELECT list - unless you are running into a bug with some old version (like #kgrittn commented).
If I prepend the string literal with E, I can provoke the error message that you get. This cannot explain your problem, because you stated that the expression works fine as SELECT item.
But, as Sherlock Holmes is quoted, "when you have excluded the impossible, whatever remains, however improbable, must be the truth."
Maybe you ran one test with standard_conforming_strings = on and the other one with standard_conforming_strings = off - this was the default interpretation of string literals in older versions before 9.1. Maybe with two different clients (that have a different setting as to that).
Read more in the chapter String Constants with C-style Escapes in the manual.