How to simplify g-mail addresses using regular expressions in Hive

How to simplify g-mail addresses using regular expressions in Hive - regex

I would like to simplify a gmail address in Hive by removing anything unnecessary. I can already remove "." using "translate()", however gmail also allows anything placed between a "+" and the "#" to be ignored. The following regular expression works in Teradata:
select REGEXP_REPLACE('test+friends#gmail.com', '\+.+\\#' ,'\\#');
gives: 'test#gmail.com', but in Hive, I get:
FAILED: SemanticException [Error 10014]: Line 1:7 Wrong arguments
''\#'': org.apache.hadoop.hive.ql.metadata.HiveException: Unable to
execute method public org.apache.hadoop.io.Text
org.apache.hadoop.hive.ql.udf.UDFRegExpReplace.evaluate(org.apache.hadoop.io.Text,org.apache.hadoop.io.Text,org.apache.hadoop.io.Text)
on object org.apache.hadoop.hive.ql.udf.UDFRegExpReplace#131b58d4 of
class org.apache.hadoop.hive.ql.udf.UDFRegExpReplace with arguments
{test+friends#gmail.com:org.apache.hadoop.io.Text,
+.+#:org.apache.hadoop.io.Text, #:org.apache.hadoop.io.Text} of size 3
How do I get this regular expression to work in Hive?

You don't need to escape # in regular expressions. Try:
select REGEXP_REPLACE('test+friends#gmail.com', '\+[^#]+#' ,'#');
You should also use [^#]+ rather than .+ so the match stops at the first #. Otherwise if there are multiple addresses in the input, the match will span all of them.

I found the answer:
select REGEXP_REPLACE('test+friends#gmail.com', '[+].+#' ,'#');
or
select REGEXP_REPLACE('test+friends#gmail.com', '\+.+#' ,'#');
Does the trick. Teradata and Hive seem to have significant differences in how they process regular expressions.

Related

BQSQLException: Cannot parse regular expression: invalid perl operator: see full error in post as title cannot contain certain characters)

Please find dummy data and my attempted solution at the end of this post.
I started learning REGEX in the last several days and am creating a REGEX to exclude any private IP addresses from my dataset. My dataset has a column url, which shows from which IP address a company performed an action. This column contains all kinds of IP addresses in the url format.
I have created a query that should output only non-local IP addresses (which are a part of URL. The query I have is as follows:
WITH table_1 AS(
SELECT 'http://localhost:9999' AS url UNION ALL
SELECT 'https://localhost:0000' AS url UNION ALL
SELECT 'http://stackoverflow.com/challenge' AS url UNION ALL
SELECT 'https://arseniyaskingquestion.ru/SO' AS url
)
SELECT url
FROM table_1
WHERE url NOT IN (SELECT DISTINCT url
FROM table_1
WHERE REGEXP_CONTAINS(url, r'((http(s)?):\/\/)(((25[0-5]|(2[0-4]|1[0-9]|[1-9]|)[0-9])(\.(?!$)|$)){4}$|(.*\.local)|(.*local\.)|(.*localhost)|(.*\.internal)|(.*csb.)|(.*codesandbox)|(.*lvh\.me)|(.*.ngrok.)|(.*nip\.io)|(.*.test)).*'))
ORDER BY url DESC
When I run this query, I get the following error message: BQSQLException: Cannot parse regular expression: invalid perl operator: (?!
I searched StackOverflow and noticed that 1 solution here, but I could not implement it successfully using REGEXP_REPLACE - I kept getting other errors as I tried to implement this and after reading Google Big Query documentation.
As you can see from my code snippet, I am trying to output only non-local IP addresses (which are a part of a full url link). Therefore, the expected output is:
url
----------------------------------
http://stackoverflow.com/challenge
https://arseniyaskingquestion.ru/SO

Is the LIKE clause not suitable?
WHERE LOWER(url) NOT LIKE '%localhost%'
The primary issue with your regex is the use of the negative lookahead (?!$).
Google BigQuery uses re2 and it omits support for lookarounds.
At regex101 you should develop your regexes using the Golang option since that is re2-based. See https://regex101.com/r/HAV5J1/1/ and it will explain why your regex is failing.
Additionally your subquery seems wildly inefficient:
WHERE url NOT IN (SELECT DISTINCT url
FROM table_1
WHERE REGEXP_CONTAINS(url, r'((http(s)?):\/\/)(((25[0-5]|(2[0-4]|1[0-9]|[1-9]|)[0-9])(\.(?!$)|$)){4}$|(.*\.local)|(.*local\.)|(.*localhost)|(.*\.internal)|(.*csb.)|(.*codesandbox)|(.*lvh\.me)|(.*.ngrok.)|(.*nip\.io)|(.*.test)).*'))
Could it not be condensed to:
WHERE NOT REGEXP_CONTAINS(url, r'MY_REGEX')
or:
WHERE REGEXP_CONTAINS(url, r'MY_REGEX') = false
I have no experience with BigQuery.

Your regex contains a lookahead and RE2 does not support lookaheads.
If you are satisfied with your pattern all in all, and you just need to fix that lookahead issue, you can unwrap the IP matching part and use
https?:\/\/((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)(\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)){4}$|.*\.local|.*local\.|.*localhost|.*\.internal|.*csb.|.*codesandbox|.*lvh\.me|.*.ngrok.|.*nip\.io|.*.test).*
See the RE2 regex demo.
See this IP matching regex page for details.

Regexp_like vs regex validators online - diferent results

I have a regex expression for email validation using plsql that is giving me some headaches... :)
This is the condition I'm using for an email (rercear12345#gmail.com) validation:
IF NOT REGEXP_LIKE (user_email, '^([\w\-\.]+)#((\[([0-9]{1,3}\.){3}[0-9]{1,3}\])|(([\w\-]+\.)+)([a-zA-Z]{2,4}))$') THEN
control := FALSE;
dbms_output.put_line('EMAIL '||C.user_email||' not according to regex');
END IF;
If I make a select based on the expression I don't get any values either:
Select * from TABLE_X where REGEXP_LIKE (user_email, '^([\w\-\.]+)#((\[([0-9]{1,3}\.){3}[0-9]{1,3}\])|(([\w\-]+\.)+)([a-zA-Z]{2,4}))$');
Using regex101.com I get full match with this email: rercear12345#gmail.com
Any idea?

The regular expression syntax that Oracle supports is in the documentation.
It seems Oracle doesn't understand the \w inside the []. You can expand that to:
with table_x (user_email) as (
select 'rercear12345#gmail.com' from dual
union all
select 'bad name#gmail.com' from dual
)
Select * from TABLE_X
where REGEXP_LIKE (user_email, '^[a-zA-Z_0-9.-]+#((\[([0-9]{1,3}\.){3}[0-9]{1,3}\])|([a-zA-Z_0-9-]+.)+[a-zA-Z]{2,4})$');
USER_EMAIL
----------------------
rercear12345#gmail.com
You don't need to escape the . or - inside the square brackets, by doing that you would allow literal backslashes to be matched.
This sort of requirement has come up before - e.g. here - but you seem be allowing IP address octets instead of FQDNs, enclosed in literal square brackets, which is unusual.
As #BobJarvis said you could also use the [:alnum:] but would still need to include underscore. That could allow non-ASCII 'letter' characters you aren't expecting; though they may be valid, as are other symbols you exclude; you seem to be following the 'common advice' mentioned in that article though.

How to extract file name from URL?

I have file names in a URL and want to strip out the preceding URL and filepath as well as the version that appears after the ?
Sample URL
Trying to use RegEx to pull, CaptialForecasting_Datasheet.pdf
The REGEXP_EXTRACT in Google Data Studio seems unique. Tried the suggestion but kept getting "could not parse" error. I was able to strip out the first part of the url with the following. Event Label is where I store URL of downloaded PDF.
The URL:
https://www.dudesolutions.com/Portals/0/Documents/HC_Brochure_Digital.pdf?ver=2018-03-18-110927-033
REGEXP_EXTRACT( Event Label , 'Documents/([^&]+)' )
The result:
HC_Brochure_Digital.pdf?ver=2018-03-18-110927-033
Now trying to determine how do I pull out everything after the? where the version data is, so as to extract just the Filename.pdf.

You could try:
[^\/]+(?=\?[^\/]*$)
This will match CaptialForecasting_Datasheet.pdf even if there is a question mark in the path. For example, the regex will succeed in both of these cases:
https://www.dudesolutions.com/somepath/CaptialForecasting_Datasheet.pdf?ver
https://www.dudesolutions.com/somepath?/CaptialForecasting_Datasheet.pdf?ver

Assuming that the name appears right after the last / and ends with the ?, the regular expression below will leave the name in group 1 where you can get it with \1 or whatever the tool that you are using supports.
.*\/(.*)\?
It basically says: get everything in between the last / and the first ? after, and put it in group 1.
Another regular expression that only matches the file name that you want but is more complex is:
(?<=\/)[^\/]*(?=\?)
It matches all non-/ characters, [^\/], immediately preceded by /, (?<=\/) and immediately followed by ?, (?=\?). The first parentheses is a positive lookbehind, and the second expression in parentheses is a positive lookahead.

This REGEXP_EXTRACT formula captures the characters a-zA-Z0-9_. between / and ?
REGEXP_EXTRACT(Event Label, "/([\\w\\.]+)\\?")
Google Data Studio Report to demonstrate.

Please try the following regex
[A-Za-z\_]*.pdf
I have tried it online at https://regexr.com/. Attaching the screenshot for reference
Please note that this only works for .pdf files

Following regex will extract file name with .pdf extension
(?:[^\/][\d\w\.]+)(?<=(?:.pdf))
You can add more extensions like this,
(?:[^\/][\d\w\.]+)(?<=(?:.pdf)|(?:.jpg))
Demo

Regex Assistance for a url filepath

Can someone assist in creating a Regex for the following situation:
I have about 2000 records for which I need to do a search/repleace where I need to make a replacement for a known item in each record that looks like this:
<li>View Product Information</li>
The FILEPATH and FILE are variable, but the surrounding HTML is always the same. Can someone assist with what kind of Regex I would substitute for the "FILEPATH/FILE" part of the search?

you may match the constant part and use grouping to put it back
(<li>View Product Information</li>)
then you should replace the string with $1your_replacement$2, where $1 is the first matching group and $2 the second (if using python for instance you should call Match.group(1) and Match.group(2))
You would have to escape \ chars if you're using Java instead.

Regular Expression for some email rules

I was using a regular expression for email formats which I thought was ok but the customer is complaining that the expression is too strict. So they have come back with the following requirement:
The email must contain an "#" symbol and end with either .xx or .xxx ie.(.nl or .com). They are happy with this to pass validation. I have started the expression to see if the string contains an "#" symbol as below
^(?=.*[#])
this seems to work but how do I add the last requirement (must end with .xx or .xxx)?

A regex simply enforcing your two requirements is:
^.+#.+\.[a-zA-Z]{2,3}$
However, there are email validation libraries for most languages that will generally work better than a regex.

I always use this for emails
^([a-zA-Z0-9_\-\.]+)#((\[[0-9]{1,3}" +
#"\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\" +
#".)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$
Try http://www.ultrapico.com/Expresso.htm as well!

It is not possible to validate every E-Mail Adress with RegEx but for your requirements this simple regex works. It is neither complete nor does it in any way check for errors but it exactly meets the specs:
[^#]+#.+\.\w{2,3}$
Explanation:
[^#]+: Match one or more characters that are not #
#: Match the #
.+: Match one or more of any character
\.: Match a .
\w{2,3}: Match 2 or 3 word-characters (a-zA-Z)
$: End of string

Try this :
([\w-\.]+)#((?:[\w]+\.)+)([a-zA-Z]{2,4})\be(\w*)s\b
A good tool to test our regular expression :
http://gskinner.com/RegExr/

You could use
[#].+\.[a-z0-9]{2,3}$

This should work:
^[^#\r\n\s]+[^.#]#[^.#][^#\r\n\s]+\.(\w){2,}$
I tested it against these invalid emails:
#exampleexample#domaincom.com
example#domaincom
exampledomain.com
exampledomain#.com
exampledomain.#com
example.domain#.#com
e.x+a.1m.5e#em.a.i.l.c.o
some-user#internal-email.company.c
some-user#internal-ema#il.company.co
some-user##internal-email.company.co
#test.com
test#asdaf
test#.com
test.#com.co
And these valid emails:
example#domain.com
e.x+a.1m.5e#em.a.i.l.c.om
some-user#internal-email.company.co
edit
This one appears to validate all of the addresses from that wikipedia page, though it probably allows some invalid emails as well. The parenthesis will split it into everything before and after the #:
^([^\r\n]+)#([^\r\n]+\.?\w{2,})$
niceandsimple#example.com
very.common#example.com
a.little.lengthy.but.fine#dept.example.com
disposable.style.email.with+symbol#example.com
other.email-with-dash#example.com
user#[IPv6:2001:db8:1ff::a0b:dbd0]
"much.more unusual"#example.com
"very.unusual.#.unusual.com"#example.com
"very.(),:;<>[]\".VERY.\"very#\\ \"very\".unusual"#strange.example.com
postbox#com
admin#mailserver1
!#$%&'*+-/=?^_`{}|~#example.org
"()<>[]:,;#\\\"!#$%&'*+-/=?^_`{}| ~.a"#example.org
" "#example.org
üñîçøðé#example.com
üñîçøðé#üñîçøðé.com

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to simplify g-mail addresses using regular expressions in Hive - regex

You don't need to escape # in regular expressions. Try: select REGEXP_REPLACE('test+friends#gmail.com', '\+[^#]+#' ,'#'); You should also use [^#]+ rather than .+ so the match stops at the first #. Otherwise if there are multiple addresses in the input, the match will span all of them.

I found the answer: select REGEXP_REPLACE('test+friends#gmail.com', '[+].+#' ,'#'); or select REGEXP_REPLACE('test+friends#gmail.com', '\+.+#' ,'#'); Does the trick. Teradata and Hive seem to have significant differences in how they process regular expressions.

Related

BQSQLException: Cannot parse regular expression: invalid perl operator: see full error in post as title cannot contain certain characters)

Regexp_like vs regex validators online - diferent results

How to extract file name from URL?

Regex Assistance for a url filepath

Regular Expression for some email rules

Categories

Resources