HIVE - Get All matched text from string column - regex

I am trying to extract all the urls from string field (metainfo.body) using query:
select split(regexp_replace(metainfo.body,'.*?((http|ftp|https):\/\/([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,#?^=%&:\/~+#-]*[\w#?^=%&\/~+#-]))\\n','$1#'),'#')**
Its not returning the URLs only but the complete field only. What should I change in this hive query to get the list of URLs?
eg:
select regexp_replace('hello hi i am arun http://a.com https://b.com','.*?((http|ftp|https):\/\/([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,#?^=%&:\/~+#-]*[\w#?^=%&\/~+#-]))','$1,') as output
output:
hello hi i am arun http://a.com https://b.com
Expected:
http://a.com,https://b.com,

You can try using case insensitive.
Then add a optional white space \s* or [ \t\r\n]* at the end.
Your regex turned into all ascii without the word class \w :
.*?((?:https?|ftp)://[a-zA-Z0-9_-]+(?:\.[a-zA-Z0-9_-]+)+[#%&+-:=?-Z\^_a-z~]*[#%&+\-/-9=?-Z\^_a-z~])\s*
The REGEXP_REPLACE should globally replace all found pattern in the string.
I can't test it, but from some online examples that use split like you're doing
should work.
select split(regexp_replace('hello hi i am arun http://a.com https://b.com',
'.*?((?:https?|ftp)://[a-zA-Z0-9_-]+(?:\.[a-zA-Z0-9_-]+)+[#%&+-:=?-Z\^_a-z~]*[#%&+\-/-9=?-Z\^_a-z~])\s*',
'$1,'), ',');
Here is a test of the regex using PCRE and its replacement
https://regex101.com/r/lIEvCk/1
Other references:
here 1
here 2

Related

How can I extract all characters between the first / and second / using REGEXP_EXTRACT in Google Data Studio?

I am trying to use REGEXP_EXTRACT in Google Data Studio for extracting a part of the URL.
Input:
URLs
/media/news/royals/meghan-markle-prince-harry-archie-new-photo
/marketplace/deals/best-selling-orthotic-friendly-sneakers/
Output:
URLs
media
marketplace
How can I draft an expression that will allow me to extract it?
You can use regex & a capture group to find the start of the string, 1 slash, anything not a slash, then a slash. In Python, the regex below works. Use regex101.com to test your regex.
strings = ['/media/news/royals/meghan-markle-prince-harry-archie-new-photo', '/marketplace/deals/best-selling-orthotic-friendly-sneakers/']
for s in strings:
good_part = re.sub('\A/([^/]*)/.*', r'\1', s)
print(good_part)
prints:
media
marketplace
You can achieve this with the following expression: ^/([^/]+).
It matches a string that starts (^) with /, and captures 1 or more characters that are not a / after that (([^/]+)).
Example:
WITH URLS AS (
SELECT '/media/news/royals/meghan-markle-prince-harry-archie-new-photo' url
UNION ALL
SELECT '/marketplace/deals/best-selling-orthotic-friendly-sneakers/' url
)
SELECT url, REGEXP_EXTRACT(url, '^/([^/]+)') path
FROM URLS
See https://support.google.com/datastudio/answer/7050487?hl=en
It can be achieved using the REGEXP_EXTRACT Calculated Field below which extracts all characters between the first / and the next / (if there is no second /, all characters will be captured till the end of the string):
REGEXP_EXTRACT(URLs, "^/([^/]+)")
Editable Google Data Studio Report (Embedded Google Sheets Data Source) and a GIF to elaborate:

Sublime Regex Search and Replace

Trying to use Sublime to update the urls of only some lines in a sql table dump.
in this case the line that I need to single out has the string 'themo_showcase_\d_image' which is easy to match. In the same string what I actually need to replace is the url column so that it reads 'https://www.example.com/' to 'http://www.example.com'
Anyone able to help shed some light on this? I've got thousands of these insert records that I need to modify.
ex:
original string:
('8630', '1328', 'themo_showcase_1_image', 'https://www.example.com/'),
to:
('8630', '1328', 'themo_showcase_1_image', 'http://www.example.com/'),
Find: 'themo_showcase_\d_image', 'http\Ks you could use \d+ if there are more than 1 digit
Replace: LEAVE EMPTY

REGEX: url with any query param OR url with any query param and an especific hash

I need to make a regex to cover the following:
domain/path/?anything --> YES
domain/path/?anything#/specificHash --> YES
domain/path/?anything#/otherHash --> NO
My try: I have made this regex excluding otherHashs:
domain/path/?(?!.*(otherHash1|otherHash2|otherHash3))
It works, but there are a lot of them... I want to make it easier just including my specificHash
Thanks in advance
You're trying to match urls that have a query string followed by a specific hash OR a query string without any hash ; here's how I would match that :
domain/path/?[^#]*(?:#/specificHash)?$
The [^#]* regex fragment should match any query string but no hash fragment. It can only be followed by an optional #/specificHash and the end of the string.
Try it here (forked from #Wiktor Stribiżew's test cases).

Talend tExtractRegexFields match after comma syntax

I'm trying to extract everything in a string after the first comma, using the tExtractRegexFields component.
I'm splitting strings in an address field (Address_1) to a second address field (Address_2).
On regexr.com, the following syntax works perfectly: ,[\s\S]*$
In order to comply with Talend's escape sequences, I changed that syntax to
,[\\s\\S]*$. That solved the error, but the code doesn't appear to match on anything, since nothing is split from Address_1 to Address_2.
What's wrong? Does this syntax not work in Talend? Are there alternate Regex solutions?
To slipt string using tExtractRegexFields use grouping regex so each group will be delivered to a column, i used this regex and it works fine "^(.*)[,]([^,]*)$", this is the job: (my input string: "123 North Drive,PO Box 1,Miami, FL 55555-5555" )

Regex Assistance for a url filepath

Can someone assist in creating a Regex for the following situation:
I have about 2000 records for which I need to do a search/repleace where I need to make a replacement for a known item in each record that looks like this:
<li>View Product Information</li>
The FILEPATH and FILE are variable, but the surrounding HTML is always the same. Can someone assist with what kind of Regex I would substitute for the "FILEPATH/FILE" part of the search?
you may match the constant part and use grouping to put it back
(<li>View Product Information</li>)
then you should replace the string with $1your_replacement$2, where $1 is the first matching group and $2 the second (if using python for instance you should call Match.group(1) and Match.group(2))
You would have to escape \ chars if you're using Java instead.