Hive: String split based on backslash \ - regex

I have a table with column named path which contains values with backslash:
\ModuleCalData\ComputerName
\ModuleCalData\StartTime
\ModuleCalData\EndTime
\ModuleCalData\SummaryParameters\TextMeasured\Value
\ModuleCalDataSummaryParameters\TextMeasured\Name
I'm trying to split and access each element separately. The query is
select split(path,'\\')[0] from test_data_tag;
This query is erroring out
Failed with exception
java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException:
Error evaluating split(path, '\')[0]
Can anyone help how to split the string on \ in hive?

select path
,split(path,'\\\\') as split_path
from mytable
;
+-----------------------------+-------------------------------------+
| path | split_path |
+-----------------------------+-------------------------------------+
| \ModuleCalData\ComputerName | ["","ModuleCalData","ComputerName"] |
+-----------------------------+-------------------------------------+

Related

How to use regexp_replace in spark.sql() to extract hashtags from string

I need to write a regexg_replace query in spark.sql() and I'm not sure how to handle it. For readability purposes, I have to utilize SQL for it. I am trying to pull out the hashtags from the table. I know how to do this using the python method but most of my team are SQL users.
My dataframe example looks like so:
Insta_post
Today, Senate Dems vote to #SaveTheInternet. Proud to support similar #NetNeutrality legislation here in the House…
RT #NALCABPolicy: Meeting with #RepDarrenSoto . Thanks for taking the time to meet with #LatinoLeader ED Marucci Guzman. #NALCABPolicy2018.…
RT #Tharryry: I am delighted that #RepDarrenSoto will be voting for the CRA to overrule the FCC and save our #NetNeutrality rules. Find out…
My code:
I create a tempview:
post_df.createOrReplaceTempView("post_tempview")
post_df = spark.sql("""
select
regexp_replace(Insta_post, '.*?(.|'')(#)(\w+)', '$1') as a
from post_tempview
where Insta_post like '%#%'
""")
My end result:
+--------------------------------------------------------------------------------------------------------------------------------------------+
|a |
+--------------------------------------------------------------------------------------------------------------------------------------------+
|Today, Senate Dems vote to #SaveTheInternet. Proud to support similar #NetNeutrality legislation here in the House… |
|RT #NALCABPolicy: Meeting with #RepDarrenSoto . Thanks for taking the time to meet with #LatinoLeader ED Marucci Guzman. #NALCABPolicy2018.…|
|RT #Tharryry: I am delighted that #RepDarrenSoto will be voting for the CRA to overrule the FCC and save our #NetNeutrality rules. Find out…|
+--------------------------------------------------------------------------------------------------------------------------------------------+
desired result:
+---------------------------------+
|a |
+---------------------------------+
| #SaveTheInternet, #NetNeutrality|
| #NALCABPolicy2018 |
| #NetNeutrality |
+---------------------------------+
I haven't really used regexp_replace too much so this is new to me. Any help would be appreciated as well as an explanation of how to structure the subsets!
For Spark 3.1+, you can use regexp_extract_all function to extract multiple matches:
post_df = spark.sql("""
select regexp_extract_all(Insta_post, '(#\\\\w+)', 1) as a
from post_tempview
where Insta_post like '%#%'
""")
post_df.show(truncate=False)
#+----------------------------------+
#|a |
#+----------------------------------+
#|[#SaveTheInternet, #NetNeutrality]|
#|[#NALCABPolicy2018] |
#|[#NetNeutrality] |
#+----------------------------------+
For Spark <3.1, you can use regexp_replace to remove all that doesn't match the hashtag pattern :
post_df = spark.sql("""
select trim(trailing ',' from regexp_replace(Insta_post, '.*?(#\\\\w+)|.*', '$1,')) as a
from post_tempview
where Insta_post like '%#%'
""")
post_df.show(truncate=False)
#+-------------------------------+
#|a |
#+-------------------------------+
#|#SaveTheInternet,#NetNeutrality|
#|#NALCABPolicy2018 |
#|#NetNeutrality |
#+-------------------------------+
Note the use trim to remove the unnecessary trailing commas created by the first replace $,.
Do you really need a view? Because the following code might do it:
df = df.filter(F.col('Insta_post').like('%#%'))
col_trimmed = F.trim((F.regexp_replace('Insta_post', '.*?(#\w+)|.+', '$1 ')))
df = df.select(F.regexp_replace(col_trimmed,'\s',', ').alias('a'))
df.show(truncate=False)
# +--------------------------------+
# |a |
# +--------------------------------+
# |#SaveTheInternet, #NetNeutrality|
# |#NALCABPolicy2018 |
# |#NetNeutrality |
# +--------------------------------+
I ended up using two of regexp_replace, so potentially there could be a better alternative, just couldn't think of one.

Extract domain from url using PostgreSQL

I need to extract the domain name for a list of urls using PostgreSQL. In the first version, I tried using REGEXP_REPLACE to replace unwanted characters like www., biz., sports., etc. to get the domain name.
SELECT REGEXP_REPLACE(url, ^((www|www2|www3|static1|biz|health|travel|property|edu|world|newmedia|digital|ent|staging|cpelection|dev|m-staging|m|maa|cdnnews|testing|cdnpuc|shipping|sports|life|static01|cdn|dev1|ad|backends|avm|displayvideo|tand|static03|subscriptionv3|mdev|beta)\.)?', '') AS "Domain",
COUNT(DISTINCT(user)) AS "Unique Users"
FROM db
GROUP BY 1
ORDER BY 2 DESC;
This seems unfavorable as the query needs to be constantly updated for list of unwanted words.
I did try https://stackoverflow.com/a/21174423/10174021 to extract from the end of the line using PostgreSQL REGEXP_SUBSTR but, I'm getting blank rows in return. Is there a more better way of doing this?
A dataset sample to try with:
CREATE TABLE sample (
url VARCHAR(100) NOT NULL);
INSERT INTO sample url)
VALUES
("sample.co.uk"),
("www.sample.co.uk"),
("www3.sample.co.uk"),
("biz.sample.co.uk"),
("digital.testing.sam.co"),
("sam.co"),
("m.sam.co");
Desired output
+------------------------+--------------+
| url | domain |
+------------------------+--------------+
| sample.co.uk | sample.co.uk |
| www.sample.co.uk | sample.co.uk |
| www3.sample.co.uk | sample.co.uk |
| biz.sample.co.uk | sample.co.uk |
| digital.testing.sam.co | sam.co |
| sam.co | sam.co |
| m.sam.co | sam.co |
+------------------------+--------------+
So, I've found the solution using Jeremy and Rémy Baron's answer.
Extract all the public suffix from public suffix and store into
a table which I labelled as tlds.
Get the unique urls in the dataset and match to its TLD.
Extract the domain name using regexp_replace (used in this query) or alternative regexp_substr(t1.url, '([a-z]+)(.)'||t1."tld"). The final output:
The SQL query is as below:
WITH stored_tld AS(
SELECT
DISTINCT(s.url),
FIRST_VALUE(t.domain) over (PARTITION BY s.url ORDER BY length(t.domain) DESC
rows between unbounded preceding and unbounded following) AS "tld"
FROM sample s
JOIN tlds t
ON (s.url like '%%'||domain))
SELECT
t1.url,
CASE WHEN t1."tld" IS NULL THEN t1.url ELSE regexp_replace(t1.url,'(.*\.)((.[a-z]*).*'||replace(t1."tld",'.','\.')||')','\2')
END AS "extracted_domain"
FROM(
SELECT a.url,st."tld"
FROM sample a
LEFT JOIN stored_tld st
ON a.url = st.url
)t1
Links to try: SQL Tester
You can try this :
with tlds as (
select * from (values('.co.uk'),('.co'),('.uk')) a(tld)
) ,
sample as (
select * from (values ('sample.co.uk'),
('www.sample.co.uk'),
('www3.sample.co.uk'),
('biz.sample.co.uk'),
('digital.testing.sam.co'),
('sam.co'),
('m.sam.co')
) a(url)
)
select url,regexp_replace(url,'(.*\.)(.*'||replace(tld,'.','\.')||')','\2') "domain" from (
select distinct url,first_value(tld) over (PARTITION BY url order by length(tld) DESC) tld
from sample join tlds on (url like '%'||tld)
) a
I use split_part(url,'/',3) for this :
select split_part('https://stackoverflow.com/questions/56019744', '/', 3) ;
output
stackoverflow.com

Regex to extract two values from single string in Splunk

I've log statements appearing in Splunk as below.
info Request method=POST, time=100, id=12345
info Response statuscode=200, time=300, id=12345
I'm trying to write a Splunk query that would extract the time parameter from the lines starting with info Request and info Response and basically find the time difference. Is there a way I can do this in a query? I'm able to extract values separately from each statement but not the two values together.
I'm hoping for something like below, but I guess the piping won't work:
... | search log="info Request*" | rex field=log "time=(?<time1>[^\,]+)" | search log="info Response*" | rex field=log "time=(?<time2>[^\,]+)" | table time1, time2
Any help is highly appreciated.
General process:
Extract type into a field
Calculate response and request times
Group by id
Calculate the diff
You may want to use something other than stats(latest) but won't matter if there's only one request/response per id.
| rex field=_raw "info (?<type>\w+).*"
| eval requestTime = if(type="Request",time,NULL)
| eval responseTime = if(type="Response",time,NULL)
| stats latest(requestTime) as requestTime latest(responseTime) as responseTime by id
| eval diff = responseTime - requestTime

Keep words starting with character/letter in Pandas | Python

I'm not sure how to do this in a dataframe context
I have the table below here with text information
TEXT |
-------------------------------------------|
"Get some new #turbo #stacks today!" |
"Is it one or three? #phone" |
"Mayhaps it be three afterall..." |
"So many new issues with phone... #iphone" |
And I want to edit it down to where only the words with a '#' symbol are kept, like in the result below.
TEXT |
-----------------|
"#turbo #stacks" |
"#phone" |
"" |
"#iphone" |
In some cases, I'd also like to know if it's possible to eliminate the rows that are empty by checking for NaN as true or if you run a different kind of condition to get this result:
TEXT |
-----------------|
"#turbo #stacks" |
"#phone" |
"#iphone" |
Python 2.7 and pandas for this.
You could try using regex and extractall:
df.TEXT.str.extractall('(#\w+)').groupby(level=0)[0].apply(' '.join)
Output:
0 #turbo #stacks
1 #phone
3 #iphone
Name: 0, dtype: object

How to capture only part of an id?

I'm trying to capture the id of an element that will be randomly generated. I can successfully capture the value of my element id like this...
| storeAttribute | //div[1]#id | variableName |
Now my variable will be something like...
divElement-12345
I want to remove 'divElement-' so that the variable I am left with is '12345' so that I can use it later to select the 'form-12345' element associated with it...something like this:
| type | //tr[#id='form-${variableName}']/td/form/fieldset/p[1]/input | Type this |
How might I be able to accomplish this?
You have two options in Selenium, XPath and CSS Selector. I have read that CSS Selector is better for doing tests in both FireFox and IE.
Using the latest version of Selenium IDE (3/5/2009) I've had success with using storeEval which evaluates Javascript expressions, giving you access to javascript string functions.
XPath:
storeAttribute | //div[1]#id | divID
storeEval | '${divID}'.replace("divElement-", "") | number
type | //tr[#id='form-${number}']/td/form/fieldset/p[1]/input | Type this
CSS Selector:
storeAttribute | css=div[1]#id | divID
storeEval | '${divID}'.replace("divElement-", "") | number
type | css=tr[id='form-${number}'] > td > form > fieldset > p[1] > input | Type this
There are many functions in XPATH which should solve your problem. Assuming "divElement-" is a constant that will not change and that you are using XPath 2.0, I would suggest:
substring-after(div[1]/#id/text(),"divElement-")