Hive: First and last occurrence in a string - regex

I have an id column and a string column as follows:
id values
1 AD123~DF123~SQ345
2 CF234~DF234
3 BG123
I need the first occurrence and last occurrence of columns below in Hive
id first last
1 AD123 SQ345
2 CF234 DF234
3 BG123 BG123
I have already tried using the HIVE split function to solve it
select id, split(values, '\~') [0] as first, reverse(split(reverse(values), '\~')[0]) from demo;
I keep getting a syntax error in Hive saying that [ is unexpected.
Another alternative I found is regex but I am new to Hive, can some one please help me out here with regex or split. Thanks

Using split:
with your_table as(
select stack(3,
1, 'AD123~DF123~SQ345',
2, 'CF234~DF234',
3, 'BG123'
) as (id,values)
) --use your_table instead of this
select id, values[0] as first, values[size(values)-1] as last
from
(
select id, split(values,'~') values
from your_table t
)s
;
Returns:
id first last
1 AD123 SQ345
2 CF234 DF234
3 BG123 BG123
Using regexp:
select id,
regexp_extract(values,'^([^~]*)',1) as first,
regexp_extract(values,'([^~]*)$',1) as last
from your_table t
;

Related

Does AWS Athena supports Order by in Array_AGG?

Im working with AWS Athena to concat a few rows to a single row.
Example table:(name: unload)
xid pid sequence text
1 1 0 select * from
1 1 1 mytbl
1 1 2
2 1 0 update test
2 1 1 set mycol=
2 1 2 'a';
So want to contact the text column.
Expected Output:
xid pid text
1 1 select * from mytbl
2 1 update test set mycol='a';
I ran the following query to partition it first with proper order and do the concat.
with cte as
(SELECT
xid,
pid,
sequence,
text,
row_number()
OVER (PARTITION BY xid,pid
ORDER BY sequence) AS rank
FROM unload
GROUP BY xid,pid,sequence,text
)
SELECT
xid,
pid,
array_join(array_agg(text),'') as text
FROM cte
GROUP BY xid,pid
But if you see the below output the order got misplaced.
xid pid text
1 1 mytblselect * from
2 1 update test'a'; set mycol=
I checked the Presto documentation, the latest version supports order by in array agg, but Athena is using Presto 0.172, so Im not sure whether it is supported or not.
What is the workaround for this in Athena?
One approach:
create records with a sortable format of text
aggregate into an unsorted array
sort the array
transform each element back into the original value of text
convert the sorted array to a string output column
WITH cte AS (
SELECT
xid, pid, text
-- create a sortable 19-digit ranking string
, SUBSTR(
LPAD(
CAST(
ROW_NUMBER() OVER (PARTITION BY xid, pid ORDER BY sequence)
AS VARCHAR)
, 19
, '0')
, -19) AS SEQ_STR
FROM unload
)
SELECT
xid, pid
-- make sortable string, aggregate into array
-- then sort array, revert each element to original text
-- finally combine array elements into one string
, ARRAY_JOIN(
TRANSFORM(
ARRAY_SORT(
ARRAY_AGG(SEQ_STR || text))
, combined -> SUBSTR(combined, 1 + 19))
, ' '
, '') AS TEXT
FROM cte
GROUP BY xid, pid
ORDER BY xid, pid
This code assumes:
xid + pid + sequence is unique for all input records
There are not many combinations of xid + pid + sequence (eg, not more than 20 million)

how to get all the words that start with a certain character in bigquery

I have a text column in a bigquery table. Sample record of that column looks like -
with temp as
(
select 1 as id,"as we go forward into unchartered waters it's important to remember we are all in this together. #united #community" as input
union all
select 2 , "US cities close bars, restaurants and cinemas #Coronavirus"
)
select *
from temp
I want to extract all the words in this column that start with a # . later on I would like to get the frequency of these terms. How do I do this in BigQuery ?
My output would look like -
id, word
1, united
1, community
2, coronavirus
Below is for BigQuery Standard SQL
I want to extract all the words in this column that start with a #
#standardSQL
WITH temp AS (
SELECT 1 AS id,"as we go forward into unchartered waters it's important to remember we are all in this together. #united #community" AS input UNION ALL
SELECT 2 , "US cities close bars, restaurants and cinemas #Coronavirus"
)
SELECT id, word
FROM temp, UNNEST(REGEXP_EXTRACT_ALL(input, r'(?:^|\s)#([^#\s]*)')) word
with output
Row id word
1 1 united
2 1 community
3 2 Coronavirus
later on I would like to get the frequency of these terms
#standardSQL
SELECT word, COUNT(1) frequency
FROM temp, UNNEST(REGEXP_EXTRACT_ALL(input, r'(?:^|\s)#([^#\s]*)')) word
GROUP BY word
You can do this without regexes, by splitting words and then selecting ones that start the way you want. For example:
SELECT
id,
ARRAY(SELECT TRIM(x, "#") FROM UNNEST(SPLIT(input, ' ')) as x WHERE STARTS_WITH(x,'#')) str
FROM
temp
If you prefer the hashtags to be separate rows, you can be a bit tiedier:
SELECT
id,
TRIM(x, "#") str
FROM
temp,
UNNEST(SPLIT(input, ' ')) x
WHERE
STARTS_WITH(x,'#')

Split multiple delimited string into unique rows - basically return unique words of sentences from table

Lot of different post out there on this subject.
But I really can't find the one suitable for my project.
I have a table with 4 columns of varchar2, length 20,60,72 and 160. Containing apx ≈ 700 000 records with data of items/products.
Example of table:
Text Id SHNAM
LEVI,GRADY Whitley 1 007C
Levi Grady;Whitley 2 0001
BEVIS,GRADY Leblanc 3 007D
Aladdin Grady;Green 4 0002
ULLA,GRADY Holman 5 0003
From this table I would like to populate a new table or a materialized view of every unique word. Delimiters used are either space, comma or semicolon (', ;').
Expected output:
OUTPUT
Levi
GRADY
Whitley
BEVIS
Leblanc
Aladdin
Green
ULLA
Holman
Note that the check is not case sensitive.
E.g. this blog post applies to your question: Splitting a comma delimited string the RegExp way, Part Two. My answer is derived directly the blog:
with data_(id_, str) as (
select 1, 'LEVI,GRADY Whitley' from dual union all
select 2, 'Levi Grady;Whitley' from dual union all
select 3, 'BEVIS,GRADY Leblanc' from dual union all
select 4, 'aladdin grady;green' from dual union all
select 5, 'ULLA,GRADY Holman' from dual union all
select 6, '1aar,1bar;1car 1dar,1ear' from dual
)
select distinct lower(regexp_substr(str, '[^,;[:space:]]+', 1, rownum_)) as splitted
from data_
cross join (select rownum as rownum_
from (select max(regexp_count(str, '[,;[:space:]]')) + 1 as max_
from data_
)
connect by level <= max_
)
where regexp_substr(str, '[^,;[:space:]]+', 1, rownum_) is not null
order by splitted
;
Note that this query doesn't have exactly the same output that you have listed in the question for the ids from 1 to 5. You expected Levi (with initcap) and GRADY (all caps) even the both names has both variations - this is inconsistent so I simply ignored it.

Replace a part of a varchar2 column in Oracle

I've a varchar2 column in a table which contains a few entries like the following
TEMPORARY-2 TIME ECS BOUND -04-Insuficient Balance
I want to update these entries and make it TEMPORARY-2 X. What's the way out?
To accomplish this, you can either use character functions such as substr(),
replace()
or a regular expression function - regexp_replace() for instance.
SQL> with t1(col) as(
2 select 'TEMPORARY-2 TIME ECS BOUND -04-Insuficient Balance'
3 from dual
4 )
5 select concat(substr( col, 1, 11), ' X') as res_1
6 , regexp_replace(col, '^(\w+-\d+)(.*)', '\1 X') as res_2
7 from t1
8 ;
Result:
RES_1 RES_2
------------- -------------
TEMPORARY-2 X TEMPORARY-2 X
So your update statement may look like this:
update your_table t
set t.col_name = regexp_replace(col_name, '^(\w+-\d+)(.*)', '\1 X')
-- where clause if needed.

How do I extract a pattern from a table in Oracle 11g?

I want to extract text from a column using regular expressions in Oracle 11g. I have 2 queries that do the job but I'm looking for a (cleaner/nicer) way to do it. Maybe combining the queries into one or a new equivalent query. Here they are:
Query 1: identify rows that match a pattern:
select column1 from table1 where regexp_like(column1, pattern);
Query 2: extract all matched text from a matching row.
select regexp_substr(matching_row, pattern, 1, level)
from dual
connect by level < regexp_count(matching_row, pattern);
I use PL/SQL to glue these 2 queries together, but it's messy and clumsy. How can I combine them into 1 query. Thank you.
UPDATE: sample data for pattern 'BC':
row 1: ABCD
row 2: BCFBC
row 3: HIJ
row 4: GBC
Expected result is a table of 4 rows of 'BC'.
You can also do it in one query, functions/procedures/packages not required:
WITH t1 AS (
SELECT 'ABCD' c1 FROM dual
UNION
SELECT 'BCFBC' FROM dual
UNION
SELECT 'HIJ' FROM dual
UNION
SELECT 'GBC' FROM dual
)
SELECT c1, regexp_substr(c1, 'BC', 1, d.l, 'i') thePattern, d.l occurrence
FROM t1 CROSS JOIN (SELECT LEVEL l FROM dual CONNECT BY LEVEL < 200) d
WHERE regexp_like(c1,'BC','i')
AND d.l <= regexp_count(c1,'BC');
C1 THEPATTERN OCCURRENCE
----- -------------------- ----------
ABCD BC 1
BCFBC BC 1
BCFBC BC 2
GBC BC 1
SQL>
I've arbitrarily limited the number of occurrences to search for at 200, YMMV.
Actually there is an elegant way to do this in one query, if you do not mind to run some extra miles. Please note that this is just a sketch, I have not run it, you'll probably have to correct a few typos in it.
create or replace package yo_package is
type word_t is record (word varchar2(4000));
type words_t is table of word_t;
end;
/
create or replace package body yo_package is
function table_function(in_cur in sys_refcursor, pattern in varchar2)
return words_t
pipelined parallel_enable (partition in_cur by any)
is
next varchar2(4000);
match varchar2(4000);
word_rec word_t;
begin
word_rec.word = null;
loop
fetch in_cur into next;
exit when in_cur%notfound;
--this you inner loop where you loop through the matches within next
--you have to implement this
loop
--TODO get the next match from next
word_rec.word := match;
pipe row (word_rec);
end loop;
end loop;
end table_function;
end;
/
select *
from table(
yo_package.table_function(
cursor(
--this is your first select
select column1 from table1 where regexp_like(column1, pattern)
)
)