regular expression in hive for a specific string - regex

I have a column in hive table which is a address column and i want to split that into 2.
There are 2 scenarios to take care of.
Example:
Scenario 1:
Input column value:
ABC DEF123 AD
Output column values:
Column 1 should have ABC DEF
Column 2 should have 123 AD
Another example can be like below.
MICHAEL POSTON875 HYDERABAD
In this case separation should be based on a number which is part of a string value, if a string is having number in it then both should separate
Scenario 2:
Input value: ABC DEFPO BOX 5232
Output:
Column 1:- ABC DEF
Column 2:- PO BOX 5232
Another example can be like below.
Hyderabad jhillsPO BOX 522002
In this case separation should be based on PO BOX
Both the data is in same column and i would like to update the data into target based on the string format..like a case statement not sure about the approach.
NOTE:- The string length can be varied as this is address column.
Can some one please help me to provide a hive query and pyspark for the same?

Using CASE expression you can check which template does it match and using regexp_replace insert some delimiter, then split by the same delimiter.
Demo (Hive):
with mytable as (
select stack(4,
'ABC DEF123 AD',
'MICHAEL POSTON875 HYDERABAD',
'ABC DEFPO BOX 5232',
'Hyderabad jhillsPO BOX 522002'
) as str
) --Use your table instead of this
select columns[0] as col1, columns[1] as col2
from
(
select split(case when (str rlike 'PO BOX') then regexp_replace(str, 'PO BOX','|||PO BOX')
when (str rlike '[a-zA-Z ]+\\d+') then regexp_replace(str,'([a-zA-Z ]+)(\\d+.*)', '$1|||$2')
--add more cases and ELSE part
end,'\\|{3}') columns
from mytable
)s
Result:
col1 col2
ABC DEF 123 AD
MICHAEL POSTON 875 HYDERABAD
ABC DEF PO BOX 5232
Hyderabad jhills PO BOX 522002

Related

Bigquery convert String with mode repeated to Record type with mode repeated

I have the following schema in bigquery with string type and mode repeated. I would like to convert the string to record type
The input schema looks like -
and the input data preview looks like this -
query
user
keyword_db_tb
dataset
tablename
select a., b. from dsip.accounts a join qwe.sales b
sys
123
dsip
accounts
qwe
sales
select * from forkp.facts where id in (select id from hjp.classes)
sys
456
forkp
facts
hjp
classes
The output schema with the last 3 columns converted to record type should look like-
and the data preview should look like -
query
user
referenced.keyword_db_tb
referenced.dataset
referenced.tablename
select a., b. from dsip.accounts a join qwe.sales b
sys
123
dsip
accounts
qwe
sales
select * from forkp.facts where id in (select id from hjp.classes)
sys
456
forkp
facts
hjp
classes
Use below approach
select query, user, array(
select as struct keyword_db_tb, dataset, tablename
from t.keyword_db_tb keyword_db_tb with offset
full outer join (select * from t.dataset dataset with offset) using (offset)
full outer join (select * from t.tablename tablename with offset) using (offset)
) as reference
from your_table t
if applied to sample data in your question - output is

how to get all the words that start with a certain character in bigquery

I have a text column in a bigquery table. Sample record of that column looks like -
with temp as
(
select 1 as id,"as we go forward into unchartered waters it's important to remember we are all in this together. #united #community" as input
union all
select 2 , "US cities close bars, restaurants and cinemas #Coronavirus"
)
select *
from temp
I want to extract all the words in this column that start with a # . later on I would like to get the frequency of these terms. How do I do this in BigQuery ?
My output would look like -
id, word
1, united
1, community
2, coronavirus
Below is for BigQuery Standard SQL
I want to extract all the words in this column that start with a #
#standardSQL
WITH temp AS (
SELECT 1 AS id,"as we go forward into unchartered waters it's important to remember we are all in this together. #united #community" AS input UNION ALL
SELECT 2 , "US cities close bars, restaurants and cinemas #Coronavirus"
)
SELECT id, word
FROM temp, UNNEST(REGEXP_EXTRACT_ALL(input, r'(?:^|\s)#([^#\s]*)')) word
with output
Row id word
1 1 united
2 1 community
3 2 Coronavirus
later on I would like to get the frequency of these terms
#standardSQL
SELECT word, COUNT(1) frequency
FROM temp, UNNEST(REGEXP_EXTRACT_ALL(input, r'(?:^|\s)#([^#\s]*)')) word
GROUP BY word
You can do this without regexes, by splitting words and then selecting ones that start the way you want. For example:
SELECT
id,
ARRAY(SELECT TRIM(x, "#") FROM UNNEST(SPLIT(input, ' ')) as x WHERE STARTS_WITH(x,'#')) str
FROM
temp
If you prefer the hashtags to be separate rows, you can be a bit tiedier:
SELECT
id,
TRIM(x, "#") str
FROM
temp,
UNNEST(SPLIT(input, ' ')) x
WHERE
STARTS_WITH(x,'#')

IF + AND / OR logic inside of a query

below is an example document I have shared:
https://docs.google.com/spreadsheets/d/1WuQIqn8DA12R0mNFGMdjJahQ0eNoxKODpSwopk7KoYU/edit#gid=0
My data is simple table:
I want to do the following:
For starting cell K7 on patient tab
I want to query the call log tab for
two main conditions.
Query select loqic: return rows D,E,F,A when certain conditions are met:
if text colC equals text in patient tab cell c7 AND col D says "No beds Available" And colI shows time left to calling greater than 0
OR If not than:
if col B=cell H3 in patient tab, and Col C= Cell C7 in patient tab
Thank you for your help
My example could help you.
Suppose you have a small data, like this, columns A:D:
Then you may use query state with two or more OR conditions, but insert them into parentheses. Sample formula:
=QUERY({A:D},"select Col1, Col2, Col3, Col4 where (Col1 < 7 and Col3 = 'c') or (Col2 = 'a' and Col4 > 0)")
To use Col1, Col2, Col3... notation inside query, data must be inside {}

UFT API TEST: Create SQL query based on values from previous step activity at run time

Steps to be performed in UFT API Test:
Get JSON RESPONSE from added REST activity in test flow
Add Open DB connection activity
Add Select Data activity with query string
SELECT Count(*) From Table1 Where COL1 = 'XXXX' and COL2 = ' 1234'
(here COL2 value has length of 7 characters including spaces)
In the above query values in where clause is received(dynamically at run time) from JSON response.
When i try to link the query value using link to data source with custom expression
eg:
SELECT COUNT(*) FROM Table1 Where COL1 =
'{Step.ResponseBody.RESTACTIVITYxx.OBJECT[1].COL1}' and COL2 =
'{Step.ResponseBody.RESTACTIVITYxx.OBJECT[1].COL2}'
then the QUERY changed (excluding spaces in COL2) to:
SELECT Count(*) From Table1 Where COL1 = 'XXXX' and COL2 = '1234'
I eventried with concatenate and Replace string activity but same happens.
Please kindly help..
You can use the StringConcatenation action, to build de Query String.
Use the String as Query in Database "Select data"

How do I extract a pattern from a table in Oracle 11g?

I want to extract text from a column using regular expressions in Oracle 11g. I have 2 queries that do the job but I'm looking for a (cleaner/nicer) way to do it. Maybe combining the queries into one or a new equivalent query. Here they are:
Query 1: identify rows that match a pattern:
select column1 from table1 where regexp_like(column1, pattern);
Query 2: extract all matched text from a matching row.
select regexp_substr(matching_row, pattern, 1, level)
from dual
connect by level < regexp_count(matching_row, pattern);
I use PL/SQL to glue these 2 queries together, but it's messy and clumsy. How can I combine them into 1 query. Thank you.
UPDATE: sample data for pattern 'BC':
row 1: ABCD
row 2: BCFBC
row 3: HIJ
row 4: GBC
Expected result is a table of 4 rows of 'BC'.
You can also do it in one query, functions/procedures/packages not required:
WITH t1 AS (
SELECT 'ABCD' c1 FROM dual
UNION
SELECT 'BCFBC' FROM dual
UNION
SELECT 'HIJ' FROM dual
UNION
SELECT 'GBC' FROM dual
)
SELECT c1, regexp_substr(c1, 'BC', 1, d.l, 'i') thePattern, d.l occurrence
FROM t1 CROSS JOIN (SELECT LEVEL l FROM dual CONNECT BY LEVEL < 200) d
WHERE regexp_like(c1,'BC','i')
AND d.l <= regexp_count(c1,'BC');
C1 THEPATTERN OCCURRENCE
----- -------------------- ----------
ABCD BC 1
BCFBC BC 1
BCFBC BC 2
GBC BC 1
SQL>
I've arbitrarily limited the number of occurrences to search for at 200, YMMV.
Actually there is an elegant way to do this in one query, if you do not mind to run some extra miles. Please note that this is just a sketch, I have not run it, you'll probably have to correct a few typos in it.
create or replace package yo_package is
type word_t is record (word varchar2(4000));
type words_t is table of word_t;
end;
/
create or replace package body yo_package is
function table_function(in_cur in sys_refcursor, pattern in varchar2)
return words_t
pipelined parallel_enable (partition in_cur by any)
is
next varchar2(4000);
match varchar2(4000);
word_rec word_t;
begin
word_rec.word = null;
loop
fetch in_cur into next;
exit when in_cur%notfound;
--this you inner loop where you loop through the matches within next
--you have to implement this
loop
--TODO get the next match from next
word_rec.word := match;
pipe row (word_rec);
end loop;
end loop;
end table_function;
end;
/
select *
from table(
yo_package.table_function(
cursor(
--this is your first select
select column1 from table1 where regexp_like(column1, pattern)
)
)