How to remove duplicates in hive string?

How to remove duplicates in hive string? - replace

I have a comma-separated column(string) with duplicate values. I want to remove duplicates:
e.g.
column_name
-----------------
gun,gun,man,gun,man
shuttle,enemy,enemy,run
hit,chase
I want result like:
column_name
----------------
gun,man
shuttle,enemy,run
hit,chase
I am using hive database.

Option 1: keep last occurrence
This will keep the last occurrence of every word.
E.g. 'hello,world,hello,world,hello' will result in 'world,hello'
select regexp_replace
(
column_name
,'(?<=^|,)(?<word>.*?),(?=.*(?<=,)\\k<word>(?=,|$))'
,''
)
from mytable
;
+-------------------+
| gun,man |
| shuttle,enemy,run |
| hit,chase |
+-------------------+
Option 2: keep first occurrence
This will keep the first occurrence of every word.
E.g. 'hello,world,hello,world,hello' will result in 'hello,world'
select reverse
(
regexp_replace
(
reverse(column_name)
,'(?<=^|,)(?<word>.*?),(?=.*(?<=,)\\k<word>(?=,|$))'
,''
)
)
from mytable
;
Option 3: sorted
E.g. 'Cherry,Apple,Cherry,Cherry,Cherry,Banana,Apple' will result in 'Apple,Banana,Cherry'
select regexp_replace
(
concat_ws(',',sort_array(split(column_name,',')))
,'(?<=^|,)(?<word>.*?)(,\\k<word>(?=,|$))+'
,'${word}'
)
from mytable
;

If value sort is not a concern:
with mytable as (
select 'gun,gun,man,gun,man' as column_name union
select 'shuttle,enemy,enemy,run' as column_name union
select 'hit,chase' as column_name
) -- test data
SELECT column_name, concat_ws(',',collect_set(item)) from (
select distinct column_name, s.item from mytable
lateral view explode(split(column_name,',')) s as item
) t
group by column_name
;
+--------------------------+--------------------+--+
| column_name | _c1 |
+--------------------------+--------------------+--+
| gun,gun,man,gun,man | gun,man |
| hit,chase | chase,hit |
| shuttle,enemy,enemy,run | enemy,run,shuttle |
+--------------------------+--------------------+--+
If want to keep the value sorted:
with mytable as (
select 'gun,gun,man,gun,man' as column_name union
select 'shuttle,enemy,enemy,run' as column_name union
select 'hit,chase' as column_name
) -- test data
select column_name,concat_ws(',',collect_set(item)) as column_name_distincted
from (
select column_name,item, min(pos) as pos
from (
select column_name,pos,item
from mytable
lateral view posexplode(split(column_name,',')) s as pos,item
) t
group by column_name,item
order by column_name,pos
) t
group by column_name
;
+--------------------------+-------------------------+--+
| column_name | column_name_distincted |
+--------------------------+-------------------------+--+
| gun,gun,man,gun,man | gun,man |
| hit,chase | hit,chase |
| shuttle,enemy,enemy,run | shuttle,enemy,run |
+--------------------------+-------------------------+--+

Related

Usage of TSQL string_split in DAX

I have 2 tables in SSAS / Power BI:
Table1:
| ValueName| ValueKey |
|:---- |:------: |
| abc | 1,2,3 |
Table2:
| ID | ValueKey | Value |
|:---- |:------: |:------: |
| ID1 | 1 | 87,8 |
| ID2 | 85 | 14 |
| ID3 | 90 | 95,8 |
| ID4 | 3 | 13,4 |
I need to retrieve (in temp table, later make calculations over this temp table) ID, Value and only those rows, which have ValueKey 1 or 2 or 3.
I need to do it with DAX. In SQL we have for such situation STING_SPLIT function. Is there some way how can I achive this with DAX? My ValueKey column (table1) is comma separated text and ValueKey (table2) column is INT.
Thanks in advance

Like #Jeroen Mostert suggests, you can do this by abusing the PATHCONTAINS function like this:
FilteredTable2 =
VAR CurrKey = SELECTEDVALUE ( Table1[ValueKey] )
VAR PathFromKey = SUBSTITUTE ( CurrKey, ",", "|" ) /* Paths use | as separator. */
RETURN
FILTER ( Table2, PATHCONTAINS ( PathFromKey, Table2[ValueKey] ) )
However, this is not best practice for relating tables. In general, you don't want multiple keys in a single fields.

How Retrieve Part Of text Using Regex Sunstring

Hi All I have Query text From Query History such as "Create OR Replace Procedure PROCEDURENAME()", what i want is Procedure name such as in this string case "PROCEDURENAME" to be found using Regex Substring function. The Developer can create Procedure with this Syntax too "CREATE PROCEDURE PROCEDUREENAME()" so the reg expression should find out the name of procedure too.

If you want to get the procedure name and parameters, you can use the following:
select regexp_substr( qtext, 'CREATE.*PROCEDURE\\s+([^)]*\\))', 1, 1, 'ei', 1 ) as res from
values('Create OR Replace Procedure PROCEDURENAME1() as xyz'),
('Create procedure PROCEDURENAME2( a varchar)'),
('Create procedure PROCEDURENAME3( a varchar, b number)')
tmp(qtext) ;
+--------------------------------------+
| RES |
+--------------------------------------+
| PROCEDURENAME1() |
| PROCEDURENAME2( a varchar) |
| PROCEDURENAME3( a varchar, b number) |
+--------------------------------------+
If you want to parse the only procedure name, you can use this one:
select regexp_substr( qtext, 'CREATE.*PROCEDURE\\s+([^()]*)\(.*\)', 1, 1, 'ei', 1 ) as res from
values('Create OR Replace Procedure PROCEDURENAME1() as xyz'),
('Create procedure PROCEDURENAME2( a varchar)'),
('Create procedure PROCEDURENAME3( a varchar, b number)')
tmp(qtext) ;
+----------------+
| RES |
+----------------+
| PROCEDURENAME1 |
| PROCEDURENAME2 |
| PROCEDURENAME3 |
+----------------+

Row number partition by to POWER BI DAX query

Can someone help me to convert the sql string to Dax?
row_number() p over (partition by date, customer, type order by day)
The row number is my desired output.

Assuming that your data looks like this table:
Sample
+------------+----------+---------+--------+
| Date | Customer | Product | Gender |
+------------+----------+---------+--------+
| 01/01/2018 | 1234 | P2 | F |
| 01/01/2018 | 1234 | P2 | M |
| 03/01/2018 | 1235 | P1 | F |
| 03/01/2018 | 1235 | P2 | F |
+------------+----------+---------+--------+
I have created a calculated column called Rank, using the RANKX and FILTER function.
The first part of the calculation is to create variables outside the scope of the FILTER function. The second part uses RANKX that takes an expression value - in this case Gender - to order the values.
Rank =
VAR _currentdate = 'Sample'[Date]
VAR _customer = 'Sample'[Customer]
var _product = 'Sample'[Product]
return
RANKX(FILTER('Sample',
[Date]=_currentdate &&
[Customer] = _customer &&
[Product] = _product),[Gender],,ASC)
The output is
I contrasted the output to the SQL equivalent.
select
*,
row_number() over(partition by Date,Customer,Product order by Gender)
from (
select '2018-01-01' as Date,1234 as CUSTOMER,'P2' AS PRODUCT, 'M' Gender union
select '2018-01-01' as Date,1234,'P2','F' UNION
select '2018-01-03' as Date,1235,'P1','F' UNION
select '2018-01-03' as Date,1235,'P2','F'
)t1

regex for specific delimiter string in Hive serde

I use serde to read data with specific format with delimiter |
One line of my data may looks like: key1=value2|key2=value2|key3="va , lues", and I create the hive table as below:
CREATE EXTERNAL TABLE(
field1 STRING,
field2 STRING,
field3 STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^\\|]*)\\|([^\\|]*)\\|([^\\|]*)",
"output.format.string" = "%1$s %2$s %3$s"
)
STORED AS TEXTFILE;
I need to extract all values, ignore all quotas if they exist.
Result looks like a
value2 value2 va , lues
How can I change my current regexp for extractig values ?

I can currently offer 2 options, none of them is perfect.
BTW, "output.format.string" is obsolete and has no effect.
1
create external table mytable
(
q1 string
,field1 string
,q2 string
,field2 string
,q3 string
,field3 string
)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties ('input.regex' = '.*?=(?<q1>"?)(.*?)(?:\\k<q1>)\\|.*?=(?<q2>"?)(.*?)(?:\\k<q2>)\\|.*?=(?<q3>"?)(.*?)(?:\\k<q3>)')
stored as textfile
;
select * from mytable
;
+----+--------+----+--------+----+-----------+
| q1 | field1 | q2 | field2 | q3 | field3 |
+----+--------+----+--------+----+-----------+
| | value2 | | value2 | " | va , lues |
+----+--------+----+--------+----+-----------+
2
create external table mytable
(
field1 string
,field2 string
,field3 string
)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties ('input.regex' = '.*?=(".*?"|.*?)\\|.*?=(".*?"|.*?)\\|.*?=(".*?"|.*?)')
stored as textfile
;
select * from mytable
;
+--------+--------+-------------+
| field1 | field2 | field3 |
+--------+--------+-------------+
| value2 | value2 | "va , lues" |
+--------+--------+-------------+

Combining data from 2 rows into 1

I have an extremely complex query. I want the 2 row data combine in to 1 row.
It gives me the following Output
PNAME RN LVN HA MSW SC
AA AG-1W SS-1M LO-2W PA-1W SK-1M
AA JL-1W TD -1M NULL NULL NULL
IS there any way I could have the results in 1 Row or Combine the 2 Rows in to 1. Like As Follows.
PNAME RN LVN HA MSW SC
AA AG-1W SS-1M LO-2W PA-1W SK-1M
JL-1W TD -1M NULL NULL NULL

It is not exactly clear what you are trying to achieve, but you can implement usage of row_number() to prevent the pname from showing in additional rows:
select case when rownum = 1 then pname else '' end pname,
[RN], [LVN], [HA], [MSW], [SC]
from
(
select pname, disc, value,
ROW_NUMBER() over(partition by disc order by disc) rownum
from temp
) src
pivot
(
max(value)
for disc in ([RN], [LVN], [HA], [MSW], [SC])
) piv
See SQL Fiddle with demo
results:
| PNAME | RN | LVN | HA | MSW | SC |
----------------------------------------------------
| AA | AG-1W | SS-1M | LO-2W | PA-1W | SK-1M |
| | JL-1W | TD-1M | (null) | (null) | (null) |
This uses to the value of the row_number() to decide if the pname should be displayed. It will only show the value when the rownum=1, otherwise it will be blank.
If you want the data in a single row, you can use something similar to the following:
;with cte as
(
select pname, disc, value,
ROW_NUMBER() over(partition by disc order by disc) rownum
from temp
),
piv as
(
select *
from cte
pivot
(
max(value)
for disc in ([RN], [LVN], [HA], [MSW], [SC])
) piv
)
select pname,
STUFF((SELECT distinct ', ' + [RN]
from piv p2
where p1.pname = p2.pname
FOR XML PATH(''), TYPE
).value('.', 'NVARCHAR(MAX)')
,1,1,'') RN,
STUFF((SELECT distinct ', ' + [LVN]
from piv p2
where p1.pname = p2.pname
FOR XML PATH(''), TYPE
).value('.', 'NVARCHAR(MAX)')
,1,1,'') LVN,
STUFF((SELECT distinct ', ' + [HA]
from piv p2
where p1.pname = p2.pname
FOR XML PATH(''), TYPE
).value('.', 'NVARCHAR(MAX)')
,1,1,'') HA,
STUFF((SELECT distinct ', ' + [MSW]
from piv p2
where p1.pname = p2.pname
FOR XML PATH(''), TYPE
).value('.', 'NVARCHAR(MAX)')
,1,1,'') MSW,
STUFF((SELECT distinct ', ' + [SC]
from piv p2
where p1.pname = p2.pname
FOR XML PATH(''), TYPE
).value('.', 'NVARCHAR(MAX)')
,1,1,'') SC
from piv p1
group by pname
See SQL Fiddle with Demo
The result is:
| PNAME | RN | LVN | HA | MSW | SC |
--------------------------------------------------------------------
| AA | AG-1W, JL-1W | SS-1M, TD-1M | LO-2W | PA-1W | SK-1M |

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to remove duplicates in hive string? - replace

I have a comma-separated column(string) with duplicate values. I want to remove duplicates: e.g. column_name ----------------- gun,gun,man,gun,man shuttle,enemy,enemy,run hit,chase I want result like: column_name ---------------- gun,man shuttle,enemy,run hit,chase I am using hive database.

Related

Usage of TSQL string_split in DAX

How Retrieve Part Of text Using Regex Sunstring

Row number partition by to POWER BI DAX query

regex for specific delimiter string in Hive serde

Combining data from 2 rows into 1

Categories

Resources