Hive REGEXP_EXTRACT extract the second occurrence of a pattern [duplicate] - regex

This question already has answers here:
hive regexp_extract weirdness
(2 answers)
Closed 4 years ago.
I am querying data in Hive and extracting a code from a column. I recently discovered that due to data entry/business process issues, users have been overloading the field and entering two separate job codes when there should only be one.
Sample data from the column:
NOV2 WAA UW FOO DISPLAY_W2100008/ SOMETHING DISPLAY W2100106
I've been using REGEXP_EXTRACT(column,'([A-Z]\\d{7})',1) as id will correctly extract the first code W2100008, but I am unable to extract the second code W21001061.
I want to use REGEXP_EXTRACT twice and alias id_1 and id_2 so we can analyze the second codes referenced. Is there a way to reference the second time the pattern is matched?
REGEXP_EXTRACT(column,'_([A-Z]\\d{7})',0) returns the first match
REGEXP_EXTRACT(column,'([A-Z]\\d{7})',1) returns the first match
REGEXP_EXTRACT(column,'([A-Z]\\d{7})',2) returns an error
The extracted value will be used to join to another column, so the result needs to return a single value, not an array.

Replace all '.*?([A-Z]\\d{7})' with delimiter(space) + ([A-Z]\\d{7}). Remove first space using trim, split by ' ' to get array:
hive> select split(trim(regexp_replace('NOV2 WAA UW FOO DISPLAY_W2100008/SOMETHING DISPLAY W2100106','.*?([A-Z]\\d{7})',' $1')),' ');
OK
["W2100008","W2100106"]
Get first element:
hive> select split(trim(regexp_replace('NOV2 WAA UW FOO DISPLAY_W2100008/ SOMETHING DISPLAY W2100106','.*?([A-Z]\\d{7})',' $1')),' ')[0];
OK
W2100008
Time taken: 0.065 seconds, Fetched: 1 row(s)
And second element is
split(trim(regexp_replace('NOV2 WAA UW FOO DISPLAY_W2100008/ SOMETHING DISPLAY W2100106','.*?([A-Z]\\d{7})',' $1')),' ')[1]
better use subquery to parse array one time.
select display_array[0] as id_1 , display_array[1] as id_2
from
(
select split(trim(regexp_replace('NOV2 WAA UW FOO DISPLAY_W2100008/ SOMETHING DISPLAY W2100106','.*?([A-Z]\\d{7})',' $1')),' ') as display_array
)s;
Use explode() if you want each element per row.

Related

How do I conditionally remove text from a string in a column in a Scala dataframe?

I'm currently exploring Azure Databricks for a POC (Scala and Databricks are both completely new to me. I'm using this (Cars - Corgis) sample dataset to show off the manipulation characteristics of Databricks.
My problem is that I have a dataframe column called 'model' that contains data like '2009 Audi A3' and '2005 Mercedes E550'. What I would like to be able to do is alter that column so instead of the aforementioned, it reads as 'Audi A3' or 'Mercedes E550'. I have a separate model year column so trying to reduce the size of the columns where possible.
From what I have seen, replaceAllIn doesn't seem to work with strings with Scala.
This is my code so far:
//Use the dataframe from the previous cell and trim the model year from the model column so for example it reads as 'Audi A3' instead of '2009 Audi A3'
import scala.util.matching.Regex
val modelPrefixPatternMatch = "[0-9 ]".r
val newModel = modelPrefixPatternMatch.replaceAllIn((specificColumnsDf.select("model")),"")
However, when I run this code, I get the following error message:
command-1778339999318469:5: error: overloaded method value replaceAllIn with alternatives:
(target: CharSequence,replacer: scala.util.matching.Regex.Match => String)String <and>
(target: CharSequence,replacement: String)String
cannot be applied to (org.apache.spark.sql.DataFrame, String)
val newModel = modelPrefixPatternMatch.replaceAllIn((specificColumnsDf.select("model")),"")
I have also tried completing the SparkSQL but didn't have any luck there either.
Thanks!
In Spark you would normally add additional columns using withColumn and then select only the columns you want. In this simple example, I use regexp_replace function to trim out the years, something like this:
%scala
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column
df
.withColumn("cleanColumn", regexp_replace($"`Identification.Model Year`", "20[0-2][0-9] ","") )
.select($"`Identification.Model Year`", $"cleanColumn").distinct
.show(false)
My results:
We could probably make the regular expression tighter, eg tie it to the start of the column or open it up for years 1980, 1990 etc - this is just an example.
If the year is always at the start then you could just use substring and start at position 5. The regex approach at least protects from the year not being there for some records.
HTH

How to build a regex in Hive to get string until Nth occurrence of a delimiter

I have some sample data in Hive as
select "abc:def:ghi:jkl" as data
union all
select "jkl:mno:23ar:stu:abc:def:ghi:7345" as data
I want to extract the strings until 3rd colon so that I get the output as
abc:def:ghi
jkl:mno:23ar
I want to keep N as variable so that I can shrink the output text as needed. How do I do this in Hive?
SELECT regexp_replace(`data`, '^([^:]+:[^:]+:[^:]+).*$', "$1")
FROM
( SELECT "abc:def:ghi:jkl" AS `data`
UNION ALL SELECT "jkl:mno:23ar:stu:abc:def:ghi:7345" AS `data`) AS tmp
With using split and posexplode functions, you can combine again with filtering position
select t.dataId, concat_ws(":", collect_list(t.cell)) as firstN from (
SELECT x.dataId, pos as pos, cell
FROM (
select 1 as dataId, "jkl:mno:23ar:stu:abc:def:ghi:7345" as data
union all
select 2 as dataId, "abc:def:ghi:7345" as data
) x
LATERAL VIEW posexplode(split(x.data,':')) dataTable AS pos, cell
) t
where t.pos<3
group by t.dataId
With variable:
set hivevar:n=3; --variable, you can pass it to the script
with your_table as(
select stack(2,"abc:def:ghi:jkl", "jkl:mno:23ar:stu:abc:def:ghi:7345")as data
)
select regexp_replace(regexp_extract(data,'([^:]*:){1,${hivevar:n}}',0),':$','') from your_table;
Result:
OK
abc:def:ghi
jkl:mno:23ar
Time taken: 0.105 seconds, Fetched: 2 row(s)
Quantifier {1,${hivevar:n}} after variable substitution will become {1,3} which means 1 to 3 times, this allows to extract values shorter than 3. If you need not to extract shorter values, use {${hivevar:n}} quantifier. If there are < than N elements, it will extract empty string in this case.
Select substring_index('abc:def:ghi:jkl',':',3) as data
Union all
Select substring_index('jkl:mno:23ar:stu:abc:def:ghi:7345',':',3) as data;

unexpected character after line continuation character. Also to keep rows after floating point rows in pandas dataframe

I have a dataset in which I want to keep row just after a floating value row and remove other rows.
For eg, a column of the dataframe looks like this:
17.3
Hi Hello
Pranjal
17.1
[aasd]How are you
I am fine[:"]
Live Free
So in this I want to preserve:
Hi Hello
[aasd]How are you
and remove the rest. I tried it with the following code, but an error showed up saying "unexpected character after line continuation character". Also I don't know if this code will solve my purpose
Dropping extra rows
for ind in data.index:
if re.search((([1-9][0-9]*\.?[0-9]*)|(\.[0-9]+))([Ee][+-]?[0-9]+)?, ind):
ind+=1
else:
data.drop(ind)
your regex has to be a string, you can't just write it like that.
re.search((('[1-9][0-9]*\.?[0-9]*)|(\.[0-9]+))([Ee][+-]?[0-9]+)?', ind):
edit - but actually i think the rest of your code is wrong too.
what you really want is something more like this:
import pandas as pd
l = ['17.3',
'Hi Hello',
'Pranjal',
'17.1',
'[aasd]How are you',
'I am fine[:"]',
'Live Free']
data = pd.DataFrame(l, columns=['col'])
data[data.col.str.match('\d+\.\d*').shift(1) == True]
logic:
if you have a dataframe with a column that is all string type (won't work for mixed type decimal and string you can find the decimal / int entries with the regex '\d+.?\d*'. If you shift this mask by one it gives you the entries after the matches. use that to select the rows you want in your dataframe.

How to create new column that parses correct values from a row to a list

I am struggling on creating a formula with Power Bi that would split a single rows value into a list of values that i want.
So I have a column that is called ID and it has values such as:
"ID001122, ID223344" or "IRRELEVANT TEXT ID112233, MORE IRRELEVANT;ID223344 TEXT"
What is important is to save the ID and 6 numbers after it. The first example would turn into a list like this: {"ID001122","ID223344"}. The second example would look exactly the same but it would just parse all the irrelevant text from between.
I was looking for some type of an loop formula where you could use the text find function to find ID starting point and use middle function to extract 8 characters from the start but I had no progress in finding such. I tried making lists from comma separator but I noticed that not all rows had commas to separate IDs.
The end results would be that the original value is on one column next to the list of parsed values which then could be expanded to new rows.
ID Parsed ID
"Random ID123456, Text;ID23456" List {"ID123456","ID23456"}
Any of you have former experience?
Hey I found the answer by myself using a good article similar to my problem.
Here is my solution without any further text parsing which i can do later on.
each let
PosList = Text.PositionOf([ID],"ID",Occurrence.All),
List = List.Transform(PosList, (x) => Text.Middle([ID],x,8))
in List
For example this would result "(ID343137,ID352973) ID358388" into {ID343137,ID352973,ID358388}
Ended up being easier than I thought. Suppose the solution relied again on the lists!

How to get the more than one mached keywords using regexp_matches

How to get the more than one matched keywords in a given string.
Please find the below query.
SELECT regexp_matches(UPPER('bakerybaking'),'BAKERY|BAKING');
output: "{BAKERY}"
the above scenario given string is matched with two keywords.
when i execute the above query getting only one keyword only.
How to get other matched keywords.
g is a global search flag using in regex.Is used to get all the matching strings
select regexp_matches(UPPER('bakerybaking'),'BAKERY|BAKING','g')
regexp_matches
text[]
--------------
{BAKERY}
{BAKING}
to get the result as a single row :
SELECT ARRAY(select array_to_string(regexp_matches(UPPER('bakerybaking'),'BAKERY|BAKING','g'),''));
array
text[]
---------------
{BAKERY,BAKING}
by using unnest - to convert the array returned to a table
select unnest(regexp_matches(UPPER('bakerybaking'),'BAKERY|BAKING','g'))
unnest
text
------
BAKERY
BAKING
accoring to: http://www.postgresql.org/docs/9.5/static/functions-string.html
SELECT regexp_matches(UPPER('bakerybaking'),'(BAKERY)(BAKING)');
Otput:)
regexp_matches
----------------- {BAKERY,BAKING} (1 row)
Oh the humanity. Please thank me.
--https://stackoverflow.com/questions/52178844/get-second-match-from-regexp-matches-results
--https://stackoverflow.com/questions/24274394/postgresql-8-2-how-to-get-a-string-representation-of-any-array
CREATE OR REPLACE FUNCTION aaa(anyarray,Integer, text)
RETURNS SETOF text
LANGUAGE plpgsql
AS $function$
DECLARE s $1%type;
BEGIN
FOREACH s SLICE 1 IN ARRAY $1[$2:$2] LOOP
RETURN NEXT array_to_string(s,$3);
END LOOP;
RETURN;
END;
$function$;
--SELECT aaa((ARRAY(SELECT unnest(regexp_matches('=If(If(E_Reports_# >=1, DMT(E_Date_R1_#, DateShift),0)', '(\w+_#)|([0-9]+)','g'))::TEXT)),1,',')
--select (array[1,2,3,4,5,6])[2:5];
SELECT aaa(array_remove(Array(SELECT unnest(regexp_matches('=If(If(E_Reports_# >=1, DMT(E_Date_R1_#, DateShift),0)', '(\w+_#)|([0-9]+)','g'))::TEXT), Null),3,',')