hive parse string from log - regex

i'm having a problem when parsing string from log file, this is the case:
"skey":"110","scp_id":"OC05","capedge":"3G"
"skey":"140","scp_id":"OC02","capedge":"3G"
"skey":"0","scp_id":"OC01","capedge":"3G"
this is our expected output for our table
| skey | scp_id | capedge |
| 110 | OC05 | 3G |
| 140 | OC02 | 3G |
| 0 | OC01 | 3G |
i've tried using parse_url method from https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF but unfortunately our string is not in url format, is there any better approach for this? or do i have to use regexp_extract for this?
thank you,
Galih

You may use a combination of SPLIT function and REGEXP_EXTRACT
select REGEXP_EXTRACT( skey , ':"(\\w+)"', 1) as skey,
REGEXP_EXTRACT( scp_id , ':"(\\w+)"', 1) as scp_id,
REGEXP_EXTRACT( capedge , ':"(\\w+)"', 1) as capedge
from (
select SPLIT(log_record, ',' )[0] as skey,
SPLIT(log_record , ',')[1] as scp_id,
SPLIT( log_record , ',')[2] as capedge
FROM yourtable
) a;
HUE DEMO : user id,pwd : demo,demo

Related

Power BI - M query to join records matching range

I have an import query (table a) and an imported Excel file (table b) with records I am trying to match it up with.
I am looking for a method to replicate this type of SQL in M:
SELECT a.loc_id, a.other_data, b.stk
FROM a INNER JOIN b on a.loc_id BETWEEN b.from_loc AND b.to_loc
Table A
| loc_id | other data |
-------------------------
| 34A032B1 | ... |
| 34A3Z011 | ... |
| 3DD23A41 | ... |
Table B
| stk | from_loc | to_loc |
--------------------------------
| STKA01 | 34A01 | 34A30ZZZ |
| STKA02 | 34A31 | 34A50ZZZ |
| ... | ... | ... |
Goal
| loc_id | other data | stk |
----------------------------------
| 34A032B1 | ... | STKA01 |
| 34A3Z011 | ... | STKA02 |
| 3DD23A41 | ... | STKD01 |
All of the other queries I can find along these lines use numbers, dates, or times in the BETWEEN clause, and seem to work by exploding the (from, to) range into all possible values and then filtering out the extra rows. However I need to use string comparisons, and exploding those into all possible values would be unfeasable.
Between all the various solutions I could find, the closest I've come is to add a custom column on table a:
Table.SelectRows(
table_b,
(a) => Value.Compare([loc_id], table_b[from_loc]) = 1
and Value.Compare([loc_id], table_b[to_loc]) = -1
)
This does return all the columns from table_b, however, when expanding the column, the values are all null.
This is not very specific "After 34A01 could be any string..." in trying to figure out how your series progresses.
But maybe you can just test for how a value "sorts" using the native sorting function in PQ.
add custom column with table.Select Rows:
= try Table.SelectRows(TableB, (t)=> t[from_loc]<=[loc_id] and t[to_loc] >= [loc_id])[stk]{0} otherwise null
To reproduce with your examples:
let
TableB=Table.FromColumns(
{{"STKA01","STKA02"},
{"34A01","34A31"},
{"34A30ZZZ","34A50ZZZ"}},
type table[stk=text,from_loc=text,to_loc=text]),
TableA=Table.FromColumns(
{{"34A032B1","34A3Z011","3DD23A41"},
{"...","...","..."}},
type table[loc_id=text, other data=text]),
//determine where it sorts and return the stk
#"Added Custom" = Table.AddColumn(#"TableA", "stk", each
try Table.SelectRows(TableB, (t)=> t[from_loc]<=[loc_id] and t[to_loc] >= [loc_id])[stk]{0} otherwise null)
in
#"Added Custom"
Note: if the above algorithm is too slow, there may be faster methods of obtaining these results

Django annotate StrIndex for empty fields

I am trying to use Django StrIndex to find all rows with the value a substring of a given string.
Eg:
my table contains:
+----------+------------------+
| user | domain |
+----------+------------------+
| spam1 | spam.com |
| badguy+ | |
| | protonmail.com |
| spammer | |
| | spamdomain.co.uk |
+----------+------------------+
but the query
SpamWord.objects.annotate(idx=StrIndex(models.Value('xxxx'), 'user')).filter(models.Q(idx__gt=0) | models.Q(domain='spamdomain.co.uk')).first()
matches <SpamWord: *#protonmail.com>
The query it is SELECT `spamwords`.`id`, `spamwords`.`user`, `spamwords`.`domain`, INSTR('xxxx', `spamwords`.`user`) AS `idx` FROM `spamwords` WHERE (INSTR('xxxx', `spamwords`.`user`) > 0 OR `spamwords`.`domain` = 'spamdomain.co.uk')
It should be <SpamWord: *#spamdomain.co.uk>
this is happening because
INSTR('xxxx', '') => 1
(and also INSTR('xxxxasd', 'xxxx') => 1, which it is correct)
How can I write this query in order to get entry #5 (spamdomain.co.uk)?
The order of the parameters of StrIndex [Django-doc] is swapped. The first parameter is the haystack, the string in which you search, and the second one is the needle, the substring you are looking for.
You thus can annotate with:
from django.db.models import Q, Value
SpamWord.objects.annotate(
idx=StrIndex('user', Value('xxxx'))
).filter(
Q(idx__gt=0) | Q(domain='spamdomain.co.uk')
).first()
Just filter rows where user is empty:
(~models.Q(user='') & models.Q(idx__gt=0)) | models.Q(domain='spamdomain.co.uk')

Using Pyspark, the regex between two characters gets text and numbers, but not date

Using Pyspark regex_extract() I can substring between two characters in the string. It is grabbing the text and numbers, but is not grabbing the dates.
data = [('2345', '<Date>1999/12/12 10:00:05</Date>'),
('2398', '<Crew>crewIdXYZ</Crew>'),
('2328', '<Latitude>0.8252644369443788</Latitude>'),
('3983', '<Longitude>-2.1915840465066916<Longitude>')]
df = sc.parallelize(data).toDF(['ID', 'values'])
df.show(truncate=False)
+----+-----------------------------------------+
|ID |values |
+----+-----------------------------------------+
|2345|<Date>1999/12/12 10:00:05</Date> |
|2398|<Crew>crewIdXYZ</Crew> |
|2328|<Latitude>0.8252644369443788</Latitude> |
|3983|<Longitude>-2.1915840465066916<Longitude>|
+----+-----------------------------------------+
df_2 = df.withColumn('vals', regexp_extract(col('values'), '(.)((?<=>)[^<:]+(?=:?<))', 2))
df_2.show(truncate=False)
+----+-----------------------------------------+-------------------+
|ID |values |vals |
+----+-----------------------------------------+-------------------+
|2345|<Date>1999/12/12 10:00:05</Date> | |
|2398|<Crew>crewIdXYZ</Crew> |crewIdXYZ |
|2328|<Latitude>0.8252644369443788</Latitude> |0.8252644369443788 |
|3983|<Longitude>-2.1915840465066916<Longitude>|-2.1915840465066916|
+----+-----------------------------------------+-------------------+
What can I add to the regex statement to get the date as well?
#jxc Thanks. Here is what made it work:
df_2 = df.withColumn('vals', regexp_extract(col('values'), '(.)((?<=>)[^>]+(?=:?<))', 2))
df_2.show(truncate=False)
+----+-----------------------------------------+-------------------+
|ID |values |vals |
+----+-----------------------------------------+-------------------+
|2345|<Date>1999/12/12 10:00:05</Date> |1999/12/12 10:00:05|
|2398|<Crew>crewIdXYZ</Crew> |crewIdXYZ |
|2328|<Latitude>0.8252644369443788</Latitude> |0.8252644369443788 |
|3983|<Longitude>-2.1915840465066916<Longitude>|-2.1915840465066916|
+----+-----------------------------------------+-------------------+
You may use
>([^<>]+)<
See the regex demo. The regex matches a >, then captures into Group 1 any one or more chars other than < and >, and then just matches >. The ncol argument should be set to 1 since the value you need is in Group 1:
df_2 = df.withColumn('vals', regexp_extract(col('values'), '>([^<>]+)<', 1))
df_2.show(truncate=False)
+----+-----------------------------------------+-------------------+
|ID |values |vals |
+----+-----------------------------------------+-------------------+
|2345|<Date>1999/12/12 10:00:05</Date> |1999/12/12 10:00:05|
|2398|<Crew>crewIdXYZ</Crew> |crewIdXYZ |
|2328|<Latitude>0.8252644369443788</Latitude> |0.8252644369443788 |
|3983|<Longitude>-2.1915840465066916<Longitude>|-2.1915840465066916|
+----+-----------------------------------------+-------------------+

How to get rows where a field contains ( ) , [ ] % or +. using rlike SparkSQL function

Let's say you have a Spark dataframe with multiple columns and you want to return the rows where the columns contains specific characters. Specifically you want to return the rows where at least one of the fields contains ( ) , [ ] % or +.
What is the proper syntax in case you want to use Spark SQL rlike function?
import spark.implicits._
val dummyDf = Seq(("John[", "Ha", "Smith?"),
("Julie", "Hu", "Burol"),
("Ka%rl", "G", "Hu!"),
("(Harold)", "Ju", "Di+")
).toDF("FirstName", "MiddleName", "LastName")
dummyDf.show()
+---------+----------+--------+
|FirstName|MiddleName|LastName|
+---------+----------+--------+
| John[| Ha| Smith?|
| Julie| Hu| Burol|
| Ka%rl| G| Hu!|
| (Harold)| Ju| Di+|
+---------+----------+--------+
Expected Output
+---------+----------+--------+
|FirstName|MiddleName|LastName|
+---------+----------+--------+
| John[| Ha| Smith?|
| Ka%rl| G| Hu!|
| (Harold)| Ju| Di+|
+---------+----------+--------+
My few attempts returns errors or not what expected even when I try to do it just for searching (.
I know that I could use the simple like construct multiple times, but I am trying to figure out to do it in a more concise way with regex and Spark SQL.
You can try this using rlike method:
dummyDf.show()
+---------+----------+--------+
|FirstName|MiddleName|LastName|
+---------+----------+--------+
| John[| Ha| Smith?|
| Julie| Hu| Burol|
| Ka%rl| G| Hu!|
| (Harold)| Ju| Di+|
| +Tim| Dgfg| Ergf+|
+---------+----------+--------+
val df = dummyDf.withColumn("hasSpecial",lit(false))
val result = df.dtypes
.collect{ case (dn, dt) => dn }
.foldLeft(df)((accDF, c) => accDF.withColumn("hasSpecial", col(c).rlike(".*[\\(\\)\\[\\]%+]+.*") || col("hasSpecial")))
result.filter(col("hasSpecial")).show(false)
Output:
+---------+----------+--------+----------+
|FirstName|MiddleName|LastName|hasSpecial|
+---------+----------+--------+----------+
|John[ |Ha |Smith? |true |
|Ka%rl |G |Hu! |true |
|(Harold) |Ju |Di+ |true |
|+Tim |Dgfg |Ergf+ |true |
+---------+----------+--------+----------+
You can also drop the hasSpecial column if you want.
Try this .*[()\[\]%\+,.]+.*
.* all character zero or more times
[()[]%+,.]+ all characters inside bracket 1 or more times
.* all character zero or more times

using Oracle REGEXP_INSTR to find exact word

I want to return the following position from the strings using REGEXP_INSTR.
I am looking for the word car with exact match in the following strings.
car,care,oscar - 1
care,car,oscar - 6
oscar,care,car - 12
something like
SELECT REGEXP_INSTR('car,care,oscar', 'car', 1, 1) "REGEXP_INSTR" FROM DUAL;
I am not sure what kind of escape operators to use.
A simpler solution is to surround the source string and search string with commas and find the position using INSTR.
SELECT INSTR(',' || 'car,care,oscar' || ',', ',car,') "INSTR" FROM DUAL;
Example:
SQL Fiddle
with x(y) as (
SELECT 'car,care,oscar' from dual union all
SELECT 'care,car,oscar' from dual union all
SELECT 'oscar,care,car' from dual union all
SELECT 'car' from dual union all
SELECT 'cart,care,oscar' from dual
)
select y, ',' || y || ',' , instr(',' || y || ',',',car,')
from x
| Y | ','||Y||',' | INSTR(','||Y||',',',CAR,') |
|-----------------|-------------------|----------------------------|
| car,care,oscar | ,car,care,oscar, | 1 |
| care,car,oscar | ,care,car,oscar, | 6 |
| oscar,care,car | ,oscar,care,car, | 12 |
| car | ,car, | 1 |
| cart,care,oscar | ,cart,care,oscar, | 0 |
The following query handles all scenarios. It returns the starting position if the string begins with car, or the whole string is just car. It returns the starting position + 1 if ,car, is found or if the string ends with ,car to account for the comma.
SELECT
CASE
WHEN REGEXP_LIKE('car,care,oscar', '^car,|^car$') THEN REGEXP_INSTR('car,care,oscar', '^car,|^car$', 1, 1)
WHEN REGEXP_LIKE('car,care,oscar', ',car,|,car$') THEN REGEXP_INSTR('car,care,oscar', ',car,|,car$', 1, 1)+1
ELSE 0
END "REGEXP_INSTR"
FROM DUAL;
SQL Fiddle demo with the various possibilities
I like Noel his answer as it gives a very good performance! Another way around is by creating separate rows from a character separated string:
pm.nodes = 'a;b;c;d;e;f;g'
(select regexp_substr(pm.nodes,'[^;]+', 1, level)
from dual
connect by regexp_substr(pm.nodes, '[^;]+', 1, level) is not null)