Hive regex split a string in to two different fields

Hive regex split a string in to two different fields - regex

my record is like:
0x0000110PPPP111KZY0 H123456789 XYZ 000000000000000000607532030000607532000060753203002014101707199999
I am searching for a regex where i can split first 3 char 0x0 in to one field in a hive table and the rest 000110PPPP111KZY0 in to second field and so on fixed length file and no delimiter.

I have no experience with hadoop or hive, however the following regex will work with what I believe you're looking for.
/(\dx\d)(.*)/ This will capture/split 0x0 into the first capture group, and everything afterwards into the second capture group. If you only want the numbers/letters following the 0x0 number (so none of the H123456789 or trailing words and letters), use /(\dx\d)([^ ]*)/
If I misunderstood what you're looking for, can you just clarify the exact section of that code you provided that you'd like to select and/or capture? Thanks!

Select
regexp_extract(data, '^(\\dx\\d).*', 1),
regexp_extract(data, '^\\dx\\d(.*)', 1)
from (Select '0x0000110PPPP111KZY0 ' as data) a;
This code returns a Hive row with two fields:
0x0 000110PPPP111KZY0

Related

Can I make my Alteryx RegEx parse conditional?

I receive messages with the fields below. I want to group and extract the user inputs. Majority of submissions contain all fields and the regex works great. Problem comes in when someone removes additional lines if let's say they only need to fill in down to Amount 1
Name:
Number:
Amount:
Old Code:
Code 1:
Amount 1:
Code 2:
Amount 2:
Code 3:
Amount 3:
Code 4:
Amount 4:
I'm using Alteryx to parse the message contents and have success with my current regex but want to be ready for unavoidable user submission inconsistency
Name:(.+)\sNumber:(.+)\sAmount:(.+)\sOld Code:(.+)\sCode 1:(.+)\sAmount 1:(.+)\sCode 2:(.*?)\sAmount 2:(.*?)\sCode 3:(.*?)\sAmount 3:(.*?)\sCode 4:(.*?)\sAmount 4:(.*?[^-]*)
Is it possible to have Alteryx return parsed results from a message even if a listed field is deleted?
Alteryx issue with new cascading regex

Anyway, you can always do a cascading nested optional grouping around the
lines to just match what's valid up to a point.
This expects the form lines to be in order. If it's not, a different type
of regex is needed - an out-of-order regex ( see the bottom regex ) .
Both these regex are for Perl 5.10
(?-ms)Name:(.*)(?:\s+Number:(.*)(?:\s+Amount:(.*)(?:\s+Old[ ]+Code:(.*)(?:\s+Code[ ]+1:(.*)(?:\s+Amount[ ]+1:(.*)(?:\s+Code[ ]+2:(.*)(?:\s+Amount[ ]+2:(.*)(?:\s+Code[ ]+3:(.*)(?:\s+Amount[ ]+3:(.*)(?:\s+Code[ ]+4:(.*)(?:\s+Amount[ ]+4:(.*?[^-]*))?)?)?)?)?)?)?)?)?)?)?
https://regex101.com/r/9oKXEE/1
For out-of-order matching, use this
(?m-s)\A(?:[\S\s]*?(?:(?(1)(?!))^\h*Name\h*:\h*(.*)|(?(2)(?!))^\h*Number\h*:\h*(.*)|(?(3)(?!))^\h*Amount\h*:\h*(.*)|(?(4)(?!))^\h*Old\h*Code\h*:\h*(.*)|(?(5)(?!))^\h*Code\h*1\h*:\h*(.*)|(?(6)(?!))^\h*Amount\h*1\h*:\h*(.*)|(?(7)(?!))^\h*Code\h*2\h*:\h*(.*)|(?(8)(?!))^\h*Amount\h*2\h*:\h*(.*)|(?(9)(?!))^\h*Code\h*3\h*:\h*(.*)|(?(10)(?!))^\h*Amount\h*3\h*:\h*(.*)|(?(11)(?!))^\h*Code\h*4\h*:\h*(.*)|(?(12)(?!))^\h*Amount\h*4\h*:\h*(.*?))){1,12}
https://regex101.com/r/f2rG1v/1

In this situation, you don't need to use Regex straight off the bat and given the inconsistent data it could take a while to perfect one regex term...
You can do it this way instead:
- RecordID first,
- Then you can use a Text 2 Columns with a new-line (\n) delimiter. Configure this to "Split to Rows".
- You can then use a Text to Columns to split on the delimter ":".
That will handle additional rows entered etc. At that stage, you can figure out how to clean up the results (filter to remove null lines, multi-row to tag records, cross-tab to create a table etc...). If you want to flag any unknown rows, you can have a Text Input with the required rows and use Find/Replace or Join to separate the data.

How to split a string in db2?

I've some URL's in my cas_fnd_dwd_det table,
casi_imp_urls cas_code
----------------------------------- -----------
www.casiac.net/fnds/CASI/qnxp.pdf
www.casiac.net/fnds/casi/as.pdf
www.casiac.net/fnds/casi/vindq.pdf
www.casiac.net/fnds/CASI/mnip.pdf
how do i copy the letters between last '/' and '.pdf' to another column
expected outcome
casi_imp_urls cas_code
----------------------------------- -----------
www.casiac.net/fnds/CASI/qnxp.pdf qnxp
www.casiac.net/fnds/casi/as.pdf as
www.casiac.net/fnds/casi/vindq.pdf vindq
www.casiac.net/fnds/CASI/mnip.pdf mnip
the below URL's are static
www.casiac.net/fnds/CASI/
www.casiac.net/fnds/casi/
Advise, how do i select the codes between last '/' and '.pdf' ?

I would recommend to take a look at REGEXP_SUBSTR. It allows to apply a regular expression. Db2 has string processing functions, but the regex function may be the easiest solution. See SO question on regex and URI parts for different ways of writing the expression. The following would return the last slash, filename and the extension:
SELECT REGEXP_SUBSTR('http://fobar.com/one/two/abc.pdf','\/(\w)*.pdf' ,1,1)
FROM sysibm.sysdummy1
/abc.pdf
The following uses REPLACE and the pattern is from this SO question with the pdf file extension added. It splits the string in three groups: everything up to the last slash, then the file name, then the ".pdf". The '$1' returns the group 1 (groups start with 0). Group 2 would be the ".pdf".
SELECT REGEXP_REPLACE('http://fobar.com/one/two/abc.pdf','(?:.+\/)(.+)(.pdf)','$1' ,1,1)
FROM sysibm.sysdummy1
abc
You could apply LENGTH and SUBSTR to extract the relevant part or try to build that into the regex.

For older Db2 versions than 11.1. Not sure if it works for 9.5, but definitely should work since 9.7.
Try this as is.
with cas_fnd_dwd_det (casi_imp_urls) as (values
'www.casiac.net/fnds/CASI/qnxp.pdf'
, 'www.casiac.net/fnds/casi/as.pdf'
, 'www.casiac.net/fnds/casi/vindq.pdf'
, 'www.casiac.net/fnds/CASI/mnip.PDF'
)
select
casi_imp_urls
, xmlcast(xmlquery('fn:replace($s, ".*/(.*)\.pdf", "$1", "i")' passing casi_imp_urls as "s") as varchar(50)) cas_code
from cas_fnd_dwd_det

Using Regex to extract numeric values in Tableau

I am trying to pull out the numeric values (10004, 12245, 13456) from the following IDs:
10004a,
12v245, and
13456n
I can get the correct ID numbers with the exception of 12v245 ID, using the following regex code:
REGEXP_EXTRACT([ID], '([0-9]+)')
The 12v245 ID is only returning the the first two numbers. What am I missing in my code?

Your issue is that the function REGEXP_EXTRACT in Tableau requires exactly one capturing group.
The function [0-9]+ returns a capturing group per block of numbers and as the ID 12v245 has a letter in between the string of numbers it returns two capturing groups i.e. the 12 and then the 245.
The workaround for this is to use a nested replace as follows:
REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE([ID], '[\D]+',"")
, '[\D]+' , "")
, '[\D]+' , "")
Depending on the nature of your data you may want to add more replaces.
This issue is documented on the Tableau community so feel free to vote up for a better fix: https://community.tableau.com/ideas/4975#

How can I separate a string by underscore (_) in google spreadsheets using regex?

I need to create some columns from a cell that contains text separated by "_".
The input would be:
campaign1_attribute1_whatever_yes_123421
And the output has to be in different columns (one per field), with no "_" and excluding the final number, as it follows:
campaign1 attribute1 whatever yes
It must be done using a regex formula!
help!
Thanks in advance (and sorry for my english)

=REGEXEXTRACT("campaign1_attribute1_whatever_yes_123421","(("&REGEXREPLACE("campaign1_attribute1_whatever_yes_123421","((_)|(\d+$))",")$1(")&"))")
What this does is replace all the _ with parenthesis to create capture groups, while also excluding the digit string at the end, then surround the whole string with parenthesis.
We then use regex extract to actuall pull the pieces out, the groups automatically push them to their own cells/columns

To solve this you can use the SPLIT and REGEXREPLACE functions
Solution:
Text - A1 = "campaign1_attribute1_whatever_yes_123421"
Formula - A3 = =SPLIT(REGEXREPLACE(A1,"_+\d*$",""), "_", TRUE)
Explanation:
In cell A3 We use SPLIT(text, delimiter, [split_by_each]), the text in this case is formatted with regex =REGEXREPLACE(A1,"_+\d$","")* to remove 123421, witch will give you a column for each word delimited by ""
A1 = "campaign1_attribute1_whatever_yes_123421"
A2 = "=REGEXREPLACE(A1,"_+\d*$","")" //This gives you : *campaign1_attribute1_whatever_yes*
A3 = SPLIT(A2, "_", TRUE) //This gives you: campaign1 attribute1 whatever yes, each in a separate column.

I finally figured it out yesterday in stackoverflow (spanish): https://es.stackoverflow.com/questions/55362/c%C3%B3mo-separo-texto-por-guiones-bajos-de-una-celda-en...
It was simple enough after all...
The reason I asked to be only in regex and for google sheets was because I need to use it in Google data studio (same regex functions than spreadsheets)
To get each column just use this regex extract function:
1st column: REGEXP_EXTRACT(Campaña, '^(?:[^_]*_){0}([^_]*)_')
2nd column: REGEXP_EXTRACT(Campaña, '^(?:[^_]*_){1}([^_]*)_')
3rd column: REGEXP_EXTRACT(Campaña, '^(?:[^_]*_){2}([^_]*)_')
etc...
The only thing that has to be changed in the formula to switch columns is the numer inside {}, (column number - 1).
If you do not have the final number, just don't put the last "_".
Lastly, remember to do all the calculated fields again, because (for example) it gets an error with CPC, CTR and other Adwords metrics that are calculated automatically.
Hope it helps!

Use REGEXP_SUBSTR like a Split function

I need to extract a text value from data in a VARCHAR2 column. Sample:
EDKES^Visit: ^PRIMARY INSURANCE COMMENTS: ^SECONDARY INSURANCE COMMENTS: ^TERTIARY INSURANCE COMMENTS: ^NO PRIMARY INSURANCE^NONE^NONE^NONE^NONE^NONE^NONE^NONE^NONE^NONE^NO SECONDARY INSURANCE^NONE^NONE^NONE^NONE^NONE^NONE^NONE^NONE^NONE^NO TERTIARY INS*
I need to get the text that proceeds the 6th occurrence of the '^' (excluding the '^'). In this example, the text would be NO PRIMARY INSURANCE.
([\w\s\:\*]+(\^?)) mostly works, but doesn't exclude the '^'.
When I try to use this expression REGEXP_SUBSTR(VARCHAR_COL, '([\w\s\:\*]+(\^?))', 1, 6), I get a single character ('s'), rather than the expected match NO PRIMARY INSURANCE^.
What am I missing?

This should work pretty well:
REPLACE(REGEXP_SUBSTR(VARCHAR_COL, '[^^]+\^?', 1, 6), '^', '')

You might be able to account for blank columns as well. And if the engine only returns
the capture groups, it will trim the delimiter.
([^^]*).?
This of course means that the last column found is always invalid.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Hive regex split a string in to two different fields - regex

Select regexp_extract(data, '^(\\dx\\d).', 1), regexp_extract(data, '^\\dx\\d(.)', 1) from (Select '0x0000110PPPP111KZY0 ' as data) a; This code returns a Hive row with two fields: 0x0 000110PPPP111KZY0

Related

Can I make my Alteryx RegEx parse conditional?

How to split a string in db2?

Using Regex to extract numeric values in Tableau

How can I separate a string by underscore (_) in google spreadsheets using regex?

Use REGEXP_SUBSTR like a Split function

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Hive regex split a string in to two different fields - regex

Select regexp_extract(data, '^(\\dx\\d).*', 1), regexp_extract(data, '^\\dx\\d(.*)', 1) from (Select '0x0000110PPPP111KZY0 ' as data) a; This code returns a Hive row with two fields: 0x0 000110PPPP111KZY0

Related

Can I make my Alteryx RegEx parse conditional?

How to split a string in db2?

Using Regex to extract numeric values in Tableau

How can I separate a string by underscore (_) in google spreadsheets using regex?

Use REGEXP_SUBSTR like a Split function

Categories

Resources

Select regexp_extract(data, '^(\\dx\\d).', 1), regexp_extract(data, '^\\dx\\d(.)', 1) from (Select '0x0000110PPPP111KZY0 ' as data) a; This code returns a Hive row with two fields: 0x0 000110PPPP111KZY0