Extract book name from a string in Hive

Extract book name from a string in Hive - regex

My data is something like this -
1124 An Orphan's Journey
234 Red Dragon
35600 You'll Know When It's Time
It has two values, the first one is Book ID, and the second one is the book name.
I used the split function in Hive but that doesn't look proper.
SELECT split(books, '\\ ')[0] book_id,
split(books, '\\ ')[1] + ' ' +
split(books, '\\ ')[2] + ' ' +
split(books, '\\ ')[3] + ' ' +
split(books, '\\ ')[4] as book_name
FROM books;
So far values are good but I don't feel it is the right approach.
Please help.

You may use
REGEXP_EXTRACT(books, '^\\d+', 0)
to get the book ID and
REGEXP_EXTRACT(books, '\\s+(\\S.*)', 1)
to extract the book name. The second regex can be more verbose, say, you may also check if there are digits at the start of the string before, '^\\d+\\s+(\\S.*)'.
Here,
^\d+ - matches one or more (+) digits at the start of the string (^)
\s+(\S.*) - matches one or more whitespace chars (\s+) and then captures into Group 1 any non-whitespace char (\S) and then the rest of the string (.* matches any zero or more chars other than line break chars as many as possible). Note the index argument is set to 1 in the second callt o REGEXP_EXTRACT to make sure the Group 1 value is only returned, without the initial whitespace.

Related

Split records with complex delimiter

I have an incoming record with a complex column delimiter and need to tokenize the record.
One of the delimiter characters can be part of the data.
I am looking for a regex expression.
Required to use on Teradata 16.1 with the function "REGEXP_SUBSTR".
There can max of 5 columns to tokenize.
Planing to use case statements in Teradata to tokenize the columns.
I guess regular expression for one token will do the trick.
Case#1: Column delimiter is ' - '
Sample data: On-e - tw o - thr$ee
Required output : [On-e, tw o, thr$ee]
My attempt : ([\S]*)\s{1}\-{1}\s{1}
Case#2 : Column delimiter is '::'
Sample data : On:e::tw:o::thr$ee
Required output : [On:e, tw:o, thr$ee]
Case#3 : Column delimiter is ':;'
Sample data : On:e:;tw;o:;thr$ee
Required output : [On:e, tw;o, thr$ee]
The above 3 cases are independent and do not occur together ie., 3 different solutions are required

If you absolutely must use RegEx for this, you could do it like in the examples shown below using capture groups.
Generic example:
/(?<data>.+?)($delimiter|$)/gm
(?<data>.+?) named capture group data, matching:
. any character
+? occuring between one and unlimited times
followed by
($delimiter|$) another capture group, matching:
$delimiter - replace this with regex matching your delimiter string
| or
$ end of string
Picking up your examples:
Case #1:
Column delimiter is ' - '
/(?<data>.+?)(\s-\s|$)/gm
(https://regex101.com/r/qMYxAY/1)
Case #2:
Column delimiter is '::'
/(?<data>.+?)(\:\:|$)/gm
https://regex101.com/r/IzaAoA/1
Case #3:
Column delimiter is ':;'
(?<data>.+?)(\:\;|$)
https://regex101.com/r/g1MUb6/1

Normally you would use STRTOK to split a string on a delimiter. But strtok can't handle a multi-character delimiter. One moderately over-complicated approach is to replace the multiple characters of the delimiter with a single character and split on that. For example:
select
strtok(oreplace(<your column>,' - ', '|'),'|',1) as one,
strtok(oreplace(somecol,' - ', '|'),'|',2) as two,
strtok(oreplace(somecol,' - ', '|'),'|',3) as three,
strtok(oreplace(<your column>,' - ', '|'),'|',4) as four,
strtok(oreplace(<your column>,' - ', '|'),'|',5) as five
If there are only three occurrences, like in your samples, it just returns null for the other two.

Extract from string in BigQuery using regexp_extract

I have a long string in BigQuery where that I need to extract out some data.
Part of the string looks like this:
... source: "agent" resolved_query: "hi" score: 0.61254 parameters ...
I want to extract out data such as agent, hi, and 0.61254.
I'm trying to use regexp_extract but I can't get the regexp to work correctly:
select
regexp_extract([col],r'score: [0-9]*\.[0-9]+') as score,
regexp_extract([col],r'source: [^"]*') as source
from [table]
What should the regexp be to just get agent or 0.61254 without the field name and no quotation marks?
Thank you in advance.

I love non-trivial approaches - below one of such -
select * except(col) from (
select col, split(kv, ': ')[offset(0)] key,
trim(split(kv, ': ')[offset(1)], '"') value,
from your_table,
unnest(regexp_extract_all(col, r'\w+: "?[\w.]+"?')) kv
)
pivot (min(value) for key in ('source', 'resolved_query', 'score'))
if applied to sample data as in your question
with your_table as (
select '... source: "agent" resolved_query: "hi" score: 0.61254 parameters ... ' col union all
select '... source: "agent2" resolved_query: "hello" score: 0.12345 parameters ... ' col
)
the output is
As you might noticed, the benefit of such approach is obvious - if you have more fields/attributes to extract - you do not need to clone the lines of code for each of attribute - you just add yet another value in last line's list - the whole code is always the same

You can use
select
regexp_extract([col],r'score:\s*(\d*\.?\d+)') as score,
regexp_extract([col],r'resolved_query:\s*"([^"]*)"') as resolved_query,
regexp_extract([col],r'source:\s*"([^"]*)"') as source
from [table]
Here,
score:\s*(\d*\.?\d+) matches score: string, then any zero or more whitespaces, and then there is a capturing group with ID=1 that captures zero or more digits, an optional . and then one or more digits
resolved_query:\s*"([^"]*)" matches a resolved_query: string, zero or more whitespaces, ", then captures into Group 1 any zero or more chars other than " and then matches a " char
source:\s*"([^"]*)" matches a source: string, zero or more whitespaces, ", then captures into Group 1 any zero or more chars other than " and then matches a " char.

Do not select if additional character is included

Suppose I have the following numbers:
3,000mt
300mt
44,000m
320m
And I want 44,000m and 320m to be selected.
What regex should I use to only select the numbers (comma separated) that have "m" in the end and not the ones that have "mt"?
This is what I have tried:
\d+[,]?\d+m.
I have no idea how to negate mt though.

You are very close to the solution and only missed the possibility to check for a word boundary (represented by regex character \b). So instead of using any character . at the end of your regular expression, you will probably only look if the string is ended by a word boundary (e.g. spaces or newlines or nothing more):
\d+(,\d+)?m\b
where
\d+ looks for any digits (at least one)
(,\d+)? looks for a comma followed by one digit or more (it's grouped by using parentheses and the whole group is completely optional using the ? sign)
m\b as explained above looks for a literal m at the end of a word
With this regex you can also match strings with one digit only followed by m like 9m or similar. This is a slight change in comparison to your regex (grouping comma followed by digits).
I proved the regex via Python and also added some more edge cases:
>>> import re
>>> text = "3,000mt 300mt 44,000m 1m 1mt 1,3mt 320m"
>>> re.findall(r"\d+(?:,\d+)?m\b", text) # ?: is python specific for findall method
['44,000m', '1m', '320m']

how about a unix solution like below
> echo "3,000mt 300mt 44,000m 320m" | tr ' ' '\n' | awk -F" " ' $0~/m$/ { print } '
44,000m
320m
>

How to determine if variable contains a specified string using RegEx

How can I write a condition which will compare Recipient.AdressEntry for example with the following String "I351" using RegEx?
Here is my If condition which works but is hardcoded to every known email address.
For Each recip In recips
If recip.AddressEntry = "Dov John, I351" Then
objMsg.To = "example#mail.domain"
objMsg.CC = recip.Address
objMsg.Subject = Msg.Subject
objMsg.Body = Msg.Body
objMsg.Send
End If
Next
The reason I need this condition is email may have one of several colleagues from my team and one or more from another team. AdressEntry of my colleagues ends with I351 so I will check if this email contains one of my teammates.
For Each recip In recips
If (recip.AddressEntry = "Dov John, I351" _
Or recip.AddressEntry = "Vod Nohj, I351") Then
objMsg.To = "example#mail.domain"
objMsg.CC = recip.Address
objMsg.Subject = Msg.Subject
objMsg.Body = Msg.Body
objMsg.Send
End If
Next

You still didn't clarify exactly what the condition you want to use for matching is, so I'll do my best:
If you simply want to check if the string ends with "I351", you don't need regex, you can use something like the following:
If recip.AddressEntry Like "*I351" Then
' ...
End If
If you want to check if the string follows this format "LastName FirstName, I351", you can achieve that using Regex by using something like the following:
Dim regEx As New RegExp
regEx.Pattern = "^\w+\s\w+,\sI351$"
If regEx.Test(recip.AddressEntry) Then
' ...
End If
Explanation of the regex pattern:
' ^ Asserts position at the start of the string.
' \w Matches any word character.
' + Matches between one and unlimited times.
' \s Matches a whitespace character.
' \w+ Same as above.
' , Matches the character `,` literally.
' \s Matches a whitespace character.
' I351 Matches the string `I351` literally.
' $ Asserts position at the end of the string.
Try it online.
Hope that helps.

How to better this regex?

I have a list of strings like this:
/soccer/poland/ekstraklasa-2008-2009/results/
/soccer/poland/orange-ekstraklasa-2007-2008/results/
/soccer/poland/orange-ekstraklasa-youth-2010-2011/results/
From each string I want to take a middle part resulting in respectively:
ekstraklasa
orange ekstraklasa
orange ekstraklasa youth
My code here does the job but it feels like it can be done in fewer steps and probably with regex alone.
name = re.search('/([-a-z\d]+)/results/', string).group(1) # take the middle part
name = re.search('[-a-z]+', name).group() # trim numbers
if name.endswith('-'):
name = name[:-1] # trim tailing `-` if needed
name = name.replace('-', ' ')
Can anyone see how make it better?

This regex should do the work:
/(?:\/\w+){2}\/([\w\-]+)(?:-\d+){2}/
Explanation:
(?:\/\w+){2} - eat the first two words delimited by /
\/ - eat the next /
([\w\-]+)- match the word characters of hyphens (this is what we're looking for)
(?:-\d+){2} - eat the hyphens and the numbers after the part we're looking for
The result is in the first match group

I cant test it because i am not using python, but i would use an Expression like
^(/soccer/poland/)([a-z\-]*)(.*)$
or
^(/[a-z]*/[a-z]*/)([a-z\-]*)(.*)$
This Expressen works like "/soccer/poland/" at the beginning, than "everything with a to z (small) or -" and the rest of the string.
And than taking 2nd Group!
The Groups should hold this Strings:
/soccer/poland/
orange-ekstraklasa-youth-
2010-2011/results/
And then simply replacing "-" with " " and after that TRIM Spaces.
PS: If ur Using regex101.com e.g., u need to escape / AND just use one Row of String!
Expression
^(\/soccer\/poland\/)([a-z\-]*)(.*)$
And one Row of ur String.
/soccer/poland/orange-ekstraklasa-youth-2010-2011/results/
If u prefere to use the Expression not just for soccer and poland, use
^(\/[a-z]*\/[a-z]*\/)([a-z\-]*)(.*)$

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Extract book name from a string in Hive - regex

Related

Split records with complex delimiter

Extract from string in BigQuery using regexp_extract

Do not select if additional character is included

How to determine if variable contains a specified string using RegEx

How to better this regex?

Categories

Resources