Regexp_Replace in HIVE to keep only certain words - regex

I am trying to use regexp_replace in HIVE as a way to only keep certain words of a string.
I am trying to use the following:
select regexp_replace('I AM WANTING BLUE SHOES AND YELLOW STRINGS', '([^BLUE|SHOES|YELLOW|STRINGS])',' ')
But it is giving me "I W NTING BLUE SHOES N YELLOW STRINGS" instead of "BLUE SHOES YELLOW STRINGS"
I have tried using \b \s and any number of things to no avail. Any tips on getting this to work?
I also considered using regexp_replace but for my particular use case there are too many variables for it to be useful.

You can use
select regexp_replace('I AM WANTING BLUE SHOES AND YELLOW STRINGS', '\\s*\\b(?!(?:BLUE|SHOES|YELLOW|STRINGS)\\b)\\w+','')
See the regex demo
Details
\s* - 0+ whitespaces
\b - a word boundary
(?!(?:BLUE|SHOES|YELLOW|STRINGS)\b) - no BLUE, SHOES, YELLOW and STRINGS substring is allowed immediately on the right
\w+ - 1 or more word chars (letters, digits or _)

Related

Regular Expression to search substring

let's say I have a string like Michael is studying at the Faculty of Economics at the University
and I need to check if a given string contains the following expression: Facul* of Econom*
where the star sign implies that the word can have many different endings
In general, my goal is to find similar expressions within tables from the clickhouse database. If you suggest other options for solving this problem, I will be grateful
If you want to match any lowercase letters following your two words use this:
\bFacul[a-z]* of Econom[a-z]*\b
If you want to match any optional letters following your two words use this:
\bFacul[A-Za-z]* of Econom[A-Za-z]*\b
Explanation:
\b - word boundary
Facul - literal text
[A-Za-z]* - 0 to multiple alpha chars
of - literal text
Econom - literal text
[A-Za-z]* - 0 to multiple alpha chars
\b - word boundary
If you want to be be more forgiving with upper/lowercase and spaces use this:
\b[Ff]acul[A-Za-z]* +of +[Ee]conom[A-Za-z]*\b
Use any number of "word" chars for word tails and "word boundary" at the front:
\bFacul\w* of Econom\w*
consider case insensitivity too:
(?i)\bfacul\w* of econom\w*

Handle initials in Postgresql

I'd like to keep together initials (max two letters) when there are punctuation or spaces in between.
I have the following snippet to tackle almost everything, but I am having issues in keeping together initials that are separated by punctuation and space. For instance, this is working on regular regex, but not in postgresql:
SELECT regexp_replace('R Z ELEMENTARY SCHOOL',
'(\b[A-Za-z]{1,2}\b)\s+\W*(?=[a-zA-Z]{1,2}\b)',
'\1')
The outcome should be "RZ ELEMENTARY SCHOOL". Other examples will include:
A & D ALTERNATIVE EDUCATION
J. & H. KNOWLEDGE DEVELOPMENT
A. - Z. EVOLUTION IN EDUCATION
The transformation should be as follows:
AD ALTERNATIVE EDUCATION
JH KNOWLEDGE DEVELOPMENT
AZ EVOLUTION IN EDUCATION
How to achieve this in Postgresql?
Thanks
Building on your current regex, I can recommend
SELECT REGEXP_REPLACE(
REGEXP_REPLACE('J. & H. KNOWLEDGE DEVELOPMENT', '\m([[:alpha:]]{1,2})\M\s*\W*(?=[[:alpha:]]{1,2}\M)', '\1'),
'^([[:alpha:]]+)\W+',
'\1 '
)
See the online demo, yielding
regexp_replace
1 JH KNOWLEDGE DEVELOPMENT
It is a two step solution. The first regex matches
\m([[:alpha:]]{1,2})\M - a whole one or two letter words captured into Group 1 (\m is a leading word boundar, and \M is a trailing word boundary)
\s* - zero or more whitespaces
\W* - zero or more non-word chars
(?=[[:alpha:]]{1,2}\M) - a positive lookahead that requires a whole one or two letter word immediately to the right of the current position.
The match is replaced with the contents of Group 1 (\1).
The second regex matches a letter word at the start of the string and replaces all non-word chars after it with a space.

Regex: include 3 word in front and 3 behind the selected text

Im using this regex code in excel to find the desired text in a paragraph:
=RegexExtract(B2,"(bot|vehicle|scrape)")
This code will successfully return all 3 of the words if they are found on a paragraph, what I would like to do as an extra is for the regex to return the desired text in bold along with few words in front and 3 words in the back of the selected word.
Example of text:
A car (or automobile) is a wheeled motor vehicle used for transportation.
Most definitions of car say they run primarily on roads, seat one to eight people,
have four tires, and mainly transport people rather than goods.
Example output:
a wheeled motor **vehicle** used for transportation
I want a portion of the text to appear in order for the receiver to be able to pinpoint easier the location of the text.
Any alternative approach is much appreciated.
You may use
=RegexExtract(B2,"(?:\w+\W+(?:\w+\W+){0,2})?(?:bot|vehicle|scrape)(?:\W+\w+(?:\W+\w+){0,2})?")
See the regex demo and the Regulex graph:
Details: The pattern is enclosed with capturing parentheses to make REGEXEXTRACT actually extract the string you need that meets the following pattern:
(?:\w+\W+(?:\w+\W+){0,2})? - an optional sequence of a word followed with non-word chars that is followed with zero, one or two repetitions of 1+ word chars and then 1+ non-word chars
(?:bot|vehicle|scrape) - a bot, vehicle or scrape word
(?:\W+\w+(?:\W+\w+){0,2})? - an optional sequence of 1+ non-word chars and then 1+ word chars followed with zero, one or two repetitions of 1+ non-word chars and then 1+ word chars.
Google Spreadsheets test:

Google Sheet: custom formula to split words by uppercase

I would like to come up with a script to program a custom formula for google sheet. The idea is to split a string composed of multiple words. The formula should recognize the words where there is a capital letter and separate them. The result would be a string where the words are separated by ",".
To clarify this is an example of the string:
Nursing StudentStudentNurseNursing School
Desired Result:
Nursing Student,Student,Nurse,Nursing School
I have tried to use a formula in Google Sheet:
=split(regexreplace(A1,"[A-Z][^A-Z]*","$0"&char(9)),char(9))
However, it generates 6 cells with the below strings:
Nursing Student Student Nurse Nursing School
Can anybody help me or give me some hint?
=REGEXREPLACE(A1,"(\B)([A-Z])",",$2")
\B not a word Border.
[A-Z] Upper case letter.
If \B is followed by a upper case letter, replace the \B with ,
If you plan to insert a comma in between a lowercase letter and an uppercase letter, you may use either of:
=REGEXREPLACE(A1,"([a-z])([A-Z])","$1,$2")
=REGEXREPLACE(A1,"([[:lower:]])([[:upper:]])","$1,$2")
where
([a-z]) / ([[:lower:]]) - Capturing group 1 (later referred to with $1 from the replacement pattern): any lowercase ASCII letter
([A-Z]) / ([[:upper:]]) - Capturing group 2 (later referred to with $2 from the replacement pattern): any uppercase ASCII letter
Note that another suggestion, based on a non-word boundary \B, that can be written as =REGEXREPLACE(A1,"\B[A-Z]",",$0"), will also match an uppercase letter after _ and any digit, so it might overfire if you do not expect that behavior.

Excel 2007 VBA RegEx Help Needed

I'm working on an Excel 2007 VBA project that my client wants done yesterday and I need to use RegEx to locate strings within some pretty challenging data. This is my first exposure to RegEx so I'm stuck doing something I think is simple (maybe not) and I'm clueless.
I've added the reference to the VBScript RegEx engine (5.5) and RegEx is working O.K. in Excel - I just don't know how to construct the pattern statement. I need to locate occurrences of the word "trust" in a range of cells on a worksheet. In some of my data this word has been abbreviated "Tr". I have constructed the following RegEx statement to locate the word "trust" and all words that start with a space and contain "tr".
"trust| tr"
Unfortuantely, this matches any word that contains "tr", like "trail", "tree", and so on. What I want to match is " tr" - meaning it has a leading space, the "tr", and nothing else in the word. Can somebody tell me what I need to do to make this happen?
I'm also going to need RegEx patterns for street addresses, city, state, and zip plus last name and first name. If there's a resource someone can point me to for these expressions I'd appreciate the help. I'm sorry to ask the group this question without spending the proper amount of time educating myself, by this is a time-sensitive project for which I need your expertise.
Thanks In Advance -
PS - Here a sample of data that I'm working with. I have this type of data present in 5 columns over 4,000 rows.
Jones Family **Trust**
3420 E Ave of the Ftns
3420 E Avenue of the Fountain
320 E ARROWHEAD **TRAILHEAD**
501 S 29TH ST
PO BOX 13422
71343 W Paradise Dr
152035 S 29TH ST
124 Owl Grove Pl
Johnson **Tr**
1900 E Arrowhead **Trl**
1900 E ARROWHEAD **TRL**
This is a sample from a column that predominantly contains street addresses. Other columns contain client names without addresses. So not every cell contains data that starts with a number.
I would rewrite your expression that finds trust and tr where they not preceded or followed by a other letters by using the \b is a word boundary assertion. \b matches at a position that is aptly called a "word boundary".
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is a
word character.
After the last character in the string, if the last
character is a word character.
Between two characters in the string,
where one is a word character and the other is not a word character.
For more information on word boundaries then see also regular-expressions.info. I'm not affiliated with that site.
\b(?:trust|tr)\b
After viewing the above, if you're still set on requiring the tr preceded by a space, then use this \b(?:trust|\str)\b
Examples
Live Demo
https://regex101.com/r/xM4fR9/1
Note: I am assuming you're using the case insensitive flag for this
Explanation
NODE EXPLANATION
----------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
trust 'trust'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
tr 'tr'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
----------------------------------------------------------------------
Or
The \b(?:trust|tr)\b expression isn't the most efficient, but it is readable.
A functionally identical, but more efficient regular expression would be:
\btr(?:ust)?\b
Here we're still using the \b word boundary, but we've just made the ust part of the word trust optional with the (?: ... )? construct.