Hive REGEXP_EXTRACT returning null results - regex

I am trying to extract R7080075 and X1234567 from the sample data below. The format is always a single upper case character followed by 7 digit number. This ID is also always preceded by an underscore. Since it's user generated data, sometimes it's the first underscore in the record and sometimes all preceding spaces have been replaced with underscores.
I'm querying HDP Hive with this in the select statement:
REGEXP_EXTRACT(column_name,'[(?:(^_A-Z))](\d{7})',0)
I've tried addressing positions 0-2 and none return an error or any data. I tested the code on regextester.com and it highlighted the data I want to extract. When I then run it in Zepplin, it returns NULLs.
My regex experience is limited so I have reviewed the articles here on regexp_extract (+hive) and talked with a colleague. Thanks in advance for your help.
Sample data:
Sept Wk 5 Sunny Sailing_R7080075_12345
Holiday_Wk2_Smiles_X1234567_ABC

The Hive manual says this:
Note that some care is necessary in using predefined character classes: using '\s' as the second argument will match the letter s; '\\s' is necessary to match whitespace, etc.
Also, your expression includes unnecessary characters in the character class.
Try this:
REGEXP_EXTRACT(column_name,'_[A-Z](\\d{7})',0)
Since you want only the part without underscore, use this:
REGEXP_EXTRACT(column_name,'_([A-Z]\\d{7})',1)
It matches the entire pattern, but extracts only the second group instead of the entire match.
Or alternatively:
REGEXP_EXTRACT(column_name,'(?<=_)[A-Z]\\d{7}', 0)
This uses a regexp technique called "positive lookbehind". It translates to : "find me an upper case alphabet followed by 7 digits, but only if they are preceded by an _". It uses the _ for matching but doesn't consider it part of the extracted match.

Related

Regex for last n characters of String in PostgreSQL query

Regex checks wouldn't be a strong point of mine. This is trivial but after playing around with it for 15 minutes already I think it would be quicker posting here. Ultimately I want to filter out any results of a table where a certain text column value ends with S(01 -99), i.e. the letter S followed by 2 digits. Consider the following test query
select x.* from (
select
unnest(array['kjkjkj','jhjs01','kjkj11','kjhkjh','uusus','iiosis99']::text[])
as tests ) x
where RIGHT(x.tests,3) !~ 'S[0-9]{1,2}$'
This returns everything in the unnested array, whereas I'm hoping to return everything excluding the second and last values. Any pointers in the right direction would be much appreciated. I'm using PostgreSQL v11.9
You may actually use SIMILAR TO here since your pattern is not that complex:
SELECT * FROM table
WHERE column_name NOT SIMILAR TO '%S[0-9]{2}'
SIMILAR TO patterns require a full string match, so here, % matches any text from the start of the string, then S matches S and [0-9]{2} matches two digits that must be at the end of the string.
If you were to use a regex, you could use
WHERE column_name !~ 'S[0-9]{2}$'
Or, 'S[0-9]{1,2}$' if there can be one or two digits. Since the regex search in PostgreSQL does not require a full string match, it just matches S, two (or one or two with {1,2}) digits at the end of string ($).

Regex with 2 semi colons in notepad++

I have data like this
Giftsbirth;;Basket7;CC
Giftswedding;;Cake4;COD
I am trying to find a regex that will only select the second data (Basket7, Cake4).
From past help I tried something like
^(\w+ [^\v;;]+;;[^\v;]+)?.*
But I know that is not right
Please assist with the regex if you can
You could use a positive lookbehind (?<= to assert what is before is ;; and a positive lookahead (?= to assert that what follows is ;
Use a negative character class [^;]+ to match not a ; to match your values.
(?<=;;)[^;]+(?=;)
You may use
(?:.*;)?([^;\n\r]+);[^;\n\r]+$
Or,
.*?;;([^;\r\n]+)(?:;.*)?
and replace with $1.
Details
(?:.*;)? - an optional substring having 0+ chars other than line break chars, as many as possible, up to the ;
([^;\n\r]+) - Group 1: any one or more chars other than CR, LF and ;
; - a semi-colon
[^;\n\r]+ - any one or more chars other than CR, LF and ;
$ - end of line.
The second regex matches
.*?;; - any 0+ chars as few as possible up to (and including) the first ;;
([^;\r\n]+) - Group 1: any one or more chars other than CR, LF and ;
(?:;.*)? - an optional group matching 1 or 0 occurrences of a ; and then any 0+ chars up to the end of line
The $1 in the replacement is the value you need to keep.
You need to specify more precisely what "the second data (Basket7, Cake4)" means. This looks like CSV data with the ; set as separator, but that would place Basket7 and Cake4 in the third column, since the second column is empty. In order to write a regex that solves this problem in the general case, you need to take into account the full domain of possible lines, and you've only given two examples and let everyone guess what the underlying format and total possible variations might be.
For example, is it always reasonable to assume that that which you're looking for is always preceded by ;; and ends with a ;, and that ;; never occurs in other places than immediately before that which you're looking for? In that case, (?<=;;)([^;]*) captures this. But what if you encounter one of the following lines?
Giftsbirth;;;CC # Here, the thing matched is empty
Giftsbirth;1600;Basket7;CC # Here, the second column isn't empty
;;Basket7;CC # Here, the first column is empty
;;;CC # Here, all but the last column are empty
;;; # Here, all columns are empty
You may experience that various suggestions will give you "the right text", but if you test this on a limited subset that does not account for all variations that can reasonably be expected in the input, you will inevitably have to revise your regex.
Assuming this is a CSV where the fields don't contain literal ;s, and that you don't know anything about the length of any of the fields (and consequently that the second column isn't always empty), but that there are at least three columns, you could consider the regex:
^[^;]*;[^;]*;([^;]*)
(See demo at https://regex101.com/r/vhPNEj/1)
These assumptions may not be correct, but my ability to guess are much worse than yours, since you're sitting with a larger sample size of data. In order to succeed at automating your tasks, it is critical that you learn to modify code to fit your assumptions.
For example, you may want to disregard the cases where the third column is empty:
^[^;]*;[^;]*;([^;]+)
Here the difference is [^;]* changed into [^;]+.
Or you may want to take into account that the first column could contain semicolons when they are wrapped in double quotes, e.g. like "Giftsbirth; Holiday";;Basket7;CC:
^(?:[^;"]*|"[^"]*");[^;]*;([^;]*)
Here the difference is [^;]* changed into (?:[^;"]*|"[^"]*") being either [^;"]* (being all but ; and ") or "[^"]*" (being " followed by anything but ", which includes ;, followed by ").

Using Gsub to get matched strings in R - regular expression

I am trying to extract words after the first space using
species<-gsub(".* ([A-Za-z]+)", "\1", x=genus)
This works fine for the other rows that have two words, however row [9] "Eulamprus tympanum marnieae" has 3 words and my code is only returning the last word in the string "marnieae". How can I extract the words after the first space so I can retrieve "tympanum marnieae" instead of "marnieae" but have the answers stored in one variable called >species.
genus
[9] "Eulamprus tympanum marnieae"
Your original pattern didn't work because the subpattern [A-Za-z]+ doesn't match spaces, and therefore will only match a single word.
You can use the following pattern to match any number of words (other than 0) after the first, within double quotes:
"[A-Za-z]+ ([A-Za-z ]+)" https://regex101.com/r/p6ET3I/1
https://regex101.com/r/p6ET3I/2
This is a relatively simple, but imperfect, solution. It will also match trailing spaces, or just one or more spaces after the first word even if a second word doesn't exist. "Eulamprus " for example will successfully match the pattern, and return 5 spaces. You should only use this pattern if you trust your data to be properly formatted.
A more reliable approach would be the following:
"[A-Za-z]+ ([A-Za-z]+(?: [A-Za-z]+)*)"
https://regex101.com/r/p6ET3I/3
This pattern will capture one word (following the first), followed by any number of addition words (including 0), separated by spaces.
However, from what I remember from biology class, species are only ever comprised of one or two names, and never capitalized. The following pattern will reflect this format:
"[A-Za-z]+ ([a-z]+(?: [a-z]+)?)"
https://regex101.com/r/p6ET3I/4

Match Regex Starting After "X" Number of Characters

I am using regex in a Google script to normalize company names, and while I am getting very close to perfect with a combination of replacing certain words, punctuation, and spaces, my last step was to replace any word with 3 or fewer letters.
But that gets rid of a few companies with acronyms at the start of their name, ie AB Holding Company. I don't want this to match AB, I want it to find the rare "the", or company code (particularly foreign ones like SPA and NV along with Co and Inc). These codes are not necessarily at the end of the string, but they seem to always be at least 4 characters after the beginning.
I am currently using
text = text.replace(/\b[a-z]{1,3}\b)/i," ");
Ignore the [a-z] as missing caps, I've dealt with that separately
What I think would work is to "skip over" the first few characters, probably 4 to be safe, and maybe learn how to include spaces and/or digits in there for the future. So I wrote this after seeing 1 other related question here.
text = text.replace(/((.{4})(.*)\b[a-z]{1,3}\b)/i," ");
Scipts does not seem to allow a lookbehind, and my version doesn't seem to work. I'm lost.
I appreciate your help.
Here is a solution:
text = text.replace("/^(.{4}.*)(\b[a-z]{1,3}\b)(.*)/gmi", "$1$3");
What I have changed is:
enclosed all groups in parenthesis - so that they can be captured and used in the replacement;
since you mentioned that the word-to-be-replaced might not be in the end of the string, I also added a third group - to match everything after.
included the part before and after the word in the replacement string (group 1 and group 3).
However, note that it might return false positives - i.e. if a company name is Company ABC, Inc., it will also capture ABC. Thus, if you know the words you want to replace, it might be better to just use an alteration:
text = text.replace("/^(.{4}.*)\b(Co|Inc|SPA|NV|the)\b(.*)/gmi", "$1$3");

RegEx Lookaround issue

I am using Powershell 2.0. I have file names like my_file_name_01012013_111546.xls. I am trying to get my_file_name.xls. I have tried:
.*(?=_.{8}_.{6})
which returns my_file_name. However, when I try
.*(?=_.{8}_.{6}).{3}
it returns my_file_name_01.
I can't figure out how to get the extension (which can be any 3 characters. The time/date part will always be _ 8 characters _ 6 characters.
I've looked at a ton of examples and tried a bunch of things, but no luck.
If you just want to find the name and extension, you probably want something like this: ^(.*)_[0-9]{8}_[0-9]{6}(\..{3})$
my_file_name will be in backreference 1 and .xls in backreference 2.
If you want to remove everything else and return the answer, you want to substitute the "numbers" with nothing: 'my_file_name_01012013_111546.xls' -replace '_[0-9]{8}_[0-9]{6}' ''. You can't simply pull two bits (name and extension) of the string out as one match - regex patterns match contiguous chunks only.
try this ( not tested), but it should works for any 'my_file_name' lenght , any lenght of digit and any kind of extension.
"my_file_name_01012013_111546.xls" -replace '(?<=[\D_]*)(_[\d_]*)(\..*)','$2'
non regex solution:
$a = "my_file_name_01012013_111546.xls"
$a.replace( ($a.substring( ($a.LastIndexOf('.') - 16 ) , 16 )),"")
The original regex you specified returns the maximum match that has 14 characters after it (you can change to (?=.{14}) who is the same).
Once you've changed it, it returns the maximum match that has 14 characters after it + the next 3 characters. This is why you're getting this result.
The approach described by Inductiveload is probably better in case you can use backreferences. I'd use the following regex: (.*)[_\d]{16}\.(.*) Otherwise, I'd do it in two separate stages
get the initial part
get the extension
The reason you get my_filename_01 when you add that is because lookaheads are zero-width. This means that they do not consume characters in the string.
As you stated, .*(?=_.{8}_.{6}) matches my_file_name because that string is is followed by something matching _.{8}_.{6}, however once that match is found, you've only consumed my_file_name, so the addition of .{3} will then consume the next 3 characters, namely _01.
As for a regex that would fit your needs, others have posted viable alternatives.