hive regexp_extract weirdness - regex

I am having some problems with regexp_extract:
I am querying on a tab-delimited file, the column I'm checking has strings that look like this:
abc.def.ghi
Now, if I do:
select distinct regexp_extract(name, '[^.]+', 0) from dummy;
MR job runs, it works, and I get "abc" from index 0.
But now, if I want to get "def" from index 1:
select distinct regexp_extract(name, '[^.]+', 1) from dummy;
Hive fails with:
2011-12-13 23:17:08,132 Stage-1 map = 0%, reduce = 0%
2011-12-13 23:17:28,265 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201112071152_0071 with errors
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
Log file says:
java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row
Am I doing something fundamentally wrong here?
Thanks,
Mario

From the docs https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF it appears that regexp_extract() is a record/line extraction of the data you wish to extract.
It seems to work on a first found (then quit) as opposed to global. Therefore the index references the capture group.
0 = the entire match
1 = capture group 1
2 = capture group 2, etc ...
Paraphrased from the manual:
regexp_extract('foothebar', 'foo(.*?)(bar)', 2)
^ ^
groups 1 2
This returns 'bar'.
So, in your case, to get the text after the dot, something like this might work:
regexp_extract(name, '\.([^.]+)', 1)
or this
regexp_extract(name, '[.]([^.]+)', 1)
edit
I got re-interested in this, just a fyi, there could be a shortcut/workaround for you.
It looks like you want a particular segment separated with a dot . character, which is almost like split.
Its more than likely the regex engine used overwrites a group if it is quantified more than once.
You can take advantage of that with something like this:
Returns the first segment: abc.def.ghi
regexp_extract(name, '^(?:([^.]+)\.?){1}', 1)
Returns the second segment: abc.def.ghi
regexp_extract(name, '^(?:([^.]+)\.?){2}', 1)
Returns the third segment: abc.def.ghi
regexp_extract(name, '^(?:([^.]+)\.?){3}', 1)
The index doesn't change (because the index still referrs to capture group 1), only the regex repetition changes.
Some notes:
This regex ^(?:([^.]+)\.?){n} has problems though.
It requires there be something between dots in the segment or the regex won't match ....
It could be this ^(?:([^.]*)\.?){n} but this will match even if there is less than n-1 dots,
including the empty string. This is probably not desireable.
There is a way to do it where it doesn't require text between the dots, but still requires at least n-1 dots.
This uses a lookahead assertion and capture buffer 2 as a flag.
^(?:(?!\2)([^.]*)(?:\.|$())){2} , everything else is the same.
So, if it uses java style regex, then this should work.
regexp_extract(name, '^(?:(?!\2)([^.]*)(?:\.|$())){2}', 1) change {2} to whatever 'segment' is needed (this does segment 2).
and it still returns capture buffer 1 after the {N}'th iteration.
Here it is broken down
^ # Begining of string
(?: # Grouping
(?!\2) # Assertion: Capture buffer 2 is UNDEFINED
( [^.]*) # Capture buffer 1, optional non-dot chars, many times
(?: # Grouping
\. # Dot character
| # or,
$ () # End of string, set capture buffer 2 DEFINED (prevents recursion when end of string)
) # End grouping
){3} # End grouping, repeat group exactly 3 (or N) times (overwrites capture buffer 1 each time)
If it doesn't do assertions, then this won't work!

I think you have to make 'groups' no?
select distinct regexp_extract(name, '([^.]+)', 1) from dummy;
(untested)
I think it behaves like the java library and this should work, let me know though.

Related

Replacing Everything Except specific pattern BigQuery

I would like to use regex to replace everything (except a specific pattern) with empty string in BigQuery. I have following values:
AX/88/8888888
AX/99/999999
AX/11/222222 - AX/22/33333 - AX/999/99999
BX/99/9999
1234455121
AX/00/888888 // BX/890/90890
NULL
[XYZ-ASA
BX/890/90890 + AX/10/1010101
AX/99/9999M
AX/111/111,AX-99
AX/11/222222 BX/99/99 AX/22/33333
The pattern will always have "AX" in the beginning, then a slash (/) and some numbers and slash(/) again and some numbers after it. (The pattern would always be AX/\d+/\d+)
I would like to replace anything (any character,brackets,digit etc) that doesn't follow that pattern mention above.
For the cases where the pattern doesn't match at all for example (BX/99/9999,1234455121, NULL,[XYZ-ASA) are the only cases from the above dataset.
** doesn't match at all means cases where the entire values doesn't have any value
that matches with the AX/\d+/\d+. In those situations, I would like to return then original text as final output.
The case where we have matching pattern for example AX/00/888888 // BX/890/90890, AX/111/111,AX-99 the pattern matches but the latter part needs to be replaced i.e [// BX/890/90890] and [,AX-99] , which should then return only the AX/00/888888, and AX/111/111 as final output.
The expected output from the above example is following:
AX/88/8888888
AX/99/999999
AX/11/222222 AX/22/33333 AX/999/99999
BX/99/9999
1234455121
AX/00/888888
NULL
[XYZ-ASA
AX/10/1010101
AX/99/9999
AX/111/111
AX/11/222222 AX/22/33333
Later I would like to split all the values by space, to get each AX/xx/xx on a different row where I have multiple of those for example case 3 from above would produce 3 rows.
AX/88/8888888
AX/99/999999
AX/11/222222
AX/22/33333
AX/999/99999
BX/99/9999
1234455121
AX/00/888888
NULL
[XYZ-ASA
AX/10/1010101
AX/99/9999
AX/111/111
AX/11/222222
AX/22/33333
Use below
select coalesce(result, col) as col
from your_table
left join unnest(regexp_extract_all(col, r'AX/\d+/\d+')) result
if applied to sample data in your question
output is

Regular Expression to match groups that may not exist

I'm trying to capture some data from logs in an application. The logs look like so:
*junk* [{count=240.0, state=STATE1}, {count=1.0, state=STATE2}, {count=93.0, state=STATE3}, {count=1.0, state=STATE4}, {count=1147.0, state=STATE5}, etc. ] *junk*
If the count for a particular state is ever 0, it actually won't be in the log at all, so I can't guarantee the ordering of the objects in the log (The only ordering is that they are sorted alphabetically by state name)
So, this is also a potential log:
*junk* [{count=240.0, state=STATE1}, {count=1.0, state=STATE4}, {count=1147.0, state=STATE5}, etc. ] *junk*
I'm somewhat new to using regular expressions, and I think I'm overdoing it, but this is what I've tried.
^[^=\n]*=(?:(?P<STATE1>\d+)(?=\.0,\s+\w+=STATE1))*.*?=(?P<STATE2>\d+)(?=\.0,\s+\w+=STATE2)*.*?=(?P<STATE3>\d+)(?=\.0,\s+\w+=STATE3)
The idea being that I'll loook for the '=' and then look ahead to see if this is for the state that I want, and it may or may not be there. Then skip all the junk after the count until the next state that I'm interested in(this is the part that I'm having issues with I believe). Sometimes it matches too far, and skips the state I'm interested in, giving me a bad value. If I use the lazy operator(as above), sometimes it doesn't go far enough and gets the count for a state that is before the one I want in the log.
See if this approach works for you:
Regex: (?<=count=)\d+(?:\.\d+)?(?=, state=(STATE\d+))
Demo
The group will be your State# and Full match will be the count value
You might use 2 capturing groups to capture the count and the state.
To capture for example STATE1, STATE2, STATE3 and STATE5, you could specify the numbers using a character class with ranges and / or an alternation.
{count=(\d+(?:\.\d+)?), state=(STATE(?:[123]|5))}
Explanation
{count= Match literally
( Capture group 1
\d+(?:\.\d+)? Match 1+ digits with an optional decimal part
) Close group
, state= Match literally
( Capture group 2
STATE(?:[123]|5) Match STATE and specify the allowed numbers
)} Close group and match }
Regex demo
If you want to match all states and digits:
{count=(\d+(?:\.\d+)?), state=(STATE\d+)}
Regex demo
After some experimentation, this is what I've come up with:
The answers provided here, although good answers, don't quite work if your state names don't end with a number (mine don't, I just changed them to make the question easier to read and to remove business information from the question).
Here's a completely tile-able regex where you can add on as many matches as needed
count=(?P<GROUP_NAME_HERE>\d+(?=\.0, state=STATE_NAME_HERE))?
This can be copied and appended with the new state name and group name.
Additionally, if any of the states do not appear in the string, it will still match the following states. For example:
count=(?P<G1>\d+(?=\.0, state=STATE_ONE))?(?P<G2>\d+(?=\.0, state=STATE_TWO))?(?P<G3>\d+(?=\.0, state=STATE_THREE))?
will match states STATE_ONE and STATE_THREE with named groups G1 & G3 in the following string even though STATE_TWO is missing:
[{count=55.0, state=STATE_ONE}, {count=10.0, state=STATE_THREE}]
I'm sure this could be improved, but it's fast enough for me, and with 11 groups, regex101 shows 803 steps with a time of ~1ms
Here's a regex101 playground to mess with: https://regex101.com/r/3a3iQf/1
Notice how groups 1,2,3,4,5,6,7,9, & 11 match. 8 & 10 are missing and the following groups still match.

Regex for pattern matching last instance with a range of values

I need to split a string that has a character, that is repeated but only split the last 3 instances and keep the 0 to any number of instances before those 3 instances of the character.
ex1: "Hi####there" after regex (and splitting) "Hi#","there"
ex2: "Hi###there" after regex (and splitting) "Hi","there"
Splitting on #{3} does not give the result I want: "Hi","#there"
edit:
I simplified my quesiton a bit; serialization and the character I am using is not relevant. I'm just interested in the regex that will get me the last x number in a list of y number of a repetitive list of characters. (#{3})(\w+) results in only returning hi# and the rest of the string is lost.
This is done in Javascript, but works
"Hi####there".split(/###(?=[^#])/); // Output = > ["Hi#", "there"]
"Hi###there".split(/###(?=[^#])/); // Output = > ["Hi", "there"]
// Tested on win7 with Chrome 45+
(?=[^#]) ... Lookahead to check if there is no # after the final/splitter ###

pl/sql negative look behind regex

I'm trying to implement a regular expression in pl/sql which excludes any results which are preceeded by a string.
data:
exclude this: 3
include this: 3
3
cvxcvxcv3
34edfgdsfg3
Using this regexp:
(?<!exclude this: )3\d{0}(\s|$)
What I would expect to be returned is:
exclude this: 3 <-- nothing
include this: 3 <- 3
3 <- 3
cvxcvxcv3 <- 3
34edfgdsfg3 <- the second 3 only
34edfgdsfg33 <- the last 3 only
This works fine when tested in notepad++ however when implementing it in pl/sql it isn't working. Looking at similar questions it appears that pl/sql doesn't support negative lookback fully but does anyone know of a similar construct or a way to work around this?
While i am not aware of any general technique to emulate negative lookbehind by means of pl/sql regexen, in your particular case a solution is possible:
([^e].{13}|[^x].{12}|[^c].{11}|[^l].{10}|[^u].{9}|[^d].{8}|[^e].{7}|[^ ].{6}|[^t].{5}|[^h].{4}|[^i].{3}|[^s].{2}|[^:].|[^ ]|^)3[^0-9]?(\s|$)
The negative lookbehind applies to a literal. Therefore all forbidden prefixes of the first character that must match are known beforehand as are their lengths. this allows for a compact (well ...) specification as a regex that must match.
Not that I would recommend that for best practice ... or any practice at all ...
Update (processing advice):
The regex as it stands identifies matches without providing any further information for postprocessing. However, you can identify the offset of the match and the length of the forbidden prefix with the following code:
DECLARE
s_data VARCHAR2(4000); -- This will contain the line you match against
s_matchpos BINARY_INTEGER; -- Offset of the 'interesting' part (digit '3' under the various constraints) in s_data
s_prefix VARCHAR2(100); -- The prefix part of the match
s_re VARCHAR2(4000); -- The regex
BEGIN
s_re := '([^e].{13}|[^x].{12}|[^c].{11}|[^l].{10}|[^u].{9}|[^d].{8}|[^e].{7}|[^ ].{6}|[^t].{5}|[^h].{4}|[^i].{3}|[^s].{2}|[^:].|[^ ]|^)3[^0-9]?(\s|$)';
s_prefix := regexp_replace( s_data, s_re, '\1', 1, 1); -- start at offset 1 of the data and find the first match
s_matchpos := regexp_instr( s_data, s_re ) + length(s_prefix);
END;
As mentioned above, not necessarily to recommend as best practice ...

Why would regex to separate filename from extension not work in ColdFusion?

I'm trying to retrieve a filename without the extension in ColdFusion. I am using the following function:
REMatchNoCase( "(.+?)(\.[^.]*$|$)" , "Doe, John 8.15.2012.docx" );
I would like this to return an array like: ["Doe, John 8.15.2012","docx"]
but instead I always get an array with one element - the entire filename:["Doe, John 8.15.2012.docx"]
I tried the regex string above on rexv.org and it works as expected, but not on ColdFusion. I got the string from this SO question: Regex: Get Filename Without Extension in One Shot?
Does ColdFusion use a different syntax? Or am I doing something wrong?
Thanks.
Why you're not getting expected results...
The reason you are getting a one-item array with the whole filename is because your pattern matches the entire filename, and matches once.
It is capturing the two groups, but rematch returns arrays of matches, not arrays of the captured groups, so you don't see those groups.
How to solve the problem...
If you are dealing with simple files (i.e. no .htaccess or similar), then the simplest solution is to just use...
ListLast( filename , '.' )
....to get only the file extension and to get the name without extension you can do...
rematch( '.+(?=\.[^.]+$)' , filename )
This uses a lookahead to ensure there is a . followed by at least one non-. at the end of the string, but (since it's a lookahead) it is excluded from the match (so you only get the pre-extension part in your match).
To deal with non-extensioned files (e.g. .htaccess or README) you can modify the above regex to .+(?=(?:\.[^.]+)?$) which basically does the same thing except making the extension optional. However, there isn't a trivial way to get update the ListLast method for these (guess you'd need to check len(extension) LT len(filename)-1 or similar).
(optional) Accessing captured groups...
If you want to get at the actual captured groups, the closest native way to do this in CF is using the refind function, with the fourth argument set to true - however, this only gives you positions and lengths - requiring that you use mid to extract them yourself.
For this reason (amongst many others), I've created an improved regex implementation for CF, called cfRegex, which lets you return the group text directly (i.e. no messing around with mid).
If you wanted to use cfRegex, you can do so with your original pattern like so:
RegexMatch( '(.+?)(\.[^.]*$|$)' , filename , 1 , 0 , 'groups' )
Or with named arguments:
RegexMatch( pattern='(.+?)(\.[^.]*$|$)' , text=filename , returntype='groups' )
And you get returned an array of matches, within each element being an array of the captured groups for that match.
If you're doing lots of regex work dealing with captured groups, cfRegex is definitely better than doing it with CF's re methods.
If all you care about is getting the extension and/or the filename with extension excluded then the previous examples above are sufficient.
#Peter's response is great, however the approach is perhaps a bit longer-winded than necessary. One can do this with reMatch() with a slight tweak to the regex.
<cfscript>
param name="URL.filename";
sRegex = "^.+?(?=(?:\.[^.]+?)?$)";
aMatch = reMatch(sRegex, URL.filename);
writeDump(aMatch);
</cfscript>
This works on the following filename patterns:
foo.bar
foo
.htaccess
John 8.15.2012.docx
Explanation of the regex:
^ From the beginning of the string
.+? One or more (+) characters (.), but the fewest (?) that will work with the rest of the regex. This is the file name.
(?=) Look ahead. Make sure the stuff in here appears in the string, but don't actually match it. This is the key bit to NOT return any file extension that might be present.
(?: Group this stuff together, but don't remember it for a back reference.
. A dot. This is the separator between file name and file extension.
[^.]+? One or more (+) single ([]) non-dot characters (^.), again matching the fewest possible (?) that will allow the regex as a whole to work.
? (This is the one after the (?:) group). Zero or one of those groups: ie: zero or one file extensions.
$ To the end of the string
I've only tested with those four file name patterns, but it seems to work OK. Other people might be able to finetune it.
A few more ways of achieving the same result. They all execute in roughly the same amount of time.
<cfscript>
str = 'Doe, John 8.15.2012.docx';
// sans regex
arr1 = [
reverse( listRest( reverse( str ), '.' ) ),
listLast( str, '.' )
];
// using Java String lastIndexOf()
arr2 = [
str.substring( 0, str.lastIndexOf( '.' ) ),
str.substring( str.lastIndexOf( '.' ) + 1 )
];
// using listToArray with non-filename safe character replace
arr3 = listToArray( str.replaceAll( '\.([^\.]+)$', '|$1' ), '|' );
</cfscript>