input.regex in Hive - regex

A question was asked earlier for the given dataset.
03-24-2014 fm506 TOTAL-PROCESS OK;HARD;1;PROCS OK: 717 processes
03-24-2014 fm504 CHECK-LOAD OK;SOFT;2;OK - load average: 54.61, 56.95
The input regex provided in that thread is not at all working hence I created two "input regex" and tested the first regex in "http://www.regexplanet.com/advanced/java/index.html". The groups are perfect. But when I am trying in Hive, it's loading only NULL values.
input regex I provided as below
([^ ]*)\t+([^ ]*)\t+([^ ]*)\t+([^ ]*)
My second input regex is
^(\\S+)\\t+(\\S+)\\t+(\\S+)\\t+(\\S+)$
I thought it will work but it's also not loading NULL values.
Could you please let me know what's wrong with these two input regex?

Your first pattern does not match the entire string, and field matching parts are [^ ]*, that is, any 0+ chars other than a space, so the last field cannot be matched (it contains spaces).
The second regex also contains \S+ patterns matching 1 or more chars other than whitespace, and the last one does not match the last field.
You may use
^(\S+)\t+(\S+)\t+(\S+)\t+(.+)
^([^\t]*)\t+([^\t]*)\t+([^\t]*)\t+(.*)
See the regex demo
The [^\t]* matches any field in a tab-delimited text since it matches zero or more chars other than a tab.

Related

notepad++ regex divide two lists

i have below list:
21870172299%3Akvm6wcmcVYaoQ2J%3A2 340282366841710300949128111982633033733
21200717504%3AUhGubOhpHPtBKLk%3A6 340282366841710300949128111984034029824
21256096197%3AMGYmtB2uoj4er5i%3A1 340282366841710300949128111984541030820
11665946937%3AHBBkUBzcy3cvbtb%3A5 340282366841710300949128111986242038268
21719881031%3AH3t9c4b7re6cs5%3A24 340282366841710300949128111986284030213
21697692027%3A1S0fM2Jp6Ivsxo9%3A5 340282366841710300949128111986299030036
20424141770%3AFPiScGMuAVBPGvk%3A7 340282366841710300949128111987613032298
I would like to use regular expression to divide these 2 list. example:
list1:
21870172299%3Akvm6wccVYaoQ2J%3A2
21200717504%3AUhGubOpHPtBKLk%3A6
21256096197%3AMGYmtBuoj4er5i%3A1
11665946937%3AHBBkUBcy3cvbtb%3A5
21719881031%3AH3t9c4b7re6cs5%3A24
21697692027%3A1S0fMJp6Ivsxo9%3A5
20424141770%3AFPiSGMuAVBPGvk%3A7
list2:
340282366841710300949128111982633033733
340282366841710300949128111984034029824
340282366841710300949128111984541030820
340282366841710300949128111986242038268
340282366841710300949128111986284030213
340282366841710300949128111986299030036
340282366841710300949128111987613032298
I have tried to use online regex (regex101) but with failed attempts.
Kindly help me to divide this lists.
Thank you.
Copy this text and paste twice to your text file, one below the other.
Select first block of data:
Check "In selection" option and use pattern (^\S+).+ and replace it \1 meaning replacing with first capturing group.
Pattern explanation: ^ matches beginning of a string, \S+ matches one or more non-whitespace characters, .+ matches one or more of any character, (...) means store matched text in first capturing group.
Similarly, select second block of data and use pattern: ^\S+\s+(.+)
\s+ matches one or more of whitespaces. Again, check "In selection" check box.

Extracting part of a string using regex

I am trying to extract part of a strings below
I tried (.*)(?:table)?,it fails in the last case. How to make the expression capture entire string in the absence of the text "table"
Text: "diningtable" Expected Match: dining
Text: "cookingtable" Match: cooking
Text: "cooking" Match:cooking
Text: "table" Match:""
Rather than try to match everything but table, you should do a replacement operation that removes the text table.
Depending on the language, this might not even need regex. For example, in Java you could use:
String output = input.replace("table", "");
If you want to use regex, you can use this one:
(^.*)(?=table)|(?!.*table.*)(^.+)
See demo here: regex101
The idea is: match everything from the beginning of the line ^ until the word table or if you don't find table in the string, match at least one symbol. (to avoid matching empty lines). Thus, when it finds the word table, it will return an empty string (because it matches from the beginning of the line till the word table).
The (.*)(?:table)? fails with table (matches it) as the first group (.*) is a greedy dot matching pattern that grabs the whole string into Group 1. The regex engine backtracks and looks for table in the optional non-capturing group, and matches an empty string at the end of the string.
The regex trick is to match any text that does not start with table before the optional group:
^((?:(?!table).)+)(?:table)?$
See the regex demo
Now, Group 1 - ((?:(?!table).)+) - contains a tempered greedy token (?:(?!table).)+ that matches 1 or more chars other than a newline that do not start a table sequence. Thus, the first group will never match table.
The anchors make the regex match the whole line.
NOTE: Non-regex solutions might turn out more efficient though, as a tempered greedy token is rather resource consuming.
NOTE2: Unrolling the tempered greedy token usually enhances performance n times:
^([^t]*(?:t(?!able)[^t]*)*)(?:table)?$
See another demo
But usually it looks "cryptic", "unreadable", and "unmaintainable".
Despite other great answers, you could also use alternation:
^(?|(.*)table$|(.*))$
This makes use of a branch reset, so your desired content is always stored in group 1. If your language/tool of choice doesn't support it, you would have to check which of groups 1 and 2 contains the string.
See Demo

Capture number between two whitespaces (RegEx)

I have the following data:
SOMEDATA .test 01/45/12 2.50 THIS IS DATA
and I want to extract the number 2.50 out of this. I have managed to do this with the following RegEx:
(?<=\d{2}\/\d{2}\/\d{2} )\d+.\d+
However that doesn't work for input like this:
SOMEDATA .test 01/45/12 2500 THIS IS DATA
In this case, I want to extract the number 2500.
I can't seem to figure out a regex rule for that. Is there a way to extract something between two spaces ? So extract the text/number after the date until the next whitespace ? All I know is that the date will always have the same format and there will always be a space after the text and then a space after the number I want to extract.
Can someone help me out on this ?
Capture number between two whitespaces
A whitespace is matched with \s, and non-whitespace with \S.
So, what you can use is:
\d{2}\/\d{2}\/\d{2} +(\S+)
^^^
See the regex demo
The 1+ non-whitespace symbols are captured into Group 1.
If - for some reason - you need to only get the value as a whole match, use your lookbehind approach:
(?<=\d{2}\/\d{2}\/\d{2} )\S+
Or - if you are using PCRE - you may leverage the match reset operator \K:
\d{2}\/\d{2}\/\d{2} +\K\S+
^^
See another demo
NOTE: the \K and a capture group approaches allow 1 or more spaces after the date and are thus more flexible.
I see some people helped you already, but if you would want an alternative working one for some reason, here's what works too :)
.+ \d+\/\d+\/\d+ (\d+[\.\d]*)
So the .+ matches anything plus the first space
then the \d+/\d+/\d+ is the date parsing plus a space
the capturing group is the number, as you can see I made the last part optional, so both floating point values and normal values can be matched. Hope this helped!
Proof: https://regex101.com/r/fY3nJ2/1
Just make the fractal part optional:
(?<=\d{2}\/\d{2}\/\d{2} )\d+(?:\.\d+)?
Demo: https://regex101.com/r/jH3pU7/1
Update following clarifications in comments:
To match anything (but space) surrounded by spaces and prepended by date use:
(?<=\d{2}\/\d{2}\/\d{2} )\S+
Demo: https://regex101.com/r/jH3pU7/3
Rather than capture, you can make your entire match be the target text by using a look behind:
(?<=\d\d(\/\d\d){2} )\S+
This matches the first series of non-whitespace that follows a "date like" part.
Note also the reduction in the length of the "date like" pattern. You may consider using this part of the regex in whatever solution you use.

How can I replace only the first 2 matches per line, using regex in Notepad++

I'm trying to parse a list of filenames to a CSV file by converting the first 2 - characters per line into a |. The problem is that the filenames themselves also contain the character I'm searching for.
My raw data looks something like this:
12055371-1-Florence - BW Letter of Intent HB Comments 9-4-14-2.DOCX
12057668-2-EB-DUE-M- SBuxbaum FHA Benefit Plans-2.DOCX
12058210-1-Redline Letter of Intent-2.PDF
12058029-3-Florence Hospital--Order Establishing Bid Procedures-HB 9-23-14-2.DOCX
12058020-10-Florence - BW Letter of Intent 10,10,14 Revisions-2.DOCX
Using Notepadd++ to replace on the fly, but I'm not sure what regex will work to identify and replace these items.
Don't match -, match the beginning of the lines up to the second - :
match ^(.*?)-(.*?)-
replace by \1|\2|
Explanation :
^ matches the beginning of the line (0-width match).
(.*?) matches any character in a non-greedy way : if the next token of the regex can match, it will let it do so. The result is grouped so it can be referenced later.
\1 and \2 are back-references and refers to the two (.*?) groups.
Note : for efficiency you could replace the non-greedy matches by the negated class [^\-], which means every character but -, the - being escaped because it's a special character in this context. The groups would then become ([^\-]*). Of course it really does not matter if it's a one-time operation.

Regex expressions to match text between first comma and the comma before the first number

I have a csv file with all UK areas (43000 rows).
However, even though the fields are separated with commas, they are not enclosed with anything, hence if the field has commas within its contents, import to a database fails.
Fortunately, there is only one field that has commas within its content.
I need a regular expression that I could use to select this field on all rows.
Here is an example of data:
Aberaman,Rhondda, Cynon, Taf (Rhondda, Cynon, Taff),51.69N,03.43W,SO0101
Aberangell,Powys,52.67N,03.71W,SH8410
This should look like:
Aberaman,"Rhondda, Cynon, Taf (Rhondda, Cynon, Taff)",51.69N,03.43W,SO0101
Aberangell,"Powys",52.67N,03.71W,SH8410
So I need to basically select the second field, which is between the first comma and the comma just before the first number.
I will use sublime text 2 to perform this regex search.
Sublime text2 supports \K,
Regex:
^[^,]*,\K(.*?)(?=,\d)
Replacement string:
"\1"
DEMO
Explanation:
^ Asserts that we are at the start of a line.
[^,]* Matches any character not of comma zero or more times.
, Literal comma.
\K Previously matched characters would be discarded.
(.*?)(?=,\d) Matches any character zeror or more times which must be followed by , and a number. ? after * does a reluctant match.
You can try with capturing groups. Simply substitute it with $1"$2"$3 or \1"\2"\3
^(\w+,)([^\d]*)(,.*)$
Live Demo
You can do it in Notepad++ as well.
Find what: ^(\w+,)([^\d]*)(,.*)$
Replace with: $1"$2"$3
A regex which should be able to solve your problem is:
^.*?,(.*?),\d+
This matches
anything (non-greedy) up to first comma (which will not be included in result)
then anything up to second comma (which will be in a group)
and additional condition is that there has to be a number after second comma
So your group is in $1