How to extract a value from a string column using hive

How to extract a value from a string column using hive - regex

I need to extract a field from a string column using hive
Input: [{"name":"MANAGER"}]
Output: MANAGER
I was able to fetch the record using the below regular expression, but I am not able to remove ] from the output.
Query built:
select split(regexp_replace('([{"name":"MANAGER"}])','^\\(|\\)$|[{"}]',''),': *')[1];
Output obtained:
MANAGER]
Could you please help me to remove the ] from the output and get only MANAGER in this example using hive.

You can atually parse this with get_json_object function as the string you shared is a JSON string:
select get_json_object(regexp_replace('[{"name":"MANAGER"}]', '[\\[\\]]', ''), '$.name')
See the documentation:
get_json_object
A limited version of JSONPath is supported:
$ : Root object
. : Child operator
[] : Subscript operator for array
* : Wildcard for []
Syntax not supported that's worth noticing:
: Zero length string as key
.. : Recursive descent
# : Current object/element
() : Script expression
?() : Filter (script) expression.
[,] : Union operator
[start:end.step] : array slice operator

Related

Fluentd Parsing

Hi i'm trying to parse single line log using fluentd. Here is log i'm trying to parse.
F2:4200000000000000,F3:000000,F4:000000060000,F6:000000000000,F7:000000000,F8..........etc
This will parse into like this:
{ "F2" : "4200000000000000", "F3" : "000000", "F4" : "000000060000" ............etc }
I tried to use regex but it's confusing and making me write multiple regexes for different keys and values. Is there any easier way to achieve this ?
EDIT1: Heya! I will make this more detailed. I'm currently tailing logs using fluentd to Elasticsearch+Kibana. Here is unparsed example log that fluentd sending to Elasticsearch:
21/09/02 16:36:09.927238: 1 frSMS:0:13995:#HTF4J::141:141:msg0210,00000000000000000,000000,000000,007232,00,#,F2:00000000000000000,F3:002000,F4:000000820000,F6:Random message and strings,F7:.......etc
Elasticsearch recived message:
{"message":"frSMS:0:13995:#HTF4J::141:141:msg0210,00000000000000000,000000,000000,007232,00,#,F2:00000000000000000,F3:002000,F4:000000820000,F6:Random
digits and chars,F7:.......etc"}
This log has only message key so i can't index and create dashboard on only using whole message field. What am i trying to achieve is catch only useful fields, add key into it if it has no key and make indexing easier.
Expected output:
{"logdate" : "21/09/02 16:36:09.927238",
"source" : "frSMS",
"UID" : "#HTF4J",
"statuscode" : "msg0210",
"F2": "00000000000000000",
"F3": "randomchar314516",.....}
I used regex plugin to parse into this but it was too overwhelming and . Here is what i did so far:
^(?<logDate>\d{2}.\d{2}.\d{2}\s\d{2}:\d{2}:\d{2}.\d{6}\b)....(?<source>fr[A-Z]{3,4}|to[A-Z]{3,4}\b).(?<status>\d\b).(?<dummyfield>\d{5}\b).(?<HUID>.[A-Z]{5}\b)..(?<d1>\d{3}\b).(?<d2>\d{3}\b).(?<msgcode>msg\d{4}\b).(?<dummyfield1>\d{16}\b).(?<dummyfield2>\d{6}\b).(?<dummyfield3>\d{6,7}\b).(?<dummyfield4>\d{6}\b).(?<dummyfield5>\d{2}\b)...
Which results to :
"logDate": "21/09/02 16:36:09.205706",
"source": "toSMS" ,
"status": "0",
"dummyfield": "13995" ,
"UID" : "#HTFAA" ,
"d1" : "156" ,
"d2" : "156" ,
"msgcode" : "msg0210",
"dummyfield1" :"0000000000000000" ,
"dummyfield2" :"002000",
"dummyfield3" :"2000000",
"dummyfield4" :"00",
"dummyfield5" :"2000000" ,
"dummyfield6" :"867202"
Which only applies to example log and has useless fields like field1, dummyfield, dummyfield1 etc.
Other logs has the useful values and keys(date,source,msgcode,UID,F1,F2 fields) like i showcased on expected output. Not useful fields are not static(they can be none, or has less|more digits and chars) so they trigger the pattern not matched error.
So the question is:
How do i capture useful fields that i mentioned using regex?
How do i capture F1,F2,F3...... fields that has different value
patterns like char string mixed?
PS: I wraped the regex i wrote into html snippet so the <> capturing fields don't get deleted

Regex pattern to use:
(F[\d]+):([\d]+)
This pattern will catch all the 'F' values with whatever digit that comes after - yes even if it's F105 it still works. This whole 'F105' will be stored as the first group in your regex match expression
The right part of the above pattern will catch the value of all the digits following ':' up until any charachter that is not a digit. i.e. ',', 'F', etc.. and will store it as the second group in your regex match
Use
Depending on your coding language you will have to access your regex matches variable with an iterator and extract group 1 and group 2 respectivly
Python example:
import re
log = 'F2:4200000000000000,F3:000000,F4:000000060000,F6:000000000000,F7:000000000,F105:9726450'
pattern = '(F[\d]+):([\d]+)'
matches = re.finditer(pattern,log)
log_dict = {}
for match in matches:
log_dict[match.group(1)] = match.group(2)
print(log_dict)
Output
{'F2': '4200000000000000', 'F3': '000000', 'F4': '000000060000', 'F6': '000000000000', 'F7': '000000000', 'F105': '9726450'}

Assuming the logdate will be static(in pattern wise) You can ignore useless values using ".+" regex and get collect the useful values by their patterns. So the regex will be like this :
(?\d{2}.\d{2}.\d{2}\s\d{2}:\d{2}:\d{2}.\d{6}\b).+(?fr[A-Z]{3,4}|to[A-Z]{3,4}).+(?#[A-Z0-9]{5}).+(?msg\d{4})
And output will be like:
{"logdate" : "21/09/02 16:36:09.927238", "source" : "frSMS",
"UID" : "#HTF4J","statuscode" : "msg0210"}
And I'm working on getting F2,F3,FN keys and values.

WSO2 DAS Substring

I am currently using Siddhi QL, and i have a simple requirement.
Input data is given in quote e.g.
"apple"
and the output would be :
apple
I have tried using
select substr(inputDATA,1,4) as out insert into outputStream;
Than i am getting error
"substr is neither a function nor an aggregated attribute, "
However i have tried using JS inside siddhi to substring `
define function splitFn[JavaScript] return string {}
it but i got :
`jdk.nashorn.internal.runtime.ParserException: <eval>:1:22 Missing
space after numeric literal var data = [""tempID=1wef"",0]`
Do you have any alternative solution ; or am i doing something wrong

You have to use siddhi function with namespace
str:substr(inputDATA,1,4)
E.g
select str:substr(inputDATA,1,4) as out insert into outputStream;
Refer Siddhi Documentation for further reference.

MongoDB regular expression not working as expected

db.col.find({_id: {term: "garcia"}})
finds the document with term = "garcia". However,
db.col.find({_id: {term: /garcia/}})
doesn't find anything. What's the reason?
Document:
{ "_id" : { "term" : "garcia" }, "count" : 43512, "count_users" : 15388 }

Your current query using {_id: {term: /garcia/}} is asking for an exact match on _id itself, not just the term field within it. So it's trying to find a doc where _id is an object with a single term field with a value of that regular expression.
Use dot notation to match the regular expression against just the term field:
db.col.find({'_id.term': /garcia/})

Pig JsonLoader issue - is not parsing custom json correctly

I am new to pig and am trying to parse a json with the following structure
{"id1":197,"id2":[
{"id3":"109.11.11.0","id4":"","id5":1391233948301},
{"id3":"10.10.15.81","id4":"","id5":1313393100648},
...
]}
The above file is called jsonfile.txt
alias = load 'jsonfile.txt' using JsonLoader('id1:int,id2:[id3:chararray,id4:chararray,id5:chararray]');
This is the error I get.
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: mismatched input 'id3' expecting RIGHT_BRACKET
Do you know what i could be doing wrong?

Your JSON schema is not well formatted.
The formats for complex data types are shown here:
Tuple: enclosed by (), items separated by ","
Non-empty tuple: (item1,item2,item3)
Empty tuple is valid: ()
Bag: enclosed by {}, tuples separated by ","
Non-empty bag: {code}{(tuple1),(tuple2),(tuple3)}{code}
Empty bag is valid: {}
Map: enclosed by [], items separated by ",", key and value separated by "#"
Non-empty map: [key1#value1,key2#value2]
Empty map is valid: []
Source : http://pig.apache.org/docs/r0.10.0/func.html#jsonloadstore
In other words, [] aren't arrays, they're associative tables (maps) where the key character is "#" to split keys and values. Try using tuples (parenthesis) instead.
'id1:int,id2:(id3:chararray,id4:chararray,id5:chararray)'
OR
'id1:int,id2:{(id3:chararray,id4:chararray,id5:chararray)}'
I couldn't test it and never trying Pig but according to documentation, it should work just fine.
(based on the following example)
a = load 'a.json' using JsonLoader('a0:int,a1:{(a10:int,a11:chararray)},a2:(a20:double,a21:bytearray),a3:[chararray]');

Using a regular expression to match everything after a colon

I've got a string like this:
192.168.114.182:SomethingFun-1083:EOL/Nothing Here : MySQL Database 4.12
192.168.15.82:SomethingElse-1325083:All Types : PHP Version Info : 23
I'm trying to select this item in an Oracle database (using REGEXP_SUBSTR) and get all remaining text after the second colon (:).
So in the end, I'd like to get these values:
EOL/Nothing Here : MySQL Database 4.12
All Types : PHP Version Info : 23
I've tried these, but it haven't found something that works.
REGEXP_SUBSTR(title,'[^:]+',1,3) as Title -- doesn't work if last field has ":"
REGEXP_SUBSTR(title,'(?:[^:]*:)+([^:]*)') as Title

how about REGEXP_REPLACE
REGEXP_REPLACE(title,'^[^:]+:[^:]+:(.*)$', '\1') as Title

Oracle's regular expression functions tend to be CPU intensive versus alternatives. Tom Kyte's advice is "however, if you can do it without regular expressions - do it without them. regular expressions consume CPU hugely."
An alternative not using regular expressions would be substr(test_values, instr(test_values, ':', 1, 2) + 1)
SQL> create table t (test_values varchar2(100));
Table created.
SQL> insert into t values ('192.168.114.182:SomethingFun-1083:EOL/Nothing Here : MySQL Database 4.12');
1 row created.
SQL> insert into t values ('192.168.15.82:SomethingElse-1325083:All Types : PHP Version Info : 23');
1 row created.
SQL> commit;
Commit complete.
SQL> select substr(test_values, instr(test_values, ':', 1, 2) + 1)
2 from t;
SUBSTR(TEST_VALUES,INSTR(TEST_VALUES,':',1,2)+1)
-----------------------------------------------------
EOL/Nothing Here : MySQL Database 4.12
All Types : PHP Version Info : 23
Bench-marking left as an exercise for the reader.

This should work:
^[^:]+:[^:]+:(.*)$

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to extract a value from a string column using hive - regex

Related

Fluentd Parsing

WSO2 DAS Substring

MongoDB regular expression not working as expected

Pig JsonLoader issue - is not parsing custom json correctly

Using a regular expression to match everything after a colon

Categories

Resources