Pig JsonLoader issue - is not parsing custom json correctly - mapreduce

I am new to pig and am trying to parse a json with the following structure
{"id1":197,"id2":[
{"id3":"109.11.11.0","id4":"","id5":1391233948301},
{"id3":"10.10.15.81","id4":"","id5":1313393100648},
...
]}
The above file is called jsonfile.txt
alias = load 'jsonfile.txt' using JsonLoader('id1:int,id2:[id3:chararray,id4:chararray,id5:chararray]');
This is the error I get.
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: mismatched input 'id3' expecting RIGHT_BRACKET
Do you know what i could be doing wrong?

Your JSON schema is not well formatted.
The formats for complex data types are shown here:
Tuple: enclosed by (), items separated by ","
Non-empty tuple: (item1,item2,item3)
Empty tuple is valid: ()
Bag: enclosed by {}, tuples separated by ","
Non-empty bag: {code}{(tuple1),(tuple2),(tuple3)}{code}
Empty bag is valid: {}
Map: enclosed by [], items separated by ",", key and value separated by "#"
Non-empty map: [key1#value1,key2#value2]
Empty map is valid: []
Source : http://pig.apache.org/docs/r0.10.0/func.html#jsonloadstore
In other words, [] aren't arrays, they're associative tables (maps) where the key character is "#" to split keys and values. Try using tuples (parenthesis) instead.
'id1:int,id2:(id3:chararray,id4:chararray,id5:chararray)'
OR
'id1:int,id2:{(id3:chararray,id4:chararray,id5:chararray)}'
I couldn't test it and never trying Pig but according to documentation, it should work just fine.
(based on the following example)
a = load 'a.json' using JsonLoader('a0:int,a1:{(a10:int,a11:chararray)},a2:(a20:double,a21:bytearray),a3:[chararray]');

Related

Python dictionary map to SQL query string

I have a python dictionary that maps column names from a source table to a destination table.
Dictionary:
tablemap_computer = {
'ComputerID' : 'computer_id',
'HostName' : 'host_name',
'Number' : 'number'
}
I need to dynamically produce the following query string, such that it will update properly when new column name pairs are added to the dictionary.
ComputerID=%(computer_id)s, HostName=%(host_name), Number=%(number)s
What is a good way to do this?
I have a start at this but it is very messy and has an extra comma at the end of the last element.
queryStrInsert = ''
for tm_key, tm_val in tablemap_computer.items():
queryStrInsert += tm_key+'=%('+tm_val+')s,'
You might want to try using join and list comprehension.
query = ', '.join([key + '=%(' + value + ')s' for key, value in tablemap_computer.items()])
Using the dictionary you gave as an example one would end up with the following string:
HostName=%(host_name)s, Number=%(number)s, ComputerID=%(computer_id)s

casting dictionary in kdb

I want to cast dictionary and log it.
dict:(enlist`code)!(enlist`B10005)
when I do
type value dict / 11h
but the key looks like ,code
when I do
type string value dict / 0h
I am not sure why.
I want to concatenate with strings and log it. So it will be something like:
"The constraint is ",string key dict
But it did not work. The constraint will be like each letter for each line. How I can cast the dictionary so I can concatenate and log it.
Have a look at http://code.kx.com/q/ref/dotq/#qs-plain-text for logging arbitrary kdb+ datatypes.
q)show dict:`key1`key2!`value1`value2
key1| value1
key2| value2
q).Q.s dict
"key1| value1\nkey2| value2\n"
There are several things are going on here.
dict has one key/value pair only but this fact doesn't affect how key and value behave: they return all keys and values. This is why type value dict is 11h which is a list of symbols. For exact same reason key dict is ,`code where comma means enlist: key dict is a list of symbols which (in your particular example) happens to have just one symbol `code.
string applied to a list of symbols converts every element of that list to a string and returns a list of strings
a string in q is a simple list of characters (see http://code.kx.com/wiki/Tutorials/Lists for more on simple and mixed lists)
when you join a simple list of characters "The constraint is " with a list of strings, i.e. a list of lists of characters a result cannot be represented as a simple list anymore and becomes a generic list. This is why q converts "The constraint is " (simple list) to ("T";"h";"e",...) (generic list) before joining and you q starts displaying each character on a separate line.
I hope you understand now what's happening. Depending on your needs you can fix your code like this:
"The constraint is ",first string key dict / displays the first key
"The constraint is ",", "sv string key dict / displays all keys separated by commas
Hope this helps.
if you are looking something for nice logging, something like this should help you(and is generic)
iterate through values, and convert to strings
s:{$[all 10h=type each x;","sv x;0h~t:type x;.z.s each x;10h~t;x;"," sv $[t<0;enlist#;(::)]string x]}/
string manipulation
fs:{((,)string[count x]," keys were passed")," " sv/:("Key:";"and the values for it were:"),'/:flip (string key#;value)#\:s each x}
examples
d:((,)`a)!(,)`a
d2:`a`b!("he";"lo")
d3:`a`b!((10 20);("he";"sss";"ssss"))
results and execution
fs each (d;d2;d3)
you can tailor obviously to your exact needs - this is not tested for complex dict values

PIG regex extract then filter the unnamed regex tuple

I have a string as:
[["structure\/","structure\/home_page\/","structure\/home_page\/headline_list\/","structure\/home_page\/latest\/","topic\/","topic\/location\/","topic\/location\/united_states\/","topic\/location\/united_states\/ohio\/","topic\/location\/united_states\/ohio\/franklin\/","topic\/news\/","topic\/news\/politics\/","topic\/news\/politics\/elections\/,topic\/news\/politics\/elections\/primary\/"]]
I want to regex_extract_all to turn it into elements in a tuple and sepereated by ",". Then I need to filter out the ones don't contain structure and location.
However, I got an error that can't filter regex type. Any idea?
By the way, the ending goal is to parse out the longest hierarchy like (topic|news|politics|elections|primary)
update the script:
data = load load '/web/visit_log/20160303'
USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') as json:map[];
a = foreach data generate json#section as sec_type;
b = foreach act_flt GENERATE ..host, REGEX_EXTRACT_ALL(act_type, 'topic..(?!location)(.*?)"') as extr;
store b into /user/tad/sec_hir
The syntax for filter matches seems incorrect.The data doesn't seem to have () in it.
c = filter b by not extr matches '(structure|location)';
Try changing this to
c = filter b by not (extr matches 'structure|location');

Pig - Convert tuple to a string

I have a Pig script which returns output as tuples as shown below:
(1,2,3)
(a,b,c)
I am storing this output into a file and it gets stored with the parenthesis ( and ) as above. I would like to stored the records as below in the file:
1,2,3
a,b,c
How can I get rid of the parenthesis before using 'STORE INTO' in Pig?
DUMP will display the records in a tuple that's the reason why you are seeing 1,2,3 enclosed in parenthesis.()
Ref : http://chimera.labs.oreilly.com/books/1234000001811/ch05.html#pl_dump
Using STORE on the alias will save the values alone.
Ref : http://chimera.labs.oreilly.com/books/1234000001811/ch05.html#pl_store

Why is this freemarker code failing when there is a comma in the list?

I have a map that contains a list (all the values in the list are strings):
["diameter":["1", "2", "3"]]
["length":["2", "3", "4"]]
I iterate through it in freemarker:
<#list product.getSortedVariantMap.keySet() as variantCode>
<#list product.getSortedVariantMap[variantCode] as variantValue>
This works fine. However if one of the strings contains a comma like this:
def returnValue = ["diameter":["3,5"]]
I get the following error:
?size is unsupported for: freemarker.ext.beans.SimpleMethodModel
The problematic instruction:
----------
==> list product.getSortedVariantMap[variantCode] as variantValue [on line 200, column 41 in product.htm]
I have no idea what the error could be, a comma in a string shouldn't create that error.
It depends on FreeMarker configuration, but product.getSortedVariantMap most probably returns the method itself, not its return value. You should write product.sortedVariantMap. (Although I don't understand why it doesn't stop earlier, on product.getSortedVariantMap.keySet(). Maybe your example is not exactly what to run?)