Implementing UPPER,TRIM and REPLACE in Apache Pig - replace

I am quiet new to pig environment. I have tried to implement my pig script file in two ways.
I.
data = LOAD 'sample2.txt' USING PigStorage(',') as(campaign_id:chararray,date:chararray,time:chararray,display_site:chararray,placement:chararray,was_clicked:int,cpc:int,keyword:chararray);
distinct_data = DISTINCT data;
val = foreach distinct_data generate campaign_id,date,time,UPPER(keyword),display_site,placement,was_clicked,cpc;
val1 = foreach val generate campaign_id,date,time,TRIM(keyword),display_site,placement,was_clicked,cpc;
val2 = foreach val1 generate campaign_id,REPLACE(date, '-', '/'),time,keyword,display_site,placement,was_clicked,cpc;
dump val2;
i get error:
2016-09-29 02:45:40,826 INFO org.apache.pig.Main: Apache Pig version
0.10.0-cdh4.2.1 (rexported) compiled Apr 22 2013, 12:04:54 2016-09-29 02:45:40,827 INFO org.apache.pig.Main: Logging error messages to:
/home/training/training_materials/analyst/exercises/pig_etl/pig_1475131540824.log
2016-09-29 02:45:42,371 ERROR org.apache.pig.tools.grunt.Grunt: ERROR
1025: Invalid field
projection. Projected field [keyword] does not exist in schema:
campaign_id:chararray,date:chararray,time:chararray,org.apache.pig.builtin.upper_keyword_12:chararray,display_site:chararray,placement:chararray,was_clicked:int,cpc:int.
Details at logfile: /home/hduser/pig_etl/pig_1475131540824.log
But When i integrate the UPPER,TRIM and REPLACE in one statement then it works:
II.
data = LOAD 'sample2.txt' USING PigStorage(',') as(campaign_id:chararray,date:chararray,time:chararray,display_site:chararray,placement:chararray,was_clicked:int,cpc:int,keyword:chararray);
distinct_data = DISTINCT data;
val = foreach distinct_data generate campaign_id,REPLACE(date, '-', '/'),time,TRIM(UPPER(keyword)),display_site,placement,was_clicked,cpc;
dump val;
So, I just want someone to explain me that why I. method didn't work and what is the error message.

While you are applying TRIM in val1 there is nothing called "keyword" in val.
Note when you are applying any Function use alias so that error u can avoid..
or before creating a new relation it is always good to use describe so that the schema is clear to u..
Solution will be:
data = LOAD 'sample2.txt' USING PigStorage(',') as(campaign_id:chararray,date:chararray,time:chararray,display_site:chararray,placement:chararray,was_clicked:int,cpc:int,keyword:chararray);
distinct_data = DISTINCT data;
val = foreach distinct_data generate campaign_id,date,time,UPPER(keyword) as keyword,display_site,placement,was_clicked,cpc;
val1 = foreach val generate campaign_id,date,time,TRIM(keyword) as keyword,display_site,placement,was_clicked,cpc;
val2 = foreach val1 generate campaign_id,REPLACE(date, '-', '/') as date,time,keyword,display_site,placement,was_clicked,cpc;
dump val2;

Related

Dynamically referring to JSON node in Power Query M

I have a function that extracts a node from JSON document as follows:
...
Json = GetJson(Url),
Value = Json[#"values"]
values correspond to the actual node within the JSON document.
I would like to generalize this piece of code and provide the name of the node as a variable like:
let myFunc = (parentNodeName as text) =>
...
Json = GetJson(Url),
Value = Json[parentNodeName]
However getting this error:
An error occurred in the ‘myFunc’ query. Expression.Error: The field 'parentNodeName' of the record wasn't found.
How can I refer to the JSON node dynamically?
Try
(Json, parentNodeName ) =>
let
...
Value = Record.Field(Json,parentNodeName)
in Value
sample code:
let Json = Json.Document(Web.Contents("http://soundcloud.com/oembed?url=http%3A//soundcloud.com/forss/flickermood&format=json")),
Value=myFunc(Json,"title")
in Value
and myFunc:
(Json, parentNodeName ) =>
let
Value = Record.Field(Json,parentNodeName)
in Value

In Power Query, how can I remove duplicates either side of a delimiter?

I wish to turn : into :
For example amazon:amazon becomes amazon:
This is doable by hand using the replace values function but I need a way to do it programatically.
Thanks!
You can try this Transform but if it doesn't work, provide detail as to the
nature of the failure
examples of data on which it doesn't work
any error messages and the line which returns the error
remDups = Table.TransformColumns(#"Changed Type",{"Column1", each
let
sep = ":",
splitList = Text.Split(_, " "),
sepString = List.FindText(splitList,sep){0},
sepStringPosition = List.PositionOf(splitList,sepString),
//rem if the same remove last
splitSep = Text.Split(sepString, sep),
replString = if splitSep{0} = splitSep{1} then splitSep{0} & sep else sepString,
//put the string backtogether
replList = List.ReplaceRange(splitList,sepStringPosition,1,{replString})
in
Text.Combine(replList," ")
})

How to save string as json in scala spark

I have the raw of string in logs file . I do many filter and other operation after that . I have reached the following problem as below. I need to convert the string into json format . So that i can save it as a single object.
Suppose i have the following data
Val CDataTime = "20191012"
Val LocationId = "12345"
Val SetInstruc = "Comm=Qwe123,Elem=12345,Elem123=Test"
I am trying to create a data frame that contains datetime|location|jsonofinstruction
The Jsonofstring is the json of third Val; I try to split the string first by comma than by equal to sign and loop through by 2 and create a map of value of one and 2 as value. But json not created . Please help here.
You can use scala.util.parsing.json.JSONObject to convert a map to JSON and then to a string.
val df = spark.createDataset(Seq("Comm=Qwe123,Elem=12345,Elem123=Test")).toDF("col3")
val dfWithJson = df.map{ row =>
val insMap = row.getAs[String]("col3").split(",").map{kv =>
val kvArray = kv.split("=")
(kvArray(0),kvArray(1))
}.toMap
val insJson = JSONObject(insMap).toString()
(row.getAs[String]("col3"),insJson)
}.toDF("col3","col4").show()
Result -
+--------------------+--------------------+
| col3| col4|
+--------------------+--------------------+
|Comm=Qwe123,Elem=...|{"Comm" : "Qwe123...|
+--------------------+--------------------+

get Unique record among Duplicates Using mapReduce

File.txt
123,abc,4,Mony,Wa
123,abc,4, ,War
234,xyz,5, ,update
234,xyz,5,Rheka,sild
179,ijo,6,all,allSingle
179,ijo,6,ball,ballTwo
1) column1,column2,colum3 are primary Keys
2) column4,column5 are comparision Keys
I have a file with duplicate records like above In this duplicate record i need to get only one record among duplicates based on sorting order.
Expected Output:
123,abc,4, ,War
234,xyz,5, ,update
179,ijo,6,all,allSingle
Please help me. Thanks in advance.
You can try the below code:
data = LOAD 'path/to/file' using PigStorage(',') AS (col1:chararray,col2:chararray,col3:chararray,col4:chararray,col5:chararray);
B = group data by (col1,col2,col3);
C = foreach B {
sorted = order data by col4 desc;
first = limit sorted 1;
generate group, flatten(first);
};
In the above code, you can change the sorted variable to choose the column you would like to consider for sorting and the type of sorting. Also, in case you require more than one record, you can change the limit to greater than 1.
Hope this helps.
Questions isn't soo clear , but I understand this is what you need :
A = LOAD 'file.txt' using PigStorage(',') as (column1,column2,colum3,column4,column5);
B = GROUP A BY (column1,column2,colum3);
C = FOREACH B GENERATE FLATTERN(group) as (column1,column2,colum3);
DUMP C;
Or
A = LOAD 'file.txt' using PigStorage(',') as (column1,column2,colum3,column4,column5);
B = DISTINCT(FOREACH A GENERATE column1,column2,colum3);
DUMP B;

Can you use the ternary operator with STORE in pig

In apache pig if I want to conditionally store some data and I try to do it like so:
data1 = ....;
data2 = ....;
STORE (condition ? data1 : data2) INTO '$output' USING PigStorage(",");
--assuming pig is smart enough not to run the query for data1 or data2 depending on the condition
Then I get a syntax error:
SEVERE: exception during parsing: Error during parsing. <file test.pig, line 38, column 6> Syntax error, unexpected symbol at or near '('
Failed to parse: <file test.pig, line 38, column 6> Syntax error, unexpected symbol at or near '('
Am I using the ternary operator in pig incorrectly, and if this is not possible is there another way I can achieve conditional storage in pig, preferably without writing a UDF.
You can not use ternary operation in the STORE statement as you are trying to do in the question.
You can add the condition column to both data1 and data2, then take a UNION and then filter the UNION'd data based on condition value.
data1 = ....
data1a = CROSS data1, condition;
data2 = ....
data2a = CROSS data2, condition;
data12 = UNION data1a, data2a;
final = FILTER data12 BY condition == true;
STORE final INTO '$output' USING PigStorage(",");
Hope this helps.