In apache pig if I want to conditionally store some data and I try to do it like so:
data1 = ....;
data2 = ....;
STORE (condition ? data1 : data2) INTO '$output' USING PigStorage(",");
--assuming pig is smart enough not to run the query for data1 or data2 depending on the condition
Then I get a syntax error:
SEVERE: exception during parsing: Error during parsing. <file test.pig, line 38, column 6> Syntax error, unexpected symbol at or near '('
Failed to parse: <file test.pig, line 38, column 6> Syntax error, unexpected symbol at or near '('
Am I using the ternary operator in pig incorrectly, and if this is not possible is there another way I can achieve conditional storage in pig, preferably without writing a UDF.
You can not use ternary operation in the STORE statement as you are trying to do in the question.
You can add the condition column to both data1 and data2, then take a UNION and then filter the UNION'd data based on condition value.
data1 = ....
data1a = CROSS data1, condition;
data2 = ....
data2a = CROSS data2, condition;
data12 = UNION data1a, data2a;
final = FILTER data12 BY condition == true;
STORE final INTO '$output' USING PigStorage(",");
Hope this helps.
Related
File.txt
123,abc,4,Mony,Wa
123,abc,4, ,War
234,xyz,5, ,update
234,xyz,5,Rheka,sild
179,ijo,6,all,allSingle
179,ijo,6,ball,ballTwo
1) column1,column2,colum3 are primary Keys
2) column4,column5 are comparision Keys
I have a file with duplicate records like above In this duplicate record i need to get only one record among duplicates based on sorting order.
Expected Output:
123,abc,4, ,War
234,xyz,5, ,update
179,ijo,6,all,allSingle
Please help me. Thanks in advance.
You can try the below code:
data = LOAD 'path/to/file' using PigStorage(',') AS (col1:chararray,col2:chararray,col3:chararray,col4:chararray,col5:chararray);
B = group data by (col1,col2,col3);
C = foreach B {
sorted = order data by col4 desc;
first = limit sorted 1;
generate group, flatten(first);
};
In the above code, you can change the sorted variable to choose the column you would like to consider for sorting and the type of sorting. Also, in case you require more than one record, you can change the limit to greater than 1.
Hope this helps.
Questions isn't soo clear , but I understand this is what you need :
A = LOAD 'file.txt' using PigStorage(',') as (column1,column2,colum3,column4,column5);
B = GROUP A BY (column1,column2,colum3);
C = FOREACH B GENERATE FLATTERN(group) as (column1,column2,colum3);
DUMP C;
Or
A = LOAD 'file.txt' using PigStorage(',') as (column1,column2,colum3,column4,column5);
B = DISTINCT(FOREACH A GENERATE column1,column2,colum3);
DUMP B;
I am quiet new to pig environment. I have tried to implement my pig script file in two ways.
I.
data = LOAD 'sample2.txt' USING PigStorage(',') as(campaign_id:chararray,date:chararray,time:chararray,display_site:chararray,placement:chararray,was_clicked:int,cpc:int,keyword:chararray);
distinct_data = DISTINCT data;
val = foreach distinct_data generate campaign_id,date,time,UPPER(keyword),display_site,placement,was_clicked,cpc;
val1 = foreach val generate campaign_id,date,time,TRIM(keyword),display_site,placement,was_clicked,cpc;
val2 = foreach val1 generate campaign_id,REPLACE(date, '-', '/'),time,keyword,display_site,placement,was_clicked,cpc;
dump val2;
i get error:
2016-09-29 02:45:40,826 INFO org.apache.pig.Main: Apache Pig version
0.10.0-cdh4.2.1 (rexported) compiled Apr 22 2013, 12:04:54 2016-09-29 02:45:40,827 INFO org.apache.pig.Main: Logging error messages to:
/home/training/training_materials/analyst/exercises/pig_etl/pig_1475131540824.log
2016-09-29 02:45:42,371 ERROR org.apache.pig.tools.grunt.Grunt: ERROR
1025: Invalid field
projection. Projected field [keyword] does not exist in schema:
campaign_id:chararray,date:chararray,time:chararray,org.apache.pig.builtin.upper_keyword_12:chararray,display_site:chararray,placement:chararray,was_clicked:int,cpc:int.
Details at logfile: /home/hduser/pig_etl/pig_1475131540824.log
But When i integrate the UPPER,TRIM and REPLACE in one statement then it works:
II.
data = LOAD 'sample2.txt' USING PigStorage(',') as(campaign_id:chararray,date:chararray,time:chararray,display_site:chararray,placement:chararray,was_clicked:int,cpc:int,keyword:chararray);
distinct_data = DISTINCT data;
val = foreach distinct_data generate campaign_id,REPLACE(date, '-', '/'),time,TRIM(UPPER(keyword)),display_site,placement,was_clicked,cpc;
dump val;
So, I just want someone to explain me that why I. method didn't work and what is the error message.
While you are applying TRIM in val1 there is nothing called "keyword" in val.
Note when you are applying any Function use alias so that error u can avoid..
or before creating a new relation it is always good to use describe so that the schema is clear to u..
Solution will be:
data = LOAD 'sample2.txt' USING PigStorage(',') as(campaign_id:chararray,date:chararray,time:chararray,display_site:chararray,placement:chararray,was_clicked:int,cpc:int,keyword:chararray);
distinct_data = DISTINCT data;
val = foreach distinct_data generate campaign_id,date,time,UPPER(keyword) as keyword,display_site,placement,was_clicked,cpc;
val1 = foreach val generate campaign_id,date,time,TRIM(keyword) as keyword,display_site,placement,was_clicked,cpc;
val2 = foreach val1 generate campaign_id,REPLACE(date, '-', '/') as date,time,keyword,display_site,placement,was_clicked,cpc;
dump val2;
Any ideas on how to perform addition of several columns in idiorm query
I want get the value of alias 'Due' n 'TotDue' as an addition of those specified columns,but i keep on getting a fatal Error
any assistance i shall realy appreciate;
This is my sample query ;
$d = ORM::for_table('transaction_s')
->select_many(array('ACCT_TYPE'=>'transaction_s.TRANS_TYP',
'ADate'=>'transaction_s.ACCTDATE',
'Str'=>'accounts_master.PHYS_ADDRESS',
'Zone'=>'accounts_master.ZONE_ID',
'Name'=>'customers.CUSTOMER_NAME',
'Address'=>'customers.POST_ADDRESS',
'TelNo'=>'customers.TEL_NO',
'AcctNo'=>'transaction_s.ACCT_NO',
'smscost'=>'transaction_s.INT_AMOUNT',
'Due'=>'(transaction_s.water_DUE+transaction_s.METER_RENT+transaction_s.SEWER+
transaction_s.Conserve+transaction_s.INT_AMOUNT+transaction_s.BIN_HIRE)',
'TotalDue' =>'(transaction_s.water_DUE+transaction_s.METER_RENT+transaction_s.SEWER
+transaction_s.Conserve+transaction_s.PREVBAL
+transaction_s.INT_AMOUNT+transaction_s.BIN_HIRE)',
'OutStanding'=>'transaction_s.water_OUTSTANDING'
)
->inner_join('accounts_master', 'customers.CUSTOMER_NO = accounts_master.CUSTOMER_NO')
->inner_join('customers', 'customers.CUSTOMER_NO = accounts_master.CUSTOMER_NO')
->find_many();
I have two columns in a Pandas DataFrame that has datetime as its index. The two column contain data measuring the same parameter but neither column is complete (some row have no data at all, some rows have data in both column and other data on in column 'a' or 'b').
I've written the following code to find gaps in columns, generate a list of indices of dates where these gaps appear and use this list to find and replace missing data. However I get a KeyError: Not in index on line 3, which I don't understand because the keys I'm using to index came from the DataFrame itself. Could somebody explain why this is happening and what I can do to fix it? Here's the code:
def merge_func(df):
null_index = df[(df['DOC_mg/L'].isnull() == False) & (df['TOC_mg/L'].isnull() == True)].index
df['TOC_mg/L'][null_index] = df[null_index]['DOC_mg/L']
notnull_index = df[(df['DOC_mg/L'].isnull() == True) & (df['TOC_mg/L'].isnull() == False)].index
df['DOC_mg/L'][notnull_index] = df[notnull_index]['TOC_mg/L']
df.insert(len(df.columns), 'Mean_mg/L', 0.0)
df['Mean_mg/L'] = (df['DOC_mg/L'] + df['TOC_mg/L']) / 2
return df
merge_func(sve)
Whenever you are considering performing assignment then you should use .loc:
df.loc[null_index,'TOC_mg/L']=df['DOC_mg/L']
The error in your original code is the ordering of the subscript values for the index lookup:
df['TOC_mg/L'][null_index] = df[null_index]['DOC_mg/L']
will produce an index error, I get the error on a toy dataset: IndexError: indices are out-of-bounds
If you changed the order to this it would probably work:
df['TOC_mg/L'][null_index] = df['DOC_mg/L'][null_index]
However, this is chained assignment and should be avoided, see the online docs
So you should use loc:
df.loc[null_index,'TOC_mg/L']=df['DOC_mg/L']
df.loc[notnull_index, 'DOC_mg/L'] = df['TOC_mg/L']
note that it is not necessary to use the same index for the rhs as it will align correctly
I encountered the following problem:
First my data is a string that looks like this:
decimals, decimals
example: 1.345, 3.456
I used the following pig script to put this column, say QQ, into two columns:
result = FOREACH old_table GENERATE FLATTEN(STRSPLIT(QQ, ',')) as (COL1: double, COL2: double);
Then, I want to order it by first field, then second field.
result_ordered = ORDER result BY COL1, COL2;
However, I got the result like the following:
> 59.619198977071434 -151.4586740547339
> 60.52611316847121 -150.8005347076273
> 64.8310014577408 -147.84786488835852
> 7.059652849999997 125.59985130999996
which implies that my data is still being ordered as a string and not as a double. Has anyone encountered this issue and know how to solve it? Thank you in advance!
I'm not sure why STRSPLIT is returning a chararray though you explicitly state they are doubles. But if you look at http://pig.apache.org/docs/r0.10.0/basic.html#arithmetic, notice that chararrays can't be multiplied by 1.0 to doubles, but bytearrays can. Therefore you can do something like:
result = FOREACH old_table
GENERATE FLATTEN(STRSPLIT(QQ, ',')) AS (COL1: bytearray, COL2: bytearray);
B = FOREACH result GENERATE 1.0 * COL1 AS COL1, 1.0 * COL2 AS COL2 ;
result_ordered = ORDER B BY COL1, COL2;
Which gives me the correct output of:
result_ordered: {COL1: double,COL2: double}
(7.059652849999997,125.59985130999996)
(59.619198977071434,-151.4586740547339)
(60.52611316847121,-150.8005347076273)
(64.8310014577408,-147.84786488835852)
Instead of assigning the output of FLATTEN to a schema with two doubles, try actually casting the fields with (chararray). It may be that Pig only uses the :chararray syntax for applying schema checking, but requires the explicit cast to convert the types during execution.