I am new to the concept of PIG . Now i have file mounted on HDFS .
While i am loading the file using
A = LOAD 'user/vishal/output/part-00000' USING PigStorage(' ') as
(name,occourence)
it is happening propery but while i am using FILTER command like
FLT = FILTER A by occourence > '20' and occourence < '35';
it is giving the following error
2013-02-27 11:06:16,264 [main] WARN org.apache.pig.PigServer -
Encountered Warning IMPLICIT_CAST_TO_CHARARRAY 6 time(s)
What could be the issue
Thanks
Default datatype for a column in pig is bytearray.
occurence should be int datatype like below .
A = LOAD 'user/vishal/output/part-00000' USING PigStorage(' ') as (name:chararray,occourence:int);
Now you can filter like below (without quotes).
FLT = FILTER A by occourence > 20 and occourence < 35;
Related
I want to transform unstructured data into structured form. The data is of the following form - (showing 1 row of data)
Agra - Ahmedabad### Sat, 24 Jan### http://www.cleartrip.com/m/flights/results?from=AGR&to=AMD&depart_date=24/01/2015&adults=1&childs=0&infants=0&class=Economy&airline=&carrier=&intl=n&page=loaded Air India### 15:30 -
14:35### 47h 5m, 3 stops , AI 406### Rs. 30,336###
and I want to extract the data in the following format using APACHE PIG
(Agra - Ahmedabad,Sat, 24 Jan,http://www.cleartrip.com/m/flights/results?from=AGR&to=AMD&depart_date=24/01/2015&adults=1&childs=0&infants=0&class=Economy&airline=&carrier=&intl=n&page=loaded,Air India,15:30 - 14:35,47h 5m, 3 , AI 406 , 30,336)
I am using the following lines in APACHE PIG :
A = LOAD '/prodqueue_cleartrip_23rdJan15.txt' using PigStorage as (value: chararray);
B = foreach A generate REGEX_EXTRACT_ALL('value', '([^#]+)#+\\s+([^#]+)#+\\s+([^\\s]+)\\s+([^#]+)#+\\s+([0-9]{1,2}:[0-9]{1,2}\\s-\\n[0-9]{1,2}:[0-9]{1,2})#+\\s+([^,]+),\\s([0-9]+)\\sstops\\s,\\s([^#]+)#+\\s+Rs.\\s([^#]+)#+
');
C = LIMIT B 5;
The output I am getting is this :
()
()
()
()
()
What is the mistake?
This may be just a typo for your question but
REGEX_EXTRACT_ALL('value', '([^#]+)#+\\s...
would just search a literal "value". You probably want to take out the single quote so that it matches the field value.
REGEX_EXTRACT_ALL(value, '([^#]+)#+\\s...
I want to get the following output from Pig Latin / Hadoop
((39,50,60,42,15,Bachelor,Male),5)
((40,35,HS-grad,Male),2)
((39,45,15,30,12,7,HS-grad,Female),6)
from the following data sample
data sample for adult data
I have written the following Pig Latin script:
sensitive = LOAD '/mdsba/sample2.csv' using PigStorage(',') as (AGE,EDU,SEX,SALARY);
BV= group sensitive by (EDU,SEX) ;
BVA= foreach BV generate group as EDU, COUNT (sensitive) as dd:long;
Dump BVA ;
Unfortunately, the results come out like this
((Bachelor,Male),5)
((HS-grad,Male),2)
Than try to project the AGE data too.
Something like this:
BVA= foreach BV generate
sensitive.AGE as AGE,
FLATTEN(group) as (EDU,SEX),
COUNT(sensitive) as dd:long;
Another suggestion is to specify the datatype when you load the data.
sensitive = LOAD '/mdsba/sample2.csv' using PigStorage(',') as (AGE:int,EDU:chararray,SEX:chararray,SALARY:chararray);
from googlefinance import getQuotes
import json
import time as t
import re
List = ["A","AA","AAB"]
Time=t.localtime() # Sets variable Time to retrieve date/time info
Date2= ('%d-%d-%d %dh:%dm:%dsec'%(Time[0],Time[1],Time[2],Time[3],Time[4],Time[5])) #formats time stamp
while True:
for i in List:
try: #allows elements to be called and if an error does the next step
Data = json.dumps(getQuotes(i.lower()),indent=1) #retrieves Data from google finance
regex = ('"LastTradePrice": "(.+?)",') #sets parse
pattern = re.compile(regex) #compiles parse
price = re.findall(pattern,Data) #retrieves parse
print(i)
print(price)
except: #sets Error coding
Error = (i + ' Failed to load on: ' + Date2)
print (Error)
It will display the quote as: ['(number)'].
I would like it to only display the number, which means removing the brackets and quotes.
Any help would be great.
Changing:
print(price)
into:
print(price[0])
prints this:
A
42.14
AA
10.13
AAB
0.110
Try to use type() function to know the datatype, in your case type(price)
it the data type is list use print(price[0])
you will get the output (number), for brecess you need to check google data and regex.
When i use * i receive the error
raise error, v # invalid expression
error: nothing to repeat
other wildcard characters such as ^ work fine.
the line of code:
df.columns = df.columns.str.replace('*agriculture', 'agri')
am using pandas and python
edit:
when I try using / to escape, the wildcard does not work as i intend
In[44]df = pd.DataFrame(columns=['agriculture', 'dfad agriculture df'])
In[45]df
Out[45]:
Empty DataFrame
Columns: [agriculture, dfad agriculture df]
Index: []
in[46]df.columns.str.replace('/*agriculture*','agri')
Out[46]: Index([u'agri', u'dfad agri df'], dtype='object')
I thought the wildcard should output Index([u'agri', u'agri'], dtype='object)
edit:
I am currently using hierarchical columns and would like to only replace agri for that specific level (level = 2).
original:
df.columns[0] = ('grand total', '2005', 'agriculture')
df.columns[1] = ('grand total', '2005', 'other')
desired:
df.columns[0] = ('grand total', '2005', 'agri')
df.columns[1] = ('grand total', '2005', 'other')
I'm looking at this link right now: Changing columns names in Pandas with hierarchical columns
and that author says it will get easier at 0.15.0 so I am hoping there are more recent updated solutions
You need to the asterisk * at the end in order to match the string 0 or more times, see the docs:
In [287]:
df = pd.DataFrame(columns=['agriculture'])
df
Out[287]:
Empty DataFrame
Columns: [agriculture]
Index: []
In [289]:
df.columns.str.replace('agriculture*', 'agri')
Out[289]:
Index(['agri'], dtype='object')
EDIT
Based on your new and actual requirements, you can use str.contains to find matches and then use this to build a dict to map the old against new names and then call rename:
In [307]:
matching_cols = df.columns[df.columns.str.contains('agriculture')]
df.rename(columns = dict(zip(matching_cols, ['agri'] * len(matching_cols))))
Out[307]:
Empty DataFrame
Columns: [agri, agri]
Index: []
I have sampledata.csv which contains data as below,
2,4/1/2010,5.97
2,4/6/2010,12.71
2,4/7/2010,34.52
2,4/12/2010,7.89
2,4/14/2010,17.17
2,4/16/2010,9.25
2,4/19/2010,26.74
I want to filter the data in pig script so that only data with valid date are considered.
Say if the date is like '4//2010' or '/9/2010', then it has to be filtered out.
Below is the pig script I have written and the output I am getting while dumping the data.
script:
data = load 'sampledata.csv' using PigStorage(',') as (custid:int, date:chararray,amount:float);
cleadata = FILTER data by REGEX_EXTRACT(date, '(([1-9])|(1[0-2]))/(([0-2][1-9])|([3][0-1]))/([1-9]{4})', 1) != null;
Output:
2014-09-14 18:21:30,587 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1003: Unable to find an operator for alias cleandata
I am a beginner in pig scripting. If you have come across this kind of error,please let me know how to resolve.
here the solution for your problem. I have modified the Regex also, if you want you can change the regex according to your need.
input.txt
2,04/1/0000,5.97
2,04/1/2010,5.97
2,44/6/2010,12.71
2,4/07/2010,34.52
2,4/\12/2010,7.89
2,4/14/2010/,17.17
2,/16/2010,9.25
2,4/19//2010,26.74
2,4//19/2010,26.74
PigScript:
A = LOAD 'input.txt' USING PigStorage(',') AS (custid:int,date:chararray,amount:float);
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(date, '(0?[1-9]|1[0-2])/([1-2][0-9]|[3][0-1]|0?[1-9])/([1-2][0-9]{3})')) AS (month,day,year);
C = FOREACH B GENERATE CONCAT(month,'/',day,'/',year) AS extractedDate;
D = FILTER C BY extractedDate is not null;
DUMP D;
Output:
(04/1/2010)
(4/07/2010)