get Unique record among Duplicates Using mapReduce - mapreduce

File.txt
123,abc,4,Mony,Wa
123,abc,4, ,War
234,xyz,5, ,update
234,xyz,5,Rheka,sild
179,ijo,6,all,allSingle
179,ijo,6,ball,ballTwo
1) column1,column2,colum3 are primary Keys
2) column4,column5 are comparision Keys
I have a file with duplicate records like above In this duplicate record i need to get only one record among duplicates based on sorting order.
Expected Output:
123,abc,4, ,War
234,xyz,5, ,update
179,ijo,6,all,allSingle
Please help me. Thanks in advance.

You can try the below code:
data = LOAD 'path/to/file' using PigStorage(',') AS (col1:chararray,col2:chararray,col3:chararray,col4:chararray,col5:chararray);
B = group data by (col1,col2,col3);
C = foreach B {
sorted = order data by col4 desc;
first = limit sorted 1;
generate group, flatten(first);
};
In the above code, you can change the sorted variable to choose the column you would like to consider for sorting and the type of sorting. Also, in case you require more than one record, you can change the limit to greater than 1.
Hope this helps.

Questions isn't soo clear , but I understand this is what you need :
A = LOAD 'file.txt' using PigStorage(',') as (column1,column2,colum3,column4,column5);
B = GROUP A BY (column1,column2,colum3);
C = FOREACH B GENERATE FLATTERN(group) as (column1,column2,colum3);
DUMP C;
Or
A = LOAD 'file.txt' using PigStorage(',') as (column1,column2,colum3,column4,column5);
B = DISTINCT(FOREACH A GENERATE column1,column2,colum3);
DUMP B;

Related

Merging Pandas Dataframe from one CSV file with identical columns

I hope my question isn't a dublicate of another, but I have searched for three days and I aven't found the answer.
Okay, so I have a CSV file containing two headers. The file contains information about hotels (their name), how much they cost (price), their rating and where they are located (Area 1, 2 or 3):
The CSV file imported
As you can see the first row describes the area, while the second row are the Hotelname, price and rating. What I want is to rearrange the file and save it to a new CSV file, where the format looks like this:
The hopeful output
So the information about the area for the hotels have been given its own column. The names in the seond row are all identical.
Is there a way to create this? I am a bit new to these tree-like datastructures when they have to be imported. Could it be done with if the tree had more nodes (e.g. if we started by country, moved down to area and then down to hotel name, price and rating)? Can it be done with Pandas?
First, could you share the csv files as text files? That is really helpful to try out my own solution. It seems not productive to write down the data from the picture.
Second, have you tried out to achieve this by scripting yourself? Or have you tried to use some library? You added the tag pandas but in the text you do not mention that. Any specific reason it should be pandas?
A solution which works for that one case seems simple to do just by using slicing. I guess the format you have is rather specific and not standard so the libraries might not help much. Pandas e.g. allows multiple rows as a header, but it is interpreted in a different way, see pandas dataframe with 2-rows header and export to csv
A solution idea:
table = []
with open(my_csv_file) as f:
for line in f:
a1, p1, r1, a2, p2, r2, a3, p3, r3 = line[:-1].split(",")
table.append([a1, p1, r1, "area1"])
table.append([a2, p2, r2, "area2"])
table.append([a3, p3, r3, "area3"])
# ... convert table into dataframe etc.
Okay so I created a possible solution to the problem:
infile = csv.reader(infile, delimiter=';')
out = []
counter = 0
i = 0
k = 0
names = []
temp1 = 0
for line in infile:
temp = list(set(line))
if counter == 0:
names = line
counter +=1
elif counter == 1:
k = len(list(set(line)))
while i < len(line):
line.insert(i+k, name)
i += (k + 1)
counter += 1
out.append(line)
else:
i = 0
ind = 0
while i < len(line):
line.insert(i+k, names[ind*k])
i += (k + 1)
ind +=1
out.append(line)
headers = out.pop(0)
n = len(set(headers))
table = pd.DataFrame(out, columns=headers)
for i in range(0, len(table.columns)):
if i ==0:
temp1 = table.ix[:,n*i:n*(i+1)]
else:
temp1 = pd.concat([temp1, table.ix[:,n*i:n*(i+1)]], ignore_index=True)
I would very much like some input and suggestions to make the solution more elegant or to add extra levels of headers to the file.

performing addition of several columns of a table in idiorm

Any ideas on how to perform addition of several columns in idiorm query
I want get the value of alias 'Due' n 'TotDue' as an addition of those specified columns,but i keep on getting a fatal Error
any assistance i shall realy appreciate;
This is my sample query ;
$d = ORM::for_table('transaction_s')
->select_many(array('ACCT_TYPE'=>'transaction_s.TRANS_TYP',
'ADate'=>'transaction_s.ACCTDATE',
'Str'=>'accounts_master.PHYS_ADDRESS',
'Zone'=>'accounts_master.ZONE_ID',
'Name'=>'customers.CUSTOMER_NAME',
'Address'=>'customers.POST_ADDRESS',
'TelNo'=>'customers.TEL_NO',
'AcctNo'=>'transaction_s.ACCT_NO',
'smscost'=>'transaction_s.INT_AMOUNT',
'Due'=>'(transaction_s.water_DUE+transaction_s.METER_RENT+transaction_s.SEWER+
transaction_s.Conserve+transaction_s.INT_AMOUNT+transaction_s.BIN_HIRE)',
'TotalDue' =>'(transaction_s.water_DUE+transaction_s.METER_RENT+transaction_s.SEWER
+transaction_s.Conserve+transaction_s.PREVBAL
+transaction_s.INT_AMOUNT+transaction_s.BIN_HIRE)',
'OutStanding'=>'transaction_s.water_OUTSTANDING'
)
->inner_join('accounts_master', 'customers.CUSTOMER_NO = accounts_master.CUSTOMER_NO')
->inner_join('customers', 'customers.CUSTOMER_NO = accounts_master.CUSTOMER_NO')
->find_many();

List element comparison by iteration over it

1.Three lists a, b and c. If a[index] is in b[index] then get the element in list c corresponding to list b[index]. That is if a[0]=b[1],get c[1]:
a = ['ASAP','WTHK']
b = ['ABCD','ASAP','EFGH','HIJK']
c = ['1','2','3','4','5']
I hope this is what you were looking for. You can add the b and the corresponding c value to the dictionary in a loop if the a array contains the b value. After that you can get the c value by a value as key like in the code below.
a = ['ASAP','WTHK']
# b c
dictionary_trans = {'ASAP' : '1'}
dictionary_trans = {'WTHK' : '1337'}
# etc. put all b values existing in a to the dict
# with thier coresponding c values.
key = a[0]
c_value = dictionary_trans.get(key)
print c_value
My python skills are very limited, but I think I would try to solve the problem this way.
This solution could crash if you use an a value which is not contained in the dictionary, so you need to implement some logic to handle missing relations between a and c, like insert dummy entries to the dictionary or so.

Why does Relation.size sometimes return a Hash in Rails 4

I can run a query in two different ways to return a Relation.
When I interrogate the size of the Relation one query gives a Fixnum as expected the other gives a Hash which is a hash of each value in the Relations Group By statement with the number of occurrences of each.
In Rails 3 I assume it always returned a Fixnum as I never had a problem whereeas with Rails 4 it sometimes returns a Hash and a statement like Rel.size.zero? gives the error:
undefined method `zero?' for {}:Hash
Am I best just using the .blank? method to check for zero records to be sure of avoiding unexpected errors?
Here is a snippet of code with looging statements for the two queries and the resulting log
CODE:
assessment_responses1=AssessmentResponse.select("process").where("client_id=? and final = ?",self.id,false).group("process")
logger.info("-----------------------------------------------------------")
logger.info("assessment_responses1.class = #{assessment_responses1.class}")
logger.info("assessment_responses1.size.class = #{assessment_responses1.size.class}")
logger.info("assessment_responses1.size value = #{assessment_responses1.size}")
logger.info("............................................................")
assessment_responses2=AssessmentResponse.select("distinct process").where("client_id=? and final = ?",self.id,false)
logger.info("assessment_responses2.class = #{assessment_responses2.class}")
logger.info("assessment_responses2.size.class = #{assessment_responses2.size.class}")
logger.info("assessment_responses2.size values = #{assessment_responses2.size}")
logger.info("-----------------------------------------------------------")
LOG
-----------------------------------------------------------
assessment_responses1.class = ActiveRecord::Relation::ActiveRecord_Relation_AssessmentResponse
(0.5ms) SELECT COUNT(`assessment_responses`.`process`) AS count_process, process AS process FROM `assessment_responses` WHERE `assessment_responses`.`organisation_id` = 17 AND (client_id=43932 and final = 0) GROUP BY process
assessment_responses1.size.class = Hash
CACHE (0.0ms) SELECT COUNT(`assessment_responses`.`process`) AS count_process, process AS process FROM `assessment_responses` WHERE `assessment_responses`.`organisation_id` = 17 AND (client_id=43932 and final = 0) GROUP BY process
assessment_responses1.size value = {"6 Month Review(1)"=>3, "Assessment(1)"=>28, "Assessment(2)"=>28}
............................................................
assessment_responses2.class = ActiveRecord::Relation::ActiveRecord_Relation_AssessmentResponse
(0.5ms) SELECT COUNT(distinct process) FROM `assessment_responses` WHERE `assessment_responses`.`organisation_id` = 17 AND (client_id=43932 and final = 0)
assessment_responses2.size.class = Fixnum
CACHE (0.0ms) SELECT COUNT(distinct process) FROM `assessment_responses` WHERE `assessment_responses`.`organisation_id` = 17 AND (client_id=43932 and final = 0)
assessment_responses2.size values = 3
-----------------------------------------------------------
size on an ActiveRecord::Relation object translates to count, because the former tries to get the count of the Relation. But when you call count on a grouped Relation object, you receive a hash.
The keys of this hash are the grouped column's values; the values of this hash are the respective counts.
AssessmentResponse.group(:client_id).count # this will return a Hash
AssessmentResponse.group(:client_id).size # this will also return a Hash
This is true for the following methods: count, sum, average, maximum, and minimum.
If you want to check for rows being present or not, simply use exists? i.e. do the following:
AssessmentResponse.group(:client_id).exists?
Instead of this:
AssessmentResponse.group(:client_id).count.zero?

Unnecessary lookups for related fields

I have a django model which looks like this:
class A(models.Model):
b = models.ForeignKey(B)
n = models.IntegerField()
My django-shell is configured to print all the sql queries as soon as it is fired.
So, writing
a = A.objects.get(id=10)
prints one query to retrieve an object from the table A.
What I do not understand is, that writing print a.n fires a query to retrieve the related B element.
Can anyone throw some light on why Django needs to retrieve the element B when I haven't even accessed it yet? (Nor do I plan to do so. So it is turning out to be an overhead for me.)
Edit
This is the original code that #Daniel asked for:
>> x=PlayedGamesModel.objects.get(pkey='i01E25D45526E14EA0A490D36#AdobeID16688802f8024509a8e431b644108a82')
[2014-03-05 12:46:15,174] (0.326) SELECT `appConnexions_playedgamesmodel`.`pkey`, `appConnexions_playedgamesmodel`.`user_id`, `appConnexions_playedgamesmodel`.`game_id`, `appConnexions_playedgamesmodel`.`playedFromBrowser`, `appConnexions_playedgamesmodel`.`playedFromConnexions`, `appConnexions_playedgamesmodel`.`deleted`, `appConnexions_playedgamesmodel`.`timestamp` FROM `appConnexions_playedgamesmodel` WHERE `appConnexions_playedgamesmodel`.`pkey` = 'i01E25D45526E14EA0A490D36#AdobeID16688802f8024509a8e431b644108a82' ; args=('i01E25D45526E14EA0A490D36#AdobeID16688802f8024509a8e431b644108a82',)
>> print x.playedFromConnexions
1
[2014-03-05 12:46:17,993] (0.319) SELECT `appConnexions_gamemodel`.`gameID`, `appConnexions_gamemodel`.`gameURL`, `appConnexions_gamemodel`.`title`, `appConnexions_gamemodel`.`longTitle`, `appConnexions_gamemodel`.`category`, `appConnexions_gamemodel`.`description`, `appConnexions_gamemodel`.`iconURL`, `appConnexions_gamemodel`.`screenshotURL`, `appConnexions_gamemodel`.`genre`, `appConnexions_gamemodel`.`minOS`, `appConnexions_gamemodel`.`price`, `appConnexions_gamemodel`.`isBlacklisted`, `appConnexions_gamemodel`.`swfURL`, `appConnexions_gamemodel`.`androidURL`, `appConnexions_gamemodel`.`windowsURL`, `appConnexions_gamemodel`.`iosURL`, `appConnexions_gamemodel`.`publisher_id`, `appConnexions_gamemodel`.`lastUpdated` FROM `appConnexions_gamemodel` WHERE `appConnexions_gamemodel`.`gameID` = '16688802f8024509a8e431b644108a82' ; args=(u'16688802f8024509a8e431b644108a82',)
[2014-03-05 12:46:18,597] (0.292) SELECT `appConnexions_usermodel`.`userID`, `appConnexions_usermodel`.`fID`, `appConnexions_usermodel`.`gID`, `appConnexions_usermodel`.`name`, `appConnexions_usermodel`.`avatarURL`, `appConnexions_usermodel`.`age`, `appConnexions_usermodel`.`country`, `appConnexions_usermodel`.`email`, `appConnexions_usermodel`.`settingsBitset`, `appConnexions_usermodel`.`allowLogin`, `appConnexions_usermodel`.`allowAutomaticAdd`, `appConnexions_usermodel`.`allowActivityShare`, `appConnexions_usermodel`.`logoutFromBrowser` FROM `appConnexions_usermodel` WHERE `appConnexions_usermodel`.`userID` = 'i01E25D45526E14EA0A490D36#AdobeID' ; args=(u'i01E25D45526E14EA0A490D36#AdobeID',)
PlayedGamesModel is the table with relations to GameModel(appConnexions_gamemodel) and UserModel(appConnexions_usermodel). Printing an integer value (playedFromConnexions) from PlayedGamesModel fires queries to GameModel and UserModel unnecessarily.