Remove the duplicate row in apache pig - mapreduce

I want to remove the duplicate rows in pig. There are a lot of ways, but I am not sure if which one is better.
Here is the data set, the schema is (f0,f1,id,f3,f4):
1,2,3,2015-02-21,2015-02-20
1,2,3,2015-02-22,2015-02-20
1,2,3,2015-02-23,2015-02-20
1,2,4,2015-02-24,2015-02-20
1,2,5,2015-02-25,2015-02-20
If any of rows whose f0,f1 and id are equal, then they are considered to be the duplicate. And I want to output one of them where f3 is minimum.
But I also want to output which ids have the duplicates.
That is, I will store or dump two relations.
one of both relations are:
1,2,3,2015-02-21,2015-02-20
1,2,4,2015-02-24,2015-02-20
1,2,5,2015-02-25,2015-02-20
The other one is the id which has the duplicate rows, the schema is (id,f4)
3,2015-02-20
That is, id=3 has the duplicate data.
Here is my workaround
r1 = LOAD 'data' USING PigStorage(',');
r2 = group r1 by ($0,$1,$2);
r3 = FOREACH r2 GENERATE COUNT(r1) as c, r1;
SPLIT r3 into r4 if c > 1, r5 if c == 1;
r6 = FOREACH r5 GENERATE flatten(r1);
dups_id = FOREACH r4 {
GENERATE flatten(r1.$2),flatten(r1.$4);
};
r7= distinct dups_id
dump r7
no_dups = FOREACH r4 {
sorted = ORDER r1 by $3 ASC;
lim = limit sorted 1;
GENERATE flatten(lim);
};
r8 = union no_dups,r6
dump r8
I think that this is a little complicated, and I doubt the performance.
Is there any other better idea can implement this use case?

Here is how I would do it.
r1 = LOAD 'data' USING PigStorage(',');
r2 = group r1 by ($0,$1,$2);
r3 = FOREACH r2 GENERATE $0.., SIZE($1) AS size;
DEFINE MYTOP TOP('ASC');
r8 = FOREACH r2 {
GENERATE MYTOP(1, 3, r1);
};
dups = FILTER r3 BY size > 1L;
dups2 = FOREACH dups GENERATE FLATTEN($1);
dups3 = FOREACH dups2 GENERATE $2, $4;
dups_id = DISTINCT dups3;
dump r8;
dump dups_id;

Related

Reference a column by a variable

I want to reference a table column by a variable while creating another column but I can't get the syntax:
t0 = Table.FromRecords({[a = 1, b = 2]}),
c0 = "a", c1 = "b",
t1 = Table.AddColumn(t0, "c", each([c0] + [c1]))
I get the error the record's field 'c0' was not found. It is understanding c0 as a literal but I want the text value contained in c0. How to do it?
Edit
I used this inspired by the accepted answer:
t0 = Table.FromRecords({[a = 1, b = 2]}),
c0 = "a", c1 = "b",
t1 = Table.AddColumn(t0, "c", each(Record.Field(_, c0) + Record.Field(_, c1)))
Another way:
let
t0 = Table.FromRecords({[a = 1, b = 2]}),
f = {"a","b"},
t1 = Table.AddColumn(t0, "sum", each List.Sum(Record.ToList(Record.SelectFields(_, f))))
in
t1
try using an index as below
let t0 = Table.FromRecords({[a = 1, b = 2]}),
#"Added Index" = Table.AddIndexColumn(t0, "Index", 0, 1),
c0 = "a",
c1 = "b",
t1 = Table.AddColumn(#"Added Index", "c", each Table.Column(#"Added Index",c0){[Index]} + Table.Column(#"Added Index",c1){[Index]} )
in t1
Expression.Evaluate is another possibility:
= Table.AddColumn(t0, "c", each Expression.Evaluate("["&c0&"] + ["&c1&"]", [_=_]) )
Please refer to this article to understand the [_=_] context argument:
Expression.Evaluate() In Power Query/M
This article explains that argument specifically:
Inside a table, the underscore _ represents the current row, when working with line-by-line operations. The error can be fixed, by adding [_=_] to the environment of the Expression.Evaluate() function. This adds the current row of the table, in which this formula is evaluated, to the environment of the statement, which is evaluated inside the Expression.Evaluate() function.

0 DF in regression in SAS enterprise guide

I created dummies in SAS (part of the codes below) and run regression (threw away M23). It was working fine. But then I tried to group them by age since we don't have enough members. I ran it the same way and threw away one age group (M20to24 since this group has the highest membership). Now some of my variables have 0 DF. Does anyone know what went wrong?
I got the message - Note: Model is not full rank. Least-squares solutions for the parameters are not unique. Some statistics will be misleading. A reported DF of 0 or B means that the estimate is biased. The following parameters have been set to 0, since the variables are a linear combination of other variables as shown.
data Table;
set Table;
M0=(AgeGender = '0M');
M1=(AgeGender = '1M');
M2=(AgeGender = '2M');
M3=(AgeGender = '3M');
M4=(AgeGender = '4M');
M5to9=(AgeGender = ' 5to9M');
M10to14=(AgeGender = '10to14M');
M15to19=(AgeGender = '15to19M');
M20to24=(AgeGender = '20to24M');
M25to29=(AgeGender = '25to29M');
M30to34=(AgeGender = '30to34M');
M35to39=(AgeGender = '35to39M');
M40to44=(AgeGender = '40to44M');
M45to49=(AgeGender = '45to49M');
M50to54=(AgeGender = '50to54M');
M55to59=(AgeGender = '55to59M');
M60to64=(AgeGender = '60to64M');
M65Plus=(AgeGender = '65+M');
F0=(AgeGender = '0F');
F1=(AgeGender = '1F');
F2=(AgeGender = '2F');
F3=(AgeGender = '3F');
F4=(AgeGender = '4F');
F5to9=(AgeGender = ' 5to9F');
F10to14=(AgeGender = '10to14F');
F15to19=(AgeGender = '15to19F');
F20to24=(AgeGender = '20to24F');
F25to29=(AgeGender = '25to29F');
F30to34=(AgeGender = '30to34F');
F35to39=(AgeGender = '35to39F');
F40to44=(AgeGender = '40to44F');
F45to49=(AgeGender = '45to49F');
F50to54=(AgeGender = '50to54F');
F55to59=(AgeGender = '55to59F');
F60to64=(AgeGender = '60to64F');
F65Plus=(AgeGender = '65+F');
Dep = (Relationship = 'Dep');
Mandatory = (Mand_Vo = 'Mandatory');
run;
ods output ParameterEstimates=Parameter_Estimates;
proc reg data= Table;
model logPMPM =
M0
M1
M2
M3
M4
M5to9
M10to14
M15to19
M25to29
M30to34
M35to39
M40to44
M45to49
M50to54
M55to59
M60to64
M65Plus
F0
F1
F2
F3
F4
F5to9
F10to14
F15to19
F20to24
F25to29
F30to34
F35to39
F40to44
F45to49
F50to54
F55to59
F60to64
F65Plus;
weight Membership;
run;
ods output close;
It doesn't look like you have overlaps or identical complimentary data variables but that's by definition. Your data is likely having that occur by chance, which is harder to find. You can likely find this by crossing variables that you suspect may be related or doing a pair wise scatter plot (PROC SGSCATTER) and seeing which two overlap almost identically.
You're correct, you wouldn't get this behaviour with continuous values because they're continuous and less likely to overlap exactly. In general, it's considered best practice to NOT categorize/bin variables when you can keep them continuous. The boundaries are artificial, does a 34 year old really differ from that 36 year old? What if all the people in that age group are 34 compared to the 36 in the 35 to 39 age group? You may not find a difference, but if your distribution was everyone at 39 vs everyone at 31 you may find more of a difference. Keeping the data continuous avoids these manufactured issues.

Vba - extract values and list once

I have a spreadsheet with two raw data sheets on separate excel tabs that has been extracted from a finance system, containing values that represent cost codes. The dataset on both tabs is quite large and the codes that I want listed just once are repeated multiple times. I want a macro that will scan these two relevant columns (say column A on both sheets) and list the cost codes once in numerical order on a third sheet.
I've searched this site but can't seem to find a code that does the above completely.
Thanks in advance
This may not be the fastest implementation possible, as it mostly relies on VBA operations to do the work, except the final sort. Has not been tested.
Sub AppendUnique(ByVal W1 As Worksheet, ByVal W2 As Worksheet, ByVal R1 As Long, ByVal R2 As Long, ByVal C1 As Long, ByVal C2 As Long)
' Append values from an unsorted column to a new unique but unsorted column
Dim V1 As Variant, V2 As Variant
Dim I As Long
V1 = W1.Cells(R1, C1).Value
While Not IsEmpty(V1)
I = R2
V2 = W2.Cells(I, C2).Value
While Not IsEmpty(V2)
If V2 = V1 Then Exit While
I = I + 1
V2 = W2.Cells(I, C2).Value
Wend
W2.Cells(I, C2).Value = V1
R1 = R1 + 1
V1 = W1.Cells(R1, C1).Value
Wend
End Sub
Dim W1 As Worksheet, W2 As Worksheet, W3 As Worksheet
Dim C1 As Long, Dim C2 As Long, Dim C3 As Long
Dim R1 As Long, Dim R2 As Long, Dim R3 As Long
Set W1 = Worksheets("Sheet1") ' Source 1
Set W2 = Worksheets("Sheet2") ' Source 2
Set W3 = Worksheets("Sheet3") ' Destination
C1 = 1 ' Column on Sheet1: Source 1
C2 = 1 ' Column on Sheet2: Source 2
C3 = 1 ' Column on Sheet3: Destination
R1 = 1 ' Starting Row on Sheet1: Source 1
R2 = 1 ' Starting Row on Sheet2: Source 2
R3 = 1 ' Starting Row on Sheet3: Destination
AppendUnique W1, W3, R1, R3, C1, C3
AppendUnique W2, W3, R2, R3, C2, C3
W3.Range(W3.Cells(R3, C3), W3.Cells(R3, C3).End(xlDown)).Sort

Find all possible paths from lists with adjacent elements in Prolog?

I apologize in advance for the awkward title as it's a bit hard to put clearly in just a few words.
The goal is to find all possible paths and the total energy used from one "room" to another based on the input rooms. So the list [r1,r2,3] would mean you can travel from room 1 to room 2, and from room 2 to room 1, and it would take 3 energy either way. You are not allowed to travel to a room previously traveled to.
Here is the list of lists that represent which rooms can be traveled too.
adjacent([[r1,r2,8],[r1,r3,2],[r1,r4,4],[r2,r3,7],[r3,r4,1],[r2,r5,2],[r4,r6,5],[r6,r3,9],[r3,r5,3]]).
And here is my code which does correctly find a path, however all future possible paths are just repeating previous rooms because I'm unsure on how to implement that functionality. I figured I could simply use not member(PosPath, Paths) since Paths hold the list of all previously traveled to elements but it seems like it adds PosPath to Paths sometime beforehand so it always fails.
trip(Start,End,[Start,End],Energy):- adjacent(List), member([Start,End,Energy],List).
trip(Start,End,[Start|Paths],TotalE) :-
adjacent(List),
member([Start,PosPath,E], List),
% not member(PosPath, Paths),
trip(PosPath,End,Paths,PathE).
% TotalE is E+PathE.
Output:
?- trip(r1, r6, Path, TotalE).
Path = [r1, r2, r3, r4, r6]
Total = Total
Yes (0.00s cpu, solution 1, maybe more)
Path = [r1, r2, r3, r4, r6, r3, r4, r6]
Total = Total
Yes (0.00s cpu, solution 2, maybe more)
Path = [r1, r2, r3, r4, r6, r3, r4, r6, r3, r4, r6]
TotalE = TotalE
Yes (0.00s cpu, solution 3, maybe more)
Since the rooms in [r1,r2,3] represent a bidirectional path I would suggest a predicate that describes this symmetry, let's call it from_to_cost/3:
from_to_cost(X,Y,C) :-
adjacent(L),
member([X,Y,C],L).
from_to_cost(X,Y,C) :-
adjacent(L),
member([Y,X,C],L).
For the calling predicate I would suggest a somewhat more descriptive name, say start_end_path_cost/4, that correspond to your predicate trip/4. For the predicate that describes the actual relation two additional arguments are needed: An accumulator to sum up the cost of the path, that starts at 0 and a list of visited rooms that starts with the first room as the single element [S]:
start_end_path_cost(S,E,P,C) :-
s_e_p_c_(S,E,P,C,0,[S]).
The actual relation has to describe two cases:
1) If the start-room and the end-room are equal the path is found. Then the cost and the accumulator are equal as well and the path is empty.
2) Otherwise there is an intermediary room that has not been visited yet and can be reached from S:
s_e_p_c_(E,E,[],C,C,_Visited).
s_e_p_c_(S,E,[X|Path],C,C0,Visited) :-
maplist(dif(X),Visited),
from_to_cost(S,X,SXC),
C1 is C0+SXC,
s_e_p_c_(X,E,Path,C,C1,[X|Visited]).
Now your example query finds all solutions and terminates:
?- start_end_path_cost(r1, r6, Path, TotalE).
Path = [r2, r3, r4, r6],
TotalE = 21 ;
Path = [r2, r3, r6],
TotalE = 24 ;
Path = [r2, r5, r3, r4, r6],
TotalE = 19 ;
Path = [r2, r5, r3, r6],
TotalE = 22 ;
Path = [r3, r4, r6],
TotalE = 8 ;
Path = [r3, r6],
TotalE = 11 ;
Path = [r4, r6],
TotalE = 9 ;
Path = [r4, r3, r6],
TotalE = 14 ;
false.
And the most general query finds all 137 solutions for your given connections and terminates as well:
?- start_end_path_cost(S, E, Path, TotalE).
S = E,
Path = [],
TotalE = 0 ;
S = r1,
E = r2,
Path = [r2],
TotalE = 8 ;
S = r1,
E = r3,
Path = [r2, r3],
TotalE = 15 ;
.
.
.
S = r5,
E = r1,
Path = [r3, r6, r4, r1],
TotalE = 21 ;
S = r5,
E = r2,
Path = [r3, r6, r4, r1, r2],
TotalE = 29 ;
false.
Edit:
Concerning your question in the comments: yes it is possible. You can define a predicate that describes the first argument to not be an element of the list that's the second argument, let's call it nonmember/2:
nonmember(_A,[]).
nonmember(A,[H|T]):-
dif(A,H),
nonmember(A,T).
Then you can replace the maplist goal in s_e_p_c_/6 by nonmember/2 like so:
s_e_p_c_(E,E,[],C,C,_Visited).
s_e_p_c_(S,E,[X|Path],C,C0,Visited) :-
nonmember(X,Visited), % <- here
from_to_cost(S,X,SXC),
C1 is C0+SXC,
s_e_p_c_(X,E,Path,C,C1,[X|Visited]).
With this change the queries yield the same results.

Merge Two Pandas Dataframes when two columns are list

I have two Pandas data frames and they need to be merged. Example data frames are:
c1 c2
pd1 = [[1, [1,2]]
c3 c4
pd2 = [[1, [1,3]],
[2,[2,3]]
result = [[1,1], [1,2]]
The join condition is that lists in c2 and c4 have at lease one common element.
I've tried:
result = pd.merge(pd1, pd2, left_on=list('c2'),right_on=list('c4'), how='inner')
However, this seems to only join them when the rows in each column are single values like a float, int or string.
I've attacked this problem using nested loops. This runs like a dog when the sets get large. Is there a faster way to perform this merge exploiting data frames or is there another way that's better?
pd1 = pd.DataFrame([[1, [1,2]]], columns=['c1', 'c2'])
pd1
pd2 = pd.DataFrame([[1, [1, 2]], [2, [2, 3]]], columns=['c3', 'c4'])
pd2
Setup for a merge
s2 = pd2.c4.apply(pd.Series).stack() \
.rename_axis(['idx2', 'lst2']).reset_index(name='val')
s2
s1 = pd1.c2.apply(pd.Series).stack() \
.rename_axis(['idx1', 'lst1']).reset_index(name='val')
s1
mrg = s1.merge(s2)[['idx1', 'idx2']].drop_duplicates()
mrg
a1 = pd1.c1.loc[mrg.idx1].values
a2 = pd2.c3.loc[mrg.idx2]
pd.DataFrame(dict(c1=a1, c3=a2))