Group by X OR Y in Pig

Group by X OR Y in Pig - mapreduce

I am processing a big amount of data with Pig and I need to group records by one field OR another. Be careful that it is not classic GROUP BY X AND Y, I mean, you have to group two records if they have the same value for the attributes X OR Y.
For example, given this dataset:
1, a, 'r1'
2, b, 'r2'
3, c, 'r3'
4, a, 'r4'
3, d, 'r5'
5, c, 'r6'
5, e, 'r7'
The result of grouping by first OR second field should be:
{(1, a, 'r1'), (4, a, 'r4')}
{(2, b, 'r2')}
{(3, c, 'r3'), (3, d, 'r5'), (5, c, 'r6'), (5, e, 'r7')}
(1) Because 'r1' and 'r4' have the same value for the second attribute.
(2) The record 'r2' does not have any coincidence for the first OR second fields.
(3) Finally, 'r3' shares the value of the first attribute with 'r5', and the value of its second field with 'r6'. And 'r6' shares with 'r7' the same value for the first attribute. Note that 'r3' and 'r7' do not have any of their fields in common, but chaining records they ended in the same group.
I have solved this problem using Java (out of Pig), and I know how to do it using Map-Reduce. But (in order to learn) I would like to know how to do it using Pig-latin, or any library that could help me in this stuff.

Related

Mathematica solution?

I am new with Mathematica and I have one more task to figure out, but I can't find the answer. I have two lists of numbers ("b","u"):
b = {8.734059001373602`, 8.330508824111284`, 5.620669156438947`,
1.4722145583571766`, 1.797504620275392`, 7.045821078656974`,
2.1437334927375247`, 2.295629405840401`, 9.749038328921163`,
5.9928406294151095`, 5.710839663259195`, 7.6983109942364365`,
1.02781847368645`, 4.909108426318685`, 2.5860897177525572`,
9.56334726886076`, 5.661774934433563`, 3.4927397824800384`,
0.4570000499566351`, 6.240122061193738`, 8.371962670138991`,
4.593105388706549`, 7.653068139076581`, 2.2715973346475877`,
7.6234743784167875`, 0.9177107503732636`, 3.182296027902268`,
6.196168580445633`, 0.1486794884986935`, 1.2920960388213274`,
7.478757220079665`, 9.610332785387424`, 0.05088141346751485`,
3.940557901075696`, 5.21881311050797`, 7.489624788199514`,
8.773397599406234`, 3.397275198258715`, 1.4847171141876618`,
0.06574278834161795`, 0.620801320529969`, 2.075457888143216`,
5.244608900551409`, 4.54384757203616`, 7.114276285060143`,
2.8878711430358344`, 5.70657733453041`, 8.759173986432632`,
1.9392596667256967`, 7.419234634325729`, 8.258205508179927`,
1.185315253730261`, 3.907753644335596`, 7.168561412289151`,
9.919881985898002`, 3.169835543867407`, 8.352858871046699`,
7.959492335118693`, 7.772764587074317`, 7.091413185764939`,
1.433673058797801`};
and
u={5.1929, 3.95756, 5.55276, 3.97068, 5.67986, 4.57951, 4.12308,
2.52284, 6.58678, 4.32735, 7.08465, 4.65308, 3.82025, 5.01325,
1.17007, 6.43412, 4.67273, 3.7701, 4.10398, 2.90585, 3.75596,
5.12365, 4.78612, 7.20375, 3.19926, 8.10662};
This is the LinePlot of "b" and "u";
I need to compare first 5 numbers from "b" to 1st number in "u" and always leave the maximum (replace "b"<"u" with "u"). Then I need to shift by 2 numbers and compare 3rd, 4th, 5th, 6th and 7th "b" with 2nd "u" and so on (shift always => 2 steps). But the overlapping numbers need to be "remembered" and compared in the next step, so that always the maximum is picked (e.g. 3rd, 4th and 5th "b" has to be > than 1st and 2nd "u").
Possibly the easiest way would be to cover the maximums showed in the image throughout the whole function, but I am new to this software and I don't have the experience to do that. Still It would be awesome if someone would figure out how to do this with a function that would do what I have described above.

I believe this does what you want:
With[{n = Length # u},
Array[b[[#]] ~Max~ Take[u, ⌊{#-2, #+1}/2⌋ ~Clip~ {1, n}] &, 2 n + 3]
]
{8.73406, 8.33051, 5.62067, 5.1929, 5.55276, 7.04582, 5.55276, 5.55276, 9.74904,--
Or if the length of u and v are appropriately matched:
With[{n = Length # u},
MapIndexed[# ~Max~ Take[u, ⌊(#2[[1]] + {-2, 1})/2⌋ ~Clip~ {1, n}] &, b]
]
These are quite a lot faster than Mark's solution. With the following data:
u = RandomReal[{1, 1000}, 1500];
b = RandomReal[{1, 1000}, 3004];
Mark's code takes 2.8 seconds, while mine take 0.014 and 0.015 seconds.
Please ask your future questions on the dedicated Mathematica StackExchange site:

I think that there's a small problem with your data, u doesn't have as many elements as Partition[b,5,2]. Leaving that to one side, the best I could do was:
Max /# Transpose[
Table[Map[If[# > 0, Max[#, u[[i]]], 0] &,
RotateRight[PadRight[Partition[b, 5, 2][[i]], Length[b]],
2 (i - 1)]], {i, 1, Length[u]}]]
which starts producing the same numbers as in your comment.
As ever, pick this apart from the innermost expression and work outwards.

Combining Field Values in SAS

Suppose my variables are X and Y
The values of X are A, B, C
The values of Y are 1, 2, 3
There is also an ID field. The X and Y values are unique to each ID.
Using SAS, I want to create another variable that combines the values of X and Y with a period in between: A.1, B.2, C.3
How should I go about this? Thanks!

CATX function (or any CAT family function, really) will do it for you.
newfield = catx('.',x,y);

Getting all combinations of splitting an array into two equally sized groups in Julia

Given an array of 20 numbers, I would like to extract all possible combinations of two groups, with ten numbers in each, order is not important.
combinations([1, 2, 3], 2)
in Julia will give me all possible combinations of two numbers drawn from the array, but I also need the ones that were not drawn...

You can use setdiff to determine the items missing from any vector, e.g.,
y = setdiff(1:5, [2,4])
yields [1,3,5].

After playing around for a bit, I came up with this code, which seems to work. I'm sure it could be written much more elegantly, etc.
function removeall!(remove::Array, a::Array)
for i in remove
if in(i, a)
splice!(a, indexin([i], a)[1])
end
end
end
function combinationgroups(a::Array, count::Integer)
result = {}
for i in combinations(a, count)
all = copy(a)
removeall!(i, all)
push!(result, { i; all } )
end
result
end
combinationgroups([1,2,3,4],2)
6-element Array{Any,1}:
{[1,2],[3,4]}
{[1,3],[2,4]}
{[1,4],[2,3]}
{[2,3],[1,4]}
{[2,4],[1,3]}
{[3,4],[1,2]}

Based on #tholy's comment about instead of using the actual numbers, I could use positions (to avoid problems with numbers not being unique) and setdiff to get the "other group" (the non-selected numbers), I came up with the following. The first function grabs values out of an array based on indices (ie. arraybyindex([11,12,13,14,15], [2,4]) => [12,14]). This seems like it could be part of the standard library (I did look for it, but might have missed it).
The second function does what combinationgroups was doing above, creating all groups of a certain size, and their complements. It can be called by itself, or through the third function, which extracts groups of all possible sizes. It's possible that this could all be written much faster, and more idiomatical.
function arraybyindex(a::Array, indx::Array)
res = {}
for e in indx
push!(res, a[e])
end
res
end
function combinationsbypos(a::Array, n::Integer)
res = {}
positions = 1:length(a)
for e in combinations(positions, n)
push!(res, { arraybyindex(a, e) ; arraybyindex(a, setdiff(positions, e)) })
end
res
end
function allcombinationgroups(a::Array)
maxsplit = floor(length(a) / 2)
res = {}
for e in 1:5
println("Calculating for $e, so far $(length(res)) groups calculated")
push!(res, combinationsbypos(a, e))
end
res
end
Running this in IJulia on a 3 year old MacBook pro gives
#time c=allcombinationgroups([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20])
println(length(c))
c
Calculating for 1, so far 0 groups calculated
Calculating for 2, so far 20 groups calculated
Calculating for 3, so far 210 groups calculated
Calculating for 4, so far 1350 groups calculated
Calculating for 5, so far 6195 groups calculated
Calculating for 6, so far 21699 groups calculated
Calculating for 7, so far 60459 groups calculated
Calculating for 8, so far 137979 groups calculated
Calculating for 9, so far 263949 groups calculated
Calculating for 10, so far 431909 groups calculated
elapsed time: 11.565218719 seconds (1894698956 bytes allocated)
Out[49]:
616665
616665-element Array{Any,1}:
{{1},{2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20}}
{{2},{1,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20}}
⋮
{{10,12,13,14,15,16,17,18,19,20},{1,2,3,4,5,6,7,8,9,11}}
{{11,12,13,14,15,16,17,18,19,20},{1,2,3,4,5,6,7,8,9,10}}
ie. 53,334 groups calculated per second.
As a contrast, using the same outer allcombinationgroups function, but replacing the call to combinationsbypos with a call to combinationgroups (see previous answer), is 10x slower.
I then rewrote the array by index group using true or false flags as suggested by #tholy (I couldn't figure out how to get it work using [], so I used setindex! explicitly, and moved it into one function. Another 10x speedup! 616,665 groups in 1 second!
Final code (so far):
function combinationsbypos(a::Array, n::Integer)
res = {}
positions = 1:length(a)
emptyflags = falses(length(a))
for e in combinations(positions, n)
flag = copy(emptyflags)
setindex!(flag, true, e)
push!(res, {a[flag] ; a[!flag]} )
end
res
end
function allcombinationgroups(a::Array)
maxsplit = floor(length(a) / 2)
res = {}
for e in 1:maxsplit
res = vcat(res, combinationsbypos(a, e))
end
res
end

C++: Design: should I use enum here?

What is the preferred and best way in C++ to do this: Split the letters of the alphabeth into 7 groups so I can later ask if a char is in group 1, 3 or 4 etc... ? I can of course think of several ways of doing this myself but I want to know the standard and stick with it when doing this kinda stuff.
0
AEIOUHWY
1
BFPV
2
CGJKQSXZ
3
DT
4
MN
5
L

6
R

best way in C++ to do this: Split the letters of the alphabeth into 7 groups so I can later ask if a char is in group 1, 3 or 4 etc... ?
The most efficient way to do the "split" itself is to have an array from letter/char to number.
// A B C D E F G H...
const char lookup[] = { 0, 1, 2, 3, 0, 1, 2, 0...
A switch/case statement's another reasonable choice - the compiler can decide itself whether to create an array implementation or some other approach.
It's unclear what use of those 1-6 values you plan to make, but an enum appears a reasonable encoding choice. That has the advantage of still supporting any use you might have for those specific numeric values (e.g. in < comparisons, streaming...) while being more human-readable and compiler-checked than "magic" numeric constants scattered throughout the code. constant ints of any width are also likely to work fine, but won't have a unifying type.

Create a lookup table.
int lookup[26] = { 0, 1, 2, 3, 0, 1, 2, 0 .... whatever };
inline int getgroup(char c)
{
return lookup[tolower(c) - 'a'];
}
call it this way
char myc = 'M';
int grp = lookup(myc);
Error checks omitted for brevity.
Of course, depending on what the 7 groups represent , you can make enums instead of using 0, 1, 2 etc.

Given the small amount of data involved, I'd probably do it as a bit-wise lookup -- i.e., set up values:
cat1 = 1;
cat2 = 2;
cat3 = 4;
cat4 = 8;
cat5 = 16;
cat6 = 32;
cat7 = 64;
Then just create an array of 26 values, one for each letter in the alphabet, with each containing the value of the category for that letter. When you want to classify a letter, you just categories[ch-'A'] to find it.

Assigning list elements to variables

The output from Mathematica with the following operation FactorInteger[28851680048402838857] is as follows:
{{3897424303, 1}, {7402755719, 1}}
My question is: how could I go about extracting the two prime numbers (without the exponents) and assign them to an arbitrary variable?
I basically want to retrieve two primes, whatever they may be, and assign them some variables.
Ex: x0 = 3897424303 and x1 = 7402755719
Thanks!

The output is a list and you can use list manipulating functions like Part ([[ ]]) to pick the pieces you want, e.g.,
{x0, x1} = FactorInteger[28851680048402838857][[All, 1]]
or, without Part:
{{x0,dummy}, {x1,dummy}} = FactorInteger[28851680048402838857];

Implicit in your question is the issue of handing parts of the expression that is returned as output from functions such as FactorInteger. Allow me to suggest alternatives.
1. Keep all of the values in a {list} and access each element with Part:
x = First /# FactorInteger[7813426]
{2, 31, 126023}
x[[1]]
x[[3]]
2
126023
2. Store factors as values of the function x, mimicking indexation of an array:
(This code uses MapIndexed, Function.)
Clear[x]
MapIndexed[
(x[First##2] = First##1) &,
FactorInteger[7813426]
];
x[1]
x[3]
2
126023
You can see all the values using ? or ?? (see Information):
?x
Global`x
x[1]=2
x[2]=31
x[3]=126023

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Group by X OR Y in Pig - mapreduce

Related

Mathematica solution?

Combining Field Values in SAS

Getting all combinations of splitting an array into two equally sized groups in Julia

C++: Design: should I use enum here?

Assigning list elements to variables

Categories

Resources