Pig: Joining without nested renaming - mapreduce

I have two datasets
A(af1, af2, af3)
B(bf1, bf2, bf3)
When I join them in Pig as
C = Join A by af1, B by bf1
And subsequently store as a JSON (after removing the join-predicate column)
store C into 'output.son' using JsonStorage();
I see a JSON schema as
{"A::af1":val, "A::af2":val, ...., "B::bf2":val, ...}
Is there a way I can strip off the unnecessary (as I am taking care of the ambiguity already) nesting-like naming resulting from the join?
Thanks in advance

We have to iterate over relation/alias C and generate the required fields and then store the new alias, lets say new alias is D.
D = FOREACH C GENERATE A::af1 AS af1, A::af2 AS af2, A::af3 AS af3, B::bf2 AS bf2, B::bf3 AS bf3;
STORE D INTO 'output.son' USING JsonStorage();
Update :
If there are 100 of unique field names in alias A, likewise in B, then after join we can use .. operator and select the required columns. We can even access the required fields using position notation also ($0..$99,$101..$200)
C = JOIN A BY af1, B BY bf1;
D = FOREACH C GENERATE af1..af100,bf2..bf100;
STORE D INTO 'output.son' USING JsonStorage();

Related

Create a new Rcpp::DateVector using given dates as string

How can I create a new DateVector (a C++ class from the package Rcpp) with dates given during compile time?
Since a DateVector is an NumericVector I can do this:
DateVector d = DateVector::create(14974, 14975, 15123); // TODO how to use real dates instead?
But I would prefer to use a more intuitive and human-readable representation like
DateVector d = DateVector::create("2010-12-31", "2011-01-01", "2011-05-29");
but this causes an compiler error like
Rcpp/include/Rcpp/vector/converter.h:34:27: error: no matching
function for call to ‘caster::target>(const char [11])’
return caster(input) ;
Edit: I have found a working example (also showing some variations):
DateVector d = DateVector::create(Date("2010-12-31"), Date("01.01.2011", "%d.%m.%Y"), Date(2011, 05, 29));

Need Time efficient data structure to store computed value using values from different maps that have intersecting key type but same value type

I have a problem
I have 4 classes
Need to maintain 3 maps
class A
{
};
class B
{
};
class D
{
int d1;
int d2;
};
std::unordered_map<A, D> m1;
std::unordered_map<B, D> m2;
std::unordered_map<std::pair<A, B>, D> m3;
Note: These maps are not updated frequently.
Problem
1. For object of A: a, object of B: b, and object of pair of object of a and object of b,
2. Need to get an object of D: d such that
d.d1 = std::min( m1[a].d1, m2[b].d1, m3[std::pair(a,b)].d1 );
d.d2 = std::min( m1[a].d2, m2[b].d2, m3[std::pair(a,b)].d2 );
Note: The values may not exist in which case they will be skipped, eg: m1[a] does not exist, for a pair(a, b), than min( m2[b], m3[std::pair(a,b)] ) will be used
The simple solution is
1. find
m1[a], m2[b], m3[pair(a, b)]
three look-ups
2. perform minimum on d1 and d2
3. Get d
I can not save every b for every a, as the number of possible keys of A and B is huge
Question
1. I need to store this info in such a structure such that for a pair of a and b, I can get d, in less than three look-ups, I need to reduce the time for look-ups to the minimum while the space may be increased.
2. Also, the time of updating that structure can be more than updating the three maps.
3. If anyone knows any such data structure, please provide the name or the link or some explanation. Thanks.
Regards

Defining two variable in one line, as opposed to 2 lines

I am new-ish to Python and was interested if putting code on one line (as opposed to many) is always the way to go.
For example, the two code snippets below do exactly the same thing, but the first one has cut out 1 line of code. Is this considered 'un-pythonic'?
mean1, var1 = np.mean(value), np.var(value)
Or..
mean1 = np.mean(value)
var1 = np.var(value)
That construct:
a,b = c
is particularly useful to unpack c which is known as a collection/iterable made of 2 elements.
The usefulness of that:
mean1, var1 = np.mean(value), np.var(value)
is dubious: you create a tuple on the right side just to be able to unpack it on the left side. If the effect is a one-liner, you could as well do this:
mean1 = np.mean(value); var1 = np.var(value)
so you don't create any extra temp object.

subset Armadillo field

If I understand correctly, a field in Armadillo is like a List for arbitrary objects. For instance a set of matrices of different sizes, or matrices and vectors. In the documentation I have seen the type cube which can be used with slices so you can subset using them. However, it seems there is no specific method to subset the fields.
A simplified version of my code is:
arma::mat A = eye(2,2);
arma::mat B = eye(3,3)*3;
arma::mat C = eye(4,4)*4;
arma::field<arma::mat> F(3,1);
F(0,0) = A;
F(1,0) = B;
F(2,1) = C;
// to get matrices B and C
F.slices(1,2);
but get error
Error: field::slices(): indicies out of bounds or incorrectly used
Firstly, there is a small error in the code you presented:
F(2,1) = C;
I assume it should be:
F(2,0) = C;
Secondly, the function slices() is only valid for 3D fields. Your field F, however, is only a 2D field because you only specify rows and columns in the constructor. To access matrices B and C, you can instead use:
arma::field<arma::mat> G=F.subfield(1,0,2,0);
or:
arma::field<arma::mat> G=F.rows(1,2);
More info on the subfield views at this page.

Filter Many to Many relation in Django

I have this structure of model objects:
Class A:
b = models.ManyToManyField("B")
Class B:
c = models.ForeignKey("C")
d = models.ForeignKey("D")
Class C:
d = models.ForeignKey("D")
This is the query I'm trying to get:
I want to get all the B objects of object A, then in each B object to perform a comparison between the D object and the c.d object.
I know that simply move on the B collection with for loop and make this comparison.
But I dived on the ManyToMany relation, then I noticed I can do the following:
bObjects = A.objects.all().b
q = bObjects.filter(c__d=None)
This is working, it gives me all the c objects with None d field. But when I try the following :
q = bObjects.filter(c__d=d)
It gives me d not defined, but d is an object like c in the object B.
What can be the problem? I'll be happy if you suggest further way to do this task.
I generally I'm trying to write my query in a single operation with many to many sub objects and not using loops.
q = bObjects.filter(c_d=d) //Give me d not defined. but d is an object like c in the object B.
Try this:
from django.db.models import F
q = bObjects.filter(c__d=F('d'))
As for the question from your comment below you can have 1 sql query instead of 100 in those ways:
1) if you can express your selection of A objects in terms of a query (for example a.price<10 and a.weight>20) use this:
B.objects.filter(a__price__lt=10, a__weight__gt=20, c__d=F('d'))
or this:
B.objects.filter(a__in=A.objects.filter(price__lt=10, weight__gt=20), c_d=F('d'))
2) if you just have a python list of A objects, use this:
B.objects.filter(a__pk__in=[a.pk for a in your_a_list], c__d=F('d'))