How to self-join two bags? - mapreduce

I have some set of numbers that describes connections between the first set of integers and the second set of integers. For example:
1,2
3,4
5,6
5,7
6,8
I then load my data as follows, and group it:
data = load 'data.csv' as integer_1, integer_2;
grouped = group data by integer_1;
grouped_numbers = foreach grouped generate group as node, data.integer_2 as connection;
Which then yields a bag with each first integer and its first-degree connections:
(1,{(2)})
(3,{(4)})
(5,{(6),(7)})
(6,{(8)})
I would then like to do a self-join of the grouped_numbers bag, in order to give the resultant first integer with each of its first- and second-degree connections. In this case, that would be:
(1,{(2)})
(3,{(4)})
(5,{(6),(7),(8)})
(6,{(8)})
because 5 is connected to 6, which is connected to 8, so 8 is a second-degree connection of 6. How would I implement this in Pig?

First join:
joined = join data1 by int_2, data2 by int_1
where data1 and data2 are the same set (copies of data in this example).
then group by the first field. The inner bag will have all the connections to the 'group', possibly more than once. So you might need a distinct on the inner bags as well, if you just one the unique elements.
(answered via a Pig mailing list)

Related

In a list of lists duplicate the array that contains a list inside per each element that is inside the third array

I am trying to classified and split some data in python 2.7
I have this 3D list/array:
data = [
[['FTIM#whitelist.es'],['TTT#hi.com','JORDI#jordilazo.com'],'a','b'],
['a',['PEP#hi.com','LAZO#jordilazo.com','ZORO#hi.com'],['GOKU#jordilazo.es'],'b'],
[['t3'],['y1','y2','y3','y4'],'a','b']
['b',['r1#m.com','r2#m.com'],'a',['t4#m.com','t5#m.com']]
]
We can imagine one of the elements inside the list will be the SENDER and the other the RECEIVER and we must duplicate in order to fulfill all the possible combinations.
So the result is like:
result = [
['FTIM#jordilazo.es','TTT#hi.com','a','b'],
['FTIM#jordilazo.es','JORDI#jordilazo.com','a','b'],
['a','GOKU#jordilazo.es','PEP#hi.com','b'],
['a','GOKU#jordilazo.es','LAZO#jordilazo.com','b'],
['a','ZORO#hi.com','GOKU#jordilazo','a','b'],
['t3','y1','a','b'],
['t3','y2','a','b'],
['t3','y3','a','b'],
['t3','y4','a','b'],
['t4#m.com','r1#m.com','a','b'],
['t4#m.com','r2#m.com','a','b'],
['t5#m.com','r1#m.com','a','b'],
['t5#m.com','r2#m.com','a','b'],
]
We create n lenght list inside the list because we have n elements inside the third list.
CONSIDERATIONS:
1- The index may change. It will not always be position 1.
2- There will always be more just 1 inner list (1 for the sender and 1 for the receiver).
3- The order of how the data is displayed does not matter (the most important thing is that the combinations exist).
4- Finally we have to flatten the list
If in the rows of the input you would wrap all non-list items (a and b) into lists, then this task translates to getting for each row the Cartesian product of its members. And for that you can use itertools.product:
import itertools
data = [
[['FTIM#whitelist.es'],['ZTNM#hi.com','JORDI#mediador.com'],'a','b'],
['a',['PEP#hi.com','LAZO#mediador.com','ZORO#hi.com'],['GOKU#whitelist.es'],'b'],
[['t3'],['y1','y2','y3','y4'],'a','b'],
['b',['r1#m.com','r2#m.com'],'a',['t4#m.com','t5#m.com']]
]
result = [combi
for row in data
for combi in itertools.product(
*(item if isinstance(item, list) else list(item) for item in row)
)
]

How to merger these two records ino one row removing Null value in Informatica using transformation. Please see the snapshot for scenario

enter image description here
Input-
Code value Min Max
A abc 10 null
A abc Null 20
Output-
Code value Min Max
A abc 10 20
You can use an aggregator transformation to remove nulls and get single row. I am providing solution based on your data only.
use an aggregator with below ports -
inout_Code (group by)
inout_value (group by)
in_Min
in_Max
out_Min= MAX(in_Min)
out_Max = MAX(in_Max)
And then attach out_Min, out_Max, code and value to target.
You will get 1 record for a combination of code and value and null values will be gone.
Now, if you have more than 4/5/6/more etc. code,value combinations and some of min, max columns are null and you want multiple records, you need more complex mapping logic. Let me know if this helps. :)

How to do it in Informatica

New to Informatica.
For Ex: This is a Flat file to Flat file load.
I have a expression that has calculated the data to the sample given below:
The some CUST has one entry with N flag and some has two with N and Y.
I need only the 1 and N or 2 and Y occurrence to be on target table, as sated below, pls let me know how to do it in Informatica.
Source
CUST-111|N|1
CUST-222|N|1
CUST-222|Y|2
CUST-333|N|1
CUST-444|N|1
CUST-555|N|1
CUST-555|Y|2
CUST-666|N|1
CUST-666|Y|2
Target:
CUST-111|N|1
CUST-222|Y|2
CUST-333|N|1
CUST-444|N|1
CUST-555|Y|2
CUST-666|Y|2
Thanks a lot guys
You can first calculate count of customer. Then, if count =1 and flag = N, pass it to target else if count >1, then pass to target only the record with flag =Y.
Steps below -
Sort data by Cust ID (CID)
Use Aggregator to calculate count.
Use CUST_ID as group by. Create a new output port
out_FLAG_CNT = COUNT(*).
Use joiner to join step 2 and step1. Join condition is Cust ID.
Then use a filter with below condition-
IIF (out_FLAG_CNT>1 AND FLAG='Y',TRUE, IIF( out_FLAG_CNT=1 AND FLAG='N', TRUE, FALSE))
Finally link this data to target.
|-->Agg( count by CID)-|
SQ --> SRT (Sort by CID) -->|---------------------->|JNR (on CID) -->FIL (Cond above) --> Target
Pls note, if you have more than 1 N or more than 1 Y data, then above will not work and you need to attach another aggregator in the end.

Pandas - identify unique triplets from a df

I have a dataframe which represents unique items. Each item is uniquely identified by a set of varA, varB, and varC (so each item has 0 to n values for varA, varB, or varC). My df has multiple raws per unique item, with various combination of varA, varB, and varC.
The df is like this (ID is unique in the column, but it doesn't represent the unique item).
df = pd.DataFrame({'ID':[1,2,3,4,5],
'varA':['a', 'd', 'a', 'm','Z'],
'varB':['b', 'e', 'k', 'e',NaN],
'varC':['c', 'f', 'l', NaN ,'t']})
So in the df here, you can see that:
1 and 3 are the same item with: {varA:[a], varB:[b,k], varC: [c,l]}.
2 and 4 is also the same: {varA:[d,m], varB:[e], varC: [f]}
I would like to identify every unique item, give them a unique id, and store their information.
The code I have written is terribly inefficient:
Step1: I walk through each row of the dataframe and make a list of each variable
When the three variables are new, it's a new item and I give it an id.
When either of the variable is know, I store the new ones in their respective list and keep walking to the next row
Step2: Once I walked all the dataframe, I have two subsets:
1 with a unique id,
the other one without unique id, but whose information can be found in the ones that have unique id, either with varA, varB, or varC. So quite uglily I merge successively on either variable, and find the unique id.
Result: I have the same df than at the start, but with a column of repeated unique identifiers.
This works well with 20,000 rows in entry with varA and varB. This is running very slow and dying before the end (between Step1 and Step2) on 100,000 rows, and I need to make it on 1,000,000 rows.
Any pandanique way of doing this?
You can use chained boolean indexing using duplicated (pd.Series.duplicated):
If you want to keep the first occurence of a duplicated:
myfilter = ~df.varA.duplicated(keep='first') & \
~df.varB.duplicated(keep='first') & \
~df.varC.duplicated(keep='first')
If you don't want to
myfilter = ~df.varA.duplicated(keep=False) & \
~df.varB.duplicated(keep=False) & \
~df.varC.duplicated(keep=False)
Then you can for example give these an incremental uniqueID:
df.ix[myfilter, 'uniqueID'] = np.arange(myfilter.sum(), dtype='int')
df
ID varA varB varC uniqueID
0 1 a b c 0.0
1 2 d e f 1.0
2 3 a k l NaN
3 4 m e NaN NaN
4 5 Z NaN t 2.0

C++ splitting two dimensional vector into groups

I'm trying to achieve a group by function (from sql), using a two dimensional vector of strings, which represents the data source.
I'm allowing the user to select which field to group by. I don't know the best way to achieve this.
I don't want to group if the selected field doesn't contain enough consistency. Example:
ID | name | type
1 | Sam | a
2 | Alex | b
3 | Tom | b
4 | Ryan | a
With the above example, grouping by name shouldn't pass because there is too much variability in the data. Whereas type is a valid condition. How could I implement this type of checking? I was thinking of keeping track of how many instances of each group field there is?
Would it be unnecessary to store each group in its individual vectors?
Lets answer your first question.
How do you determine if an attribute is valid to group on.
You want a low variability.
You need a metric that tells you if you should be able to group by that attribute.
A very simple metric would be to find the number of unique elements in an attribute and divide by the total number of elements in that attribute.
(1 means all elements are different 1/(number of elements) means all elements are the same)
So you can set a threshold on weather or not you group on an attribute by that number.
In your example:
name has 4 unique elements out of 4 elements. it's score would be 1
type has 2 unique elements out of 4 elements. it's score would be 0.5
Note this metric may perform poorly on small data-sets.
No it's not necessary to store each attribute in it's own vector (but it will work).
Other solutions:
create a struct/class to hold your data and store that class in a vector.
vector[0] => {id: 1, name: Sam, type: a}
vector[1] => {id: 2, name: Alex, type: b}
vector[2] => {id: 3, name: Tom, type: b}
vector[3] => {id: 4, name: Ryan, type: a}
you could then group by sorting based on a specific key (ie based on type)
or
Create a separate hash or map for each group. each hash/map will store pointers to your objects.
type_hash [0] => List of pointers to data objects with type a
type_hash [1] => List of pointers to data objects with type b