List relationships in pig - mapreduce

I have a list like this:
a, 2
b, 1
a, 5
c, 5
d, 3
a, 3
and I want to convert it to:
a, 2,3,5
b, 1
c, 5
d, 3
In other words, I need to find the numbers that are related to a letter.
What I'm thinking is that I can filter and get a list of unique letters (a,b,c,d) and then for each one, I would need to find the numbers that are related.
How can I find all the number that are related? Do I need to do it one by one? if I have a very large data set, would it wokr? or is there some facility in pig that I can use to accomplish this.

Can you try this?
input:
a, 2
b, 1
a, 5
c, 5
d, 3
a, 3
PigScript:
A = LOAD 'input' USING PigStorage(',') AS (col1:chararray,col2:chararray);
B = GROUP A BY col1;
C = FOREACH B {
sortedRow = ORDER A BY col2 ASC;
GENERATE group,FLATTEN(REPLACE(BagToString(sortedRow.$1.col2),'_\\s+',','));
}
STORE C INTO 'output' USING PigStorage(',');
Output: (will be stored in output/part* file)
a, 2,3,5
b, 1
c, 5
d, 3

Related

ARRAYFORMULA for MAX value doesn't work with N/A input

I am using an ARRAYFORMULA in column F to output the maximum of three values, in columns C, D and E. When C, D, and E are all numbers, the formula works perfectly. However, when any of C, D, or E are letters (i.e., N/A or NA) the formula breaks:
Here is the formula:
=ARRAYFORMULA(if(A2:A=0,,IFERROR(1*IF(C2:C>D2:D,IF(C2:C>E2:E,C2:C,E2:E),IF(D2:D>E2:E,D2:D,E2:E)),0)))
How can I get it to work even when letters are present?
My desired result in the example above is 763.
A reproduction of the problem.
=ARRAYFORMULA(QUERY(TRANSPOSE(QUERY(TRANSPOSE(IFERROR(B2:F*1, 0)),
"select "&REGEXREPLACE(JOIN( , IF(LEN(A2:A),
"max(Col"&ROW(A2:A)-ROW(A2)+1&"),", "")), ".\z", "")&"")),
"select Col2"))

Conditional calculation based on another column

I have a cross reference table and another table with the list of "Items"
I connect "PKG" to "Item" as "PKG" has distinct values.
Example:
**Cross table** **Item table**
Bulk PKG Item Value
A D A 2
A E B 1
B F C 4
C G D 5
E 8
F 3
G 1
After connecting the 2 above tables by PKG and ITEM i get the following result
Item Value Bulk PKG
A 2
B 1
C 4
D 5 A D
E 8 A E
F 3 B F
G 1 C G
As you can see nothing shows up for the first 3 values since it is connected by pkg and those are "Bulk" values.
I am trying to create a new column that uses the cross reference table
I want to create the following with a new column
Item Value Bulk PKG NEW COLUMN
A 2 5
B 1 3
C 4 1
D 5 A D 5.75
E 8 A E 9.2
F 3 B F 3.45
G 1 C G 1.15
The new column is what I am trying to create.
I want the original values to show up for bulk as they appear for pkg. I then want the Pkg items to be 15% higher than the original value.
How can I calculate this based on the setup?
Just write a conditional custom column in the query editor:
New Column = if [Bulk] = null then [Value] else 1.15 * [Value]
You can also do this as a DAX calculated column:
New Column = IF( ISBLANK( Table1[Bulk] ), Table1[Value], 1.15 * Table1[Value] )

Use regular expression to extract elements from a pandas data frame

From the following data frame:
d = {'col1':['a-1524112-124', 'b-1515', 'c-584854', 'a-15154']}
df = pd.DataFrame.from_dict(d)
My ultimate goal is to extract the letters a, b or c (as string) in a pandas series. For that I am using the .findall() method from the re module, as shown below:
# import the module
import re
# define the patterns
pat = 'a|b|c'
# extract the patterns from the elements in the specified column
df['col1'].str.findall(pat)
The problem is that the output i.e. the letters a, b or c, in each row, will be present in a list (of a single element), as shown below:
Out[301]:
0 [a]
1 [b]
2 [c]
3 [a]
While I would like to have the letters a, b or c as string, as shown below:
0 a
1 b
2 c
3 a
I know that if I combine re.search() with .group() I can get a string, but if I do:
df['col1'].str.search(pat).group()
I will get the following error message:
AttributeError: 'StringMethods' object has no attribute 'search'
Using .str.split() won't do the job because, in my original dataframe, I want to capture strings that might contain the delimiter (e.g. I might want to capture a-b)
Does anyone know a simple solution for that, perhaps avoiding iterative operations such as a for loop or list comprehension?
Use extract with capturing groups:
import pandas as pd
d = {'col1':['a-1524112-124', 'b-1515', 'c-584854', 'a-15154']}
df = pd.DataFrame.from_dict(d)
result = df['col1'].str.extract('(a|b|c)')
print(result)
Output
0
0 a
1 b
2 c
3 a
Fix your code
pat = 'a|b|c'
df['col1'].str.findall(pat).str[0]
Out[309]:
0 a
1 b
2 c
3 a
Name: col1, dtype: object
Simply try with str.split() like this- df["col1"].str.split("-", n = 1, expand = True)
import pandas as pd
d = {'col1':['a-1524112-124', 'b-1515', 'c-584854', 'a-15154']}
df = pd.DataFrame.from_dict(d)
df['col1'] = df["col1"].str.split("-", n = 1, expand = True)
print(df.head())
Output:
col1
0 a
1 b
2 c
3 a

R: How to group and aggregate list elements using regex?

I want to aggregate (sum up) the following product list by groups (see below):
prods <- list("101.2000"=data.frame(1,2,3),
"102.2000"=data.frame(4,5,6),
"103.2000"=data.frame(7,8,9),
"104.2000"=data.frame(1,2,3),
"105.2000"=data.frame(4,5,6),
"106.2000"=data.frame(7,8,9),
"101.2001"=data.frame(1,2,3),
"102.2001"=data.frame(4,5,6),
"103.2001"=data.frame(7,8,9),
"104.2001"=data.frame(1,2,3),
"105.2001"=data.frame(4,5,6),
"106.2001"=data.frame(7,8,9))
test= list("100.2000"=data.frame(2,3,5),
"100.2001"=data.frame(4,5,6))
names <- c("A", "B", "C")
prods <- lapply(prods, function (x) {colnames(x) <- names; return(x)})
Each element of the product list (prods) has a name combination of the product number and the year (e.g. 101.2000 --> 101 = prod nr. and 2000 = year). And the groups only contain product numbers for the aggregation.
group1 <- c(101, 106)
group2 <- c(102, 104)
group3 <- c(105, 103)
My expected result, shows the aggregated product groups by year:
$group1.2000
A B C
1 8 10 12
$group2.2000
A B C
1 5 7 9
$group3.2000
A B C
1 11 13 15
$group1.2001
A B C
1 8 10 12
$group2.2001
A B C
1 5 7 9
$group3.2001
A B C
1 11 13 15
So far, I tried this way: First I decomposed the names of prods into product numbers:
prodnames <- names(prods)
prodnames_sub <- gsub("\\..*.","", prodnames)
And then I tried to aggregate using lapply:
lapply(prods, function(x) aggregate( ... , FUN = sum)
However, I didn't find how to implement the previous product numbers in the aggregation function. Ideas? Thanks
Here are two approaches. No packages are used in either one.
1) Using lists Create a two column data.frame S from the groups whose columns are the products (value column) and associated groups (ind column). Create the list to split by, By. In code to produce By, sub("\\.*", "", names(prods)) extracts the products and match is then used to find the associated group. sub("\\..*", "", names(prods)) extracts the year. Next perform the split and lapply over it to run the summations. The two components of By (group and year) can be reversed to change the order of the output, if desired.
S <- stack(list(group1 = group1, group2 = group2, group3 = group3))
By <- list(group = S$ind[match(sub("\\..*", "", names(prods)), S$values)],
year = sub(".*\\.", "", names(prods)))
lapply(split(prods, By), function(x) colSums(do.call(rbind, x)))
2) Using data.frames Convert the groups and prods each to a data frame, merge them, perform an aggregate and split back into a list. The output is the same as requested except for order. (Reverse the two right hand variables in the aggregate formula to get the order shown in the question but that will also reverse the two parts of each component name in he output list.)
S <- stack(list(group1 = group1, group2 = group2, group3 = group3))
DF0 <- do.call(rbind, prods)
DF <- cbind(do.call(rbind, strsplit(rownames(DF0), ".", fixed = TRUE)), DF0)
M <- merge(DF, S, all.x = TRUE, by = 1)
Ag <- aggregate(cbind(A, B, C) ~ ind + `2`, M, sum)
lapply(split(Ag, paste(Ag[[1]], Ag[[2]], sep = ".")), "[", 3:5)
giving:
$group1.2000
A B C
1 8 10 12
$group1.2001
A B C
4 8 10 12
$group2.2000
A B C
2 5 7 9
$group2.2001
A B C
5 5 7 9
$group3.2000
A B C
3 11 13 15
$group3.2001
A B C
6 11 13 15

Pandas Series - print columns and rows

For now I am not so worried about the most performant way to get at my data in a series, lets say that my series is as follows :
A 1
B 2
C 3
D 4
If I am using a for loop to iterate this, for example :
for row in seriesObj:
print row
The code above will print the values down the right hand side, but lets say, I want to get at the left column (indexes) how might I do that?
All help greatly appreciated, I am very new to pandas and am having some teething problems.
Thanks.
Try Series.iteritems.
import pandas as pd
s = pd.Series([1, 2, 3, 4], index=iter('ABCD'))
for ind, val in s.iteritems():
print ind, val
Prints:
A 1
B 2
C 3
D 4