Unable to avoid duplicate deletion in Apache Pig - mapreduce

I am new to Apache Pig. I want to split and flatten the following input into my required output like who are all viewed that product.
My Input :(UserId, ProductId)
12345 123456,23456,987653
23456 23456,123456,234567
34567 234567,765678,987653
My Required Output:(ProductId, UserId)
123456 12345
123456 23456
23456 12345
23456 23456
987653 12345
987653 34567
234567 23456
234567 34567
765678 34567
My Pig Scripts:
a = load '/home/hadoopuser/ips' using PigStorage('\t') as (key:chararray, val:chararray);
b = foreach a generate key as ky1, FLATTEN(TOKENIZE(val)) as vl1;
c = group b by vl1;
d = foreach c generate group as vl2, $1 as ky2;
e = foreach d generate vl2, BagToString(ky2) as kyy;
f = foreach e generate vl2 as vl3,FLATTEN(STRSPLIT(kyy,'_')) as ky3;
g = foreach f generate vl3, FLATTEN(TOKENIZE(ky3)) as kk1;
dump g;
I got the following output which eliminates the repeated (duplicate)values,
(23456,12345)
(123456,12345)
(234567,23456)
(765678,34567)
(987653,12345)
I don't know how to solve this problem. Can anyone help me to solve this problem? and how to do this in a simple way?

Well, the second line of your code does exactly what you want, it simply displays the customer first and the product second. Put first the FLATTEN and then the key part:
a = load '/home/hadoopuser/ips' using PigStorage('\t') as (key:chararray, val:chararray);
b = foreach a generate FLATTEN(TOKENIZE(val)) as ProductId, key as UserId;
dump b;
(123456,12345)
(23456,12345)
(987653,12345)
(23456,23456)
(123456,23456)
(234567,23456)
(234567,34567)
(765678,34567)
(987653,34567)
As to why you are getting only one result per ProductId with your current code, you are grouping by ProductId, which gives you one row per different ProductId with a bag that contains all of the customers who viewed that product. Then, you convert that bag to a huge string separated by _, to convert it again to the same bag as before:
d = foreach c generate group as vl2, $1 as ky2;
e = foreach d generate vl2, BagToString(ky2) as kyy;
f = foreach e generate vl2 as vl3,FLATTEN(STRSPLIT(kyy,'_')) as ky3;
The BagToString UDF converts a bag to a string, joining the different values in the bag separated by a custom delimiter, which defaults to _. In the next line, however, you split it by _ resulting in the same bag as before. However, you FLATTEN that bag, so now instead of having a row with the ProductId and a bag, you have a row with several fields, being the first the ProductId, and the following fields all the customers that viewed the product:
Before FLATTEN:
(23456,{(23456,23456),(12345,23456)})
(123456,{(23456,123456),(12345,123456)})
(234567,{(34567,234567),(23456,234567)})
(765678,{(34567,765678)})
(987653,{(34567,987653),(12345,987653)})
After FLATTEN:
(23456,23456,23456,12345,23456)
(123456,23456,123456,12345,123456)
(234567,34567,234567,23456,234567)
(765678,34567,765678)
(987653,34567,987653,12345,987653)
And here lies the error. You have one only row for each of the products, and several fields in each row for each customer. When applying the last foreach, you select the first field (the product) and the second (the first of all the customers), discarding the rest of the customers on each row.

Related

Need to extract data which is in tabular format from a python list

Team A Team B
name xyz abc
addres 345,JH colony 43,JK colony
Phone 76576 87866
name pqr ijk
addres 345,ab colony 43,JKkk colony
Phone 7666666 873336
Here above , I have 2 teams with names, address and phone number of each player in a list . However, there are no tables as such, but the data whiloe i tried to read is in Tabular format, where In team A Team B are 2nd and 3rd columns and the 1st column is where the tags name,address phone comes.
My objective is to fetch only the names of the players grouped by team name. In this example, there are 2 players each team. it can be between 1 and 2.Is there a way someone can help to share a solution using Regular Expressions. I tried a bit, however that is giving me random results , such as team B players in Team A.Can someone help?
This should work for you, in future I would give more detail on your input string, I have assumed spaces. If it uses tabs, try replacing them with four spaces. I have added an extra row which included a more difficult case.
Warning: If Team B has more players than Team A, it will probably put the extra players in Team A. But it will depend on the exact formatting.
import re
pdf_string = ''' Team A Team B
name xyz abc
addres 345,JH colony 43,JK colony
Phone 76576 87866
name pqr ijk
addres 345,ab colony 43,JKkk colony
Phone 7666666 873336
name forename surname
addres 345,ab colony
Phone 7666666 '''
lines_untrimmed = pdf_string.split('\n')
lines = [line.strip() for line in lines_untrimmed]
space_string = ' ' * 3 # 3 spaces to allow spaces between names and teams
# This can be performed as a one liner below, but I wrote it out for an explanation
lines_csv = []
for line in lines:
line_comma_spaced = re.sub(space_string + '+', ',', line)
line_item_list = line_comma_spaced.split(',')
lines_csv.append(line_item_list)
# lines_csv = [re.sub(space_string + '+', ',', line).split(',') for line in lines]
teams = lines_csv[0]
team_dict = {team:[] for team in teams}
for line in lines_csv:
if 'name' in line:
line_abbv = line[1:] # [1:] to remove name
for i, team in enumerate(teams):
if i < len(line_abbv): # this will prevent an error if there are fewer names than teams
team_dict[team].append(line_abbv[i])
print(team_dict)
This will give the output:
{'Team A': ['xyz', 'pqr', 'forename surname'], 'Team B': ['abc', 'ijk', 'ijk']}

How to add multiple Sentences (which are stored in a list) into a pandas dataframe

I would like to create an aspect analysis from user reviews. The reviews contain various aspects and therefore the reviews need to be separated into sentences. I save the data in a pandas dataframe and separate the sentences with the nltk library.
I put the separate sentences in a list that I want to format into a dataframe and connect to the original dataframe. However, I get an error. Instead of an extra column, I get 19 new columns. (the individual sentences are not stored in a cell, I think every single sentence gets their own column) I also tested itertools but I also get a wrong record.
Can someone help me to get the right format?
I would like to have a new dataframe which looks like that:
U_REVIEW | SENTENCES
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Im a Sentence. Iam another Sentence in a Row. |[u'Im a Sentence', u'Iam another Sentence in a Row.']
Here we go, next Sentence. Blub, more blubs. |[u"Here weg o, next Sentence.", u'Blub, more blubs.']
Once again, more Sentence. And some other information. The Restaurant was ok, but not awesome.|[u"Once again, more Sentence.", u'And some other information.',u’The Restaurant was ok, but not awesome.’]
That’s how my code looks like:
ta = ta[['U_REVIEW']]
Output:
U_REVIEW
Im a Sentence. Iam another Sentence in a Row.
Here we go, next Sentence. Blub, more blubs.
Once again, more Sentence. And some other information. The Restaurant was ok, but not awesome.
# the empty lists
sentences = []
ss = []
for sentence in ta['U_REVIEW']:
# seperates the review into sentence
sentence = sent_tokenize(sentence)
sentences.append(sentence)
test = itertools.chain(sentences)
#new dataframe to add the Sentences
df2 = pd.DataFrame(sentences)
#create Column
cols2 = ['REVIEW_SENTENCES']
# bring the two dataframes together
df2 = pd.DataFrame(sentences, columns=cols2)
Output of senteces:
[[u'Im a Sentence', u'Iam another Sentence in a Row.'],[u"Here weg o, next Sentence.", u'Blub, more blubs.'],[u"Once again, more Sentence.", u'And some other information.',u’The Restaurant was ok, but not awesome.’]]
Output of test:
<itertools.chain object at 0x000000001316DC18>
Output and Information of the new Dataframe df2:
AssertionError: 1 columns passed, passed data had 19 columns
U_REVIEW | 0 | 1 | 2 ...
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Im a Sentence. Iam another Sentence in a Row. |Im a Sentence |Iam another Sentence in a Row. |
Here we go, next Sentence. Blub, more blubs. |Here we go, next Sentence.|Blub, more blubs. |
Once again, more Sentence. And some other information. The Restaurant was ok, but not awesome.|Once again, more Sentence.|And some other information. |The Restaurant was ok, but not awesome.
Here is a Testset of a Dataframe:
import pandas as pd
ta = pd.DataFrame( ['Im a Sentence. Iam another Sentence in a Row','Here we go, next Sentence. Blub, more blubs.','Once again, more Sentence. And some other information. The Restaurant was ok, but not awsome.'])
ta.columns =['U_REVIEW']
try this I have done it in python 3.5 I think it should work for 2.5 also:
In [45]: df = pd.DataFrame(ta.U_REVIEW.str.split('.',expand=True).replace('',np.nan).fillna(np.nan).values.flatten()).dropna()
In [46]: df
Out[46]:
0
0 Im a Sentence
1 Iam another Sentence in a Row
4 Here we go, next Sentence
5 Blub, more blubs
8 Once again, more Sentence
9 And some other information
10 The Restaurant was ok, but not awsome
is this what you want:
ta.U_REVIEW.str.split('.',expand=True)
Out[50]:
0 1 \
0 Im a Sentence Iam another Sentence in a Row
1 Here we go, next Sentence Blub, more blubs
2 Once again, more Sentence And some other information
2 3
0 None None
1 None
2 The Restaurant was ok, but not awsome
or
In [52]: ta.U_REVIEW.str.split('.').apply(list)
Out[52]:
0 [Im a Sentence, Iam another Sentence in a Row]
1 [Here we go, next Sentence, Blub, more blubs, ]
2 [Once again, more Sentence, And some other in...
Name: U_REVIEW, dtype: object

Match a keyword from a list data type in Cypher

I ran a cypher query to delete all duplicate relationship with same name from my graph. A relationship has properties(name, confidence, time). I kept the relationship with highest confidence value and collected all time values, using following query:
MATCH (e0:Entity)-[r:REL]-(e1:Entity)
WITH e0, r.name AS relation, COLLECT(r) AS rels, COLLECT(r.confidence)AS relConf, MAX(r.confidence) AS maxConfidence, COLLECT(r.time) AS relTime, e1 WHERE SIZE(rels) > 1
SET (rels[0]).confidence = maxConfidence, (rels[0]).time = relTime
FOREACH (rel in tail(rels) | DELETE rel)
RETURN rels, relation, relConf, maxConfidence, relTime
Old Data:
name,confidence,time
likes, 0.87, 20111201010900
likes, 0.97, 20111201010600
New data:
name,confidence,time
likes, 0.97, [20111201010900,20111201010600]
Could anyone please suggest a match query to find relationships containing year 2011 in new "time" property?? (I converted time using toInt while loading from a csv).
Your new data structure is definitely not easy to make such searches, but it is possible on medium graphs :
MATCH (n:Entity)-[r:REL]->(x)
WHERE ANY(
t IN extract(x IN r.time | toString(x))
WHERE t STARTS WITH "2011"
)
RETURN r

How to know if a variation (f.e. abbreviation) of a string in a list does match agains another list if the original does not?

I currently searching for a method in R which let's me match/merge two data frames. Helas both of these data frames contain non optimal data. They can have certain abbreviations of even typo's in them. Therefore I would like to define a list for each abbreviation and if a string contains one of those elements. If the original entries don't match, R should check if any of the other options of the abbreviation has a match. To illustrate: the name of a company could end with "Limited" but also with "Ltd." of "Ltd" etc.
EXAMPLE
Data
The Original "Address" file contains:
Company name Address
Deloitte Ltd. New York
Coca-Cola New York
Tesla ltd California
Microsoft Limited Washington
Would have to be merged with "EnterpriseNrList"
Company name EnterpriseNumber
Deloitte Ltd. 221
Coca-Cola 334
Tesla ltd 725
Microsoft Limited 127
So the abbreviations should work in "both directions". That's why I said, if R recognises any of the abbreviations, R should try to match all of them.
All of the matches should be reported as the return.
Therefore I would make up a list "Abbreviations" for each possible abbreviation
Limited.
limited
Ltd.
ltd.
Ltd
ltd
Questions
1) Would this be a good method, or would there be a more efficient way?
2) How can I check a list against a list of possible abbreviations (step 1, see below), sort of a containsx from excel?
3) How could I make up a list that replaces for the entries that do not match the abbreviation with all other abbreviatinos (step 2, see below)?
Thoughts for solution
Step 1
As I am still very new to this kind of work, I was thinking the following: use a regex expression to filter out wether a string contains any of the abbreviation options and create a list which will then contain either -1 if no match could be found and >0 if match is found. The no pattern matching can already be matched against the "Address" list. With the other entries I continue to step 2.
In this step I don't really know how to check against a list of options ("Abbreviations" list).
Step 2
Next I would create a list with the matches from step 1 and rbind together all options. In this step I don't really know to I could create a list that combines f.e. Coca-Cola with all it's possible abbreviations.
Coca-Cola Limited
Coca-Cola Ltd.
Coca-Cola Ltd
etc.
Step 3
Lastly I would match/merge this more complete list of companies again with the original "Data" list. With the introduction of step 2 I thought It might be a bit easier on the required computing power, as the original list is about 8000 rows.
I would go in a different approach, fixing the tables first before the merge.
To fix with abreviations, I would use a regex, case insensitive, the final dot being optionnal, I start with a list of 'Normal word' = vector of abbreviations.
abbrevs <- list('Limited'=c('Limited','Ltd'),'Incorporated'=c('Incorporated','Inc'))
The I build the corresponding regex (alternations with an optional dot at end, the case will be ignored by parameter in gsub and agrep later):
regexes <- lapply(abbrevs,function(x) { paste0("(",paste0(x,collapse='|'),")[.]?") })
Which gives:
$Limited
[1] "(Limited|Ltd)[.]?"
$Incorporated
[1] "(Incorporated|Inc)[.]?"
Now we have to apply each regex to the company.name column of each df:
for (i in seq_along(regexes)) {
Address$Company.name <- gsub(regexes[[i]], names(regexes[i]), Address$Company.name, ignore.case=TRUE)
Enterprise$Company.name <- gsub(regexes[[i]], names(regexes[i]), Enterprise$Company.name, ignore.case=TRUE)
}
This does not take into account typos. Here you'll need to work on with agrepor adist to manage it.
Result for Address example data set:
> Address
Company.name Address
1 Deloitte Limited New York
2 Coca-Cola New York
3 Tesla Limited California
4 Microsoft Limited Washington
Input data used:
Address <- structure(list(Company.name = c("Deloitte Ltd.", "Coca-Cola",
"Tesla ltd", "Microsoft Limited"), Address = c("New York", "New York",
"California", "Washington")), .Names = c("Company.name", "Address"
), class = "data.frame", row.names = c(NA, -4L))
Enterprise <- structure(list(Company.name = c("Deloitte Ltd.", "Coca-Cola",
"Tesla ltd", "Microsoft Limited"), EnterpriseNumber = c(221L,
334L, 725L, 127L)), .Names = c("Company.name", "EnterpriseNumber"
), class = "data.frame", row.names = c(NA, -4L))
I would say that the answer depends on whether you have a list of abbreviations or not.
If you have one, you could just look which element of your list contains an abbreviation with grep or greplfunctions. (grep return all indexes that have a matching pattern whereas grepl returns a logical vector).
Also, use the ignore.case= TRUE parameter of these function, so you don't have to try all capitalized/lowercase possibilities.
If you don't have such a list, my first guest would be to extract the first "word" of each company (I would guess that there is a single "Deloitte" company, and that it is "Deloitte Ltd"). You can do so with:
unlist(strsplit(CompanyNames,split = " "))
If you wanted to also correct for typos, this is more a question of string distance.
Hope that it helped!

Grouping Similar words/phrases

I have a frequency table of words which looks like below
> head(freqWords)
employees work bose people company
1879 1804 1405 971 959
employee
100
> tail(freqWords)
youll younggood yoyo ytd yuorself zeal
1 1 1 1 1 1
I want to create another frequency table which will combine similar words and add their frequencies
In above example, my new table should contain both employee and employees as one element with a frequency of 1979. For example
> head(newTable)
employee,employees work bose people
1979 1804 1405 971
company
959
I know how to find out similar words (using adist, stringdist) but I am unable to create the frequency table. For instance I can use following to get a list of similar words
words <- names(freqWords)
lapply(words, function(x) words[stringdist(x, words) < 3])
and following to get a list of similar phrases of two words
lapply(words, function(x) words[stringdist2(x, words) < 3])
where stringdist2 is follwoing
stringdist2 <- function(word1, word2){
min(stringdist(word1, word2),
stringdist(word1, gsub(word2,
pattern = "(.*) (.*)",
repl="\\2,\\1")))
}
I do not have any punctuation/special symbols in my words/phrases. (I do not know a lot of R; I created stringdist2 by tweaking an implementation of adist2 I found here but I do not understand everything about how pattern and repl works)
So I need help to create new frequency table.