Iterating through a list of lists with transitive property - list

I'm looking for a proper way to transform the data:
similarities = [["Parasite", "1917"],
["Parasite", "Jojo Rabbit"],
["Joker", "Ford v Ferrari"]]
In each sublist there are films that are similar, I need to be able to iterate through it somehow, to count how many similar films each one has. Similar movies have transitive property: if movie 1 is similar to movie 2 and movie 2 is similar to movie 3 -> movie 1 is similar to movie 3 and vice versa.
The outcome is like this:
Joker - 1 (Ford v Ferrari)
1917 - 2 (Parasite, Jojo Rabbit)
Parasite - 2 (1917, Jojo Rabbit)
Jojo Rabbit - 2 (Parasite, 1917)
Ford v Ferrari - 1 (Joker)
I thought of a dict or a graph traversal, but nothing seems to work so far

Related

Merging two Pandas Dataframes using Regular Expressions

I'm new to Python and Pandas but I try to use Pandas Dataframes to merge two dataframes based on regular expression.
I have one dataframe with some 2 million rows. This table contains data about cars but the model name is often specified in - lets say - a creative way, e.g. 'Audi A100', 'Audi 100', 'Audit 100 Quadro', or just 'A 100'. And the same for other brands. This is stored in a column called "Model". In a second model I have the manufacturer.
Index
Model
Manufacturer
0
A 100
Audi
1
A100 Quadro
Audi
2
Audi A 100
Audi
...
...
...
To clean up the data I created about 1000 regular expressions to search for some key words and stored it in a dataframe called 'regex'. In a second column of this table I save the manufacture. This value is used in a second step to validate the result.
Index
RegEx
Manufacturer
0
.* A100 .*
Audi
1
.* A 100 .*
Audi
2
.* C240 .*
Mercedes
3
.* ID3 .*
Volkswagen
I hope you get the idea.
As far as I understood, the Pandas function "merge()" does not work with regular expressions. Therefore I use a loop to process the list of regular expressions, then use the "match" function to locate matching rows in the car DataFrame and assign the successfully used RegEx and the suggested manufacturer.
I added two additional columns to the cars table 'RegEx' and 'Manufacturer'.
for index, row in regex.iterrows():
cars.loc[cars['Model'].str.match(row['RegEx']),'RegEx'] = row['RegEx']
cars.loc[cars['Model'].str.match(row['RegEx']),'Manufacturer'] = row['Manfacturer']
I learnd 'iterrows' should not be used for performance reasons. It takes 8 minutes to finish the loop, what isn't too bad. However, is there a better way to get it done?
Kind regards
Jiriki
I have no idea if it would be faster (I'll be glad, if you would test it), but it doesn't use iterrows():
regex.groupby(["RegEx", "Manufacturer"])["RegEx"]\
.apply(lambda x: cars.loc[cars['Model'].str.match(x.iloc[0])])
EDIT: Code for reproduction:
cars = pd.DataFrame({"Model": ["A 100", "A100 Quatro", "Audi A 100", "Passat V", "Passat Gruz"],
"Manufacturer": ["Audi", "Audi", "Audi", "VW", "VW"]})
regex = pd.DataFrame({"RegEx": [".*A100.*", ".*A 100.*", ".*Passat.*"],
"Manufacturer": ["Audi", "Audi", "VW"]})
#Output:
# Model Manufacturer
#RegEx Manufacturer
#.*A 100.* Audi 0 A 100 Audi
# 2 Audi A 100 Audi
#.*A100.* Audi 1 A100 Quatro Audi
#.*Passat.* VW 3 Passat V VW
# 4 Passat Gruz VW

Strange output with Python Lists

I have a dataframe: Outlet_results
it goes something like this
index Calendar year/Week Material Sellthru Qty
0 37.2013 ABC 2
1 38.2913 ABC 7
2 37.2913 BCG 22
3 39.2013 XYZ 5
Now, I wanted a separate list for the Materials and week for further coding.
I used this code for the material list
mat_outlet = list(set(outlet_result['Material']))
It works perfectly and gives me 3 values (ABC, BCG, XYZ)
However, the week list shows a faulty output even though the code is same.
week_outlet_list = list(set(outlet_result['Calendar Year/Week']))
I am getting a list with 4 values
['38.2013', '37.2013', 'Calendar Year/Week', '39.2013']
Why is the string (header) included in the list? Please help me understand this concept.
I am using Python 2.7.... has it got something to do with it?

Pandas: Create New Dataframe that Counts Number of Times Keywords / Phrases From List Occur in One Column

I have the following word list:
list = ['clogged drain', 'right wing', 'horse', 'bird', 'collision light']
I have the following data frame (notice spacing can be weird):
ID TEXT
1 you have clogged drain
2 the dog has a right wing clogged drain
3 the bird flew into collision light
4 the horse is here to horse around
5 bird bird bird
I want to create a table that shows keywords and frequency counts of how often the keywords occurred in TEXT field. However, if a keyword appears more than once in the same row within the TEXT column, it is only counted once.
Desired output:
keywords count
clogged drain 2
right wing 1
horse 1
bird 2
collision light 1
I have searched all over stackoverflow but couldn't find my specific case.
I would start by reformatting the TEXT column to get rid of your funny spacing, using str.split() and str.join(). Then, use str.contains for each of your keywords, and get the sum of the boolean values that are outputted (It will return True if your keyword is found):
# Reformat text, splitting wherever you have one or more spaces
df['formatted_text'] = df.TEXT.str.split('\s+').str.join(' ')
# create your output dataframe
df2 = pd.DataFrame(my_list, columns=['keywords'])
# Count occurences:
df2['count'] = df2['keywords'].apply(lambda x: df.formatted_text.str.contains(x).sum())
The result:
>>> df2
keywords count
0 clogged drain 2
1 right wing 1
2 horse 1
3 bird 2
4 collision light 1
Just to note, I changed the variable name of your list to my_list, so as not to mask the built in python data type
You can using extractall
df.TEXT.str.extractall(r'({})'.format('|'.join(list)))[0].str.get_dummies().sum(level=0).gt(0).astype(int).sum()
Out[225]:
bird 2
clogged drain 2
collision light 1
horse 1
right wing 1
dtype: int64

Grouping Similar words/phrases

I have a frequency table of words which looks like below
> head(freqWords)
employees work bose people company
1879 1804 1405 971 959
employee
100
> tail(freqWords)
youll younggood yoyo ytd yuorself zeal
1 1 1 1 1 1
I want to create another frequency table which will combine similar words and add their frequencies
In above example, my new table should contain both employee and employees as one element with a frequency of 1979. For example
> head(newTable)
employee,employees work bose people
1979 1804 1405 971
company
959
I know how to find out similar words (using adist, stringdist) but I am unable to create the frequency table. For instance I can use following to get a list of similar words
words <- names(freqWords)
lapply(words, function(x) words[stringdist(x, words) < 3])
and following to get a list of similar phrases of two words
lapply(words, function(x) words[stringdist2(x, words) < 3])
where stringdist2 is follwoing
stringdist2 <- function(word1, word2){
min(stringdist(word1, word2),
stringdist(word1, gsub(word2,
pattern = "(.*) (.*)",
repl="\\2,\\1")))
}
I do not have any punctuation/special symbols in my words/phrases. (I do not know a lot of R; I created stringdist2 by tweaking an implementation of adist2 I found here but I do not understand everything about how pattern and repl works)
So I need help to create new frequency table.

Filter items with Django Query

I'm encountering this problem and would like to seek your help.
The context:
I'm having a bag of balls, each of which has an age (red and blue) and color attributes.
What I want is to get the top 10 "youngest" balls and there are at most 3 blue balls (this means if there are more than 3 blue balls in the list of 10 youngest balls, then replace the "redudant" oldest blue balls with the youngest red balls)
To get top 10:
sel_balls = Ball.objects.all().sort('age')[:10]
Now, to also satisfy the conditions "at most 3 blue balls", I need to process further:
Iterate through sel_balls and count the number of blue balls (= B)
If B <= 3: do nothing
Else: get additional B - 3 red balls to replace the oldest (B - 3) blue balls (and these red balls must not have appeared in the original 10 balls already taken out). I figure I can do this by getting the oldest age value among the list of red balls and do another query like:
add_reds = Ball.objects.filter(age >= oldest_sel_age)[: B - 3]
My question is:
Is there any way that I can satisfy the constraints in only one query?
If I have to do 2 queries, is there any faster ways than the one method I mentioned above?
Thanks all.
Use Q for complex queries to the database: https://docs.djangoproject.com/en/dev/topics/db/queries/#complex-lookups-with-q-objects
You should use annotate to do it.
See documentation.
.filter() before .annotate() gives 'WHERE'
.filter() after .annotate() gives 'HAVING' (this is what you need)