Pandas drop duplicates; values in reverse order - python-2.7

I'm trying to find a way to utilize pandas drop_duplicates() to recognize that rows are duplicates when the values are in reverse order.
An example is if I am trying to find transactions where customers purchases both apples and bananas, but the data collection order may have reversed the items. In other words, when combined as a full order the transaction is seen as a duplicate because it is made up up of the same items.
I want the following to be recognized as duplicates:
Item1 Item2
Apple Banana
Banana Apple

First sort by rows with apply sorted and then drop_duplicates:
df = df.apply(sorted, axis=1).drop_duplicates()
print (df)
Item1 Item2
0 Apple Banana
#if need specify columns
cols = ['Item1','Item2']
df[cols] = df[cols].apply(sorted, axis=1)
df = df.drop_duplicates(subset=cols)
print (df)
Item1 Item2
0 Apple Banana
Another solution with numpy.sort and DataFrame constructor:
df = pd.DataFrame(np.sort(df.values, axis=1), index=df.index, columns=df.columns)
.drop_duplicates()
print (df)
Item1 Item2
0 Apple Banana

Related

How do I create a comma-separated list of values in Google Sheets

How do I create a comma-separated list of values which pulls the top value from each list using a common value in Google sheets. For example, if I have three lists and want to use a common value to pull the value of the top list into a comma-separated list (hope that makes sense):
Category 1
apples
oranges
pears
Category 2
apples
pears
grapes
Category 3
oranges
apples
celery
I'm trying to create lists that look like the following using the common value (oranges, apples, etc):
oranges: category 3, category 1
apples: category 1, category 2, category 3
celery: category 3
pears: category 1, category 2
grapes: category 1
So many thanks if someone could help me with this!
If ColA - ColC has the following:
Category 1 Category 2 Category 3
apples apples oranges
oranges pears apples
pears grapes celery
Put the following formula in, say, E1:
={"Values";UNIQUE(TRANSPOSE({TRANSPOSE($A$2:$A$4),TRANSPOSE($B$2:$B$4),TRANSPOSE($C$2:$C$4)}))}
This will create an array with a header and transposes the unique values from the 3 ranges of categories (from rows 2 to 4, change if needed). This way each value is extracted but there are no repeats.
Then, put header text in F1 ("Categories"). Put this formula in F2 and drag down to match each result in ColE:
=TEXTJOIN(", ",TRUE,{{IFERROR(IF(SEARCH($E2,join("",$A$2:$A)),$A$1,""))},{IFERROR(IF(SEARCH($E2,join("",$B$2:$B)),$B$1,""),)},{IFERROR(IF(SEARCH($E2,join("",$C$2:$C)),$C$1,""),)}})
This formula will search each category in ColA - ColC for the unique fruit in ColE. If there's a match, it will return the Category. The Textjoin() function separates the results with a comma.

Creating an Index Column for a Descriptive Data Using "DAX" in Power BI

I have a table Like this,
Table1
ColA ColB
Orange Apple
Mango Not Apple
Mango Not Apple
I want to create a column called as RowNumber using DAX and not Query Editor (M).
So the Expected output is,
ColA ColB RowNumber
Orange Apple 1
Mango Not Apple 2
Mango Not Apple 3
This can be done in M - Power Query Side.
But, I am looking for a solution using DAX- Calculated Column.
I expected functions like RowNumber (T-SQL) or Index to be present inside DAX.
If you need to create an Index in DAX you can use this formula:
Index = RANKX(ALL(Barges),Barges[Date],,ASC)
RANKX: create your Index values
ALL: to avoid your Index to be partially generated if you have any filter
The second parameter is from where you want to sort your data, in my example I have an Index number increasing with an ascending order on my date, if I use Barges[name] instead for example I'll have my index generating with an A-Z sorting on my barges names.

Converting a specific column data in .csv to text using Python pandas

I have a .csv file like below where all the contents are text
col1 Col2
My name Arghya
The Big Apple Fruit
I am able to read this csv using pd.read_csv(index_col=False, header=None).
How do I combine all the three rows in Col1 into a list separated by a full stop.
If need convert column values to list:
print (df.Col1.tolist())
#alternative solution
#print (list(df.Col1))
['This is Apple', 'I am in Mumbai', 'I like rainy day']
And then join values in list - output is string:
a = '.'.join(df.Col1.tolist())
print (a)
This is Apple.I am in Mumbai.I like rainy day
print (df)
0 1
0 Col1 Col2
1 This is Apple Fruit
2 I am in Mumbai Great
3 I like rainy day Flood
print (list(df.loc[:, 0]))
#alternative
#print (list(df[0]))
['Col1', 'This is Apple', 'I am in Mumbai', 'I like rainy day']

Keeping duplicates and deleting rest from pandas dataframe

I have 3 different pandas dataframe, which I have concatenated. Now I would like to keep only those rows which appear in three columns and delete the rest. For instance
Column1 Column2 Column3
0 John a Sam
1 Sam b Rob
2 Daniel c John
3 Varys d Ella
I want to keep only those rows in Column1, which appear in both Column1 and Column2. In the above example its ROW -- 0 & 1.
Desired output
Column1 Column2
0 John a
1 Sam b
Filter the df by pass the series 'Column3' as an arg to isin to test for membership:
In [42]:
df[df['Column1'].isin(df['Column3'])]
Out[42]:
Column1 Column2 Column3
0 John a Sam
1 Sam b Rob

Creating a pandas.DataFrame from a dict

I'm new to using pandas and I'm trying to make a dataframe with historical weather data.
The keys are the day of the year (ex. Jan 1) and the values are lists of temperatures from those days over several years.
I want to make a dataframe that is formatted like this:
... Jan1 Jan2 Jan3 etc
1 temp temp temp etc
2 temp temp temp etc
etc etc etc etc
I've managed to make a dataframe with my dictionary with
df = pandas.DataFrame(weather)
but I end up with 1 row and a ton of columns.
I've checked the documentation for DataFrame and DataFrame.from_dict, but neither were very extensive nor provided many examples.
Given that "the keys are the day of the year... and the values are lists of temperatures", your method of construction should work. For example,
In [12]: weather = {'Jan 1':[1,2], 'Jan 2':[3,4]}
In [13]: df = pd.DataFrame(weather)
In [14]: df
Out[14]:
Jan 1 Jan 2
0 1 3
1 2 4