Merge Two Pandas Dataframes when two columns are list

Merge Two Pandas Dataframes when two columns are list - list

I have two Pandas data frames and they need to be merged. Example data frames are:
c1 c2
pd1 = [[1, [1,2]]
c3 c4
pd2 = [[1, [1,3]],
[2,[2,3]]
result = [[1,1], [1,2]]
The join condition is that lists in c2 and c4 have at lease one common element.
I've tried:
result = pd.merge(pd1, pd2, left_on=list('c2'),right_on=list('c4'), how='inner')
However, this seems to only join them when the rows in each column are single values like a float, int or string.
I've attacked this problem using nested loops. This runs like a dog when the sets get large. Is there a faster way to perform this merge exploiting data frames or is there another way that's better?

pd1 = pd.DataFrame([[1, [1,2]]], columns=['c1', 'c2'])
pd1
pd2 = pd.DataFrame([[1, [1, 2]], [2, [2, 3]]], columns=['c3', 'c4'])
pd2
Setup for a merge
s2 = pd2.c4.apply(pd.Series).stack() \
.rename_axis(['idx2', 'lst2']).reset_index(name='val')
s2
s1 = pd1.c2.apply(pd.Series).stack() \
.rename_axis(['idx1', 'lst1']).reset_index(name='val')
s1
mrg = s1.merge(s2)[['idx1', 'idx2']].drop_duplicates()
mrg
a1 = pd1.c1.loc[mrg.idx1].values
a2 = pd2.c3.loc[mrg.idx2]
pd.DataFrame(dict(c1=a1, c3=a2))

Related

Displaying items of several lists with uneven shape flushed-right

I have a couple of lists with progressively decreasing number of items like below:
list1 = [10 20 30]
list2 = [50 60]
list3 = [80]
I want to print the lists such that the last item of the succeeding lists after the 1st list is flushed-right, i.e, the last items of the lists, e.g. 30 60 80 are aligned under the last column of list1.
Here's a snippet of my list and the codes that I used to display the list as I wanted:
s1 = [4.98, 14.41, -3.16, 2.74. -12.32]
s2 = [-6.59, 14.14, 8.84, 5.68]
s3 = [-29.95, 18.95, 15.75]
s4 = [11.44, -8.22]
s5 = [30.96]
The lists show flushed-left when printed, all first items aligned in col 1. As I mentioned, I want to print it fushed-right, all last items aligned together in the last column of list s1.
I padded the "blanks" of lists whose lengths are less than list s1 with a dummy item ('zzzz') to see if I could print flushed-right.
Pad1 = ['zzzz']
Pad2 = ['zzzz', 'zzzz']
Pad3 = ['zzzz', 'zzzz', 'zzzz']
Pad4 = ['zzzz', 'zzzz', 'zzzz', 'zzzz']
df_join1 = Pad1 + s2
df_join2 = Pad2 + s3
df_join3 = Pad3 + s4
df_join4 = Pad4 + s5
Padding works, got the last item of each list to print flushed right but the outlook is ugly as the numbers in columns are not properly aligned.
There must be a better way to do it. Would greatly appreciate a useful lead. I must admit, my script codes aren't the most efficient. I can clean them up later. For now, I just want to see if there's a better way.
Much thanks.

Combine two (tables or lists) to create output of unique combinations in Power BI

Trying to combine two queries in Power BI to have an output of unique combinations.
for instance one list(or column): A, B, C
and another list(or column): 1, 2, 3, 4, 5, 6, 7
Output should be: A1, A2, A3, A4, A5, A6, A7, B1, B2, B3, B4, B5, B6, B7, C1, C2, C3, C4, C5, C6, C7
Is there a way to accomplish this? (Yes my rows are not equal in count)
Just don't know the best or right approach for this (tried using combine with a helper column and hit a dead end as duplicates get created, unless I did that wrong)

This is essentially a Cartesian product (a.k.a. cross product) of two lists.
If you just have two text lists, you can do a one-liner like this:
List.Combine(List.Transform(List1, (L1) => List.Transform(List2, (L2) => L1 & L2)))
This says for each item X in the first list, create a list that is the second list with X prepended to each element. This gives a list of lists that is flattened out to a single list using the combine function.
It's not uncommon to want to do this with table though too. In this case, the analogous idea is to define a new column on one table where each row is the entire list/column of the other and then expand that new column.
Assuming we want the cross product of Table1[Column1] and Table2[Column2]:
let
Table1 = <Table1 Source>,
AddCustom = Table.AddColumn(Table1 , "Custom", each Table2),
Expand = Table.ExpandTableColumn(AddCustom, "Custom", {"Column2"}, {"Column2"}),
Concatenate = Table.AddColumn(Expand, "Concatenate", each [Column1] & [Column2])
in
Concatenate
Edit:
You can do the concatenation before the expand too:
let
Table1 = <Table1 Source>,
AddCustom = Table.AddColumn(Table1 , "Custom",
(T1) => List.Transform(Table2[Column2], each T1[Column1] & _)),
Expanded = Table.ExpandListColumn(AddCustom, "Custom")
in
Expanded
References with more detail:
Cartesian Product Joins
Cartesian Join of two queries...

How to swap columns using openpyxl

I've .xlsx file. Rows are good, values are just fine. But i need to change the columns order by list of new columns positions, e.g:
old = [1, 2, 3, 4]
new = [2, 1, 4, 3]
Docs are checked - there is no straightforward options for this problem.
I've tried to iterate over columns, so:
old = {cell.column: cell.value for cell in ws[1]}.keys() # [1, 2, 3, 4]
new = [2, 1, 4, 3]
for i, col in enumerate(ws.iter_cols(max_col=old[-1]), 1):
if old[i-1] != new[i-1]:
for one in ws[get_column_letter(i)]:
old_cell = one
new_cell = ws.cell(old_cell.row, old[new[i-1]]-1)
new_cell.value, old_cell.value = old_cell.value, new_cell.value
old[i] = new_cell.column
old[new_cell.column] = old_cell.column
...but it work only for a few cases. Probably i'm missing some general solution.
At the and it should be, for example, old = [1, 2, 3, 4] new = [2, 1, 4, 3]:
Input file:
x A B C D
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
4 A4 B4 C3 D4
Output file:
x A B C D
1 B1 A1 D1 C1
2 B2 A2 D2 C2
3 B3 A3 D3 C3
4 B4 A4 D3 C4

Your current approach risks overwriting cells. I'd be tempted to move the cells from existing columns to new columns in the correct order and then delete the old ones.
for c in ws['A']:
new_cell = c.offset(column=6)
new_cell.value = c.value
for c in ws['B']:
new_cell = c.offset(column=5)
new_cell.value = c.value
ws.delete_cols(min_col=1, max_col=4)
This is just to give you the general idea could be optimised and parametrised so you could do each row at once.
Or you could use move_range:
ws.move_range(min_col=1, max_col=1, min_row=1, max_row=ws.max_row, cols=5)
In either case be careful not to overwrite existing cells before you've moved them.

Python Pandas Dataframe merge and pick only few columns

I have a basic question on dataframe merge. After I merge two dataframe , is there a way to pick only few columns in the result.
Taking an example from documentation
https://pandas.pydata.org/pandas-docs/stable/merging.html#
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
'key2': ['K0', 'K1', 'K0', 'K1'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
'key2': ['K0', 'K0', 'K0', 'K0'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
result = pd.merge(left, right, on=['key1', 'key2'])
Result comes as :
A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
1 A2 B2 K1 K0 C1 D1
2 A2 B2 K1 K0 C2 D2
None
Is there a way I can chose only column 'C' from 'right' dataframe? For example, I would like my result to be like:
A B key1 key2 C
0 A0 B0 K0 K0 C0
1 A2 B2 K1 K0 C1
2 A2 B2 K1 K0 C2
None

result = pd.merge(left, right[['key1','key2','C']], on=['key1', 'key2'])
OR
right.merge(left, on=['key1','key2'])[['A','B','C','key1','key2']]

"Deinterlacing" a list in Scala

I have a list of bytes that represent raw samples read in from an audio interface. Depending on the use case and H/W, each sample can be anywhere from 1 to 4 bytes long, and the total number of channels in the "stream" can be more or less arbitrary. The amount of channels and bits per sample are both known at runtime.
I'll give an example of what I mean. There are four channels in the stream and each sample is two bytes.
List(A1, A2, B1, B2, C1, C2, D1, D2, A3, A4, B3, B4, C3, C4, D3, D4)
so A1 is the first byte of channel A's first sample, A2 is the second byte of the same sample and so on.
What I need to do is extract each channel's samples into their own lists, like this:
List(List(A1, A2, A3, A4), List(B1, B2, B3, B4), List(C1, C2, C3, C4), List(D1, D2, D3, D4))
How would I go about doing this in idiomatic Scala? I just started learning Scala a few hours ago, and the only non-imperative solution I've come up with is clearly nonoptimal:
def uninterleave(samples: Array[Byte], numChannels: Int, bytesPerSample: Int) = {
val dropAmount = numChannels * bytesPerSample
def extractChannel(n: Int) = {
def extrInner(in: Seq[Byte], acc: Seq[Byte]): Seq[Byte] = {
if(in == List()) acc
else extrInner(in.drop(dropAmount), in.take(bytesPerSample) ++ acc)
}
extrInner(samples.drop(n * bytesPerSample), Nil)
}
for(i <- 0 until numChannels) yield extractChannel(i)
}

I would do
samples.grouped(bytesPerSample).grouped(numChannels).toList
.transpose.map(_.flatten)
I would not vouch for its performance though. I would rather avoid lists, unfortunately grouped produces them.
Maybe
samples.grouped(bytesPerSample).map(_.toArray)
.grouped(numChannels).map(_.toArray)
.toArray
.transpose
.map(flatten)
Still, lots of lists.

didierd's answer is just about perfect, but, alas, I think one can improve it a bit. He is concerned with all the list creation, and transpose is a rather heavy operation as well. If you can process all the data at the same time, it might well be good enough.
However, I'm going with Stream, and use a little trick to avoid transposing.
First of all, the grouping is the same, only I'm turning stuff into streams:
def getChannels[T](input: Iterator[T], elementsPerSample: Int, numOfChannels: Int) =
input.toStream.grouped(elementsPerSample).toStream.grouped(numOfChannels).toStream
Next, I'll give you a function to extract one channel from that:
def streamN[T](s: Stream[Stream[Stream[T]]])(channel: Int) = s flatMap (_(channel))
With those, we can decode the streams like this:
// Sample input
val input = List('A1, 'A2, 'B1, 'B2, 'C1, 'C2, 'D1, 'D2, 'A3, 'A4, 'B3, 'B4, 'C3, 'C4, 'D3, 'D4)
// Save streams to val, to avoid recomputing the groups
val streams = getChannels(input.iterator, elementsPerSample = 2, numOfChannels = 4)
// Decode each one
def demuxer = streamN(streams) _
val aa = demuxer(0)
val bb = demuxer(1)
val cc = demuxer(2)
val dd = demuxer(3)
This will return separate streams for each channel without having the whole stream at hand. This might be useful if you need to process the input in real time. Here's some input source to test how far into the input it reads to get at a particular element:
def source(elementsPerSample: Int, numOfChannels: Int) = Iterator.from(0).map { x =>
"" + ('A' + x / elementsPerSample % numOfChannels).toChar +
(x % elementsPerSample
+ (x / (numOfChannels * elementsPerSample)) * elementsPerSample
+ 1)
}.map { x => println("Saw "+x); x }
You can then try stuff like this:
val streams = getChannels(source(2, 4), elementsPerSample = 2, numOfChannels = 4)
def demuxer = streamN(streams) _
val cc = demuxer(2)
println(cc take 20 toList)
val bb = demuxer(1)
println(bb take 30 toList)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Merge Two Pandas Dataframes when two columns are list - list

Related

Displaying items of several lists with uneven shape flushed-right

Combine two (tables or lists) to create output of unique combinations in Power BI

How to swap columns using openpyxl

Python Pandas Dataframe merge and pick only few columns

"Deinterlacing" a list in Scala

Categories

Resources