match index from pyspark dataframe in pandas - list

I have following pyspark dataframe (testDF=ldamodel.describeTopics().select("termIndices").toPandas())
topic| termIndices| termWeights|
+-----+---------------+--------------------+
| 0| [6, 118, 5]|[0.01205522104545...|
| 1| [0, 55, 100]|[0.00125521761966...|
and i have following word list
['one',
'peopl',
'govern',
'think',
'econom',
'rate',
'tax',
'polici',
'year',
'like',
........]
I am trying to match vocablist to termIndices to termWeights.
So far I have following:
for i in testDF.items():
for j in i[1]:
for m in j:
t=vocablist[m],m
print(t)
which results into:
('tax', 6)
('insur', 118)
('rate', 5)
('peopl', 1)
('health', 84)
('incom', 38)
('think', 3)
('one', 0)
('social', 162)
.......
But I wanted something like
('tax', 6, 0.012055221045453202)
('insur', 118, 0.001255217619666775)
('rate', 5, 0.0032220995010401187)
('peopl', 1,0.008342115226031033)
('health', 84,0.0008332053105123403)
('incom', 38, ......)
Any help will be appreciated.

I would recommend you spread those lists in the columns termIndices and termWeights downward. Once you've done that, then you can actually map indices to their term names while having the term weights aligned with each term index. The following is an illustration:
df = pd.DataFrame(data={'topic': [0, 1],
'termIndices': [[6, 118, 5],
[0, 55, 100]],
'termWeights': [[0.012055221045453202, 0.012055221045453202, 0.012055221045453202],
[0.00125521761966, 0.00125521761966, 0.00125521761966]]})
dff = df.apply(lambda s: s.apply(pd.Series).stack().reset_index(drop=True, level=1))
vocablist = ['one', 'peopl', 'govern', 'think', 'econom', 'rate', 'tax', 'polici', 'year', 'like'] * 50
dff['termNames'] = dff.termIndices.map(vocablist.__getitem__)
dff[['termNames', 'termIndices', 'termWeights']].values.tolist()
I hope this helps.

Related

How to plot only highest y-values for each x-value in a list of [x,y] values

I am trying to plot a graph/chromatogram of the x and y-values located in time_dependent_intensities, but only the ones with the largest y-values.
run = pymzml.run.Reader(in_path)
time_dependent_intensities = []
for spectrum in run:
if spectrum.ms_level == 1:
has_peak_matches = spectrum.reduce(mz_range=(150,151))
if has_peak_matches != []:
for mz, I in has_peak_matches:
time_dependent_intensities.append(
[spectrum.scan_time_in_minutes(), I]
)
print("RT \ti")
for i in time_dependent_intensities:
print(i)
return
When I print i, I end up with a huge list of stuff like this ranging from 0 - 15 with about 5 different y-values per x-value:
[14.9929171, 21.0]
[14.9929171, 21.0]
[14.9929171, 20.0]
[14.9929171, 31.0]
[14.9929171, 25.0]
[14.9929171, 21.0]
[14.9929171, 18.0]
[14.9967165, 22.0]
[14.9967165, 26.0]
How do I access the lists [x,y] within the time_dependent_intensities list but only plot the ones that have the largest y value.
If you have a list of lists like this and you want to extract the maximum value of each list, a list expression is probably the most convenient option:
list_ = [[15, 10, 1, 18, 2],
[1, 2, 10 , 15, 0],
[15, 10, 1, 18, 2],
[15, 10, 1, 18, 2],]
maxima = [max(x) for x in list_]
print(maxima)
>>> [18, 15, 18, 18]

Looping through list of dictionary in python

Given list of dictionary in python
my_list=[{'id':0,'name':'cube0_cluster0','member_ids': [429, 432, 435]},
{'id': 1,'name': 'cube0_cluster1','member_ids': [0, 4, 5]},
{'id':0,'name':'cube1_cluster1','member_ids': [4, 706, 800]}]
I want to print all member_ids for cube{ }_cluster1
My expected output is to print [0,4,5,706,800]
any help would be highly appreciated
I have tried it
for k in my_list:
for j in range(len(my_list)):
if k['name']=='cube{}_cluster1'.format(j):
print(k['member_ids'])
But I am getting two separate outputs as [0,4,5] and [4,706,800]
Try this one.
import re
member_ids = []
for di in my_list:
if re.match('cube\d_cluster1', di['name']):
member_ids += di['member_ids']
print(member_ids)
You can also use list comprehension.
my_list=[{'id':0,'name':'cube0_cluster0','member_ids': [429, 432, 435]},
{'id': 1,'name': 'cube0_cluster1','member_ids': [0, 4, 5]},
{'id':0,'name':'cube1_cluster1','member_ids': [4, 706, 800]}]
res = [j for i in my_list for j in i['member_ids'] if "cluster1" in i["name"]]
print (res) # return list
print (set(res)) # to return distinct data
# Result
# [0, 4, 5, 4, 706, 800]
# {0, 800, 706, 4, 5}
I hope this helps and counts!

Python Dask - vertical concatenation of 2 DataFrames

I am trying to vertically concatenate two Dask DataFrames
I have the following Dask DataFrame:
d = [
['A','B','C','D','E','F'],
[1, 4, 8, 1, 3, 5],
[6, 6, 2, 2, 0, 0],
[9, 4, 5, 0, 6, 35],
[0, 1, 7, 10, 9, 4],
[0, 7, 2, 6, 1, 2]
]
df = pd.DataFrame(d[1:], columns=d[0])
ddf = dd.from_pandas(df, npartitions=5)
Here is the data as a Pandas DataFrame
A B C D E F
0 1 4 8 1 3 5
1 6 6 2 2 0 0
2 9 4 5 0 6 35
3 0 1 7 10 9 4
4 0 7 2 6 1 2
Here is the Dask DataFrame
Dask DataFrame Structure:
A B C D E F
npartitions=4
0 int64 int64 int64 int64 int64 int64
1 ... ... ... ... ... ...
2 ... ... ... ... ... ...
3 ... ... ... ... ... ...
4 ... ... ... ... ... ...
Dask Name: from_pandas, 4 tasks
I am trying to concatenate 2 Dask DataFrames vertically:
ddf_i = ddf + 11.5
dd.concat([ddf,ddf_i],axis=0)
but I get this error:
Traceback (most recent call last):
...
File "...", line 572, in concat
raise ValueError('All inputs have known divisions which cannot '
ValueError: All inputs have known divisions which cannot be concatenated
in order. Specify interleave_partitions=True to ignore order
However, if I try:
dd.concat([ddf,ddf_i],axis=0,interleave_partitions=True)
then it appears to be working. Is there a problem with setting this to True (in terms of performance - speed)? Or is there another way to vertically 2 concatenate Dask DataFrames?
If you inspect the divisions of the dataframe ddf.divisions, you will find, assuming one partition, that it has the edges of the index there: (0, 4). This is useful to dask, as it knows when you do some operation on the data, not to use a partition not including required index values. This is also why some dask operations are much faster when the index is appropriate for the job.
When you concatenate, the second dataframe has the same index as the first. Concatenation would work without interleaving if the values of the index had different ranges in the two partitions.
mdurant's answer is correct and this answer elaborate with MCVE code snippets using Dask v2021.08.1. Examples make it easier to understand divisions and interleaving.
Vertically concatenating DataFrames
Create two DataFrames, concatenate them, and view the results.
df = pd.DataFrame(
{"nums": [1, 2, 3, 4, 5, 6], "letters": ["a", "b", "c", "d", "e", "f"]}
)
ddf1 = dd.from_pandas(df, npartitions=2)
df = pd.DataFrame({"nums": [88, 99], "letters": ["xx", "yy"]})
ddf2 = dd.from_pandas(df, npartitions=1)
ddf3 = dd.concat([ddf1, ddf2])
print(ddf3.compute())
nums letters
0 1 a
1 2 b
2 3 c
3 4 d
4 5 e
5 6 f
0 88 xx
1 99 yy
Divisions metadata when vertically concatenating
Create two DataFrames, concatenate them, and illustrate that sometimes this operation will cause divisions metadata to be lost.
def print_partitions(ddf):
for i in range(ddf.npartitions):
print(ddf.partitions[i].compute())
df = pd.DataFrame(
{"nums": [1, 2, 3, 4, 5, 6], "letters": ["a", "b", "c", "d", "e", "f"]}
)
ddf1 = dd.from_pandas(df, npartitions=2)
ddf1.divisions # (0, 3, 5)
df = pd.DataFrame({"nums": [88, 99], "letters": ["xx", "yy"]})
ddf2 = dd.from_pandas(df, npartitions=1)
ddf2.divisions # (0, 1)
ddf3 = dd.concat([ddf1, ddf2])
ddf3.divisions # (None, None, None, None)
Set interleave_partitions=True to avoid losing the divisions metadata.
ddf3_interleave = dd.concat([ddf1, ddf2], interleave_partitions=True)
ddf3_interleave.divisions # (0, 1, 3, 5)
When interleaving isn't necessary
Create two DataFrames without overlapping divisions, concatenate them, and confirm that the divisions metadata is not lost:
df = pd.DataFrame(
{"nums": [1, 2, 3, 4], "letters": ["a", "b", "c", "d"], "some_index": [4, 5, 6, 7]}
)
ddf1 = dd.from_pandas(df, npartitions=2)
ddf1 = ddf1.set_index("some_index")
df = pd.DataFrame({"nums": [88, 99], "letters": ["xx", "yy"], "some_index": [10, 20]})
ddf2 = dd.from_pandas(df, npartitions=1)
ddf2 = ddf2.set_index("some_index")
ddf3 = dd.concat([ddf1, ddf2])
ddf3.divisions # (4, 6, 10, 20)
I wrote a blog post to explain this in more detail. Let me know if you'd like the link.

find the index of maximum element of 2nd column of a 2 dimensional list

I am a newbie in python,
I need to find the index of maximum element of 2nd column of a 2 dimensional list as follows
3 | 23
7 | 12
5 | 42
1 | 25
so the output should be the index of 42 i.e. 2
I tried converting the list to numpy array and then using argmax, but in vain
Thanks!!
>>> import numpy as np
>>> a = np.asarray([[3, 7, 5, 1], [23, 12, 42, 25]])
>>> a
array([[ 3, 7, 5, 1],
[23, 12, 42, 25]])
>>> a[1]
array([23, 12, 42, 25])
>>> np.argmax(a[1])
2

How to count the number of zeros in Python?

My code is currently written as:
convert = {0:0,1:1,2:2,3:3,4:0,5:1,6:2,7:1}
rows = [[convert[random.randint(0,7)] for _ in range(5)] for _ in range(5)]
numgood = 25 - rows.count(0)
print numgood
>> 25
It always comes out as 25, so it's not just that rows contains no 0's.
Have you printed rows?
It's [[0, 1, 0, 0, 2], [1, 2, 0, 1, 2], [3, 1, 1, 1, 1], [1, 0, 0, 1, 0], [0, 3, 2, 0, 1]], so you have a nested list there.
If you want to count the number of 0's in those nested lists, you could try:
import random
convert = {0:0, 1:1, 2:2, 3:3, 4:0, 5:1, 6:2, 7:1}
rows = [[convert[random.randint(0, 7)] for _ in range(5)] for _ in range(5)]
numgood = 25 - sum(e.count(0) for e in rows)
print numgood
Output:
18
rows doesn't contain any zeroes; it contains lists, not integers.
>>> row = [1,2,3]
>>> type(row)
<type 'list'>
>>> row.count(2)
1
>>> rows = [[1,2,3],[4,5,6]]
>>> rows.count(2)
0
>>> rows.count([1,2,3])
1
To count the number of zeroes in any of the lists in rows, you could use a generator expression:
>>> rows = [[1,2,3],[4,5,6], [0,0,8]]
>>> sum(x == 0 for row in rows for x in row)
2
You could also use numpy:
import numpy as np
import random
convert = {0:0,1:1,2:2,3:3,4:0,5:1,6:2,7:1}
rows = [[convert[random.randint(0,7)] for _ in range(5)] for _ in range(5)]
numgood = 25 - np.count_nonzero(rows)
print numgood
Output:
9