Is there a method to delete multiple rows of a matrix in SymPy (without using NumPy)?
I understand that .row_del() can only delete one row at a time.
from sympy import *
a_1_1, a_1_2, a_1_3 = symbols ('a_1_1 a_1_2 a_1_3')
a_2_1, a_2_2, a_2_3 = symbols ('a_2_1 a_2_2 a_2_3')
a_3_1, a_3_2, a_3_3 = symbols ('a_3_1 a_3_2 a_3_3')
A = Matrix ([
[a_1_1, a_1_2, a_1_3],
[a_2_1, a_2_2, a_2_3],
[a_3_1, a_3_2, a_3_3]
])
A.row_del (1 : 2)
does not (yet?) work:
A.row_del (1 : 2)
^
SyntaxError: invalid syntax
You can use slicing or outer indexing to select the rows that you want:
In [8]: A = MatrixSymbol('A', 3, 3).as_explicit()
In [9]: A
Out[9]:
⎡A₀₀ A₀₁ A₀₂⎤
⎢ ⎥
⎢A₁₀ A₁₁ A₁₂⎥
⎢ ⎥
⎣A₂₀ A₂₁ A₂₂⎦
In [10]: A[:1,:]
Out[10]: [A₀₀ A₀₁ A₀₂]
In [11]: A[[0,2],:]
Out[11]:
⎡A₀₀ A₀₁ A₀₂⎤
⎢ ⎥
⎣A₂₀ A₂₁ A₂₂⎦
The : syntax to select a slice of items is only supported inside of square brackets in Python. That's why the error is SyntaxError. So even if the row_del function did support deleting multiple rows at once, the syntax for it would not look like row_del(1:2), because that's not valid Python.
Regarding your follow-up question, part of the beauty of working in Python is that you can write your own utility functions to make things work the way you want. So you could make a function to create a matrix that automatically comes back with indices as desired. Note, too, that selecting rows or columns with a list will put them in the requested order:
>>> a1rc = lambda a,r,c: MatrixSymbol(a, r+1, c+1).as_explicit()[1:,1:]
>>> a1rc("A",3,4)
Matrix([
[A[1, 1], A[1, 2], A[1, 3], A[1, 4]],
[A[2, 1], A[2, 2], A[2, 3], A[2, 4]],
[A[3, 1], A[3, 2], A[3, 3], A[3, 4]]])
>>> _[[2,0],:] # row 2 will come first, then row 0
Matrix([
[A[3, 1], A[3, 2], A[3, 3], A[3, 4]],
[A[1, 1], A[1, 2], A[1, 3], A[1, 4]]])
Related
I am doing some operations on two matrices in sympy and I want to record how the result was obtained. For example in a isympy session:
a = Matrix([[1, 0], [2, 1]])
b = Matrix([[1, 1], [0, 2]])
out = HadamardProduct(a,b).doit()
out = sum(out)
out
Output:
3
Instead I would like this output:
1 * 1 + 0 * 1 + 2 * 0 + 1 * 2 = 3
How do I keep track of the history?
This seems to be it:
a = Matrix([[1, 0], [2, 1]])
b = Matrix([[1, 1], [0, 2]])
with evaluate(False):
out = a.multiply_elementwise(b)
out = sum(out)
Eq(out, out.simplify())
Output:
2⋅0 + 0⋅1 + 0 + 1⋅1 + 1⋅2 = 3
I am trying to vertically concatenate two Dask DataFrames
I have the following Dask DataFrame:
d = [
['A','B','C','D','E','F'],
[1, 4, 8, 1, 3, 5],
[6, 6, 2, 2, 0, 0],
[9, 4, 5, 0, 6, 35],
[0, 1, 7, 10, 9, 4],
[0, 7, 2, 6, 1, 2]
]
df = pd.DataFrame(d[1:], columns=d[0])
ddf = dd.from_pandas(df, npartitions=5)
Here is the data as a Pandas DataFrame
A B C D E F
0 1 4 8 1 3 5
1 6 6 2 2 0 0
2 9 4 5 0 6 35
3 0 1 7 10 9 4
4 0 7 2 6 1 2
Here is the Dask DataFrame
Dask DataFrame Structure:
A B C D E F
npartitions=4
0 int64 int64 int64 int64 int64 int64
1 ... ... ... ... ... ...
2 ... ... ... ... ... ...
3 ... ... ... ... ... ...
4 ... ... ... ... ... ...
Dask Name: from_pandas, 4 tasks
I am trying to concatenate 2 Dask DataFrames vertically:
ddf_i = ddf + 11.5
dd.concat([ddf,ddf_i],axis=0)
but I get this error:
Traceback (most recent call last):
...
File "...", line 572, in concat
raise ValueError('All inputs have known divisions which cannot '
ValueError: All inputs have known divisions which cannot be concatenated
in order. Specify interleave_partitions=True to ignore order
However, if I try:
dd.concat([ddf,ddf_i],axis=0,interleave_partitions=True)
then it appears to be working. Is there a problem with setting this to True (in terms of performance - speed)? Or is there another way to vertically 2 concatenate Dask DataFrames?
If you inspect the divisions of the dataframe ddf.divisions, you will find, assuming one partition, that it has the edges of the index there: (0, 4). This is useful to dask, as it knows when you do some operation on the data, not to use a partition not including required index values. This is also why some dask operations are much faster when the index is appropriate for the job.
When you concatenate, the second dataframe has the same index as the first. Concatenation would work without interleaving if the values of the index had different ranges in the two partitions.
mdurant's answer is correct and this answer elaborate with MCVE code snippets using Dask v2021.08.1. Examples make it easier to understand divisions and interleaving.
Vertically concatenating DataFrames
Create two DataFrames, concatenate them, and view the results.
df = pd.DataFrame(
{"nums": [1, 2, 3, 4, 5, 6], "letters": ["a", "b", "c", "d", "e", "f"]}
)
ddf1 = dd.from_pandas(df, npartitions=2)
df = pd.DataFrame({"nums": [88, 99], "letters": ["xx", "yy"]})
ddf2 = dd.from_pandas(df, npartitions=1)
ddf3 = dd.concat([ddf1, ddf2])
print(ddf3.compute())
nums letters
0 1 a
1 2 b
2 3 c
3 4 d
4 5 e
5 6 f
0 88 xx
1 99 yy
Divisions metadata when vertically concatenating
Create two DataFrames, concatenate them, and illustrate that sometimes this operation will cause divisions metadata to be lost.
def print_partitions(ddf):
for i in range(ddf.npartitions):
print(ddf.partitions[i].compute())
df = pd.DataFrame(
{"nums": [1, 2, 3, 4, 5, 6], "letters": ["a", "b", "c", "d", "e", "f"]}
)
ddf1 = dd.from_pandas(df, npartitions=2)
ddf1.divisions # (0, 3, 5)
df = pd.DataFrame({"nums": [88, 99], "letters": ["xx", "yy"]})
ddf2 = dd.from_pandas(df, npartitions=1)
ddf2.divisions # (0, 1)
ddf3 = dd.concat([ddf1, ddf2])
ddf3.divisions # (None, None, None, None)
Set interleave_partitions=True to avoid losing the divisions metadata.
ddf3_interleave = dd.concat([ddf1, ddf2], interleave_partitions=True)
ddf3_interleave.divisions # (0, 1, 3, 5)
When interleaving isn't necessary
Create two DataFrames without overlapping divisions, concatenate them, and confirm that the divisions metadata is not lost:
df = pd.DataFrame(
{"nums": [1, 2, 3, 4], "letters": ["a", "b", "c", "d"], "some_index": [4, 5, 6, 7]}
)
ddf1 = dd.from_pandas(df, npartitions=2)
ddf1 = ddf1.set_index("some_index")
df = pd.DataFrame({"nums": [88, 99], "letters": ["xx", "yy"], "some_index": [10, 20]})
ddf2 = dd.from_pandas(df, npartitions=1)
ddf2 = ddf2.set_index("some_index")
ddf3 = dd.concat([ddf1, ddf2])
ddf3.divisions # (4, 6, 10, 20)
I wrote a blog post to explain this in more detail. Let me know if you'd like the link.
I don't know why I cant figure this out. But I have a column of numbers that I would like to turn into a list of strings. I should of mention this when i initially posted this but this isn't a DataFrame or did it come from a file this is a result of a some code, sorry wasn't trying to waste anybody's time, I just didn't want to add a bunch of clutter. This is exactly how it prints out.
Here is my column of numbers.
3,1,3
3,1,3
3,1,3
3,3,3
3,1,1
And I would like them to look like this.
['3,1,3', '3,1,3', '3,1,3', '3,3,3', '3,1,1']
I'm trying to find a way that is not dependent on how many numbers are in each row or how many sets of numbers are in the column.
Thanks, really appreciate it.
Assume you start with a DataFrame
df = pd.DataFrame([[3, 1, 3], [3, 1, 3], [3, 1, 3], [3, 3, 3], [3, 1, 1]])
df.astype(str).apply(lambda x: ','.join(x.values), axis=1).values.tolist()
Looks like:
['3,1,3', '3,1,3', '3,1,3', '3,3,3', '3,1,1']
def foo():
l = []
with open("file.asd", "r") as f:
for line in f:
l.append(line)
return l
To turn your dataframe in to strings, use the astype function:
df = pd.DataFrame([[3, 1, 3], [3, 1, 3], [3, 1, 3], [3, 3, 3], [3, 1, 1]])
df = df.astype('str')
Then manipulating your columns becomes easy, you can for instance create a new column:
In [29]:
df['temp'] = df[0] + ',' + df[1] + ',' + df[2]
df
Out[29]:
0 1 2 temp
0 3 1 3 3,1,3
1 3 1 3 3,1,3
2 3 1 3 3,1,3
3 3 3 3 3,3,3
4 3 1 1 3,1,1
And then compact it into a list:
In [30]:
list(df['temp'])
Out[30]:
['3,1,3', '3,1,3', '3,1,3', '3,3,3', '3,1,1']
# Done in Jupyter notebook
# add three quotes on each side of your column.
# The advantage to dataframe is the minimal number of operations for
# reformatting your column of numbers or column of text strings into
# a single string
a = """3,1,3
3,1,3
3,1,3
3,3,3
3,1,1"""
b = f'"{a}"'
print('String created with triple quotes:')
print(b)
c = a.split('\n')
print ("Use split() function on the string. Split on newline character:")
print(c)
print ("Use splitlines() function on the string:")
print(a.splitlines())
Following this recipe. I tried to filter a dataframe by the columns name that contain the string '+'. Here's the example:
B = pd.DataFrame([[1, 5, 2], [2, 4, 4], [3, 3, 1], [4, 2, 2], [5, 1, 4]],
columns=['A', '+B', '+C'], index=[1, 2, 3, 4, 5])
So I want a dataframe C with only '+B' and '+C' columns in it.
C = B.filter(regex='+')
However I get the error:
File "c:\users\hernan\anaconda\lib\site-packages\pandas\core\generic.py", line 1888, in filter
matcher = re.compile(regex)
File "c:\users\hernan\anaconda\lib\re.py", line 190, in compile
return _compile(pattern, flags)
File "c:\users\hernan\anaconda\lib\re.py", line 244, in _compile
raise error, v # invalid expression
error: nothing to repeat
The recipe says it is Python 3. I use python 2.7. However, I don't think that is the problem here.
Hernan
+ has a special meaning in regular expressions (see here). You can escape it with \:
>>> C = B.filter(regex='\+')
>>> C
+B +C
1 5 2
2 4 4
3 3 1
4 2 2
5 1 4
Or, since all you care about is the presence of +, you could use the like argument instead:
>>> C = B.filter(like="+")
>>> C
+B +C
1 5 2
2 4 4
3 3 1
4 2 2
5 1 4
My code is currently written as:
convert = {0:0,1:1,2:2,3:3,4:0,5:1,6:2,7:1}
rows = [[convert[random.randint(0,7)] for _ in range(5)] for _ in range(5)]
numgood = 25 - rows.count(0)
print numgood
>> 25
It always comes out as 25, so it's not just that rows contains no 0's.
Have you printed rows?
It's [[0, 1, 0, 0, 2], [1, 2, 0, 1, 2], [3, 1, 1, 1, 1], [1, 0, 0, 1, 0], [0, 3, 2, 0, 1]], so you have a nested list there.
If you want to count the number of 0's in those nested lists, you could try:
import random
convert = {0:0, 1:1, 2:2, 3:3, 4:0, 5:1, 6:2, 7:1}
rows = [[convert[random.randint(0, 7)] for _ in range(5)] for _ in range(5)]
numgood = 25 - sum(e.count(0) for e in rows)
print numgood
Output:
18
rows doesn't contain any zeroes; it contains lists, not integers.
>>> row = [1,2,3]
>>> type(row)
<type 'list'>
>>> row.count(2)
1
>>> rows = [[1,2,3],[4,5,6]]
>>> rows.count(2)
0
>>> rows.count([1,2,3])
1
To count the number of zeroes in any of the lists in rows, you could use a generator expression:
>>> rows = [[1,2,3],[4,5,6], [0,0,8]]
>>> sum(x == 0 for row in rows for x in row)
2
You could also use numpy:
import numpy as np
import random
convert = {0:0,1:1,2:2,3:3,4:0,5:1,6:2,7:1}
rows = [[convert[random.randint(0,7)] for _ in range(5)] for _ in range(5)]
numgood = 25 - np.count_nonzero(rows)
print numgood
Output:
9