Python Dask - vertical concatenation of 2 DataFrames - python-2.7

I am trying to vertically concatenate two Dask DataFrames
I have the following Dask DataFrame:
d = [
['A','B','C','D','E','F'],
[1, 4, 8, 1, 3, 5],
[6, 6, 2, 2, 0, 0],
[9, 4, 5, 0, 6, 35],
[0, 1, 7, 10, 9, 4],
[0, 7, 2, 6, 1, 2]
]
df = pd.DataFrame(d[1:], columns=d[0])
ddf = dd.from_pandas(df, npartitions=5)
Here is the data as a Pandas DataFrame
A B C D E F
0 1 4 8 1 3 5
1 6 6 2 2 0 0
2 9 4 5 0 6 35
3 0 1 7 10 9 4
4 0 7 2 6 1 2
Here is the Dask DataFrame
Dask DataFrame Structure:
A B C D E F
npartitions=4
0 int64 int64 int64 int64 int64 int64
1 ... ... ... ... ... ...
2 ... ... ... ... ... ...
3 ... ... ... ... ... ...
4 ... ... ... ... ... ...
Dask Name: from_pandas, 4 tasks
I am trying to concatenate 2 Dask DataFrames vertically:
ddf_i = ddf + 11.5
dd.concat([ddf,ddf_i],axis=0)
but I get this error:
Traceback (most recent call last):
...
File "...", line 572, in concat
raise ValueError('All inputs have known divisions which cannot '
ValueError: All inputs have known divisions which cannot be concatenated
in order. Specify interleave_partitions=True to ignore order
However, if I try:
dd.concat([ddf,ddf_i],axis=0,interleave_partitions=True)
then it appears to be working. Is there a problem with setting this to True (in terms of performance - speed)? Or is there another way to vertically 2 concatenate Dask DataFrames?

If you inspect the divisions of the dataframe ddf.divisions, you will find, assuming one partition, that it has the edges of the index there: (0, 4). This is useful to dask, as it knows when you do some operation on the data, not to use a partition not including required index values. This is also why some dask operations are much faster when the index is appropriate for the job.
When you concatenate, the second dataframe has the same index as the first. Concatenation would work without interleaving if the values of the index had different ranges in the two partitions.

mdurant's answer is correct and this answer elaborate with MCVE code snippets using Dask v2021.08.1. Examples make it easier to understand divisions and interleaving.
Vertically concatenating DataFrames
Create two DataFrames, concatenate them, and view the results.
df = pd.DataFrame(
{"nums": [1, 2, 3, 4, 5, 6], "letters": ["a", "b", "c", "d", "e", "f"]}
)
ddf1 = dd.from_pandas(df, npartitions=2)
df = pd.DataFrame({"nums": [88, 99], "letters": ["xx", "yy"]})
ddf2 = dd.from_pandas(df, npartitions=1)
ddf3 = dd.concat([ddf1, ddf2])
print(ddf3.compute())
nums letters
0 1 a
1 2 b
2 3 c
3 4 d
4 5 e
5 6 f
0 88 xx
1 99 yy
Divisions metadata when vertically concatenating
Create two DataFrames, concatenate them, and illustrate that sometimes this operation will cause divisions metadata to be lost.
def print_partitions(ddf):
for i in range(ddf.npartitions):
print(ddf.partitions[i].compute())
df = pd.DataFrame(
{"nums": [1, 2, 3, 4, 5, 6], "letters": ["a", "b", "c", "d", "e", "f"]}
)
ddf1 = dd.from_pandas(df, npartitions=2)
ddf1.divisions # (0, 3, 5)
df = pd.DataFrame({"nums": [88, 99], "letters": ["xx", "yy"]})
ddf2 = dd.from_pandas(df, npartitions=1)
ddf2.divisions # (0, 1)
ddf3 = dd.concat([ddf1, ddf2])
ddf3.divisions # (None, None, None, None)
Set interleave_partitions=True to avoid losing the divisions metadata.
ddf3_interleave = dd.concat([ddf1, ddf2], interleave_partitions=True)
ddf3_interleave.divisions # (0, 1, 3, 5)
When interleaving isn't necessary
Create two DataFrames without overlapping divisions, concatenate them, and confirm that the divisions metadata is not lost:
df = pd.DataFrame(
{"nums": [1, 2, 3, 4], "letters": ["a", "b", "c", "d"], "some_index": [4, 5, 6, 7]}
)
ddf1 = dd.from_pandas(df, npartitions=2)
ddf1 = ddf1.set_index("some_index")
df = pd.DataFrame({"nums": [88, 99], "letters": ["xx", "yy"], "some_index": [10, 20]})
ddf2 = dd.from_pandas(df, npartitions=1)
ddf2 = ddf2.set_index("some_index")
ddf3 = dd.concat([ddf1, ddf2])
ddf3.divisions # (4, 6, 10, 20)
I wrote a blog post to explain this in more detail. Let me know if you'd like the link.

Related

How to turn column of number into a list of strings?

I don't know why I cant figure this out. But I have a column of numbers that I would like to turn into a list of strings. I should of mention this when i initially posted this but this isn't a DataFrame or did it come from a file this is a result of a some code, sorry wasn't trying to waste anybody's time, I just didn't want to add a bunch of clutter. This is exactly how it prints out.
Here is my column of numbers.
3,1,3
3,1,3
3,1,3
3,3,3
3,1,1
And I would like them to look like this.
['3,1,3', '3,1,3', '3,1,3', '3,3,3', '3,1,1']
I'm trying to find a way that is not dependent on how many numbers are in each row or how many sets of numbers are in the column.
Thanks, really appreciate it.
Assume you start with a DataFrame
df = pd.DataFrame([[3, 1, 3], [3, 1, 3], [3, 1, 3], [3, 3, 3], [3, 1, 1]])
df.astype(str).apply(lambda x: ','.join(x.values), axis=1).values.tolist()
Looks like:
['3,1,3', '3,1,3', '3,1,3', '3,3,3', '3,1,1']
def foo():
l = []
with open("file.asd", "r") as f:
for line in f:
l.append(line)
return l
To turn your dataframe in to strings, use the astype function:
df = pd.DataFrame([[3, 1, 3], [3, 1, 3], [3, 1, 3], [3, 3, 3], [3, 1, 1]])
df = df.astype('str')
Then manipulating your columns becomes easy, you can for instance create a new column:
In [29]:
df['temp'] = df[0] + ',' + df[1] + ',' + df[2]
df
Out[29]:
0 1 2 temp
0 3 1 3 3,1,3
1 3 1 3 3,1,3
2 3 1 3 3,1,3
3 3 3 3 3,3,3
4 3 1 1 3,1,1
And then compact it into a list:
In [30]:
list(df['temp'])
Out[30]:
['3,1,3', '3,1,3', '3,1,3', '3,3,3', '3,1,1']
# Done in Jupyter notebook
# add three quotes on each side of your column.
# The advantage to dataframe is the minimal number of operations for
# reformatting your column of numbers or column of text strings into
# a single string
a = """3,1,3
3,1,3
3,1,3
3,3,3
3,1,1"""
b = f'"{a}"'
print('String created with triple quotes:')
print(b)
c = a.split('\n')
print ("Use split() function on the string. Split on newline character:")
print(c)
print ("Use splitlines() function on the string:")
print(a.splitlines())

Remove duplicate method for Python Pandas doesnt work

Trying to remove duplicate based on unique values on column 'new', I have even tried two methods, but the output df.shape suggests before/after have the same df shape, meaning remove duplication fails.
import pandas
import numpy as np
import random
df = pandas.DataFrame(np.random.randn(10, 4), columns=list('ABCD'))
df['new'] = [1, 1, 3, 4, 5, 1, 7, 8, 1, 10]
df['new2'] = [1, 1, 2, 4, 5, 3, 7, 8, 9, 5]
print df.shape
df.drop_duplicates('new', take_last=False)
df.groupby('new').max()
print df.shape
# output
(10, 6)
(10, 6)
[Finished in 1.0s]
You need to assign the result of drop_duplicates, by default inplace=False so it returns a copy of the modified df, as you don't pass param inplace=True your original df is unmodified:
In [106]:
df = df.drop_duplicates('new', take_last=False)
df.groupby('new').max()
Out[106]:
A B C D new2
new
1 -1.698741 -0.550839 -0.073692 0.618410 1
3 0.519596 1.686003 1.395585 1.298783 2
4 1.557550 1.249577 0.214546 -0.077569 4
5 -0.183454 -0.789351 -0.374092 -1.824240 5
7 -1.176468 0.546904 0.666383 -0.315945 7
8 -1.224640 -0.650131 -0.394125 0.765916 8
10 -1.045131 0.726485 -0.194906 -0.558927 5
if you passed inplace=True it would work:
In [108]:
df.drop_duplicates('new', take_last=False, inplace=True)
df.groupby('new').max()
Out[108]:
A B C D new2
new
1 0.334352 -0.355528 0.098418 -0.464126 1
3 -0.394350 0.662889 -1.012554 -0.004122 2
4 -0.288626 0.839906 1.335405 0.701339 4
5 0.973462 -0.818985 1.020348 -0.306149 5
7 -0.710495 0.580081 0.251572 -0.855066 7
8 -1.524862 -0.323492 -0.292751 1.395512 8
10 -1.164393 0.455825 -0.483537 1.357744 5

How to filter dataframe in pandas by 'str' in columns name?

Following this recipe. I tried to filter a dataframe by the columns name that contain the string '+'. Here's the example:
B = pd.DataFrame([[1, 5, 2], [2, 4, 4], [3, 3, 1], [4, 2, 2], [5, 1, 4]],
columns=['A', '+B', '+C'], index=[1, 2, 3, 4, 5])
So I want a dataframe C with only '+B' and '+C' columns in it.
C = B.filter(regex='+')
However I get the error:
File "c:\users\hernan\anaconda\lib\site-packages\pandas\core\generic.py", line 1888, in filter
matcher = re.compile(regex)
File "c:\users\hernan\anaconda\lib\re.py", line 190, in compile
return _compile(pattern, flags)
File "c:\users\hernan\anaconda\lib\re.py", line 244, in _compile
raise error, v # invalid expression
error: nothing to repeat
The recipe says it is Python 3. I use python 2.7. However, I don't think that is the problem here.
Hernan
+ has a special meaning in regular expressions (see here). You can escape it with \:
>>> C = B.filter(regex='\+')
>>> C
+B +C
1 5 2
2 4 4
3 3 1
4 2 2
5 1 4
Or, since all you care about is the presence of +, you could use the like argument instead:
>>> C = B.filter(like="+")
>>> C
+B +C
1 5 2
2 4 4
3 3 1
4 2 2
5 1 4

Extract multi-digit numbers from a string in python 3

I am doing the algorithm challenges from HackerRank and one of the problems needs me to accept input in the form of strings of numbers formatted as follows:
3 4
12 14 16
1 2
3 4
5 6
Now, I know how to iterate through the lines and assign them where they need to go, but my issue is with the second line. The others are two two digit numbers so I've been extracting them by just referencing their index in the string. For example, the first line of numbers would be collected with string[0] and string[-1].
The second line, however is of indeterminate length, and may include numbers shorter or longer than three digits. How would I pull those out and assign them to variables? I'm sure there is probably a way to do it with RegEx, but I don't know how to assign multiple matches in one string to multiple variables.
import re
print(re.findall(r"(\d+)",x))
"x" being your line.This will return a list with all the number.
You mean this?
>>> import re
>>> s = """3 4
... 12 14 16
... 1 2
... 3 4
... 5 6"""
>>> m = re.findall(r'\b\d+\b', s, re.M)
>>> m
['3', '4', '12', '14', '16', '1', '2', '3', '4', '5', '6']
Just pickup each value in the final list and assign it to variables.
So if s is your string,
map(int, s.split())
yields a list of integers:
[3, 4, 12, 14, 16, 1, 2, 3, 4, 5, 6]
That's basically what skamazin suggested.
Given:
>>> txt='''\
... 3 4
... 12 14 16
... 1 2
... 3 4
... 5 6'''
If the lines have meaning, you can do:
>>> [map(int, line.split()) for line in txt.splitlines()]
[[3, 4], [12, 14, 16], [1, 2], [3, 4], [5, 6]]
If the lines have no meaning, you just want all the digits, you can do:
>>> map(int, txt.split())
[3, 4, 12, 14, 16, 1, 2, 3, 4, 5, 6]
If your source text has the possibility of strings that will not convert to integers:
>>> txt='''\
... 3 4
... 12 14 16
... 1 2
... 3 4
... 5 6
... text that won't be integers
... 99 100 101'''
You can use a conversion function:
>>> def conv(s):
... try:
... return int(s)
... except ValueError:
... return s
...
>>> [[conv(s) for s in line.split()] for line in txt.splitlines()]
[[3, 4], [12, 14, 16], [1, 2], [3, 4], [5, 6], ['text', 'that', "won't", 'be', 'integers'], [99, 100, 101]]
Or filter out the things that are not digits:
>>> map(int, filter(lambda s: s.isdigit(), txt.split()))
[3, 4, 12, 14, 16, 1, 2, 3, 4, 5, 6, 99, 100, 101]

Pandas: What is the best way to 'crop' as large dataframe to only the previous 1000 days?

I have a dataframe where the index is made up of datetimes. I also have an anchor date and I know that I only want the second dataframe to contain the 1000 days previous to the anchor date. What is the best way to do this?
Don't know if it's the best way, but it should work
Create example DataFrame:
>>> dates = [pd.datetime(2012, 5, 4), pd.datetime(2012, 5, 5), pd.datetime(2012, 5, 6), pd.datetime(2012, 5, 1), pd.datetime(2012, 5, 2), pd.datetime(2012, 5, 3)]
>>> values = [1, 2, 3, 4, 5, 6]
>>> df = pd.DataFrame(values, dates)
>>> df
>>> df
0
2012-05-04 1
2012-05-05 2
2012-05-06 3
2012-05-01 4
2012-05-02 5
2012-05-03 6
Suppose we want 2 days back from 2012-05-04:
>>> date_end = pd.datetime(2012, 5, 4)
>>> date_start = date_end - pd.DateOffset(days=2)
>>> date_start, date_end
(datetime.datetime(2012, 5, 2, 0, 0), datetime.datetime(2012, 5, 4, 0, 0))
Now let's try to get rows by label indexing:
>>> df.loc[date_start:date_end]
Empty DataFrame
Columns: [0]
Index: []
That's because our index is not sorted, so let's fix it:
>>> df.sort_index(inplace=True)
>>> df.loc[date_start:date_end]
0
2012-05-02 5
2012-05-03 6
2012-05-04 1
It's also possible to get rows by datetime indexing:
>>> df[date_start:date_end]
0
2012-05-02 5
2012-05-03 6
2012-05-04 1
Keep in mind that I'm still not an expert in Pandas, but I like it for Data Analysis very much.
Hope it helps.