I am using the following code to create a data frame from a list:
test_list = ['a','b','c','d']
df_test = pd.DataFrame.from_records(test_list, columns=['my_letters'])
df_test
The above code works fine. Then I tried the same approach for another list:
import pandas as pd
q_list = ['112354401', '116115526', '114909312', '122425491', '131957025', '111373473']
df1 = pd.DataFrame.from_records(q_list, columns=['q_data'])
df1
But it gave me the following errors this time:
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-24-99e7b8e32a52> in <module>()
1 import pandas as pd
2 q_list = ['112354401', '116115526', '114909312', '122425491', '131957025', '111373473']
----> 3 df1 = pd.DataFrame.from_records(q_list, columns=['q_data'])
4 df1
/usr/local/lib/python3.4/dist-packages/pandas/core/frame.py in from_records(cls, data, index, exclude, columns, coerce_float, nrows)
1021 else:
1022 arrays, arr_columns = _to_arrays(data, columns,
-> 1023 coerce_float=coerce_float)
1024
1025 arr_columns = _ensure_index(arr_columns)
/usr/local/lib/python3.4/dist-packages/pandas/core/frame.py in _to_arrays(data, columns, coerce_float, dtype)
5550 data = lmap(tuple, data)
5551 return _list_to_arrays(data, columns, coerce_float=coerce_float,
-> 5552 dtype=dtype)
5553
5554
/usr/local/lib/python3.4/dist-packages/pandas/core/frame.py in _list_to_arrays(data, columns, coerce_float, dtype)
5607 content = list(lib.to_object_array(data).T)
5608 return _convert_object_array(content, columns, dtype=dtype,
-> 5609 coerce_float=coerce_float)
5610
5611
/usr/local/lib/python3.4/dist-packages/pandas/core/frame.py in _convert_object_array(content, columns, coerce_float, dtype)
5666 # caller's responsibility to check for this...
5667 raise AssertionError('%d columns passed, passed data had %s '
-> 5668 'columns' % (len(columns), len(content)))
5669
5670 # provide soft conversion of object dtypes
AssertionError: 1 columns passed, passed data had 9 columns
Why would the same approach work for one list but not another? Any idea what might be wrong here? Thanks a lot!
DataFrame.from_records treats string as a character list. so it needs as many columns as length of string.
You could simply use the DataFrame constructor.
In [3]: pd.DataFrame(q_list, columns=['q_data'])
Out[3]:
q_data
0 112354401
1 116115526
2 114909312
3 122425491
4 131957025
5 111373473
In[20]: test_list = [['a','b','c'], ['AA','BB','CC']]
In[21]: pd.DataFrame(test_list, columns=['col_A', 'col_B', 'col_C'])
Out[21]:
col_A col_B col_C
0 a b c
1 AA BB CC
In[22]: pd.DataFrame(test_list, index=['col_low', 'col_up']).T
Out[22]:
col_low col_up
0 a AA
1 b BB
2 c CC
If you want to create a DataFrame from multiple lists you can simply zip the lists. This returns a 'zip' object. So you convert back to a list.
mydf = pd.DataFrame(list(zip(lstA, lstB)), columns = ['My List A', 'My List B'])
just using concat method
test_list = ['a','b','c','d']
pd.concat(test_list )
You could also take the help of numpy.
import numpy as np
df1 = pd.DataFrame(np.array(q_list),columns=['q_data'])
Related
I have dataframe like below . I am trying filter out words based on
1)if length of string in Root Word column is equal to 1
2)if Similar_word corresponding Root_word column is blank
3)Remove rows from the dataframe if Text column contains only number
Root Word Similar Word
kwun kwung, kwon, kuwan, ton, tong., jwun, stkwun, rd.kl, kuwn,
bay ba
1
chung chung., cont, kway, containe, kwai, terminal4
international
tin ti
floor floor.
central cental, central.
tsuen tusen, tsven
g gf g/f
My code
similar = [[item[0] for item in model.wv.most_similar(word) if item[1] > 0.7] for word in words]
similarity_matrix = pd.DataFrame({'Root_Word': words, 'Similar_Words': similar})
similarity_matrix = similarity_matrix[['Root_Word', 'Similar_Words']]
import numpy as np
conditions = [
similarity_matrix ['Similar_Words'].isnull(),
similarity_matrix ['Root_Word'].isnumeric(),
similarity_matrix ['Root_Word'].str.len() == 1
]
similarity_matrix = similarity_matrix [np.logical_or.reduce(conditions)]
But I am getting below error
AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
3079 if name in self._info_axis:
3080 return self[name]
-> 3081 return object.__getattribute__(self, name)
3082
3083 def __setattr__(self, name, value):
AttributeError: 'Series' object has no attribute 'isnumeric'
How can I fix this.
Data link
https://drive.google.com/file/d/1F6Z3QrOFtAChpPhaMKGu3iJ0epfqKE7F/view?usp=sharing
You need str ancestor - Series.str.isnumeric and invert final boolean mask by ~:
similarity_matrix ['Root_Word'].astype(str).str.isnumeric()
similarity_matrix ['Similar_Words'] =="",
Added condition for remove empty spaces
import numpy as np
conditions = [
similarity_matrix ['Similar_Words'].isnull(),
similarity_matrix ['Similar_Words'] =="",
similarity_matrix ['Root_Word'].astype(str).str.isnumeric(),
similarity_matrix ['Root_Word'].str.len() == 1
]
similarity = similarity_matrix [~np.logical_or.reduce(conditions)]
print (similarity_matrix)
Root_Word Similar_Words
0 kwun kwung, kwon, kuwan, ton, tong., jwun, stkwun, ...
1 bay ba
3 chung chung., cont, kway, containe, kwai, terminal4
5 tin ti
6 floor floor.
7 central cental, central.
8 tsuen tusen, tsven
Test all filter out rows:
print (similarity_matrix [np.logical_or.reduce(conditions)])
Root_Word Similar_Words
2 1 None
4 international None
9 g gf g/f
How to pass array list(multiple column) instead of single column in pyspark using this command:
new_df = new_df.filter(new_df.color.isin(*filter_list) == False)
eg:-
I used this code for removing garbage value(#,$) into single column
filter_list = ['##', '$']
new_df = new_df.filter(new_df.color.isin(*filter_list) == False)
In this example 'color' is column.
But I want to remove garbage(#,##,$,$$$) value with multiple occurrances into multiple column.
Sample Input:-
id name Salary
# Yogita 3000
2 Bhavana 5000
$$ ### 7000
%$4# Neha $$$$
Sample Output:-
id name salary
2 Bhavana 5000
Anybody help me,
Thanks in advance,
Yogita
Here is an answer using a user-defined function:
from pyspark.sql.types import *
from itertools import chain
filter_list = ['#','##', '$', '$$$']
def filterfn(*x):
booleans=list(chain(*[[filter not in elt for filter in filter_list] for elt in x]))
return(reduce(lambda x,y: x and y, booleans, True))
filter_udf=f.udf(filterfn, BooleanType())
new_df.filter(filter_udf(*[col for col in new_df.columns])).show(10)
I am running a snippet of code that queries a database and then fills in a pandas dataframe with a value of 1 if that tuple is present in the query. it does this by running the query then iterates over the tuples and fills in the dataframe. However, the query returns almost 8 million rows of data.
My question is if anyone knows how to speed up a process like this. Here is the code below:
user_age = pd.read_sql_query(sql_age, datastore, index_col=['userid']).age.astype(np.int, copy=False)
x = pd.DataFrame(0, index=user_age.index, columns=range(366), dtype=np.int8)
for r in pd.read_sql_query(sql_active, datastore, chunksize=50000):
for userid, day in r.itertuples(index=False):
x.at[userid, day] = 1
Thank you in advance!
You could save some time by replacing the Python loop
for userid, day in r.itertuples(index=False):
x.at[userid, day] = 1
with a NumPy array assignment using "advanced integer indexing":
x[npidx[r['userid']], r['day']] = 1
On a 80000-row DataFrame, using_numpy (below) is about 6x faster:
In [7]: %timeit orig()
1 loop, best of 3: 984 ms per loop
In [8]: %timeit using_numpy()
10 loops, best of 3: 162 ms per loop
import numpy as np
import pandas as pd
def mock_read_sql_query():
np.random.seed(2016)
for arr in np.array_split(index, N//M):
size = len(arr)
df = pd.DataFrame({'userid':arr , 'day':np.random.randint(366, size=size)})
df = df[['userid', 'day']]
yield df
N, M = 8*10**4, 5*10**2
index = np.arange(N)
np.random.shuffle(index)
columns = range(366)
def using_numpy():
npidx = np.empty_like(index)
npidx[index] = np.arange(len(index))
x = np.zeros((len(index), len(columns)), dtype=np.int8)
for r in mock_read_sql_query():
x[npidx[r['userid']], r['day']] = 1
x = pd.DataFrame(x, columns=columns, index=index)
return x
def orig():
x = pd.DataFrame(0, index=index, columns=columns, dtype=np.int8)
for r in mock_read_sql_query():
for userid, day in r.itertuples(index=False):
x.at[userid, day] = 1
return x
expected = orig()
result = using_numpy()
expected_index, expected_col = np.where(expected)
result_index, result_col = np.where(result)
assert np.equal(expected_index, result_index).all()
assert np.equal(expected_col, result_col).all()
I've got a learner that returns a list of values corresponding to dates.
I need the function to return a dataframe for plotting purposes. I've got the dataframe created, but now I need to populate the dataframe with the values from the list. Here is my code:
learner.addEvidence(x,y_values.values)
y_prediction_list = learner.query(x) # this yields a plain old python list
y_prediction_df = pd.DataFrame(index=dates,columns="Y-Prediction")
y_prediction_df = ??
return y_prediction_df
you can simply create the dataframe with:
y_prediction_df=pd.DataFrame({"Y-Prediction":y_prediction_list},index=dates)
I think you can use parameter data of DataFrame.
import pandas as pd
#test data
dates = ["2014-05-22 05:37:59", "2015-05-22 05:37:59","2016-05-22 05:37:59"]
y_prediction_list = [1,2,3]
y_prediction_df = pd.DataFrame(data=y_prediction_list, index=dates,columns=["Y-Prediction"])
print y_prediction_df
# Y-Prediction
#2014-05-22 05:37:59 1
#2015-05-22 05:37:59 2
#2016-05-22 05:37:59 3
print y_prediction_df.info()
#<class 'pandas.core.frame.DataFrame'>
#Index: 3 entries, 2014-05-22 05:37:59 to 2016-05-22 05:37:59
#Data columns (total 1 columns):
#Y-Prediction 3 non-null int64
#dtypes: int64(1)
#memory usage: 48.0+ bytes
#None
In python i have data that looks like this with 500.000 rows :
TIME count
1-1-1900 10:41:00 1
3-1-1900 09:54:00 1
4-1-1900 15:45:00 1
5-1-1900 18:41:00 1
4-1-1900 15:45:00 1
and i want to make a new column with bins in quarters like this:
bins count
9:00-9:15 2
9:15-9:30 4
9:30-9:45 4
10:00-10:15 4
i know how you make bins, but the timestamp gives me troubles.
Can somebody help me with this?
already thank you!
I know it's late. But better late than never. I also came across a similar requirement and done by using pandas library.
First, Load data in pandas data-frame
Second, check TIME column must be datetime object and not object type (like string or whatever). You can check it by
df.info()
for example, in my case TIME column was initially of object type i.e. string type
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17640 entries, 0 to 17639
Data columns (total 3 columns):
TIME 17640 non-null object
value 17640 non-null int64
dtypes: int64(1), object(2)
memory usage: 413.5+ KB
if that is the case, then convert it to pandas datetime object by using this command
df['TIME'] = pd.to_datetime(df['TIME'])
ignore this if already in datetime format
df.info() now gives updated format
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17640 entries, 0 to 17639
Data columns (total 3 columns):
TIME 17640 non-null datetime64[ns]
value 17640 non-null int64
dtypes: datetime64[ns](2), int64(1)
memory usage: 413.5 KB
Now our dataframe is ready for magic :)
counts = pd.Series(index=df.TIME, data=np.array(df.count)).resample('15T').count()
print(counts[:3])
TIME
2017-07-01 00:00:00 3
2017-07-01 00:15:00 3
2017-07-01 00:30:00 3
Freq: 15T, dtype: int64
in above command 15T means 15minutes bucket, you can replace it with D for day bucket, 2D for 2 days bucket, M for month bucket, 2M for 2 months bucket and so on. You can read the detail of these notations on this link
now, our buckets data is done as you can see above. for time range use this command. Use the same time range as of data. In my case, my data was 3 months so I am creating time-range of 3 months.
r = pd.date_range('2017-07', '2017-09', freq='15T')
x = np.repeat(np.array(r), 2, axis=0)[1:-1]
# now reshape data to fit in Dataframe
x = np.array(x)[:].reshape(-1, 2)
# now fit in dataframe and print it
final_df = pd.DataFrame(x, columns=['start', 'end'])
print(final_df[:3])
start end
0 2017-07-01 00:00:00 2017-07-01 00:15:00
1 2017-07-01 00:15:00 2017-07-01 00:30:00
2 2017-07-01 00:30:00 2017-07-01 00:45:00
date ranges also done
Now append count and dateranges to get final outcome
final_df['count'] = np.array(means)
print(final_df[:3])
start end count
0 2017-07-01 00:00:00 2017-07-01 00:15:00 3
1 2017-07-01 00:15:00 2017-07-01 00:30:00 3
2 2017-07-01 00:30:00 2017-07-01 00:45:00 3
Hope anyone find it useful.
Well, I'm not sure that this is what you asked for. If it's not, I would recommend you to improve your question, because it's very hard to understand your problem. In particular, it would be nice to see what you've already tried to do.
from __future__ import division, print_function
from collections import namedtuple
from itertools import product
from datetime import time
from StringIO import StringIO
MAX_HOURS = 23
MAX_MINUTES = 59
def process_data_file(data_file):
"""
The data_file is supposed to be an opened file object
"""
time_entry = namedtuple("time_entry", ["time", "count"])
data_to_bin = []
for line in data_file:
t, count = line.rstrip().split("\t")
t = map(int, t.split()[-1].split(":")[:2])
data_to_bin.append(time_entry(time(*t), int(count)))
return data_to_bin
def make_milestones(min_hour=0, max_hour=MAX_HOURS, interval=15):
minutes = [minutes for minutes in xrange(MAX_MINUTES+1) if not minutes % interval]
hours = range(min_hour, max_hour+1)
return [time(*milestone) for milestone in list(product(hours, minutes))]
def bin_time(data_to_bin, milestones):
time_entry = namedtuple("time_entry", ["time", "count"])
data_to_bin = sorted(data_to_bin, key=lambda time_entry: time_entry.time, reverse=True)
binned_data = []
current_count = 0
upper = milestones.pop()
lower = milestones.pop()
for entry in data_to_bin:
while not lower <= entry.time <= upper:
if current_count:
binned_data.append(time_entry("{}-{}".format(str(lower)[:-3], str(upper)[:-3]), current_count))
current_count = 0
upper, lower = lower, milestones.pop()
current_count += entry.count
return binned_data
data_file = StringIO("""1-1-1900 10:41:00\t1
3-1-1900 09:54:00\t1
4-1-1900 15:45:00\t1
5-1-1900 18:41:00\t1
4-1-1900 15:45:00\t1""")
binned_time = bin_time(process_data_file(data_file), make_milestones())
for entry in binned_time:
print(entry.time, entry.count, sep="\t")
The output:
18:30-18:45 1
15:45-16:00 2
10:30-10:45 1
Just trying without pandas:
from collections import defaultdict
import datetime as dt
from itertools import groupby
def bin_ts(dtime, delta):
modulo = dtime.timestamp() % delta.total_seconds()
return dtime - dt.timedelta(seconds=modulo)
src_data = [
('1-1-1900 10:41:00', 1),
('3-1-1900 09:54:00', 1),
('4-1-1900 15:45:00', 1),
('5-1-1900 18:41:00', 1),
('4-1-1900 15:45:00', 1)
]
ts_data = [(dt.datetime.strptime(ts, '%d-%m-%Y %H:%M:%S'), count) for ts, count in src_data]
bin_size = dt.timedelta(minutes=15)
binned = [(bin_ts(ts, bin_size), count) for ts, count in ts_data]
def time_fmt(ts):
res = "%s - %s" % (ts.strftime('%H:%M'), (ts + bin_size).strftime('%H:%M'))
return res
binned_time = [(time_fmt(ts), count) for ts, count in binned]
cnts = defaultdict(int)
for ts, group in groupby(binned_time, lambda x: x[0]):
for row in group:
cnts[ts] += row[1]
output = list(cnts.items())
output.sort(key=lambda x: x[0])
from pprint import pprint
pprint(output)
result in:
[('09:45 - 10:00', 1),
('10:30 - 10:45', 1),
('15:45 - 16:00', 2),
('18:30 - 18:45', 1)]