Count unique dates from a multidimensional list - python-2.7

I am querying a database for Leads. Leads have a "lead generated date" and a possible "closed" date.
What I would like to do is get a month by month total for leads generated/leads closed per month in the format [MM/YYYY, leads generated, leads closed] for Google Visualization API.
I have my query logic set and currently have a a result similar to:
[
["09/2011","09/2011"],
["09/2011","10/2011"],
["10/2011","12/2011"],
...
]
I am stuck trying to come up with an efficient way parse this and get the result of:
[
["09/2011", 2, 1],
["10/2011", 1, 1],
["12/2011", 0, 1]
]
Any help would be appreciated!

It's not that beautiful, but this should work:
from collections import defaultdict
d1 = defaultdict(int)
d2 = defaultdict(int)
data = [["09/2011","09/2011"],["09/2011","10/2011"],["10/2011","12/2011"]]
for d in data:
d1[d[0]] += 1
d2[d[1]] += 1
out = []
for key in set(d1.keys()) | set(d2.keys()):
out.append([key, d1.get(key, 0), d2.get(key, 0)])

Related

python Find the most reported month

I am trying to find out October(mentioned 2 times), I had the idea to use dictionary to solve this problem. However I struggled a lot to figure out how to find/separate the months, I was not able to use my solution for the 1st str values where there are some spaces. Can someone please suggest how can I modify that split section to cover - , and white space?
import re
#str="May-29-1990, Oct-18-1980 ,Sept-1-1980, Oct-2-1990"
str="May-29-1990,Oct-18-1980,Sept-1-1980,Oct-2-1990"
val=re.split(',',str)
monthList=[]
myDictionary={}
#put the months in a list
def sep_month():
for item in val:
if not item.isdigit():
month,day,year=item.split("-")
monthList.append(month)
#process the month list from above
def count_month():
for item in monthList:
if item not in myDictionary.keys():
myDictionary[item]=1
else:
myDictionary[item]=myDictionary.get(item)+1
for k,v in myDictionary.items():
if v==2:
print(k)
sep_month()
count_month()
from datetime import datetime
import calendar
from collections import Counter
datesString = "May-29-1990,Oct-18-1980,Sep-1-1980,Oct-2-1990"
datesListString = datesString.split(",")
datesList = []
for dateStr in datesListString:
datesList.append(datetime.strptime(dateStr, '%b-%d-%Y'))
monthsOccurrencies = Counter((calendar.month_name[date.month] for date in datesList))
print(monthsOccurrencies)
# Counter({'October': 2, 'May': 1, 'September': 1})
Something to be aware in my solution with %b for the month is that Sept has changed to Sep to work (Month as locale’s abbreviated name). In this case you can either use fullname months (%B) or abbreviated name (%b). If you can not have the big string as with correct month name formatting, just replace the wrong ones ("Sept" for example with "Sep" and always work with date obj).
Not sure that regex is the best tool for this job, I would just use strip() along with split() to handle your whitespace issues and get a list of just the month abbreviations. Then you could create a dict with counts by month using the list method count(). For example:
dates = 'May-29-1990, Oct-18-1980 ,Sept-1-1980, Oct-2-1990'
months = [d.split('-')[0].strip() for d in dates.split(',')]
month_counts = {m: months.count(m) for m in set(months)}
print(month_counts)
# {'May': 1, 'Oct': 2, 'Sept': 1}
Or even better with collections.Counter:
from collections import Counter
dates = 'May-29-1990, Oct-18-1980 ,Sept-1-1980, Oct-2-1990'
months = [d.split('-')[0].strip() for d in dates.split(',')]
month_counts = Counter(months)
print(month_counts)
# Counter({'Oct': 2, 'May': 1, 'Sept': 1})

Dask: Groupby with nlargest automatically brings in index and doesn't allow reset_index()

I've been trying to get nlargest rows for a group by following method from this question. The solution to the question is correct up to a point.
In this example, I groupby column A and want to return the rows of C and D based on the top two values in B.
For some reason the index of grp_df is multilevel and includes both A and the original index of ddf.
I was hoping to simply reset_index() and drop the unwanted index and just keep A, but I get the following error:
ValueError: The columns in the computed data do not match the columns in the provided metadata
Here is a simple example reproducing the error:
import numpy as np
import dask.dataframe as dd
import pandas as pd
np.random.seed(42)
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
ddf = dd.from_pandas(df, npartitions=3)
grp_df = ddf.groupby('A')[['B','C']].apply(lambda x: x.nlargest(2, columns=['B']), meta={
"B": 'f8', "C": 'f8'})
# Print is successful and results are correct
print(grp_df.head())
grp_df = grp_df.reset_index()
# Print is unsuccessful and shows error below
print(grp_df.head())
Found approach for solution here.
Following code now allows for reset_index() to work and gets rid of the original ddf index. Still not sure why the original ddf index came through the groupby in the first place, though
meta = pd.DataFrame(columns=['B', 'C'], dtype=int, index=pd.MultiIndex([[], []], [[], []], names=['A', None]))
grp_df = ddf.groupby('A')[['B','C']].apply(lambda x: x.nlargest(2, columns=['B']), meta=meta)
grp_df = grp_df.reset_index().drop('level_1', axis=1)

Rewriting some functions for xlsxwriter box borders from Python 2 to Python 3

I am having some problem getting xlsxwriter to create box borders around a number of cells when creating a Excel sheet. After some searching I found a thread here where there was a example on how to do this in Python 2.
The link to the thread is:
python XlsxWriter set border around multiple cells
The answer I am trying to use is the one given by aubaub.
I am using Python 3 and is trying to get this to work but I am having some problems with it.
The first thing I did was changing xrange to range in the
def box(workbook, sheet_name, row_start, col_start, row_stop, col_stop),
and then I changed dict.iteritems() to dict.items() in
def add_to_format(existing_format, dict_of_properties, workbook):
Since there have been some changes to this from Python 2 to 3.
But the next part I am struggling with, and kinda have no idea what to do, and this is the
return(workbook.add_format(dict(new_dict.items() + dict_of_properties.items())))
part. I tried to change this by adding the two dictionaries in another way, by adding this before the return part.
dest = dict(list(new_dict.items()) + list(dict_of_properties.items()))
return(workbook.add_format(dest))
But this is not working, I have not been using dictionaries a lot before, and am kinda blank on how to get this working, and if it there have been some other changes to xlsxwriter or other factors that prevent this from working. Does anyone have some good ideas for how to solve this?
Here I have added a working example of the code and problem.
import pandas as pd
import xlsxwriter
import numpy as np
from xlsxwriter.utility import xl_range
#Adding the functions from aubaub copied from question on Stackoverflow
# https://stackoverflow.com/questions/21599809/python-xlsxwriter-set-border-around-multiple-cells/37907013#37907013
#And added the changes I thought would make it work.
def add_to_format(existing_format, dict_of_properties, workbook):
"""Give a format you want to extend and a dict of the properties you want to
extend it with, and you get them returned in a single format"""
new_dict={}
for key, value in existing_format.__dict__.items():
if (value != 0) and (value != {}) and (value != None):
new_dict[key]=value
del new_dict['escapes']
dest = dict(list(new_dict.items()) + list(dict_of_properties.items()))
return(workbook.add_format(dest))
def box(workbook, sheet_name, row_start, col_start, row_stop, col_stop):
"""Makes an RxC box. Use integers, not the 'A1' format"""
rows = row_stop - row_start + 1
cols = col_stop - col_start + 1
for x in range((rows) * (cols)): # Total number of cells in the rectangle
box_form = workbook.add_format() # The format resets each loop
row = row_start + (x // cols)
column = col_start + (x % cols)
if x < (cols): # If it's on the top row
box_form = add_to_format(box_form, {'top':1}, workbook)
if x >= ((rows * cols) - cols): # If it's on the bottom row
box_form = add_to_format(box_form, {'bottom':1}, workbook)
if x % cols == 0: # If it's on the left column
box_form = add_to_format(box_form, {'left':1}, workbook)
if x % cols == (cols - 1): # If it's on the right column
box_form = add_to_format(box_form, {'right':1}, workbook)
sheet_name.write(row, column, "", box_form)
#Adds dataframe with some data
frame1 = pd.DataFrame(np.random.randint(0,100,size=(10, 4)), columns=list('ABCD'))
writer = pd.ExcelWriter('test.xlsx', engine='xlsxwriter')
#Add frame to Excel sheet
frame1.to_excel(writer, sheet_name='Sheet1', startcol= 1, startrow= 2)
# Get the xlsxwriter workbook and worksheet objects.
workbook = writer.book
worksheet = writer.sheets['Sheet1']
#Add some formating to the table
format00 = workbook.add_format()
format00.set_bold()
format00.set_font_size(14)
format00.set_bg_color('#F2F2F2')
format00.set_align('center')
worksheet.conditional_format(xl_range(2, 1, 2, 5),
{'type': 'no_blanks',
'format': format00})
box(workbook, 'Sheet1', 3, 1, 12, 5)
writer.save()
I stumbled on this when trying to see if anyone else had posted a better way to deal with formats. Don't use my old way; whether you could make it work with Python 3 or not, it's pretty crappy. Instead, grab what I just put here: https://github.com/Yoyoyoyoyoyoyo/XlsxFormatter.
If you use sheet.cell_writer() instead of sheet.write(), then it will keep a memory of the formats you ask for on a cell-by-cell basis, so writing something new in a cell (or adding a border around it) won't delete the cell's old format, but adds to it instead.
A simple example of your code:
from format_classes import Book
book = Book(where_to_save)
sheet = book.add_book_sheet('Sheet1')
sheet.box(3, 1, 12, 5)
# add data to the box with sheet.cell_writer(...)
book.close()
Look at the code & the README to see how to do other things, like format the box's borders or backgrounds, write data, apply a format to an entire worksheet, etc.

Removing features with low variance using scikit-learn

scikit-learn provides various methods to remove descriptors, a basic method for this purpose has been provided by the given tutorial below,
http://scikit-learn.org/stable/modules/feature_selection.html
but the tutorial does not provide any method or a way that can tell you the way to keep the list of features that either removed or kept.
The code below has been taken from the tutorial.
from sklearn.feature_selection import VarianceThreshold
X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
sel.fit_transform(X)
array([[0, 1],
[1, 0],
[0, 0],
[1, 1],
[1, 0],
[1, 1]])
The given example code above depicts only two descriptors "shape(6, 2)", but in my case, I have a huge data frames with a shape of (rows 51, columns 9000). After finding a suitable model I want to keep the track of useful and useless features because I can save computational time during the computation of the features of test data set by calculating only useful features.
For example, when you perform machine learning modeling with WEKA 6.0, it provided with remarkable flexibility over feature selection and after removing the useless feature you can get a list of a discarded features along with the useful features.
thanks
Then, what you can do, if I'm not wrong is:
In the case of the VarianceThreshold, you can call the method fit instead of fit_transform. This will fit data, and the resulting variances will be stored in vt.variances_ (assuming vt is your object).
Having a threhold, you can extract the features of the transformation as fit_transform would do:
X[:, vt.variances_ > threshold]
Or get the indexes as:
idx = np.where(vt.variances_ > threshold)[0]
Or as a mask
mask = vt.variances_ > threshold
PS: default threshold is 0
EDIT:
A more straight forward to do, is by using the method get_support of the class VarianceThreshold. From the documentation:
get_support([indices]) Get a mask, or integer index, of the features selected
You should call this method after fit or fit_transform.
import numpy as np
import pandas as pd
from sklearn.feature_selection import VarianceThreshold
# Just make a convenience function; this one wraps the VarianceThreshold
# transformer but you can pass it a pandas dataframe and get one in return
def get_low_variance_columns(dframe=None, columns=None,
skip_columns=None, thresh=0.0,
autoremove=False):
"""
Wrapper for sklearn VarianceThreshold for use on pandas dataframes.
"""
print("Finding low-variance features.")
try:
# get list of all the original df columns
all_columns = dframe.columns
# remove `skip_columns`
remaining_columns = all_columns.drop(skip_columns)
# get length of new index
max_index = len(remaining_columns) - 1
# get indices for `skip_columns`
skipped_idx = [all_columns.get_loc(column)
for column
in skip_columns]
# adjust insert location by the number of columns removed
# (for non-zero insertion locations) to keep relative
# locations intact
for idx, item in enumerate(skipped_idx):
if item > max_index:
diff = item - max_index
skipped_idx[idx] -= diff
if item == max_index:
diff = item - len(skip_columns)
skipped_idx[idx] -= diff
if idx == 0:
skipped_idx[idx] = item
# get values of `skip_columns`
skipped_values = dframe.iloc[:, skipped_idx].values
# get dataframe values
X = dframe.loc[:, remaining_columns].values
# instantiate VarianceThreshold object
vt = VarianceThreshold(threshold=thresh)
# fit vt to data
vt.fit(X)
# get the indices of the features that are being kept
feature_indices = vt.get_support(indices=True)
# remove low-variance columns from index
feature_names = [remaining_columns[idx]
for idx, _
in enumerate(remaining_columns)
if idx
in feature_indices]
# get the columns to be removed
removed_features = list(np.setdiff1d(remaining_columns,
feature_names))
print("Found {0} low-variance columns."
.format(len(removed_features)))
# remove the columns
if autoremove:
print("Removing low-variance features.")
# remove the low-variance columns
X_removed = vt.transform(X)
print("Reassembling the dataframe (with low-variance "
"features removed).")
# re-assemble the dataframe
dframe = pd.DataFrame(data=X_removed,
columns=feature_names)
# add back the `skip_columns`
for idx, index in enumerate(skipped_idx):
dframe.insert(loc=index,
column=skip_columns[idx],
value=skipped_values[:, idx])
print("Succesfully removed low-variance columns.")
# do not remove columns
else:
print("No changes have been made to the dataframe.")
except Exception as e:
print(e)
print("Could not remove low-variance features. Something "
"went wrong.")
pass
return dframe, removed_features
this worked for me if you want to see exactly which columns are remained after thresholding you may use this method:
from sklearn.feature_selection import VarianceThreshold
threshold_n=0.95
sel = VarianceThreshold(threshold=(threshold_n* (1 - threshold_n) ))
sel_var=sel.fit_transform(data)
data[data.columns[sel.get_support(indices=True)]]
When testing features I wrote this simple function that tells me which variables remained in the data frame after the VarianceThreshold is applied.
from sklearn.feature_selection import VarianceThreshold
from itertools import compress
def fs_variance(df, threshold:float=0.1):
"""
Return a list of selected variables based on the threshold.
"""
# The list of columns in the data frame
features = list(df.columns)
# Initialize and fit the method
vt = VarianceThreshold(threshold = threshold)
_ = vt.fit(df)
# Get which column names which pass the threshold
feat_select = list(compress(features, vt.get_support()))
return feat_select
which returns a list of column names which are selected. For example: ['col_2','col_14', 'col_17'].

Get objects created in last 30 days, for each past day

I am looking for fast method to count model's objects created within past 30 days, for each day separately. For example:
27.07.2013 (today) - 3 objects created
26.07.2013 - 0 objects created
25.07.2013 - 2 objects created
...
27.06.2013 - 1 objects created
I am going to use this data in google charts API. Have you any idea how to get this data efficiently?
items = Foo.objects.filter(createdate__lte=datetime.datetime.today(), createdate__gt=datetime.datetime.today()-datetime.timedelta(days=30)).\
values('createdate').annotate(count=Count('id'))
This will (1) filter results to contain the last 30 days, (2) select just the createdate field and (3) count the id's, grouping by all selected fields (i.e. createdate). This will return a list of dictionaries of the format:
[
{'createdate': <datetime.date object>, 'count': <int>},
{'createdate': <datetime.date object>, 'count': <int>},
...
]
EDIT:
I don't believe there's a way to get all dates, even those with count == 0, with just SQL. You'll have to insert each missing date through python code, e.g.:
import datetime
# needed to use .append() later on
items = list(items)
dates = [x.get('createdate') for x in items]
for d in (datetime.datetime.today() - datetime.timedelta(days=x) for x in range(0,30)):
if d not in dates:
items.append({'createdate': d, 'count': 0})
I think this can be somewhat more optimized solution with #knbk 's solution with python. This has fewer iterations and iterations inside SET is highly optimized in python (both in processing and in CPU-cycles).
from_date = datetime.date.today() - datetime.timedelta(days=7)
orders = Order.objects.filter(created_at=from_date, dealer__executive__branch__user=user)
orders = orders.annotate(count=Count('id')).values('created_at').order_by('created_at')
if len(orders) < 7:
orders_list = list(orders)
dates = set([(datetime.date.today() - datetime.timedelta(days=i)) for i in range(6)])
order_set = set([ord['created_at'] for ord in orders])
for dt in (order_set - dates):
orders_list.append({'created_at': dt, 'count': 0})
orders_list = sorted(orders_list, key=lambda item: item['created_at'])
else:
orders_list = orders