I am learning python and pandas via Wes McKinney's Python for Data Analysis. One of the examples in Chapter 2 is a merge of MovieLens data on movie_id that is not working. I think the issue is that in ratings the movie_id is an int64 and in movies it is an object. The merge returns an empty data frame.
I have read some of the previous posts on pandas and automatic data type assignment and found the dtype in pandas.io.parsers.read_table documentation but cant get the type to change.
The original code:
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('ch02/movielens/movies.dat', sep='::', header=None, names=mnames)
And what my research indicated what should work:
movies = pd.read_table('ch02/movielens/movies.dat', sep='::', header=None, names=mnames, dtype={'movie_id':np.int64})
Unfortunately, the type isn't changed and the merge still returns an empty set. I am running pandas 0.10.1
(Note I haven't looked up the book code, just your post)
First confirm the dtypes:
print ratings_df.dtypes
print movies_df.dtypes
If you find they're different types you could try (let's assume ratings_df.movie_id is object instead of int):
ratings_df.movie_id = ratings_df.movie_id.astype(int)
See if your merge now works.
Related
I am using
pandas==0.25.0
django-pandas==0.6.1
And I am using value_counts() to group for unique valor in two columns:
charges_mean_provinces = whatever.objects.filter(whatever = whatever).values('origin_province','destination_province')
df_charges_mean = pd.DataFrame(charges_mean_provinces)
df_charges_mean = df_charges_mean.value_counts().to_frame('cantidad').reset_index()
In local (development) it work correctly. But in production (I use Heroku), it return this error.
'DataFrame' object has no attribute 'value_counts'
Is there other way to group unique valor from two columns without use value_counts? Considering that I can not change my Pandas version in Heroku.
Anyway, value_counts is in pandas 0.25 documentation, so, I do not understand the error.
What version of Pandas are you using? Initially, value_counts is a method for series, rather than dataframes. You can call value_counts on a specific column, but not the frame itself.
After 1.10, that was updated and now value_counts is also a dataframe method. I recall seeing posts previously on here regarding this error before pandas was updated.
Context
There is a dataframe of customer invoices and their due dates.(Identified by customer code)
Week(s) need to be added depending on customer code
Model is created to persist the list of customers and week(s) to be added
What is done so far:
Models.py
class BpShift(models.Model):
bp_name = models.CharField(max_length=50, default='')
bp_code = models.CharField(max_length=15, primary_key=True, default='')
weeks = models.IntegerField(default=0)
helper.py
from .models import BpShift
# used in views later
def week_shift(self, df):
df['DueDateRange'] = df['DueDate'] + datetime.timedelta(
weeks=BpShift.objects.get(pk=df['BpCode']).weeks)
I realised my understanding of Dataframes is seriously flawed.
df['A'] and df['B'] would return Series. Of course, timedelta wouldn't work like this(weeks=BpShift.objects.get(pk=df['BpCode']).weeks).
Dataframe
d = {'BpCode':['customer1','customer2'],'DueDate':['2020-05-30','2020-04-30']}
df = pd.DataFrame(data=d)
Customer List csv
BP Name,BP Code,Week(s)
Customer1,CA0023MY,1
Customer2,CA0064SG,1
Error
BpShift matching query does not exist.
Commentary
I used these methods in hope that I would be able to change the dataframe at once, instead of
using df.iterrows(). I have recently been avoiding for loops like a plague and wondering if this
is the "correct" mentality. Is there any recommended way of doing this? Thanks in advance for any guidance!
This question Python & Pandas: series to timedelta will help to take you from Series to timedelta. And although
pandas.Series(
BpShift.objects.filter(
pk__in=df['BpCode'].tolist()
).values_list('weeks', flat=True)
)
will give you a Series of integers, I doubt the order is the same as in df['BpCode']. Because it depends on the django Model and database backend.
So you might be better off to explicitly create not a Series, but a DataFrame with pk and weeks columns so you can use df.join. Something like this
pandas.DataFrame(
BpShift.objects.filter(
pk__in=df['BpCode'].tolist()
).values_list('pk', 'weeks'),
columns=['BpCode', 'weeks'],
)
should give you a DataFrame that you can join with.
So combined this should be the gist of your code:
django_response = [('customer1', 1), ('customer2', '2')]
d = {'BpCode':['customer1','customer2'],'DueDate':['2020-05-30','2020-04-30']}
df = pd.DataFrame(data=d).set_index('BpCode').join(
pd.DataFrame(django_response, columns=['BpCode', 'weeks']).set_index('BpCode')
)
df['DueDate'] = pd.to_datetime(df['DueDate'])
df['weeks'] = pd.to_numeric(df['weeks'])
df['new_duedate'] = df['DueDate'] + df['weeks'] * pd.Timedelta('1W')
print(df)
DueDate weeks new_duedate
BpCode
customer1 2020-05-30 1 2020-06-06
customer2 2020-04-30 2 2020-05-14
You were right to want to avoid looping. This approach gets all the data in one SQL query from your Django model, by using filter. Then does a left join with the DataFrame you already have. Casts the dates and weeks to the right types and then computes a new due date using the whole columns instead of loops over them.
NB the left join will give NaN and NaT for customers that don't exist in your Django database. You can either avoid those rows by passing how='inner' to df.join or handle them whatever way you like.
we all know that if we need to retrieve data from the database the data will back as a queryset but the question is How can I retrieve the data from database which is the name of it is queryset but remove that name from it.
maybe I can't be clarified enough in explanation so you can look at the next example to understand what I mean:
AnyObjects.objects.all().values()
this line will back the data like so:
<QuerySet [{'key': 'value'}]
now you can see the first name that is on the left side of retrieving data which is: "QuerySet" so, I need to remove that name to make the data as follows:
[{'key': 'value'}]
if you wonder about why so, the abbreviation of answer is I want to use Dataframe by pandas so, to put the data in Dataframe method I should use that layout.
any help please!!
You don't have to change it from a Queryset to anything else; pandas.DataFrame can take any Iterable as data. So
df = pandas.DataFrame(djangoapp.models.Model.objects.all().values())
Gives you the DataFrame you expect. (though you may want to double check df.dtypes. If there are Nones in your data, the column may end up to be of object type.)
You can use list(…) to convert it to a list of dictionaries:
list(AnyObjects.objects.values())
You will need to serialize it with the json package to obtain a JSON blob, since strings with single quotes are not valid JSON, in order to make it a JSON blob, you can work with:
import json
json.dumps(list(AnyObjects.object.values()))
I am very new to Django, but facing quite a daunting task already.
I need to create multiple forms like this on the webpage where user would provide input (only floating numbers allowed) and then convert these inputs to pandas DataFrame to do data analysis. I would highly appreciate if you could advise how should I go about doing this?
Form needed:
This is a very broad question and I am assuming you are familiar with pandas and python. There might be a more efficient way but this is how I would do it. It should not be that difficult have the user submit the form then import pandas in your view. Create an initial data frame Then you can get the form data using something like this
if form.is_valid():
field1 = form.cleaned_data['field1']
field2 = form.cleaned_data['field2']
field3 = form.cleaned_data['field3']
field4 = form.cleaned_data['field4']
you can then create a new data frame like so:
df2 = pd.DataFrame([[field1, field2], [field3, field4]], columns=list('AB'))
then append the second data frame to the first like so:
df.append(df2)
Keep iterating over the data in this fashion until you have added all the data. After its all been appended you can do you analysis and whatever else you like. You note you can append more data the 2 by 2 thats just for example.
Pandas append docs:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.append.html
Django forms docs:
https://docs.djangoproject.com/en/2.0/topics/forms/
The docs are you friend
I am pretty new to machine learning in general and scikit-learn in specific.
I am trying to use the example given on the site http://scikit-learn.org/stable/tutorial/basic/tutorial.html
For practicing on my own, I am using my own data-set. My data set is divided into two different CSV files:
Train_data.csv (Contains 32 columns, the last column is the output value).
Test_data.csv (Contains 31 columns the output column is missing - Which should be the case, no?)
Test data is one column less than training data..
I am using the following code to learn (using training data) and then predict (using test data).
The issue I am facing is the error:
*ValueError: X.shape[1] = 31 should be equal to 29, the number of features at training time*
Here is my code (sorry if it looks completely wrong :( )
import pandas as pd #import the library
from sklearn import svm
mydata = pd.read_csv("Train - Copy.csv") #I read my training data set
target = mydata["Desired"] #my csv has header row, and the output label column is named "Desired"
data = mydata.ix[:,:-3] #select all but the last column as data
clf = svm.SVC(gamma=0.001, C=100.) #Code from the URL above
clf.fit(data,target) #Code from the URL above
test_data = pd.read_csv("test.csv") #I read my test data set. Without the output column
clf.predict(test_data[-1:]) #Code from the URL above
The training data csv labels looks something like this:
Value1,Value2,Value3,Value4,Output
The test data csv labels looks something like this:
Value1,Value2,Value3,Value4.
Thanks :)
Your problem is a Supervised Problem, you have some data in form of (input,output).
The input are the features describing your example and the output is the prediction that your model should respond given that input.
In your training data, you'll have one more attribute in your csv file because in order to train your model you need to give him the output.
The general workflow in sklearn with a Supervised Problem should look like this
X, Y = read_data(data)
n = len(X)
X_train, X_test = X[:n*0.8], X[n*0.8:]
Y_train, Y_test = Y[:n*0.8], Y[n*0.8:]
model.fit(X_train,Y_train)
model.score(X_test, Y_test)
To split your data, you can use train_test_split and you can use several metrics in order to judge your model's performance.
You should check the shape of your data
data.shape
It seems like you're not taking into the account the last 3 columns instead of only the last. Try instead :
data = mydata.ix[:,:-1]