replace pandas dataframe with a unique id - python-2.7

I got a dataframe with millions of entries, with one of the columns 'TYPE' (string). There is a total of 400 values for this specific column and I want to replace the values with integer id starting from 1 to 400. I also want to export this dictionary 'TYPE' => id for future reference. I tried with to_dict but it did not help. Anyway can do this ?

Option 1: you can use pd.factorize:
df['new'] = pd.factorize(df['str_col'])[0]+1
Option 2: using category dtype:
df['new'] = df['str_col'].astype('category').cat.codes+1
or even better just convert it to categorical dtype:
df['str_col'] = df['str_col'].astype('category')
and when you need to use numbers instead just use category codes:
df['str_col'].cat.codes
thanks to #jezrael for extending the answer - for creating a dictionary:
cats = df['str_col'].cat.categories
d = dict(zip(cats, range(1, len(cats) + 1)))
PS category dtype is very memory-efficient too

Related

How to get custom column from MYSQL table through Django

I would like to get specific column from DV table based on input from user.
db :
animal weight height
cat 40 20
wolf 100 50
first i need to get what animal user wants
input1='cat'
and then information about the input1 like weight or height
input2='weight'
animalwho=Wildlife.objects.get(animal=input1)
So if i put animalwho.weight it give me 40
But i want to get column based on input2 as input 2 might be height or any other
I tried animalwho.input2 but it does not work.
Is that possible to get column based on input2?
Would apopreciate any help
The easiest solution would be to convert your object to a dict or get dict directly using values.
Many options to convert object to dict see here link
Then you can easily
animalwho_dict=Wildlife.objects.filter(animal=input1).values()[0]
input2_value = animalwho_dict[input2]

pandas dataframe filter a column with a key word based on the aggregation of another column

Imagine I have the following dataframe df:
Contract_Id, date, product, qty
1,2016-08-06,a,1
1,2016-08-06,b,2
1,2017-08-06,c,2
2,2016-08-06,a,1
3,2016-08-06,a,2
3,2017-08-06,a,2
4,2016-08-06,b,2
4,2017-09-06,a,2
I am trying to find out whether each contract id has product b or product a and return 2 columns.
Ideal output:
Contract_Id, date, product, qty, contract_id_has_a, contract_id_has_b
1,2016-08-06,a,1,True,True
1,2016-08-06,b,2,True,True
2,2016-08-06,a,1,True,False
3,2016-08-06,a,2,True,False
4,2016-08-06,b,2,False,True
This will only return whether this row has product a or not
df[‘product’].str.contains('a', flags=re.IGNORECASE, regex=True)
I tried:
import re
df[‘product’].groupby([‘Contract_Id']).str.contains('a', flags=re.IGNORECASE, regex=True)
KeyError: ‘Contract_Id'
Could anyone enlighten? Thanks!
In order to perform grouping but return values for all original rows at the end (and not just for every group) you should use the pd.transform function. Then you could check if any of the group matches, and set it for all rows.
This would work:
df['contract_id_has_a'] = df.groupby('Contract_Id')['product'].transform(lambda x: x.str.contains('a').any())

Include 0 to a annotate count if not found in django Queryset.?

I don't know how to phrase my question correctly but what I am doing is I am querying a table based on certain id and then getting the count of the unique field in a column and that count I am giving to matplotlib to plot but what problem I am facing is if a certain value is not found the label size mismatches with the count and plot doesn't work. Below is my query:
workshop_feedback = WorkshopFeedbackResponse.objects.filter(workshop_conducted=workshop_id)
count = workshop_feedback.values('q1').annotate(count=Count('q1'))
sizes = [item['count'] for item in count] #converting to list for matplotlib bar char
labels = [1,2,3] #option of choices for q1
fig, ax = plt.subplots()
ax.pie(sizes, labels=labels, autopct='%1.1f%%',
shadow=True, startangle=90)
ax.axis('equal')
# Equal aspect ratio ensures that pie is drawn as a circle.
plt.savefig(os.path.join(settings.BASE_DIR, filename)) #filename is random image filename Which i am creating image of the graph
My column name is q1 and it can contain 1,2,3 as an option depending upon what user has filled but suppose if I don't have 1 or either option the count doesn't reflect that.
Exception Value:
'label' must be of length 'x'
Above is the error I get from if count of certain value is missing. So How to get count as 0 if certain label value is missing?
Edit: If I can query based on label value to get count in a column I can fix my error.

Pandas extracting substring

I have a column containing dates, which could look like 2017-10-12. I want to create a new column containing the day, which in my case would be the number between the two -'s. I've tried various .str.extract() queries, but I can't seem to get it right.
df['days'] = df['dates'].str.extract('(-*)')
Any hints?
Use split and select second list by str[1]:
df['days'] = df['dates'].str.split('-').str[1]
Or to_datetime with format parameter + dt.day:
df['days'] = pd.to_datetime(df['dates'], format='%Y-%d-%m').dt.day

How to create python dictionary from cassandra tables values?

I'm trying to create one application, in which I want to load one Cassandra table in python dictionary.
Table looks like below -
startup_id | startup_name
-----------+---------------------
1 | name
2 | gender
3 | address
4 | pincode
5 | phoneno
I want to load this table in python dictionary.
Expected Output-
{'name': '1' , 'gender': '2', 'address': '3', 'pincode': '4', 'phoneno': '5'}
NOTE - startup_name column is unique so we can add this as a key in python dictionary.
Currently I'm following below way to do that -
main_dict = {}
rows = session.execute('SELECT startup_id,startup_name FROM startup;')
for (id, name) in rows:
main_dict[name] = id
I want to know, Is their any simple and faster way to do this? Any answers which will make performance faster will be appreciated.
Thanks in Advance!
Here is a link to the Cassandra documentation that has a code example of returning rows as dicts: https://datastax.github.io/python-driver/api/cassandra/query.html#cassandra.query.dict_factory
Basically, you change the row factory for the session to dict_factory.
session = cluster.connect(...)
session.row_factory = dict_factory
session.execute(...)
By the way, by default, the returned rows are already named tuples so you can use column names as object members, like this:
rows = session.execute("select peer,release_version,rack,data_center from system.peers;")
for row in rows:
print(row.release_version, row.rack, row.data_center)