Data format and pandas

Data format and pandas - python-2.7

I am using Pandas to format things nicely in a tabular format
data = []
for i in range (start, end_value):
data([i, value])
# modify value in some way
print pd.DataFrame(data)
gives me
0 1
0 38 2.500000e+05
1 39 2.700000e+05
2 40 2.916000e+05
3 41 3.149280e+05
How can I modify this to remove scientific notation and for extra points add thousands separator?

data['column_name'] = data['column_name'].apply('{0:,.2f}'.format)
thanks to John Galt's previous SO answer

Related

Reshaping Pandas data frame (a complex case!)

I want to reshape the following data frame:
index id numbers
1111 5 58.99
2222 5 75.65
1000 4 66.54
11 4 60.33
143 4 62.31
145 51 30.2
1 7 61.28
The reshaped data frame should be like the following:
id 1 2 3
5 58.99 75.65 nan
4 66.54 60.33 62.31
51 30.2 nan nan
7 61.28 nan nan
I use the following code to do this.
import pandas as pd
dtFrame = pd.read_csv("data.csv")
ids = dtFrame['id'].unique()
temp = dtFrame.groupby(['id'])
temp2 = {}
for i in ids:
temp2[i]= temp.get_group(i).reset_index()['numbers']
dtFrame = pd.DataFrame.from_dict(temp2)
dtFrame = dtFrame.T
Although the above code solve my problem but is there a more simple way to achieve this. I tried Pivot table but it does not solve the problem perhaps it requires to have same number of element in each group. Or may be there is another way which I am not aware of, please share your thoughts about it.

In [69]: df.groupby(df['id'])['numbers'].apply(lambda x: pd.Series(x.values)).unstack()
Out[69]:
0 1 2
id
4 66.54 60.33 62.31
5 58.99 75.65 NaN
7 61.28 NaN NaN
51 30.20 NaN NaN
This is really quite similar to what you are doing except that the loop is replaced by apply. The pd.Series(x.values) has an index which by default ranges over integers starting at 0. The index values become the column names (above). It doesn't matter that the various groups may have different lengths. The apply method aligns the various indices for you (and fills missing values with NaN). What a convenience!
I learned this trick here.

csv parsing and manipulation using python

I have a csv file which i need to parse using python.
triggerid,timestamp,hw0,hw1,hw2,hw3
1,234,343,434,78,56
2,454,22,90,44,76
I need to read the file line by line, slice the triggerid,timestamp and hw3 columns from these. But the column-sequence may change from run to run. So i need to match the field name, count the column and then print out the output file as :
triggerid,timestamp,hw3
1,234,56
2,454,76
Also, is there a way to generate an hash-table(like we have in perl) such that i can store the entire column for hw0 (hw0 as key and the values in the columns as values) for other modifications.

I'm unsure what you mean by "count the column".
An easy way to read the data in would use pandas, which was designed for just this sort of manipulation. This creates a pandas DataFrame from your data using the first row as titles.
In [374]: import pandas as pd
In [375]: d = pd.read_csv("30735293.csv")
In [376]: d
Out[376]:
triggerid timestamp hw0 hw1 hw2 hw3
0 1 234 343 434 78 56
1 2 454 22 90 44 76
You can select one of the columns using a single column name, and multiple columns using a list of names:
In [377]: d[["triggerid", "timestamp", "hw3"]]
Out[377]:
triggerid timestamp hw3
0 1 234 56
1 2 454 76
You can also adjust the indexing so that one or more of the data columns are used as index values:
In [378]: d1 = d.set_index("hw0"); d1
Out[378]:
triggerid timestamp hw1 hw2 hw3
hw0
343 1 234 434 78 56
22 2 454 90 44 76
Using the .loc attribute you can retrieve a series for any indexed row:
In [390]: d1.loc[343]
Out[390]:
triggerid 1
timestamp 234
hw1 434
hw2 78
hw3 56
Name: 343, dtype: int64
You can use the column names to retrieve the individual column values from that one-row series:
In [393]: d1.loc[343]["triggerid"]
Out[393]: 1

Since you already have a solution for the slices here's something for the hash table part of the question:
import csv
with open('/path/to/file.csv','rb') as fin:
ht = {}
cr = csv.reader(fin)
k = cr.next()[2]
ht[k] = list()
for line in cr:
ht[k].append(line[2])

I used a different approach (using.index function)
bpt_mode = ["bpt_mode_64","bpt_mode_128"]
with open('StripValues.csv') as file:
for _ in xrange(1):
next(file)
for line in file:
stat_values = line.split(",")
draw_id=stats.index('trigger_id')
print stat_values[stats.index('trigger_id')],',',
for j in range(len(bpt_mode)):
print stat_values[stats.index('hw.gpu.s0.ss0.dg.'+bpt_mode[j])],',', file.close()
#holdenweb Though i am unable to figure out how to print the output to a file. Currently i am redirecting while running the script
Can you provide a solution for writing to a file. There will be multiple writes to a single file.

python pandas replacing column values conditional on string patterns and using split()

long time lurker--I finally stuck to a project involving pandas and more than ever I need your help.
I have a dataframe like the following. Each row describe one retirement formula which may have more than one criteria (hence e1)
index e0 e1
1 62/10 NaN
2 age 55 NaN
3 67/10 age 70
I want to make a column age that describes the minimum age. I've defined patterns for how each criterion is described. For example,
pattern1=r'.*/.*'
pattern7=r'age.[0-9].*'
and I have pattern1-pattern7.
I used the following code to extract age portion of e0 to a new column age:
df['age']=df['e0'][(df['e0'].str.match(pattern1)==1)].apply(lambda x: str(x).split('/')[0])
which gives me
index e0 e1 age
1 62/10 NaN 62
2 age 55 NaN NaN
3 67/10 age 70 67
I want to address other formats such as "age 55" (to extract 55, in this case), but I'm not sure how to go about. If I do
df['age']=df['e0'][(df['e0'].str.match(pattern7)==1)].apply(lambda x: str(x).split(' ')[1])
then it's clearly wrong because I'd overwrite what's already in age and I get
index e0 e1 age
1 62/10 NaN NaN
2 age 55 NaN 55
3 67/10 age 70 NaN
I've tried other variations as far as the syntax would allow me but to no avail.
I'm a Stata user and in Stata, I'd be using replace command conditional on regexm. I'm trying to learn Python and it's been a difficult journey! I'd appreciate any help on this.
I have another (hopefully) quick question in addition: I've used the following two lines to get rid of white space in both e0 and e1.
option['e0']=option['e0'].str.strip()
option['e1']=option['e1'].str.strip()
Is there a way to address them both in one line?
Thanks a lot in advance.

This is a response to your second question (you should stick to one question per post).
df.loc[:, ['e0', 'e1']].apply(lambda x: x.str.strip())
I'm not sure why you are calling the DataFrame 'option' when it was previously referred to as 'df', so I stuck with the latter.

Interesting problem, here I pass a function that removes the NaN values and then calls sum which will concatenate the rows of data.
We can then call the vectorised str method findall with regex \d+ which returns all numbers as a list.
We then apply another function to this that converts the str numbers to ints, puts these in a list and returns the smallest value:
In [37]:
def func(x):
return x.dropna().sum()

def lowest(x):
return min(list(map(int,x)))

df['min'] = df[['e0','e1']].apply(lambda x: func(x), axis=1).str.findall(r'\d+').apply(lowest)
df
Out[37]:
index e0 e1 min
0 1 62/10 NaN 10
1 2 age 55 NaN 55
2 3 67/10 age 70 10
Breaking the above down so you can see what is happening:
In [38]:
df[['e0','e1']].apply(lambda x: func(x), axis=1)
Out[38]:
0 62/10
1 age 55
2 67/10age 70
dtype: object
In [39]:
df[['e0','e1']].apply(lambda x: func(x), axis=1).str.findall(r'\d+')
Out[39]:
0 [62, 10]
1 [55]
2 [67, 10, 70]
dtype: object

How to append a new column to my Pandas DataFrame based on a row-based calculation?

Let's say I have a Pandas DataFrame with two columns: 1) user_id, 2) steps (which contains the number of steps on the given date). Now I want to calculate the difference between the number of steps and the number of steps in the preceding measurement (measurements are guaranteed to be in order within my DataFrame).
So basically this comes down to appending an extra column to my DataFrame where the row values of this data frame match the value of the column 'steps' within this same row, minus the value of the 'steps' column in the row above (or 0 if this is the first row). To complicate things further, I want to calculate these differences per user_id, so I want to make sure that I do not subtract the steps values of two rows with different user_id's.
Does anyone have an idea how to get this done with Python 2.7 and Panda?
So an example to illustrate this.
Example input:
user_id steps
1015 48
1015 23
1015 79
1016 10
1016 20
Desired output:
user_id steps d_steps
1015 48 0
1015 23 -25
1015 79 56
2023 10 0
2023 20 10

Your output shows user ids that are not in you orig data but the following does what you want, you will have to replace/fill the NaN values with 0:
In [16]:
df['d_steps'] = df.groupby('user_id').transform('diff')
df.fillna(0, inplace=True)
df
Out[16]:
user_id steps d_steps
0 1015 48 0
1 1015 23 -25
2 1015 79 56
3 1016 10 0
4 1016 20 10
Here we generate the desired column by calling transform on the groupby by object and pass a string which maps to the diff method which subtracts the previous row value. Transform applies a function and returns a series with an index aligned to the df.

Pandas dataframe applying NA to part of the data

Let me preface this with I am new at using pandas so I'm sorry if this question is basic or answered before, I looked online and couldn't find what I needed.
I have a dataframe that consists of a baseball teams schedule. Some of the games have been played already and as a result the results from the game are inputed in the dataframe. However, for games that are yet to happen, there is only the time they are to be played (eg 1:35 pm).
So, I would like to convert all of the values of the games yet to happen into Na's.
Thank you
As requested here is what the results dataframe for the Arizona Diamondbacks contains
print MLB['ARI']
0 0
1 0
2 0
3 1
4 0
5 0
6 0
7 0
8 1
9 0
10 1
...
151 3:40 pm
152 8:40 pm
153 8:10 pm
154 4:10 pm
155 4:10 pm
156 8:10 pm
157 8:10 pm
158 1:10 pm
159 9:40 pm
160 8:10 pm
161 4:10 pm
Name: ARI, Length: 162, dtype: object

Couldn't figure out any direct solution, only iterative
for i in xrange(len(MLB)):
if 'pm' in MLB.['ARI'].iat[i] or 'am' in MLB.['ARI'].iat[i]:
MLB.['ARI'].iat[i] = np.nan
This should work if your actual values (1s and 0s) are also strings. If they are numbers, try:
for i in xrange(len(MLB)):
if type(MLB.['ARI'].iat[i]) != type(1):
MLB.['ARI'].iat[i] = np.nan

The more idiomatic way to do this would be with the vectorised string methods.
http://pandas.pydata.org/pandas-docs/stable/basics.html#vectorized-string-methods
mask = MLB['ARI'].str.contains('pm') #create boolean array
MLB['ARI'][mask] = np.nan #the column names goes first
Create the boolean array from and then use it to select the data you want.
Make sure that the column name goes before the masking array, otherwise you'll be acting on a copy of the data and your original dataframe wont get updated.
MLB['ARI'][mask] #returns a view on MLB datafrmae, will be updated
MLB[mask]['ARI'] #returns a copy of MLB, wont be updated.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Data format and pandas - python-2.7

data['column_name'] = data['column_name'].apply('{0:,.2f}'.format) thanks to John Galt's previous SO answer

Related

Reshaping Pandas data frame (a complex case!)

csv parsing and manipulation using python

python pandas replacing column values conditional on string patterns and using split()

How to append a new column to my Pandas DataFrame based on a row-based calculation?

Pandas dataframe applying NA to part of the data

Categories

Resources