csv parsing and manipulation using python - python-2.7

I have a csv file which i need to parse using python.
triggerid,timestamp,hw0,hw1,hw2,hw3
1,234,343,434,78,56
2,454,22,90,44,76
I need to read the file line by line, slice the triggerid,timestamp and hw3 columns from these. But the column-sequence may change from run to run. So i need to match the field name, count the column and then print out the output file as :
triggerid,timestamp,hw3
1,234,56
2,454,76
Also, is there a way to generate an hash-table(like we have in perl) such that i can store the entire column for hw0 (hw0 as key and the values in the columns as values) for other modifications.

I'm unsure what you mean by "count the column".
An easy way to read the data in would use pandas, which was designed for just this sort of manipulation. This creates a pandas DataFrame from your data using the first row as titles.
In [374]: import pandas as pd
In [375]: d = pd.read_csv("30735293.csv")
In [376]: d
Out[376]:
triggerid timestamp hw0 hw1 hw2 hw3
0 1 234 343 434 78 56
1 2 454 22 90 44 76
You can select one of the columns using a single column name, and multiple columns using a list of names:
In [377]: d[["triggerid", "timestamp", "hw3"]]
Out[377]:
triggerid timestamp hw3
0 1 234 56
1 2 454 76
You can also adjust the indexing so that one or more of the data columns are used as index values:
In [378]: d1 = d.set_index("hw0"); d1
Out[378]:
triggerid timestamp hw1 hw2 hw3
hw0
343 1 234 434 78 56
22 2 454 90 44 76
Using the .loc attribute you can retrieve a series for any indexed row:
In [390]: d1.loc[343]
Out[390]:
triggerid 1
timestamp 234
hw1 434
hw2 78
hw3 56
Name: 343, dtype: int64
You can use the column names to retrieve the individual column values from that one-row series:
In [393]: d1.loc[343]["triggerid"]
Out[393]: 1

Since you already have a solution for the slices here's something for the hash table part of the question:
import csv
with open('/path/to/file.csv','rb') as fin:
ht = {}
cr = csv.reader(fin)
k = cr.next()[2]
ht[k] = list()
for line in cr:
ht[k].append(line[2])

I used a different approach (using.index function)
bpt_mode = ["bpt_mode_64","bpt_mode_128"]
with open('StripValues.csv') as file:
for _ in xrange(1):
next(file)
for line in file:
stat_values = line.split(",")
draw_id=stats.index('trigger_id')
print stat_values[stats.index('trigger_id')],',',
for j in range(len(bpt_mode)):
print stat_values[stats.index('hw.gpu.s0.ss0.dg.'+bpt_mode[j])],',', file.close()
#holdenweb Though i am unable to figure out how to print the output to a file. Currently i am redirecting while running the script
Can you provide a solution for writing to a file. There will be multiple writes to a single file.

Related

Data format and pandas

I am using Pandas to format things nicely in a tabular format
data = []
for i in range (start, end_value):
data([i, value])
# modify value in some way
print pd.DataFrame(data)
gives me
0 1
0 38 2.500000e+05
1 39 2.700000e+05
2 40 2.916000e+05
3 41 3.149280e+05
How can I modify this to remove scientific notation and for extra points add thousands separator?
data['column_name'] = data['column_name'].apply('{0:,.2f}'.format)
thanks to John Galt's previous SO answer

Plot sub-sections of Pandas Dataframes in Python - all legend entries in one column

I have the following Pandas DataFrame in Python 2.7.
Name Date Val_Celsius Rm_Log
Lite 2012-07-17 77 12
Lite 2012-12-01 76 -21
Lite 2012-09-01 79 73
Lite 2013-12-01 78 945
Staed 2012-07-17 105 36
Staed 2012-12-01 104 19
Staed 2012-09-01 102 107
Staed 2013-12-01 104 11
ArtYr 2012-07-17 -11 100
ArtYr 2012-12-01 -14 21
ArtYr 2012-09-01 -10 68
ArtYr 2013-12-01 -12 83
I need to plot the Rm_Log numbers as the y-variable and I need to plot the Date as the x-variable.
However, I need there to be 3 separate overlapping plots on the same figure - 1st plot for Lite, 2nd for Staed and 3rd for ArtYr. I need the legend for the figure to show 3 entries, Lite, Staed and ArtYr.
I have never done a plot like this before. Usually, I have separate columns but here the numbers are arranged differently.
If I create 3 separate DataFrames for each Name then it is possible to plot. However, the Name column typically has a lot more entries than just the 3 that I have shown so this method is very time consuming. Also, the number of entries are not known ahead of time.i.e. here I have shown 3 entries, Lite, Staed and ArtYr, but there may be 50 or there may be 100 entries. I cannot create 50-100 DataFrames each time I need to generate one figure.
How can I show overlapping plots of the Rm_Log vs Date column, for each Name value, on the same figure? Is it possible to get the date as vertical on the x-axis?
EDIT:
Error I get when using ax.set_ticks(df.index):
File "C:\Python27\lib\site-packages\matplotlib\axes\_base.py", line 2602, in set_xticks
return self.xaxis.set_ticks(ticks, minor=minor)
File "C:\Python27\lib\site-packages\matplotlib\axis.py", line 1574, in set_ticks
self.set_view_interval(min(ticks), max(ticks))
File "C:\Python27\lib\site-packages\matplotlib\axis.py", line 1885, in set_view_interval
max(vmin, vmax, Vmax))
File "C:\Python27\lib\site-packages\matplotlib\transforms.py", line 973, in _set_intervalx
self._points[:, 0] = interval
ValueError: invalid literal for float(): 2012-07-17
If you don't want to use anything besides native pandas, you can still do this pretty easily:
df.reset_index().set_index(["Name", "Date"]).unstack("Name")["Rm_Log"].plot(rot=90)
First, you sort it using a MultiIndex, then you unstack it so that each entry in the Name column becomes its own column. Then you select the Rm_Log column and plot it. The argument rot=90 rotates the xticks. You could also separate this into several lines, but I thought I'd keep it as one to show how it could be done without modifying the DataFrame.
That is where ggplot inspired from R is absolutely amazing for simplicity, you do not have to modify your dataframe.
from ggplot import *
ggplot(df, aes(x='Date', y='Rm_log', color='Name')) + geom_line()

How to append a new column to my Pandas DataFrame based on a row-based calculation?

Let's say I have a Pandas DataFrame with two columns: 1) user_id, 2) steps (which contains the number of steps on the given date). Now I want to calculate the difference between the number of steps and the number of steps in the preceding measurement (measurements are guaranteed to be in order within my DataFrame).
So basically this comes down to appending an extra column to my DataFrame where the row values of this data frame match the value of the column 'steps' within this same row, minus the value of the 'steps' column in the row above (or 0 if this is the first row). To complicate things further, I want to calculate these differences per user_id, so I want to make sure that I do not subtract the steps values of two rows with different user_id's.
Does anyone have an idea how to get this done with Python 2.7 and Panda?
So an example to illustrate this.
Example input:
user_id steps
1015 48
1015 23
1015 79
1016 10
1016 20
Desired output:
user_id steps d_steps
1015 48 0
1015 23 -25
1015 79 56
2023 10 0
2023 20 10
Your output shows user ids that are not in you orig data but the following does what you want, you will have to replace/fill the NaN values with 0:
In [16]:
df['d_steps'] = df.groupby('user_id').transform('diff')
df.fillna(0, inplace=True)
df
Out[16]:
user_id steps d_steps
0 1015 48 0
1 1015 23 -25
2 1015 79 56
3 1016 10 0
4 1016 20 10
Here we generate the desired column by calling transform on the groupby by object and pass a string which maps to the diff method which subtracts the previous row value. Transform applies a function and returns a series with an index aligned to the df.

Python Pandas read_csv issue

I have simple CSV file that looks like this:
inches,12,3,56,80,45
tempF,60,45,32,80,52
I read in the CSV using this command:
import pandas as pd
pd_obj = pd.read_csv('test_csv.csv', header=None, index_col=0)
Which results in this structure:
1 2 3 4 5
0
inches 12 3 56 80 45
tempF 60 45 32 80 52
But I want this (unnamed index column):
0 1 2 3 4
inches 12 3 56 80 45
tempF 60 45 32 80 52
EDIT: As #joris pointed out additional methods can be run on the resulting DataFrame to achieve the wanted structure. My question is specifically about whether or not this structure could be achieved through read_csv arguments.
from the documentation of the function:
names : array-like
List of column names to use. If file contains no header row, then you
should explicitly pass header=None
so, apparently:
pd_obj = pd.read_csv('test_csv.csv', header=None, index_col=0, names=range(5))

Read multiple *.txt files into Pandas Dataframe with filename as column header

I am trying to import a set of *.txt files. I need to import the files into successive columns of a Pandas DataFrame in Python.
Requirements and Background information:
Each file has one column of numbers
No headers are present in the files
Positive and negative integers are possible
The size of all the *.txt files is the same
The columns of the DataFrame must have the name of file (without extension) as the header
The number of files is not known ahead of time
Here is one sample *.txt file. All the others have the same format.
16
54
-314
1
15
4
153
86
4
64
373
3
434
31
93
53
873
43
11
533
46
Here is my attempt:
import pandas as pd
import os
import glob
# Step 1: get a list of all csv files in target directory
my_dir = "C:\\Python27\Files\\"
filelist = []
filesList = []
os.chdir( my_dir )
# Step 2: Build up list of files:
for files in glob.glob("*.txt"):
fileName, fileExtension = os.path.splitext(files)
filelist.append(fileName) #filename without extension
filesList.append(files) #filename with extension
# Step 3: Build up DataFrame:
df = pd.DataFrame()
for ijk in filelist:
frame = pd.read_csv(filesList[ijk])
df = df.append(frame)
print df
Steps 1 and 2 work. I am having problems with step 3. I get the following error message:
Traceback (most recent call last):
File "C:\Python27\TextFile.py", line 26, in <module>
frame = pd.read_csv(filesList[ijk])
TypeError: list indices must be integers, not str
Question:
Is there a better way to load these *.txt files into a Pandas dataframe? Why does read_csv not accept strings for file names?
You can read them into multiple dataframes and concat them together afterwards. Suppose you have two of those files, containing the data shown.
In [6]:
filelist = ['val1.txt', 'val2.txt']
print pd.concat([pd.read_csv(item, names=[item[:-4]]) for item in filelist], axis=1)
val1 val2
0 16 16
1 54 54
2 -314 -314
3 1 1
4 15 15
5 4 4
6 153 153
7 86 86
8 4 4
9 64 64
10 373 373
11 3 3
12 434 434
13 31 31
14 93 93
15 53 53
16 873 873
17 43 43
18 11 11
19 533 533
20 46 46
You're very close. ijk is the filename already, you don't need to access the list:
# Step 3: Build up DataFrame:
df = pd.DataFrame()
for ijk in filelist:
frame = pd.read_csv(ijk)
df = df.append(frame)
print df
In the future, please provide working code exactly as is. You import from pandas import * yet then refer to pandas as pd, implying the import import pandas as pd.
You also want to be careful with variable names. files is actually a single file path, and filelist and filesList have no discernible difference from the variable name. It also seems like a bad idea to keep personal documents in your python directory.