Convert blank cells to NaN in Excel File - python-2.7

When parsing an excel file in Pandas,
xls = pd.ExcelFile('file.xlsx')
df = xls.parse(0, parse_dates=[0, 1])
Is there a way to convert all of the blank cells to NaN rather than to 0?

You can try with:
df = df.replace('', np.nan, regex=True)

Related

Is there a code for separating alphabets from integers from a string in an excel sheet using pandas? [duplicate]

This question already has answers here:
How to split a column into alphabetic values and numeric values from a column in a Pandas dataframe?
(4 answers)
Closed 3 years ago.
I'm working in a company project, the guys collected data and put it in excel sheet. And they want me to separate the integers from alphabets using regex under Barcode_Number column. Is the a way I can do that for all the values under Barcode_Number Column?
import numpy as np
import re
data = pd.read_excel(r'C:\Users\yanga\Gaussian\SEC - 6. Yanga Deliverables\Transmission\Raw\3000_2- processed.xlsx')
data.head()
# Extract the column you want to work with
df = pd.DataFrame(data, columns= ['Barcode_Number'])
# Identify the null values
df.isnull().sum()
# remove all the null values
df.dropna(how = 'all', inplace = True)
# Select cells that contain non-digit values
df1 = df[df['Barcode_Number'].str.contains('^\D', na = False)]
For example if I have list of values under column Barcode_Number
Barcode_Number
'VQA435'
'KSR436'
'LAR437'
'ARB438'
and I want an output to be like this:
'VQA', '435'
'KSR', '436'
'LAR', '437'
'ARB', '438'
import pandas as pd
df = pd.read_csv(filename)
df[["Code", "Number"]] = df["Barcode_Number"].str.extract(r"([A-Z]+)([0-9]+)")
print(df)
Output:
Barcode_Number Code Number
0 VQA435 VQA 435
1 KSR436 KSR 436
2 LAR437 LAR 437
3 ARB438 ARB 438

deleting semicolons in a column of csv in python

I have a column of different times and I want to find the values in between 2 different times but can't find out how? For example: 09:04:00 threw 09:25:00. And just use the values in between those different times.
I was gonna just delete the semicolons separating hours:minutes:seconds and do it that way. But really don't know how to do that. But I know how to find a value in a column so I figured that way would be easier idk.
Here is the csv I'm working with.
DATE,TIME,OPEN,HIGH,LOW,CLOSE,VOLUME
02/03/1997,09:04:00,3046.00,3048.50,3046.00,3047.50,505
02/03/1997,09:05:00,3047.00,3048.00,3046.00,3047.00,162
02/03/1997,09:06:00,3047.50,3048.00,3047.00,3047.50,98
02/03/1997,09:07:00,3047.50,3047.50,3047.00,3047.50,228
02/03/1997,09:08:00,3048.00,3048.00,3047.50,3048.00,136
02/03/1997,09:09:00,3048.00,3048.00,3046.50,3046.50,174
02/03/1997,09:10:00,3046.50,3046.50,3045.00,3045.00,134
02/03/1997,09:11:00,3045.50,3046.00,3044.00,3045.00,43
02/03/1997,09:12:00,3045.00,3045.50,3045.00,3045.00,214
02/03/1997,09:13:00,3045.50,3045.50,3045.50,3045.50,8
02/03/1997,09:14:00,3045.50,3046.00,3044.50,3044.50,152
02/03/1997,09:15:00,3044.00,3044.00,3042.50,3042.50,126
02/03/1997,09:16:00,3043.50,3043.50,3043.00,3043.00,128
02/03/1997,09:17:00,3042.50,3043.50,3042.50,3043.50,23
02/03/1997,09:18:00,3043.50,3044.50,3043.00,3044.00,51
02/03/1997,09:19:00,3044.50,3044.50,3043.00,3043.00,18
02/03/1997,09:20:00,3043.00,3045.00,3043.00,3045.00,23
02/03/1997,09:21:00,3045.00,3045.00,3044.50,3045.00,51
02/03/1997,09:22:00,3045.00,3045.00,3045.00,3045.00,47
02/03/1997,09:23:00,3045.50,3046.00,3045.00,3045.00,77
02/03/1997,09:24:00,3045.00,3045.00,3045.00,3045.00,131
02/03/1997,09:25:00,3044.50,3044.50,3043.50,3043.50,138
02/03/1997,09:26:00,3043.50,3043.50,3043.50,3043.50,6
02/03/1997,09:27:00,3043.50,3043.50,3043.00,3043.00,56
02/03/1997,09:28:00,3043.00,3044.00,3043.00,3044.00,32
02/03/1997,09:29:00,3044.50,3044.50,3044.50,3044.50,63
02/03/1997,09:30:00,3045.00,3045.00,3045.00,3045.00,28
02/03/1997,09:31:00,3045.00,3045.50,3045.00,3045.50,75
02/03/1997,09:32:00,3045.50,3045.50,3044.00,3044.00,54
02/03/1997,09:33:00,3043.50,3044.50,3043.50,3044.00,96
02/03/1997,09:34:00,3044.00,3044.50,3044.00,3044.50,27
02/03/1997,09:35:00,3044.50,3044.50,3043.50,3044.50,44
02/03/1997,09:36:00,3044.00,3044.00,3043.00,3043.00,61
02/03/1997,09:37:00,3043.50,3043.50,3043.50,3043.50,18
Thanks for the time
If you just want to replace semicolons with commas you can use the built in string replace function.
line = '02/03/1997,09:24:00,3045.00,3045.00,3045.00,3045.00,131'
line = line.replace(':',',')
print(line)
Output
02/03/1997,09,04,00,3046.00,3048.50,3046.00,3047.50,505
Then split on commas to separate the data.
line.split(',')
If you only want the numerical values you could also do the following (using a regular expression):
import re
line = '02/03/1997,09:04:00,3046.00,3048.50,3046.00,3047.50,505'
values = [float(x) for x in re.sub(r'[^\w.]+', ',', line).split(',')]
print values
Which gives you a list of numerical values that you can process.
[2.0, 3.0, 1997.0, 9.0, 4.0, 0.0, 3046.0, 3048.5, 3046.0, 3047.5, 505.0]
Use the csv module! :)
>>>import csv
>>> with open('myFile.csv', newline='') as csvfile:
... myCsvreader = csv.reader(csvfile, delimiter=',', quotechar='|')
... for row in myCsvreader:
... for item in row:
... item.spit(':') # Returns hours without semicolons
Once you extracted different time stamps, you can use the datetime module, such as:
from datetime import datetime, date, time
x = time(hour=9, minute=30, second=30)
y = time(hour=9, minute=30, second=42)
diff = datetime.combine(date.today(), y) - datetime.combine(date.today(), x)
print diff.total_seconds()

python/pandas:need help adding double quotes to columns

I need to add double quotes to specific columns in a csv file that my script generates.
Below is the goofy way I thought of doing this. For these two fixed-width fields, it works:
df['DATE'] = df['DATE'].str.ljust(9,'"')
df['DATE'] = df['DATE'].str.rjust(10,'"')
df['DEPT CODE'] = df['DEPT CODE'].str.ljust(15,'"')
df[DEPT CODE'] = df['DEPT CODE'].str.rjust(16,'"')
For the following field, it doesn't. It has a variable length. So, if the value is shorter than the standard 6-digits, I get extra double-quotes: "5673"""
df['ID'] = df['ID'].str.ljust(7,'"')
df['ID'] = df['ID'].str.rjust(8,'"')
I have tried zfill, but the data in the column is a series-- I get "pandas.core.series.Series" when i run
print type(df['ID'])
and I have not been able to convert it to string using astype. I'm not sure why. I have not imported numpy.
I tried using len() to get the length of the ID number and pass it to str.ljust and str.rjust as its first argument, but I think it got hung up on the data not being a string.
Is there a simpler way to apply double-quotes as I need, or is the zfill going to be the way to go?
You can add a speech mark before / after:
In [11]: df = pd.DataFrame([["a"]], columns=["A"])
In [12]: df
Out[12]:
A
0 a
In [13]: '"' + df['A'] + '"'
Out[13]:
0 "a"
Name: A, dtype: object
Assigning this back:
In [14]: df['A'] = '"' + df.A + '"'
In [15]: df
Out[15]:
A
0 "a"
If it's for exporting to csv you can use the quoting kwarg:
In [21]: df = pd.DataFrame([["a"]], columns=["A"])
In [22]: df.to_csv()
Out[22]: ',A\n0,a\n'
In [23]: df.to_csv(quoting=1)
Out[23]: '"","A"\n"0","a"\n'
With numpy, not pandas, you can specify the formatting method when saving to a csv file. As very simple example:
In [209]: np.savetxt('test.txt',['string'],fmt='%r')
In [210]: cat test.txt
'string'
In [211]: np.savetxt('test.txt',['string'],fmt='"%s"')
In [212]: cat test.txt
"string"
I would expect the pandas csv writer to have a similar degree of control, if not more.

Print columns of Pandas dataframe to separate files + dataframe with datetime (min/sec)

I am trying to print a Pandas dataframe's columns to separate *.csv files in Python 2.7.
Using this code, I get a dataframe with 4 columns and an index of dates:
import pandas as pd
import numpy as np
col_headers = list('ABCD')
dates = pd.date_range(dt.datetime.today().strftime("%m/%d/%Y"),periods=rows)
df2 = pd.DataFrame(np.random.randn(10, 4), index=dates, columns = col_headers)
df = df2.tz_localize('UTC') #this does not seem to be giving me hours/minutes/seconds
I then remove the index and set it to a separate column:
df['Date'] = df.index
col_headers.append('Date') #update the column keys
At this point, I just need to print all 5 columns of the dataframe to separate files. Here is what I have tried:
for ijk in range(0,len(col_headers)):
df.to_csv('output' + str(ijk) + '.csv', columns = col_headers[ijk])
I get the following error message:
KeyError: "[['D', 'a', 't', 'e']] are not in ALL in the [columns]"
If I say:
for ijk in range(0,len(col_headers)-1):
then it works, but it does not print the 'Date' clumn. That is not what I want. I need to also print the date column.
Questions:
How do I get it to print the 'Dates' column to a *.csv file?
How do I get the time with hours, minutes and seconds? If the number of
rows is changed from 10 to 5000, then will the seconds change from one row of the dataframe to the next?
EDIT:
- Answer for Q2 (See here) ==> in the case of my particular code, see this:
dates = pd.date_range(dt.datetime.today().strftime("%m/%d/%Y %H:%M"),periods=rows)
I don't quite understand your logic but the following is a simpler method to do it:
for col in df:
df[col].to_csv('output' + col + '.csv')
example:
In [41]:
for col in df2:
print('output' + col + '.csv')
outputA.csv
outputB.csv
outputC.csv
outputD.csv
outputDate.csv

Pandas Read CSV with dates as DD-MMM-YY

I have a data set that looks as follows in a CSV file:
Date Sample
01-AUG-09 Sample 1
02-Aug-09 Sample 2
etc...
When I use Pandas, I read in the file with the following code:
in_file = pd.read_csv('File Name.csv', parse_dates = True)
However, it is not recognizing the date column properly. Does anybody know if the Pandas date parser can recognize dates that are in DD-MMM-YY format?
The following worked for me
I suspect yours is probably much simpler to parse because they are many tab separated? (I did an exact width parsing which is not trivial)
In [41]: df = pd.read_fwf(StringIO(data),widths=[9,13],parse_dates=True,index_col=0,names=['sample'],header=None,skiprows=1)
In [42]: df
Out[42]:
sample
2009-08-01 Sample 1
2009-08-02 Sample 2
Tab separated is much simpler
In [43]: data2 = """Data\tSample\n01-AUG-09\tSample 1\n02-Aug-09\tSample 2\n"""
In [44]: read_csv(StringIO(data2),sep='\t',parse_dates=True,index_col=0)
Out[44]:
Sample
Data
2009-08-01 Sample 1
2009-08-02 Sample 2