Parse textfile without fixed structure using python dictionary and Pandas - python-2.7

I have .txt file without specific separators and to parse it, I need to count character by character to know where starts and ends a column. To do so, I constructed a Python dictionary where the keys are the column names and the values are the number of characters that takes each column:
headers = {first_col: 3, second_col: 5, third_col: 2, ... nth_col: n_chars}
Having that in mind, I know that the three first columns of the following line in the .txt file
ABC123-3YN0000000001203ABC123*TESTINGLINE
first_col: ABC
second_col: 123-3
third_col: YN
I want to know if there is any pandas function that helps me to parse this .txt taking into account this particular condition and (if possible) using my headers dictionary.

Using a dictionary is dangerous because the order is not guaranteed. Meaning, if you picked third_col first, you've thrown of your entire scheme. You can fix this by using lists. From there, you can use pd.read_fwf to read a fixed with formatted text file.
Solution
names = ['first_col', 'second_col', 'third_col']
widths = [3, 5, 2]
pd.read_fwf(
'myfile.txt',
widths=widths,
names=names
)
first_col second_col third_col
0 ABC 123-3 YN
You can also use OrderedDict from the collections library and make sure you keep the order you want by passing an iterator that produces tuples in the correct order
from collections import OrderedDict
names = ['first_col', 'second_col', 'third_col']
widths = [3, 5, 2]
header = OrderedDict(zip(names, widths))
pd.read_fwf(
'myfile.txt',
widths=header.values(),
names=header.keys()
)
first_col second_col third_col
0 ABC 123-3 YN
Demonstration
from collections import OrderedDict
txt = """ABC123-3YN0000000001203ABC123*TESTINGLINE"""
names = ['first_col', 'second_col', 'third_col']
widths = [3, 5, 2]
header = OrderedDict(zip(names, widths))
pd.read_fwf(
'myfile.txt',
widths=header.values(),
names=header.keys()
)
first_col second_col third_col
0 ABC 123-3 YN

Related

use python to write to a specific column is a .csv file

I have a .csv file where I need to overwrite a certain column with new values from a list.
Let's say I have the list L1 = ['La', 'Lb', 'Lc'] that I want to write in column no. 5 of the .csv file.
If I run:
L1 = ['La', 'Lb', 'Lc']
import csv
with open(r'C:\LIST.csv','wb') as f:
w = csv.writer(f)
for i in L1:
w.writerow(i)
This will write the L1 values to the first and second column.
First column will be 'L', 'L', 'L' and second column 'a', 'b', 'c'
I could not find the syntax to write to a specific column each element from the list. (this is in Python 2.7). Thank you for your help!
(for this script I must use IronPython, and just the built in Libraries that comes with IronPython)
Although you could certainly use Python's built-in csv module to read the data, modify it, and write it out, I'd recommend the excellent tablib module:
from tablib import Dataset
csv = '''Col1,Col2,Col3,Col4,Col5,Col6,Col7
a1,b1,c1,d1,e1,f1,g1
a2,b2,c2,d2,e2,f2,g2
a3,b3,c3,d3,e3,f3,g3
'''
# Read a hard-coded string just for test purposes.
# In your code, you would use open('...', 'rt').read() to read from a file.
imported_data = Dataset().load(csv, format='csv')
L1 = ['La', 'Lb', 'Lc']
for i in range(len(L1)):
# Each row is a tuple, and tuples don't support assignment.
# Convert to a list first so we can modify it.
row = list(imported_data[i])
# Put our value in the 5th column (index 4).
row[4] = L1[i]
# Store the row back into the Dataset.
imported_data[i] = row
# Export to CSV. (Of course, you could write this to a file instead.)
print imported_data.export('csv')
# Output:
# Col1,Col2,Col3,Col4,Col5,Col6,Col7
# a1,b1,c1,d1,La,f1,g1
# a2,b2,c2,d2,Lb,f2,g2
# a3,b3,c3,d3,Lc,f3,g3

How to get the index of the result of nltk.RegexpParser?

I want to get not only result of RegexpParser, but also the index of the result.
For example the start index of the word and the end index of the word.
import nltk
from nltk import word_tokenize, pos_tag
text = word_tokenize("6 ACCESSKEY attribute can be used to specify many 6.0 shortcut key 6.0")
tag = pos_tag(text)
print tag
# grammar = "NP: {<DT>?<JJ>*<NN|NNS|NNP|NNPS>}"
grammar2 = """Triple: {<CD>*<DT>?<CD>*<JJ>*<CD>*<VBD|VBG>*<CD>*<NN.*>+<CD>*<MD>*<VB.*>+<JJ>?<RB>?<CD>*<DT>?<NN.*>*<IN*|TO*>?<DT>?<CD>*<JJ>*<CD>*<VBD|VBG>*<CD>*<NN.*>+<CD>*}
Triple: {<CD>*<DT>?<CD>*<JJ>*<CD>*<VBD|VBG>*<CD>*<NN.*>+<CD>*<MD>*<VB.*>+<JJ>?<RB>?<CD>*<DT>?<NN.*>*<TO>?<VB><DT>?<CD>*<JJ>*<CD>*<VBD|VBG>*<CD>*<NN.*>+<CD>*}
"""
grammar = """
NP: {<CD>*<DT>?<CD>*<JJ>*<CD>*<VBD|VBG>*<CD>*<NN.*>+<CD>*}
VP: {<VB.*>+<JJ>*<RB>*<JJ>*<VB.*>?<DT>?<NN|NP>?<IN*|TO*>?}
"""
cp = nltk.RegexpParser(grammar)
result = cp.parse(tag)
print(result)
result.draw()
Since you give the parser tokenised text, there is no way it can guess the original offsets (how could it know how much space was between the tokens).
But, fortunately, the parse() method accepts additional info, which is simply passed on to the output.
In your example, the input (you saved it in the badly named variable tag) looks like this:
[('6', 'CD'),
('ACCESSKEY', 'NNP'),
('attribute', 'NN'),
...
If you manage to change it to
[('6', 'CD', 0, 1),
('ACCESSKEY', 'NNP', 2, 11),
('attribute', 'NN', 12, 21),
...
and feed this to the parser, then the offsets will be included in the parse tree:
Tree('S',
[Tree('NP', [('6', 'CD', 0, 1),
('ACCESSKEY', 'NNP', 2, 11),
('attribute', 'NN', 12, 21)]),
...
How do you get the offsets into the tagged sequence?
Well, I will leave this as a programming exercise to you.
Hint: Look for the span_tokenize() method of the word tokenisers.

deleting semicolons in a column of csv in python

I have a column of different times and I want to find the values in between 2 different times but can't find out how? For example: 09:04:00 threw 09:25:00. And just use the values in between those different times.
I was gonna just delete the semicolons separating hours:minutes:seconds and do it that way. But really don't know how to do that. But I know how to find a value in a column so I figured that way would be easier idk.
Here is the csv I'm working with.
DATE,TIME,OPEN,HIGH,LOW,CLOSE,VOLUME
02/03/1997,09:04:00,3046.00,3048.50,3046.00,3047.50,505
02/03/1997,09:05:00,3047.00,3048.00,3046.00,3047.00,162
02/03/1997,09:06:00,3047.50,3048.00,3047.00,3047.50,98
02/03/1997,09:07:00,3047.50,3047.50,3047.00,3047.50,228
02/03/1997,09:08:00,3048.00,3048.00,3047.50,3048.00,136
02/03/1997,09:09:00,3048.00,3048.00,3046.50,3046.50,174
02/03/1997,09:10:00,3046.50,3046.50,3045.00,3045.00,134
02/03/1997,09:11:00,3045.50,3046.00,3044.00,3045.00,43
02/03/1997,09:12:00,3045.00,3045.50,3045.00,3045.00,214
02/03/1997,09:13:00,3045.50,3045.50,3045.50,3045.50,8
02/03/1997,09:14:00,3045.50,3046.00,3044.50,3044.50,152
02/03/1997,09:15:00,3044.00,3044.00,3042.50,3042.50,126
02/03/1997,09:16:00,3043.50,3043.50,3043.00,3043.00,128
02/03/1997,09:17:00,3042.50,3043.50,3042.50,3043.50,23
02/03/1997,09:18:00,3043.50,3044.50,3043.00,3044.00,51
02/03/1997,09:19:00,3044.50,3044.50,3043.00,3043.00,18
02/03/1997,09:20:00,3043.00,3045.00,3043.00,3045.00,23
02/03/1997,09:21:00,3045.00,3045.00,3044.50,3045.00,51
02/03/1997,09:22:00,3045.00,3045.00,3045.00,3045.00,47
02/03/1997,09:23:00,3045.50,3046.00,3045.00,3045.00,77
02/03/1997,09:24:00,3045.00,3045.00,3045.00,3045.00,131
02/03/1997,09:25:00,3044.50,3044.50,3043.50,3043.50,138
02/03/1997,09:26:00,3043.50,3043.50,3043.50,3043.50,6
02/03/1997,09:27:00,3043.50,3043.50,3043.00,3043.00,56
02/03/1997,09:28:00,3043.00,3044.00,3043.00,3044.00,32
02/03/1997,09:29:00,3044.50,3044.50,3044.50,3044.50,63
02/03/1997,09:30:00,3045.00,3045.00,3045.00,3045.00,28
02/03/1997,09:31:00,3045.00,3045.50,3045.00,3045.50,75
02/03/1997,09:32:00,3045.50,3045.50,3044.00,3044.00,54
02/03/1997,09:33:00,3043.50,3044.50,3043.50,3044.00,96
02/03/1997,09:34:00,3044.00,3044.50,3044.00,3044.50,27
02/03/1997,09:35:00,3044.50,3044.50,3043.50,3044.50,44
02/03/1997,09:36:00,3044.00,3044.00,3043.00,3043.00,61
02/03/1997,09:37:00,3043.50,3043.50,3043.50,3043.50,18
Thanks for the time
If you just want to replace semicolons with commas you can use the built in string replace function.
line = '02/03/1997,09:24:00,3045.00,3045.00,3045.00,3045.00,131'
line = line.replace(':',',')
print(line)
Output
02/03/1997,09,04,00,3046.00,3048.50,3046.00,3047.50,505
Then split on commas to separate the data.
line.split(',')
If you only want the numerical values you could also do the following (using a regular expression):
import re
line = '02/03/1997,09:04:00,3046.00,3048.50,3046.00,3047.50,505'
values = [float(x) for x in re.sub(r'[^\w.]+', ',', line).split(',')]
print values
Which gives you a list of numerical values that you can process.
[2.0, 3.0, 1997.0, 9.0, 4.0, 0.0, 3046.0, 3048.5, 3046.0, 3047.5, 505.0]
Use the csv module! :)
>>>import csv
>>> with open('myFile.csv', newline='') as csvfile:
... myCsvreader = csv.reader(csvfile, delimiter=',', quotechar='|')
... for row in myCsvreader:
... for item in row:
... item.spit(':') # Returns hours without semicolons
Once you extracted different time stamps, you can use the datetime module, such as:
from datetime import datetime, date, time
x = time(hour=9, minute=30, second=30)
y = time(hour=9, minute=30, second=42)
diff = datetime.combine(date.today(), y) - datetime.combine(date.today(), x)
print diff.total_seconds()

Unit Test with Pandas Dataframe to read *.csv files

I am often vertically concatenating many *.csv files in Pandas. So, everytime I do this, I have to check that all the files I am concatenating have the same number of columns. This became quite cumbersome since I had to figure out a way to ignore the files with more or less columns than what I tell it I need. eg. the first 10 files have 4 columns but then file #11 has 8 columns and file #54 has 7 columns. This means I have to load all files - even the files that have the wrong number of columns. I want to avoid loading those files and then trying to concatenate them vertically - I want to skip them completely.
So, I am trying to write a Unit Test with Pandas that will:
a. check the size of all the *.csv files in some folder
b. ONLY read in the files that have a pre-determined number of columns
c. print a message indicating the naems of the *.csv files have the wrong number of columns
Here is what I have (I am working in the folder C:\Users\Downloads):
import unittest
import pandas as pd
from os import listdir
# Create csv files:
df1 = pd.DataFrame(np.random.rand(10,4), columns = ['A', 'B', 'C', 'D'])
df2 = pd.DataFrame(np.random.rand(10,3), columns = ['A', 'B', 'C'])
df1.to_csv('test1.csv')
df1.to_csv('test2.csv')
class Conct(unittest.TestCase):
"""Tests for `primes.py`."""
TEST_INP_DIR = 'C:\Users\Downloads'
fns = listdir(TEST_INP_DIR)
t_fn = fn for fn in fns if fn.endswith(".csv") ]
print t_fn
dfb = pd.DataFrame()
def setUp(self):
for elem in Conct.t_fn:
print elem
fle = pd.read_csv(elem)
try:
pd.concat([Conct.dfb,fle],axis = 0, join='outer', join_axes=None, ignore_index=True, verify_integrity=False)
except IOError:
print 'Error: unable to concatenate a file with %s columns.' % fle.shape[1]
self.err_file = fle
def tearDown(self):
del self.err_fle
if __name__ == '__main__':
unittest.main()
Problem:
I am gettingthis output:
['test1.csv', 'test2.csv']
----------------------------------------------------------------------
Ran 0 tests in 0.000s
OK
The first print statement works - it is printing a list of *.csv files, as expected. But, for some reason, the second and third print statements do not work.
Also, the concatenation should not have gone through - the second file has 3 columns but the first one has got 4 columns. The IOerror line does not seem to be printing.
How can I use a Python unittest to check each of the *.csv files to make sure that they have the same number of columns before concatenation? And how can I print the appropriate error message at the correct time?
On second thought, instead of chunksize, just read in the first row and count the number of columns, then read and append everything with the correct number of columns. In short:
for f in files:
test = pd.read_csv( f, nrows=1 )
if len( test.columns ) == 4:
df = df.append( pd.read_csv( f ) )
Here's the full version:
df1 = pd.DataFrame(np.random.rand(2,4), columns = ['A', 'B', 'C', 'D'])
df2 = pd.DataFrame(np.random.rand(2,3), columns = ['A', 'B', 'C'])
df3 = pd.DataFrame(np.random.rand(2,4), columns = ['A', 'B', 'C', 'D'])
df1.to_csv('test1.csv',index=False)
df2.to_csv('test2.csv',index=False)
df3.to_csv('test3.csv',index=False)
files = ['test1.csv', 'test2.csv', 'test3.csv']
df = pd.DataFrame()
for f in files:
test = pd.read_csv( f, nrows=1 )
if len( test.columns ) == 4:
df = df.append( pd.read_csv( f ) )
In [54]: df
Out [54]:
A B C D
0 0.308734 0.242331 0.318724 0.121974
1 0.707766 0.791090 0.718285 0.209325
0 0.176465 0.299441 0.998842 0.077458
1 0.875115 0.204614 0.951591 0.154492
(Edit to add) Regarding the use of nrows for the test... line: The only point of the test line is to read in enough of the CSV so that on the next line we check if it has the right number of columns before reading in. In this test case, reading in the first row is sufficient to figure out if we have 3 or 4 columns, and it's inefficient to read in more than that, although there is no harm in leaving off the nrows=1 besides reduced efficiency.
In other cases (e.g. no header row and varying numbers of columns in the data), you might need to read in the whole CSV. In that case, you'd be better off doing it like this:
for f in files:
test = pd.read_csv( f )
if len( test.columns ) == 4:
df = df.append( test )
The only downside of that way is that you completely read in the datasets with 3 columns that you don't want to keep, but you also don't read in the good datasets twice that way. So that's definitely a better way if you don't want to use nrows at all. Ultimately, depends on what your actual data looks like as to which way is best for you, of course.

Appending input from csv file to a dictionary including duplicate values for a unique key

First, I want to say that I am new to programming. That said, using Python 2.76, I'm trying to take a text file, read it in with csv, and then create a dictionary with a key equal to the first column in the file. Here is an example of the type of file I want to use (sorry for the bad formatting, there are three columns, each with a given value of either visitid, date, or time):
visitid cdate ctime
OMHioJh8XEeq7152 6/15/2007 06:00
OMHioJh8XEeq7152 6/14/2007 07:10
OMHioJh8XEeq7152 6/11/2007 14:21
t2v0TjgroLTI6118 4/28/2006 14:18
t2v0TjgroLTI6118 5/1/2006 04:00
Specifically, given this kind of list, I want to make a key in the dictionary equal to the value of the first column, and for the value have the remaining columns as a list. Finally, I want to append the value with another list if there are duplicates of the value in column 1 to form a list of lists, so to speak. This is what I have so far, after doing some research on here and elsewhere:
def test_results(filename):
import csv
with open(filename,"rU") as f:
reader = csv.reader(f,delimiter="\t")
result = {}
for row in reader:
key = row[0]
if key in result:
result[row[0]].append(row[1])
else:
result[row[0]] = key
result[key]=row[1:]
print result
This works, but it does not append the values to make a list of lists, and only adds to the dictionary the last row for any unique visitID.
Thanks!
You should use defaultdict:
from collections import defaultdict
import csv
def test_results(filename):
with open(filename, "rU") as f:
reader = csv.reader(f, delimiter="\t")
result = defaultdict(list)
# Skip header row
next(reader)
for row in reader:
result[row[0]].append(row[1:])
defaultdict(list) will assume an empty list if the key is not present in the dictionary. Given the input provide in the question, result will contain:
{'OMHioJh8XEeq7152': [['6/15/2007', '06:00'],
['6/14/2007', '07:10'],
['6/11/2007', '14:21']],
't2v0TjgroLTI6118': [['4/28/2006', '14:18'],
['5/1/2006', '04:00']]}
If you want a more flexible format, you should convert your date and time strings into a datetime object using dateutil.parser.parse:
import csv
from collections import defaultdict
from dateutil import parser
def test_results(filename):
with open(filename, "rU") as f:
reader = csv.reader(f, delimiter="\t")
result = defaultdict(list)
# Skip header line
next(reader)
for row in reader:
result[row[0]].append(parser.parse(' '.join(row[1:])))
Which yields:
{'OMHioJh8XEeq7152': [datetime.datetime(2007, 6, 15, 6, 0),
datetime.datetime(2007, 6, 14, 7, 10),
datetime.datetime(2007, 6, 11, 14, 21)],
't2v0TjgroLTI6118': [datetime.datetime(2006, 4, 28, 14, 18),
datetime.datetime(2006, 5, 1, 4, 0)]}
Maybe something like this:
if key in result:
result[row[0]].append(row[1:])
else:
result[row[0]] = key
result[key] = [row[1:]]