Convert a column into a list - python-2.7

I need to apply a for loop on a file containing records of a command, to convert one of the column into a list. Please advise, Thanks in advance .
Data is as below :
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG
asset 8 100 1009663 1
asset 7 200 523533 1
asset 9 319710 319710 0
asset 5 870935 870935 0
This is my code :
lag_list = []
with open(fname) as f:
f.readline()
lines = f.readlines()[1: ]
length = len(lines)
print(length)
for line in lines:
print "Hello"
print line
print "hello 2"
data=line.split(' ')
lag_list.append(data[4])
data=line.split("\t")
lag_list.append(data[4])
print lag_list
return
But returning this error:
lag_list.append(data[4])
IndexError: list index out of range

Your data has
- not 4 tabs in a line
- or not 4 spaces in a line
- or you have a \n after the last line of your source data
When reading those lines and splitting them, you do not have 5 elements in the resulting list - hence index error when accessing data[4].
Splitting the same list by spaces and by tabs does not make much sense for me - I hope it does for your data and application.
Check your splitted list before indexing into it:
lag_list = []
with open(fname) as f:
f.readline()
lines = f.readlines()[1: ]
length = len(lines)
print(length)
for line in lines:
print "Hello"
print line
print "hello 2"
data = line.split(' ')
if len(data) >= 5: # check if safe to index into
lag_list.append(data[4]
else:
print("Not enough elements - need 5 at least:", data)
data = line.split("\t")
if len(data) >= 5: # check if safe to index into
lag_list.append(data[4])
else:
print("Not enough elements - need 5 at least:", data)
print lag_list
return

Related

Reading mailing addresses of varying length from a text file using regular expressions

I am trying to read a text file and collect addresses from it. Here's an example of one of the entries in the text file:
Electrical Vendor Contact: John Smith Phone #: 123-456-7890
Address: 1234 ADDRESS ROAD Ship To:
Suite 123 ,
Nowhere, CA United States 12345
Phone: 234-567-8901 E-Mail: john.smith#gmail.com
Fax: 345-678-9012 Web Address: www.electricalvendor.com
Acct. No: 123456 Monthly Due Date: Days Until Due
Tax ID: Fed 1099 Exempt Discount On Assets Only
G/L Liab. Override:
G/L Default Exp:
Comments:
APPROVED FOR ELECTRICAL THINGS
I cannot wrap my head around how to search for and store the address for each of these entries when the amount of lines in the address varies. Currently, I have a generator that reads each line of the file. Then the get_addrs() method attempts to capture markers such as the Address: and Ship keywords in the file to signify when an address needs to be stored. Then I use a regular expression to search for zip codes in the line following a line with the Address: keyword. I think I've figured out how successfully save the second line for all addresses using that method. However, in a few addresses,es there is a suite number or other piece of information that causes the address to become three lines instead of two. I'm not sure how to account for this and I tried expanding my save_previous() method to three lines, but I can't get it quite right. Here's the code that I was able to successfully save all of the two line addresses with:
import re
class GetAddress():
def __init__(self):
self.line1 = []
self.line2 = []
self.s_line1 = []
self.addr_index = 0
self.ship_index = 0
self.no_ship = False
self.addr_here = False
self.prev_line = []
self.us_zip = ''
# Check if there is a shipping address.
def set_no_ship(self, line):
try:
self.no_ship = line.index(',') == len(line) - 1
except ValueError:
pass
# Save two lines at a time to see whether or not the previous
# line contains 'Address:' and 'Ship'.
def save_previous(self, line):
self.prev_line += [line]
if len(self.prev_line) > 2:
del self.prev_line[0]
def get_addrs(self, line):
self.addr_here = 'Address:' in line and 'Ship' in line
self.po_box = False
self.no_ship = False
self.addr_index = 0
self.ship_index = 0
self.zip1_index = 0
self.set_no_ship(line)
self.save_previous(line)
# Check if 'Address:' and 'Ship' are in the previous line.
self.prev_addr = (
'Address:' in self.prev_line[0]
and 'Ship' in self.prev_line[0])
if self.addr_here:
self.po_box = 'Box' in line or 'BOX' in line
self.addr_index = line.index('Address:') + 1
self.ship_index = line.index('Ship')
# Get the contents of the line between 'Address:' and
# 'Ship' if both words are present in this line.
if self.addr_index is not self.ship_index:
self.line1 += [' '.join(line[self.addr_index:self.ship_index])]
elif self.addr_index is self.ship_index:
self.line1 += ['']
if len(self.prev_line) > 1 and self.prev_addr:
self.po_box = 'Box' in line or 'BOX' in line
self.us_zip = re.search(r'(\d{5}(\-\d{4})?)', ' '.join(line))
if self.us_zip and not self.po_box:
self.zip1_index = line.index(self.us_zip.group(1))
if self.no_ship:
self.line2 += [' '.join(line[:line.index(',')])]
elif self.zip1_index and not self.no_ship:
self.line2 += [' '.join(line[:self.zip1_index + 1])]
elif len(self.line1) > 0 and not self.line1[-1]:
self.line2 += ['']
# Create a generator to read each line of the file.
def read_gen(infile):
with open(infile, 'r') as file:
for line in file:
yield line.split()
infile = 'Vendor List.txt'
info = GetAddress()
for i, line in enumerate(read_gen(infile)):
info.get_addrs(line)
I am still a beginner in Python so I'm sure a lot of my code may be redundant or unnecessary. I'd love some feedback as to how I might make this simpler and shorter while capturing both two and three line addresses.
I also posted this question to Reddit and u/Binary101010 pointed out that the text file is a fixed width, and it may be possible to slice each line in a way that only selects the necessary address information. Using this intuition I added some functionality to the generator expression, and I was able to produce the desired effect with the following code:
infile = 'Vendor List.txt'
# Create a generator with differing modes to read the specified lines of the file.
def read_gen(infile, mode=0, start=0, end=0, rows=[]):
lines = list()
with open(infile, 'r') as file:
for i, line in enumerate(file):
# Set end to correct value if no argument is given.
if end == 0:
end = len(line)
# Mode 0 gives all lines of the file
if mode == 0:
yield line[start:end]
# Mode 1 gives specific lines from the file using the rows keyword
# argument. Make sure rows is formatted as [start_row, end_row].
# rows list should only ever be length 2.
elif mode == 1:
if rows:
# Create a list for indices between specified rows.
for element in range(rows[0], rows[1]):
lines += [element]
# Return the current line if the index falls between the
# specified rows.
if i in lines:
yield line[start:end]
class GetAddress:
def __init__(self):
# Allow access to infile for use in set_addresses().
global infile
self.address_indices = list()
self.phone_indices = list()
self.addresses = list()
self.count = 0
def get(self, i, line):
# Search for appropriate substrings and set indices accordingly.
if 'Address:' in line[18:26]:
self.address_indices += [i]
if 'Phone:' in line[18:24]:
self.phone_indices += [i]
# Add address to list if both necessary indices have been collected.
if i in self.phone_indices:
self.set_addresses()
def set_addresses(self):
self.address = list()
start = self.address_indices[self.count]
end = self.phone_indices[self.count]
# Create a generator that only yields substrings for rows between given
# indices.
self.generator = read_gen(
infile,
mode=1,
start=40,
end=91,
rows=[start, end])
# Collect each line of the address from the generator and remove
# unnecessary spaces.
for element in range(start, end):
self.address += [next(self.generator).strip()]
# This document has a header on each page and a portion of that is
# collected in the address substring. Search for the header substring
# and remove the corresponding elements from self.address.
if len(self.address) > 3 and not self.address[-1]:
self.address = self.address[:self.address.index('header text')]
self.addresses += [self.address]
self.count += 1
info = GetAddress()
for i, line in enumerate(read_gen(infile)):
info.get(i, line)

What is the best way for sum numbers at a big text file?

What is the best way for sum numbers at a big text file?
The text file will contain numbers separated by a comma (',').
The number can be from any type.
No line or row limits.
for example:
1 ,-2, -3.45-7.8j ,99.6,......
...
...
Input: path to the text file
Output: the sum of the numbers
I am tried to wrote one solution at myself and want to know for better solutions:
This is my try:
I am working with chunks of data and not read line by line, and because the end of the chunk can contain some of the number (just -2 and not -2+3j) i am looking just on the "safe piece" the last comma (',') and the other part save for the next chunk
import re
CHUNK_SIZE = 1017
def calculate_sum(file_path):
_sum = 0
with open(file_path, 'r') as _f:
chunk = _f.read(CHUNK_SIZE)
while chunk:
chunk = chunk.replace(' ', '')
safe_piece = chunk.rfind(',')
next_chunk = chunk[safe_piece:] if safe_piece != 0 else ''
if safe_piece != 0:
chunk = chunk[:safe_piece]
_sum += sum(map(complex, re.findall(r"[+-]\d*\.?\d*[+-]?\d*\.?\d*j|[+-]?\d+(?:\.\d+)?", chunk)))
chunk = next_chunk + _f.read(CHUNK_SIZE)
return _sum
Thanks!
This will add up any amount of numbers in a text file. Example:
input.csv
1,-2,-3.45-7.8j,99.6
-1,1-2j
1.5,2.5,1+1j
example.py
import csv
with open('input.txt','rb') as f:
r = csv.reader(f)
total = 0
for line in r:
total += sum(complex(col) for col in line)
print total
Output
(100.15-8.8j)
If you have really long lines and insufficient memory to read it in one go, then you could use a buffering class to chunk the reads and split numbers out of the buffer using a generator function:
import re
class Buffer:
def __init__(self,filename,chunksize=4096):
self.filename = filename
self.chunksize = chunksize
self.buf = ''
def __iter__(self):
with open(self.filename) as f:
while True:
if ',' in self.buf or '\n' in self.buf:
data,self.buf = re.split(r',|\n',self.buf,1) # split off the text up to the first separator
yield complex(data)
else:
chunk = f.read(self.chunksize)
if not chunk: # if no more data to read, return the remaining buffer and exit function
if self.buf:
yield complex(self.buf)
return
self.buf += chunk
total = 0
for num in Buffer('input.txt'):
total += num
print total
Output:
(100.15-8.8j)

How can i correctly print out this dictionary in a way i have each word sorted by the number of times(frequency) in the text?

How can i correctly print out this dictionary in a way i have each word sorted by the number of times(frequency) in the text?
slova = dict()
for line in text:
line = re.split('[^a-z]',text)
line[i] = filter(None,line)
i =+ 1
i = 0
for line in text:
for word in line:
if word not in slova:
slova[word] = i
i += 1
I'm not sure what your text looks like, and you also haven't provided example output, but here is what my guess is. If this doesn't help please update your question and I'll try again. The code makes use of Counter from collections to do all the heavy lifting. First all of the words in all of the lines of the text are flattened to a single list, then this list is simply passed to Counter. The keys of the Counter (the words) are then sorted by their counts and printed out.
CODE:
from collections import Counter
import re
text = ['hello hi hello yes hello',
'hello hi hello yes hello']
all_words = [w for l in text for w in re.split('[^a-z]',l)]
word_counts = Counter(all_words)
sorted_words = sorted(word_counts.keys(),
key=lambda k: word_counts[k],
reverse = True)
#Print out the word and counts
for word in sorted_words:
print word,word_counts[word]
OUTPUT:
hello 6
yes 2
hi 2

TypeError: unhashable type: 'list' - creating frequency function

I am taking a text file as an input and creating a function that counts which word occurs most frequently. If 2 or more words occur most frequent and are equal I will print all of those words.
def wordOccurance(userFile):
userFile.seek(0)
line = userFile.readline()
lines = []
while line != "":
if line != "\n":
line = line.lower() # making lower case
line = line.rstrip("\n") # cleaning
line = line.rstrip("?") #cleans the whole docoument by removing "?"
line = line.rstrip("!") #cleans the whole docoument by removing "!"
line = line.rstrip(".") #cleans the whole docoument by removing "."
line = line.split(" ") #splits the texts into space
lines.append(line)
line = userFile.readline() # keep reading lines from document.
words = lines
wordDict = {} #creates the clean word Dic, from above
for word in words: #
if word in wordDict.keys():
wordDict[word] = wordDict[word] + 1
else:
wordDict[word] = 1
largest_value = max(wordDict.values())
for k in wordDict.keys():
if wordDict[k] == largest_value:
print(k)
return wordDict
Please help me with this function.
In this line you are creating a list of strings:
line = line.split(" ") #splits the texts into space
Then you append it to a list, so you have a list of lists:
lines.append(line)
Later you loop through that list of lists, and try to use a sublist as a key:
for word in words: #
if word in wordDict.keys():
wordDict[word] = wordDict[word] + 1
else:
wordDict[word] = 1 # Here you will try to assign a list (`word`) as a key, which is not allowed
One easy fix would be to flatten the list of lists first:
words = [item for sublist in lines for item in sublist]
for word in words: #
if word in wordDict.keys():
wordDict[word] = wordDict[word] + 1
else:
wordDict[word] = 1
The list comprehension [item for sublist in lines for item in sublist] will loop through lines, then loop through the sublists created by line.split(" ") and return a new list consisting of the items in each sublist. For you, lines probably looks something like this:
[['words', 'on', 'line', 'one'], ['words', 'on', 'line', 'two']]
The list comprehension will turn it into this:
['words', 'on', 'line', 'one', 'words', 'on', 'line', 'two']
If you would like to use something a little less complicated, you could just use nested loops:
# words = lines
# just use `lines` in your for loop instead of creating an identical list
wordDict = {} #creates the clean word Dic, from above
for line in lines:
for word in line:
if word in wordDict.keys():
wordDict[word] = wordDict[word] + 1
else:
wordDict[word] = 1
largest_value = max(wordDict.values())
This will probably be a little less efficient and/or "Pythonic", but it will probably be easier to wrap your head around.
Also, you may want to consider splitting each line into words before cleaning the data, because if you clean the lines first, you will only remove punctuation at the end of lines rather than at the end of words. However, this might not be necessary depending on the nature of your data.

Printing out element in the ith postion in a list in python

f=open('julyTemps.txt')
for li in f.readlines():
data = li.strip().split(' ')
print data[1]
This code give me an out of range error and the list is of length 3.
please help
with open('julyTemps.txt', 'r') as f:
for line in f:
data = line.strip().split(' ')
if len(data) > 1
print data[1]
else:
print 'this line does not split as it should:\n%s' % line