Get information from a TXT File - list

I have a txt file that has 7 columns and I am trying to extract data from. Essentially there is a column that has a lot of minimum values, a column solely consists of dashes, column of only maximum values, and a few others that I would like to break into their own lists (I think thats the way to go). Any help would be much appreciated. Thanks!
Edit: Sorry I should have been clearer. I am using Python 3.5, grabbing right from the txt and using split actually. I guess I should ask where to go from there. I currently have it loading a file and using split(). End game I would like to be able to put each column into its own list so I can calculate averages, percentages, etc. Thanks again, sorry about the bad initial post, its my first time posting here.
file = open("year2000.txt")
for line in file:
z = line.strip()
z = line.find(" ")
min_sal1 = line[:z]
min_sal2 = min_sal1.replace(',', '')
min_sal3 = min_sal2.find('.')
min_sal4 = min_sal1[:min_sal3]
min_sal = int(min_sal4)
print(min_sal4)
y = z.find(' ', 2)
x = z.find(' ', 3)
max_sal = line[y:x]
print(max_sal)
After running this, I get a list of all min salarys like it should, however for max values I am getting just a bunch of blank lines. I also plan on putting each type of value into their own lists. Thanks

Related

Scoring multiple TRUES in Pythton RE Search

Background
I have a list of "bad words" in a file called bad_words.conf, which reads as follows
(I've changed it so that it's clean for the sake of this post but in real-life they are expletives);
wrote (some )?rubbish
swore
I have a user input field which is cleaned and striped of dangerous characters before being passed as data to the following script, score.py
(for the sake of this example I've just typed in the value for data)
import re
data = 'I wrote some rubbish and swore too'
# Get list of bad words
bad_words = open("bad_words.conf", 'r')
lines = bad_words.read().split('\n')
combine = "(" + ")|(".join(lines) + ")"
#set score incase no results
score = 0
#search for bad words
if re.search(combine, data):
#add one for a hit
score += 1
#show me the score
print(str(score))
bad_words.close()
Now this finds a result and adds a score of 1, as expected, without a loop.
Question
I need to adapt this script so that I can add 1 to the score every time a line of "bad_words.conf" is found within text.
So in the instance above, data = 'I wrote some rubbish and swore too' I would like to actually score a total of 2.
1 for "wrote some rubbish" and +1 for "swore".
Thanks for the help!
Changing combine to just:
combine = "|".join(lines)
And using re.findall():
In [33]: re.findall(combine,data)
Out[33]: ['rubbish', 'swore']
The problem with having the multiple capturing groups as you originally were doing is that re.findall() will return each additional one of those as an empty string when one of the words is matched.

Parsing Unbalanced Text into Tables in R

I am trying to pull data from some text files on the SEC's EDGAR webpage and I keep running into a similar problem where there are tables that visually look very simple in the text file, but I have trouble parsing them into something useful in R. In particular, I can't seem to figure out how to balance some of the tables when there are either values missing in a column, especially at the end.
The approach I've taken so far is to read in the text files with readLines and split the strings based on the tab delimiters, but this doesn't always work when there are missing values. Is there a better approach or some way to intelligently coerce each row into a data frame? I can't seem to get rbind.fill to work in this case.
Here is my most recent attempt:
raw.data = readLines("http://www.sec.gov/Archives/edgar/data/1349353/0001349353-13-000002.txt")
# parse basic document information
companyName = gsub("\t\tCOMPANY CONFORMED NAME:\t\t\t","",raw.data[grep("\t\tCOMPANY CONFORMED NAME:\t\t\t",raw.data)])
cik = gsub("\t\tCENTRAL INDEX KEY:\t\t\t","",raw.data[grep("\t\tCENTRAL INDEX KEY:\t\t\t",raw.data)])
secfilename = gsub("<FILENAME>","",raw.data[grep("<FILENAME>",raw.data)])
# trim down to table
table13f = raw.data[(grep("<TABLE>",raw.data)+1):(grep("</TABLE>",raw.data)-1)]
table13f = table13f[!grepl("INFORMATION TABLE",table13f, ignore.case=TRUE)]
table13f = table13f[!grepl("VOTING AUTHORITY",table13f, ignore.case=TRUE)]
table13f = table13f[!grepl("NAME OF ISSUER",table13f, ignore.case=TRUE)]
table13f = table13f[nchar(table13f)>0]
# extract data vectors
splittable = strsplit(table13f,"\t")
splittable2 = data.frame(splittable)
Thanks in advance for the help and/or advice!
You should be able to parse the last table13f string using the following line:
data <- read.csv(text=table13f,header = T,quote = "\"", sep = "\t", fill = T)

how to improve the speed of the python script

I'm very new to python. I'm working in the area of hydrology and I want to learn python to assist me with processing hydrological data.
At the moment I write a script to extract bits of information from a big data set. I have three csv files:
Complete_borelist.csv
Borelist_not_interested.csv
Elevation_info.csv
I want to create a file with has all the bores that are in complete_borelist.csv but not in borelist_not_interested.csv. I also want to grab some information from complete_borelist.csv and Elevation_info.csv for those bores which satisfy the first criteria.
My script is as follow:
not_interested_list =[]
outfile1 = open('output.csv','w')
outfile1.write('Station_ID,Name,Easting,Northing,Location_name,Elevation')
outfile1.write('\n')
with open ('Borelist_not_interested.csv','r') as f1:
for line in f1:
if not line.startswith('Station'): #ignore header
line = line.rstrip()
words = line.split(',')
station = words[0]
not_interested_list.append(station)
with open('Complete_borelist.csv','r') as f2:
next(f2) #ignore header
for line in f2:
line= line.rstrip()
words = line.split(',')
station = words[0]
if not station in not_interested_list:
loc_name = words[1]
easting = words[4]
northing = words[5]
outfile1.write(station+','+easting+','+northing+','+loc_name+',')
with open ('Elevation_info.csv','r') as f3:
next(f3) #ignore header
for line in f3:
line = line.rstrip()
data = line.split(',')
bore_id = data[0]
if bore_id == station:
elevation = data[4]
outfile1.write(elevation)
outfile1.write ('\n')
outfile1.close()
I have two issues with the script:
The first is the Elevation_info.csv doesn't have information for all the bore in the Complete_borelist.csv. When my loop get to the station where it can't find Elevation record for it, the script doesn't write "null" but continue to write the information for the next station in the same line. Can anyone help me to fix this please?
The second is my complete borelist is about >200000 rows and my script runs through them very slow. Can anyone have any suggestion to make it run faster?
Very much appreciated and sorry if my question is too long.
performance-wise, this has a couple of problems. The first one is that you are opening and re-reading the Elevation info for every line of the complete file.. Read the elevation info into a dictionary keyed upon the bore_id before you open the complete file. Then you can test the dictionary very fast to see if station is in it instead of re-reading.
The second performance issue is that you don't stop searching in the bore_id list once you find a match. The dictionary idea solves that too, but otherwise a break once you have a match would help a little.
For the null printing problem, you just need to outfile1.write("\n") if the bore_id is not in the dictionary. An else statement on the dictionary test does that. In the current code, an else closing the for loop would do it. Or even changing the indentation of that last write("\n").

Python 2.7 - Split comma separated text file into smaller text files

I was (unsuccessfully) trying to figure out how to create a list of compound letters using loops. I am a beginner programmer, have been learning python for a few months. Fortunately, I later found a solution to this problem - Genearte a list of strings compound of letters from other list in Python - see the first answer.
So I took that code and added a little to it for my needs. I randomized the list, turned the list into a comma separated file. This is the code:
from string import ascii_lowercase as al
from itertools import product
import random
list = ["".join(p) for i in xrange(1,6) for p in product(al, repeat = i)]
random.shuffle(list)
joined = ",".join(list)
f = open("double_letter_generator_output.txt", 'w')
print >> f, joined
f.close()
What I need to do now is split that massive file "double_letter_generator_output.txt" into smaller files. Each file needs to consist of 200 'words'. So it will need to split into many files. The files of course do not exist yet and will need to be created by the program also. How can I do that?
Here's how I would do it, but I'm not sure why you're splitting this into smaller files. I would normally do it all at once, but I'm assuming the file is too big to be stored in working memory, so I'm traversing one character at a time.
Let bigfile.txt contain
1,2,3,4,5,6,7,8,9,10,11,12,13,14
MAX_NUM_ELEMS = 2 #you'll want this to be 200
nameCounter = 1
numElemsCounter = 0
with open('bigfile.txt', 'r') as bigfile:
outputFile = open('output' + str(nameCounter) + '.txt', 'a')
for letter in bigfile.read():
if letter == ',':
numElemsCounter += 1
if numElemsCounter == MAX_NUM_ELEMS:
numElemsCounter = 0
outputFile.close()
nameCounter += 1
outputFile = open('output' + str(nameCounter) + '.txt', 'a')
else:
outputFile.write(letter);
outputFile.close()
now output1.txt is 1,2, output2.txt is 3,4, output3.txt is 5,6, etc.
$ cat output7.txt
13,14
This is a little sloppy, you should write a nice function to do it and format it the way you like!
FYI, if you want to write to a bunch of different files, there's no reason to write to one big file first. Write to the little files right off the bat.
This way, the last file might have fewer than MAX_NUM_ELEMS elements.

Select random group of items from txt file

I'm working on a simple Python game where the computer tries to guess a number you think of. Every time it guesses the right answer, it saves the answer to a txt file. When the program is run again, it will guess the old answers first (if they're in the range the user specifies).
try:
f = open("OldGuesses.txt", "a")
r = open("OldGuesses.txt", "r")
except IOError as e:
f = open("OldGuesses.txt", "w")
r = open("OldGuesses.txt", "r")
data = r.read()
number5 = random.choice(data)
print number5
When I run that to pull the old answers, it grabs one item. Like say I have the numbers 200, 1242, and 1343, along with spaces to tell them apart, it will either pick a space, or a single digit. Any idea how to grab the full number (like 200) and/ or avoid picking spaces?
The r.read() call reads the entire contents of r and returns it as a single string. What you can do is use a list comprehension in combination with r.readlines(), like this:
data = [int(x) for x in r.readlines()]
which breaks up the file into lines and converts each line to an integer.