Parsing Unbalanced Text into Tables in R - regex

I am trying to pull data from some text files on the SEC's EDGAR webpage and I keep running into a similar problem where there are tables that visually look very simple in the text file, but I have trouble parsing them into something useful in R. In particular, I can't seem to figure out how to balance some of the tables when there are either values missing in a column, especially at the end.
The approach I've taken so far is to read in the text files with readLines and split the strings based on the tab delimiters, but this doesn't always work when there are missing values. Is there a better approach or some way to intelligently coerce each row into a data frame? I can't seem to get rbind.fill to work in this case.
Here is my most recent attempt:
raw.data = readLines("http://www.sec.gov/Archives/edgar/data/1349353/0001349353-13-000002.txt")
# parse basic document information
companyName = gsub("\t\tCOMPANY CONFORMED NAME:\t\t\t","",raw.data[grep("\t\tCOMPANY CONFORMED NAME:\t\t\t",raw.data)])
cik = gsub("\t\tCENTRAL INDEX KEY:\t\t\t","",raw.data[grep("\t\tCENTRAL INDEX KEY:\t\t\t",raw.data)])
secfilename = gsub("<FILENAME>","",raw.data[grep("<FILENAME>",raw.data)])
# trim down to table
table13f = raw.data[(grep("<TABLE>",raw.data)+1):(grep("</TABLE>",raw.data)-1)]
table13f = table13f[!grepl("INFORMATION TABLE",table13f, ignore.case=TRUE)]
table13f = table13f[!grepl("VOTING AUTHORITY",table13f, ignore.case=TRUE)]
table13f = table13f[!grepl("NAME OF ISSUER",table13f, ignore.case=TRUE)]
table13f = table13f[nchar(table13f)>0]
# extract data vectors
splittable = strsplit(table13f,"\t")
splittable2 = data.frame(splittable)
Thanks in advance for the help and/or advice!

You should be able to parse the last table13f string using the following line:
data <- read.csv(text=table13f,header = T,quote = "\"", sep = "\t", fill = T)

Related

Python 2.7 - How to call individual columns from transposed csv file

I understand that the csv module exists, however for my current project we are not allowed to use the module to call csv files.
My code is as follows;
table = []
for line in open("data.csv"):
data = line.split(",")
table.append(data)
transposed = [[table[j][i] for j in range(len(table))] for i in range(len(table[0]))]
rows = transposed[1][1:]
rows = [float(i) for i in rows]
I'm really new to python so this is probably a massively basic question, I've been scouring the internet all day and struggle to find a solution. All I need to do is to be able to call data from any individual column so I can analyse it. Thanks
your data is organized in a list of lists. Each sub list represents a row. To better illustrate this I would avoid using list comprehensions because they are more difficult to read. Additionally I would avoid using variables like 'i' and 'j' and instead use more descriptive names like row or column. Here is a simple example of how I would accomplish this
def read_csv():
table = []
with open("data.csv") as fileobj:
for line in fileobj.readlines():
data = line.strip().split(',')
table.append(data)
return table
def get_column_data(data, column_index):
column_data = []
for row in data:
cell_data = row[column_index]
column_data.append(cell_data)
return column_data
data = read_csv()
get_column_data(data, column_index=2) #example usage

Get information from a TXT File

I have a txt file that has 7 columns and I am trying to extract data from. Essentially there is a column that has a lot of minimum values, a column solely consists of dashes, column of only maximum values, and a few others that I would like to break into their own lists (I think thats the way to go). Any help would be much appreciated. Thanks!
Edit: Sorry I should have been clearer. I am using Python 3.5, grabbing right from the txt and using split actually. I guess I should ask where to go from there. I currently have it loading a file and using split(). End game I would like to be able to put each column into its own list so I can calculate averages, percentages, etc. Thanks again, sorry about the bad initial post, its my first time posting here.
file = open("year2000.txt")
for line in file:
z = line.strip()
z = line.find(" ")
min_sal1 = line[:z]
min_sal2 = min_sal1.replace(',', '')
min_sal3 = min_sal2.find('.')
min_sal4 = min_sal1[:min_sal3]
min_sal = int(min_sal4)
print(min_sal4)
y = z.find(' ', 2)
x = z.find(' ', 3)
max_sal = line[y:x]
print(max_sal)
After running this, I get a list of all min salarys like it should, however for max values I am getting just a bunch of blank lines. I also plan on putting each type of value into their own lists. Thanks

Python .splitlines() to segment text into separate variables

I've read the other threads on this site but haven't quite grasped how to accomplish what I want to do. I'd like to find a method like .splitlines() to assign the first two lines of text in a multiline string into two separate variables. Then group the rest of the text in the string together in another variable.
The purpose is to have consistent data-sets to write to a .csv using the three variables as data for separate columns.
Title of a string
Description of the string
There are multiple lines under the second line in the string!
There are multiple lines under the second line in the string!
There are multiple lines under the second line in the string!
Any guidance on the pythonic way to do this would be appreciated.
Using islice
In addition to normal list slicing you can use islice() which is more performant when generating slices of larger lists.
Code would look like this:
from itertools import islice
with open('input.txt') as f:
data = f.readlines()
first_line_list = list(islice(data, 0, 1))
second_line_list = list(islice(data, 1, 2))
other_lines_list = list(islice(data, 2, None))
first_line_string = "".join(first_line_list)
second_line_string = "".join(second_line_list)
other_lines_string = "".join(other_lines_list)
However, you should keep in mind that the data source you read from is long enough. If it is not, it will raise a StopIteration error when using islice() or an IndexError when using normal list slicing.
Using regex
The OP asked for a list-less approach additionally in the comments below.
Since reading data from a file leads to a string and via string-handling to lists later on or directly to a list of read lines I suggested using a regex instead.
I cannot tell anything about performance comparison between list/string handling and regex operations. However, this should do the job:
import re
regex = '(?P<first>.+)(\n)(?P<second>.+)([\n]{2})(?P<rest>.+[\n])'
preg = re.compile(regex)
with open('input.txt') as f:
data = f.read()
match = re.search(regex, data, re.MULTILINE | re.DOTALL)
first_line = match.group('first')
second_line = match.group('second')
rest_lines = match.group('rest')
If I understand correctly, you want to split a large string into lines
lines = input_string.splitlines()
After that, you want to assign the first and second line to variables and the rest to another variable
title = lines[0]
description = lines[1]
rest = lines[2:]
If you want 'rest' to be a string, you can achieve that by joining it with a newline character.
rest = '\n'.join(lines[2:])
A different, very fast option is:
lines = input_string.split('\n', maxsplit=2) # This only separates the first to lines
title = lines[0]
description = lines[1]
rest = lines[2]

Efficient means of mass converting thousands of different find and replaces within one file

I am trying to convert a map file for some SNP data I want to use from Affy ids to dbSNP rs ids.
I am trying to find an effective way to this. I have the annotation file for the Axiom array from which the data comes from, so I know the proper ids.
I was wondering if anyone could suggest a good bash/Python/Perl based method to do this. It amounts to >100,000 different replacements. The idea I had in mind was the
sed -i 's/Affy#/rs#/g' filename
method, but I figure this would not be the most efficient. Any suggestions? Thanks!
Python code, assuming your substitutions are stored in subs.csv:
import csv
subs = dict(csv.reader(open('subs.csv'), delimiter='\t'))
source = csv.reader(open('all_snp.map'), delimiter='\t')
dest = csv.writer(open('all_snp_out.map', 'wb'), delimiter='\t')
for row in source:
row[1] = subs.get(row[1], row[1])
dest.writerow(row)
The line row[1] = subs.get(row[1], row[1]): row[1] is the Affx column, and it replaces it with a dictionary lookup which either gets the rsNumber equivalent if there is one, or returns the original Affx bit if there isn't one.

String from CSV to list - Python

I don't get it. I have a CSV data with the following content:
wurst;ball;hoden;sack
1;2;3;4
4;3;2;1
I want to iterate over the CSV data and put the heads in one list and the content in another list. Heres my code so far:
data = [ i.strip() for i in open('test.csv', 'r').readlines() ]
for i_c, i in enumerate(data):
if i_c == 0:
heads = i
else:
content = i
heads.split(";")
content.split(";")
print heads
That always returns the following string, not a valid list.
wurst;ball;hoden;sack
Why does split not work on this string?
Greetings and merry Christmas,
Jan
The split method returns the list, it does not modify the object in place. Try:
heads = heads.split(";")
content = content.split(";")
I've noticed also that your data seems to all be integers. You might consider instead the following for content:
content = [int(i) for i in content.split(";")]
The reason is that split returns a list of strings, and it seems like you might need to deal with them as numbers in your code later on. Of course, disregard if you are expecting non-numeric data to show up at some point.