Importing unfriedly formatted data in Excel and forcing messy values as column names - regex

I'm trying to import some publicly available life outcomes data using the code below:
require(gdata)
# Source SIMD12 data zone level data
simd.sg.xls <- read.xls(xls = "http://www.gov.scot/Resource/0044/00447385.xls",
sheet = "Quick Lookup", verbose = TRUE)
Naturally, the imported data frame doesn't look good:
I would like to amend my column names using the code below:
# Clean column names
names(simd.sg.xls) <- make.names(names = as.character(simd.sg.xls[1,]),
unique = TRUE,allow_ = TRUE)
But it produces rather unpleasant results:
> names(simd.sg.xls)
[1] "X1" "X1.1" "X771" "X354" "X229" "X74" "X67" "X33" "X19" "X1.2"
[11] "X6" "X1.3" "X8" "X7" "X7.1" "X6506" "X21" "X1.4" "X6158" "X6506.1"
[21] "X6506.2" "X6506.3" "X6263" "X6506.4" "X6468" "X1010" "X815" "X99" "X58" "X65"
[31] "X60" "X6506.5" "X21.1" "X1.5" "X6173" "X5842" "X6506.6" "X6506.7" "X6263.1" "X6506.8"
[41] "X6481" "X883" "X728" "X112" "X69" "X56" "X54" "X6506.9" "X21.2" "X1.6"
[51] "X6143" "X5651" "X6506.10" "X6506.11" "X6263.2" "X6506.12" "X6480" "X777" "X647" "X434"
[61] "X518" "X246" "X436" "X6506.13" "X21.3" "X1.7" "X6136" "X5677" "X6506.14" "X6506.15"
[71] "X6263.3" "X6506.16" "X660" "X567" "X480" "X557" "X261" "X456"
My question is if there is a way to neatly force the values from the first row to the column names? As I'm doing a lot of data I'm looking for solution that would be easily reproducible, I can accommodate a lot of violation to the actual strings to get syntactically correct names but ideally I would avoid faffing around with elaborate regular expressions as I'm often reading files like the one linked here and don't wan to be forced to adjust the rules for each single import.

It looks like the problem is that the header is on the second line, not the first. You could include a skip=1 argument but a more general way of dealing with this using read.xls seems to be to use the pattern and header arguments which force the first line which matches the pattern string to be treated as the header. Your code becomes:
require(gdata)
# Source SIMD12 data zone level data
simd.sg.xls <- read.xls(xls = "http://www.gov.scot/Resource/0044/00447385.xls",
sheet = "Quick Lookup", verbose = TRUE,
pattern="DATAZONE", header=TRUE)
UPDATE
I don't get the warning messages you do when I execute the code. The messages refer to an issue with locale. The locale settings on my system are:
Sys.getlocale()
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
Yours are probably different. Locale data could be OS dependent. I'm using Windows 8.1. Also I'm using Strawberry Perl; you appear to be using something else. So some possible reasons for the discrepancy in warning messages but nothing more specific.
On the second question in your comment, to read the entire file, and convert a particular row ( in this case, row 2) to column names, you could use the following code:
simd.sg.xls <- read.xls(xls = "http://www.gov.scot/Resource/0044/00447385.xls",
sheet = "Quick Lookup", verbose = TRUE,
header=FALSE, stringsAsFactors=FALSE)
names(simd.sg.xls) <- make.names(names = simd.sg.xls[2,],
unique = TRUE,allow_ = TRUE)
simd.sg.xls <- simd.sg.xls[-(1:2),]
All data will be of character type so you'll need to convert to factor and numeric as necessary.

Related

Converting segments of large .cif files to smaller .pdb files

I'm trying to carve out some binding sites with ligands from cif-files of ribosome crystal structures, and have encountered an annoying problem involving a type error.
TypeError: %c requires int or char
Using the code below,
from Bio.PDB import *
from Bio import PDB
class save_res(Select):
def accept_residue(self, residue):
if residue in keep_res_list:
print(residue)
return 1
else:
return 0
keep_res_list = []
parser = MMCIFParser()
structure = parser.get_structure("1vvj.cif", "./1vvj.cif")
structure = structure[0]
atom_list = Selection.unfold_entities(structure, "A") # A for atoms
ns = NeighborSearch(atom_list)
for residue in structure.get_residues():
if residue.get_resname() == "PAR":
for atom in residue:
center = atom.get_coord()
neighbors = ns.search(center, 5.0)
neighbor_residue_list = Selection.unfold_entities(neighbors, "R")
for res in neighbor_residue_list:
if res not in keep_res_list:
keep_res_list.append(res)
io = PDBIO()
io.set_structure(structure)
io.save("1vvj_bs.pdb", save_res())
gives me the error:
File "/scratch/software/anaconda3/envs/my-devel-3.6/lib/python3.6/site-packages/Bio/PDB/PDBIO.py", line 112, in _get_atom_line
return _ATOM_FORMAT_STRING % args
TypeError: %c requires int or char
This code works well when changing the pdb-id to 1fyb, which also has the same ligand id.
I'm thinking the problem stems from the vast amounts of chains and their IDs in the original file. Am I completely wrong in this assumption or does anyone know how to fix this? I've been trying to find a way to rename the chain IDs, but haven't found a viable method to do this.
Your help is appreciated.
The chain name format in _ATOM_FORMAT_STRING is %c, while in this case you have chain named QA.
Chain names in PDB files were traditionally single characters.
But there are only so many letters and digits. For ribosome it's necessary to use longer names. The pdb format has space for a second letter -- empty column on the left from the 1-character chain name. Many programs support it, but not all, and this is not part of the official specification.
So you can either use PDB files with 2-character chains (if the rest of your workflow supports it) or rename chains in the output (your output is only a tiny part of the original structure).
Here is how to do it in gemmi:
import gemmi
structure = gemmi.read_structure('1vvj.cif')
model = structure[0]
ns = gemmi.NeighborSearch(model, structure.cell, 5.0).populate()
for chain in model:
for residue in chain:
if residue.name == 'PAR':
for atom in residue:
for nb in ns.find_neighbors(atom):
nb.to_cra(model).residue.flag = 'y'
sel = gemmi.Selection().set_residue_flags('y')
new_structure = sel.copy_structure_selection(structure)
#new_structure.remove_empty_chains()
#new_structure.shorten_chain_names()
new_structure.write_minimal_pdb('1vvj-par.pdb')
The two commented out lines are renaming the chains.
One difference comparing with your code is that the NeighborSearch in gemmi is symmetry-aware. It finds also nearby atoms from symmetry mates. In BioPython you search only in asymmetric unit (asu).
Both are different than the biological assembly --
PDB-101 covers it nicely.
If you'd like to search in asu only -- replace structure.cell with gemmi.UnitCell() above, i.e. don't pass the unit cell information.
(You can ask such questions on bioinformatics.SE -- it should get answer sooner there).

How to change 'eval' output format in a document. (remove the ##)

When R-Markdown output a chunk result, it insert a ## before the result.
I would like to know how could I change the default result format, specially how to change the double hashtag for another symbol (to not make beginners confused with the code comments).
You can change this within a setup chunk, e.g.
```{r setup, include = FALSE}
knitr::opts_chunk$set(comment = "#>")
```
will prefix #> instead of ##.

What causes "UserWarning: Discarded range with reserved name" - openpyxl

I have a simple EXCEL-sheet with names of cities in column A and I want to extract them and put them in a list:
def getCityfromEXCEL():
wb = load_workbook(filename='test.xlsx', read_only=True)
ws = wb['Sheet1']
cityList = []
for i in range(2, ws.get_highest_row()+1):
acell = "A"+str(i)
cityString = ws[acell].value
city = ftfy.fix_text_encoding(cityString)
cityList.append(city)
getCityfromEXCEL()
With a small file that worked perfectly (70 rows). Now I'm processing a big file (8300 rows) and it gives me this error:
/Library/Python/2.7/site-packages/openpyxl/workbook/names/named_range.py:121: UserWarning: Discarded range with reserved name
warnings.warn("Discarded range with reserved name")
but it does not abort. It just does not seem to continue anymore. Can someone tell me what might cause the error? Is it something in the .xlsx? Any special hints what I can look for?
It's supposed to be a friendly warning letting you know that some of the defined names are being lost when reading the file. Warnings in Python are not exceptions but informational notices.
Support for defined names is essentially limited to references to cell ranges in openpyxl at the moment. But they can refer to lots of other things like printing settings. However, if the objects/values they refer to are not preserved by openpyxl and the file is saved and later opened by Excel it might complain about the missing objects.
If you want to ignore it:
import warnings
warnings.simplefilter("ignore")
wb = load_workbook(path)
warnings.simplefilter("default")
In my case this warning shows up when filtering is on one of my worksheets. I wanted to suppress the warning so that it didn't bother my users and I just put this line in my code before the openpyxl.load_workbook call:
warnings.simplefilter("ignore")

Parsing Unbalanced Text into Tables in R

I am trying to pull data from some text files on the SEC's EDGAR webpage and I keep running into a similar problem where there are tables that visually look very simple in the text file, but I have trouble parsing them into something useful in R. In particular, I can't seem to figure out how to balance some of the tables when there are either values missing in a column, especially at the end.
The approach I've taken so far is to read in the text files with readLines and split the strings based on the tab delimiters, but this doesn't always work when there are missing values. Is there a better approach or some way to intelligently coerce each row into a data frame? I can't seem to get rbind.fill to work in this case.
Here is my most recent attempt:
raw.data = readLines("http://www.sec.gov/Archives/edgar/data/1349353/0001349353-13-000002.txt")
# parse basic document information
companyName = gsub("\t\tCOMPANY CONFORMED NAME:\t\t\t","",raw.data[grep("\t\tCOMPANY CONFORMED NAME:\t\t\t",raw.data)])
cik = gsub("\t\tCENTRAL INDEX KEY:\t\t\t","",raw.data[grep("\t\tCENTRAL INDEX KEY:\t\t\t",raw.data)])
secfilename = gsub("<FILENAME>","",raw.data[grep("<FILENAME>",raw.data)])
# trim down to table
table13f = raw.data[(grep("<TABLE>",raw.data)+1):(grep("</TABLE>",raw.data)-1)]
table13f = table13f[!grepl("INFORMATION TABLE",table13f, ignore.case=TRUE)]
table13f = table13f[!grepl("VOTING AUTHORITY",table13f, ignore.case=TRUE)]
table13f = table13f[!grepl("NAME OF ISSUER",table13f, ignore.case=TRUE)]
table13f = table13f[nchar(table13f)>0]
# extract data vectors
splittable = strsplit(table13f,"\t")
splittable2 = data.frame(splittable)
Thanks in advance for the help and/or advice!
You should be able to parse the last table13f string using the following line:
data <- read.csv(text=table13f,header = T,quote = "\"", sep = "\t", fill = T)

Parameterize pattern match as function argument in R

I've a directory with csv files, about 12k in number, with the naming format being
YYYY-MM-DD<TICK>.csv
. The <TICK> refers to ticker of a stock, e.g. MSFT, GS, QQQ etc. There are total 500 tickers, of various length.
My aim is to merge all the csv for a particular tick and save as a zoo object in individual RData file in a separate directory.
To automate this I've managed to do the csv manipulation, setup as a function which gets a ticker as input, does all the data modification. But I'm stuck in making the file listing stage, passing the pattern to match the ticker being processed. I'm unable to make the pattern to be matched dependent on the ticker.
Below is the function i've tried to make work, doesn't work:
csvlist2zoo <- function(symbol){
csvlist=list.files(path = "D:/dataset/",pattern=paste("'.*?",symbol,".csv'",sep=""),full.names=T)
}
This works, but can't make it work in function
csvlist2zoo <- function(symbol){
csvlist=list.files(path = "D:/dataset/",pattern='.*?"ibm.csv',sep=""),full.names=T)
}
Searched in SO, there are similar questions, not exactly meeting my requirement. But if I missed something please point out in the right direction. Still fighting with regex.
OS: Win8 64bit, R version-3.1.0 (if needed)
Try:
csvlist2zoo <- function(symbol){
list.files(pattern=paste0('\\d{4}-\\d{2}-\\d{2}',symbol, ".csv"))
}
csvlist2zoo("QQQ")
#[1] "2002-12-19QQQ.csv" "2008-01-25QQQ.csv"
csvlist2zoo("GS")
#[1] "2005-05-18GS.csv"
I created some files in the working directory (linux)
v1 <- c("2001-05-17MSFT.csv", "2005-05-18GS.csv", "2002-12-19QQQ.csv", "2008-01-25QQQ.csv")
lapply(v1, function(x) write.csv(1:3, file=x))
Update
Using paste
csvlist2zoo <- function(symbol){
list.files(pattern=paste('\\d{4}-\\d{2}-\\d{2}',symbol, ".csv", sep=""))
}
csvlist2zoo("QQQ")
#[1] "2002-12-19QQQ.csv" "2008-01-25QQQ.csv"