I have a simple EXCEL-sheet with names of cities in column A and I want to extract them and put them in a list:
def getCityfromEXCEL():
wb = load_workbook(filename='test.xlsx', read_only=True)
ws = wb['Sheet1']
cityList = []
for i in range(2, ws.get_highest_row()+1):
acell = "A"+str(i)
cityString = ws[acell].value
city = ftfy.fix_text_encoding(cityString)
cityList.append(city)
getCityfromEXCEL()
With a small file that worked perfectly (70 rows). Now I'm processing a big file (8300 rows) and it gives me this error:
/Library/Python/2.7/site-packages/openpyxl/workbook/names/named_range.py:121: UserWarning: Discarded range with reserved name
warnings.warn("Discarded range with reserved name")
but it does not abort. It just does not seem to continue anymore. Can someone tell me what might cause the error? Is it something in the .xlsx? Any special hints what I can look for?
It's supposed to be a friendly warning letting you know that some of the defined names are being lost when reading the file. Warnings in Python are not exceptions but informational notices.
Support for defined names is essentially limited to references to cell ranges in openpyxl at the moment. But they can refer to lots of other things like printing settings. However, if the objects/values they refer to are not preserved by openpyxl and the file is saved and later opened by Excel it might complain about the missing objects.
If you want to ignore it:
import warnings
warnings.simplefilter("ignore")
wb = load_workbook(path)
warnings.simplefilter("default")
In my case this warning shows up when filtering is on one of my worksheets. I wanted to suppress the warning so that it didn't bother my users and I just put this line in my code before the openpyxl.load_workbook call:
warnings.simplefilter("ignore")
Related
The following code works fine until the filter gets first match(es). After that on the following runs in the for loop, the query always returns 0. If in that result object would be a match for the next row also, it doesn't even see that, so I don't see this being a cache issue (which might have been far fetched anyhow).
for row in self._ordr.OrderRows.SalesOrderRow:
available = row.Row_amount - Slot.objects.filter(rows_ids=row.Row_id).count()
if available > 0 or row.Row_id in self.instance.rows_ids:
# some code
Any ideas what am I doing wrong here?
This is the models code on that rows_ids.
from django.db import models
from django_mysql.models import ListCharField
class Slot(models.Model):
rows_ids = ListCharField(base_field=models.IntegerField(), size=10, max_length=(10 * 21), null=True)
Went through the documentation once more, and apparently it's a ListCharField based issue. To find a match in all the options in it, you shouldn't compare it directly to the field, but rather to field__contains. So the correct code is:
for row in self._ordr.OrderRows.SalesOrderRow:
available = row.Row_amount - Slot.objects.filter(rows_ids__contains=row.Row_id).count()
if available > 0 or row.Row_id in self.instance.rows_ids:
# some code
Finally got it to work with this.
I'm trying to carve out some binding sites with ligands from cif-files of ribosome crystal structures, and have encountered an annoying problem involving a type error.
TypeError: %c requires int or char
Using the code below,
from Bio.PDB import *
from Bio import PDB
class save_res(Select):
def accept_residue(self, residue):
if residue in keep_res_list:
print(residue)
return 1
else:
return 0
keep_res_list = []
parser = MMCIFParser()
structure = parser.get_structure("1vvj.cif", "./1vvj.cif")
structure = structure[0]
atom_list = Selection.unfold_entities(structure, "A") # A for atoms
ns = NeighborSearch(atom_list)
for residue in structure.get_residues():
if residue.get_resname() == "PAR":
for atom in residue:
center = atom.get_coord()
neighbors = ns.search(center, 5.0)
neighbor_residue_list = Selection.unfold_entities(neighbors, "R")
for res in neighbor_residue_list:
if res not in keep_res_list:
keep_res_list.append(res)
io = PDBIO()
io.set_structure(structure)
io.save("1vvj_bs.pdb", save_res())
gives me the error:
File "/scratch/software/anaconda3/envs/my-devel-3.6/lib/python3.6/site-packages/Bio/PDB/PDBIO.py", line 112, in _get_atom_line
return _ATOM_FORMAT_STRING % args
TypeError: %c requires int or char
This code works well when changing the pdb-id to 1fyb, which also has the same ligand id.
I'm thinking the problem stems from the vast amounts of chains and their IDs in the original file. Am I completely wrong in this assumption or does anyone know how to fix this? I've been trying to find a way to rename the chain IDs, but haven't found a viable method to do this.
Your help is appreciated.
The chain name format in _ATOM_FORMAT_STRING is %c, while in this case you have chain named QA.
Chain names in PDB files were traditionally single characters.
But there are only so many letters and digits. For ribosome it's necessary to use longer names. The pdb format has space for a second letter -- empty column on the left from the 1-character chain name. Many programs support it, but not all, and this is not part of the official specification.
So you can either use PDB files with 2-character chains (if the rest of your workflow supports it) or rename chains in the output (your output is only a tiny part of the original structure).
Here is how to do it in gemmi:
import gemmi
structure = gemmi.read_structure('1vvj.cif')
model = structure[0]
ns = gemmi.NeighborSearch(model, structure.cell, 5.0).populate()
for chain in model:
for residue in chain:
if residue.name == 'PAR':
for atom in residue:
for nb in ns.find_neighbors(atom):
nb.to_cra(model).residue.flag = 'y'
sel = gemmi.Selection().set_residue_flags('y')
new_structure = sel.copy_structure_selection(structure)
#new_structure.remove_empty_chains()
#new_structure.shorten_chain_names()
new_structure.write_minimal_pdb('1vvj-par.pdb')
The two commented out lines are renaming the chains.
One difference comparing with your code is that the NeighborSearch in gemmi is symmetry-aware. It finds also nearby atoms from symmetry mates. In BioPython you search only in asymmetric unit (asu).
Both are different than the biological assembly --
PDB-101 covers it nicely.
If you'd like to search in asu only -- replace structure.cell with gemmi.UnitCell() above, i.e. don't pass the unit cell information.
(You can ask such questions on bioinformatics.SE -- it should get answer sooner there).
I am looking for guidance regarding a return result FORMAT from a csv file. The code I have to date partially ahcieves my objective but despite significant effort researching through this and many other sites/forums I cannot resolve the final step. I have also posed this question on gis.stackexchange but was redirected to this forum with the comment "Questions relating to general Information Technology, with no clear GIS component, are off-topic here, but can be researched/asked at Stack Overflow".
My successful piece of python code that reads selected data from a csv and returns it in dict format is below ; (Yes I know the reason it returns as type dict is due to the format my code is calling!!! and that is the crux of the problem)
import arcpy, csv
Att_Dict ={}
with open ("C:/Data/Code/Python/Library/Peter/123.csv") as f:
reader = csv.DictReader(f)
for row in reader:
if row['Status']=='Keep':
Att_Dict.update({row['book_id']:row['book_ref']})
print Att_Dict
Att_Dict = {'7643': '7625', '9644': '2289', '4406': '4443', '7588': '9681', '2252': '7947'}
For the next part of my code to run I need the result above but in the format of ; (this is part of a very lengthy code but the only show stopper is the returned format so little value in posting the other 200 or so lines)
Att_Dict = [[7643, 7625], [9644, 2289], [4406, 4443], [7588, 9681], [2252, 7947]]
Although I have experimented endlessly and can achieve this by reverting to csv.Reader rather than csv.DictReader, I then lose the ability to 'weed out' rows where column 'Status' has value 'Keep' in them and that is a requirement for the task at hand.
My sledgehammer approach to date has been to use 'search and replace' within Idle to amend the returned set to the meet the other requirement but Im sure it can be done programatically rather than manually. Similar but not exact to https://docs.python.org/2/library/index.html, plus my startout question at Returning values from multiple CSV columns to Python dictionary? and Using Python's csv.dictreader to search for specific key to then print its value plus a multitude of csv based questions at geonet.esri.
(Using Win 7, ArcGIS 10.2, Python 2.7.5)
Try this
Att_Dict = {'7643': '7625', '9644': '2289', '4406': '4443', '7588': '9681', '2252': '7947'}
Att_List = []
for key, value in Att_Dict.items():
Att_List.append([int(key), int(value)])
print Att_List
Out: [[7643, 7625], [9644, 2289], [4406, 4443], [7588, 9681], [2252, 7947]]
I am trying to modify a script so that it will remove duplicate lines from a text file using only the title portion of that line.
To clarify the text file lines look something like this:
Title|Image Url|Description|Page Url
At the moment the script does remove duplicates, but it does so by reading the entire line, not just the first part. All the lines in the file are not going to be 100% the same, but a few will be very similar.
I want to remove all of the lines that contain the same "title", regardless of what the rest of the line contains.
This is the script I am working with:
import sys
from collections import OrderedDict
infile = "testfile.txt"
outfile = "outfile.txt"
inf = open(infile,"r")
lines = inf.readlines()
inf.close()
newset = list(OrderedDict.fromkeys(lines))
outf = open(outfile,"w")
lstline = len(newset)
for i in range(0,lstline):
ln = newset[i]
outf.write(ln)
outf.close()
So far I have tried using .split() to split the lines in the list. I have also tried .readline(lines[0:25]) in hopes of using a character limit to achieve the desired results, but no luck so far. I also can't seem to find any documentation on my exact problem so I'm stuck.
I am using Windows 8 and Python 2.7.9 for this project if that helps.
I made a few changes to the program you had set up. First, I changed your file interactions to use "with" statements, since those are very convenient and automatically handle a lot of the functionality you had to write out. Second off, I used a set instead of an OrderedDict because you were basically just trying to emulate set functionality (exclusivity of elements) by using keys in an OrderedDict. If the title hasn't been used, it adds it to the set so it can't be used again and prints the line to the output file. If it has been used, it keeps going. I hope this helps you!
with open("testfile.txt") as infile:
with open("outfile.txt",'w') as outfile:
titleset = set()
for line in infile:
title = line.split('|')[0]
if title not in titleset:
titleset.add(title)
outfile.write(line)
I'm trying to import some publicly available life outcomes data using the code below:
require(gdata)
# Source SIMD12 data zone level data
simd.sg.xls <- read.xls(xls = "http://www.gov.scot/Resource/0044/00447385.xls",
sheet = "Quick Lookup", verbose = TRUE)
Naturally, the imported data frame doesn't look good:
I would like to amend my column names using the code below:
# Clean column names
names(simd.sg.xls) <- make.names(names = as.character(simd.sg.xls[1,]),
unique = TRUE,allow_ = TRUE)
But it produces rather unpleasant results:
> names(simd.sg.xls)
[1] "X1" "X1.1" "X771" "X354" "X229" "X74" "X67" "X33" "X19" "X1.2"
[11] "X6" "X1.3" "X8" "X7" "X7.1" "X6506" "X21" "X1.4" "X6158" "X6506.1"
[21] "X6506.2" "X6506.3" "X6263" "X6506.4" "X6468" "X1010" "X815" "X99" "X58" "X65"
[31] "X60" "X6506.5" "X21.1" "X1.5" "X6173" "X5842" "X6506.6" "X6506.7" "X6263.1" "X6506.8"
[41] "X6481" "X883" "X728" "X112" "X69" "X56" "X54" "X6506.9" "X21.2" "X1.6"
[51] "X6143" "X5651" "X6506.10" "X6506.11" "X6263.2" "X6506.12" "X6480" "X777" "X647" "X434"
[61] "X518" "X246" "X436" "X6506.13" "X21.3" "X1.7" "X6136" "X5677" "X6506.14" "X6506.15"
[71] "X6263.3" "X6506.16" "X660" "X567" "X480" "X557" "X261" "X456"
My question is if there is a way to neatly force the values from the first row to the column names? As I'm doing a lot of data I'm looking for solution that would be easily reproducible, I can accommodate a lot of violation to the actual strings to get syntactically correct names but ideally I would avoid faffing around with elaborate regular expressions as I'm often reading files like the one linked here and don't wan to be forced to adjust the rules for each single import.
It looks like the problem is that the header is on the second line, not the first. You could include a skip=1 argument but a more general way of dealing with this using read.xls seems to be to use the pattern and header arguments which force the first line which matches the pattern string to be treated as the header. Your code becomes:
require(gdata)
# Source SIMD12 data zone level data
simd.sg.xls <- read.xls(xls = "http://www.gov.scot/Resource/0044/00447385.xls",
sheet = "Quick Lookup", verbose = TRUE,
pattern="DATAZONE", header=TRUE)
UPDATE
I don't get the warning messages you do when I execute the code. The messages refer to an issue with locale. The locale settings on my system are:
Sys.getlocale()
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
Yours are probably different. Locale data could be OS dependent. I'm using Windows 8.1. Also I'm using Strawberry Perl; you appear to be using something else. So some possible reasons for the discrepancy in warning messages but nothing more specific.
On the second question in your comment, to read the entire file, and convert a particular row ( in this case, row 2) to column names, you could use the following code:
simd.sg.xls <- read.xls(xls = "http://www.gov.scot/Resource/0044/00447385.xls",
sheet = "Quick Lookup", verbose = TRUE,
header=FALSE, stringsAsFactors=FALSE)
names(simd.sg.xls) <- make.names(names = simd.sg.xls[2,],
unique = TRUE,allow_ = TRUE)
simd.sg.xls <- simd.sg.xls[-(1:2),]
All data will be of character type so you'll need to convert to factor and numeric as necessary.