import glob
from bs4 import BeautifulSoup
f = open('csvfile.csv','w')
for file in glob.glob('*.htm'):
print 'Processing', file
for y in range(0,3):
for x in range(0, 6):
soup = BeautifulSoup(open(file).read())
all_string=soup.find_all("h2")[x].get_text()
#stack=[]
#acct.write(", ".join(stack) + '\n')
f.write(all_string)
f.write('\n')
print(all_string)
x=0
f.close()
Output-
Processing Alkali-Controlled C–H Cleavage or N–C Bond Formation by N2-Derived Iron Nitrides and Imides - Journal of the American Chemical Society (ACS Publications).htm
Abstract
Supporting Information
Vanadium-catalyzed Reduction of Molecular Dinitrogen into Silylamine under Ambient Reaction Conditions
Traceback (most recent call last):
File "", line 1, in
runfile('/Users/ROXX/Desktop/project/csv1.py', wdir='/Users/ROXX/Desktop/project')
File
"/Users/ROXX/anaconda/lib/python2.7/site-packages/spyder/utils/site/sitecustomize.py",
line 880, in runfile
execfile(filename, namespace)
File
"/Users/ROXX/anaconda/lib/python2.7/site-packages/spyder/utils/site/sitecustomize.py",
line 94, in execfile
builtins.execfile(filename, *where)
File "/Users/ROXX/Desktop/project/csv1.py", line 17, in
all_string=soup.find_all("h2")[x].get_text()
IndexError: list index out of range
The reason for the error probably is that there are less than 7 occurrences of h2 in the file you are processing (in the second loop). Repeating on "all_string" instead of a fixed interval would probably solve the issue.
Related
I'm trying to evaluate a theano TensorValue expression:
import pymc3
import numpy as np
with pymc3.Model():
growth = pymc3.Normal('growth_%s' % 'some_name', 0, 10)
x = np.arange(4)
(x * growth).eval()
but get the error
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/danna/.virtualenvs/lib/python2.7/site-packages/theano/gof/graph.py", line 522, in eval
self._fn_cache[inputs] = theano.function(inputs, self)
File "/home/danna/.virtualenvs/lib/python2.7/site-packages/theano/compile/function.py", line 317, in function
output_keys=output_keys)
File "/home/danna/.virtualenvs/lib/python2.7/site-packages/theano/compile/pfunc.py", line 486, in pfunc
output_keys=output_keys)
File "/home/danna/.virtualenvs/lib/python2.7/site-packages/theano/compile/function_module.py", line 1839, in orig_function
name=name)
File "/home/danna/.virtualenvs/lib/python2.7/site-packages/theano/compile/function_module.py", line 1487, in __init__
accept_inplace)
File "/home/danna/.virtualenvs/lib/python2.7/site-packages/theano/compile/function_module.py", line 181, in std_fgraph
update_mapping=update_mapping)
File "/home/danna/.virtualenvs/lib/python2.7/site-packages/theano/gof/fg.py", line 175, in __init__
self.__import_r__(output, reason="init")
File "/home/danna/.virtualenvs/lib/python2.7/site-packages/theano/gof/fg.py", line 346, in __import_r__
self.__import__(variable.owner, reason=reason)
File "/home/danna/.virtualenvs/lib/python2.7/site-packages/theano/gof/fg.py", line 391, in __import__
raise MissingInputError(error_msg, variable=r)
theano.gof.fg.MissingInputError: Input 0 of the graph (indices start from 0), used to compute InplaceDimShuffle{x}(growth_some_name), was not provided and not given a value. Use the Theano flag exception_verbosity='high', for more information on this error.
I tried
Can someone please help me see what the theano variables actually output?
Thank you!
I'm using Python 2.7 and theano 1.0.3
While PyMC3 distributions are TensorVariable objects, they don't technical have any values to be evaluated outside of sampling. If you want values, you have to at least run sampling on the model:
with pymc3.Model():
growth = pymc3.Normal('growth', 0, 10)
trace = pymc3.sample(10)
x = np.arange(4)
x[:, np.newaxis]*trace['growth']
If you want to view node values during sampling, you'd need to use theano.tensor.printing.Print objects. For more info, see the PyMC3 debugging tips.
Here is the code:
name = input("Enter Molecule ID: ")
name_in = name+'.lac.dat'
print(name_in)
atm_chg = []
with open(name_in) as f:
# skip two lines
f.readline()
f.readline()
for line in f.readlines():
atm_chg.append(float( line.split()[-1] ))
This is to process input for a larger Python program.
The input is:
LOEWDIN ATOMIC CHARGES
----------------------
0 C : -0.780631
1 H : 0.114577
2 Br: 0.309802
3 Cl: 0.357316
4 F : -0.001065
Finally the runtime messages are:
runfile('/home/comp/Apps/Python/Testing/ReadFile_2.py', wdir='/home/comp/Apps/Python/Testing')
Enter Molecule ID: A
A.lac.dat
Traceback (most recent call last):
File "<ipython-input-1-8c665940b39f>", line 1, in <module>
runfile('/home/comp/Apps/Python/Testing/ReadFile_2.py', wdir='/home/comp/Apps/Python/Testing')
File "/home/comp/Apps/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 866, in runfile
execfile(filename, namespace)
File "/home/comp/Apps/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "/home/comp/Apps/Python/Testing/ReadFile_2.py", line 27, in <module>
atm_chg.append(float( line.split()[-1] ))
IndexError: list index out of range
In spite of the errors there is an entry in the Variable Explorer (Spyder IDE):
[-0.780631, 0.114577, 0.309802, 0.357316, -0.001065]
which is exactly what I require.
The program is probably crashing because it's trying to process an empty line beyond the last line of data. You can fix this by making sure the line is non-empty before processing it:
for line in f.readlines():
if line.strip(): # is the line non-empty, ignoring white space
atm_chg.append(float( line.split()[-1] ))
I am merging files together in 4 folders. Within those 4 folders I am merging 80 .dbf files together each of which is 35 megabytes. I am using the following code:
import os
import pandas as pd
from simpledbf import Dbf5
list1=[]
folders=r'F:\dbf_tables'
out=r'F:\merged'
if not os.path.isdir(out):
os.mkdir(out)
for folder in os.listdir(folders):
if not os.path.isdir(os.path.join(out,folder)):
os.mkdir(os.path.join(out,folder))
for f in os.listdir(os.path.join(folders,folder)):
if '.xml' not in f:
if '.cpg' not in f:
table=Dbf5(os.path.join(folders,folder,f))
df=table.to_dataframe()
list1.append(df)
dfs = reduce(lambda left,right: pd.merge(left,right,on=['POINTID'],how='outer',),list1)
dfs.to_csv(os.path.join(out,folder,'combined.csv'), index=False)
almost immediately after running the code I receive this error:
Traceback (most recent call last):
File "<ipython-input-1-77eb6fd0cda7>", line 1, in <module>
runfile('F:/python codes/prelim_codes/raster_to_point.py', wdir='F:/python codes/prelim_codes')
File "C:\Users\spotter\AppData\Local\Continuum\Anaconda_64\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 714, in runfile
execfile(filename, namespace)
File "C:\Users\spotter\AppData\Local\Continuum\Anaconda_64\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 74, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)
File "F:/python codes/prelim_codes/raster_to_point.py", line 66, in <module>
dfs = reduce(lambda left,right: pd.merge(left,right,on=['POINTID'],how='outer',),list1)
File "F:/python codes/prelim_codes/raster_to_point.py", line 66, in <lambda>
dfs = reduce(lambda left,right: pd.merge(left,right,on=['POINTID'],how='outer',),list1)
File "C:\Users\spotter\AppData\Local\Continuum\Anaconda_64\lib\site-packages\pandas\tools\merge.py", line 39, in merge
return op.get_result()
File "C:\Users\spotter\AppData\Local\Continuum\Anaconda_64\lib\site-packages\pandas\tools\merge.py", line 217, in get_result
join_index, left_indexer, right_indexer = self._get_join_info()
File "C:\Users\spotter\AppData\Local\Continuum\Anaconda_64\lib\site-packages\pandas\tools\merge.py", line 353, in _get_join_info
sort=self.sort, how=self.how)
File "C:\Users\spotter\AppData\Local\Continuum\Anaconda_64\lib\site-packages\pandas\tools\merge.py", line 559, in _get_join_indexers
return join_func(lkey, rkey, count, **kwargs)
File "pandas\src\join.pyx", line 160, in pandas.algos.full_outer_join (pandas\algos.c:61256)
MemoryError
but only 30% of my memory is being used, which is pretty much the baseline.
EDIT:
I picked out only 2 files and tried the merge using:
merge=pd.merge(df1,df2, on=['POINTID'], how='outer')
and still get a memory error, something weird is going on.
When I run the same thing in 32-bit Anaconda I get ValueError: negative dimensions are not allowed
EDIT:
The entire problem stemmed from the solution give here:
Value Error: negative dimensions are not allowed when merging
EDITED based on comment:
Try this (it's enough to use only one if statement with logical and conditions):
import os
import pandas as pd
from simpledbf import Dbf5
folders = r'F:\dbf_tables'
out = r'F:\merged'
if not os.path.isdir(out):
os.mkdir(out)
for folder in os.listdir(folders):
if not os.path.isdir(os.path.join(out, folder)):
os.mkdir(os.path.join(out, folder))
# Initialize empty dataframe by folders
dfs = pd.DataFrame(columns=['POINTID'])
for f in os.listdir(os.path.join(folders, folder)):
if ('.xml' not in f) and ('.cpg' not in f):
table = Dbf5(os.path.join(folders, folder, f))
df = table.to_dataframe()
# Merge actual dataframe to result dataframe
dfs = dfs.merge(df, on=['POINTID'], how='outer')
# Save results by folder
dfs.to_csv(os.path.join(out, folder, 'combined.csv'), index=False)
I'm having problems reading an OpenStreetMap buildings (IMPOSM GEOJSON) file into a geopandas data frame object (Python 2.7). This is on MAC OS X 10.11.3. Here are the messages I'm getting:
>>> import geopandas as gpd
>>> df=gpd.read_file('san-francisco-bay_california_buildings.geojson')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/ewang/anaconda/lib/python2.7/site-packages/geopandas/io/file.py", line 28, in read_file
gdf = GeoDataFrame.from_features(f, crs=crs)
File "/Users/ewang/anaconda/lib/python2.7/site-packages/geopandas/geodataframe.py", line 193, in from_features
d = {'geometry': shape(f['geometry']) if f['geometry'] else None}
File "/Users/ewang/anaconda/lib/python2.7/site-packages/shapely/geometry/geo.py", line 34, in shape
return Polygon(ob["coordinates"][0], ob["coordinates"][1:])
File "/Users/ewang/anaconda/lib/python2.7/site-packages/shapely/geometry/polygon.py", line 229, in __init__
self._geom, self._ndim = geos_polygon_from_py(shell, holes)
File "/Users/ewang/anaconda/lib/python2.7/site-packages/shapely/geometry/polygon.py", line 508, in geos_polygon_from_py
geos_shell, ndim = geos_linearring_from_py(shell)
File "/Users/ewang/anaconda/lib/python2.7/site-packages/shapely/geometry/polygon.py", line 450, in geos_linearring_from_py
n = len(ob[0])
IndexError: list index out of range
The odd thing is that I can load OSM roads data IMPOSM GEOJSON files with geopandas. Am I missing something obvious here? Thanks very much.
EDIT - link to the data below:
OSM data from mapzen
I have follow code in python:
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators = 100)
forest = forest.fit( train_data_features, train["sentiment"] )
but have key error for "sentiment", I don't know why,
train = pd.read_csv("labeledTrainData.tsv", header=0, delimiter="\t", quoting=3)
-Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site--packages/pandas/core/frame.py", line 1780, in __getitem__
return self._getitem_column(key)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/frame.py", line 1787, in _getitem_column
return self._get_item_cache(key)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/generic.py", line 1068, in _get_item_cache
values = self._data.get(item)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/internals.py", line 2849, in get
loc = self.items.get_loc(item)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/index.py", line 1402, in get_loc
return self._engine.get_loc(_values_from_object(key))
File "pandas/index.pyx", line 134, in pandas.index.IndexEngine.get_loc (pandas/index.c:3807)
File "pandas/index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas/index.c:3687)
File "pandas/hashtable.pyx", line 696, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12310)
File "pandas/hashtable.pyx", line 704, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12261)
KeyError: 'sentiment'
Are you doing the Kaggle competition? https://www.kaggle.com/c/word2vec-nlp-tutorial/data
Are you sure you have downloaded and decompressed the file ok? The first part of the file reads:
id sentiment review
"5814_8" 1 "With all this stuff go
This works for me:
>>> train = pd.read_csv("labeledTrainData.tsv", delimiter="\t")
>>> train.columns
Index([u'id', u'sentiment', u'review'], dtype='object')
>>> train.head(3)
id sentiment review
0 5814_8 1 With all this stuff going down at the moment w...
1 2381_9 1 \The Classic War of the Worlds\" by Timothy Hi...
2 7759_3 0 The film starts with a manager (Nicholas Bell)...
You should check the columns are setup correctly in the train variable. You should have a sentiment column. That column seems to be missing in your dataframe.