First of all, there is my script:
import psycopg2
import sys
data = ((160000,),
(40000,),
(75000,),
)
def main():
try:
connection = psycopg2.connect("""host='localhost' dbname='postgres'
user='postgres'""")
cursor = connection.cursor()
query = "UPDATE Planes SET Price=%s"
cursor.executemany(query, data)
connection.commit()
except psycopg2.Error, e:
if connection:
connection.rollback()
print 'Error:{0}'.format(e)
finally:
if connection:
connection.close()
if __name__ == '__main__':
main()
This code works of course, but not in the way I want. It updates entire column 'Price' which is good, but it updates it only by use of the last value of 'data'(75000).
(1, 'Airbus', 75000, 'Public')
(2, 'Helicopter', 75000, 'Private')
(3, 'Falcon', 75000, 'Military')
My desire output would look like:
(1, 'Airbus', 160000, 'Public')
(2, 'Helicopter', 40000, 'Private')
(3, 'Falcon', 75000, 'Military')
Now, how can I fix it?
Without setting up your database on my machine to debug, I can't be sure, but it appears that the query is the issue. When you execute
UPDATE Planes SET Price=%s
I would think it is updating the entire column with the value being iterated on from your data tuple. Instead, you might need the tuple to a dictionary
({'name':'Airbus', 'price':160000}, {'name':'Helicopter', 'price':40000}...)
and change the query to
"""UPDATE Planes SET Price=%(price)s WHERE Name=%(name)s""".
See the very bottom of this article for a similar formulation. To check that this is indeed the issue, you could just execute the query once (cursor.execute(query)) and I bet you will get the full price column filled with the first value in your data tuple.
Related
I've been trying to get nlargest rows for a group by following method from this question. The solution to the question is correct up to a point.
In this example, I groupby column A and want to return the rows of C and D based on the top two values in B.
For some reason the index of grp_df is multilevel and includes both A and the original index of ddf.
I was hoping to simply reset_index() and drop the unwanted index and just keep A, but I get the following error:
ValueError: The columns in the computed data do not match the columns in the provided metadata
Here is a simple example reproducing the error:
import numpy as np
import dask.dataframe as dd
import pandas as pd
np.random.seed(42)
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
ddf = dd.from_pandas(df, npartitions=3)
grp_df = ddf.groupby('A')[['B','C']].apply(lambda x: x.nlargest(2, columns=['B']), meta={
"B": 'f8', "C": 'f8'})
# Print is successful and results are correct
print(grp_df.head())
grp_df = grp_df.reset_index()
# Print is unsuccessful and shows error below
print(grp_df.head())
Found approach for solution here.
Following code now allows for reset_index() to work and gets rid of the original ddf index. Still not sure why the original ddf index came through the groupby in the first place, though
meta = pd.DataFrame(columns=['B', 'C'], dtype=int, index=pd.MultiIndex([[], []], [[], []], names=['A', None]))
grp_df = ddf.groupby('A')[['B','C']].apply(lambda x: x.nlargest(2, columns=['B']), meta=meta)
grp_df = grp_df.reset_index().drop('level_1', axis=1)
I am trying to plot a line chart which includes tooltip, but the code below results in displaying all the values of the line in a tooltip instead displaying a single value for those co ordinates
#Import the library
import pandas
import itertools
import bokeh
import MySQLdb
from bokeh.plotting import figure, output_file, show
from bokeh.models import HoverTool
TOOLS='hover'
wells=['F1','F2','F3','F4','F5','F6','F7','F8','F9','F10','F11','F12','G1','G2','G3','G4','G5','G6','G7','G8','G9','G10','G11','G12']
p = figure(plot_width=800, plot_height=640,x_axis_type="datetime", tools=TOOLS)
p.title.text = 'Click on legend entries to hide the corresponding lines'
# Open database connection
db = MySQLdb.connect("localhost","user","password","db" )
#pallete for the lines
my_palette=bokeh.palettes.inferno(len(wells))
#create a statement to get the data
for name, color in zip(wells,my_palette):
stmnt='select date_time,col1,wells,test_value from db where wells="%s"'%(name)
#creating dataframe
df=pandas.read_sql(stmnt,con=db)
p.scatter(df['date_time'], df['test_value'], line_width=2, color=color, alpha=0.8, legend=name,)
#Inserting tool tip
hover = p.select(dict(type=HoverTool))
hover.tooltips = [("Wells","#wells"),("Date","#%s"%(df['date_time'])),("Values","#%s"%(df['test_value']))]
hover.mode = 'mouse'
#Adding a legend
p.legend.location = "top_right"
output_file("interactive_legend.html", title="interactive_legend.py example")
show(p)
Given below is the resultant screenshot
I am trying to get only one well,Date_time,Test_value at given mouse over instance
This code:
hover.tooltips = [
("Wells","#wells"),
("Date","#%s"%(df['date_time'])),
("Values","#%s"%(df['test_value']))
]
Does not do what you think. Let's suppose df['date_time'] has the value [10, 20, 30, 40]. Then after your string substitution, your tooltip looks like:
("Date", "#[10, 20, 30, 40]")
Which exactly explains what you are seeing. The #[10 part looks for a column named "[10" in your ColumnDataSource (because of the # in front). There isn't a column with that name, so the tooltip prints ??? to indicate it can't find data to look up. The rest 20, 30, 40 is just plain text, so it gets printed as-is. In your code, you are actually passing a Pandas series and not a list, so the string substitution also prints the Name and dtype info in the tooltip text as well.
Since you are passing sequence literals to scatter, it creates a Column Data Source for you, and the default names in the CDS it are 'x' and 'y'. My best guess, is that you actually want:
hover.tooltips = [
("Wells","#wells"),
("Date","#x"),
("Values","#y")
]
But note that you would want to do this outside the loop. As it is you are simply modifying the same hover tool over and over.
I have a page with a HTML table with 16 rows and 5 columns.
I have a method to loop through the table and print out the cell values.
I get the following error:
raise exception_class(message, screen, stacktrace)
StaleElementReferenceException: Message: Element is no longer valid
The error happens on this line:
col_name = row.find_elements(By.TAG_NAME, "td")[1] # This is the Name column
My method code is:
def get_variables_col_values(self):
try:
table_id = self.driver.find_element(By.ID, 'data_configuration_variables_ct_fields_body1')
#time.sleep(10)
rows = table_id.find_elements(By.TAG_NAME, "tr")
print "Rows length"
print len(rows)
for row in rows:
# Get the columns
print "cols length"
print len(row.find_elements(By.TAG_NAME, "td"))
col_name = row.find_elements(By.TAG_NAME, "td")[1] # This is the Name column print "col_name.text = "
print col_name.text
except NoSuchElementException, e:
return False
Am i getting the element is no longer valid because the dom has updated, changed?
The table has not completed in loading?
How can i solve this please?
Do i need to wait for the page to be fully loaded, completed?
Should i use the following WebdriverWait code to wait for page load completed?
WebDriverWait(self.driver, 10).until(lambda d: d.execute_script('return document.readyState') == 'complete')
Where about in my code should i put this line if this is required?
I ran my code again, the 2nd time it worked. The output was:
Rows length
16
cols length
6
col_name.text =
Name
cols length
6
col_name.text =
Address
cols length
6
col_name.text =
DOB
...
So I need to make my code better so it works every time i run my test case.
What is the best solution?
Thanks,
Riaz
StaleElementReferenceException: Message: Element is no longer valid can mean that the page wasn't completely loaded or a script that changes the page elements was not finished running, so the elements are still changing or not all present after you start interacting with them.
You're on the right track! Using explicate waits are good practice to avoid StaleElementReferenceException and NoSuchElementException, since your script will often execute commands much faster than a web page can load or JavaScript code can finish.
Use WebDriverWait before you use WebDriver commands.
Here's a list of different "Expected Conditions" you can use to detect that page is loaded completely or at least loaded enough: http://selenium-python.readthedocs.org/en/latest/waits.html
An example of where to place the wait in your code, with an example of waiting up to 10 seconds for all 'td' elements to finish loading (you may need to use a different type of condition, amount of time, or wait for a different element, depending on what the web page as a whole is like):
from selenium.webdriver.support import expected_conditions as EC
def get_variables_col_values(self):
try:
WebDriverWait(self.driver, 10).until(EC.presence_of_all_elements_located((By.TAG_NAME,'td')))
table_id = self.driver.find_element(By.ID, 'data_configuration_variables_ct_fields_body1')
#time.sleep(10)
rows = table_id.find_elements(By.TAG_NAME, "tr")
print "Rows length"
print len(rows)
for row in rows:
# Get the columns
print "cols length"
print len(row.find_elements(By.TAG_NAME, "td"))
col_name = row.find_elements(By.TAG_NAME, "td")[1] # This is the Name column print "col_name.text = "
print col_name.text
except NoSuchElementException, e:
return False
scikit-learn provides various methods to remove descriptors, a basic method for this purpose has been provided by the given tutorial below,
http://scikit-learn.org/stable/modules/feature_selection.html
but the tutorial does not provide any method or a way that can tell you the way to keep the list of features that either removed or kept.
The code below has been taken from the tutorial.
from sklearn.feature_selection import VarianceThreshold
X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
sel.fit_transform(X)
array([[0, 1],
[1, 0],
[0, 0],
[1, 1],
[1, 0],
[1, 1]])
The given example code above depicts only two descriptors "shape(6, 2)", but in my case, I have a huge data frames with a shape of (rows 51, columns 9000). After finding a suitable model I want to keep the track of useful and useless features because I can save computational time during the computation of the features of test data set by calculating only useful features.
For example, when you perform machine learning modeling with WEKA 6.0, it provided with remarkable flexibility over feature selection and after removing the useless feature you can get a list of a discarded features along with the useful features.
thanks
Then, what you can do, if I'm not wrong is:
In the case of the VarianceThreshold, you can call the method fit instead of fit_transform. This will fit data, and the resulting variances will be stored in vt.variances_ (assuming vt is your object).
Having a threhold, you can extract the features of the transformation as fit_transform would do:
X[:, vt.variances_ > threshold]
Or get the indexes as:
idx = np.where(vt.variances_ > threshold)[0]
Or as a mask
mask = vt.variances_ > threshold
PS: default threshold is 0
EDIT:
A more straight forward to do, is by using the method get_support of the class VarianceThreshold. From the documentation:
get_support([indices]) Get a mask, or integer index, of the features selected
You should call this method after fit or fit_transform.
import numpy as np
import pandas as pd
from sklearn.feature_selection import VarianceThreshold
# Just make a convenience function; this one wraps the VarianceThreshold
# transformer but you can pass it a pandas dataframe and get one in return
def get_low_variance_columns(dframe=None, columns=None,
skip_columns=None, thresh=0.0,
autoremove=False):
"""
Wrapper for sklearn VarianceThreshold for use on pandas dataframes.
"""
print("Finding low-variance features.")
try:
# get list of all the original df columns
all_columns = dframe.columns
# remove `skip_columns`
remaining_columns = all_columns.drop(skip_columns)
# get length of new index
max_index = len(remaining_columns) - 1
# get indices for `skip_columns`
skipped_idx = [all_columns.get_loc(column)
for column
in skip_columns]
# adjust insert location by the number of columns removed
# (for non-zero insertion locations) to keep relative
# locations intact
for idx, item in enumerate(skipped_idx):
if item > max_index:
diff = item - max_index
skipped_idx[idx] -= diff
if item == max_index:
diff = item - len(skip_columns)
skipped_idx[idx] -= diff
if idx == 0:
skipped_idx[idx] = item
# get values of `skip_columns`
skipped_values = dframe.iloc[:, skipped_idx].values
# get dataframe values
X = dframe.loc[:, remaining_columns].values
# instantiate VarianceThreshold object
vt = VarianceThreshold(threshold=thresh)
# fit vt to data
vt.fit(X)
# get the indices of the features that are being kept
feature_indices = vt.get_support(indices=True)
# remove low-variance columns from index
feature_names = [remaining_columns[idx]
for idx, _
in enumerate(remaining_columns)
if idx
in feature_indices]
# get the columns to be removed
removed_features = list(np.setdiff1d(remaining_columns,
feature_names))
print("Found {0} low-variance columns."
.format(len(removed_features)))
# remove the columns
if autoremove:
print("Removing low-variance features.")
# remove the low-variance columns
X_removed = vt.transform(X)
print("Reassembling the dataframe (with low-variance "
"features removed).")
# re-assemble the dataframe
dframe = pd.DataFrame(data=X_removed,
columns=feature_names)
# add back the `skip_columns`
for idx, index in enumerate(skipped_idx):
dframe.insert(loc=index,
column=skip_columns[idx],
value=skipped_values[:, idx])
print("Succesfully removed low-variance columns.")
# do not remove columns
else:
print("No changes have been made to the dataframe.")
except Exception as e:
print(e)
print("Could not remove low-variance features. Something "
"went wrong.")
pass
return dframe, removed_features
this worked for me if you want to see exactly which columns are remained after thresholding you may use this method:
from sklearn.feature_selection import VarianceThreshold
threshold_n=0.95
sel = VarianceThreshold(threshold=(threshold_n* (1 - threshold_n) ))
sel_var=sel.fit_transform(data)
data[data.columns[sel.get_support(indices=True)]]
When testing features I wrote this simple function that tells me which variables remained in the data frame after the VarianceThreshold is applied.
from sklearn.feature_selection import VarianceThreshold
from itertools import compress
def fs_variance(df, threshold:float=0.1):
"""
Return a list of selected variables based on the threshold.
"""
# The list of columns in the data frame
features = list(df.columns)
# Initialize and fit the method
vt = VarianceThreshold(threshold = threshold)
_ = vt.fit(df)
# Get which column names which pass the threshold
feat_select = list(compress(features, vt.get_support()))
return feat_select
which returns a list of column names which are selected. For example: ['col_2','col_14', 'col_17'].
I am trying to use networkx to create a DiGraph. I want to use add_edges_from(), and I want the edges and their data to be generated from three tuples.
I am importing the data from a CSV file. I have three columns: one for ids (first set of nodes), one for a set of names (second set of nodes), and another for capacities (no headers in the file). So, I created a dictionary for the ids and capacities.
dictionary = dict(zip(id, capacity))
then I zipped the tuples containing the edges data:
List = zip(id, name, capacity)
but when I execute the next line, it gives me an assertion error.
G.add_edges_from(List, 'weight': 1)
Can someone help me with this problem? I have been trying for a week with no luck.
P.S. I'm a newbie in programming.
EDIT:
so, i found the following solution. I am honestly not sure how it works, but it did the job!
Here is the code:
import networkx as nx
import csv
G = nx.DiGraph()
capacity_dict = dict(zip(zip(id, name),capacity))
List = zip(id, name, capacity)
G.add_edges_from(capacity_dict, weight=1)
for u,v,d in List:
G[u][v]['capacity']=d
Now when I run:
G.edges(data=True)
The result will be:
[(2.0, 'First', {'capacity': 1.0, 'weight': 1}), (3.0, 'Second', {'capacity': 2.0, 'weight': 1})]
I am using the network simplex. Now, I am trying to find a way to make the output of the flowDict more understandable, because it is only showing the ids of the flow. (Maybe i'll try to input them in a database and return the whole row of data instead of using the ids only).
A few improvements on your version. (1) NetworkX algorithms assume that weight is 1 unless you specifically set it differently. Hence there is no need to set it explicitly in your case. (2) Using the generator allows the capacity attribute to be set explicitly and other attributes to also be set once per record. (3) The use of a generator to process each record as it comes through saves you having to iterate through the whole list twice. The performance improvement is probably negligible on small datasets but still it feels more elegant. Having said that -- your method clearly works!
import networkx as nx
import csv
# simulate a csv file.
# This makes a multi-line string behave as a file.
from StringIO import StringIO
filehandle = StringIO('''a,b,30
b,c,40
d,a,20
''')
# process each row in the file
# and generate an edge from each
def edge_generator(fh):
reader = csv.reader(fh)
for row in reader:
row[-1] = float(row[-1]) # convert capacity to float
# add other attributes to the dict() below as needed...
# e.g. you might add weights here as well.
yield (row[0],
row[1],
dict(capacity=row[2]))
# create the graph
G = nx.DiGraph()
G.add_edges_from(edge_generator(filehandle))
print G.edges(data=True)
Returns this:
[('a', 'b', {'capacity': 30.0}),
('b', 'c', {'capacity': 40.0}),
('d', 'a', {'capacity': 20.0})]