Why True for [1] < [1, 2] in Python - list

I am trying to compare 2 python list objects with integers. I understand that python uses lexicographical comparison. While doing so I came across this scenario. when I compared [1] < [1, 2] python returned True. Shouldn't that be False?
[1] < [1] # returns False
[1, 2] < [1] # returns False as well

Related

Can somebody explain what does this 'nx.connected_components()' does?

I have got some code from git and i was trying to understand it, here's a part of it, i didn't understand the second line of this code
G = nx.Graph(network_map) # Graph for the whole network
components = list(nx.connected_components(G))
Whats does this function connected_components do? I went through the documentation and couldn't understand it properly.
nx.connected_components(G) will return "A generator of sets of nodes, one for each component of G". A generator in Python allows iterating over values in a lazy manner (i.e., will generate the next item only when necessary).
The documentation provides the following example:
>>> import networkx as nx
>>> G = nx.path_graph(4)
>>> nx.add_path(G, [10, 11, 12])
>>> [len(c) for c in sorted(nx.connected_components(G), key=len, reverse=True)]
[4, 3]
Let's go through it:
G = nx.path_graph(4) - create the directed graph 0 -> 1 -> 2 -> 3
nx.add_path(G, [10, 11, 12]) - add to G: 10 -> 11 -> 12
So, now G is a graph with 2 connected components.
[len(c) for c in sorted(nx.connected_components(G), key=len, reverse=True)] - list the sizes of all connected components in G from the largest to smallest. The result is [4, 3] since {0, 1, 2, 3} is of size 4 and {10, 11, 12} is of size 3.
So just to recap - the result is a generator (lazy iterator) over all connected components in G, where each connected component is simply a set of nodes.

Python: How to encode a solution for an optimization problem

I am working on an optimization problem and needed to encode the solution to the problem. Below is the piece of code I wrote for this task. Part one of the extracts the corresponding cities assigned to each salesman. In Part two of the code, I want to insert the starting and ending depots (cities) of each of the salesmen. I want this process to be dynamic as the starting/ending depots lists will change as the "num_salesmen" variable changes. The "population_list" will hold members of the population. I have given one example to aid in your assistance of this request.
Please let me know if you need further clarification of my logic in the inserting part.
####____BELOW CODE is being designed encode a solution for a GA_____#
populationSize = 1 (this will be varied)
num_salesmen = 2
population_list = [[4, 2, 3], [0, 1, 0], [1, 0], [1, 0]]
## - where [4, 2, 3] is a list of cities to be visited by salesmen,
## - [0, 1, 0] the list of salesman, and
## - [1, 0], [1, 0] are the lists of starting and ending depots of the
salesman one (0) and salesman two (1) respectively.
for pop in population_list:
##----Part ONE: determine cities assigned to each salesman:
Assigned_cites = [[] for x in range(num_salesmen)]
for i in range(len(pop[1])):
for man in range(num_salesmen):
if pop[1][i] == man:
Assigned_cites[man].append(pop[0][i])
##---- Part TWO: inserting the starting and ending depots:
for s_man in range(num_salesmen):
for s_e_d in range(2,num_salesmen+2):
Assigned_cites[s_man].insert(0,pop[s_e_d][0])
Assigned_cites[s_man].append(pop[s_e_d][1])
###- expected result from Part TWO Should look like below, but I am not getting it:
[[1, 4, 3, 0], [1, 2, 0]]
Thanks in advance for your help.
#your extraction logic need a bit of tweaking
Assigned_cites = [[] for x in range(num_salesmen)]
for i in range(len(population_list[1])):
for man in range(num_salesmen):
if population_list[1][i] == man:
Assigned_cites[man].append(population_list[0][i])
print Assigned_cites
s_man = 0 # no need of an outer for loop for sales man
for s_e_d in range(2,num_salesmen+2):
Assigned_cites[s_man].insert(0,population_list[s_e_d][0])
Assigned_cites[s_man].append(population_list[s_e_d][1])
s_man = s_man + 1
print Assigned_cites

Removing features with low variance using scikit-learn

scikit-learn provides various methods to remove descriptors, a basic method for this purpose has been provided by the given tutorial below,
http://scikit-learn.org/stable/modules/feature_selection.html
but the tutorial does not provide any method or a way that can tell you the way to keep the list of features that either removed or kept.
The code below has been taken from the tutorial.
from sklearn.feature_selection import VarianceThreshold
X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
sel.fit_transform(X)
array([[0, 1],
[1, 0],
[0, 0],
[1, 1],
[1, 0],
[1, 1]])
The given example code above depicts only two descriptors "shape(6, 2)", but in my case, I have a huge data frames with a shape of (rows 51, columns 9000). After finding a suitable model I want to keep the track of useful and useless features because I can save computational time during the computation of the features of test data set by calculating only useful features.
For example, when you perform machine learning modeling with WEKA 6.0, it provided with remarkable flexibility over feature selection and after removing the useless feature you can get a list of a discarded features along with the useful features.
thanks
Then, what you can do, if I'm not wrong is:
In the case of the VarianceThreshold, you can call the method fit instead of fit_transform. This will fit data, and the resulting variances will be stored in vt.variances_ (assuming vt is your object).
Having a threhold, you can extract the features of the transformation as fit_transform would do:
X[:, vt.variances_ > threshold]
Or get the indexes as:
idx = np.where(vt.variances_ > threshold)[0]
Or as a mask
mask = vt.variances_ > threshold
PS: default threshold is 0
EDIT:
A more straight forward to do, is by using the method get_support of the class VarianceThreshold. From the documentation:
get_support([indices]) Get a mask, or integer index, of the features selected
You should call this method after fit or fit_transform.
import numpy as np
import pandas as pd
from sklearn.feature_selection import VarianceThreshold
# Just make a convenience function; this one wraps the VarianceThreshold
# transformer but you can pass it a pandas dataframe and get one in return
def get_low_variance_columns(dframe=None, columns=None,
skip_columns=None, thresh=0.0,
autoremove=False):
"""
Wrapper for sklearn VarianceThreshold for use on pandas dataframes.
"""
print("Finding low-variance features.")
try:
# get list of all the original df columns
all_columns = dframe.columns
# remove `skip_columns`
remaining_columns = all_columns.drop(skip_columns)
# get length of new index
max_index = len(remaining_columns) - 1
# get indices for `skip_columns`
skipped_idx = [all_columns.get_loc(column)
for column
in skip_columns]
# adjust insert location by the number of columns removed
# (for non-zero insertion locations) to keep relative
# locations intact
for idx, item in enumerate(skipped_idx):
if item > max_index:
diff = item - max_index
skipped_idx[idx] -= diff
if item == max_index:
diff = item - len(skip_columns)
skipped_idx[idx] -= diff
if idx == 0:
skipped_idx[idx] = item
# get values of `skip_columns`
skipped_values = dframe.iloc[:, skipped_idx].values
# get dataframe values
X = dframe.loc[:, remaining_columns].values
# instantiate VarianceThreshold object
vt = VarianceThreshold(threshold=thresh)
# fit vt to data
vt.fit(X)
# get the indices of the features that are being kept
feature_indices = vt.get_support(indices=True)
# remove low-variance columns from index
feature_names = [remaining_columns[idx]
for idx, _
in enumerate(remaining_columns)
if idx
in feature_indices]
# get the columns to be removed
removed_features = list(np.setdiff1d(remaining_columns,
feature_names))
print("Found {0} low-variance columns."
.format(len(removed_features)))
# remove the columns
if autoremove:
print("Removing low-variance features.")
# remove the low-variance columns
X_removed = vt.transform(X)
print("Reassembling the dataframe (with low-variance "
"features removed).")
# re-assemble the dataframe
dframe = pd.DataFrame(data=X_removed,
columns=feature_names)
# add back the `skip_columns`
for idx, index in enumerate(skipped_idx):
dframe.insert(loc=index,
column=skip_columns[idx],
value=skipped_values[:, idx])
print("Succesfully removed low-variance columns.")
# do not remove columns
else:
print("No changes have been made to the dataframe.")
except Exception as e:
print(e)
print("Could not remove low-variance features. Something "
"went wrong.")
pass
return dframe, removed_features
this worked for me if you want to see exactly which columns are remained after thresholding you may use this method:
from sklearn.feature_selection import VarianceThreshold
threshold_n=0.95
sel = VarianceThreshold(threshold=(threshold_n* (1 - threshold_n) ))
sel_var=sel.fit_transform(data)
data[data.columns[sel.get_support(indices=True)]]
When testing features I wrote this simple function that tells me which variables remained in the data frame after the VarianceThreshold is applied.
from sklearn.feature_selection import VarianceThreshold
from itertools import compress
def fs_variance(df, threshold:float=0.1):
"""
Return a list of selected variables based on the threshold.
"""
# The list of columns in the data frame
features = list(df.columns)
# Initialize and fit the method
vt = VarianceThreshold(threshold = threshold)
_ = vt.fit(df)
# Get which column names which pass the threshold
feat_select = list(compress(features, vt.get_support()))
return feat_select
which returns a list of column names which are selected. For example: ['col_2','col_14', 'col_17'].

Python3 max function using value in defaultdict as key not working

Suppose model is a defaultdict, and num is a set
>>> model
>>> defaultdict(<function <lambda> at 0x11076f758>, {1: 3, 2: 2, 4: 1})
>>> num
>>> {1, 2, 3, 4, 5, 6}
I want to get the item from num that has maximum value in model, and the following code works fine in Python2
>>> # python 2.7.6
>>> max(num, key=model.get)
>>> 1
But it doesn't work in Python3,
>>> # python 3.3.3
>>> max(num, key=model.get)
>>> TypeError: unorderable types: NoneType() > int()
I can use max(num, key=lambda k:model[k]) to get it work in Python3, but if the item in num is not in the model, it will be added. This will modify model.
I am wondering why model.get doesn't work in Python3, and how can I do it without modifying model.
Use key=lambda x: model.get(x, 0).
defaultdict.get by default returns None if the item is not found. Python 2 allows ordered comparisons (like less-than and greater-than) on different types, but Python 3 doesn't. When Python 3 tries to find the max, it tries to see if the value for one key is greater than another. If one of the values is None, it fails with the error you saw. The solution is to make your key function return zero instead of None for missing values.

My webcrawler isn't looping to get all the links - using foo function (Python)

I am creating a webcrawler and in the first step, I need to crawl a website and extract all its link however my code is not looping to extract. I tried using append but that results in a list of lists. I'm trying to use foo and I get an error. Any help would be appreciated. Thank you
from urllib2 import urlopen
import re
def get_all_urls(url):
get_content = urlopen(url).read()
url_list = []
find_url = re.compile(r'a\s?href="(.*)">')
url_list_temp = find_url.findall(get_content)
for i in url_list_temp:
url_temp = url_list_temp.pop()
source = 'http://blablabla/'
url = source + url_temp
url_list.append(url)
#print url_list
return url_list
def web_crawler(seed):
tocrawl = [seed]
crawled = []
i = 0
while i < len(tocrawl):
page = tocrawl.pop()
if page not in crawled:
#tocrawl.append(get_all_urls(page))
foo = (get_all_urls(page))
tocrawl = foo
crawled.append(page)
if not tocrawl:
break
print crawled
return crawled
First of all, it's a bad idea to parse HTML with regular expressions, for all reasons listed:
here: Python regular expression for HTML parsing (BeautifulSoup)
here: Python regex to match HTML
here: regexp python with parsing html page
and so on..
You should use an HTML parser to to the job. Python provides one in its standard library: HTMLParser, but you could also use BeautifulSoup or even lxml. I tend to favor BeautifulSoup, for its nice API.
Now, back to your problem, you're modifying the list you're iterating on:
for i in url_list_temp:
url_temp = url_list_temp.pop()
source = 'http://blablabla/'
...
This is bad, because it metaphorically amounts to sawing a branch you're sitting on.
Also, you do not seem to require this removal, as there is no condition for which an url must be removed or not.
Finally, you get an error after using append because, as you said, it creates a list of list. You should use extend instead.
>>> l1 = [1, 2, 3]
>>> l2 = [4, 5, 6]
>>> l1.append(l2)
>>> l1
[1, 2, 3, [4, 5, 6]]
>>> l1 = [1, 2, 3]
>>> l1.extends(l2)
>>> l1
[1, 2, 3, 4, 5, 6]
NB: Take a look at http://www.pythonforbeginners.com/python-on-the-web/web-scraping-with-beautifulsoup/ for additional help with scraping with beautifulsoup