python: Finding min values of subsets of a list - python-2.7

I have a list that looks something like this
(The columns would essentially be acct, subacct, value.):
1,1,3
1,2,-4
1,3,1
2,1,1
3,1,2
3,2,4
4,1,1
4,2,-1
I want update the list to look like this:
(The columns are now acct, subacct, value, min of the value for each account)
1,1,3,-4
1,2,-4,-4
1,3,1,-4
2,1,1,1
3,1,2,2
3,2,4,2
4,1,1,-1
4,2,-1,-1
The fourth value is derived by taking the min(value) for each account. So, for account 1, the min is -4, so col4 would be -4 for the three records tied to account 1.
For account 2, there is only one value.
For account 3, the min of 2 and 4 is 2, so the value for col 4 is 2 where account = 3.
I need to preserve col3, as I will need to use the value in column 3 for other calculations later. I also need to create this additional column for output later.
I have tried the following:
with open(file_name, 'rU') as f: #opens PW file
data = zip(*csv.reader(f, delimiter = '\t'))
# data = list(list(rec) for rec in csv.reader(f, delimiter='\t'))
#reads csv into a list of lists
#print the first row
uniqAcct = []
data[0] not in used and (uniqAcct.append(data[0]) or True)
But short of looping through and matching on each unique count and then going back through and adding a new column, I am stuck. I think there must be a pythonic way of doing this, but I cannot figure it out. Any help would be greatly appreciated!
I cannot use numpy, pandas, etc as they cannot be installed on this server yet. I need to use just basic python2

So the problem here is your data structure, it's not trivial to index.
Ideally you'd change it to something readible and keep it in those containers. However if you insist on changing it back into tuples I'd go with this construction
# dummy values
data = [
(1, 1, 3),
(1, 2,-4),
(1, 3, 1),
(2, 1, 1),
(3, 1, 2),
(3, 2, 4),
(4, 1, 1),
(4, 2,-1),
]
class Account:
def __init__(self, acct):
self.acct = acct
self.subaccts = {} # maps sub account id to it's value
def as_tuples(self):
min_value = min(val for val in self.subaccts.values())
for subacct, val in self.subaccts.items():
yield (self.acct, subacct, val, min_value)
def accounts_as_tuples(accounts):
return [ summary for acct_obj in accounts.values() for summary in acct_obj.as_tuples() ]
accounts = {}
for acct, subacct, val in data:
if acct not in accounts:
accounts[acct] = Account(acct)
accounts[acct].subaccts[subacct] = val
print(accounts_as_tuples(accounts))
But ideally, I'd keep it in the Account objects and just add a method that extracts the minimal value of the account when it's needed.

Here is another way using your initial approach.
Modify the way you import your data, so you can easily handle it in python.
import csv
mylist = []
with open(file_name, 'rU') as f: #opens PW file
data = csv.reader(f, delimiter = '\t')
for row in data:
splitted = row[0].split(',')
# this is in case you need integers
splitted = [int(i) for i in splitted]
mylist += [splitted]
Then, add the fourth column
updated = []
for acc in set(zip(*mylist)[0]):
acclist = [x for x in mylist if x[0] == acc]
m = min(i for sublist in acclist for i in sublist)
[l.append(m) for l in acclist]
updated += acclist

Related

Counting matrix pairs using a threshold

I have a folder with hundreds of txt files I need to analyse for similarity. Below is an example of a script I use to run similarity analysis. In the end I get an array or a matrix I can plot etc.
I would like to see how many pairs there are with cos_similarity > 0.5 (or any other threshold I decide to use), removing cos_similarity == 1 when I compare the same files, of course.
Secondly, I need a list of these pairs based on file names.
So the output for the example below would look like:
1
and
["doc1", "doc4"]
Will really appreciate your help as I feel a bit lost not knowing which direction to go.
This is an example of my script to get the matrix:
doc1 = "Amazon's promise of next-day deliveries could be investigated amid customer complaints that it is failing to meet that pledge."
doc2 = "The BBC has been inundated with comments from Amazon Prime customers. Most reported problems with deliveries."
doc3 = "An Amazon spokesman told the BBC the ASA had confirmed to it there was no investigation at this time."
doc4 = "Amazon's promise of next-day deliveries could be investigated amid customer complaints..."
documents = [doc1, doc2, doc3, doc4]
# In my real script I iterate through a folder (path) with txt files like this:
#def read_text(path):
# documents = []
# for filename in glob.iglob(path+'*.txt'):
# _file = open(filename, 'r')
# text = _file.read()
# documents.append(text)
# return documents
import nltk, string, numpy
nltk.download('punkt') # first-time use only
stemmer = nltk.stem.porter.PorterStemmer()
def StemTokens(tokens):
return [stemmer.stem(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
def StemNormalize(text):
return StemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))
nltk.download('wordnet') # first-time use only
lemmer = nltk.stem.WordNetLemmatizer()
def LemTokens(tokens):
return [lemmer.lemmatize(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
def LemNormalize(text):
return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))
from sklearn.feature_extraction.text import CountVectorizer
LemVectorizer = CountVectorizer(tokenizer=LemNormalize, stop_words='english')
LemVectorizer.fit_transform(documents)
tf_matrix = LemVectorizer.transform(documents).toarray()
from sklearn.feature_extraction.text import TfidfTransformer
tfidfTran = TfidfTransformer(norm="l2")
tfidfTran.fit(tf_matrix)
tfidf_matrix = tfidfTran.transform(tf_matrix)
cos_similarity_matrix = (tfidf_matrix * tfidf_matrix.T).toarray()
from sklearn.feature_extraction.text import TfidfVectorizer
TfidfVec = TfidfVectorizer(tokenizer=LemNormalize, stop_words='english')
def cos_similarity(textlist):
tfidf = TfidfVec.fit_transform(textlist)
return (tfidf * tfidf.T).toarray()
cos_similarity(documents)
Out:
array([[ 1. , 0.1459739 , 0.03613371, 0.76357693],
[ 0.1459739 , 1. , 0.11459266, 0.19117117],
[ 0.03613371, 0.11459266, 1. , 0.04732164],
[ 0.76357693, 0.19117117, 0.04732164, 1. ]])
As I understood your question, you want to create a function that reads the output numpy array and a certain value (threshold) in order to return two things:
how many docs are bigger than or equal the given threshold
the names of these docs.
So, here I've made the following function which takes three arguments:
the output numpy array from cos_similarity() function.
list of document names.
a certain number (threshold).
And here it's:
def get_docs(arr, docs_names, threshold):
output_tuples = []
for row in range(len(arr)):
lst = [row+1+idx for idx, num in \
enumerate(arr[row, row+1:]) if num >= threshold]
for item in lst:
output_tuples.append( (docs_names[row], docs_names[item]) )
return len(output_tuples), output_tuples
Let's see it in action:
>>> docs_names = ["doc1", "doc2", "doc3", "doc4"]
>>> arr = cos_similarity(documents)
>>> arr
array([[ 1. , 0.1459739 , 0.03613371, 0.76357693],
[ 0.1459739 , 1. , 0.11459266, 0.19117117],
[ 0.03613371, 0.11459266, 1. , 0.04732164],
[ 0.76357693, 0.19117117, 0.04732164, 1. ]])
>>> threshold = 0.5
>>> get_docs(arr, docs_names, threshold)
(1, [('doc1', 'doc4')])
>>> get_docs(arr, docs_names, 1)
(0, [])
>>> get_docs(lst, docs_names, 0.13)
(3, [('doc1', 'doc2'), ('doc1', 'doc4'), ('doc2', 'doc4')])
Let's see how this function works:
first, I iterate over every row of the numpy array.
Second, I iterate over every item in the row whose index is bigger than the row's index. So, we are iterating in a traingular shape like so:
and that's because each pair of documents is mentioned twice in the whole array. We can see that the two values arr[0][1] and arr[1][0] are the same. You also should notice that the diagonal items arn't included because we knew for sure that they are 1 as evey document is very similar to itself :).
Finally, we get the items whose values are bigger than or equal the given threshold, and return their indices. These indices are used later to get the documents names.

Groupby on one column of pandas dataframe, and train feature and target (X, y) of each group with a common sklearn pipeline using GridsearchCv

I have a pandas dataframe with the following structure:
pd.DataFrame({"user_id": ['user_id1', 'user_id1', 'user_id1', 'user_id2', 'user_id2'],
'meeting': ['text1', 'text2', 'text3', 'text4', 'text5'], 'label': ['a,b', 'a', 'a,c', 'x', 'x,y' ]})
There a total of 12 user_id's. I have a pipeline as follows:
knn_tfidf = Pipeline([('tf_idf', TfidfVectorizer(stop_words='english')),
('model', OneVsRestClassifier(KNeighborsClassifier())])
a parameter grid as follows:
param_grid_1 = {'tf_idf__max_df': (0.25, 0.5, 0.75),
'tf_idf__ngram_range': [(1, 1), (1, 2), (2,2) (1, 3)],
'model__estimator_n_neighbors' : [np.range(1,30)]
}
And finally GridSearchCV:
Grid_Search_tune = GridSearchCV(knn_tfidf, param_grid_1, cv=2)
I need to create a model for each user with the corresponding X and y values. For one user, I can do the following:
t = df[df.user_id == 'user_id1']
Extract X and y from t. Pass y to a Multi labelBinarizer(), then after instantiating the pipeline, param_grid and GridsearchCV, I can do:
Grid_Search_tune.fit(X, y)
Doing this 12 times for each user is repetitive. So I looped through the grouped pandas Dataframe. Here is what I have done:
g = df.groupby('user_id')
for names, groups in g:
X = groups.meeting_subject.as_matrix()
labels = [x.split(', ') for x in groups.priority_label.tolist()]
mlb = MultiLabelBinarizer()
y = mlb.fit_transform(labels)
knn_tfidf = Pipeline([('tf_idf', TfidfVectorizer(stop_words='english')),
('model', OneVsRestClassifier(KNeighborsClassifier()))])
param_grid_1 = {'tf_idf__max_df': (0.25, 0.5, 0.75),
'tf_idf__ngram_range': [(1, 2), (2,2), (1, 3)], 'model__estimator__n_neighbors': np.arange(1,4)}
Grid_Search_tune = GridSearchCV(knn_tfidf, param_grid_1, cv=2)
all_estimators = Grid_Search_tune.fit(X, y)
best_of_all_estimators = Grid_Search_tune.best_estimator_
print(best_of_all_estimators)
This gives me an output like:
user_id1
Pipeline(memory=None,
steps=[('tf_idf', TfidfVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
lowercase=True, max_df=0.25, max_features=None, min_df=1,
ngram_range=(2, 2), norm=u'l2', preprocessor=None, smooth_idf=T...tric_params=None, n_jobs=1, n_neighbors=1, p=2,
weights='uniform'),
n_jobs=1))])
user_id2
Pipeline(memory=None,
steps=[('tf_idf', TfidfVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
lowercase=True, max_df=0.25, max_features=None, min_df=1,
ngram_range=(1, 2), norm=u'l2', preprocessor=None, smooth_idf=T...tric_params=None, n_jobs=1, n_neighbors=1, p=2,
weights='uniform'),
n_jobs=1))])
And so on till user_id12 and the corresponding pipeline. I don't know if this is the correct way of doing it, and from here on I am lost. If I do:
best_of_all_estimators.predict(['some_text_string'])
I get a prediction for all the 12 models. How do I key or index my models with the for loop variable 'names' so that when I do:
str(raw_input('Choose user_id from above list:'))
Say I choose user_id3 , and then
str(raw_input('Enter text string:'))
I enter 'some random string'. The model trained for the X and y belonging to user_id3 is pulled up and a prediction is done on that model, and not for all the models. A very similar question is linked here. training an ML model on selected parts of a data frame. I am beginner and I'm really struggling! Please, please help! Thanks a ton in advance.
Apparently Pipeline doesn't support changing the number of samples, such as in groupby or other aggregation.
Here is a similar question and possible workaround.
sklearn: Have an estimator that filters samples

code for split a list of items into list (split function) in python

I have a function to split the list,
example
def split(*arg):
row = len(arg[0])
col = len(arg)
new = [row * col]
for i in row:
for j in col:
new[j][i] = arg[i][j]
return new
# this is method for split the list but it is include errors
Desired output:
list_a = [(1,2,3),(8,9,10),(100,20,15)]
split (list_a)
[(1,8,100),(2,9,20),(3,10,15)]
This is very similar to Transpose nested list in python.
However, you want a list of tuples as the result, so we don't even need a list comprehension. Just
list_a = [(1,2,3),(8,9,10),(100,20,15)]
zip(*list_a) # Python 2
# or
list(zip(*list_a)) # Python 3
# [(1, 8, 100), (2, 9, 20), (3, 10, 15)]
This uses argument unpacking and the built-in zip function.
based on the desired output it seems you are trying to find the transpose so you could do it with numpy like this:
import numpy
list_a = [(1,2,3),(8,9,10),(100,20,15)]
transpose_a = numpy.transpose(list_a)
print(transpose_a)
#or
#print(list(transpose_a))
But your split is malfunctioning for a few reasons reasons:
you are using *arg perameter but not unpacking the argument so you need to call it like split(*list_a)
new = [row * col] is creating a new list with one item instead of a two dimensional list.
you are iterating over integers instead of using range(row) and range(col).
row and col need to be swapped: row = len(arg) and col = len(arg[0]) since you use row as first dimension and col as second.
Although it occurs to me that this is what zip is designed to do so maybe you just need to use that instead.

Adding 2 list inside a dictionnary

I've been trying to add the number of 2 list inside a dictionnary. The thing is, I need to verify if the value in the selected row and column is already in the dictionnary, if so I want to add the double entry list to the value (another double entry list) already existing in the dictionnary. I'm using a excel spreadsheet + xlrd so i can read it up. I' pretty new to this.
For exemple, the code is checking the account (a number) in the specified row and columns, let's say the value is 10, then if it's not in the dictionnary, it add the 2 values corresponding to this count, let's say [100, 0] as a value to this key. This is working as intended.
Now, the hard part is when the account number is already in the dictionnary. Let's say its the second entry for the account number 10. and it's [50, 20]. I want the value associated to the key "10" to be [150, 20].
I've tried the zip method but it seems to return radomn result, Sometimes it adds up, sometime it doesn't.
import xlrd
book = xlrd.open_workbook("Entry.xls")
print ("The number of worksheets is", book.nsheets)
print ("Worksheet name(s):", book.sheet_names())
sh = book.sheet_by_index(0)
print (sh.name,"Number of rows", sh.nrows,"Number of cols", sh.ncols)
liste_compte = {}
for rx in range(4, 10):
if (sh.cell_value(rowx=rx, colx=4)) not in liste_compte:
liste_compte[((sh.cell_value(rowx=rx, colx=4)))] = [sh.cell_value(rowx=rx, colx=6), sh.cell_value(rowx=rx, colx=7)]
elif (sh.cell_value(rowx=rx, colx=4)) in liste_compte:
three = [x + y for x, y in zip(liste_compte[sh.cell_value(rowx=rx, colx=4)],[sh.cell_value(rowx=rx, colx=6), sh.cell_value(rowx=rx, colx=7)])]
liste_compte[(sh.cell_value(rowx=rx, colx=4))] = three
print (liste_compte)
I'm not going to directly untangle your code, but just help you with a general example that does what you want:
def update_balance(existing_balance, new_balance):
for column in range(len(existing_balance)):
existing_balance[column] += new_balance[column]
def update_account(accounts, account_number, new_balance):
if account_number in accounts:
update_balance(existing_balance = accounts[account_number], new_balance = new_balance)
else:
accounts[account_number] = new_balance
And finally you'd do something like (assuming your xls looks like [account_number, balance 1, balance 2]:
accounts = dict()
for row in xls:
update_account(accounts = accounts,
account_number = row[0],
new_balance = row[1:2])

Pandas Dataframe ValueError: Shape of passed values is (X, ), indices imply (X, Y)

I am getting an error and I'm not sure how to fix it.
The following seems to work:
def random(row):
return [1,2,3,4]
df = pandas.DataFrame(np.random.randn(5, 4), columns=list('ABCD'))
df.apply(func = random, axis = 1)
and my output is:
[1,2,3,4]
[1,2,3,4]
[1,2,3,4]
[1,2,3,4]
However, when I change one of the of the columns to a value such as 1 or None:
def random(row):
return [1,2,3,4]
df = pandas.DataFrame(np.random.randn(5, 4), columns=list('ABCD'))
df['E'] = 1
df.apply(func = random, axis = 1)
I get the the error:
ValueError: Shape of passed values is (5,), indices imply (5, 5)
I've been wrestling with this for a few days now and nothing seems to work. What is interesting is that when I change
def random(row):
return [1,2,3,4]
to
def random(row):
print [1,2,3,4]
everything seems to work normally.
This question is a clearer way of asking this question, which I feel may have been confusing.
My goal is to compute a list for each row and then create a column out of that.
EDIT: I originally start with a dataframe that hase one column. I add 4 columns in 4 difference apply steps, and then when I try to add another column I get this error.
If your goal is add new column to DataFrame, just write your function as function returning scalar value (not list), something like this:
>>> def random(row):
... return row.mean()
and then use apply:
>>> df['new'] = df.apply(func = random, axis = 1)
>>> df
A B C D new
0 0.201143 -2.345828 -2.186106 -0.784721 -1.278878
1 -0.198460 0.544879 0.554407 -0.161357 0.184867
2 0.269807 1.132344 0.120303 -0.116843 0.351403
3 -1.131396 1.278477 1.567599 0.483912 0.549648
4 0.288147 0.382764 -0.840972 0.838950 0.167222
I don't know if it possible for your new column to contain lists, but it deinitely possible to contain tuples ((...) instead of [...]):
>>> def random(row):
... return (1,2,3,4,5)
...
>>> df['new'] = df.apply(func = random, axis = 1)
>>> df
A B C D new
0 0.201143 -2.345828 -2.186106 -0.784721 (1, 2, 3, 4, 5)
1 -0.198460 0.544879 0.554407 -0.161357 (1, 2, 3, 4, 5)
2 0.269807 1.132344 0.120303 -0.116843 (1, 2, 3, 4, 5)
3 -1.131396 1.278477 1.567599 0.483912 (1, 2, 3, 4, 5)
4 0.288147 0.382764 -0.840972 0.838950 (1, 2, 3, 4, 5)
I use the code below it is just fine
import numpy as np
df = pd.DataFrame(np.array(your_data), columns=columns)