Update 1: the last line of code sorted_xlist = sorted(xlist).extend(sorted(words_cp)) should be changed to:
sorted_xlist.extend(sorted(xlist))
sorted_xlist.extend(sorted(words_cp))
Update 1: Code is updated to solve the problem of changing length of words list.
This exercise of list functions is from Google's Python Introduction course. I don't know why the code doesn't work in Python 2.7. The goal of the code is explained in annotation portion.
# B. front_x
# Given a list of strings, return a list with the strings
# in sorted order, except group all the strings that begin with 'x' first.
# e.g. ['mix', 'xyz', 'apple', 'xanadu', 'aardvark'] yields
# ['xanadu', 'xyz', 'aardvark', 'apple', 'mix']
# Hint: this can be done by making 2 lists and sorting each of them
# before combining them.
def front_x(words):
words_cp = []
words_cp.extend(words)
xlist=[]
sorted_xlist=[]
for i in range(0, len(words)):
if words[i][0] == 'x':
xlist.append(words[i])
words_cp.remove(words[i])
print sorted(words_cp) # For debugging
print sorted(xlist) # For debugging
sorted_xlist = sorted(xlist).extend(sorted(words_cp))
return sorted_xlist
Update 1: Now error message is gone.
front_x
['axx', 'bbb', 'ccc']
['xaa', 'xzz']
X got: None expected: ['xaa', 'xzz', 'axx', 'bbb', 'ccc']
['aaa', 'bbb', 'ccc']
['xaa', 'xcc']
X got: None expected: ['xaa', 'xcc', 'aaa', 'bbb', 'ccc']
['aardvark', 'apple', 'mix']
['xanadu', 'xyz']
X got: None expected: ['xanadu', 'xyz', 'aardvark', 'apple', 'mix']
The splitting of the original list works fine. But the merging doesn't work.
You're iterating over a sequence as you're changing its length.
Imagine if you start off with an array
arr = ['a','b','c','d','e']
When you remove the first two items from it, now you have:
arr = ['c','d','e']
But you're still iterating over the length of the original array. Eventually you get to i > 2, in my example above, which raises an IndexError.
Related
I've been trying to get nlargest rows for a group by following method from this question. The solution to the question is correct up to a point.
In this example, I groupby column A and want to return the rows of C and D based on the top two values in B.
For some reason the index of grp_df is multilevel and includes both A and the original index of ddf.
I was hoping to simply reset_index() and drop the unwanted index and just keep A, but I get the following error:
ValueError: The columns in the computed data do not match the columns in the provided metadata
Here is a simple example reproducing the error:
import numpy as np
import dask.dataframe as dd
import pandas as pd
np.random.seed(42)
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
ddf = dd.from_pandas(df, npartitions=3)
grp_df = ddf.groupby('A')[['B','C']].apply(lambda x: x.nlargest(2, columns=['B']), meta={
"B": 'f8', "C": 'f8'})
# Print is successful and results are correct
print(grp_df.head())
grp_df = grp_df.reset_index()
# Print is unsuccessful and shows error below
print(grp_df.head())
Found approach for solution here.
Following code now allows for reset_index() to work and gets rid of the original ddf index. Still not sure why the original ddf index came through the groupby in the first place, though
meta = pd.DataFrame(columns=['B', 'C'], dtype=int, index=pd.MultiIndex([[], []], [[], []], names=['A', None]))
grp_df = ddf.groupby('A')[['B','C']].apply(lambda x: x.nlargest(2, columns=['B']), meta=meta)
grp_df = grp_df.reset_index().drop('level_1', axis=1)
I am working in Python 3.6 with NLTK 3.2.
I am trying to write a program which takes raw text as input and outputs any (maximum) series of consecutive words beginning with the same letter (i.e. alliterative sequences).
When searching for sequences, I want to ignore certain words and punctuation (for instance, 'it', 'that', 'into', ''s', ',', and '.'), but to include them in the output.
For example, inputting
"The door was ajar. So it seems that Sam snuck into Sally's subaru."
should yield
["so", "it", "seems", "that", "sam", "snuck", "into", "sally's", "subaru"]
I am new to programming and the best I could come up with is:
import nltk
from nltk import word_tokenize
raw = "The door was ajar. So it seems that Sam snuck into Sally's subaru."
tokened_text = word_tokenize(raw) #word tokenize the raw text with NLTK's word_tokenize() function
tokened_text = [w.lower() for w in tokened_text] #make it lowercase
for w in tokened_text: #for each word of the text
letter = w[0] #consider its first letter
allit_str = []
allit_str.append(w) #add that word to a list
pos = tokened_text.index(w) #let "pos" be the position of the word being considered
for i in range(1,len(tokened_text)-pos): #consider the next word
if tokened_text[pos+i] in {"the","a","an","that","in","on","into","it",".",",","'s"}: #if it's one of these
allit_str.append(tokened_text[pos+i]) #add it to the list
i=+1 #and move on to the next word
elif tokened_text[pos+i][0] == letter: #or else, if the first letter is the same
allit_str.append(tokened_text[pos+i]) #add the word to the list
i=+1 #and move on to the next word
else: #or else, if the letter is different
break #break the for loop
if len(allit_str)>=2: #if the list has two or more members
print(allit_str) #print it
which outputs
['ajar', '.']
['so', 'it', 'seems', 'that', 'sam', 'snuck', 'into', 'sally', "'s", 'subaru', '.']
['seems', 'that', 'sam', 'snuck', 'into', 'sally', "'s", 'subaru', '.']
['sam', 'snuck', 'into', 'sally', "'s", 'subaru', '.']
['snuck', 'into', 'sally', "'s", 'subaru', '.']
['sally', "'s", 'subaru', '.']
['subaru', '.']
This is close to what I want, except that I don't know how to restrict the program to only print the maximum sequences.
So my questions are:
How can I modify this code to only print the maximum sequence
['so', 'it', 'seems', 'that', 'sam', 'snuck', 'into', 'sally', "'s", 'subaru', '.']?
Is there an easier way to do this in Python, maybe with regular expression or more elegant code?
Here are similar questions asked elsewhere, but which have not helped me modify my code:
How do you effectively use regular expressions to find alliterative expressions?
A reddit challenge asking for a similar program
4chan question regarding counting instances of alliteration
Blog about finding most common alliterative strings in a corpus
(I also think it would be nice to have this question answered on this site.)
Interesting task. Personally, I'd loop through without the use of indices, keeping track of the previous word to compare it with the current word.
Additionally, it's not enough to compare letters; you have to take into account that 's' and 'sh' etc don't alliterate. Here's my attempt:
import nltk
from nltk import word_tokenize
from nltk import sent_tokenize
from nltk.corpus import stopwords
import string
from collections import defaultdict, OrderedDict
import operator
raw = "The door was ajar. So it seems that Sam snuck into Sally's subaru. She seems shy sometimes. Someone save Simon."
# Get the English alphabet as a list of letters
letters = [letter for letter in string.ascii_lowercase]
# Here we add some extra phonemes that are distinguishable in text.
# ('sailboat' and 'shark' don't alliterate, for instance)
# Digraphs go first as we need to try matching these before the individual letters,
# and break out if found.
sounds = ["ch", "ph", "sh", "th"] + letters
# Use NLTK's built in stopwords and add "'s" to them
stopwords = stopwords.words('english') + ["'s"] # add extra stopwords here
stopwords = set(stopwords) # sets are MUCH faster to process
sents = sent_tokenize(raw)
alliterating_sents = defaultdict(list)
for sent in sents:
tokenized_sent = word_tokenize(sent)
# Create list of alliterating word sequences
alliterating_words = []
previous_initial_sound = ""
for word in tokenized_sent:
for sound in sounds:
if word.lower().startswith(sound): # only lowercasing when comparing retains original case
initial_sound = sound
if initial_sound == previous_initial_sound:
if len(alliterating_words) > 0:
if previous_word == alliterating_words[-1]: # prevents duplication in chains of more than 2 alliterations, but assumes repetition is not alliteration)
alliterating_words.append(word)
else:
alliterating_words.append(previous_word)
alliterating_words.append(word)
else:
alliterating_words.append(previous_word)
alliterating_words.append(word)
break # Allows us to treat sh/s distinctly
# This needs to be at the end of the loop
# It sets us up for the next iteration
if word not in stopwords: # ignores stopwords for the purpose of determining alliteration
previous_initial_sound = initial_sound
previous_word = word
alliterating_sents[len(alliterating_words)].append(sent)
sorted_alliterating_sents = OrderedDict(sorted(alliterating_sents.items(), key=operator.itemgetter(0), reverse=True))
# OUTPUT
print ("A sorted ordered dict of sentences by number of alliterations:")
print (sorted_alliterating_sents)
print ("-" * 15)
max_key = max([k for k in sorted_alliterating_sents]) # to get sent with max alliteration
print ("Sentence(s) with most alliteration:", sorted_alliterating_sents[max_key])
This produces a sorted ordered dictionary of sentences with their alliteration counts as its keys. The max_key variable contains the count for the highest alliterating sentence or sentences, and can be used to access the sentences themselves.
The accepted answer is very comprehensive, but I would suggest using Carnegie Mellon's pronouncing dictionary. This is partly because it makes life easier, and partly because identical sounding syllables that are not necessarily identical letter-to-letter are also considered alliterations. An example I found online (https://examples.yourdictionary.com/alliteration-examples.html) is "Finn fell for Phoebe".
# nltk.download('cmudict') ## download CMUdict for phoneme set
# The phoneme dictionary consists of ARPABET which encode
# vowels, consonants, and a representitive stress-level (wiki/ARPABET)
phoneme_dictionary = nltk.corpus.cmudict.dict()
stress_symbols = ['0', '1', '2', '3...', '-', '!', '+', '/',
'#', ':', ':1', '.', ':2', '?', ':3']
# nltk.download('stopwords') ## download stopwords (the, a, of, ...)
# Get stopwords that will be discarded in comparison
stopwords = nltk.corpus.stopwords.words("english")
# Function for removing all punctuation marks (. , ! * etc.)
no_punct = lambda x: re.sub(r'[^\w\s]', '', x)
def get_phonemes(word):
if word in phoneme_dictionary:
return phoneme_dictionary[word][0] # return first entry by convention
else:
return ["NONE"] # no entries found for input word
def get_alliteration_level(text): # alliteration based on sound, not only letter!
count, total_words = 0, 0
proximity = 2 # max phonemes to compare to for consideration of alliteration
i = 0 # index for placing phonemes into current_phonemes
lines = text.split(sep="\n")
for line in lines:
current_phonemes = [None] * proximity
for word in line.split(sep=" "):
word = no_punct(word) # remove punctuation marks for correct identification
total_words += 1
if word not in stopwords:
if (get_phonemes(word)[0] in current_phonemes): # alliteration occurred
count += 1
current_phonemes[i] = get_phonemes(word)[0] # update new comparison phoneme
i = 0 if i == 1 else 1 # update storage index
alliteration_score = count / total_words
return alliteration_score
Above is the proposed script. The variable proximity is introduced so that we consider syllables in alliteration, that are otherwise separated by multiple words. The stress_symbols variables reflect stress levels indicated on the CMU dictionary, and it could be easily incorporated in to the function.
I have been trying to solve this for days, and although I have found a similar problem here How can i vectorize list using sklearn DictVectorizer, the solution is overly simplified.
I would like to fit some features into a logistic regression model to predict 'chinese' or 'non-chinese'. I have a raw_name which I will extract to get two features 1) is just the last name, and 2) is a list of substring of the last name, for example, 'Chan' will give ['ch', 'ha', 'an']. But it seems Dictvectorizer doesn't take list type as part of the dictionary. From the link above, I try to create a function list_to_dict, and successfully, return some dict elements,
{'substring=co': True, 'substring=or': True, 'substring=rn': True, 'substring=ns': True}
but I have no idea how to incorporate that in the my_dict = ... before applying the dictvectorizer.
# coding=utf-8
import pandas as pd
from pandas import DataFrame, Series
import numpy as np
import nltk
import re
import random
from random import randint
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction import DictVectorizer
lr = LogisticRegression()
dv = DictVectorizer()
# Get csv file into data frame
data = pd.read_csv("V2-1_2000Records_Processed_SEP2015.csv", header=0, encoding="utf-8")
df = DataFrame(data)
# Pandas data frame shuffling
df_shuffled = df.iloc[np.random.permutation(len(df))]
df_shuffled.reset_index(drop=True)
# Assign X and y variables
X = df.raw_name.values
y = df.chineseScan.values
# Feature extraction functions
def feature_full_last_name(nameString):
try:
last_name = nameString.rsplit(None, 1)[-1]
if len(last_name) > 1: # not accept name with only 1 character
return last_name
else: return None
except: return None
def feature_twoLetters(nameString):
placeHolder = []
try:
for i in range(0, len(nameString)):
x = nameString[i:i+2]
if len(x) == 2:
placeHolder.append(x)
return placeHolder
except: return []
def list_to_dict(substring_list):
try:
substring_dict = {}
for i in substring_list:
substring_dict['substring='+str(i)] = True
return substring_dict
except: return None
list_example = ['co', 'or', 'rn', 'ns']
print list_to_dict(list_example)
# Transform format of X variables, and spit out a numpy array for all features
my_dict = [{'two-letter-substrings': feature_twoLetters(feature_full_last_name(i)),
'last-name': feature_full_last_name(i), 'dummy': 1} for i in X]
print my_dict[3]
Output:
{'substring=co': True, 'substring=or': True, 'substring=rn': True, 'substring=ns': True}
{'dummy': 1, 'two-letter-substrings': [u'co', u'or', u'rn', u'ns'], 'last-name': u'corns'}
Sample data:
Raw_name chineseScan
Jack Anderson non-chinese
Po Lee chinese
If I have understood correctly you want a way to encode list values in order to have a feature dictionary that DictVectorizer could use. (One year too late but) something like this can be used depending on the case:
my_dict_list = []
for i in X:
# create a new feature dictionary
feat_dict = {}
# add the features that are straight forward
feat_dict['last-name'] = feature_full_last_name(i)
feat_dict['dummy'] = 1
# for the features that have a list of values iterate over the values and
# create a custom feature for each value
for two_letters in feature_twoLetters(feature_full_last_name(i)):
# make sure the naming is unique enough so that no other feature
# unrelated to this will have the same name/ key
feat_dict['two-letter-substrings-' + two_letters] = True
# save it to the feature dictionary list that will be used in Dict vectorizer
my_dict_list.append(feat_dict)
print my_dict_list
from sklearn.feature_extraction import DictVectorizer
dict_vect = DictVectorizer(sparse=False)
transformed_x = dict_vect.fit_transform(my_dict_list)
print transformed_x
Output:
[{'dummy': 1, u'two-letter-substrings-er': True, 'last-name': u'Anderson', u'two-letter-substrings-on': True, u'two-letter-substrings-de': True, u'two-letter-substrings-An': True, u'two-letter-substrings-rs': True, u'two-letter-substrings-nd': True, u'two-letter-substrings-so': True}, {'dummy': 1, u'two-letter-substrings-ee': True, u'two-letter-substrings-Le': True, 'last-name': u'Lee'}]
[[ 1. 1. 0. 1. 0. 1. 0. 1. 1. 1. 1. 1.]
[ 1. 0. 1. 0. 1. 0. 1. 0. 0. 0. 0. 0.]]
Another thing you could do (but I don't recommend) if you don't want to create as many features as the values in your lists is something like this:
# sorting the values would be a good idea
feat_dict[frozenset(feature_twoLetters(feature_full_last_name(i)))] = True
# or
feat_dict[" ".join(feature_twoLetters(feature_full_last_name(i)))] = True
but the first one means that you can't have any duplicate values and probably both don't make good features, especially if you need fine-tuned and detailed ones. Also, they reduce the possibility of two rows having the same combination of two letter combinations, thus the classification probably won't do well.
Output:
[{'dummy': 1, 'last-name': u'Anderson', frozenset([u'on', u'rs', u'de', u'nd', u'An', u'so', u'er']): True}, {'dummy': 1, 'last-name': u'Lee', frozenset([u'ee', u'Le']): True}]
[{'dummy': 1, 'last-name': u'Anderson', u'An nd de er rs so on': True}, {'dummy': 1, u'Le ee': True, 'last-name': u'Lee'}]
[[ 1. 0. 1. 1. 0.]
[ 0. 1. 1. 0. 1.]]
Hello awesome coding folks!
Does anyone have a good idea to add a random number to a list? I am trying to get a list to log the random numbers that are generated inside a loop. Here is an example of the code inside the loop:
stuff = {'name': 'Jack', 'age': 30, 'height': '6 foot 9 inches'}
tester = [0]
print(tester)
tester.append[random.randint(1, len(stuff))]
print(tester)
Apparently the output of random.randint is not subscriptable, but I'm not sure how else to write this.
Thank you in advance for the help!
tester.append[random.randint(1, len(stuff))]
# wrong ^ wrong ^
# should be
tester.append(random.randint(1, len(stuff)))
Methods, such as append, are called with parentheses rather than brackets.
It's simple, Try this
from random import randint # import randint from random
listone = [] # Creating a list called listone
for i in xrange(1,10): # creating a loop so numbers can add one by one upto 10 times
ic = randint(1,10) # generating random numbers from 1 to 10
listone.append(ic) # append that numbers to listone
pass
print(listone) # printing list
# for fun you can sort this out ;)
print(sorted(listone))
do this modifications in your code
import random
stuff = {'name': 'Jack', 'age': 30, 'height': '6 foot 9 inches'}
tester = [0]
print(tester)
tester.append(random.randint(1, len(stuff)))
print(tester)
I am doing this python program where i have to access :
This is what i am trying to achieve with my code: Return a dict mapping doc_id to length, computed as sqrt(sum(w_i**2)), where w_i is the tf-idf weight for each term in the document.
E.g., in the sample index below, document 0 has two terms 'a' (with
tf-idf weight 3) and 'b' (with tf-idf weight 4). It's length is
therefore 5 = sqrt(9 + 16).
>>> lengths = Index().compute_doc_lengths({'a': [[0, 3]], 'b': [[0,4]]})
>>> lengths[0]
5.0
The code i have is this:
templist=[]
for iter in index.values():
templist.append(iter)
d = defaultdict(list)
for i,l in templist[1]:
d[i].append(l)
lent = defaultdict()
for m in d:
lo= math.sqrt(sum(lent[m]**2))
return lo
So, if I'm understanding you correctly, we have to transform the input dictionary:
ind = {'a':[ [1,3] ], 'b': [ [1,4 ] ] }
To the output dictionary:
{1:5}
Where the 5 is calculated as the euclidian distance for the value portion of the input dictionary (the vector [3,4] in this case), Correct?
Given that information, the answer becomes a bit more straight-forwards:
def calculate_length(ind):
# Frist, let's transform the dictionary into a list of doc_id, tl_idf pairs; [[doc_id_1,tl_idf_1],...]
data = [entry[0] for entry in ind.itervalues()] # use just ind.values() in python 3.X
# Next, let's split that list into two, one for doc_id's, one for tl_idfs
doc_ids, tl_idfs = zip(*data)
# We can just assume that all the doc_id's are the same. you could check that here if you wanted
doc_id = doc_ids[0]
# Next, we calculate the length as per our formula
length = sqrt(sum(tl_idfs**2 for tl_idfs in tl_idfs))
# Finally, we return the output dictionary
return {doc_id: length}
Example:
>> calculate_length({'a':[ [1,3] ], 'b': [ [1,4 ] ] })
{1:5.0}
There are a couple places in here where you could optimize this to remove the intermidary lists (this method can be two lines of operation and a return) but I'll leave that to you to find out since this is a homework assignment. I also hope you take the time to actually understand what this code does, rather than just copying it wholesale.
Also note that this answer makes the very large asumption that all doc_id values are the same, and there will only ever be a single doc_id,tl_idf list at each key in the dictionary! If that's not true, then your transform becomes more complicated. But you did not provide sample input nore textual explination indicating that's the case (though, based on the data structure, I'd think it quite likely).
Update
In fact, it's really bothering me because I definitely think that's the case. Here is a version that solves the more complex case:
from itertools import chain
from collections import defaultdict
def calculate_length(ind):
# We want to transform this first into a dict of {doc_id:[tl_idf_a,...]}
# First we transform it into a generator of ([doc_id,tl_idf],...)
tf_gen = chain.from_iterable(ind.itervalues())
# which we then use to generate our transformed dictionary
tf_dict = defaultdict(list)
for doc_id, tl_idf in tf_gen:
tf_dict[doc_id].append(tl_idf)
# Now we proceed mostly as before, but we can just do it in one line
return dict((doc_id, sqrt(sum(tl_idfs**2 for tl_idfs in tl_idfs))) for doc_id, tl_idfs in tf_dict.iteritems())
Example use:
>>> calculate_length({'a':[ [1,3] ], 'b': [ [1,4 ] ] })
{1: 5.0}
>>> calculate_length({'a':[ [1,3],[2,3] ], 'b': [ [1,4 ], [2,1] ] })
{1: 5.0, 2: 3.1622776601683795}