Can't merge two lists into a dictionary - python-2.7

I can't merge two lists into a dictionary.I tried the following :
Map two lists into a dictionary in Python
I tried all solutions and I still get an empty dictionary
from sklearn.feature_extraction import DictVectorizer
from itertools import izip
import itertools
text_file = open("/home/vesko_/evnt_classification/bag_of_words", "r")
text_fiel2 = open("/home/vesko_/evnt_classification/sdas", "r")
lines = text_file.read().split('\n')
words = text_fiel2.read().split('\n')
diction = dict(itertools.izip(words,lines))
new_dict = {k: v for k, v in zip(words, lines)}
print new_dict
I get the following :
{'word': ''}
['word=']
The two lists are not empty.
I'm using python2.7
EDIT :
Output from the two lists (I'm only showing a few because it's a vector with 11k features)
//lines
['change', 'I/O', 'fcnet2', 'ifconfig',....
//words
['word', 'word', 'word', .....
EDIT :
Now at least I have some output #DamianLattenero
{'word\n': 'XXAMSDB35:XXAMSDB35_NGCEAC_DAT_L_Drivei\n'}
['word\n=XXAMSDB35:XXAMSDB35_NGCEAC_DAT_L_Drivei\n']

I think the root of a lot of confusion is code in the example that is not relevant.
Try this:
text_file = open("/home/vesko_/evnt_classification/bag_of_words", "r")
text_fiel2 = open("/home/vesko_/evnt_classification/sdas", "r")
lines = text_file.read().split('\n')
words = text_fiel2.read().split('\n')
# to remove any extra newline or whitespace from what was read in
map(lambda line: line.rstrip(), lines)
map(lambda word: word.rstrip(), words)
new_dict = dict(zip(words,lines))
print new_dict
Python builtin zip() returns an iterable of tuples from each of the arguments. Giving this iterable of tuples to the dict() object constructor creates a dictionary where each of the items in words is the key and items in lines is the corresponding value.
Also note that if the words file has more items than lines then there will either keys with empty values. If lines has items then only the last one will be added with an None key.

I tryed this and worked for me, I created two files, added numbers 1 to 4, letters a to d, and the code creates the dictionary ok, I didn't need to import itertools, actually there is an extra line not needed:
lines = [1,2,3,4]
words = ["a","b","c","d"]
diction = dict(zip(words,lines))
# new_dict = {k: v for k, v in zip(words, lines)}
print(diction)
{'a': 1, 'b': 2, 'c': 3, 'd': 4}
If that worked, and not the other, you must have a problem in loading the list, try loading like this:
def create_list_from_file(file):
with open(file, "r") as ins:
my_list = []
for line in ins:
my_list.append(line)
return my_list
lines = create_list_from_file("/home/vesko_/evnt_classification/bag_of_words")
words = create_list_from_file("/home/vesko_/evnt_classification/sdas")
diction = dict(zip(words,lines))
# new_dict = {k: v for k, v in zip(words, lines)}
print(diction)
Observation:
If you files.txt looks like this:
1
2
3
4
and
a
b
c
d
the result will have for keys in the dictionary, one per line:
{'a\n': '1\n', 'b\n': '2\n', 'c\n': '3\n', 'd': '4'}
But if you file looks like:
1 2 3 4
and
a b c d
the result will be {'a b c d': '1 2 3 4'}, only one value

Related

Convert text list into Dictionary?

I have a list as the given one:
l = ['1,a','2,b','3,c']
I want to convert this list into a Dictionary, like this:
l_dict = {1:'a',2:'b',3:'c'}
How can I solve it?
you can use a generator expression to pass to the dict constructor each string split by ','
dict(e.split(',') for e in l)
output:
{'1': 'a', '2': 'b', '3': 'c'}
You need to first split and then push the value to dict. Here there are two options if you just want to push it to dict you can use list else if you want in order use od
Link
from collections import OrderedDict
l = ['1,a','2,b','3,c']
list = {}
od = OrderedDict()
for text in l:
convertToDict = text.split(",")
list[convertToDict[0]] = convertToDict[1]
od[convertToDict[0]] = convertToDict[1]
print(list)
print(od)

Extract elements from tuple to encode in python

I have a list of a list of tuples. With unicode problems.
I have be struggling to encode this into equivalent characters and I have been unsuccessful.
Here is a sample of my code:
import spaghetti as sgt
import codecs
f = codecs.open('output-data-pos', encoding='utf-8')
raw = f.read()
reviews = [raw.split()]
output_tagged = (sgt.pos_tag_sents(reviews))
Here is a sample of output_tagged produces.
[[(u'cerramos', None), (u'igual', u'aq0cs0'), (u'arrancado', None), (u'estanter\xeda', None), (u'\xe9xito', u'ncms000'), (u'an\xe9cdotas', u'ncfp000')]]
My overall objective is to extract each value from the tuple and encode it in utf-8 for a final result such as
cerramos None
igual aq0cs0
arrancado None
estantería None
éxito ncms000
anécdotas ncfp000
Some of the strategies that I have so far tried are from simple stratgies:
where i try to output the list and encode it directly
d = codecs.open('output-data-tagged', 'w', encoding='utf-8')
d.write(output_tagged)
or this approach
f = open('output-data-tagged', 'w')
for output in output_tagged:
output.encode('utf-8')
f.write(output)
f.close
where I first try to map the list and then encode it:
list_of_lists = map(list, output_tagged)
print list_of_lists
where I try functions to encode the data
def reprunicode(u):
return reprunicode(u).decode('raw_unicode_escape')
print u'[%s]' % u', '.join([u'(%s,)' % reprunicode(ti[0]) for ti in output_tagged])
this one too:
def utf8data(list):
return [item.decode('utf8') for item in list]
print utf8data(output_tagged)
Considering my many trials, how can I extract the elements from the tuple in the list of list in order to arrive at my desired final encoding results?

Create dict try comprehension

This:
index ={}
for item in args:
for array in item:
for k,v in json.loads(array).iteritems():
for value in v:
index.setdefault(k,[]).append({'values':value['id']})
Works
But, when I try this:
index ={}
filt = {index.setdefault(k,[]).append(value['id']) for item in args for array in item for (k,v) in json.loads(array).iteritems() for value in v}
print filt
Output:
result set([None])
Whats wrong?
dict.setdefault is an inplace method that returns None so you are creating a set of None's which as sets cannot have duplicates leave you with set([None]):
In [27]: d = {}
In [28]: print(d.setdefault(1,[]).append(1)) # returns None
None
In [35]: d = {}
In [36]: {d.setdefault(k,[]).append(1) for k in range(2)} # a set comprehension
Out[36]: {None}
In [37]: d
Out[37]: {0: [1], 1: [1]}
The index dict like d above would get updated but using any comprehension for side effects is not a good approach. You also cannot replicate the for loops/setdefault logic even using a dict comprehension.
What you could do is use a defaultdict with list.extend:
from collections import defaultdict
index = defaultdict(list)
for item in args:
for array in item:
for k,v in json.loads(array).iteritems():
index[k].extend({'values':value['id']} for value in v)

Find top 5 word lengths in a text

I'm trying to write a program that takes two functions:
count_word_lengths which takes the argument text, a string of text, and returns a default dictionary that records the count for each word length. An example call to this function:
top5_lengths which takes the same argument text and returns a list of the top 5 word lengths.
Note: that in the event that
two lengths have the same frequency, they should be sorted in descending order. Also, if there are fewer than 5 word lengths it should return a shorter list of the sorted word lengths.
Example calls to count_word_lengths:
count_word_lengths("one one was a racehorse two two was one too"):
defaultdict(<class 'int'>, {1: 1, 3: 8, 9: 1})
Example calls to top5_lengths:
top5_lengths("one one was a racehorse two two was one too")
[3, 9, 1]
top5_lengths("feather feather feather chicken feather")
[7]
top5_lengths("the swift green fox jumped over a cool cat")
[3, 5, 4, 6, 1]
My current code is this, and seems to output all these calls, however it is failing a hidden test. What type of input am I not considering? Is my code actually behaving correctly? If not, how could I fix this?
from collections import defaultdict
length_tally = defaultdict(int)
final_list = []
def count_word_lengths(text):
words = text.split(' ')
for word in words:
length_tally[len(word)] += 1
return length_tally
def top5_word_lengths(text):
frequencies = count_word_lengths(text)
list_of_frequencies = frequencies.items()
flipped = [(t[1], t[0]) for t in list_of_frequencies]
sorted_flipped = sorted(flipped)
reversed_sorted_flipped = sorted_flipped[::-1]
for item in reversed_sorted_flipped:
final_list.append(item[1])
return final_list
One thing to note is that you do not account for an empty string. That would cause count() to return null/undefined. Also you can use iteritems() during list comprehension to get the key and value from a dict like for k,v in dict.iteritems():
I'm not a Python guy, but I can see a few things that might cause issues.
You keep referring to top5_lengths, but your code has a function called top5_word_lengths.
You use a function called count_lengths that isn't defined anywhere.
Fix these and see what happens!
Edit:
This shouldn't impact your code, but it's not great practice for your functions to update variables outside their scope. You probably want to move the variable assignments at the top to functions where they're used.
Not really an answer, but an alternative way of tracking words instead of just lengths:
from collections import defaultdict
def count_words_by_length(text):
words = [(len(word),word) for word in text.split(" ")]
d = defaultdict(list)
for k, v in words:
d[k].append(v)
return d
def top_words(dict, how_many):
return [{"word_length": length, "num_words": len(words)} for length, words in dict.items()[-how_many:]]
Use as follows:
my_dict = count_words_by_length('hello sir this is a beautiful day right')
my_top_words = num_top_words_by_length(my_dict, 5)
print(my_top_words)
print(my_dict)
Output:
[{'word_length': 9, 'num_words': 1}]
defaultdict(<type 'list'>, {1: ['a'], 2: ['is'], 3: ['sir', 'day'], 4: ['this'], 5: ['hello', 'right'], 9: ['beautiful']})

PYTHON 2.7 - Modifying List of Lists and Re-Assembling Without Mutating

I currently have a list of lists that looks like this:
My_List = [[This, Is, A, Sample, Text, Sentence] [This, too, is, a, sample, text] [finally, so, is, this, one]]
Now what I need to do is "tag" each of these words with one of 3, in this case arbitrary, tags such as "EE", "FF", or "GG" based on which list the word is in and then reassemble them into the same order they came in. My final code would need to look like:
GG_List = [This, Sentence]
FF_List = [Is, A, Text]
EE_List = [Sample]
My_List = [[(This, GG), (Is, FF), (A, FF), (Sample, "EE), (Text, FF), (Sentence, GG)] [*same with this sentence*] [*and this one*]]
I tried this by using for loops to turn each item into a dict but the dicts then got rearranged by their tags which sadly can't happen because of the nature of this thing... the experiment needs everything to stay in the same order because eventually I need to measure the proximity of tags relative to others but only in the same sentence (list).
I thought about doing this with NLTK (which I have little experience with) but it looks like that is much more sophisticated then what I need and the tags aren't easily customized by a novice like myself.
I think this could be done by iterating through each of these items, using an if statement as I have to determine what tag they should have, and then making a tuple out of the word and its associated tag so it doesn't shift around within its list.
I've devised this.. but I can't figure out how to rebuild my list-of-lists and keep them in order :(.
for i in My_List: #For each list in the list of lists
for h in i: #For each item in each list
if h in GG_List: # Check for the tag
MyDicts = {"GG":h for h in i} #Make Dict from tag + word
Thank you so much for your help!
Putting the tags in a dictionary would work:
My_List = [['This', 'Is', 'A', 'Sample', 'Text', 'Sentence'],
['This', 'too', 'is', 'a', 'sample', 'text'],
['finally', 'so', 'is', 'this', 'one']]
GG_List = ['This', 'Sentence']
FF_List = ['Is', 'A', 'Text']
EE_List = ['Sample']
zipped = zip((GG_List, FF_List, EE_List), ('GG', 'FF', 'EE'))
tags = {item: tag for tag_list, tag in zipped for item in tag_list}
res = [[(word, tags[word]) for word in entry if word in tags] for entry in My_List]
Now:
>>> res
[[('This', 'GG'),
('Is', 'FF'),
('A', 'FF'),
('Sample', 'EE'),
('Text', 'FF'),
('Sentence', 'GG')],
[('This', 'GG')],
[]]
Dictionary works by key-value pairs. Each key is assigned a value. To search the dictionary, you search the index by the key, e.g.
>>> d = {1:'a', 2:'b', 3:'c'}
>>> d[1]
'a'
In the above case, we always search the dictionary by its keys, i.e. the integers.
In the case that you want to assign the tag/label to each word, you are searching by the key word and finding the "value", i.e. the tag/label, so your dictionary would have to look something like this (assuming that the strings are words and numbers as tag/label):
>>> d = {'a':1, 'b':1, 'c':3}
>>> d['a']
1
>>> sent = 'a b c a b'.split()
>>> sent
['a', 'b', 'c', 'a', 'b']
>>> [d[word] for word in sent]
[1, 1, 3, 1, 1]
This way the order of the tags follows the order of the words when you use a list comprehension to iterate through the words and find the appropriate tags.
So the problem comes when you have the initial dictionary indexed with the wrong way, i.e. key -> labels, value -> words, e.g.:
>>> d = {1:['a', 'd'], 2:['b', 'h'], 3:['c', 'x']}
>>> [d[word] for word in sent]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: 'a'
Then you would have to reverse your dictionary, assuming that all elements in your value lists are unique, you can do this:
>>> from collections import ChainMap
>>> d = {1:['a', 'd'], 2:['b', 'h'], 3:['c', 'x']}
>>> d_inv = dict(ChainMap(*[{value:key for value in values} for key, values in d.items()]))
>>> d_inv
{'h': 2, 'c': 3, 'a': 1, 'x': 3, 'b': 2, 'd': 1}
But the caveat is that ChainMap is only available in Python3.5 (yet another reason to upgrade your Python ;P). For Python <3.5, solutions, see How do I merge a list of dicts into a single dict?.
So going back to the problem of assigning labels/tags to words, let's say we have these input:
>>> d = {1:['a', 'd'], 2:['b', 'h'], 3:['c', 'x']}
>>> sent = 'a b c a b'.split()
First, we invert the dictionary (assuming that there're one to one mapping for every word and its tag/label:
>>> d_inv = dict(ChainMap(*[{value:key for value in values} for key, values in d.items()]))
Then, we apply the tags to the words through a list comprehension:
>>> [d_inv[word] for word in sent]
[1, 2, 3, 1, 2]
And for multiple sentences:
>>> sentences = ['a b c'.split(), 'h a x'.split()]
>>> [[d_inv[word] for word in sent] for sent in sentences]
[[1, 2, 3], [2, 1, 3]]