Mapping specific letters onto a list of words - list

Can someone please suggest an approach to write the piece of codes that will automatically maps the letter in letter_str onto the dic_key (a dict key string type that contains dashes that match the length of words in word_lst)?
So, the mapping only occurs if a letter appears in every words in the list at the same position no matter how many words is in the list.
If no letter appears at any position for all the words in the word list then the new_dic_key would be '----'. Please see the examples below
Thanks
word_lst = ['ague', 'bute', 'byre', 'came', 'case', 'doze']
dic_key = '----'
letters_str ='abcdefghijklmnopqrstuvwxyz'
new_dic_key = '---e'
if
word_list = ['bute', 'byre']
new_dic_key = 'b--e'
or
word_list = ['drek', 'drew', 'dyes']
new_dic_key = 'd-e-'

If the words in the word_list will be of the same length this code will give what you want:
word_list = ['drek', 'drew', 'dyes']
cols = []
for i in range(len(word_list[0])):
cols.append([])
for word in word_list:
for i, ch in enumerate(word):
cols[i].append(ch)
pattern = [item[0] if len(set(item)) == 1 else '-' for item in cols]
print ''.join(pattern)
d-e-
Explanation:
We initialize cols to be a list of list. It will contain two dimensional representation of the letters in the words of word_list. After populating cols this is what it looks like:
[['d', 'd', 'd'], ['r', 'r', 'y'], ['e', 'e', 'e'], ['k', 'w', 's']]
So the final result new_dic_key will contain the letter only if all elements in the sub-list above have the same letter, otherwise it will contain a -. This is achieved using the list comprehension for pattern.
Hope it helps.

Related

Substitute single space with multiple spaces in variable

I have a variable text:
let text="hello world"
and would like to put multiple spaces between the two words. How could I achieve this programmatically? This is my current solution:
let text=substitute(text," "," ","")
but how could I put multiple spaces without typing each of them? Is there any function to put n number of spaces?
You can use the repeat() function. From :h repeat():
repeat({expr}, {count}) *repeat()*
Repeat {expr} {count} times and return the concatenated
result. Example: >
:let separator = repeat('-', 80)
< When {count} is zero or negative the result is empty.
When {expr} is a |List| the result is {expr} concatenated
{count} times. Example: >
:let longlist = repeat(['a', 'b'], 3)
< Results in ['a', 'b', 'a', 'b', 'a', 'b'].
For example:
let text = substitute(text, " ", repeat(" ", n), "")

Convert a set of numbers into a word

I need to convert a given string of numbers to the word those numbers correspond to. For example:
>>>number_to_word ('222 2 333 33')
'CAFE'
The numbers work like they do on a cell phone, you hit once on the second button and you get an 'A', you hit twice and you get an 'B', etc. Let's say I want the letter 'E', I'd have to press the third button twice.
I would like to have some help trying to understand the easiest way to do this function. I have thought on creating a dictionary with the key being the letter and the value being the number, like this:
dic={'A':'2', 'B':'22', 'C':'222', 'D':'3', 'E':'33',etc...}
And then using a 'for' cycle to read all the numbers the in the string, but I do not know how to start.
You need to reverse your dictionary:
def number_to_word(number):
dic = {'2': 'A', '22': 'B', '222': 'C', '3': 'D', '33': 'E', '333': 'F'}
return ''.join(dic[n] for n in number.split())
>>> number_to_word('222 2 333 33')
'CAFE'
Let's start inside out. number.split() splits the text with your number at white space characters:
>>> number = '222 2 333 33'
>>> number.split()
['222', '2', '333', '33']
We use a generator expression ((dic[n] for n in number.split())) to find the letter for each number. Here is a list comprehension that does nearly the same but also shows the result as a list:
>>> [dic[n] for n in number.split()]
['C', 'A', 'F', 'E']
This lets n run through all elements in the list with the numbers and uses n as the key in the dictionary dic to get the corresponding letter.
Finally, we use the method join() with an empty string as spectator to turn the list into a string:
>>> ''.join([dic[n] for n in number.split()])
'CAFE'

Python 2.7 : Remove elements from a multidimensional list

Basically, I have a 3dimensional list (it is a list of tokens, where the first dimension is for the text, second for the sentence, and third for the word).
Addressing an element in the list (lets call it mat) can be done for example:
mat[2][3][4]. That would give us the fifth word or the fourth sentence in the third text.
But, some of the words are just symbols like '.' or ',' or '?'. I need to remove all of them. I thought to do that with a procedure:
def removePunc(mat):
newMat = []
newText = []
newSentence = []
for text in mat:
for sentence in text:
for word in sentence:
if word not in " !##$%^&*()-_+={}[]|\\:;'<>?,./\"":
newSentence.append(word)
newText.append(newSentence)
newMat.append(newText)
return newMat
Now, when I try to use that:
finalMat = removePunc(mat)
it is giving me the same list (mat is a 3 dimensional list). My idea was to iterate over the list and remove only the 'words' which are actually punctuation symbols.
I don't know what I am doing wrong but surely there is a simple logical mistake.
Edit: I need to keep the structure of the array. So, words of the same sentence should still be in the same sentence (just without the 'punctuation symbol' words). Example:
a = [[['as', '.'], ['w', '?', '?']], [['asas', '23', '!'], ['h', ',', ',']]]
after the changes should be:
a = [[['as'], ['w']], [['asas', '23'], ['h']]]
Thanks for reading and/or giving me a reply.
I would suspect that your data are not organized as you think they are. And although I am usually not the one to propose regular expressions, I think in your case they may be among the best solutions.
I would also suggest that instead of eliminating non-alphabetic characters from words, you process sentences
>>> import re
>>> non_word = re.compile(r'\W+') # If your sentences may
>>> sentence = '''The formatting sucks, but the only change that I've made to your code was shortening the "symbols" string to one character. The only issue that I can identify is either with the "symbols" string (though it looks like all chars in it are properly escaped) that you used, or the punctuation is not actually separate words'''
>>> words = re.split(non_word, sentence)
>>> words
['The', 'formatting', 'sucks', 'but', 'the', 'only', 'change', 'that', 'I', 've', 'made', 'to', 'your', 'code', 'was', 'shortening', 'the', 'symbols', 'string', 'to', 'one', 'character', 'The', 'only', 'issue', 'that', 'I', 'can', 'identify', 'is', 'either', 'with', 'the', 'symbols', 'string', 'though', 'it', 'looks', 'like', 'all', 'chars', 'in', 'it', 'are', 'properly', 'escaped', 'that', 'you', 'used', 'or', 'the', 'punctuation', 'is', 'not', 'actually', 'separate', 'words']
>>>
The code you wrote seems solid and it looks like "it should work", but only if this:
But, some of the words are just symbols like '.' or ',' or '?'
is actually fulfilled.
I would actually expect the symbols to not be separate from words, so instead of:
["Are", "you", "sure", "?"] #example sentence
you would rather have:
["Are", "you", "sure?"] #example sentence
If this is the case, you would need to go along the lines of:
def removePunc(mat):
newMat = []
newText = []
newSentence = []
newWord = ""
for text in mat:
for sentence in text:
for word in sentence:
for char in word:
if char not in " !##$%^&*()-_+={}[]|\\:;'<>?,./\"":
newWord += char
newSentence.append(newWord)
newText.append(newSentence)
newMat.append(newText)
return newMat
Finally, found it. As expected, it was a very small logical mistake that was always there but couldn't see it. Here is the working solution:
def removePunc(mat):
newMat = []
for text in mat:
newText = []
for sentence in text:
newSentence = []
for word in sentence:
if word not in " !##$%^&*()-_+={}[]|\\:;'<>?,./\"":
newSentence.append(word)
newText.append(newSentence)
newMat.append(newText)
return newMat

Combine several list comprehension codes

I got three list comprehensions that do some trimming in a given string. What these are doing is that in a string, it removes words that contain '/', removes certain words in the list called 'remove_set', and combines single consecutive letters into a one big word.
regex = re.compile(r'.*/.*')
parent = ' '.join([p for p in parent.split() if not regex.match(p)])
remove_set = {'hello', 'corp', 'world'}
parent = ' '.join([i for i in parent.split() if i not in remove_set])
parent = ' '.join((' ' if x else '').join(y) for x, y in itertools.groupby(parent.split(), lambda x: len(x) > 1))
For example:
string = "hello C S people in some corp/llc"
changes to
string = "CS people in some"
Can these commands can be written in one beautiful command??
Thanks in advance!

Split a word with regexp in matlab; startIndex for 'split'?

My aim is to generate the phonetic transcription for any word according to a set of rules.
First, I want to split words into their syllables. For example, I want an algorithm to find 'ch' in a word and then separate it like shown below:
Input: 'aachbutcher'
Output: 'a' 'a' 'ch' 'b' 'u' 't' 'ch' 'e' 'r'
I have come so far:
check=regexp('aachbutcher','ch');
if (isempty(check{1,1})==0) % Returns 0, when 'ch' was found.
[match split startIndex endIndex] = regexp('aachbutcher','ch','match','split')
%Now I split the 'aa', 'but' and 'er' into single characters:
for i = 1:length(split)
SingleLetters{i} = regexp(split{1,i},'.','match');
end
end
My problem is: How do I put the cells together, such that they are formatted like the desired output? I only have the starting indexes for the match parts ('ch') but not for the split parts ('aa', 'but','er').
Any ideas?
You don't need to work with the indices or length. Simple logic: Process first element from match, then first from split, then second from match etc....
[match,split,startIndex,endIndex] = regexp('aachbutcher','ch','match','split');
%Now I split the 'aa', 'but' and 'er' into single characters:
SingleLetters=regexp(split{1,1},'.','match');
for i = 2:length(split)
SingleLetters=[SingleLetters,match{i-1},regexp(split{1,i},'.','match')];
end
So, you know the length of 'ch', it's 2. You know where you found it from regex, as those indices are stored in startIndex. I'm assuming (Please, correct me if I'm wrong) that you want to split all other letters of the word into single-letter cells, like in your output above. So, you can just use the startIndex data to construct your output, using conditionals, like this:
check=regexp('aachbutcher','ch');
if (isempty(check{1,1})==0) % Returns 0, when 'ch' was found.
[match split startIndex endIndex] = regexp('aachbutcher','ch','match','split')
%Now I split the 'aa', 'but' and 'er' into single characters:
for i = 1:length(split)
SingleLetters{i} = regexp(split{1,i},'.','match');
end
end
j = 0;
for i = 1 : length('aachbutcher')
if (i ~= startIndex(1)) && (i ~= startIndex(2))
j = j +1;
output{end+1} = SingleLetters{j};
else
i = i + 1;
output{end+1} = 'ch';
end
end
I don't have MATLAB right now, so I can't test it. I hope it works for you! If not, let me know and I'll take anther shot at it.