How to parse parenthesis to sum word frequencies in python 3 - regex

I have an input with words and their frequency for a given line, however, I would like to have a total count of word frequency. I know there are many solutions for calculating word frequency from a file as a whole, but the input I have has brackets around each line, and parenthesis around each word. I have not been able to extract the word and count because there are a different number of words for each line. Any help would be greatly appreciated!
A sample input:
[('Company', 1)]
[('Tax', 1), ('Service', 1)]
[('"Birchwood', 1), ('LLC"', 1), ('Enterprise,', 1)]
[("Wendy's", 1), ('Salon', 1)]
Code I have been trying:
from collections import defaultdict
def wordCountTotals (fh):
d = defaultdict(int)
for line in fh:
word, count = line.split()
d[word] += count
return d[word], count
I have also tried using :
re.search("\((\w+)\, [0-9]+)", s)
but still no results
Because there are brackets and parenthesis, this code does not work - there are too many values to unpack. If anyone could help with this, I would be so grateful!

Your input consists of list of tuples as exactly same syntax in Python, we can use ast.literal_eval to exploit this fact.
>>> import ast
>>> ast.literal_eval(" [('Company', 1)]".strip())
[('Company', 1)]
So, something along the lines of:
d = defaultdict(0)
for line in fh:
val = ast.literal_eval(line.strip())
for s, c in val:
d[s] += c
return d
would be enough. I have not tried this, might need some fixes.

Related

Extracting data using regular expressions: Python

The basic outline of this problem is to read the file, look for integers using the re.findall(), looking for a regular expression of [0-9]+ and then converting the extracted strings to integers and summing up the integers.
I am finding trouble in appending the list. From my below code, it is just appending the first(0) index of the line. Please help me. Thank you.
import re
hand = open ('a.txt')
lst = list()
for line in hand:
line = line.rstrip()
stuff = re.findall('[0-9]+', line)
if len(stuff)!= 1 : continue
num = int (stuff[0])
lst.append(num)
print sum(lst)
import re
ls=[];
text=open('C:/Users/pvkpu/Desktop/py4e/file1.txt');
for line in text:
line=line.rstrip();
l=re.findall('[0-9]+',line);
if len(l)==0:
continue
ls+=l
for i in range(len(ls)):
ls[i]=int(ls[i]);
print(sum(ls));
Great, thank you for including the whole txt file! Your main problem was in the if len(stuff)... line which was skipping if stuff had zero things in it and when it had 2,3 and so on. You were only keeping stuff lists of length 1. I put comments in the code but please ask any questions if something is unclear.
import re
hand = open ('a.txt')
str_num_lst = list()
for line in hand:
line = line.rstrip()
stuff = re.findall('[0-9]+', line)
#If we didn't find anything on this line then continue
if len(stuff) == 0: continue
#if len(stuff)!= 1: continue #<-- This line was wrong as it skip lists with more than 1 element
#If we did find something, stuff will be a list of string:
#(i.e. stuff = ['9607', '4292', '4498'] or stuff = ['4563'])
#For now lets just add this list onto our str_num_list
#without worrying about converting to int.
#We use '+=' instead of 'append' since both stuff and str_num_lst are lists
str_num_lst += stuff
#Print out the str_num_list to check if everything's ok
print str_num_lst
#Get an overall sum by looping over the string numbers in the str_num_lst
#Can convert to int inside the loop
overall_sum = 0
for str_num in str_num_lst:
overall_sum += int(str_num)
#Print sum
print 'Overall sum is:'
print overall_sum
EDIT:
You are right, reading in the entire file as one line is a good solution, and it's not difficult to do. Check out this post. Here is what the code could look like.
import re
hand = open('a.txt')
all_lines = hand.read() #Reads in all lines as one long string
all_str_nums_as_one_line = re.findall('[0-9]+',all_lines)
hand.close() #<-- can close the file now since we've read it in
#Go through all the matches to get a total
tot = 0
for str_num in all_str_nums_as_one_line:
tot += int(str_num)
print('Overall sum is:',tot) #editing to add ()

What is the error in my python code

You are given an integer NN on one line. The next line contains NN space separated integers. Create a tuple of those NN integers. Let's call it TT.
Compute hash(T) and print it.
Note: Here, hash() is one of the functions in the __builtins__ module.
Input Format
The first line contains NN. The next line contains NN space separated integers.
Output Format
Print the computed value.
Sample Input
2
1 2
Sample Output
3713081631934410656
My code
a=int(raw_input())
b=()
i=0
for i in range (0,a):
x=int(raw_input())
c = b + (x,)
i=i+1
hash(b)
Error:
invalid literal for int() with base 10: '1 2'
There are three errors that I can spot:
First, your for-loop is not indented.
Second, you should not be adding 1 to i - the for-loop does this automatically.
Thirds - and this is where the error is thrown - is that raw_input reads the entire line. If you are reading the line '1 2', you cannot convert this to an int.
To fix this problem, I suggest doing:
line = tuple(map(int,raw_input().split(' ')))
This takes the raw input, splits it into an list, makes this list into ints, then turns this list into a tuple.
In fact, you can scrap the entire for loop. You could answer this problem in two lines of code:
raw_input()#To get rid of the first line, which we do not need
print hash(tuple(map(int,raw_input().split(' '))))
The input format
next line contains NN space separated integers
eg: 1 2 3, is not an integer (because of the spaces), that is why when you try int(raw_input()) your code throws an error. You should use split(' ') as the other answer has suggested, to separate each integer. This will remove the error.
Also, there is no need to use i=i+1 as the loop will take care of it
Try the below code:
if __name__ == '__main__':
n = int(input())
integer_list = map(int, input().split())
t = tuple(integer_list)
print(hash(t))
Try This code for Python-3
if __name__ == '__main__':
n = int(input())
integer_list = map(int, input().split())
input_list = [int(x) for x in integer_list]
t = tuple(input_list)``
print(hash(t))

Python - Check If string Is In bigger String

I'm working with Python v2.7, and I'm trying to find out if you can tell if a word is in a string.
If for example i have a string and the word i want to find:
str = "ask and asked, ask are different ask. ask"
word = "ask"
How should i code so that i know that the result i obtain doesn't include words that are part of other words. In the example above i want all the "ask" except the one "asked".
I have tried with the following code but it doesn't work:
def exact_Match(str1, word):
match = re.findall(r"\\b" + word + "\\b",str1, re.I)
if len(match) > 0:
return True
return False
Can someone please explain how can i do it?
You can use the following function :
>>> test_str = "ask and asked, ask are different ask. ask"
>>> word = "ask"
>>> def finder(s,w):
... return re.findall(r'\b{}\b'.format(w),s,re.U)
...
>>> finder(text_str,word)
['ask', 'ask', 'ask', 'ask']
Note that you need \b for boundary regex!
Or you can use the following function to return the indices of words :
in splitted string :
>>> def finder(s,w):
... return [i for i,j in enumerate(re.findall(r'\b\w+\b',s,re.U)) if j==w]
...
>>> finder(test_str,word)
[0, 3, 6, 7]

finding anagrams considering all the words in english

words_ = load_words("C:\Users\Abdullah\Downloads\EOWL-v1.1.2\EOWL-v1.1.2\LF Delimited Format")
def find_all_anagrams(words, word):
import itertools
permuted_chars = []
for i in range(2, len(word)+1):
permuted_chars += itertools.permutations(word, i)
permutations_list = ["".join(i) for i in permuted_chars]
anagrams_list = [i for i in permutations_list if i in words]
return anagrams_list
To find the anagrams of a given word i figured out this solution
I have the words list of 128,000 can any body suggest a better way
For loading the words:
from io import *
import string
def load_words(base_dir):
words = []
for i in string.uppercase:
location = base_dir+"\\"+i+" Words.txt"
with open(location, "rb+") as f:
words += [x.rstrip() for x in f.readlines()]
return words
Yes, there's a better way. Anagrams contain the same letters. So if you sort the word (by characters), you will get the same result. (eg: mary -> amry, army -> amry).
Using this trick, you can simply build a dictionary, where the sorted version is the key, and the list of anagrams is the value.

removing punctuation then counting the no of every word occurance using python

Hello everybody I am new to python and need to write a program to eliminate punctuation then count the number of words in a string. So I have this:
import sys
import string
def removepun(txt):
for punct in string.punctuation:
txt = txt.replace(punct,"")
print txt
mywords = {}
for i in range(len(txt)):
item = txt[i]
count = txt.count(item)
mywords[item] = count
return sorted(mywords.items(), key = lambda item: item[1], reverse=True)
The problem is it returns back letters and counts them and not words as I hoped. Can you help me in this matter?
How about this?
>>> import string
>>> from collections import Counter
>>> s = 'One, two; three! four: five. six##$,.!'
>>> occurrence = Counter(s.translate(None, string.punctuation).split())
>>> print occurrence
Counter({'six': 1, 'three': 1, 'two': 1, 'four': 1, 'five': 1, 'One': 1})
after removing the punctuation
numberOfWords = len(txt.split(" "))
Assuming one space between words
EDIT:
a={}
for w in txt.split(" "):
if w in a:
a[w] += 1
else:
a[w] = 1
how it works
a is set to be a dict
the words in txt are iterated
if there is an entry already for dict a[w] then add one to it
if there is no entry then set one up, initialized to 1
output is the same as Haidro's excellent answer, a dict with keys of the words and values of the count of each word