Python 2.7 trying to make this book text lowercase - python-2.7

Here is my code so far, the idea is to make all words lowercase, count the unique words (ones that are not repeated) as well as count the number of times "uncle" is typed in the book.
word_cnt = 0
book = open("shunned_house.txt")
lower = book.lower()
for line in lower:
words = line.split()
for word in words:
word_cnt += 1
print word_cnt
any help would be greatly appreciated. I have tried many different variations of this problem and keep getting stuck right here. In terms of words counted in this document its around 10700 or so. I especially have trouble setting the python code up to tackle the problem.

Pretty sure this is what you want:
with open('shunned_house.txt') as f:
book = f.read().lower()
words = book.split()
print len(set(words))
print book.count('uncle')

Related

How to I shorten this code to comply with the DRY (Don't Repeat Yourself) principle?

I want to take words a user provides, store them in a list, and then modify those words so that every other letter is capitalized. I have working code but it is repetitive. I cannot for the life of me figure out how to get all the words ran through in one function and not have it output one long string with the spaces removed. Any help is appreciated.
This is my current code:
def sarcastic_caps(lis1):
list=[]
index=0
for ltr in lis1[0]:
if index % 2 == 0:
list.append(ltr.upper())
else:
list.append(ltr.lower())
index=index+1
return ''.join(list)
final_list.append(sarcastic_caps(lis1))
Imagine 4 more iterations of this ^. I would like to do it all in one function if possible?
I have tried expanding the list index but that returns all of the letters smashed together, not individual words. That is because of the .join but I need that to get all of the letters back together after running them through the .upper/.lower.
I am trying to go from ['hat', 'cat', 'fat'] to ['HaT', 'CaT', 'FaT'].

PythonQuestion on Longest Common Substring(LCS) algorithm

I'm pretty new to Python, it's my first programming language, and I've wanted to work on some manual data structure manipulation and playing around.
I've recently been learning the basic algorithm for solving the LCS problem, and I understand how it works besides one line of code that I for some weird reason can't seem to convince myself I am grasping entirely.
this is the code I've been using to learn from after I couldn't get it down myself quite right.
EDIT 2: Anyway to make this work with an input of two lists of integers?**I figured out that I was understanding my original question correctly, but would anyone know how I could make this work with a **list of integers? I tried converting S and T to a string of comma separated values, which worked in matching some of the characters, but even then it rarely worked in most test-cases. I'm not sure why it wouldn't, as it is still just two strings being compared, but with commas.
def lcs(S,T):
m = len(S)
n = len(T)
counter = [[0]*(n+1) for x in range(m+1)]
longest = 0
lcs_set = set()
for i in range(m):
for j in range(n):
if S[i] == T[j]:
c = counter[i][j] + 1
counter[i+1][j+1] = c
if c > longest:
lcs_set = set()
longest = c
lcs_set.add(S[i-c+1:i+1])
elif c == longest:
lcs_set.add(S[i-c+1:i+1])
return lcs_set
Now my issue is understanding is the line : lcs_set.add(S[i-c+1:i-1])
I understand that the counter is incremented when a match is found, to give longest the length of the substring. So, to make it easy, if S = Crow and T = Crown, when you reach w, the last match, the counter is incremented to 4, and i is at index 3 of S.
Does this mean I am to read this as: i (index3 on S, the W) - c (4), so 3-4 = -1, so 3-4+1 = 0 (at C) and for the right side of the slice: i(3) + 1 = 4(N, but will not be included, obviously), meaning we end with S[0:4], Crow, to LCS_Set?
If that is the case, I guess I am confused as to why we are adding the whole substring to the set, and not just the newest matched character?
If I understand right, it is updating LCS_set with the entire slice of the current matched substring, so if it were on the second match, R, the counter would be at 2, i would be at 1, and it would be saying S[1-2+1:i(1)+1], so 1-2 = -1, -1 + 1 = 0(C) up to i(1)+1 = 2 (leaving us with S[0:2], or CR), so each time around, the set is updated with the entire substring, and not just the current index.
It's not really a problem, I just want to make sure I'm understanding this correctly.
I would really appreciate any input, or any tips anyone might see with my current logic!!
EDIT:
I just realized I was totally forgetting that the position at C is the current counter number, therefore it obviously wouldn't be updating the LCS_set with the current max match number, and it can't update it with just the current matched letter, so it has to take the slice of the substring in order to update the LCS_Set.
Thanks in advance!

Python key sorting

Im taking an online beginner course through google on python 2, and I cannot figure out the answer to one of the questions. Here it is and thanks in advance for your help!
# A. match_ends
# Given a list of strings, return the count of the number of
# strings where the string length is 2 or more and the first
# and last chars of the string are the same.
# Note: python does not have a ++ operator, but += works.
def match_ends(words):
a = []
for b in words:
retun
I tried a few different things. This is just where i left off on my last attempt, and decided to ask for help. I have spent more time thinking about this than i care to mention
You should go through the course materials carefully. This can be solved easily if you have a beginner level understanding of Python. See the following code snippet:
def match_ends(words):
count = 0
for word in words:
if len(word) >= 2 and word[0] == word[-1]:
count += 1
return count

findall function grabbing the wrong info

I am trying to writing a piece of python to read my files. The code is below:
import re, os
captureLevel = [] # capture read scale.
captureQID = [] # capture questionID.
captureDesc = [] # capture description.
file=open(r'E:\Grad\LIS\LIS590 Text mining\Final_Project\finalproject_data.csv','rt')
newfile=open('finalwordlist.csv','w')
mytext=file.read()
for row in mytext.split('\n'):
grabLevel=re.findall(r'(\d{1})+\n',row)
captureLevel.append(grabLevel)
grabQID=re.findall(r'(\w{1}\d{5})',row)
captureQID.append(grabQID) #ERROR LINE.
grabDesc=re.findall(r'\,+\s+(\w.+)',row)
captureDesc.append(grabDesc)
lineCount = 0
wordCount = 0
lines = ''.join(grabDesc).split('.')
for line in lines:
lineCount +=1
for word in line.split(' '):
wordCount +=1
newfile.write(''.join(grabLevel) + '|' + ''.join(grabQID) + '|' + str(lineCount) + '|' + str(wordCount) + '|' + word + '\n')
newfile.close()
Here are three lines of my data:
a00004," another oakstr eetrequest, helped student request item",2
a00005, asked retiree if he used journal on circ list,2
a00006, asked scientist about owner of some archival notes,2
Here is the result:
22|a00002|1|1|a00002,
22|a00002|1|2|
22|a00002|1|3|scientist
22|a00002|1|4|looking
22|a00002|1|5|for
The first column of the result should be just one number, but why is it printing out a two digit number?
Any idea what is the problem here? Thanks.
It is the tab and space difference again. Need to be careful especially for Python. Spaces are not treated as equivalent to tab. Here is a helpful link talking about the difference: http://legacy.python.org/dev/peps/pep-0008/. To be brief, space is recommended for indentation in the post. However, I find Tab works fine for indentation too. It is important to keep indentation consistent. So if you use tab, make sure you use it all the way.

Remove fullstop, commas, quotation from list in Python

I have a python code for word frequency count from a text file. The problem with the program is that it takes fullstop into account hence altering the count. For counting word i've used a sorted list of words. I tried to remove the fullstop using
words = open(f, 'r').read().lower().split()
uniqueword = sorted(set(words))
uniqueword = uniqueword.replace(".","")
but i get error as
AttributeError: 'list' object has no attribute 'replace'
Any help would be appreciated :)
You can process the words before you make the set, using a list comprehension:
words = [word.replace(".", "") for word in words]
You could also remove them after (uniquewords = [word.replace...]), but then you will reintroduce duplicates.
Note that if you want to count these words, a Counter may be more useful:
from collections import Counter
counts = Counter(words)
You might be better off with
words = re.findall(r'\w+', open(f, 'r').read().lower())
which will grab all the strings composed of one or more “word characters” and will ignore punctuation and whitespace.