Why does ConditionalFreqDist not work in NLTK? - python-2.7

cfd = nltk.ConditionalFreqDist(
(target,fileid[:4])
for target in ['america']
for fileid in inaugural.fileids()
It works fine, but I don't know why the samples got 1 in every file?

This has nothing to do with the FreqDist. It's about what you feed it – you need to know how the generator expressions work.
In your case it's a two-way nested generator. Look at it like a for loop:
for target in ['america']:
for fileid in inaugural.fileids():
# do something with target and fileid
In this case, the "do something" part is simply adding a pair of strings to the FreqDist. The string pairs look like this:
('america', <prefix_of_file_1>)
('america', <prefix_of_file_2>)
('america', <prefix_of_file_3>)
...
The first element is always the same, because you have just one item in the target list. The second element is made of the first 4 characters of the file ID. You get exactly one entry per file, regardless of whether 'america' is in the file or not, because you don't look at the content of the file, you just iterate over file IDs.
The way to do it is like the first example in your original post, before you deleted it:
cfd = nltk.ConditionalFreqDist(
(target, fileid[:4])
for target in ['america']
for fileid in inaugural.fileids()
for w in inaugural.words(fileid)
if w.lower().startswith(target))
Let's have a look at this three-way nested generator expression, written as for loops:
for target in ['america']:
for fileid in inaugural.fileids():
for w in inaugural.words(fileid):
if w.lower().startswith(target)):
# add target and fileid[:4] to the FreqDist
So here you iterate over all words (inner-most loop) in every file (middle loop), and you do for every target (first loop; here there is just one so there's not much looping). And then you skip all words that do not start with "america".
For example, let's say file 1 has two occurrences of "America" (or "American"), the second file has no mention of the target, and the third file has 3 occurrences. Then the pairs added to the FreqDist will look like this:
('america', <prefix_of_file_1>)
('america', <prefix_of_file_1>)
('america', <prefix_of_file_3>)
('america', <prefix_of_file_3>)
('america', <prefix_of_file_3>)
...
So for every occurrence of the target, you give the FreqDist an entry to count. Files without an occurrence are not counted, and multiple occurrences are counted multiple times.

Related

Attempting to splice a recurring item out of a list

I have extracted files from an online database that consist of a roughly 100 titles. Associated with each of these titles is a DOI number, however, the DOI number is different for each title. To program this endeavor, I converted the contents of the website to a list. I then created for loop to iterate through each item of the list. What I want the program to do is iterate through the entire list and find where it says "DOI:" then to take the number which follows this. However, with the for loop I created, all it seems to do is print out the first DOI number, then terminates. How to I make the loop keep going once I have found the first one.
Here is the code:
resulttext = resulttext.split()
print(resulttext)
for item in resulttext:
if item == "DOI:":
DOI=resulttext[resulttext.index("DOI:")+1] #This parses out the DOI, then takes the item which follows it
print(DOI)

Why does random.sample() add square brackets and single quotes to the item sampled?

I'm trying to sample an item (which is one of the keys in a dictionary) from a list and later use the index of that item to find its corresponding value (in the same dictionary).
questions= list(capitals.keys())
answers= list(capitals.values())
for q in range(10):
queswrite = random.sample(questions,1)
number = questions.index(queswrite)
crtans = answers[number]
Here,capitals is the original dectionary from which the states(keys) and capitals(values) are being sampled.
But,apparently random.sample() method adds square brackets and single quotes to the sampled item and thus prevents it from being used to reference the list containing the corresponding values.
Traceback (most recent call last):
File "F:\test.py", line 30, in
number = questions.index(queswrite)
ValueError: ['Delaware'] is not in list
How can I prevent this?
random.sample() returns a list, containing the number of elements you requested. See the documentation:
Return a k length list of unique elements chosen from the population sequence or set. Used for random sampling without replacement.
If you wanted to pick just one element, you don't want a sample however, you wanted to choose just one. For that you'd use the random.choice() function instead:
question = random.choice(questions)
However, given that you are using a loop, you probably really wanted to get 10 unique questions. Don't use a loop over range(10), instead pick a sample of 10 random questions. That's exactly what random.sample() would do for you:
for question in random.sample(questions, 10):
# pick the answer for this question.
Next, putting both keys and values into two separate lists, then using the index of one to find the other is... inefficient and unnecessary; the keys you pick can be used directly to find the answers:
questions = list(capitals)
for question in random.sample(questions, 10):
crtans = capitals[question]

i want to search file for three strings and type 'defect' only if those both strings are present

I have a txt file with three debug signature present on them.
x = 'task cocaLc Requested reboot'
y = 'memPartFree'
z = 'memPartAlloc'
import re
f = open('testfile.txt','r')
searchstrings = ('task cocaLc Requested reboot', 'memPartFree', 'memPartAlloc')
for line in f():
for word in searchstrings:
if any (s in line for s in searchstrings):
print 'defect'
I want to create a short script to scan through the file and print 'defect' only if all these three strings are present.
I was trying creating with different ways, but unable to meet the requirement.
First, there is a small error on line 4 of the example code. f is not callable, and thus you shouldn't be using parenthesis next to it.
If you have a file with the following in it:
task cocaLc Requested reboot
memPartFree
memPartAlloc
It will print out "defect" 9 times because you're checking once for each line, and once for each search string. So three lines, times three search strings is 9.
The any() function will return True any time the file contains at least one of the defined search strings. Thus, this code will print out "defect" once for each line, multiplied by the number of search strings you've defined.
To resolve this, the program will need to know if/when any of the particular search strings have been detected. You might do something like this:
f = open('testfile.txt','r')
searchstrings = ['task cocaLc Requested reboot', 'memPartFree', 'memPartAlloc']
detections = [False, False, False]
for line in f:
for i in range(0, len(searchstrings)):
if searchstrings[i] in line: #loop through searchstrings using index numbers
detections[i] = True
break #break out of the loop since the word has been detected
if all(detections): #if every search string was detected, every value in detections should be true
print "defect"
In this code, we loop through the lines and the search strings, but the detection variable serves to tell us which search strings have been detected in the file. Thus, if all elements in that list are true, that means all of the search strings have been detected in the file.

Hello I have a code that prints what I need in python but i'd like it to write that result to a new file

The file look like a series of lines with IDs:
aaaa
aass
asdd
adfg
aaaa
I'd like to get in a new file the ID and its occurrence in the old file as the form:
aaaa 2
asdd 1
aass 1
adfg 1
With the 2 element separated by tab.
The code i have print what i want but doesn't write in a new file:
with open("Only1ID.txt", "r") as file:
file = [item.lower().replace("\n", "") for item in file.readlines()]
for item in sorted(set(file)):
print item.title(), file.count(item)
As you use Python 2, the simplest approach to convert your console output to file output is by using the print chevron (>>) syntax which redirects the output to any file-like object:
with open("filename", "w") as f: # open a file in write mode
print >> f, "some data" # print 'into the file'
Your code could look like this after simply adding another open to open the output file and adding the chevron to your print statement:
with open("Only1ID.txt", "r") as file, open("output.txt", "w") as out_file:
file = [item.lower().replace("\n", "") for item in file.readlines()]
for item in sorted(set(file)):
print >> out_file item.title(), file.count(item)
However, your code has a few other more or less bad things which one should not do or could improve:
Do not use the same variable name file for both the file object returned by open and your processed list of strings. This is confusing, just use two different names.
You can directly iterate over the file object, which works like a generator that returns the file's lines as strings. Generators process requests for the next element just in time, that means it does not first load the whole file into your memory like file.readlines() and processes them afterwards, but only reads and stores one line at a time, whenever the next line is needed. That way you improve the code's performance and resource efficiency.
If you write a list comprehension, but you don't need its result necessarily as list because you simply want to iterate over it using a for loop, it's more efficient to use a generator expression (same effect as the file object's line generator described above). The only syntactical difference between a list comprehension and a generator expression are the brackets. Replace [...] with (...) and you have a generator. The only downside of a generator is that you neither can find out its length, nor can you access items directly using an index. As you don't need any of these features, the generator is fine here.
There is a simpler way to remove trailing newline characters from a line: line.rstrip() removes all trailing whitespaces. If you want to keep e.g. spaces, but only want the newline to be removed, pass that character as argument: line.rstrip("\n").
However, it could possibly be even easier and faster to just not add another implicit line break during the print call instead of removing it first to have it re-added later. You would suppress the line break of print in Python 2 by simply adding a comma at the end of the statement:
print >> out_file item.title(), file.count(item),
There is a type Counter to count occurrences of elements in a collection, which is faster and easier than writing it yourself, because you don't need the additional count() call for every element. The Counter behaves mostly like a dictionary with your items as keys and their count as values. Simply import it from the collections module and use it like this:
from collections import Counter
c = Counter(lines)
for item in c:
print item, c[item]
With all those suggestions (except the one not to remove the line breaks) applied and the variables renamed to something more clear, the optimized code looks like this:
from collections import Counter
with open("Only1ID.txt") as in_file, open("output.txt", "w") as out_file:
counter = Counter(line.lower().rstrip("\n") for line in in_file)
for item in sorted(counter):
print >> out_file item.title(), counter[item]

For loop using enumerate through a list with an if statement to search lines for a particular string

I am going to compile a list of a recurring strings (transaction ID).
I am flummoxed. I've researched the correct method and feel like this code should work.
However, I'm doing something wrong in the second block.
This first block correctly compiles a list of the strings that I want.
I cant get this second block to work. If I simplify, I can print each value in the list
by using
for idx, val in enumerate(tidarray): print val
It seems like I should now be able to use that value to search each line for that string,
then print the line (actually I'll be using it in conjunction with another search term to
reduce the number of line reads, but this is my basic test before honing in further.
def main():
pass
samlfile= "2013-08-18 06:24:27,410 tid:5af193fdc DEBUG org.sourceid.saml20.domain.AttributeMapping] Source attributes:{SAML_AUTHN_CTX=urn:oasis:names:tc:SAML:2.0:ac:classes"
tidarray = []
for line in samlfile:
if "tid:" in line:
str=line
tid = re.search(r'(tid:.*?)(?= )', str)
if tid.group() not in tidarray:
tidarray.append(tid.group())
for line in samlfile:
for idx, val in enumerate(tidarray):
if val in line:
print line
Can someone suggest a correction for the second block of code? I recognize that reading the file twice isn't the most elegant solution... My main goal here is to learn how to enumerate through the list and use each value in the subsequent code.
Iterating over a file twice
Basically what you do is:
for line in somefile: pass # first run
for line in somefile: pass # second run
The first run will complete just fine, the second run will not run at all.
This is because the file was read until the end and there's no more data to read lines from.
Call somefile.seek(0) to go to the beginning of the file:
for line in somefile: pass # first run
somefile.seek(0)
for line in somefile: pass # second run
Storing things uniquely
Basically, what you seem to want is a way to store the IDs from the file in the a
data structure and every id shall only be once in said structure.
If you want to store elements uniquely you use, for example, dictionaries (help(dict))
or sets (help(set)). Example with sets:
myset = set()
myset.add(2) # set([2])
myset.add(3) # set([2,3])
myset.add(2) # set([2,3])