python:why whis code only loop one time? - python-2.7

I tend read two csv files and print specific column by key name.
First, I have a list of my key name like key = [a,b,c]
and I these following code:
with open('./file/report.csv', 'rb') as csvfile,open('./file/all.csv','rb') as csvfile2:
reader2 = csv.DictReader(csvfile)
reader3 = csv.DictReader(csvfile2)
for i in key:
for row in reader2:
for row2 in reader3:
if row['Case Id'] == i and row2['name'] == i:
a=row['Status']
b = row2['result']
print a,b
two csv files:
report.csv: all.csv:
Case Id Status name result
a 111 a 1111
b 222 b 2222
c 333 c 3333
my expected result is it will loop three times because there are three elements in key list.expected result should look like:
111 1111
222 2222
333 3333
But actual result is:
111 1111
it only loop one time. I am new on coding things, need some help! Thanks!!

Readers are one-time iterators and are depleted after one iteration.
This means that in the second time around you don't have anything in reader3 since you've already depleted it.
Try this:
reader2 = list(csv.DictReader(csvfile)) # optional
reader3 = list(csv.DictReader(csvfile2)) # must
If you're using big files use more sophisticated matching or just re-open the file each time.

Think of a CVSReader as a one-time iterator over the file. Once you read a line, you cannot go back, and one the reader is exhausted, you cannot read any more data from the file without re-creating it. A good practice would be to read both readers in to memory and then going over them. E.g.:
list2 = list(reader2);
list3 = list(reader3);
for i in key:
for row in list2:
for row2 in list3:
if row['Case Id'] == i and row2['name'] == i:
a=row['Status']
b = row2['result']
print a,b

Related

How to add multiple Sentences (which are stored in a list) into a pandas dataframe

I would like to create an aspect analysis from user reviews. The reviews contain various aspects and therefore the reviews need to be separated into sentences. I save the data in a pandas dataframe and separate the sentences with the nltk library.
I put the separate sentences in a list that I want to format into a dataframe and connect to the original dataframe. However, I get an error. Instead of an extra column, I get 19 new columns. (the individual sentences are not stored in a cell, I think every single sentence gets their own column) I also tested itertools but I also get a wrong record.
Can someone help me to get the right format?
I would like to have a new dataframe which looks like that:
U_REVIEW | SENTENCES
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Im a Sentence. Iam another Sentence in a Row. |[u'Im a Sentence', u'Iam another Sentence in a Row.']
Here we go, next Sentence. Blub, more blubs. |[u"Here weg o, next Sentence.", u'Blub, more blubs.']
Once again, more Sentence. And some other information. The Restaurant was ok, but not awesome.|[u"Once again, more Sentence.", u'And some other information.',u’The Restaurant was ok, but not awesome.’]
That’s how my code looks like:
ta = ta[['U_REVIEW']]
Output:
U_REVIEW
Im a Sentence. Iam another Sentence in a Row.
Here we go, next Sentence. Blub, more blubs.
Once again, more Sentence. And some other information. The Restaurant was ok, but not awesome.
# the empty lists
sentences = []
ss = []
for sentence in ta['U_REVIEW']:
# seperates the review into sentence
sentence = sent_tokenize(sentence)
sentences.append(sentence)
test = itertools.chain(sentences)
#new dataframe to add the Sentences
df2 = pd.DataFrame(sentences)
#create Column
cols2 = ['REVIEW_SENTENCES']
# bring the two dataframes together
df2 = pd.DataFrame(sentences, columns=cols2)
Output of senteces:
[[u'Im a Sentence', u'Iam another Sentence in a Row.'],[u"Here weg o, next Sentence.", u'Blub, more blubs.'],[u"Once again, more Sentence.", u'And some other information.',u’The Restaurant was ok, but not awesome.’]]
Output of test:
<itertools.chain object at 0x000000001316DC18>
Output and Information of the new Dataframe df2:
AssertionError: 1 columns passed, passed data had 19 columns
U_REVIEW | 0 | 1 | 2 ...
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Im a Sentence. Iam another Sentence in a Row. |Im a Sentence |Iam another Sentence in a Row. |
Here we go, next Sentence. Blub, more blubs. |Here we go, next Sentence.|Blub, more blubs. |
Once again, more Sentence. And some other information. The Restaurant was ok, but not awesome.|Once again, more Sentence.|And some other information. |The Restaurant was ok, but not awesome.
Here is a Testset of a Dataframe:
import pandas as pd
ta = pd.DataFrame( ['Im a Sentence. Iam another Sentence in a Row','Here we go, next Sentence. Blub, more blubs.','Once again, more Sentence. And some other information. The Restaurant was ok, but not awsome.'])
ta.columns =['U_REVIEW']
try this I have done it in python 3.5 I think it should work for 2.5 also:
In [45]: df = pd.DataFrame(ta.U_REVIEW.str.split('.',expand=True).replace('',np.nan).fillna(np.nan).values.flatten()).dropna()
In [46]: df
Out[46]:
0
0 Im a Sentence
1 Iam another Sentence in a Row
4 Here we go, next Sentence
5 Blub, more blubs
8 Once again, more Sentence
9 And some other information
10 The Restaurant was ok, but not awsome
is this what you want:
ta.U_REVIEW.str.split('.',expand=True)
Out[50]:
0 1 \
0 Im a Sentence Iam another Sentence in a Row
1 Here we go, next Sentence Blub, more blubs
2 Once again, more Sentence And some other information
2 3
0 None None
1 None
2 The Restaurant was ok, but not awsome
or
In [52]: ta.U_REVIEW.str.split('.').apply(list)
Out[52]:
0 [Im a Sentence, Iam another Sentence in a Row]
1 [Here we go, next Sentence, Blub, more blubs, ]
2 [Once again, more Sentence, And some other in...
Name: U_REVIEW, dtype: object

Python - creating a dictionary from large text file where the key matches regex pattern

My question: how do I create a dictionary from a list by assigning dictionary keys based on a regex pattern match ('^--L-[0-9]{8}'), and assigning the values by using all lines between each key.
Example excerpt from the raw file:
SQL> --L-93752133
SQL> --SELECT table_name, tablespace_name from dba_tables where upper(table_name) like &tablename_from_developer;
SQL>
SQL> --L-52852243
SQL>
SQL> SELECT log_mode FROM v$database;
LOG_MODE
------------
NOARCHIVELOG
SQL>
SQL> archive log list
Database log mode No Archive Mode
Automatic archival Disabled
Archive destination USE_DB_RECOVERY_FILE_DEST
Oldest online log sequence 3
Current log sequence 5
SQL>
SQL> --L-42127143
SQL>
SQL> SELECT t.name "TSName", e.encryptionalg "Algorithm", d.file_name "File Name"
2 FROM v$tablespace t
3 , v$encrypted_tablespaces e
4 , dba_data_files d
5 WHERE t.ts# = e.ts#
6 AND t.name = d.tablespace_name;
no rows selected
Some additional detail: The raw file can be large (at least 80K+ lines, but often much larger) and I need to preserve the original spacing so the output is still easy to read. Here's how I'm reading the file in and removing "SQL>" from the beginning of each line:
with open(rawFile, 'r') as inFile:
content = inFile.read()
rawList = content.splitlines()
for line in rawList:
cleanLine = re.sub('^SQL> ', '', line)
Finding the dictionary keys I'm looking for is easy:
pattern = re.compile(r'^--L-[0-9]{8}')
if pattern.search(cleanLine) is not None:
itemID = pattern.search(cleanLine)
print(itemID.group(0))
But how do I assign all lines between each key as the value belonging to the most recent key preceding them? I've been playing around with new lists, tuples, and dictionaries but everything I do is returning garbage. The goal is to have the data and keys linked to each other so that I can return them as needed later in my script.
I spent a while searching for a similar question, but in most other cases the source file was already in a dictionary-like format so creating the new dictionary was a less complicated problem. Maybe a dictionary or tuple isn't the right answer, but any help would be appreciated! Thanks!
In general, you should question why you would read the entire file, split the lines into a list, and then iterate over the list. This is a Python anti-pattern.
For line oriented text files, just do:
with open(fn) as f:
for line in f:
# process a line
It sounds, however, that you have multi-line block oriented patterns. If so, with smaller files, read the entire file into a single string and use a regex on that. Then you would use group 1 and group 2 as the key, value in your dict:
pat=re.compile(pattern, flags)
with open(file_name) as f:
di={m.group(1):m.group(2) for m in pat.finditer(f.read())}
With a larger file, use a mmap:
import re, mmap
pat=re.compile(pattern, flags)
with open(file_name, 'r+') as f:
mm = mmap.mmap(f.fileno(), 0)
for i, m in enumerate(pat.finditer(mm)):
# process each block accordingly...
As far as the regex, I am a little unclear on what you are trying to capture or not. I think this regex is what I am understanding you want:
^SQL> (--L-[0-9]{8})(.*?)(?=SQL> --L-[0-9]{8}|\Z)
Demo
In either case, running that regex with the example string yields:
>>> pat=re.compile(r'^SQL> (--L-[0-9]{8})\s*(.*?)\s*(?=SQL> --L-[0-9]{8}|\Z)', re.S | re.M)
>>> with open(file_name) as f:
... di={m.group(1):m.group(2) for m in pat.finditer(f.read())}
...
>>> di
{'--L-52852243': 'SQL> \nSQL> SELECT log_mode FROM v;\n\n LOG_MODE\n ------------\n NOARCHIVELOG\n\nSQL> \nSQL> archive log list\n Database log mode No Archive Mode\n Automatic archival Disabled\n Archive destination USE_DB_RECOVERY_FILE_DEST\n Oldest online log sequence 3\n Current log sequence 5\nSQL>',
'--L-93752133': 'SQL> --SELECT table_name, tablespace_name from dba_tables where upper(table_name) like &tablename_from_developer;\nSQL>',
'--L-42127143': 'SQL> \nSQL> SELECT t.name TSName, e.encryptionalg Algorithm, d.file_name File Name\n 2 FROM v t\n 3 , v e\n 4 , dba_data_files d\n 5 WHERE t.ts# = e.ts#\n 6 AND t.name = d.tablespace_name;\n\n no rows selected'}
Something like this?
with open(rawFile, 'r') as inFile:
content = inFile.read()
rawList = content.splitlines()
keyed_dict = {}
in_between_lines = ""
last_key = 0
for line in rawList:
cleanLine = re.sub('^SQL> ', '', line)
pattern = re.compile(r'^--L-[0-9]{8}')
if pattern.search(cleanLine) is not None:
itemID = pattern.search(cleanLine)
if last_key: keyed_dict[last_key] = in_between_lines
last_key = itemID.group(0)
in_between_lines = ""
else:
in_between_lines += cleanLine

Python:How can you recursively search a .txt file, find matches and print results

I have been searching for an answer to this, but can not seem to get what I need. I would like a python script that reads my text file and starting from the top working its way through each line of the file and then prints out all the matches in another txt file. Content of the text file is just 4 digit numbers like 1234.
example
1234
3214
4567
8963
1532
1234
...and so on.
I would like the output to be something like:
1234 : matches found = 2
I know that there are matches in the file do to almost 10000 lines. I appreciate any help. If someone could just point me in the right direction here would be great. Thank you.
import re
file = open("filename", 'r')
fileContent=file.read()
pattern="1234"
print len(re.findall(pattern,fileContent))
If I were you I would open the file and use the split method to create a list with all the numbers in and use the Counter method from collections to count how many of each number in the list are dupilcates.
`
from collections import Counter
filepath = 'original_file'
new_filepath = 'new_file'
file = open(filepath,'r')
text = file.read()
file.close()
numbers_list = text.split('\n')
numbers_set = set(numbers_list)
dupes = [[item,':matches found =',str(count)] for item,count in Counter(numbers_list).items() if count > 1]
dupes = [' '.join(i) for i in dupes]
new_file = open(new_filepath,'w')
for i in dupes:
new_file.write(i)
new_file.close()
`
Thanks to everyone who helped me on this. Thank you to #csabinho for the code he provided and to #IanAuld for asking me "Why do you think you need recursion here?" – IanAuld. It got me to thinking that the solution was a simple one. I just wanted to know which 4 digit numbers had duplicates and how many, and also which 4 digit combos were unique. So this is what I came up with and it worked beautifully!
import re
a=999
while a <9999:
a = a+1
file = open("4digits.txt", 'r')
fileContent = file.read()
pattern = str(a)
result = len(re.findall(pattern, fileContent))
if result >= 1:
print(a,"matches",result)
else:
print (a,"This number is unique!")

Codeeval Challenge 230: Football, Answer Only Partially Correct

I am working on a relatively new challenge in CodeEval called 'Football.' The description is listed in the following link:
https://www.codeeval.com/open_challenges/230/
Inputs are lines of a file read by Python, and within each line there are lists separated by '|', with each list representing a country: the first being country "1", second being country "2", and so on.
1 2 3 4 | 3 1 | 4 1
19 11 | 19 21 23 | 31 39 29
Outputs are also lines in response to each line read from the file.
1:1,2,3; 2:1; 3:1,2; 4:1,3;
11:1; 19:1,2; 21:2; 23:2; 29:3; 31:3; 39:3;
so country 1 supports team 1, 2, and 3 as shown in the first line of output: 1:1,2,3.
Below is my solution, and since I have no clue why the solution only works for the two sample cases lited in the description link, I'd like to ask anyone for comments and hints on how to correct my code. Thank you very much for your time and assistance ahead of time.
import sys
def football(string):
countries = map(str.split, string.split('|'))
teams = sorted(list(set([i[j] for i in countries for j in range(len(i))])))
results = []
for i in range(len(teams)):
results.append([teams[i]+':'])
for j in range(len(countries)):
if teams[i] in countries[j]:
results[i].append(str(j+1))
for i in range(len(results)):
results[i] = results[i][0]+','.join(results[i][1:])
return '; '.join(results) + '; '
if __name__ == '__main__':
lines = [line.rstrip() for line in open(sys.argv[1])]
for line in lines:
print football(line)
After deliberately failing an attempt to checkout the complete test input and my output, I found the problem. The line:
teams = sorted(list(set([i[j] for i in countries for j in range(len(i))])))
will make the output problematic in terms of sorting. For example here's a sample input:
10 20 | 43 23 | 27 | 25 | 11 1 12 43 | 33 18 3 43 41 | 31 3 45 4 36 | 25 29 | 1 19 39 | 39 12 16 28 30 37 | 32 | 11 10 7
and it produces the output:
1:5,9; 10:1,12; 11:5,12; 12:5,10; 16:10; 18:6; 19:9; 20:1; 23:2; 25:4,8; 27:3; 28:10; 29:8; 3:6,7; 30:10; 31:7; 32:11; 33:6; 36:7; 37:10; 39:9,10; 4:7; 41:6; 43:2,5,6; 45:7; 7:12;
But the challenge expects the output teams to be sorted by numbers in ascending order, which is not achieved by the above-mentioned code as the numbers are in string format, not integer format. Therefore the solution is simply adding a key to sort the teams list by ascending order for integer:
teams = sorted(list(set([i[j] for i in countries for j in range(len(i))])), key=lambda x:int(x))
With a small change in this line, the code passes through the tests. A sample output looks like:
1:5,9; 3:6,7; 4:7; 7:12; 10:1,12; 11:5,12; 12:5,10; 16:10; 18:6; 19:9; 20:1; 23:2; 25:4,8; 27:3; 28:10; 29:8; 30:10; 31:7; 32:11; 33:6; 36:7; 37:10; 39:9,10; 41:6; 43:2,5,6; 45:7;
Please let me know if you have a better and more efficient solution to the challenge. I'd love to read better codes or great suggestions on improving my programming skills.
Here's how I solved it:
import sys
with open(sys.argv[1]) as test_cases:
for test in test_cases:
if test:
team_supporters = {}
for nation, nation_teams in enumerate(test.strip().split("|"), start=1):
for team in map(int, nation_teams.split()):
team_supporters.setdefault(team, []).append(nation)
print(*("{}:{};".format(team, ",".join(map(str, sorted(nations))))
for team, nations in sorted(team_supporters.items())))
The problem is not very complicated. We're given a mapping from nation (implicitly numbered by their order in the input) to a list of teams. We need to reverse that to create an output that maps from a team to a list of nations.
It seems natural to use a dictionary that maps in the same way as the desired output. We can use enumerate to give numbers to the nations as we iterate over them. The setdefault method of the dict adds empty lists to the dictionary as they are needed (using a collections.defaultdict instead of a regular dictionary would be another way to deal with this). We don't need to care about the order of the input, nor the order things are stored in the dictionary's inner lists.
The output we build using str.format calls and the default space separator of the print function. If the final semicolon wasn't desired, I'd have used print("; ".join("{}:{}.format(...))) instead. Since the output needs to be sorted by team at the top level, and by nation in the inner lists, we make some sorted calls where necessary.
Sorting the inner lists is probably not even be necessary, since the nations were processed in order, with their numbers derived from the order they had in the input line. Fortunately, Python's Timsort algorithm is very fast on already-sorted input, so even with a bit of unnecessary sorting, our code is still fast enough.

Compare two CSV files in Python

I have two CSV files as follows:
CSV1:
**ID Name Address Ph**
1 Mr.C dsf 142
2 Ms.N asd 251
4 Mr.V fgg 014
12 Ms.S trw 547
CSV2:
**ID Name Service Day**
1 Mr.C AAA Mon
2 Ms.N AAA Mon
2 Ms.N BBB Tue
2 Ms.N AAA Sat
As you can see very quickly CSV1 file is unique in having only 1 instance of every ID whilst CSV2 has repeats.
I am trying to match two CSV files based on ID and then wherever they match adding to CSV2 file the Address and Ph fields from CSV1. This is then saved as a new output file leaving the two original CSV files intact.
I have written a code but here's what's happening:
Either all the entries from CSV1 get added against the last row of CSV2
Or all the entries from CSV2 get the same address details appended against them
Here's what I have done so far.
import csv
csv1=open('C:\csv1file.csv')
csv2=open('C:\csv2file.csv')
csv1reader=csv.reader(csv1)
csv2reader=csv.reader(csv2)
outputfile=open('C:\mapped.csv', 'wb')
csvwriter=csv.writer(outputfile)
counter=0
header1=csv1reader.next()
header2=csv2reader.next()
csvwriter.writerow(header2+header1[2:4])
for row1 in csv1reader:
for row2 in csv2reader:
if row1[0]==row2[0]:
counter=counter+1
csvwriter.writerow(row2+row1[2:4])
I am running this code in Python 2.7. As you might have guessed the two different results that I am getting are based on the indentation of the csvwriter statement in the above code. I feel I am quite close to the answer and understand the logic but somehow the loop doesn't loop very well.
Can any one of you please assist?
Thanks.
The problem arises because the inner loop only works once. the reason for that is, because csv2reader will be empty after you run the loop once
a way to fix this would be to make a copy of the rows in the second file and use that copy in the loop
csvwriter.writerow(header2+header1[2:4])
csv2copy=[]
for row2 in csv2reader: csv2copy.append(row2)
for row1 in csv1reader:
for row2 in csv2copy:
print row1,row2,counter
if row1[0]==row2[0]:
counter=counter+1
csvwriter.writerow(row2+row1[2:4])