Extracting data using regular expressions: Python - regex

The basic outline of this problem is to read the file, look for integers using the re.findall(), looking for a regular expression of [0-9]+ and then converting the extracted strings to integers and summing up the integers.
I am finding trouble in appending the list. From my below code, it is just appending the first(0) index of the line. Please help me. Thank you.
import re
hand = open ('a.txt')
lst = list()
for line in hand:
line = line.rstrip()
stuff = re.findall('[0-9]+', line)
if len(stuff)!= 1 : continue
num = int (stuff[0])
lst.append(num)
print sum(lst)

import re
ls=[];
text=open('C:/Users/pvkpu/Desktop/py4e/file1.txt');
for line in text:
line=line.rstrip();
l=re.findall('[0-9]+',line);
if len(l)==0:
continue
ls+=l
for i in range(len(ls)):
ls[i]=int(ls[i]);
print(sum(ls));

Great, thank you for including the whole txt file! Your main problem was in the if len(stuff)... line which was skipping if stuff had zero things in it and when it had 2,3 and so on. You were only keeping stuff lists of length 1. I put comments in the code but please ask any questions if something is unclear.
import re
hand = open ('a.txt')
str_num_lst = list()
for line in hand:
line = line.rstrip()
stuff = re.findall('[0-9]+', line)
#If we didn't find anything on this line then continue
if len(stuff) == 0: continue
#if len(stuff)!= 1: continue #<-- This line was wrong as it skip lists with more than 1 element
#If we did find something, stuff will be a list of string:
#(i.e. stuff = ['9607', '4292', '4498'] or stuff = ['4563'])
#For now lets just add this list onto our str_num_list
#without worrying about converting to int.
#We use '+=' instead of 'append' since both stuff and str_num_lst are lists
str_num_lst += stuff
#Print out the str_num_list to check if everything's ok
print str_num_lst
#Get an overall sum by looping over the string numbers in the str_num_lst
#Can convert to int inside the loop
overall_sum = 0
for str_num in str_num_lst:
overall_sum += int(str_num)
#Print sum
print 'Overall sum is:'
print overall_sum
EDIT:
You are right, reading in the entire file as one line is a good solution, and it's not difficult to do. Check out this post. Here is what the code could look like.
import re
hand = open('a.txt')
all_lines = hand.read() #Reads in all lines as one long string
all_str_nums_as_one_line = re.findall('[0-9]+',all_lines)
hand.close() #<-- can close the file now since we've read it in
#Go through all the matches to get a total
tot = 0
for str_num in all_str_nums_as_one_line:
tot += int(str_num)
print('Overall sum is:',tot) #editing to add ()

Related

How can i correctly print out this dictionary in a way i have each word sorted by the number of times(frequency) in the text?

How can i correctly print out this dictionary in a way i have each word sorted by the number of times(frequency) in the text?
slova = dict()
for line in text:
line = re.split('[^a-z]',text)
line[i] = filter(None,line)
i =+ 1
i = 0
for line in text:
for word in line:
if word not in slova:
slova[word] = i
i += 1
I'm not sure what your text looks like, and you also haven't provided example output, but here is what my guess is. If this doesn't help please update your question and I'll try again. The code makes use of Counter from collections to do all the heavy lifting. First all of the words in all of the lines of the text are flattened to a single list, then this list is simply passed to Counter. The keys of the Counter (the words) are then sorted by their counts and printed out.
CODE:
from collections import Counter
import re
text = ['hello hi hello yes hello',
'hello hi hello yes hello']
all_words = [w for l in text for w in re.split('[^a-z]',l)]
word_counts = Counter(all_words)
sorted_words = sorted(word_counts.keys(),
key=lambda k: word_counts[k],
reverse = True)
#Print out the word and counts
for word in sorted_words:
print word,word_counts[word]
OUTPUT:
hello 6
yes 2
hi 2

How to get a list of strings to print out vertically in a text file?

I have some data that I've pulled from a website. This is the code I used to grab it (my actual code is much longer but I think this about sums it up).
lid_restrict_save = []
for t in range(10000,10020):
address = 'http://www.tspc.oregon.gov/lookup_application/' + lines2[t]
page = requests.get(address)
tree = html.fromstring(page.text)
#District Restriction
dist_restrict = tree.xpath('//tr[11]//text()')
if u"District Restriction" in dist_restrict:
lid_restrict_save.append(id2)
I'm trying to export this list:
print lid_restrict_save
[['5656966VP65', '5656966RR68', '56569659965', '56569658964']]
to a text file.
f = open('dis_restrict_no_uniqDOB2.txt', 'r+')
for j in range(0,len(lid_restrict_save)):
s = ( (unicode(lid_restrict_save[j]).encode('utf-8') + ' \n' ))
f.write(s)
f.close()
I want the text to come out looking like this:
5656966VP65
5656966RR68
56569659965
56569658964
This code worked but only when I started the range from 0.
f = open('dis_restrict.txt', 'r+')
for j in range(0,len(ldob_restrict)):
f.write( ldob_restrict[j].encode("utf-8") + ' \n' )
f.close()
When I've tried changing the code I keep getting this error:
"AttributeError: 'list' object has no attribute 'encode'."
I've tried the suggestions from here, here, and here but to no avail.
If anyone has any hints it would be greatly appreciated.
lid_restrict_save is a nested list so you can't encode the first element because it is not a string.
You could write to the txt file using this:
lid_restrict_save = [['5656966VP65', '5656966RR68', '56569659965', '56569658964']]
lid_restrict_save = lid_restrict_save[0] # remove the outer list
with open('dis_restrict.txt', 'r+') as f:
for i in lid_restrict_save:
f.write(str(i) + '\n')

What is the error in my python code

You are given an integer NN on one line. The next line contains NN space separated integers. Create a tuple of those NN integers. Let's call it TT.
Compute hash(T) and print it.
Note: Here, hash() is one of the functions in the __builtins__ module.
Input Format
The first line contains NN. The next line contains NN space separated integers.
Output Format
Print the computed value.
Sample Input
2
1 2
Sample Output
3713081631934410656
My code
a=int(raw_input())
b=()
i=0
for i in range (0,a):
x=int(raw_input())
c = b + (x,)
i=i+1
hash(b)
Error:
invalid literal for int() with base 10: '1 2'
There are three errors that I can spot:
First, your for-loop is not indented.
Second, you should not be adding 1 to i - the for-loop does this automatically.
Thirds - and this is where the error is thrown - is that raw_input reads the entire line. If you are reading the line '1 2', you cannot convert this to an int.
To fix this problem, I suggest doing:
line = tuple(map(int,raw_input().split(' ')))
This takes the raw input, splits it into an list, makes this list into ints, then turns this list into a tuple.
In fact, you can scrap the entire for loop. You could answer this problem in two lines of code:
raw_input()#To get rid of the first line, which we do not need
print hash(tuple(map(int,raw_input().split(' '))))
The input format
next line contains NN space separated integers
eg: 1 2 3, is not an integer (because of the spaces), that is why when you try int(raw_input()) your code throws an error. You should use split(' ') as the other answer has suggested, to separate each integer. This will remove the error.
Also, there is no need to use i=i+1 as the loop will take care of it
Try the below code:
if __name__ == '__main__':
n = int(input())
integer_list = map(int, input().split())
t = tuple(integer_list)
print(hash(t))
Try This code for Python-3
if __name__ == '__main__':
n = int(input())
integer_list = map(int, input().split())
input_list = [int(x) for x in integer_list]
t = tuple(input_list)``
print(hash(t))

Why does this code only read the first line rather than the whole .txt file?

I have a code here on Python 2.7 that is supposed to tell me the frequency of a letter or word within a single text file.
def frequency_a_in_text(textfile, a):
"""Counts how many "a" letters are in the text file.
"""
try:
f = open(textfile,'r')
lines = f.readlines()
f.close()
except IOError:
return -1
tot = 0
for line in lines:
split = str(line.split())
k = split.count(s)
tot = tot + k
return tot
print frequency_a_in_text("RandomTextFile.txt", "a")
There's a little bit of extra coding in there - the "try" and "except", but that's just telling me that if I can't open the text file, then it'll return a "-1" to me.
Whenever I run it, it seems to just read the first line and tell me how many "a" letters there are.
You are returning out of the function after the first iteration of your loop.
The return statement should be outside of the loop.
for line in lines:
split = str(line.split())
k = split.count(s)
tot = tot + k
return tot

putting text,csv,excel file in pattern

I am beginner for real programming and have the ff problem
I want to read many instances stored in a file/csv/txt/excel
like the folloing
find<S>ing<G>s<p>
Then when I read this file it goes through each character and start from the six position and continue until the 11 position-the max size of a single row is 12
-,-,-,-,-,f,i,n,d,i,n,0
-,-,-,-,f,i,n,d,i,n,g,0
-,-,-,f,i,n,d,i,n,g,s,0
-,-,f,i,n,d,i,n,g,s,-,S//there is an S value next to the letter d
-,f,i,n,d,i,n,g,s,-,-,0
f,i,n,d,i,n,g,s,-,-,-,0
i,n,d,i,n,g,s,-,-,-,-,G // there is a G value here at th end of g
n,d,i,n,g,s,-,-,-,-,-,P */// there is a P value here at th end of s
Here is the code that I tried in python. but can be possible in c++, java, dotNet.
import sys
import os
f = open('/home/mm/exprimentdata/sample3.csv')// can be txt file
string = f.read()
a = []
b = []
i = 0
while (i < len(string)):
if (string[i] != '\n '):
n = string[i]
if (string[i] == ""):
print ' = '
if (string[i] = upper | numeric)
print rep(char).rjust(12),delimiter=','
a.append(n)
i = (i+1)
print (len(a))
print a
my question is how can I compare each string and assign a single char at the rightmost part (position 12 like above G,P,S)
how can I push one step back after aligning the first row?
how can i fix the length
please anyone see fragment and adjust to solve the above case
I don't understand your question.
But some advice:
Firstly, you should be closing the file after you open it.
f = open('/home/mm/exprimentdata/sample3.csv')// can be txt file
string = f.read()
**f.close()**
Secondly, your indentation is problematic. Whitespace matters in Python. (Maybe your real code is indented properly and it's just a StackOverflow thing.)
Thirdly, instead of using a while loop and incrementing, you should be writing:
for i range(len(string)):
# loop code
Fourthly, this line will never evaluate to True:
if (string[i] == ""):
string[i] will always be some character (or cause an out of bounds error).
I advise you read a Python tutorial before you try and write this program.