Why are the python element tree taking up so much memory? - python-2.7

I am about to find some specific information from a huge (4 GB) xml-file. To avoid using too much memory, I have used the iterparse method of the element tree library in Python. This seems to work well, and the memory usage for Python is ca. 3.5 MB for the most part. But the memory usage is increasing to several gigabytes, when it reach the end of the program, and it seems like it never will finish.
From the output csv-file, it seems like the program has been through all elements of interest, but has some problems with finishing the program. Can anybody see what is wrong with my program, and tell me why it behaves like it does?
The program is shown here:
import xml.etree.ElementTree as ET
output_file = 'output.csv'
input_file = 'raw_data/denmark-latest.xml'
parser = ET.iterparse(input_file, events=("start", "end"))
parser = iter(parser)
event, root = parser.next()
with open(output_file, 'a', 1) as f:
for event, element in parser:
if event == "start" and element.tag == "node":
for node in element.findall(".//tag/[#k='addr:housenumber']/..[#lat]"):
f = open(output_file, "a")
lat = node.get('lat')
lon = node.get('lon')
for tag in node.findall("./tag/[#k='addr:city']"):
city = tag.get('v')
for tag in node.findall("./tag/[#k='addr:postcode']"):
postcode = tag.get('v')
for tag in node.findall("./tag/[#k='addr:street']"):
street = tag.get('v')
for tag in node.findall("./tag/[#k='addr:housenumber']"):
houseno = tag.get('v')
string = str(lat) + ', ' + str(lon) + ', ' + str(postcode) + ', ' + str(city) + ', ' + str(street) + ', ' + str(houseno) +'\n'
f.write(string)
root.clear()

Related

Rearranging elements in Python

i am new to Python and i cant get this.I have a List and i want to take the input from there and write those in files .
p = ['Eth1/1', 'Eth1/5','Eth2/1', 'Eth2/4','Eth101/1/1', 'Eth101/1/2', 'Eth101/1/3','Eth102/1/1', 'Eth102/1/2', 'Eth102/1/3','Eth103/1/1', 'Eth103/1/2', 'Eth103/1/3','Eth103/1/4','Eth104/1/1', 'Eth104/1/2', 'Eth104/1/3','Eth104/1/4']
What i am trying :
with open("abc1.txt", "w+") as fw1, open("abc2.txt", "w+") as fw2:
for i in p:
if len(i.partition("/")[0]) == 4:
fw1.write('int ' + i + '\n mode\n')
else:
i = 0
while i < len(p):
start = p[i].split('/')
if (start[0] == 'Eth101'):
i += 3
key = start[0]
i += 1
while i < len(p) and p[i].split('/')[0] == key:
i += 1
end = p[i-1].split('/')
fw2.write('confi ' + start[0] + '/' + start[1] + '-' + end[1] + '\n mode\n')
What i am looking for :
abc1.txt should have
int Eth1/1
mode
int Eth1/5
mode
int Eth2/1
mode
int Eth 2/4
mode
abc2.txt should have :
int Eth101/1/1-3
mode
int Eth102/1/1-3
mode
int Eth103/1/1-4
mode
int Eth104/1/1-4
mode
So any Eth having 1 digit before " / " ( e:g Eth1/1 or Eth2/2
)should be in one file that is abc1.txt .
Any Eth having 3 digit before " / " ( e:g Eth101/1/1 or Eth 102/1/1
) should be in another file that is abc2.txt and .As these are in
ranges , need to write it like Eth101/1/1-3, Eth102/1/1-3 etc
Any Idea ?
I don't think you need a regex here, at all. All your items begin with 'Eth' followed by one or more digits. So you can check the length of the items before first / occurs and then write it to a file.
p = ['Eth1/1', 'Eth1/5','Eth2/1', 'Eth2/4','Eth101/1/1', 'Eth101/1/2', 'Eth101/1/3','Eth102/1/1', 'Eth102/1/2', 'Eth102/1/3','Eth103/1/1', 'Eth103/1/2', 'Eth103/1/3','Eth103/1/4','Eth104/1/1', 'Eth104/1/2', 'Eth104/1/3','Eth104/1/4']
with open("abc1.txt", "w+") as fw1, open("abc2.txt", "w+") as fw2:
for i in p:
if len(i.partition("/")[0]) == 4:
fw1.write('int ' + i + '\n mode\n')
else:
fw2.write('int ' + i + '\n mode\n')
I refactored your code a little to bring with-statement into play. This will handle correctly closing the file at the end. Also it is not necessary to iterate twice over the sequence, so it's all done in one iteration.
If the data is not as clean as provided, then you maybe want to use regexes. Independent of the regex itself, by writing if re.match(r'((Eth\d{1}\/\d{1,2})', "p" ) you proof if a match object can be created for given regex on the string "p", not the value of the variable p. This is because you used " around p.
So this should work for your example. If you really need a regex, this will turn your problem in finding a good regex to match your needs without any other issues.
As these are in ranges , need to write it like Eth101/1/1-3, Eth102/1/1-3 etc
This is something you can achieve by first computing the string and then write it in the file. But this is more like a separate question.
UPDATE
It's not that trivial to compute the right network ranges. Here I can present you one approach which doesn't change my code but adds some functionality. The trick here is to get groups of connected networks which aren't interrupted by their numbers. For that I've copied consecutive_groups. You can also do a pip install more-itertools of course to get that functionality. And also I transformed the list to a dict to prepare the magic and then retransformed dict to list again. There are definitely better ways of doing it, but this worked for your input data, at least.
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from itertools import groupby
from operator import itemgetter
p = ['Eth1/1', 'Eth1/5', 'Eth2/1', 'Eth2/4', 'Eth101/1/1', 'Eth101/1/2',
'Eth101/1/3', 'Eth102/1/1', 'Eth102/1/2', 'Eth102/1/3', 'Eth103/1/1',
'Eth103/1/2', 'Eth103/1/3', 'Eth103/1/4', 'Eth104/1/1', 'Eth104/1/2',
'Eth104/1/3', 'Eth104/1/4']
def get_network_ranges(networks):
network_ranges = {}
result = []
for network in networks:
parts = network.rpartition("/")
network_ranges.setdefault(parts[0], []).append(int(parts[2]))
for network, ranges in network_ranges.items():
ranges.sort()
for group in consecutive_groups(ranges):
group = list(group)
if len(group) == 1:
result.append(network + "/" + str(group[0]))
else:
result.append(network + "/" + str(group[0]) + "-" +
str(group[-1]))
result.sort() # to get ordered results
return result
def consecutive_groups(iterable, ordering=lambda x: x):
"""taken from more-itertools (latest)"""
for k, g in groupby(
enumerate(iterable), key=lambda x: x[0] - ordering(x[1])
):
yield map(itemgetter(1), g)
# only one line added to do the magic
with open("abc1.txt", "w+") as fw1, open("abc2.txt", "w+") as fw2:
p = get_network_ranges(p)
for i in p:
if len(i.partition("/")[0]) == 4:
fw1.write('int ' + i + '\n mode\n')
else:
fw2.write('int ' + i + '\n mode\n')

Can Python add the " character to a string

I have to paste 3000 url's a day that are unformatted
Can i set up code to convert the raw paste data to a string?
(Example raw data) - 13 Michael Way Cottees NSW 2017
(Example changed data) - "13 Michael Way Cottees NSW 2017"
I have tried
RAW_URL = 13 Michael Way Cottees NSW 2017 + " "
RAW_URL = str(13 HOADLEY ST MAWSON ACT 2607)
RAW_DATA = ' " ' + (13 HOADLEY ST MAWSON ACT 2607) + ' " '
I keep getting "invalid syntax" error and not having much luck with google.
Once it's done it will be folded into the below code, to replace the single input on PASTED_CRM_DATA to a list just below
import requests
import csv
from lxml import html
import time
import sys
text2search = '''RECENTLY SOLD'''
PASTED_CRM_DATA = "13 HOADLEY ST MAWSON ACT 2607"
URL_LIST = 'https://www.realestate.com.au/property/' + str(PASTED_CRM_DATA.replace(' ', '-').lower()),
with open('REA.csv', 'wb') as csv_file:
writer = csv.writer(csv_file)
for index, url in enumerate(URL_LIST):
page = requests.get(url)
print '\r' 'Scraping URL ' + str(index+1) + ' of ' + str(len(URL_LIST))+ ' ' + url,
if text2search in page.text:
tree = html.fromstring(page.content)
(title,) = (x.text_content() for x in tree.xpath('//title'))
(price,) = (x.text_content() for x in tree.xpath('//div[#class="property-value__price"]'))
(sold,) = (x.text_content().strip() for x in tree.xpath('//p[#class="property-value__agent"]'))
writer.writerow([title, price, sold])
Any input is appreciated
First of all you should understand what strings are in python
In your examples that you have tried
RAW_URL = 13 Michael Way Cottees NSW 2017 + " "
RAW_URL = str(13 HOADLEY ST MAWSON ACT 2607)
RAW_DATA = ' " ' + (13 HOADLEY ST MAWSON ACT 2607) + ' " '
Here the characters you try to use a string are interpreted as actual code. To make your intentions clear to the interpreter use single quotes ' around them. (or double quotes)
RAW_URL = '13 Micheal Way Cottees NSW 2017'
RAW_DATA = '13 HOADLEY SY MAWSON ACT 2607'
To apply quotes use either string concatanation
RAW_URL = '"' + '13 Micheal Way Cottees NSW 2017' + '"'
Tough im not sure what you mean with raw paste data. Where is the data copied from? Is it by hand or done in the program?

How to get a list of strings to print out vertically in a text file?

I have some data that I've pulled from a website. This is the code I used to grab it (my actual code is much longer but I think this about sums it up).
lid_restrict_save = []
for t in range(10000,10020):
address = 'http://www.tspc.oregon.gov/lookup_application/' + lines2[t]
page = requests.get(address)
tree = html.fromstring(page.text)
#District Restriction
dist_restrict = tree.xpath('//tr[11]//text()')
if u"District Restriction" in dist_restrict:
lid_restrict_save.append(id2)
I'm trying to export this list:
print lid_restrict_save
[['5656966VP65', '5656966RR68', '56569659965', '56569658964']]
to a text file.
f = open('dis_restrict_no_uniqDOB2.txt', 'r+')
for j in range(0,len(lid_restrict_save)):
s = ( (unicode(lid_restrict_save[j]).encode('utf-8') + ' \n' ))
f.write(s)
f.close()
I want the text to come out looking like this:
5656966VP65
5656966RR68
56569659965
56569658964
This code worked but only when I started the range from 0.
f = open('dis_restrict.txt', 'r+')
for j in range(0,len(ldob_restrict)):
f.write( ldob_restrict[j].encode("utf-8") + ' \n' )
f.close()
When I've tried changing the code I keep getting this error:
"AttributeError: 'list' object has no attribute 'encode'."
I've tried the suggestions from here, here, and here but to no avail.
If anyone has any hints it would be greatly appreciated.
lid_restrict_save is a nested list so you can't encode the first element because it is not a string.
You could write to the txt file using this:
lid_restrict_save = [['5656966VP65', '5656966RR68', '56569659965', '56569658964']]
lid_restrict_save = lid_restrict_save[0] # remove the outer list
with open('dis_restrict.txt', 'r+') as f:
for i in lid_restrict_save:
f.write(str(i) + '\n')

regex for detecting subtitle errors

I'm having some issues with subtitles, I need a way to detect specific errors. I think regular expressions would help but need help figuring this one out. In this example of SRT formatted subtitle, line #13 ends at 00:01:10,130 and line #14 begins at 00:01:10:129.
13
00:01:05,549 --> 00:01:10,130
some text here.
14
00:01:10,129 --> 00:01:14,109
some other text here.
Problem is that next line can't begin before current one is over - embedding algorithm doesn't work when that happens. I need to check my SRT files and correct this manually, but looking for this manually in about 20 videos each an hour long just isn't an option. Specially since I need it 'yesterday' (:
Format for SRT subtitles is very specific:
XX
START --> END
TEXT
EMPTY LINE
[line number (digits)][new line character]
[start and end times in 00:00:00,000 format, separated by _space__minusSign__minusSign__greaterThenSign__space_][new line character]
[text - can be any character - letter, digit, punctuation sign.. pretty much anything][new line character]
[new line character]
I need to check if END time is greater then START time of the following subtitle. Help would be appreciated.
PS. I can work with Notepad++, Eclipse (Aptana), python or javascript...
Regular expressions can be used to achieve what you want, that being said, they can't do it on their own. Regular expressions are used for matching patterns and not numerical ranges.
If I where you, what I would do would be as following:
Parse the file and place the start-end time in one data structure (call it DS_A) and the text in another (call it DS_B).
Sort DS_A in ascending order. This should guarantee that you will not have overlapping ranges. (This previous SO post should point you in the right direction).
Iterate over and write the following in your file:j DS_A[i] --> DS_A[i + 1] <newline> DS_B[j] where i is a loop counter for DS_A and j is a loop counter for DS_B.
I ended up writing short script to fix this. here it is:
# -*- coding: utf-8 -*-
from datetime import datetime
import getopt, re, sys
count = 0
def fix_srt(inputfile):
global count
parsed_file, errors_file = '', ''
try:
with open( inputfile , 'r') as f:
srt_file = f.read()
parsed_file, errors_file = parse_srt(srt_file)
except:
pass
finally:
outputfile1 = ''.join( inputfile.split('.')[:-1] ) + '_fixed.srt'
outputfile2 = ''.join( inputfile.split('.')[:-1] ) + '_error.srt'
with open( outputfile1 , 'w') as f:
f.write(parsed_file)
with open( outputfile2 , 'w') as f:
f.write(errors_file)
print 'Detected %s errors in "%s". Fixed file saved as "%s"
(Errors only as "%s").' % ( count, inputfile, outputfile1, outputfile2 )
previous_end_time = datetime.strptime("00:00:00,000", "%H:%M:%S,%f")
def parse_times(times):
global previous_end_time
global count
_error = False
_times = []
for time_code in times:
t = datetime.strptime(time_code, "%H:%M:%S,%f")
_times.append(t)
if _times[0] < previous_end_time:
_times[0] = previous_end_time
count += 1
_error = True
previous_end_time = _times[1]
_times[0] = _times[0].strftime("%H:%M:%S,%f")[:12]
_times[1] = _times[1].strftime("%H:%M:%S,%f")[:12]
return _times, _error
def parse_srt(srt_file):
parsed_srt = []
parsed_err = []
for srt_group in re.sub('\r\n', '\n', srt_file).split('\n\n'):
lines = srt_group.split('\n')
if len(lines) >= 3:
times = lines[1].split(' --> ')
correct_times, error = parse_times(times)
if error:
clean_text = map( lambda x: x.strip(' '), lines[2:] )
srt_group = lines[0].strip(' ') + '\n' + ' --> '.join( correct_times ) + '\n' + '\n'.join( clean_text )
parsed_err.append( srt_group )
parsed_srt.append( srt_group )
return '\r\n'.join( parsed_srt ), '\r\n'.join( parsed_err )
def main(argv):
inputfile = None
try:
options, arguments = getopt.getopt(argv, "hi:", ["input="])
except:
print 'Usage: test.py -i <input file>'
for o, a in options:
if o == '-h':
print 'Usage: test.py -i <input file>'
sys.exit()
elif o in ['-i', '--input']:
inputfile = a
fix_srt(inputfile)
if __name__ == '__main__':
main( sys.argv[1:] )
If someone needs it save the code as srtfix.py, for example, and use it from command line:
python srtfix.py -i "my srt subtitle.srt"
I was lazy and used datetime module to process timecodes, so not sure script will work for subtitles longer then 24h (: I'm also not sure when miliseconds were added to Python's datetime module, I'm using version 2.7.5; it's possible script won't work on earlier versions because of this...

Selecting values from list in dictionary python

I've been working on a small contact importer, and now I'm trying to implement a block that automatically selects the output file format based on the number of contacts to be imported.
However, every time it results in the error:
KeyError: 'q'
I can't figure out for the life of me why this is happening, and I would love any help offered.
My idea of scalability is that the dictionary personDict would be of the format personDict = {nameid:[name,email]}, but nothing works.
Any help is good help,
Thanks
def autoFormat():
while True:
name = input("Enter the person's name \n")
if name == "q":
break
email = input("Enter the person's email \n")
personDict[name] = [name, email]
if len(personDict) <= 10:
keyValue = personDict[name]
for keyValue in personDict:
for key, value in personDict.iteritems():
combined = "BEGIN:VCARD\nVERSION:4.0\n" + "FN:" + name + "\n" + "EMAIL:" + email + "\n" + "END:VCARD"
fileName = name + ".vcl"
people = open(fileName, 'a')
people.write(combined)
people.close()
print("Created file for " + name)
autoFormat()
The main problem is that when the user types "q" your code leaves the while loop
with name keeping "q" as value. So you should remove this useless line:
keyValue = person_dict[name]
Since there is no element with key "q" in your dictionary.
Also in the export part you write in file values different from those you loop with.
Your code becomes:
if len(personDict) <= 10:
for name, email in personDict.values():
combined = "BEGIN:VCARD\nVERSION:4.0\n" + "FN:" + name + "\n" + "EMAIL:" + email + "\n" + "END:VCARD"
fileName = name + ".vcl"
people = open(fileName, 'a')
people.write(combined)
people.close()
print("Created file for " + name)