I'm trying to troncate several element.text on a xml file. I succeed to get two list, the first one regroup the formers too long element.text as str (long_name) and the second regroup the same after a troncation (short_name).
Now i want to replace the element.text on my xml, i tried some script but i surrended to work with the function readlines(), i want to find a similar solution with lxml as this code :
txt = open('IF_Generic.arxml','r')
Lines = txt.readlines()
txt.close()
txt = open('IF_Genericnew.arxml','w')
for e in range(len(long_name)) :
for i in range(len(Lines)) :
if (long_name[e] in Lines[i]) == True :
Lines[i] = Lines[i].replace(long_name[e],short_name[e])
for i in Lines :
txt.write(i)
txt.close()
I tried this, but it doesn't work :
f = open('IF_Generic.arxml')
arxml = f.read()
f.close()
tree = etree.parse(StringIO(arxml))
for e,b in enumerate(long_name) :
context = etree.iterparse(StringIO(arxml))
for a,i in context:
if not i.text:
pass
else:
if (b in i.text) == True :
i.text = short_name[e]
obj_arxml = etree.tostring(tree,pretty_print=True)
f = open('IF_Genericnew.arxml','w')
f.write(obj_arxml)
f.close()
Let's say the first element of the list long_name is RoutineServices_EngMGslLim_NVMID03
<BALISE_A>
<BALISE_B>
<SHORT-NAME>RoutineServices_EngMGslLim_NVMID03</SHORT-NAME>
</BALISE_B>
</BALISE_A>
<BALISE_C>
<POSSIBLE-ERROR-REF DEST="APPLICATION-ERROR">/Interfaces/RoutineServices_EngMGslLim_NVMID03/E_NOT_OK</POSSIBLE-ERROR-REF>
<SHORT-NAME>Blah_Bleh_Bluh</SHORT-NAME>
</BALISE_C>
The first element of the list short_name is RoutineServices_EngMGslLim_NV
<BALISE_A>
<BALISE_B>
<SHORT-NAME>RoutineServices_EngMGslLim_NV</SHORT-NAME>
</BALISE_B>
</BALISE_A>
<BALISE_C>
<POSSIBLE-ERROR-REF DEST="APPLICATION-ERROR">/Interfaces/RoutineServices_EngMGslLim_NV/E_NOT_OK</POSSIBLE-ERROR-REF>
<SHORT-NAME>Blah_Bleh_Bluh</SHORT-NAME>
</BALISE_C>
I want this
P.S: I use python 2.7.9
Thanks in advance everyone !
Don't open XML files like text files. I have explained in this answer why this is a bad idea.
Simply let etree read and write the file. It's also less code to write.
from lxml import etree
# read the file and load it into a DOM tree
tree = etree.parse('IF_Generic.arxml')
for elem in tree.iterfind("//*"):
# find elements that contain only text
if len(elem) == 0 and elem.text and elem.text.strip() > '':
# do your replacements ...
elem.text = "new text"
# serialize the DOM tree and write it to file
tree.write('IF_Genericnew.arxml', pretty_print=True)
Instead of going over all elements, which is what "//*" does, you can use more specific XPath to narrow down the elements you want to work on.
For example, something like "//SHORT-NAME | //POSSIBLE-ERROR-REF" would help to reduce the overall work load.
Related
I'm currently using a not-very-Scala-like approach to parse large Unix mailbox files. I'm still learning the language and would like to challenge myself to find a better way, however, I do not believe I have a solid grasp on just what can be done with an Iterator and how to effectively use it.
I'm currently using org.apache.james.mime4j, and I use the org.apache.james.mime4j.mboxiterator.MboxIterator to get a java.util.Iterator from a file, as so:
// registers an implementation of a ContentHandler that
// allows me to construct an object representing an email
// using callbacks
val handler: ContentHandler = new MyHandler();
// creates a parser that parses a SINGLE email from a given InputStream
val parser: MimeStreamParser = new MimeStreamParser(configBuilder.build());
// register my handler
parser.setContentHandler(handler);
// Get a java.util.Iterator
val iterator = MboxIterator.fromFile(fileName).build();
// For each email, process it using above Handler
iterator.forEach(p => parser.parse(p.asInputStream(Charsets.UTF_8)))
From my understanding, the Scala Iterator is much more robust, and probably a lot more capable of handling something like this, especially because I won't always be able to fit the full file in memory.
I need to construct my own version of the MboxIterator. I dug through the source for MboxIterator and was able to find a good RegEx pattern to use to determine the beginning of individual email messages with, however, I'm drawing a blank from now on.
I created the RegEx like so:
val MESSAGE_START = Pattern.compile(FromLinePatterns.DEFAULT, Pattern.MULTILINE);
What I want to do (based on what I know so far):
Build a FileInputStream from an MBOX file.
Use Iterator.continually(stream.read()) to read through the stream
Use .takeWhile() to continue to read until the end of the stream
Chunk the Stream using something like MESSAGE_START.matcher(someString).find(), or use it to find the indexes the separate the message
Read the chunks created, or read the bits in between the indexes created
I feel like I should be able to use map(), find(), filter() and collect() to accomplish this, but I'm getting thrown off by the fact that they only give me Ints to work with.
How would I accomplish this?
EDIT:
After doing some more thinking on the subject, I thought of another way to describe what I think I need to do:
I need to keep reading from the stream until I get a string that matches my RegEx
Maybe group the previously read bytes?
Send it off to be processed somewhere
Remove it from the scope somehow so it doesn't get grouped the next time I run into a match
Continue to read the stream until I find the next match.
Profit???
EDIT 2:
I think I'm getting closer. Using a method like this gets me an iterator of iterators. However, there are two issues: 1. Is this a waste of memory? Does this mean everything gets read into memory? 2. I still need to figure out a way to split by the match, but still include it in the iterator returned.
def split[T](iter: Iterator[T])(breakOn: T => Boolean):
Iterator[Iterator[T]] =
new Iterator[Iterator[T]] {
def hasNext = iter.hasNext
def next = {
val cur = iter.takeWhile(!breakOn(_))
iter.dropWhile(breakOn)
cur
}
}.withFilter(l => l.nonEmpty)
If I understand correctly, you want to lazily chunk a large file delimited by a regex recognizable pattern.
You could try to return an Iterator for each request but the correct iterator management would not be trivial.
I'd be inclined to hide all file and iterator management from the client.
class MBox(filePath :String) {
private val file = io.Source.fromFile(filePath)
private val itr = file.getLines().buffered
private val header = "From .+ \\d{4}".r //adjust to taste
def next() :Option[String] =
if (itr.hasNext) {
val sb = new StringBuilder()
sb.append(itr.next() + "\n")
while (itr.hasNext && !header.matches(itr.head))
sb.append(itr.next() + "\n")
Some(sb.mkString)
} else {
file.close()
None
}
}
testing:
val mbox = new MBox("so.txt")
mbox.next()
//res0: Option[String] =
//Some(From MAILER-DAEMON Fri Jul 8 12:08:34 2011
//some text AAA
//some text BBB
//)
mbox.next()
//res1: Option[String] =
//Some(From MAILER-DAEMON Mon Jun 8 12:18:34 2012
//small text
//)
mbox.next()
//res2: Option[String] =
//Some(From MAILER-DAEMON Tue Jan 8 11:18:14 2013
//some text CCC
//some text DDD
//)
mbox.next() //res3: Option[String] = None
There is only one Iterator per open file and only the safe methods are invoked on it. The file text is realized (loaded) only on request and the client gets just what's requested, if available. Instead of all lines in one long String you could return each line as part of a collection, Seq[String], if that's more applicable.
UPDATE: This can be modified for easy iteration.
class MBox(filePath :String) extends Iterator[String] {
private val file = io.Source.fromFile(filePath)
private val itr = file.getLines().buffered
private val header = "From .+ \\d{4}".r //adjust to taste
def next() :String = {
val sb = new StringBuilder()
sb.append(itr.next() + "\n")
while (itr.hasNext && !header.matches(itr.head))
sb.append(itr.next() + "\n")
sb.mkString
}
def hasNext: Boolean =
if (itr.hasNext) true else {file.close(); false}
}
Now you can .foreach(), .map(), .flatMap(), etc. But you can also do dangerous things like .toList which will load the entire file.
I have a script which pulls XML hosted online and saves it locally. The script then goes through the local file and replaces/adds certain text. However, for some reason, when I use the "&" symbol, there is an extra space inserted along with it within the element text. Here is a sample of the XML elements I am parsing:
<TrackingEvents>
<Tracking event="rewind">
http://www.example.com/rewind_1.png?test=rewind_test
</Tracking>
<Tracking event="pause">
http://www.example.com/pause_1.png?test=rewind_test
</Tracking>
However, after running my script to add the additional test to my elements, the text is added with an additional space, like this:
<TrackingEvents>
<Tracking event="rewind">
http://www.example.com/rewind_1.png?test=rewind_test &cb={CACHEBUSTER}
</Tracking>
<Tracking event="pause">
http://www.example.com/pause_1.png?test=rewind_test &cb={CACHEBUSTER}
</Tracking>
I have tried everything but I don't know why this is occurring or what I can do to prevent this space from being added. I have even tried to strip the white space as well. When I look at the XML that is saved locally before uploading it, everything looks fine (& is for the "&" symbol) as seen here from the source:
<Tracking event="rewind">
http://www.example.com/rewind_1.png?test=rewind_test
&cb={CACHEBUSTER}</Tracking>
<Tracking event="pause">
http://www.example.com/pause_1.png?test=rewind_test
&cb={CACHEBUSTER}</Tracking>
Here is what the code from my script looks like:
for URL, xml_name, original_server in tqdm(XML_tags):
response = requests.get(URL)
with open(xml_name, 'wb') as file:
file.write(response.content)
with open(xml_name) as saved_file:
tree = ET.parse(saved_file)
root = tree.getroot()
for element in root.iter(tag=ET.Element):
if element.text != None:
if ".png" in element.text:
if "?" in element.text:
element.text = element.text + "&cb={CACHEBUSTER}"
element.text = element.text.strip()
else:
element.text = element.text + "?cb={CACHEBUSTER}"
element.text = element.text.strip()
else:
pass
server = "example.server: ../sample/sample/" + original_server
tree.write(xml_name, xml_declaration=True, method='xml',
encoding='utf8')
server_upload = subprocess.Popen(["scp", xml_name, server])
upload_wait = os.waitpid(server_upload.pid, 0)
I can definitely use some help with this. Thanks.
Update: Actually, it appears that this has nothing to do with using the "&". Here is a sample when I just add different text:
<TrackingEvents>
<Tracking event="rewind">
http://www.example.com/rewind_1.png?test=rewind_test test123
</Tracking>
<Tracking event="pause">
http://www.example.com/pause_1.png?test=rewind_test test123
</Tracking>
</TrackingEvents>
The whitespace was in the original XML even before you add anything to element.text; it is the newline between the last letter in the .text and the closing tag. So you should have removed the whitespace before appending text instead of after appending as you did in your code above :
....
if "?" in element.text:
element.text = element.text.strip() + "&cb={CACHEBUSTER}"
else:
element.text = element.text.strip() + "?cb={CACHEBUSTER}"
....
I'm trying to parse information from an XML file using ElementTree in Python 2.7.
The XML file has the following structure:
<EXPERIMENT_PACKAGE_SET>
<EXPERIMENT_PACKAGE>
<SAMPLE>
<IDENTIFIERS>
<PRIMARY_ID>ERS1486582</PRIMARY_ID>
<EXTERNAL_ID>SAMEA36350668</EXTERNAL_ID>
</IDENTIFIERS>
<SAMPLE_ATTRIBUTES>
<SAMPLE_ATTRIBUTE>
<TAG>collection date</TAG>
<VALUE>2011</VALUE>
</SAMPLE_ATTRIBUTE>
<SAMPLE_ATTRIBUTE>
<TAG>geographic location</TAG>
<VALUE>USA</VALUE>
</SAMPLE_ATTRIBUTE>
...
</SAMPLE_ATTRIBUTES>
</SAMPLE>
...
</EXPERIMENT_PACKAGE>
</EXPERIMENT_PACKAGE_SET>
It contains many "Sample" modules that I want to loop over, parse some information from the "Identifiers" module and then loop again within their list of "Sample attributes" to store them in a file, including for each of them both their "Tag" and "Value". These data are not standardized (ie the attributes' tags don't always match) so I want to output all of them for each Sample. I wrote this piece of code that worked with other XML files with the same structure:
# import the XML file
e = xml.etree.ElementTree.parse('file.xml').getroot()
# create output file
file = open('output.txt','w')
# loop over all "Sample" modules
for i in range(0,len(e.findall('EXPERIMENT_PACKAGE/SAMPLE'))):
node = e.findall('EXPERIMENT_PACKAGE/SAMPLE')[i]
accession = node.find('IDENTIFIERS/PRIMARY_ID').text
sample = node.find('IDENTIFIERS/EXTERNAL_ID').text
# loop over the list of attributes and print 'accession', 'sample', and for each attribute, 'tag' and 'value'
for attribute in node.findall('SAMPLE_ATTRIBUTES/SAMPLE_ATTRIBUTE'):
file.write(accession + '\t' + sample + '\t' + str(attribute.find('TAG').text) + '\t' +
str(attribute.find('VALUE').text) + '\n')
file.close()
This code worked in the past, and it should give an output such as:
> ERS1486582 SAMEA36350668 collection date 2011
> ERS1486582 SAMEA36350668 geographic location USA
> ...
But with this particular XML file is giving me an error:
AttributeError: 'NoneType' object has no attribute 'text'
I guess some element has None value? How could I try to check if the elements are not NoneType before writing out their values? I've tried adding an 'if' condition like:
if attribute.find('VALUE').text is not None:
within the second 'for' loop (before writing in the output file) but it's not working either. Thanks very much for your help and sorry if I didn't express myself clearly.
I have a text file with these lines.
1.inputlist
D:\Dolby_Harmanious_kit\DRY_run_kits\Dolby_Digital_Plus_Decoder_Imp\Test_Materials\Test_Signals\ITAF_Tests\seamless_switch\acmod21_I0D0CRC.ec3 -#tD:\Dolby_Harmanious_kit\DRY_run_kits\Dolby\m1_m28_switch.cfg
2.inputlist
D:\Dolby_Harmanious_kit\DRY_run_kits\Dolby_Digital_Plus_Decoder_Imp\Test_Materials\Test_Signals\ITAF_Tests\seamless_switch\acmod_2_252_ddp.ec3 -#tD:\Dolby_Harmanious_kit\DRY_run_kits\Digital\m1_m7_switch_every_3frames.cfg
Here i need to remove the path names like
"D:\Dolby_Harmanious_kit\DRY_run_kits\Dolby_Digital_Plus_Decoder_Imp\Test_Materials\Test_Signals\ITAF_Tests\seamless_switch\"
and "D:\Dolby_Harmanious_kit\DRY_run_kits\Digital\ "
.Note that all lines have a different path names.I have a example code to remove a path name.
Code:
import re
b = open ('Filter_Lines.txt','w')
with open('Lines.txt') as f:
for trim in f:
repl = (re.sub('D:.*\\\\','',trim).rstrip('\n'))
b.write(repl + '\n')
b.close()
But here this removes a whole text from "
D:\Dolby_Harmanious_kit\DRY_run_kits\Dolby_Digital_Plus_Decoder_Imp\Test_Materials\Test_Signals\ITAF_Tests\seamless_switch\acmod21_I0D0CRC.ec3 -#tD:\Dolby_Harmanious_kit\DRY_run_kits\Dolby\"
.I need to remove only path names not including "acmod21_I0D0CRC.ec3" in that line.
Can you please guide me for this.
I did upto what i understand your question,
here you specified path's are not similar i.e what i understood is,
your path might be
a) D://a/b/c/file_name.cfg -#tD://a/b/c/d/e/file_name.cfg
is it correct what i understood?
here 2 path present in single line, but common thing is its contains -#t,
so simply use split method to split that.
here what i did based i understand from your post,
import re
li = []
b = open ('file_sample.txt','w')
with open ('file_sam.txt') as f:
for i in open ('file_sam.txt','r'):
a = [re.sub('.*\\\\','',i).rstrip('\n') for i in i.split('D:')]
b.write(''.join(a) + '\n')
b.close()
here my inputs are,
'D:\Dolby_Harmanious_kit\DRY_run_kits\Dolby_Digital_Plus_Decoder_Imp\Test_Materials\Test_Signals\ITAF_Tests\seamless_switch\acmod21_I0D0CRC.ec3 -#tD:\Dolby_Harmanious_kit\DRY_run_kits\Dolby\m1_m28_switch.cfg'
'D:\Dolby_Harmanious_kit\DRY_run_kits\Dolby_Digital_Plus_Decoder_Imp\Test_Materials\Test_Signals\ITAF_Tests\seamless_switch\acmod_2_252_ddp.ec3 -#tD:\Dolby_Harmanious_kit\DRY_run_kits\Digital\m1_m7_switch_every_3frames.cfg'
it gives me,
'acmod21_I0D0CRC.ec3 -#tm1_m28_switch.cfg'
'acmod_2_252_ddp.ec3 -#tm1_m7_switch_every_3frames.cfg'
is this you want?
I didn't find a way to do this - only to edit the translations to the existing fields.
If there is no way to achieve this - how should this be done (somehow automatically, because right now I was manually adding
<message>
<source>x</source>
<translation>xx</translation>
</message>
blocks to my .ts file and I assume that's not the correct way.
No, that's not the correct way :) Use tr() in the code to mark strings for translation.
For example
label->setText( tr("Error") );
The you run lupdate for your project to extract them to a .ts. See here for more details.
Or do you need to translate strings that are not in the source code?
I just wrote a python script to insert new entries
into the .ts file for a homegrown parser using ElementTree. It doesnt make the code pretty
when it adds it, but I believe it works just fine (so far):
from xml.etree import ElementTree as ET
tree = ET.parse(infile)
doc = tree.getroot()
for e in tree.getiterator()
if e.tag == "context":
for child in e.getchildren():
if child.tag == "name" and child.text == target:
elem = ET.SubElement(e, "message")
src = ET.SubElement(elem, "source")
src.text = newtext
trans = ET.SubElement(elem, "translation")
trans.text = "THE_TRANSLATION"
tree.write(outfile)
Where infile is the .ts file, outfile may be the same as infile or different.
target is the context you are looking for to add a new message into,
and newtext is of course the new source text.