Multi line find (grapping) and replace text in XML file using perl - regex

Here i am trying to find and replace text content in one XML file using perl regular expression.
Sample XML Code:
<root>
<add>
<st>xxxx</st>
<pin>xxx</pin>
</add>
</root>
Now i want to find / grep text from <add> to </add> and replace <xyz>xxx</xyz>
<add>
<st>xxxx</st>
<pin>xxx</pin>
</add>
Note:
if above content are in single line i mean without line break in between <add> to </add>, as <add><st>xxxx</st><pin>xxx</pin></add> i can use <add>(.*)<\/add> to find / grep.
Thanking You
Thirusanguraja V

Using XML::XSH2, a wrapper around XML::LibXML:
open input.xml ;
$add = /root/add ;
delete $add/* ;
insert element xyz into $add ;
insert text 'xxx' into $add/xyz ;
save :b ;

Related

regex to exclude string and delete line

I have the following lines in an XML file
<User id="10338" directoryId="1" sometext txt text test/>
<User id="10359" directoryId="100" some more text text text/>
<User id="103599" directoryId="100" some more text text text/>
<User id="10438" directoryId="1" sometext txt text test/>
I am trying to remove any lines that start with User id=" but I want to keep the ones that have directoryId="1"
my current sed command is
sed -i '' '/<User id="/d' file.xml
I have looked at A regular expression to exclude a word/string and a few other stack overflow posts but not able to get this to work. Please can someone help me write the regex. I essentially need to delete any lines that start with <User id= but excluding the ones where directoryId="1"
You can use
sed -i '' -e '/directoryId="1"/b' -e '/<User id="/d' file.xml
With this sed command,
/directoryId="1"/b skips the lines containing directoryId="1" and
/<User id="/d deletes the other lines that contain <User id=".
See an online demo.

Extract specific XMLs from log file

I have large log files (around 50mb each), which contain java debug information plus all kinds of XML responses
Here's an example of something I'm trying to extract from the log
<envelope>
<response>
<ATTR name="uniqueid" value="XYZ_00000-00-00_12345_1"/>
<ATTR name="status" value="Activated"/>
<ATTR name="datecreated" value="2018/10/04 09:39:05"/>
</response>
</envelope>
I need only the XMLs which the uniqueid attribute contains "12345" and the status attribute is set to "Activated"
By using "sed" I'm able to extract all the envelopes, and currently I'm using regex to check if the above conditions exist inside of it (by running all of them in a loop).
sed -n '/<envelope>/,/<\/envelope>/p' logfile
What would be a proper solution to extract what I need from the file?
Thanks!
assuming your xml is formatted as shown, this should work...
$ awk '/<envelope>/ {line=$0; p=0; next}
line {line=line ORS $0}
/uniqueid/ && $3~/12345/ {p=1}
/<\/envelope>/ && p {print line}' file
with the opening tag, start accumulating the lines, if the desired line found set the flag, with the end tag if the flag is set print the record.
with gawk you can do this instead
$ awk -F'\n' -v RS='</envelope>\n' \
'$3~/uniqueid.*12345/ && $4~/status.*Activated/{print $0, RT}' file
there will be an extra newline though.

Regex to extract http links from an XML file

I have an xml file with many lines like:
<xhtml:link vip="true" href="http://store.vcenter.com/stores/en/product/tigers-midi/100" />
How do I extract just the link - http://store.vcenter.com/stores/en/product/tigers-midi/100?
I tried http://www\.\.com[^<]+ but that captures everything untill the end of the line - including quotes and closing XML tags.
I'm using this expression with egrep.
Don't parse HTML with regex, use a proper XML/HTML parser.
Check: Using regular expressions with HTML tags
You can use one of the following :
xmllint
xmlstarlet
saxon-lint
File:
<root>
<xhtml:link vip="true" href="http://store.vcenter.com/stores/en/product/tigers-midi/100" />
</root>
Example with xmllint :
xmllint --xpath '//*[#vip="true"]/#href' file.xml 2>/dev/null
Output:
href="http://store.vcenter.com/stores/en/product/tigers-midi/100"
If you need a quick & dirty one time command, you can do:
egrep -o 'https?://[^"]+' file

REGEX Multi-line Search between 2 characters- Powershell

I am unable to apply many of the other powershell regex solutions to help solve my problem. The answer may very well already be on stackoverflow, but my lack of experience with powershell is prohibiting me from deducing how to maniupulate the solutions to my question.
I have a text file containing an XML document tree(I bring in the document tree as one large string into powershell)(edit 1) that includes the HTML tags to establish where certain content is. I need to steal the file name from in between the filename tags. Sometimes both tags and the file name are all on one line, and other times the tags are each on a seperate line as well as the file name. An example of the input data I have is below:
<files>
<file>
<fileName>
ThisTextFileINeedReturned.txt
</fileName>
<lastModifiedTime>1511883780000</lastModifiedTime>
<size>852192</size>
<isDirectory>false</isDirectory>
<isRegularFile>true</isRegularFile>
<isSymbolicLink>false</isSymbolicLink>
<isOther>false</isOther>
<group>group</group>
<transferStatus>Done</transferStatus>
</file>
<file>
<fileName>AnotherTextFileINeedReturned.txt</fileName>
<lastModifiedTime>1511883780000</lastModifiedTime>
<size>852192</size>
<isDirectory>false</isDirectory>
<isRegularFile>true</isRegularFile>
<isSymbolicLink>false</isSymbolicLink>
<isOther>false</isOther>
<group>group</group>
<transferStatus>Done</transferStatus>
</file>
I have created the following code to find the content within the tags thus far. It works if the filename tags and the file name are on the same line. The problem I'm having is in the instance where they are all on seperate lines (the example I provided above). I have already managed to transfer the large string above into $xmldata.
$xmldata -match '<fileName>(.*?)(</fileName>)'
$matches
Using the example text I displayed above, the output I receive is as follows:
<fileName>AnotherTextFileINeedReturned.txt</fileName>
I'm ok with receiving the tags, but I also need the name of the file that is on multiple lines. Like this...
<fileName>
ThisTextFileINeedReturned.txt
</fileName>
<fileName>AnotherTextFileINeedReturned.txt</fileName>
Or any variation that would give me both of the names of the text files. I have seen the (?m) part used before, but I haven't been able to successfully implement it. Thanks in advance for the help!! Let me know if you need any other information!
You should be able to get around that without using any regex. Powershell supports XML pretty well. Extracting the filename would be as easy as:
$Xml = #"
<files>
<file>
<fileName>
ThisTextFileINeedReturned.txt
</fileName>
<lastModifiedTime>1511883780000</lastModifiedTime>
<size>852192</size>
<isDirectory>false</isDirectory>
<isRegularFile>true</isRegularFile>
<isSymbolicLink>false</isSymbolicLink>
<isOther>false</isOther>
<group>group</group>
<transferStatus>Done</transferStatus>
</file>
<file>
<fileName>AnotherTextFileINeedReturned.txt</fileName>
<lastModifiedTime>1511883780000</lastModifiedTime>
<size>852192</size>
<isDirectory>false</isDirectory>
<isRegularFile>true</isRegularFile>
<isSymbolicLink>false</isSymbolicLink>
<isOther>false</isOther>
<group>group</group>
<transferStatus>Done</transferStatus>
</file>
</files>
"#
Select-Xml -Content $Xml -XPath "//files/file/fileName" | foreach {$_.node.InnerXML.Trim()}
You not explainen how you get your data but I guess you are using Get-Content to retrieve your source file. Get-Content reads the content one line at a time and returns a collection of objects, each of which represents a line of content. In other words, you're probably doing a Match on each separate line and therefor do no find the matches that are spread over multiple lines.
If this is indeed the case, the solution would be to simply join the lines first:
($xmldata -Join "") -match '<fileName>(.*?)(</fileName>)'
And check your matches, e.g.:
$Matches[0]

Replacing particular occurrence of a string with comment

I am trying to replace a particular xml statement and making it as a comment.I am trying for some linux awk,sed or any regular grammer expression,but completely stucked is therey anyway by which i can achieve this task.Below is the scenario i am looking for.
For Example
I have a n numbers of xml files. I want to replace a statement which has a word "Distribution_Facilities_carrying_Item" and should get replace with comment statement.
suppose the statement is ----
<Parameter name="RelationshipName1" direction="in" eval="constant" type="string">Distribution_Facilities_carrying_Item</Parameter>
.....as this statement contains the word "Distribution_Facilities_carrying_Item" i will replace this statement as a comment.So i want it to get replaced as
<!--Parameter name="RelationshipName1" direction="in" eval="constant" type="string">Distribution_Facilities_carrying_Item</Parameter-->
Further all such a statement in all the xml files should get replaced as a commented xml statement.Below is the pattern in which they might occcur.So how should i go about it.I know one needs to be an adept in the regular expression,because it's the only way to achieve.
......................................
This statement can be there in n number of xml files.
File:a.xml
<Parameter name="RelationshipName1" direction="in" eval="constant" type="string">Distribution_Facilities_carrying_Item</Parameter>
<Parameter direction="in" eval="constant" type="string" name="RelationshipName3">Distribution_Facilities_carrying_Item</Parameter>
<Parameter name="RelationshipName" direction="in" eval="constant" type="string">Distribution_Facilities_carrying_Item</Parameter>
<Parameter direction="in" name="RelationshipName10" type="string" eval="constant">Distribution_Facilities_carrying_Item</Parameter>
<Parameter direction="in" name="RelationshipName11" type="string" eval="constant">Distribution_Facilities_carrying_Item</Parameter>
<Parameter direction="in" eval="constant" type="string" name="RelationshipName5">Distribution_Facilities_carrying_Item</Parameter>
Thanks in advance!!
Using sed:
sed '/Distribution_Facilities_carrying_Item/ s/<\(.*\)>/<!--\1-->/' inputfile
would comment all lines containing the string Distribution_Facilities_carrying_Item.
If you want to modify the file in-place, add the -i option:
sed -i '/Distribution_Facilities_carrying_Item/ s/<\(.*\)>/<!--\1-->/' inputfile
If this is to be performed for all .xml files in a directory, use find and -exec:
find /some/dir -maxdepth 1 -type f -name "*.xml" -exec sed -i '/Distribution_Facilities_carrying_Item/ s/<\(.*\)>/<!--\1-->/' {} \;
(Remove -maxdepth 1 from the find command if you want to do it recursively.)
check with below sed equation it will comment
sed -i 's/\(<.*Distribution_Facilities_carrying_Item.*>\)/<!--\1-->/' filename.xml
Do not use regular expressions to parse XML. Use a proper parser. For example, using xsh:
my $search = "Distribution_Facilities_carrying_Item" ;
for my $file in { #ARGV } {
open $file ;
for my $p in //Parameter[text() = $search]
xinsert comment { $p->toString } replace $p ;
save :b ;
}
If you want to delete the text, too, you can change the inner loop to
for my $p in //Parameter[text() = $search] {
delete $p/text() ;
xinsert comment { $p->toString } replace $p ;
}
An awk version:
awk '/Distribution_Facilities_carrying_Item/ {sub(/^</,"<!--");sub(/>$/,"-->")}1' a.xml