perl non-greedy replace not working at the start - regex

I am having a XML similar to this
<Level1Node>
.
.
<Level2Node val="Retain"/>
.
.
</Level1Node>
<Level1Node>
.
.
<Level2Node val="Replace"/>
.
.
</Level1Node>
<Level1Node>
.
.
<Level2Node val="Retain"/>
.
.
</Level1Node>
I need to remove only the below node,
<Level1Node>
.
.
<Level2Node val="Replace"/>
.
.
</Level1Node>
To have it replaced in non-greedy manner, I used the below regex,
perl -0 -pe "s|<Level1Node>.*?<Level2Node val="Retain"/>.*?</Level1Node>||gs" myxmlfile
But the non-geedy terminates the match only at the end of the pattern, not at the start. How to get it started at the last match of <Level1Node>

You will need to use a negative lookahead to make sure you do not match closing Level1Node tags where you don't want to:
perl -0 -pe 's|<Level1Node>(?:(?!<\/Level1Node>).)*<Level2Node val="Retain"\/>(?:(?!<\/Level1Node>).)*<\/Level1Node>||gs' tmp.txt
Details:
<Level1Node>
(?:(?!<\/Level1Node>).)* # Everything except </Level1Node>
<Level2Node val="Retain"\/>
(?:(?!<\/Level1Node>).)* # Everything except </Level1Node>
<\/Level1Node>
?: is only here so that the parenthesis are not interpreter as a capturing group.
If you plan to run this on a large file, you should probably check the cost of the negative lookahead, it might be high.

Use a proper parser! It's way simpler.
perl -MXML::LibXML -e'
my $doc = XML::LibXML->new->parse_file($ARGV[0]);
$_->unbindNode() for $doc->findnodes(q{//Level1Node[Level2Node[#val!="Retain"]]});
$doc->toFH(\*STDOUT);
' tmp.txt

Related

Replace text between 2 particular lines in a text file using sed

Similar questions have been asked but they are for Powershell.
I have a Markdown file like:
.
.
.
## See also
- [a](./A.md)
- [A Child](./AChild.md)
.
.
.
- [b](./B.md)
.
.
.
## Introduction
.
.
.
I wish to replace all occurrences of .md) with .html) between ## See also and ## Introduction :
.
.
.
## See also
- [a](./A.html)
- [A Child](./AChild.html)
.
.
.
- [b](./B.html)
.
.
.
## Introduction
.
.
.
I tried like this in Bash
orig="\.md)"; new="\.html)"; sed "s~$orig~$new~" t.md -i
But, this replaces everywhere in the file. But I wish that the replacement happens only between ## See also and ## Introduction
Could you please suggest changes? I am using awk and sed as I am little familiar with those. I also know a little Python, is it recommended to do such scripting in Python (if it is too complicated for sed or awk)?
$ sed '/## See also/,/## Introduction/s/\.md/.html/g' file

How can I match this pattern of file name in a directory, and output the matched?

there are many of files in this directory:
[ichen#ui01 data]$ ls
data.list
data.root
ntuple.data15_13TeV.00276262.DAOD_FTAG2.root
ntuple.data15_13TeV.00276329.DAOD_FTAG2.root
ntuple.data15_13TeV.00276336.DAOD_FTAG2.root
ntuple.data15_13TeV.00276416.DAOD_FTAG2.root
ntuple.data15_13TeV.00276511.DAOD_FTAG2.root
and i want to make a list which just contains those files which have the pattern of:
[many chars].[many chars].[many numbers].[many chars].root
to match the file names such like:
ntuple.data15_13TeV.00276262.DAOD_FTAG1.root
ntuple.data15_13TeV.00276329.DAOD_FTAG2.root
ntuple.data15_13TeV.00276336.DAOD_FTAG3.root
etc...
how can I use regexp to achieve this goal?
Maybe we can use this syntax:
for f in `ls`;do if [....];then echo $f;fi;done > log.list
In regexp land, many roads lead to rome. :)
ls | egrep '^\w*\.\w*\.[0-9]*\.\w*\.root$'
^ marks the beginning of a line
$ marks the end of a line
\w is a word character
\w* is many work characters
. is a literal '.' character, an unmasked '.' in the regurlar expression stands for "any character"
[0-9] is any of the numbers between 0 and 9
And for your specific example:
for f in `ls`;do echo $f | egrep '^\w*\.\w*\.[0-9]*\.\w*\.root$';done
And now including the if statement:
for f in `ls`; do if [[ $f =~ '\w*\.\w*\.[0-9]*\.\w*\.root' ]]; then echo $f; fi; done
In this case, I had to remove the line beginning and end (^...$) for it to match. Not sure why. In general, =~ will check for regular expressions.
ls | grep '..*[.]..*[.][0-9][0-9]*[.]..*[.]root > log.list
should do the job
It doesn't have to be complicated like that. You just need to list out the files which match a certain pattern - wildcards are basically enough, there's no need for regexes.
ls -1 ntuple.data*.*.*.root > log.list

How can I match a string not starting with a sequence with grep

I'm running the following command:
grep -REin "Example::" .
I'm trying to filter my results more. I want to match anything with Example:: except when it starts with return like in the case return Example::.
MATCH
if (Example::test())
DO NOT MATCH
if (something()) return Example::another()
You could execute the following, using -v (invert-match) option.
grep -REin 'Example::' . | grep -vi 'return Example::'
Or use option -P which clarifies the pattern as a Perl regular expression.
grep -RPin '(?<!return )Example::' .
This uses Negative Lookbehind to assert that what precedes is not the word return.
(?<! # look behind to see if there is not:
return # 'return '
) # end of look-behind
Example:: # 'Example::'
You could use awk to solve this:
awk '/Example/ && !/return Example/' file
if (Example::test())

How to retain the first instance of a match with sed

I have a set of tokens in data and wish to strip off the trailing ".[0-9]", however i cannot figure out how to quote the regexp properly. The First match should be all up to the . and the second the . and a number. I am intending that the first match be retained.
data="thing thing__aaa.0 thing__bbb.3 thing__ccc.5 other_aaa other_bbb other_ccc.5"
data=`echo $data | sed s/\([a-zA-Z0-9_]+\)\(\.[0-9]\)/\1/g`
echo $data
Actual output:
thing thing__aaa.0 thing__bbb.3 thing__ccc.5 other_aaa other_bbb other_ccc.5
Desired output:
thing thing__aaa thing__bbb thing__ccc other_aaa other_bbb other_ccc
The idea is that the unquoted ([a-zA-Z0-9_]+) is the first matching group, and the (\.[0-9]) matches the .number. the \1 should replace both groups with the first group.
How about just
echo $data | sed 's/\.[0-9]//g'
or if number may contain more digits, then
echo $data | sed 's/\.[0-9]\+//g'
It looks like you just want to delete all strings of the form \.[0-9]. So why not just do:
sed 's/\.[0-9]+\b//g'
(This relies on gnu sed's \b and + extensions. For other sed you can do:
sed 's/\.[0-9][0-9]*\( \|$\)/\1/g'
I normally don't encourage the use of shell specific extensions, but if you are using bash you might be happy using an array:
bash$ data=(thing thing__aaa.0 thing__bbb.3)
bash$ echo "${data[#]%.[0-9]*}"
Note that this will also delete extensions that are not all digits (ie foo.34bb), but perhaps is adequate for your needs.)

Pull value for HostName for IPconfig command

I have a text file for IPCONFIG command, and am interested to obtain value for HOST NAME i.e. S4333AAB45 utilizing REGEX.
Windows IP Configuration
Host Name . . . . . . . . . . . . : S4333AAB45
Primary Dns Suffix . . . . . . . :
Node Type . . . . . . . . . . . . : Hybrid
IP Routing Enabled. . . . . . . . : No
I tried following option and it didn't work
/\bHost Name\s+(\d+)/
Here is what I would use:
/\s+Host Name.*: (\w+)$/
Use Field Splitting with AWK
You don't say what regular expression engine you're using, or why you need to use a regular expression to match the host name portion. If you have access to AWK, you can treat this as a field-splitting issue instead. For example:
awk '/\<Host Name\>/ { print $NF }' /tmp/foo
Use Known Line Positions
Assuming you've got Cygwin or similar installed, you can use the position of the interesting record to get the data you want without a regular expression at all. For example:
cat /tmp/foo | head -n3 | cut -d: -f2 | tr -d ' '
Just replace the cat command with your call to ipconfig instead, and you should get the results you want.
Use sed Instead
You can also use sed to find the line you're interested in, and print out just the trailing word on the line. For example:
sed -n '/\<Host Name\>/ s/.*[[:space:]]\([[:alnum:]]\+\)$/\1/p' /tmp/foo
Your host had a letter "S" as the first character of the host name, so "(\d+)" wouldn't be correct for matching your host name. You also failed to account for the dots and colon on the host name line. So the answer from weexpectedTHIS should do the trick. But for your information, here's how you could get the host name without first creating an intermediate file.
$ipconfig = `ipconfig /all`;
($host) = $ipconfig =~ /^\s*Host Name.*:\s*(\w+)/m;
You would need the "/m" in there so that the "^" will match the start of any line in the multi-line contents of $ipconfig. I tend to use "\s*" instead of "\s+" as a sort of insurance against future changes in the output format (where white space is often removed or expanded in newer versions of a command).