search and replace multi-line text with white space - regex

I am trying to search for some text in a XML file, the text is:
</p_dpopis>
<IMGURL>
And replace it with:
</p_dpopis>
<p_vyrobce>NONAME</p_vyrobce>
<IMGURL>
Here is what I tried with perl, without any luck:
perl -0pe 's|</p_dpopis>.*\n.*<IMGURL>|replacement|' myxml.xml
What is wrong here?

Your syntax works:
$ cat file
</p_dpopis>
<IMGURL>
$ perl -0pe 's|</p_dpopis>.*\n.*<IMGURL>|replacement|g' file
replacement
Here is a sed example with the same example file:
$ sed -r '/<\/p_dpopis>/{ N; s%</p_dpopis>.*\n.*<IMGURL>%replaced\ntest%g }' file
replaced
test
See this reference for more info.

You're missing a 'global' modifier for your regex, and using \s+ to match any amount of whitespace is much easier than specifying .*\n.*. It's also nicer to send the output to another file, rather than having to deal with it in the terminal window.
perl -0pe 's|</p_dpopis>\s+<IMGURL>|</p_dpopis>\n<p_vyrobce>NONAME</p_vyrobce>\n<IMGURL>|g' myxml.xml > my_new_xml.xml
If you're manipulating XML, it is really better to use a dedicated XML parser -- you can get into all sorts of mischief by manipulating an irregular language such as XML with regular expressions.

Related

Can grep show only result i want

I have data as this
tatusx2.atc?beginnum=0;8pctgRB Mwdf fgEio"text1"text4"text
tatqsx3.atc?beginnum=1;8pctgRBwsaNezxio"text2
tatssx4.atc?beginnum=2;8pctgsvMALNejkio"data2
tatksx4.atc?beginnum=1;8pctgxdfALNebfio"text3
tatzsx5.atc?beginnum=3;8pwerRBMALNetior"datac
How to get only data between ; and "
I have tried grep -oP ';.*?"' file and got output :
;8pctgRBMwdffgEio"
;8pctgRBwsaNezxio"
;8pctgsvMALNejkio"
;8pctgxdfALNebfio"
;8pwerRBMALNetior"
But my desired output is:
8pctgRB Mwdf fgEio
8pctgRBwsaNezxio
8pctgsvMALNejkio
8pctgxdfALNebfio
8pwerRBMALNetior
You need to use lookahead and lookbehind regex expressions
grep -oP '(?<=;)\w*(?=")'
I consider you play around regexr to learn more about regular expressions. Checkout their cheatsheet.
A much more readable way to write the expression you need is:
grep -oP '(?<=;).*(?=")' file
and will get you the desired result. PERL regexes are apparently experimental but certain patterns work without issues.
The following options are being used:
-o --only-matching to the print only the matched parts of a matching line
-P --perl-regexp
Using ?=; will get you the string beginning with ; but using the > you are able to start at the index after. Similarly the end string tag is specified.
Here is suggested additional reading.

Using ampersand in sed

I have a csv file full of lines like the following:
Aity Chel Jenni,Hendaland 229,2591 TE Amsterdam
I want to create a sed pattern for in an automated batch script that changes the info in this kind of formatting into the following formatting:
Aity Chel Jenni,Hendaland 30,2591 TE, Amsterdam
With a bit of research, I found out that I had to create a regex, then use an ampersand (&) character to have it change things around using the & to define the location of the regex.
I have tried the following:
sed 's/([1-9] [A-Z]{2}/&,/' file1 >file2
And have been trying variants of that trying to get the regexes down, but it doesn't seem to change anything.
Am I making a mistake in the usage of the ampersand or is my regex wrong?
Reading through the internet I can't seem to wrap my head around this function, can someone give me any examples/explain to me how to properly do this?
You are saying
sed 's/([1-9] [A-Z]{2}/&,/' file1 >file2
^
But you don't have to capture with () to use &. Instead, just say:
sed 's/[1-9] [A-Z]\{2\}/&,/' file
Note you need to escape the elements in the { } quantifier, unless you use -r:
sed -r 's/[1-9] [A-Z]{2}/&,/' file
Try the following:
sed -r 's:[0-9] [A-Z]{2}\b:&,:' file > out
About your own pattern, you're missing the closing parenthesis. And, iirc, you need to escape ( inside sed patterns to not match them literally.
The -r option enabled sed to use extended regex, which provides the {2} expansion.

Regular expression to extract text from XML-ish data using GNU sed

I have a file full of lines extracted from an XML file using "gsed regexp -i FILENAME". The lines in the file are all of one of either format:
<field number='1' name='Account' type='STRING'W/>
<field number='2' name='AdvId' type='STRING'W>
I've inserted a 'W' in the end which represents optional whitespace. The order and number of properties are not necessarily the same in all lines throughout the file although "number" is always before "type".
What I'm searching for is a regular expression "regexp" that I can give to gnu sed so that this command:
gsed regexp -i FILENAME
gives me a file with lines looking like this:
1 STRING
2 STRING
I don't care about the amount of whitespace in the result as long as there is some after the number and a newline at the end of each line.
I'm sure it is possible, but I just can't figure out how in a reasonable amount of time. Can anyone help?
Thanks a lot,
jules
Using xsh, a Perl wrapper around XML::LibXML:
open file.xml ;
for //field echo #number #type ;
I'm sure this can be optimized, but it works for me and answers your question:
sed "s/^.*number='\([0-9]*\)'.*type='\(.*\)'.*$/\1 \2/" <filename>
Saying that, I think the others are right, if you have an XML-file you should use an XML-parser.
I think you're much better off using a command line XML tool such as XMLStarlet. That will integrate well with the shell and let you perform XPath searches. It's XML-aware so it'll handle character encodings, whitespace correctly etc.
Simple cut should work for you:
cut -f2,6 -d"'" --output-delimiter=" "
If you really want sed:
sed -r "s/.'(.)'.type='(.)'.*/\1 \2/"
You can use this:
sed -r "s/<field [^>]*?number='([0-9]+)'[^>]*?type='([^']+)'[^>]*>/\1 \2/"
You would be better off using an XML parser, but if you had to use sed:
sed 's/<field number=\'(.*?)\'.*?type=\'(.*?)\'/\1 \2
sed -ni "/<field .*>/s#^.*[[:space:]]number='\\([^']\\+\\).*[[:space:]]type='\\([^']\\+\\).*#\1 \2#p" FILENAME
Or if you don't mind contents of number and type to be optional:
sed -ni "/<field .*>/s#^.*[[:space:]]number='\\([^']*\\).*[[:space:]]type='\\([^']*\\).*#\1 \2#p" FILENAME
Just change from [^']\\+ to [^']* at your preference.

Is there a truly universal wildcard in Grep? [duplicate]

This question already has answers here:
How do I match any character across multiple lines in a regular expression?
(26 answers)
Closed 3 years ago.
Really basic question here. So I'm told that a dot . matches any character EXCEPT a line break. I'm looking for something that matches any character, including line breaks.
All I want to do is to capture all the text in a website page between two specific strings, stripping the header and the footer. Something like HEADER TEXT(.+)FOOTER TEXT and then extract what's in the parentheses, but I can't find a way to include all text AND line breaks between header and footer, does this make sense? Thanks in advance!
When I need to match several characters, including line breaks, I do:
[\s\S]*?
Note I'm using a non-greedy pattern
You could do it with Perl:
$ perl -ne 'print if /HEADER TEXT/ .. /FOOTER TEXT/' file.html
To print only the text between the delimiters, use
$ perl -000 -lne 'print $1 while /HEADER TEXT(.+?)FOOTER TEXT/sg' file.html
The /s switch makes the regular expression matcher treat the entire string as a single line, which means dot matches newlines, and /g means match as many times as possible.
The examples above assume you're cranking on HTML files on the local disk. If you need to fetch them first, use get from LWP::Simple:
$ perl -MLWP::Simple -le '$_ = get "http://stackoverflow.com";
print $1 while m!<head>(.+?)</head>!sg'
Please note that parsing HTML with regular expressions as above does not work in the general case! If you're working on a quick-and-dirty scanner, fine, but for an application that needs to be more robust, use a real parser.
By definition, grep looks for lines which match; it reads a line, sees whether it matches, and prints the line.
One possible way to do what you want is with sed:
sed -n '/HEADER TEXT/,/FOOTER TEXT/p' "$#"
This prints from the first line that matches 'HEADER TEXT' to the first line that matches 'FOOTER TEXT', and then iterates; the '-n' stops the default 'print each line' operation. This won't work well if the header and footer text appear on the same line.
To do what you want, I'd probably use perl (but you could use Python if you prefer). I'd consider slurping the whole file, and then use a suitably qualified regex to find the matching portions of the file. However, the Perl one-liner given by '#gbacon' is an almost exact transliteration into Perl of the 'sed' script above and is neater than slurping.
The man page of grep says:
grep, egrep, fgrep, rgrep - print lines matching a pattern
grep is not made for matching more than a single line. You should try to solve this task with perl or awk.
As this is tagged with 'bbedit' and BBedit supports Perl-Style Pattern Modifiers you can allow the dot to match linebreaks with the switch (?s)
(?s).
will match ANY character. And yes,
(?s).+
will match the whole text.
As pointed elsewhere, grep will work for single line stuff.
For multiple-lines (in ruby with Regexp::MULTILINE, or in python, awk, sed, whatever), "\s" should also capture line breaks, so
HEADER TEXT(.*\s*)FOOTER TEXT
might work ...
here's one way to do it with gawk, if you have it
awk -vRS="FOOTER" '/HEADER/{gsub(/.*HEADER/,"");print}' file

Regular expression to find a line containing certain characters and remove that line

I have text file which has lot of character entries one line after another.
I want to find all lines which start with :: and delete all those lines.
What is the regular expression to do this?
-AD
Regular expressions don't "do" anything. They only match text.
What you want is some tools that uses regular expressions to identify a line and then apply some command to those tools.
One such tools is sed (there's also awk and many others). You'd use it like this:
sed -e "/^::/d" < input.txt > output.txt
The part "/^::/" tells sed to apply the following command to all lines that start with "::" and "d" simply means "delete that line".
Or the simplest solution (which my brain didn't produce for some strange reason):
grep -v "^::" input.txt > output.txt
sed -i -e '/^::/d' yourfile.txt
^::.*[\r\n]*
If you're reading the file line-by-line you won't need the [\r\n]* part.
Simple as:
^::
If you don't have sed or grep, find this and replace with empty string:
^::.*[\r\n]
Thanks for the pointers:
Following thing worked for me. After "::" any character was possiblly present in the text file so i gave:
^::[a-zA-Z0-9 I put all punctuation symbols here]*$
-AD
Here's my contribution in C#:
Text stream:
string stream = :: This is a comment line
Syntax:
Regex commentsExp = new Regex("^::.*", RegexOptions.Singleline);
Usage:
Console.WriteLine(commentsExp.Replace(stream, string.Empty));
Alternatively, if I wanted to simply take a text file that included comments and produce an exact duplicate without the comment lines I could use a simple but effective combination of the type and findstr commandline tools:
type commented.txt | findstr /v /R "^::" > uncommented.txt