text-manipulation: how to exclude specific lines with sed - replace

currelty i replace all the < in all the content with the following sed command
sed -e 's/\</</g''
but now i have to exclude the lines that contains <title>
to be exact i have to exclude the text between <title> and </title>
eg. the following line matches my command but this line should be excluded ...
<title>BEWEGUNGSBOX der ÖDG ab sofort < erhältlich </title>
how can i solve it with sed?
i'm using sed in cygwin

To make substitution only in the document body you can use regex ranges in sed:
sed -e '/<body/,/<\/body/ s/\</</g' input.htm

I don't like the idea of using sed to handle HTML data. But said that, try this:
sed -ne '/<title>.*<\/title>/ { p; b }; /<title>/,/<\/title>/ { p; b }; s/\</</g; p' infile
It looks for a <title>...</title with both tags in the same line and prints it without change. Otherwise, look for those tags in different lines using a range. From the point where one of the previous condition don't succeed, begin to substitute <.

Related

SED - remove attribute from HTML tag

I want to remove a specific attribute(name in my example) from the HTML tag, that might be in different positions for each line in my file
Example Input:
<img name="something_random_for_each_tag" src="https://websiteurl.com/286.jpg" alt="img">
Expected output:
<img src="https://websiteurl.com/286.jpg" alt="img">
My code:
sed 's/name=".*"//g' <<< '<img name="something_random_for_each_tag" src="https://websiteurl.com/286.jpg" alt="img">'
but it only shows <img >, I am losing src attribute as well
Notes:
name attribute might be in any position in a tag (not necessarily at the beginning)
you can use sed, awk, Perl, or anything you like, it should work on the command line
Your sed expression matches the text up to the last " in the line. It must have been
sed 's/ name="[^"]*"//g'
With your shown samples, could you please try following. Written and tested in GNU awk.
awk '/^<img/ && match($0,/src.*/){print substr($0,1,4),substr($0,RSTART,RLENGTH)}' Input_file
2nd solution: Using sub(substitute function) of awk.
awk '/^<img/{sub(/name="[^"]*" /,"")} 1' Input_file
Explanation:
1st solution: Using match function of awk to match from src till last of line and printing 1st 4 characters with space with matched regex value.
2nd solution: Checking condition if line starts from <img then substitute name=" till again " comes with NULL and printing current line.

process a delimited text file with sed

I have a ";" delimited file:
aa;;;;aa
rgg;;;;fdg
aff;sfg;;;fasg
sfaf;sdfas;;;
ASFGF;;;;fasg
QFA;DSGS;;DSFAG;fagf
I'd like to process it replacing the missing value with a \N .
The result should be:
aa;\N;\N;\N;aa
rgg;\N;\N;\N;fdg
aff;sfg;\N;\N;fasg
sfaf;sdfas;\N;\N;\N
ASFGF;\N;\N;\N;fasg
QFA;DSGS;\N;DSFAG;fagf
I'm trying to do it with a sed script:
sed "s/;\(;\)/;\\N\1/g" file1.txt >file2.txt
But what I get is
aa;\N;;\N;aa
rgg;\N;;\N;fdg
aff;sfg;\N;;fasg
sfaf;sdfas;\N;;
ASFGF;\N;;\N;fasg
QFA;DSGS;\N;DSFAG;fagf
You don't need to enclose the second semicolon in parentheses just to use it as \1 in the replacement string. You can use ; in the replacement string:
sed 's/;;/;\\N;/g'
As you noticed, when it finds a pair of semicolons it replaces it with the desired string then skips over it, not reading the second semicolon again and this makes it insert \N after every two semicolons.
A solution is to use positive lookaheads; the regex is /;(?=;)/ but sed doesn't support them.
But it's possible to solve the problem using sed in a simple manner: duplicate the search command; the first command replaces the odd appearances of ;; with ;\N, the second one takes care of the even appearances. The final result is the one you need.
The command is as simple as:
sed 's/;;/;\\N;/g;s/;;/;\\N;/g'
It duplicates the previous command and uses the ; between g and s to separe them. Alternatively you can use the -e command line option once for each search expression:
sed -e 's/;;/;\\N;/g' -e 's/;;/;\\N;/g'
Update:
The OP asks in a comment "What if my file have 100 columns?"
Let's try and see if it works:
$ echo "0;1;;2;;;3;;;;4;;;;;5;;;;;;6;;;;;;;" | sed 's/;;/;\\N;/g;s/;;/;\\N;/g'
0;1;\N;2;\N;\N;3;\N;\N;\N;4;\N;\N;\N;\N;5;\N;\N;\N;\N;\N;6;\N;\N;\N;\N;\N;\N;
Look, ma! It works!
:-)
Update #2
I ignored the fact that the question doesn't ask to replace ;; with something else but to replace the empty/missing values in a file that uses ; to separate the columns. Accordingly, my expression doesn't fix the missing value when it occurs at the beginning or at the end of the line.
As the OP kindly added in a comment, the complete sed command is:
sed 's/;;/;\\N;/g;s/;;/;\\N;/g;s/^;/\\N;/g;s/;$/;\\N/g'
or (for readability):
sed -e 's/;;/;\\N;/g;' -e 's/;;/;\\N;/g;' -e 's/^;/\\N;/g' -e 's/;$/;\\N/g'
The two additional steps replace ';' when they found it at beginning or at the end of line.
You can use this sed command with 2 s (substitute) commands:
sed 's/;;/;\\N;/g; s/;;/;\\N;/g;' file
aa;\N;\N;\N;aa
rgg;\N;\N;\N;fdg
aff;sfg;\N;\N;fasg
sfaf;sdfas;\N;\N;
ASFGF;\N;\N;\N;fasg
QFA;DSGS;\N;DSFAG;fagf
Or using lookarounds regex in a perl command:
perl -pe 's/(?<=;)(?=;)/\\N/g' file
aa;\N;\N;\N;aa
rgg;\N;\N;\N;fdg
aff;sfg;\N;\N;fasg
sfaf;sdfas;\N;\N;
ASFGF;\N;\N;\N;fasg
QFA;DSGS;\N;DSFAG;fagf
The main problem is that you can't use several times the same characters for a single replacement:
s/;;/..../g: The second ; can't be reused for the next match in a string like ;;;
If you want to do it with sed without to use a Perl-like regex mode, you can use a loop with the conditional command t:
sed ':a;s/;;/;\\N;/g;ta;' file
:a defines a label "a", ta go to this label only if something has been replaced.
For the ; at the end of the line (and to deal with eventual trailing whitespaces):
sed ':a;s/;;/;\\N;/g;ta; s/;[ \t\r]*$/;\\N/1' file
this awk one-liner will give you what you want:
awk -F';' -v OFS=';' '{for(i=1;i<=NF;i++)if($i=="")$i="\\N"}7' file
if you really want the line: sfaf;sdfas;\N;\N;\N , this line works for you:
awk -F';' -v OFS=';' '{for(i=1;i<=NF;i++)if($i=="")$i="\\N";sub(/;$/,";\\N")}7' file
sed 's/;/;\\N/g;s/;\\N\([^;]\)/;\1/g;s/;[[:blank:]]*$/;\\N/' YourFile
non recursive, onliner, posix compliant
Concept:
change all ;
put back unmatched one
add the special case of last ; with eventually space before the end of line
This might work for you (GNU sed):
sed -r ':;s/^(;)|(;);|(;)$/\2\3\\N\1\2/g;t' file
There are 4 senarios in which an empty field may occur: at the start of a record, between 2 field delimiters, an empty field following an empty field and at the end of a record. Alternation can be employed to cater for senarios 1,2 and 4 and senario 3 can be catered for by a second pass using a loop (:;...;t). Multiple senarios can be replaced in both passes using the g flag.

use sed to replace "file={{bla-bla}}" with "file={bla-bla}"

my bibtex file is corrupted in a sense that I need to change
file = {{name:/path/to/file.pdf:application/pdf}},
with file = {name:/path/to/file.pdf:application/pdf}, that is, remove the first pair of curly brackets.
All the strings I am interested start with file = {{.
My first attempt is
echo "file = {{name:/path/to/file.pdf:application/pdf}}," | sed 's/file = {{/file = {/g;s/}}/}/g'
The problem with this one is that it also alters lines like
title = {{ blablabla }} which i don't it want to.
How does one write a REGEX with something like s/file = {{EVERYTHING-IN-BETWEEN/file = {KEEP-WHAT-WAS-THERE}/g ?
p.s. if it's not possible with sed, any other unix commands are welcome.
p.p.s. I am on OS-X, sed here is apparently different to GNU, so some answers below do not work for me, unfortunately.
You can do the following:
sed 's/\(file = \){\({[^}]*}\)}/\1\2/g'
This is probably wrong for your situation, but this will change any double brace to a single brace:
sed 's/\([{}]\)\1/\1/g' <<END
{{
}}
file={{bar-blah}}
{}
}{
END
{
}
file={bar-blah}
{}
}{
The search part \([{}]\)\1 finds a single open or close brace followed by what was just captured. The replacement part is the single captured character.
Assuming both opening and closing braces are on the same line and no other pairs of braces exist on that same line then this should do what you want:
sed '/file *= *{{/{s/{{/{/; s/}}/}/}' file
That's:
/file = {{/ - match lines that have file = {{ on it
{ - start a group of commands
s/{{/{/ - replace {{ with { once
s/}}/}/ - replace }} with } once
} - end a group of commands
If OS X sed cannot handle that command, and this version without the command grouping does not work either:
sed '/file *= *{{/s/}}/}/; /file *= *{{/s/{{/{/'
then this, hopefully, should:
sed -e '/file *= *{{/s/}}/}/' -e '/file *= *{{/s/{{/{/'
or, to steal from glenn jackman's answer a bit:
sed -e '/file *= *{{/s/\([{}]\)\1/\1/g'
huh, GNU sed
echo "file={{bla-bla}}" | sed 's/\(file\s*=\s*\){\s*{\s*\([^}]*\)}\s*}\s*/\1{\2}/'
file={bla-bla}
\s is catching eventual white space characters
assuming that '}' is not inside internal string

How can I use perl/awk/sed to search for all occurrences of text wrapped in quotes within a file and then delete them?

How can I use perl, awk, or sed to search for all occurrences of text wrapped in quotes within a file, and print the result of deleting those occurrences from the file? I do not want to actually alter the file, but simply print the result of altering the file like sed does.
For example, say the file contains the following :
data|more data|"not important"|"more unimportant stuff"
I need it to print out:
data|more data||
But I want to leave the file intact. I tried using sed but I could not get it to accept regexs.
I have tried something like this:
sed -e 's/\<["]+[^"]*["]+\>//g' file.txt
but it does nothing and prints the original file.
Any Thoughts?
Using a perl one-liner:
perl -pe 's/".*?"//g' file
Explanation:
Switches:
-p: Creates a while(<>){...; print} loop for each line in your input file.
-e: Tells perl to execute the code on command line.
You seem to have a few extra characters in your sed command.
sed -e 's/"[^"]*"//g' file.txt
Input:
"quoted text is here" but not quoted there
never more
"hello world" foo bar
data|more data|"not important"|"more unimportant stuff"
Output:
but not quoted there
never more
foo bar
data|more data||
echo 'data|more data|"not important"|"more unimportant stuff"' | sed -E 's/"[^"]*"//g'
You don't need to declare a character class (brackets) for only one character...
my $cnt=qq(data|more data|"not important"|"more unimportant stuff");
my #arr = $cnt =~ m{(?:^|\|)([^"][^\|]*[^"])(?=\||$)}ig;
print "#arr";
This code might help you..

sed - Include newline in pattern

I am still a noob to shell scripts but am trying hard. Below, is a partially working shell script which is supposed to remove all JS from *.htm documents by matching tags and deleting their enclosed content. E.g. <script src="">, <script></script> and <script type="text/javascript">
find $1 -name "*.htm" > ./patterns
for p in $(cat ./patterns)
do
sed -e "s/<script.*[.>]//g" $p #> tmp.htm ; mv tmp.htm $p
done
The problem with this is script is that because sed reads text input line-by-line, this script will not work as expected with new-lines. Running:
<script>
//Foo
</script>
will remove the first script tag but will omit the "foo" and closing tag which I don't want.
Is there a way to match new-line characters in my regular expression? Or if sed is not appropriate, is there anything else I can use?
Assuming that you have <script> tags on different lines, e.g. something like:
foo
bar
<script type="text/javascript">
some JS
</script>
foo
the following should work:
sed '/<script/,/<\/script>/d' inputfile
This awk script will look for the <script*> tag, set the in variable and then read the next line. When the closing </script*> tag is found the variable is set to zero. The final print pattern outputs all lines if the in variable is zero.
awk '/<script.*>/ { in=1; next }
/<\/script.*>/ { if (in) in=0; next }
{ if (!in) print; } ' $1
As you mentioned, the issue is that sed processes input line by line.
The simplest workaround is therefore to make the input a single line, e.g. replacing newlines with a character which you are confident doesn't exist in your input.
One would be tempted to use tr :
… |tr '\n' '_'|sed 's~<script>.*</script>~~g'|tr '_' '\n'
However "currently tr fully supports only single-byte characters", and to be safe you probably want to use some improbable character like ˇ, for which tr is of no help.
Fortunately, the same thing can be achieved with sed, using branching.
Back on our <script>…</script> example, this does work and would be (according to the previous link) cross-platform :
… |sed -e ':a' -e 'N' -e '$!ba' -e 's/\n/ˇ/g' -e 's~<script>.*</script>~~g' -e 's/ˇ/\n/g'
Or in a more condensed form if you use GNU sed and don't need cross-platform compatibility :
… |sed ':a;N;$!ba;s/\n/ˇ/g;s~<script>.*</script>~~g;s/ˇ/\n/g'
Please refer to the linked answer under "using branching" for details about the branching part (:a;N;$!ba;). The remaining part is straightforward :
s/\n/ˇ/g replaces all newlines with ˇ ;
s~<script>.*</script>~~g removes what needs to be removed (beware that it requires some securing for actual use : as is it will delete everything between the first <script> and the last </script> ; also, note that I used ~ instead of / to avoid escaping of the slash in </script> : I could have used just about any single-byte character except a few reserved ones like \) ;
s/ˇ/\n/g readds newlines.