SED - remove attribute from HTML tag

SED - remove attribute from HTML tag - regex

I want to remove a specific attribute(name in my example) from the HTML tag, that might be in different positions for each line in my file
Example Input:
<img name="something_random_for_each_tag" src="https://websiteurl.com/286.jpg" alt="img">
Expected output:
<img src="https://websiteurl.com/286.jpg" alt="img">
My code:
sed 's/name=".*"//g' <<< '<img name="something_random_for_each_tag" src="https://websiteurl.com/286.jpg" alt="img">'
but it only shows <img >, I am losing src attribute as well
Notes:
name attribute might be in any position in a tag (not necessarily at the beginning)
you can use sed, awk, Perl, or anything you like, it should work on the command line

Your sed expression matches the text up to the last " in the line. It must have been
sed 's/ name="[^"]*"//g'

With your shown samples, could you please try following. Written and tested in GNU awk.
awk '/^<img/ && match($0,/src.*/){print substr($0,1,4),substr($0,RSTART,RLENGTH)}' Input_file
2nd solution: Using sub(substitute function) of awk.
awk '/^<img/{sub(/name="[^"]*" /,"")} 1' Input_file
Explanation:
1st solution: Using match function of awk to match from src till last of line and printing 1st 4 characters with space with matched regex value.
2nd solution: Checking condition if line starts from <img then substitute name=" till again " comes with NULL and printing current line.

Related

How do I remove a particular pattern with a number sequence sed

I'm very new to sed bash command, so trying to learn.
I'm currently faced with a few thousand markdown files i need to clean up and I'm trying to create a command that deletes part of the following
# null 864: Headline
body text
I need anything that come before the headline deleted which is '# null 864: '
it's allways: '# null ' then some digits ': '
I'm using gnu-sed because I'm using mac
The best I've come up with sofar is
gsed -i '/#\snull\s([1-9]|[1-9][0-9]|[1-9][0-9][0-9]|[1-9][0-9][0-9][0-9]):\s/d' *.md
The above does not seem to work?
however if I do
gsed -i '/#\snull/d' *.md
it does what I want, however it does some unintended stuff in the body test.
How do I control so only the headline and the body text remains?

Considering that you want to print values before headline and don't want to print any other lines, then try following.
sed -E -n 's/^(#\s+null\s+[0-9]+:\s+)Headline/\1/p' Input_file
In case you want to print value before Headline and if match is not found want to print that complete line then try following:
sed -E 's/^(#\s+null\s+[0-9]+:\s+)Headline/\1/' Input_file
Explanation: Simple using -E option of sed to enable ERE(extended regular expression), then using s option of sed to perform substitution here. matching # followed by space(s) null followed by space(s) digits colon and space(s) and keeping it in 1st capturing group, while substitution, substituting it with 1st capturing group.
NOTE: Above commands will print values on terminal, in case you want to save them inplace then use -i option once you are satisfied with above code's output.

If I'm understanding correctly, you have files like this:
This should get deleted
This should too.
# null 864: Headline
body text
this should get kept
You want to keep the headline, and everything after, right? You can do this in awk:
awk '/# null [0-9]+:/,eof {print}' foo.md

You might use awk, and replace the # null 864: part with an empty string using sub.
See this page to either create a new file, or to overwrite the same file.
The }1 prints the whole line as 1 evaluates to true.
awk '{sub(/^# null [0-9]+:[[:blank:]]+/,"")}1' file
The pattern matches
^# null Match literally from the start of the string
[0-9]+:[[:blank:]]+ match 1+ digits, then : and 1+ spaces
Output
Headline
body text

On a mac ed should be installed by default so.
The content of script.ed
g/^# null [[:digit:]]\{1,\}: Headline$/s/^.\{1,\}: //
,p
Q
for file in *.md; do ed -s "$file" < ./script.ed; done
If the output is ok, remove the ,p and change the Q to w so it can edit the file in-place
g/^# null [[:digit:]]\{1,\}: Headline$/s/^.\{1,\}: //
w
Run the loop again.

I'd use a range in sed same as Andy Lester's awk solution.
Borrowing his infile,
$: cat tst.md
This should get deleted
This should too.
# null 864: Headline
body text
this should get kept
$: sed -Ein '/^# null [0-9]+:/,${p;d};d;' tst.md
$: cat tst.md
# null 864: Headline
body text
this should get kept

replacing column field separator using sed

I have a text file 1.txt:
cam:45c62741b9c99e1dcf3c140e8e3df635::dv:johnybold#yahoo.com:83.228.32.24
gamer:3dabd5bd7984b0286eba52d4a7db2dea:$Wm?1Z3MPErXl7%yk^Pc#%iu\9LFc{:octopus#vida.tv:93.182.154.63
:adc0a54f8d21694848200ae043fa99f2:GqJ:LOLPELIC#trash-mail.com:84.176.127.30
! Aa:da99417e29ab0aa67f97db64f091836b:k_P:prus_da#yahoo.com:82.179.236.154
I want to change the column separator (currently it is ':') to '||o||'.
I want to change only the 1st, 3rd and 4th column separator as 2nd column contains something like hash:salt.
The script I am trying is:
sed 's/:/||o||/1;s/:/||o||/2;s/:/||o||/2' 1.txt
The only problem is in the results where ':' is included in the salt.
The output I am getting is:
cam||o||45c62741b9c99e1dcf3c140e8e3df635:||o||dv||o||johnybold#yahoo.com:83.228.32.24
gamer||o||3dabd5bd7984b0286eba52d4a7db2dea:$Wm?1Z3MPErXl7%yk^Pc#%iu\9LFc{||o||octopus#vida.tv||o||93.182.154.63
||o||adc0a54f8d21694848200ae043fa99f2:GqJ||o||LOLPELIC#trash-mail.com||o||84.176.127.30
! Aa||o||da99417e29ab0aa67f97db64f091836b:k_P||o||prus_da#yahoo.com||o||82.179.236.154
The first line of the output is wrong.
Expected output :
cam||o||45c62741b9c99e1dcf3c140e8e3df635::dv||o||johnybold#yahoo.com||o||83.228.32.24
Rest of the output is correct.
What I am expecting is replace first ':' from forward and second and third time the replacement should be from backwards, so that ':' in the salt gets ignored.

Try this:
(?:^[^:]*\K:)|(:(?=[^:]+:?[^:]+$))
Basic idea:
Either get the first : that occurs in the line
Or : that is followed by at most one other :
Demo: regex101
Demo with substitution: regex101
How to run it with perl:
perl -p -e 's/(?:^[^:]*\K:)|(:(?=[^:]+:?[^:]+$))/||o||/g' input.txt

Short sed solution:
sed -E 's/:+/||o||/3g; s/:/||o||/' file
The output:
cam||o||45c62741b9c99e1dcf3c140e8e3df635::dv||o||johnybold#yahoo.com||o||83.228.32.24
gamer||o||3dabd5bd7984b0286eba52d4a7db2dea:$Wm?1Z3MPErXl7%yk^Pc#%iu\9LFc{||o||octopus#vida.tv||o||93.182.154.63
||o||adc0a54f8d21694848200ae043fa99f2:GqJ||o||LOLPELIC#trash-mail.com||o||84.176.127.30
! Aa||o||da99417e29ab0aa67f97db64f091836b:k_P||o||prus_da#yahoo.com||o||82.179.236.154

Find a regexp in awk

I have a file with a line like this:
<div class="cell contentCell bbActiveRow" tabindex="-1" style="width: 150px; left: 77px; display: block;" cellposition="15,2"><div class="cell contentCell bbActiveRow last-child" tabindex="-1" style="width: 150px; left: 697px; display: block;" cellposition="15,6">159</div></div><div class="contentRow bb_row" rowindex="16" style="display: block; top: 429px;"><div class="cell first-child " title="Go to box" tabindex="-1" role="linkAction" cellposition="16,0"><span class="pre-child" style="background-color:#16A765;"> </span><span class="link" role="link"> </span></div>
The important bit I want to catch is the 159 in:
,6">159</div>
I can catch it fine with grep:
cat c |grep ',6\">[0-9]\+<'
Now, what I want to do, is actually catch the number itself (159) and print it out.
Note that the actual file I have has several of those lines. Ideally, only the numbers will print out.
I thought I could do it with awk:
cat c | awk ' /,6\">([0-9]\+)/ { print $1 } '
But nope, nothing gets printed out.
Having the regexp ready, and knowing that there are several lines in the file with entries that match the expression (with different numbers), how would you squeeze those numbers out?

This oneliner is an alternate way to do that (using an xpath expression which matches div elements containing a cellposition attribute value ending with ',6'):
# xmllint --html test.html --xpath '//div[substring(#cellposition, string-length(#cellposition) - 1)=",6"]/text()'
159

A pragmatic approach:
cat c | grep -o ',6\">[0-9]\+<' | awk -F'<|>' '{ print $2 }'
-o causes grep to only report the matching part of each line.
awk -F'<|>' '{ print $2 }' then extracts the token between > and <.
As for why your awk command didn't work:
awk uses extended regular expressions, in which + must NOT be escaped as \+ to be recognized as a quantifier.
Even with that fixed, the command wouldn't work, because, by default, awk splits by whitespace, so $2 will simply report the 2nd whitespace-separated token on each matching line, irrespective of the regular expression that caused the match.
The solution at the top even finds multiple matches on a line, but if we assume that there's at most 1, it is relatively straightforward to do it all in awk, if you have GNU awk:
cat c | gawk '{ m=gensub(/^.*,6\">([0-9]+)<.*$/, "\\1", "1"); if (m != $0) print m }'
The non-POSIX gensub() replaces regex matches and returns the replacement, while crucially also supporting backreferences, which the POSIX sub() and gsub() functions do not.
The above matches the entire line, then replaces it with the captured number only (via (escaped) backreference \1), and stores the result in a variable. If the variable doesn't equal the input line, a match was captured, and it is printed.
While a solution with POSIX awk features only is possible (using match(), RSTART, RLENGTH, split()), it would be cumbersome.
Finally, if you have xmllint (OS X does, and some Linux distros), consider guido's answer for a solution that performs actual HTML parsing and applies an XPath query, and is therefore more robust.

Find string in a file after a string pattern using shell script

i have my output file with 4 lines
storefront/storefront.war/location/header-info.jsp:30:<input type="hidden" id="welcomeConfigValue" value="${welcomeConfig}"/>
storefront/storefront.war/location/header-info.jsp:31:<span id="selected-location" class="top-txt top-nav-fix">
storefront/storefront.war/location/header-info.jsp:33:<span id="headRestName"></span><span class="header-spacing"> | </span><span id="headRestPhone"></span><span class="header-spacing"> | </span>
storefront/storefront.war/location/header-info.jsp:35:<a href="#" class="capitalize link-wht" id="location-show"><fmt:message
I'd like to get output string after id= with the UNIX shell.
I.e., output should be like this:
welcomeConfigValue
selected-location
headRestName
headRestPhone
location-show

you can try with grep:
grep -Po '\sid="\K[^"]*' file

Command:
sed -r 's/(^.*id=")([^"]+)(.*$)/\2/g' < file.txt
Output:
sdlcb#Goofy-Gen:~/AMD$ sed -r 's/(^.*id=")([^"]+)(.*$)/\2/g' < ff.txt
welcomeConfigValue
selected-location
headRestPhone
location-show
Here, we are grouping the patterns into 3 sets using "(" & ")". First set contains all characters from beginning of the line till 'id="' including. Second set contains characters between the "s (i.e between 'id="' and the pair '"'). Third set contains the remaining chars till the end of the line. Then we just avoid the 1st and 3rd patterns.

Search until next character with sed and regex

I got an image with an URL like:
<img alt="" src="http://www.example-site.com/folder_with_underscore/folder-with-dash/3635/0/235/NumBerS_and_Uc/image.png" />
I'm using sed "s///g"
So what I'm trying is to replace the src value but this is most of the time totally different.
Is there a way to use sed "s/src=\" (until first " ) / new url /g"
Extra info:
I'm using Cygwin on Windows
and PATH=C:\cygwin\bin in my .bat file

[^"] will match any charater apart from ", so you can use:
sed 's/src="[^"]*"/src="NEWURL"/g'
Example:
[me#home]$ echo '<img alt="" src="http://www.example-site.com/folder_with_underscore/folder-with-dash/3635/0/235/NumBerS_and_Uc/image.png" />' | sed 's/src="[^"]*"/src="http:\/\/stackoverflow.com"/g'
<img alt="" src="http://stackoverflow.com" />
Note that that will match till the first occurence of " which is probably what you want. If you really want to match till the last occurence of ", you could simply do:
sed 's/src=".*"/src="NEWURL"/g'
The regex is greedy and so will take up as many charactes as possibly, thus matching till the last occurence of ". While this will also work in the example above, it will not behave as expected if there are other contents within your input that also contain ".

Shawn's solution is mostly correct, but it does not deal with the case in which a newline appears in the src url. sed is really not very good at dealing with such cases, but you can hack a solution:
sed '/src/{
/src="[^"]*"/{ s//src="NEWURL"/; n; }
s/src=".*$/src="NEWURL"/
p
:a
s/.*//;
N
/"/!ba
s/[^"]*"//
}
' input
Note that many of the newlines above are superfluous in some versions of sed, but necessary in others. (In particular, the newline after :a and after the branch command, as some versions of sed will terminate the label only at the newline. I believe that versions of sed which allow a label to terminate with a semi-colon are not strictly compliant with the standard, but it is a common practice.) This script does the simple replacement where appropriate, but if a quote is not found following src=", it enters a loop deleting lines until a terminating " is seen. This is an ugly solution, and I recommend against using sed for parsing xml.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

SED - remove attribute from HTML tag - regex

Your sed expression matches the text up to the last " in the line. It must have been sed 's/ name="[^"]*"//g'

Related

How do I remove a particular pattern with a number sequence sed

replacing column field separator using sed

Find a regexp in awk

Find string in a file after a string pattern using shell script

Search until next character with sed and regex

Categories

Resources