Convert html tag using sed

Convert html tag using sed - regex

I have a tag such as the following:
<div style="position:absolute;opacity:0.5" class="header">Home</div>
(there may or may not be a style or other attribute) and using sed I need to convert it to a span where the id of the span is the class of the div:
<span style="position:absolute;opacity:0.5" id="header">Home</span>
I know how to do this in PHP but unfortunately my Linux is lacking :).
The regex to find the eligible DIVs is something along:
#<div .* id=(.*)>.*</div>#
but I don't know how to write the replacement part, mainly because I need to keep the content between the div tag name and the id. It's 4:45 am so that may have something to do with it as well :p.
I'd appreciate any help on this, thank you.

Using sed, and if you want more specific handling:
sed '/<div/{s/<div /<span /;s/ class *=/ id =/;s!</div!</span!}' input
still, this assumes start and close tags are on the same line, and there is a single div tag on that line. Also it assumes that the class attribute is the only one on that line.
A more strict command is:
sed 's!<div\([^>]*\) class *= *\([^>]*\)>\([^<]*\)</div>!<span\1 id=\2>\3</span>!g' input

sed 's/div/span/;s/id/class/' foo.html
Will output
<span style="position:absolute;opacity:0.5" class="header">Home</div>
Where foo.html is your document
PAY ATTENTION
This will replace only the first uccurence of div and id. If you want to replace all, you have to add "g" char at the end of each substitution pattern, like s/div/span/g
And, not less important, if you want to overwrite your document (so if you want to replace occurence "in place") you have to proced in the following way sed -ie 's/div/span/;s/id/class/' foo.html
Last thing: as correcly Basile Starynkevitch says in the comments, maybe sed isn't the best choice

Related

Use sed to delete all occurrences of <a name="foo">

I have multiple html documents and each one has many occurrences of
<a name="pIDsomestring">
where 'somestring' varies with each occurrence.
I want to delete the entire tag, as well as the
</a>
closing HTML tag that immediately follows it, but importantly, not the text inside the anchor tag.
Is there an easy way to do this with sed?

HTML is much more complicated than what can be parsed with sed. Two pieces of HTML can be absolutely equivalent, and yet look completely different as far as a sed command is concerned. For example, you can't really write a sed command that will recognize that these two are equivalent:
<a name="foo">bar</a>
<A
NAME = "foo"
><!-- </A> --bar</>-- -->
(The </>, if you're wondering, means </a> in this case. And heh, even Stack Overflow's syntax highlighter gets confused by the <!-- comment -- not-a-comment -- comment --> notation.)
The above is a pathological example, of course, but even perfectly-ordinary real-world HTML often has line-breaks and other whitespace in random places that have no effect on the HTML but a great deal of effect on a sed command.
But if you're just doing a one-off task where you can manually verify the results afterward, you can try something like this:
's#<a name="[^"]*">\(\([^<]\|<[^/]\|</[^a]\|</a[^>]\)*\)</a>#\1#g'
which will usually work as long as the whole thing is on one line.

Notepad++ Regex to remove styling

I need to remove some tags from a whole lot of html pages.
Lately I discovered the option of regex in Notepad++
But.. Even after hours of Googling I don't seem to get it right.
What do I need?
Example:
<p class=MsoNormal style='margin-left:19.85pt;text-indent:-19.85pt'><spanlang=NL style='font-size:11.0pt;font-family:Symbol'>·<span style='font:7.0pt "Times New Roman"'> </span></span><span lang=NL style='font-size:9.0pt;font-family:"Arial","sans-serif"'>zware uitvoering met doorzichtige vulruimte;</span></p>
I need to remove everything about styling, classes and id's. So I need to only have the clean tags without anything else.
Anyone able to help me on this one?
Kind regards
EDIT
Check an entire file via pastebin: http://pastebin.com/0tNwGUWP

I think this pattern will erase all styles in "p" and "span" tags :
((?<=<p)|(?<=<span))[^>]*(?=>)
=> how it works:
( (?<=<p) | (?<=<span) ): This is a LookBehind Block to make sure
that the string we are looking for comes after <p OR <span
[^>]* : Search for any character that is not a > character
(?=>) : This is a LookAfter block to make sure that the
string we are looking for comes before > character
PS: Tested on Notepad ++

If sample you provided is representative of what you need to process, then, the following quick and dirty solution will work:
Find what: [a-z]+='[^']*'
Replace with:
Find what: [a-z]+=[a-zA-Z]*
Replace with:
You must run the first one first to pick up the style='...' attributes and you'll need to run the second next to pickup both the class='...' and lang='...'.
There's good reason why others posters are saying don't attempt to parse HTML this way. You'll end up in all sorts of trouble since regex, in general cannot handle all the wonderful weirdness of HTML.

My advise as follows.
As I see in your sample text you have only "p" and "span" tags that need to be handled. And you apparently want to remove all the styles inside them. In this case, you could consider removing everything inside those tags, leave them simple <p> or <span>.
I don't know about Notepad++ but a simple C# program can do this job quickly.

Assuming <spanlang=NL a typo (should be <span lang=NL), I'd do:
Find what: (<\w+)[^>]*>
Replace with: $1>

If you don't mind doing a little bit of programming: HTMLAgilityPack can easily remove scripts/styles/wathever from you xml/html.
Example:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
doc.DocumentNode.Descendants()
.Where(n => n.Name == "script" || n.Name == "style")
.ToList()
.ForEach(n => n.Remove());

Delete text outside of tags

Using vim, I am attempting to remove all text outside of <text> blocks. This needs to span across newlines and other (unrelated) tags.
I have attempted to use regex to substitute text for newlines, but failed for a couple of reasons, one of which was my attempts did not span multiple lines, and I need to have my matches be non-greedy. (Is that accomplished using {-} somehow?)
The regex that should match the content I would like to delete would look like: <//text>.*<text.*> but if I make this match non-greedy, I may have other issues. (I also realize I'll have one partial tag section to clean up at the beginning doing this.)
Is there another approach that I should be taking, or can someone guide me to remove all content not between such tags using vim?
EDIT: Including sample text
<contributor>
<username>MalafayaBot</username>
<id>628</id>
</contributor>
<minor />
<comment>Robô: A modificar Categoria:Vocábulo de étimo latino (Português) para Categoria:Entrada de étimo latino (Português)</comment>
<text xml:space="preserve">={{-pt-}}=
==Substantivo==
{{flex.pt|ms=excerto|mp=excertos}}
{{paroxítona|ex|cer|to}} {{m}}
# [[extrato]] de um [[texto]], [[fragmento]]
#: ''A seguir, um '''excerto''' do texto original.''
===Tradução===
{{tradini}}
* {{trad|es|extracto}}
* {{trad|fr|extrait}}
{{tradmeio}}
* {{trad|en|excerpt}}
{{tradfim}}
=={{etimologia|pt}}==
:Do latim ''[[excerptu]]'' (colhido de).
=={{pronúncia|pt}}==
===Brasil===
* [[SAMPA]]: /e."sEx.tu/
* [[AFI]]: /esˈertu/
[[zh:excerto]]</text>
<sha1>8i1zywj37s74ah4wnai11ohorfjn8j5</sha1>
<model>wikitext</model>

Your struggles with regular expressions indicate that you're using the wrong tool for the job.
For text extraction from XML, you can use XSLT, which will handle all special cases far better than a regular expression. Or use special-purpose tools like xidel, a kind of grep for XML. With it, the extraction is as easy as:
xidel --extract "//text" input.xml

if you don't NEED to you vim, you can try using this sed command, just replace "test" with the name of your file. I would test this on a COPY of your file first since the -i option tells sed to modify the actual file you pass in.
sed -i 's/<\/text>[^<]*/<\/text>/g' test
EDIT: after seeing the sample, I'm going to take a different approach... instead of getting rid of all the text not within tags.. I'm going to select all the blocks and output it to a new file. Hopefully your version of grep supports the -P option. Try this:
grep -Pzo "(?s)<text.*?<\/text>" sample.txt > out.txt

I assume that there is only one <text> block in your file. In vim this line works for your sample text:
%s#\_.*\(<text.\{-}>\_.*</text>\)\_.*#\1#

regular expression to copy an html tag to another place

In html I need to copy the text that is in the title tag, to a new line as a h1 tag in a certain place..
Example:
<TITLE>any text here, including special characters like .,:- or whataever</TITLE>
Output to leave that title as as, and then add a line under this:
under this line:
<font face="Arial"><span style="font-size: 14pt">
it adds
<h1 align="center">any text here, including special characters like .,:- or whataever</h1>
I use notepad++ to search and replace, but I don't think it would work in this case, so please suggest any other simple and effective Windows program to use, wish steps if any.. and I need to edit this in bulk for many files..
Thanks in advance,

As this question is still open, and you have a pretty bad accept rate, I'll try to offer you a solution:
Search for:
<TITLE>(.*)</TITLE>(.*?)<font face="Arial"><span style="font-size: 14pt">
And make sure to check the ". matches Newline" option in the search box. Then use the Following replace:
<TITLE>\1</TITLE>\2<font face="Arial"><span style="font-size: 14pt">\n<h1>\1</h2>
Worked in my notepad++.

The regex you need yo use should be like:
/^<TITLE>(.*)<\/TITLE>/
Now I'll give you a small Perl snippet that takes in a line, matches against the above regex and spits out the text in the title tags wrapped in <h1>...</h1>. Remember that in regexes, the parentheses are used to "group" matches - or in layman terms, to store them for future references. The script:
$line = <STDIN>;
chomp $line;
$line =~ /^<TITLE>(.*)<\/TITLE>/;
print "<h1>"."$1"."</h1>";
Here, all text between TITLE tags is grouped and hence stored in the variable $1.
When I do:
echo "<TITLE>lama</TITLE>" |script.pl
I get:
<h1>lama</h1>
Hope that helps.

regex for all characters on yahoo pipes

I have an apparently simple regex query for pipes - I need to truncate each item from it's (<img>) tag onwards. I thought a loop with string regex of <img[.]* replaced by blank field would have taken care of it but to no avail.
Obviously I'm missing something basic here - can someone point it out?
The item as it stands goes along something like this:
sample text title
<a rel="nofollow" target="_blank" href="http://example.com"><img border="0" src="http://example.com/image.png" alt="Yes" width="20" height="23"/></a>
<a.... (a bunch of irrelevant hyperlinks I don't need)...
Essentially I only want the title text and hyperlink that's why I'm chopping the rest off
Going one better because all I'm really doing here is making the item string more manageable by cutting it down before further manipulation - anyone know if it's possible to extract a href from a certain link in the page (in this case the 1st one) using Regex in Yahoo Pipes? I've seen the regex answer to this SO q but I'm not sure how to use it to map a url to an item attribute in a Pipes module?

You need to remove the line returns with a RegEx Pipe and replace the pattern [\r\n] with null text on the content or description field to make it a single line of text, then you can use the .* wildcard which will run to the end of the line.
http://www.yemkay.com/2008/06/30/common-problems-faced-in-yahoo-pipes/

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Convert html tag using sed - regex

Related

Use sed to delete all occurrences of <a name="foo">

Notepad++ Regex to remove styling

Delete text outside of tags

regular expression to copy an html tag to another place

regex for all characters on yahoo pipes

Categories

Resources