Finding the right regex

Finding the right regex - regex

I am working on web-project with several languages: each HTML text inside a tag must be wrapped inside a <?php echo _("..."); ?> so :
<div>My text</div> transforms into <div><?php echo _("My text"); ?></div>
The fact is I want to track the huge amount of these occurrences of text in order to transform it into texts wrapped by a 'php echo'?. Does it exist a regex to track these occurrences?

Related

How to use sed to safely find and replace every instance of a regex match?

Let's say I have an html file that contains the following scenarios;
1. <p style="1">test</p>
2. <p style="2"><p style="3">test</p></p>
3. <p style="4">test</p><p style="5">test</p>
4. <td style="6"><p style="7">test</p></td>
5. <td style="8"><p style="9">test</p><p style="10">test</p></td>
I want to develop a way to find each instance of <p style="test"> and replace it with <p>. I already know that if I wanted to find each one, I would use a regex like <p .+?> or something similar <p .+?(?=>)> which would get me anything that starts with <p contains any character after that, and ends in >.
Here's what I've tried so far;
sed -r 's/<p .+?>\b/<p>/'
While this works for scenario one and four just fine, it starts to get very questionable on every other scenario that would contain more than one <p ...>.
sed -r 's/\b<p .+?>\b/<p>/' This doesn't work at all.
I won't list every possible thing I've tried here as I don't think it would bring any meaningful data to someone versed in sed. I know very little about how to use it and what its capabilities are.
What's the best way to go about this? Thanks!

As mentioned in a comment, a tool that actually understands HTML is a better choice than trying to hack something together with regular expressions.
Example perl script using HTML::TreeBuilder module that strips style attributes from p tags:
#!/usr/bin/env perl
use warnings;
use strict;
use HTML::TreeBuilder;
use Data::Dumper;
# Takes the HTML file to process as a command line argument; outputs on
# standard output.
my $tree = HTML::TreeBuilder->new_from_file($ARGV[0]);
die "Unable to parse '$ARGV[0]': $!\n" unless defined $tree;
# Remove style attributes from all p tags with one
foreach my $tag ($tree->look_down('style', qr//)) {
$tag->attr('style', undef) if $tag->tag eq 'p';
}
print $tree->as_HTML(undef, ' ');

find and everything including other tags with regex

I am trying to find everything from one div to the start of another and include everything in between, even if there is a line break and even if there is other tags and php functions in between.
I want it to start the search at <div="constant-strip"> and end on <?php include....
and i want it to delete everything inside the and everything that comes after <div="constant-strip"> until it reaches the <?php include
even though there are other <?php and <div> tags between those.
I have searched everywhere, but all the regex and wildcard etc. searches i can find, people want to stop at the end of the div and don't include divs that are inside it or only apply to text etc...
all the ([^<]) and i've tried ([\s\S])+ and all those, but none of them work
basically i want to change this:
<div id="constant_strip" class="clearfix">
<div class="clearfix"><a href="<?php echo $division ?>_brands.php">
<img src="images/people.png" width="22" height="21" style="<?php echo $stripColour ?>" />View our suppliers</a></div>
<div id="call">Call us: 021 323 4088</div>
</div>
</div>
<?php include('/footer.php'); ?>
and turn it into just this: <?php include('/footer.php'); ?>
the problem is that it doesn't have exactly the same information on every page

The following regex will match the middle part, you want to replace:
(?<=<div id="constant_strip" class="clearfix">)[\s\S]*?(?=<\?php include)

Regex in perl/sed replacement not matching whitespace/characters

Given this file, I'm trying to do a super primitive sed or perl replacement of a footer.
Typically I use DOM to parse HTML files but so far I've had no issues due to the primitive HTML files I'm dealing with ( time matters ) using sed/perl.
All I need is to replace the <div id="footer"> which contains whitespace, an element that has another element, and the closing </div> with <?php include 'footer.php';?>.
For some reason I can't even get this pattern to match up until the <div id="stupid">. I know there are whitespace characters so i used \s*:
perl -pe 's|<div id="footer">.*\s*.*\s*|<?php include INC_PATH . 'includes/footer.php'; ?>|' file.html | less
But that only matches the first line. The replacement looks like this:
<?php include INC_PATH . includes/footer.php; ?>
<div id="stupid"><img src="file.gif" width="206" height="252"></div>
</div>
Am I forgetting something simple, or should I specify some sort of flag to deal with a multiline match?
perl -v is 5.14.2 and I'm only using the pe flags.

You probably want -0777, which will force perl to read the entire file at once.
perl -0777 -n -e 's|something|else|g' file
Also, your strategy of doing .*\s*.*\s* is pretty fragile. It'll match e.g. <div id="foo", which is just a fragment...

Are you forgetting that almost all regex parsing works on a line-by-line basis?
I've always had to use tr to convert the newlines into some other character, and then back again after the regex.
Just found this: http://www.perlmonks.org/?node_id=17947
You need to tell the regex engine to treat your scalar as a multiline string with the /m option; otherwise it won't attempt to match across newlines.

perl -p
is working on the file on a line by line basis see perl.com
that means your regex will never see all lines to match, it will only match when it gets the line that starts with "<div id="footer">" and on the following lines it will not match anymore.

Sed program - deleted strings reappearing?

I'm stumped. I have an HTML file that I'm trying to convert to plain text and I'm using sed to clean it up. I understand that sed works on the 'stream' and works one line at a time, but there are ways to match multiline patterns.
Here is the relevant section of my source file:
<h1 class="fn" id="myname">My Name</h1>
<span class="street-address">123 street</span>
<span class="locality">City</span>
<span class="region">Region</span>
<span class="postal-code">1A1 A1A</span>
<span class="email">my#email.ca</span>
<span class="tel">000-000-0000</span>
I would like this to be made into the following plaintext format:
My Name
123 street
City Region 1A1 A1A
my#email.ca
000-000-0000
The key is that City, Region, and Post code are all on one line now.
I use sed -f commands.sed file.html > output.txt and I believe that the following sed program (commands.sed) should put it in that format:
#using the '#' symbol as delimiter instead of '/'
#remove tags
s#<.*>\(.*\)</.*>#\1#g
#remove the nbsp
s#\( \)*##g
#add a newline before the address (actually typing a newline in the file)
s#\(123 street\)#\
\1#g
#and now the command that matches multiline patterns
#find 'City',read in the next two lines, and separate them with spaces
/City/ {
N
N
s#\(.*\)\n\(.*\)\n\(.*\)#\1 \2 \3#g
}
Seems to make sense. Tags are all stripped and then three lines are put into one.
Buuuuut it doesn't work that way. Here is the result I get:
My Name
123 street
City <span class="region">Region</span> <span class="postal-code">1A1 A1A</span>
my#email.ca
000-000-0000
To my (relatively inexperienced) eyes, it looks like sed is 'forgetting' the changes it made (stripping off the tags). How would I solve this? Is the solution to write the file after three commands and re-run sed for the fourth? Am I misusing sed? Am I misunderstanding the 'stream' part?
I'm running Mac OS X 10.4.11 with the bash shell and using the version of sed that comes with it.

I think you're confused. Sed operates line-by-line, and runs all commands on the line before moving to the next. You seem to be assuming it strips the tags on all lines, then goes back and runs the rest of the commands on the stripped lines. That's simply not the case.

See RegEx match open tags except XHTML self-contained tags ... and stop using sed for this.
Sed is a wonderful tool, but not for processing HTML. I suggest using Python and BeautifulSoup, which is basically built just for this sort of task.

If you have only one data block per php file, try the following (using sed)
kent$ cat t
<h1 class="fn" id="myname">My Name</h1>
<span class="street-address">123 street</span>
<span class="locality">City</span>
<span class="region">Region</span>
<span class="postal-code">1A1 A1A</span>
<span class="email">my#email.ca</span>
<span class="tel">000-000-0000</span>
kent$ sed 's/<[^>]*>//g; s/ //g' t |sed '1G;3{N;N; s/\n/ /g}'
My Name
123 street
City Region 1A1 A1A
my#email.ca
000-000-0000

How can I search and then replace a code snippet that includes some variables in TextMate?

I have a project that includes 49 folders, each one has a file called index.php
All index.php files are almost the same except for one part that changes depending on the folder it is in.
<?php include_once("/home/bgarch/public_html/galleryheader.html"); ?>
<?php include_once("/home/bgarch/public_html/culture/loadscripts.html"); ?>
</head>
<body>
<div class="header">
<?php include_once("/home/bgarch/public_html/header.html"); ?>
</div>
<div class="clear"></div>
<div class="displaywrapper">
<?php include_once("content.html"); ?>
</div></div>
<div class="clear"></div>
<?php include_once("/home/bgarch/public_html/footer.html"); ?>
In the second line where the above reads: "../culture/.." the word culture is the variable and is different based on the folder it is in.
What I need to do know is do a "Find/Replace all in project" that automatically replaces all the text inside each 'index.php' file with the following"
<?php include_once("http://www.bgarchitect.co.nz/subPage/index.php"); ?>
I have spent the past 2 hours trying to figure out regular expressions to acomplish this but have been unsuccessful so far.
Maybe it is not possible to do so?
Anyway, I thought I'd ask a question here in hopes it is in fact much easier than I anticipated. So any help/pointer/hints or tricks are much appreciated.
Thanks for reading,
Jannis

If I understood right, you want to replace the word culture (which is the current directory) with the word subPage?
You can do this with the help of TextMate bundles.
Bundles > Bundle editor > Edit commands, than add a New command.
Add this as a command. I think it does what you want. You have to set the input to the command as Entire document and the output to Replace document.
#!/bin/bash
sed "s/"${TM_DIRECTORY##*/}"/theWordYouWanToReplaceTheDirWith/g" | cat

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Finding the right regex - regex

Related

How to use sed to safely find and replace every instance of a regex match?

find and everything including other tags with regex

Regex in perl/sed replacement not matching whitespace/characters

Sed program - deleted strings reappearing?

How can I search and then replace a code snippet that includes some variables in TextMate?

Categories

Resources