How to Search Replace when filename URLs are different - replace

I want to search-replace a line that contains different addresses but I just want to search-replace a specific portion. Like
Search: < img border="0" src="*http://2.bp.blogspot.com/-OKZyRpnFLgw/UtqngoauEwI/AAAAAAADOlM/aXCJiiTRkaM/*s1600/005.JPG"/ >
Replace: < img border="0" src="*http://2.bp.blogspot.com/-OKZyRpnFLgw/UtqngoauEwI/AAAAAAADOlM/aXCJiiTRkaM/*s420/005.JPG"/ >
I just want to change the bold text. The Italic text changes in every file address.
I cant just simply search replace s1600 with s420 because it changes many other entries, I just want to make this change if it appears in < img address.
Help!

Using a Find what of (< img[^>]*?)s1600 and a Replace with of \1s420 will work on the line you give; you should have Regular expression selected. It will replace a single occurrence of s1600 between an < img and the next >. If you expect two or more s1600 between the strings then run the replace multiple times.
The Find what may not be strict enough. If text such as pqrs1600 or s160000 should not be altered then you could try (< img[^>]*?)\bs1600\b.
If the s1600 always occurs between a * and a \ as in the example then the Find what may be better as (< img[^>]*?\*)s1600/ and then the Replace with should be \1s420/.
The basic idea in each case is to match the text that identifies where the item is and to capture that text using the round braces. The [^>]*? matches zero or more characters that are not a >, the *? parts indicates a non-greedy match, so it matches the smallest sequence possible before the next part of the regular expression.

Related

perl regex substitution if NOT this string NOR that character

I'm using Perl to highlight errors through my browser as I scan through pages of text. At this point, I want to ensure the text Seq is preceded by a maltese cross and space ✠ , otherwise highlight it. I also want to ignore n>Seq.
PS. If it's easier, I want to ignore > but it will always be n>. In fact, it would always be </span> - whichever is easiest to check for.
Example phrase: ✠ Seq. S. Evangélii sec. Joánnem. — In illo témpore
I'm trying to replace xySeq if xy is NOT a Maltese cross and a space ✠ , AND if xy is NOT the letter n and a greater than symbol n>.
In other words, I don’t want to substitute
✠ Seq
n>Seq
>Seq
</span>Seq
but I do want to replace things like
✠Seq
* Seq
a✠Seq
>aSeq
The following would work if I was just checking for single characters like ✠ or >
my $span_beg = q(<span class='bcy'>); # HTML markup for highlighting
my $span_end = q(</span>);
$phr =~ s/([^✠>]Seq)/$span_beg$1$span_end/g;
but [^✠ >]Seq will naturally only treat the ✠ and the space as one or the other.
I even tried [^(✠\s)>]Seq and a varible [^$var>] but these didn’t work.
I played with (?<!✠\s)Seq but didn't know how to incorporate > or if it was even the right way to go.
I hope this is possible, thanks for all.
Guy
If you always want to tag Seq and exactly two characters before it, a couple of look-behinds might be enough:
s{..(?<!✠\s)(?<!n>)Seq}{$span_beg$&$span_end}g;
Or, with look-ahead:
s{(?!✠\s)(?!n>)..Seq}{$span_beg$&$span_end}g;
This should be more efficient than performing lookaround at every position:
# Doesn't include preceding characters in the span.
s{(✠ |>)?Seq}{ $1 ? $& : "$span_beg$&$span_end" }eg
# Includes two preceding characters in the span.
s{(?:(✠ |>)|..)Seq}{ $1 ? $& : "$span_beg$&$span_end" }seg

RegEx for transforming the next text using PhpStorm's search and replace dialog

I need to transform text using regex
TPI +2573<br>
NM$ +719<br>
Молоко +801<br>
Прод. жизнь +6.5<br>
Оплод-сть +3.6<br>
Л. отела 6.3/3.9<br>
Вымя +1.48<br>
Ноги +1.61<br>
to this one
<strong>TPI</strong> +2573<br>
<strong>NM$</strong> +719<br>
<strong>Молоко</strong> +801<br>
<strong>Прод. жизнь</strong> +6.5<br>
<strong>Оплод-сть</strong> +3.6<br>
<strong>Л. отела</strong> 6.3/3.9<br>
<strong>Вымя</strong> +1.48<br>
<strong>Ноги</strong> +1.61<br>
Is it possible with regex in PhpStorm's search and replace dialog?
Given your text, you can use this regex,
.* +
and replace it with <strong>$0</strong> (Notice there is a space after </strong>)
We're using .* to capture everything but stop just before one (possible one or more) space because that's the point after which we want the text to remain intact. Once we capture the text, we use back-reference $0 to replace the match with <strong>$0</strong> so only matched text is placed within <strong> tags.
Regex Demo
Just in case, if this doesn't work for any of the samples you haven't included in your post, then please list the rules of replacement and I will give you a more robust solution, that will work flawlessly for your given rules.

Regular Expression in Notepad++ to replace < and > inside CDATA

I'm using Notepad++ to fix a huge XML export file and one of the challenges here is to replace all < and > characters to < and >. The thing is, I can't simply use the replace all action since the XML file is full of < and > that cannot be changed.
Luckly all the < and > that I need to change are wrapped by CDATA tags, like this:
<![CDATA[Text here... <span class="vSpecial"><p>Special Offer - more text here!</p></span>]]>
I was wondering if there'd be a Regular Expression to identify < and > wrapped in CDATA content, so I could easily use the Replace All to change only them.
UPDATE
The content of CDATA can contain line breaks.
Code
See regex in use here
<!\[CDATA\[)(?:(?!\]\]>).)*?\K(?:(<)|(>))
Replacement: (?{1}<)(?{2}>)
Note: For display purposes the link above uses \G(?!\A). This is not supported in Notepad++, thus it's been dropped in the actual answer. I added it to the link to show what it basically does.
See the Notepadd++ documentation for more information. It mentions the following:
For those readers familiar with Perl, \G is not supported.
Results
Before
After
Explanation
Click Replace All repeatedly until the message at the bottom shows Replace All: 0 occurrences were replaced. It will replace the first occurrence, then the second occurrence, then third, etc. for each CDATA that is found until there are no more matches.
Pattern
<!\[CDATA\[ Matches <![[CDATA[ literally
(?:(?!\]\]>).)*? Tempered lazy token matching any character any number of times, but as few as possible ensuring what follows doesn't match ]]>
\K Resets the starting point of the reported match. Any previously consumed characters are no longer included in the final match
(?:(<)|(>)) Match either of the following
(<) Capture < literally into capture group 1
(>) Capture > literally into capture group 2
Replacement
Notepad++ allows conditional replacements, so (?{1}<) makes reference to capture group one and (?{2}>) makes reference to capture group 2.

How to get the string that start after the last > by regular expression?

I am writing a C# code that read a webpage and grep the content from the webpage.
I spent a lot of time to figure the content and now I stuck on this:
<i class="icon"></i><a href="https://www.nytimes.com/2017/09/12/us/irma-storm-updates.html">Latest Updates: 90 Percent of Houses in Florida Keys Are Damaged
I wanna get the "Latest Updates: 90 Percent of Houses in Florida Keys Are Damaged" only
I used to use "(?<=\">)(.*)" to get some content out successfully but not fit for all of it.
Therefore, how could I use R.E. to point I want the element that start get after the last ' > '
Thank you.
If the substring that you want to match appears after the last > then the main thing you know about it is that it does not contain a >. This is matched with [^>]. If the string must contain at least one character then you'll want to use + as the quantifier; if it's allowed to be empty then you'll want to use * to allow for zero matches. Finally, you need to match the full remainder of the text, up to the end of the line, which you do with a $.
So the full expression is [^>]*$ (or [^>]+$ if it can't be zero length).
If you want to also require that the preceding text does have a >, you can make it a bit more complicated, using a non-matching look-behind, (?<=\>). This says to find a > (which needs to be escaped here with \>) but don't include it in the match. The final expression would then be (?<=\>)[^>]*$. Now, C# strings also make use of \ for escaping, so you have to escape it twice before passing it to the Regex constructor. So it becomes new Regex("(?<=\\>)[^>]*$").
The simpler version, [^>]*$, is probably sufficient for your needs.
Finally, I would add that parsing XML or HTML with regular expressions is usually the wrong thing to do because there are lots of edge cases, and you will have to make assumptions about the formatting. For example, based on your example text, I assumed you are searching up to the end of the input text. It's usually better to parse XML with an XML parser, which won't have these problems.
This is the Regex you need here is a working example in RegexStorm.net example:
>([^<>]+)
This says: Find a string that matches a closing angle bracket, followed by text that doesn't include angle brackets. The [^<>] says find letters, numbers, whitespace that are NOT open/close angle brackets. The parenthesis around the [^<>] captures the text as a separate group. The (+) says get at least one or more.
Here is a C# example that uses it. You need to get the second capture group for the text you want.
void Main()
{
string text = "<i class=\"icon\"></i><a href=\"https://www.nytimes.com/2017/09/12/us/irma-storm-updates.html\">Latest Updates: 90 Percent of Houses in Florida Keys Are Damaged";
Regex regex = new Regex(">([^<>]+)");
MatchCollection matchCollection = regex.Matches(text);
if (matchCollection != null)
{
foreach (Match m in matchCollection)
{
Console.WriteLine(m.Groups[1].Value);
}
}
}
RegexStorm.net is a good .Net test site. Regex101.com is a good site to learn different Regex tools.

How do I extract text in between only the first two occurances of a string?

I am pretty new to computer programming and am trying to write a script that takes all of the text in between the first and second > symbol in a large fasta file and outputs it into a different file. The question that I really need answered is if there is a regex command that allows me to only take the text located in between the first and the second > symbols in the file.
I have found a lot online about taking text in between two strings, but I haven't find anything anywhere on taking text between only the first and second occurrences of those strings when they appear multiple times in a file. I am running perl version 5.010.
Seems easy enough: />([^>]*)>/
Explanation:
A regex always finds the first (leftmost) match, so the first > is easy. "Find all text up to the next >" is equivalent to "find all following non-> characters", which is where we get [^>]* from.
The parens ( ) serve to capture the matched text in $1.
By default, regular expressions are greedy, it means that regexp will try to match as many text as possible. To avoid this you can specify symbols which shouldn't be present between > and >, as it was proposed:
>([^>]*)>/
Or you can just switch regexp engine to lazy mode by adding ? to quantifier:
>(.*?)>
or
>(.+?)>
Regexp is simpler, result is same.
So:
if ($content =~ m/>(.*?)>/gs) {
print "Captured: $1"
}