adding regex pattern in string in perl - regex

I want to compare string from one file to another. but another file may contains some element and that element can occur anywhere and it can occur many times also.
Note : these tags needs to be retain in final output.
For e.g.:
I want to compare word ‘scripting’.. tag indicates the word to be matched from str2.
$str1 = “perl is an <match>scripting</match> language”;
$str2 = “perl is an s<?..?>cr<?..?>ipti<?..?>ng langu<?..?>age”;
Output required :
perl is an <match>s<?..?>cr<?..?>ipti<?..?>ng</match> langu<?..?>age
I am adding pattern after each character:
$str1 =~ {(.)}
{
‘$&(?:(?:<?...?>|\n)+)?’
}esgi;
These works for few case but for few its goes on running. Please suggest.

(?:(?:<?...?>|\n)+)? is the same as (?:<?...?>|\n)* Also you dont want to add the pattern after each character; just in between the characters of the matched part of $str1. So no pattern before the first character and no pattern after the last. Otherwise the replace statement will around those tags, and you want them to around the front and back of the words. My guess is that if you are runnging that first replace command over all of $str1 you may end up with quite a large string. Also see my answer for related question here

Related

Powershell Regex - Replace between string A and string B only if contains string C

I have a file which looks like this
ABC01|01
Random data here 2131233154542542542
More random data
STRING-C
A bit more random stuff
&(%+
ABC02|01
Random data here 88888888
More random data 22222
STRING-D
A bit more random stuff
&(%+
I'm trying to make a script to Find everything between ABC01 and &(%+ ONLY if it contains STRING-C
I came up with this for regex ABC([\s\S]*?)STRING-C(?s)(.*?)&\(%\+
I'm getting this content from a text file with get-content.
$bad_content = gc $bad_file -raw
I want to do something like ($bad_content.replace($pattern,"") to remove the regex match.
How can I replace my matches in the file with nothing? I'm not even sure if my regex is correct but on regex101 it seems to find the strings I'm needing.
Your regex works with the sample input given, but not robustly, because if the order of blocks were reversed, it would mistakenly match across the blocks and remove both.
Tim Biegeleisen's helpful answer shows a regex that fixes the problem, via a negative lookahead assertion ((?!...)).
Let me show how to make it work from PowerShell:
You need to use the regex-based -replace operator, not the literal-substring-based .Replace() method:[1] to apply it.
To read the input string from a file, use Get-Content's -Raw switch to ensure that the file is read as a single, multi-line string; by default, Get-Content returns an array (stream) of lines, which would cause the -replace operation to be applied to each line individually.
(Get-Content -Raw file.txt) -replace '(?s)ABC01(?:(?!&\(%\+).)*?STRING-C.*?&\(%\+'
Not specifying replacement text (as the optional 2nd RHS operand to -replace) replaces the match with the empty string and therefore effectively removes what was matched.
The regex borrowed from Tim's answer is simplified a bit, by using the inline method of specifying matching options to tun on the single-line option ((?s)) at the start of the expression, which makes subsequent . instances match newlines too (a shorter and more efficient alternative to [\s\S]).
[1] See this answer for the juxtaposition of the two, including guidance on when to use which.
We can use a tempered dot trick when matching between the two markers to ensure that we don't cross the ending marker before matching STRING-C:
ABC01(?:(?!&\(%\+)[\s\S])*?STRING-C[\s\S]*?&\(%\+
Demo
Here is an explanation of the regex pattern:
ABC01 match the starting marker
(?:(?!&\(%\+)[\s\S])*? without crossing the ending marker
STRING-C match the nearest STRING-C marker
[\s\S]*? then match all content, across lines, until reaching
&\(%\+ the ending marker

Get first instance and Get last instance of string

I am trying to match the first instance of the value of Timestamp in one expression and the last instance of the value of Timestamp in another expression:
{'Latitude': 50.00001,'Longitude': 2.00002,'Timestamp': '00:10:00'},{'Latitude': 50.0,'Longitude': 2.0,'Timestamp': '00:20:00'},{'Latitude': 50.0,'Longitude': 2.0,'Timestamp': '00:25:00'},{'Latitude': 50.0,'Longitude': 2.0,'Timestamp': '00:37:00'}
Anyone know how to do that.
Take advantage of regexp's greediness: the * operator will take as many matches as it can find. So the approach here is to match the explicit pattern at the beginning and end of the regexp with a .* in the middle. The .* will slurp up as many characters as it can subject to the rest of the regexp also matching.
/(${pattern}).*(${pattern})/
Where here, ${} represents extrapolation. This will vary on your language. In Ruby it would be #{}. I have chosen to capture the entire pattern; you can instead put the () capture around the timestamp value but I find this easier to read and maintain. This regexp will match two instances of $pattern with as much stuff in between as it can fit, thus guaranteeing that you have the first and last.
If you want to be more strict, you could enforce the pattern in the middle as well, *'ing the full pattern rather than just .:
/${pattern},\s*(?:${pattern},\s*)*${pattern}/
Ask in the comments if you don't understand any piece of this regexp.
One pattern we can use is /\{[^}]+\'Timestamp\'[^}]+\}/.Note that this pattern assumes that Timestamp is the LAST key; if this is not always true you need to add a bit more to this pattern.
So the total pattern for the first example will be:
str =~ /(${pattern}.*(${pattern})/
Or, without extrapolation:
str =~ /({[^}]+'Timestamp'[^}]+}).*({[^}]+'Timestamp'[^}]+})/
Then, $1 and $2 are the first and last hashes that match the Timestamp key. Again, this matches the entire pattern rather than only the timestamp value itself, but it should be straightforward from there to extract the actual timestamp value.
For the second, more strict example, and the reason I did not want to capture the timestamp value inside the pattern itself, we have:
str =~ /(${pattern}),\s*(?:${pattern},\s*)*(${pattern})/
Or, without extrapolation:
str =~ /({[^}]+'Timestamp'[^}]+}), *(?:{[^}]+'Timestamp'[^}]+}, *)*({[^}]+'Timestamp'[^}]+})/
We still have the correct results in $1 and $2 because we explicitly chose NOT to put a capturing group inside the pattern.

Lua pattern similar to regex positive lookahead?

I have a string which can contain any number of the delimiter §\n. I would like to remove all delimiters from a string, except the last occurrence which should be left as-is. The last delimiter can be in three states: \n, §\n or §§\n. There will never be any characters after the last variable delimiter.
Here are 3 examples with the different state delimiters:
abc§\ndef§\nghi\n
abc§\ndef§\nghi§\n
abc§\ndef§\nghi§§\n
I would like to remove all delimiters except the last occurrence.
So the result of gsub for the three examples above should be:
abcdefghi\n
abcdefghi§\n
abcdefghi§§\n
Using regular expressions, one could use §\\n(?=.), which matches properly for all three cases using positive lookahead, as there will never be any characters after the last variable delimiter.
I know I could check if the string has the delimiter at the end, and then after a substitution using the Lua pattern §\n I could add the delimiter back onto the string. That is however a very inelegant solution to a problem which should be possible to solve using a Lua pattern alone.
So how could this be done using a Lua pattern?
str:gsub( '§\\n(.)', '%1' ) should do what you want. This deletes the delimiter given that it is followed by another character, putting this character back into to string.
Test code
local str = {
'abc§\\ndef§\\nghi\\n',
'abc§\\ndef§\\nghi§\\n',
'abc§\\ndef§\\nghi§§\\n',
}
for i = 1, #str do
print( ( str[ i ]:gsub( '§\\n(.)', '%1' ) ) )
end
yields
abcdefghi\n
abcdefghi§\n
abcdefghi§§\n
EDIT: This answer doesn't work specifically for lua, but if you have a similar problem and are not constrained to lua you might be able to use it.
So if I understand correctly, you want a regex replace to make the first example look like the second. This:
/(.*?)§\\n(?=.*\\n)/g
will eliminate the non-last delimiters when replaced with
$1
in PCRE, at least. I'm not sure what flavor Lua follows, but you can see the example in action here.
REGEX:
/(.*?)§\\n(?=.*\\n)/g
TEST STRING:
abc§\ndef§\nghi\n
abc§\ndef§\nghi§\n
abc§\ndef§\nghi§§\n
SUBSTITUTION:
$1
RESULT:
abcdefghi\n
abcdefghi§\n
abcdefghi§§\n

captures all stuff between the first " and the last "

I am looking for a regex which captures all stuff between the first " and the last " of a string than may contain further ".
$a='"xyz"kljhkljh"lkjhlkj"';
#b=$a=~ m/^"(.*)"$/m;
seems not to work?
There is no \n at the line end.
The reason that yours is not working is that you are trying to restrict the first quotation mark to occurring at the beginning of the string or immediately after a newline anywhere therein and the last quotation mark to occurring either at the end of the string or immediately before a newline anywhere therein.
That is not what your data contains. Don’t make this harder than it need be.
If you want everything between the first double quote and the last one, including others, then you want
($content) = $string =~ /"(.*)"/sx;
If you want lots of them, and no double quotes inside, you want:
(#contents) = $string =~ /"([^"]*)"/gx;
In your second comment to tchrist's answer you say that the first and last quotes should be at the beginning and end of the string? If that's the case, you don't even need a regular expression at all, just take the entire string minus the first and last characters:
substr($a, 1, -1)
For some reason I can't add a comment, so I'm creating an answer to answer bootware's comment on tchrist's answer. The difference between ($content)=$string=~/"(.*)"/sx and $content=$string=~/"(.*)"/sx is that the former matches in list context and the later in scalar context. In scalar context the result is simply a 1 or 0 indicating whether the string matched the regex. In list context, a list is returned for the substring that matched each parenthesized portion of the regex, in order from left to right. In this case there was one set of parentheses in the regex, so the list returned had one element, the portion of the string that was inside the quotes.
Bonus: You can refer to the substrings matched in each set of parenthesis using $1, $2, ...

matching the closest strings to a search term (perl regex)

Basically, what I'm trying to do is search through a rather large PHP file, and replace any block of PHP code that includes the string "search_term" somewhere in it with some other code. I.e.
<?php
//some stuff
?>
<?php
// some more stuff
$str = "search_term";
// yes...
?>
<?php
// last stuff
?>
should become
<?php
//some stuff
?>
HELLO
<?php
// last stuff
?>
What I've got so far is
$string =~ s/<\?php(.*?)search_term(.*?)\?>/HELLO/ims;
This correctly matches the closest closing ?>, but begins the match at the very first <?php, instead of the one closest to the string search_term.
What am I doing wrong?
Generally, I don't like to use non-greedy matching, because it usually leads to problems like this. Perl looks at your file, finds the first '<?php', then starts looking for the rest of the regexp. It passes over the first '?>' and the second '<?php' because they match .*, then finds search_term and the next '?>', and it's done.
Non-greedy matching means that you have a regular expression that matches more things than you really want, and it leaves it up to perl to decide which match to return. It's better to use a regular expression that matches exactly what you want to match. In this case, you can get what you want by using ((?!\?>).)* instead of .*? ((?!\?>) is a negative look-ahead assertion)
s/<\?php((?!\?>).)*search_term((?!\?>).)*\?>/HELLO/is;
If you expect multiple matches, you might want to use /isg rather than /is.
Alternatively, just split the file into blocks:
#blocks = split /(\?>)/, $string;
while (#blocks) {
$block = shift #blocks;
$sep = shift #blocks;
if ($block=~/search_term/) {
print "HELLO";
} else {
print $block, $sep;
}
}
You just need to put your first capture group back into your replacement. Something like this:
s/<\?php(.*)<\?php(.*?)search_term(.*?)\?>/<\?php$1HELLO/ims
$string =~ s/<\?php(?:(?!\?>|search_term).)*search_term.*?\?>/HELLO/isg;
(?:(?!\?>|search_term).)* matches one character at a time, after making sure the character isn't the beginning of ?> or search_term. When that stops matching, if the next thing in the string is search_term it consumes that and everything after it until the next ?>. Otherwise, that attempt fails and it starts over at the next <?php.
The crucial point is that, like #RobertYoung's solution, it's not allowed to match ?> as it searches for search_term. By not matching search_term either, it eliminates backtracking, which makes the search more efficient. Depending on the size of the source string that may not matter, but it won't noticeably hurt performance either.
#Benj's solution (as currently posted) does not work. It yields the desired output with the sample string you provided, but that's only by accident. It only replaces the last code block with search_term in it, and (as #mob commented) it completely ignores the contents of the very first code block.
s/(.*)<\?php.*?search_term.*?\?>/${1}HELLO/ims;
In your regular expression, the regex engine is trying to find the earliest occurence of a substring that matches your target expression, and it finds it between the first <?php and the second ?>.
By putting (.*) at the start of the regex, you trick the regex engine into going to the end of the string (since .* matches the whole string), and then backtracking to spots where it can find the string "<?php". That way the resulting match won't include any more <?php tokens than necessary.
You are using greedystingy matching but that can still match too much.
Matching repetitions in perlretut describes it well.
I sometimes use negated matches to help but I don't think it will help. For example:
s/^[^A]*A/A/
to make sure my characters aren't matched.
But I'm not usually trying to cross multiple lines and I'm not using perl unless I have to.