Regex Match Paragraph Pattern - regex

I am trying to match a paragraph pattern and I am having trouble.
The pattern is:
[image.gif]
some words, usually a few lines
name
emailaddress<mailto:theemailaddress#mail.com>
I tried matching everything between the gif image and the <mailto: but this happens multiple times in the file meaning I get a bad result.
I tried it with this
(?<=\[image.gif\].*?(\[image.gif\])).*?(?=<mailto:)
Is there a way to use Regex to match the general layout of a paragraph?

"the general layout of a paragraph" needs a better definition. Given the lack of an input plus expected output, I'm having to guess what you want here. I'm also guessing that you will accept any language. Here's perl, almost certainly not a language you're familiar with.
Assumed input:
do not match this line
[image.gif]
some words, usually a few lines
Bobert McBobson
emailaddress<mailto:bobertmb#example.com>
don't match this line either
[image.gif]
another few words
on another few lines
Bobina Robertsdaughter
emailaddress<mailto:bobinard#example.info>
this line is also not for matching
Expected output:
[image.gif]
some words, usually a few lines
Bobert McBobson
emailaddress<mailto:bobertmb#example.com>
---
[image.gif]
another few words
on another few lines
Bobina Robertsdaughter
emailaddress<mailto:bobinard#example.info>
Solution using perl:
#!/usr/bin/perl -n007
my $sep = "";
while (/(\[image\.gif\].*?<mailto:[^>]*>(\r)?\n)/gms) {
print $sep . $1;
$sep = "---$2\n";
}
perl is the king of regex languages; many would say that's all it is good for. Here, we use the -n007 option to tell it to read the entire contents of each file and run the code on it as the default variable.
$sep starts blank because there's nothing to separate until the second match.
Then we loop over each block of text that matches the regex:
matches a literal [image.gif]
then matches as little content following that as possible
then matches a literal <mailto: and continues until the next >
then captures the line break (including optional support for DOS line endings)
(see full regex explanation and example at regex101)
We then print the match and finally set the separator to three dashes and a line break (DOS line endings added when needed).
Now you can run it:
$ perl answer.pl input.txt
[image.gif]
some words, usually a few lines
Bobert McBobson
emailaddress<mailto:bobertmb#example.com>
---
[image.gif]
another few words
on another few lines
Bobina Robertsdaughter
emailaddress<mailto:bobinard#example.info>

Related

Substitute any other character except for a specific pattern in Perl

I have text files with lines like this:
U_town/u_LN0_pk_LN3_bnb_LN155/DD0 U_DESIGN/u_LNxx_pk_LN99_bnb_LN151_LN11_/DD5
U_master/u_LN999_pk_LN767888_bnb_LN9772/Dnn111 u_LN999_pk_LN767888_bnb_LN9772_LN9999_LN11/DD
...
I am trying to substitute any other character except for / to nothing and keep a word with pattern _LN\d+_ with Perl one-liner.
So the edited version would look like:
/_LN0__LN3__LN155/ /_LN99__LN151_LN11_/
/_LN999__LN767888_/ _LN999__LN767888__LN9772_LN9999_/
I tried below which returned empty lines
perl -pe 's/(?! _LN\d+_)[^\/].+//g' file
Below returned only '/'.
perl -pe 's/(?! _LN\d+_)\w+//g' file
Is it perhaps not possible with a one-liner and I should consider writing a code to parse character by character and see if a matching word _LN\d+_ or a character / is there?
To merely remove everything other than these patterns can simply match the patterns and join the matches back
perl -wnE'say join "", m{/ | _LN[0-9]+_ }gx' file
or perhaps, depending on details of the requirements
perl -wnE'say join "", m{/ | _LN[0-9]+(?=_) }gx' file
(See explanation in the last bullet below.)
Prints, for the first line (of the two) of the shown sample input
/_LN0__LN3_//_LN99__LN151_
...
or, in the second version
/_LN0_LN3//_LN99_LN151_LN11/
...
The _LN155 is not there because it is not followed by _. See below.
Questions:
Why are there spaces after some / in the "edited version" shown in the question?
The pattern to keep is shown as _LN\d+_ but _LN155 is shown to be kept even though it is not followed by a _ in the input (but by a /) ...?
Are underscores optional by any chance? If so, append ? to them in the pattern
perl -wnE'say join "", m{/ | _?LN[0-9]+_? }gx' file
with output
/_LN0__LN3__LN155//_LN99__LN151_LN11_/
(It's been clarified that the extra space in the shown desired output is a mistake.)
If the underscores "overlap," like in _LN155_LN11_, in the regex they won't be both matched by the _LN\d+_ pattern, since the first one "takes" the underscore.
But if such overlapping instances nned be kept then replace the trailing _ with a lookahead for it, which doesn't consume it so it's there for the leading _ on the next pattern
perl -wnE'say join "", m{/ | _LN[0-9]+(?=_) }gx' file
(if the underscores are optional and you use _?LN\d+_? pattern then this isn't needed)

Powershell Regex - Replace between string A and string B only if contains string C

I have a file which looks like this
ABC01|01
Random data here 2131233154542542542
More random data
STRING-C
A bit more random stuff
&(%+
ABC02|01
Random data here 88888888
More random data 22222
STRING-D
A bit more random stuff
&(%+
I'm trying to make a script to Find everything between ABC01 and &(%+ ONLY if it contains STRING-C
I came up with this for regex ABC([\s\S]*?)STRING-C(?s)(.*?)&\(%\+
I'm getting this content from a text file with get-content.
$bad_content = gc $bad_file -raw
I want to do something like ($bad_content.replace($pattern,"") to remove the regex match.
How can I replace my matches in the file with nothing? I'm not even sure if my regex is correct but on regex101 it seems to find the strings I'm needing.
Your regex works with the sample input given, but not robustly, because if the order of blocks were reversed, it would mistakenly match across the blocks and remove both.
Tim Biegeleisen's helpful answer shows a regex that fixes the problem, via a negative lookahead assertion ((?!...)).
Let me show how to make it work from PowerShell:
You need to use the regex-based -replace operator, not the literal-substring-based .Replace() method:[1] to apply it.
To read the input string from a file, use Get-Content's -Raw switch to ensure that the file is read as a single, multi-line string; by default, Get-Content returns an array (stream) of lines, which would cause the -replace operation to be applied to each line individually.
(Get-Content -Raw file.txt) -replace '(?s)ABC01(?:(?!&\(%\+).)*?STRING-C.*?&\(%\+'
Not specifying replacement text (as the optional 2nd RHS operand to -replace) replaces the match with the empty string and therefore effectively removes what was matched.
The regex borrowed from Tim's answer is simplified a bit, by using the inline method of specifying matching options to tun on the single-line option ((?s)) at the start of the expression, which makes subsequent . instances match newlines too (a shorter and more efficient alternative to [\s\S]).
[1] See this answer for the juxtaposition of the two, including guidance on when to use which.
We can use a tempered dot trick when matching between the two markers to ensure that we don't cross the ending marker before matching STRING-C:
ABC01(?:(?!&\(%\+)[\s\S])*?STRING-C[\s\S]*?&\(%\+
Demo
Here is an explanation of the regex pattern:
ABC01 match the starting marker
(?:(?!&\(%\+)[\s\S])*? without crossing the ending marker
STRING-C match the nearest STRING-C marker
[\s\S]*? then match all content, across lines, until reaching
&\(%\+ the ending marker

Regex for string matching ****${****}***

I am trying to write a regex that matches and excludes all strings in a file that contain ${ followed by } with any characters between or around it. In between could be any characters/numbers/underscores/dashes/etc (there won't be another parenthesis inside).
Example matches:
hello ${VAR}
${HELLO_VAR} world
https://${WEB_VAR}
I came up with this: egrep -v '^\${[a-zA-Z?]', though it seems to be working partially and I am not too sure if its right. How can I do this?
The input file has strings separated by a newline, very similar to simple java properties.
You can trying using sed command.
sed 's/\$\{[^}]*\}//g' <input_file> > <output_file>
Sed here excludes all the characters between '{' and '}' and writes the new content in a new output file.
You can give this one a shot:
\$\{[^}]*\}
Match ${ literally, followed by everything except }, followed by }
You say you're trying to exclude all strings in a file, so it sounds like you need something a bit more advanced than just a regex with grep. I'd do this with an awk script:
awk '{while(match($0,/\$\{[^}]*\}/)){$0=substr($0,0,RSTART-1) substr($0,RSTART+RLENGTH)}} 1' input.txt
Or, split for easier reading and commenting:
{
while (match($0,/\$\{[^}]*\}/)) {
$0=substr($0,0,RSTART-1) substr($0,RSTART+RLENGTH)
}
}
1
The idea here is that for each line, we'll check to see whether the regex matches anything on the line. If it does, we'll replace the line with the parts around the matched regex. (We could alternate sub(/RE/,""), but that would require applying the regex twice per match rather than once.)
The final 1 is shorthand that says "print the current line". It runs whether or not the loop processed any matches.
Just use the global wilcard .* around the two sequences, as in:
.*\$\{.*\}.*
As you want to match entire lines, you have to use wilcard at both sides, to extend the regexp to both ends (it doesn't matter if you anchor it with ^ and $ as the greedy algorithm will try to extend as much as possible) Note that the $, { and } must be escaped, as they are reserved by the regexp language.
This can be seen in action here.
note
the title of this question doesn't specify that the substring between the two curly braces should not have a }, and as you want only to match the whole line, then it is not necessary to check for something except a }, the only requirement is that } must be after the ${ in the line. Anyway, this has no drawback in efficiency, as the NFA that parses this regexp has the same number of states as the other.

How to use an RE to match a line of ===== and the line above

I want to match two lines like the following using a Regular Expression:-
abcmnoxyz
=========
The first line is essentially random, the second line will be all the same character of a limited number of possibles (=, - and maybe a couple more). The lines can probably be required to be the same length but it would be nice if they didn't have to be. It would be OK to have multiple REs, one for each possible 'underline' character.
Can anyone come up with a way to do this?
This regex should do what you're trying to do :
regex = "(.*)\n(.)\2{2,}$"
group 1 will give you the line before the repeated linet
Live demo here
EXPLANATION
(.*)\n: match anything followed by a new line
(.)\2{2,} : capture something then check if its followed by same character 2+ more no. of times. You don't need to worry about which character is repeated.
In case you've a set of characters that can be repeated you can put a character set like this : [=-] instead of dot (.)
Use Grep's -B Flag
Matching with Alternation
Given your example, you can use extended regular expressions with alternations and a range operator. The -B flag tells grep how many lines before the match to include in the output.
$ grep -E -B1 '^(={5,}|-{5,})$' sample.txt
abcmnoxyz
=========
You can add alternations for additional characters if you want, although boundary markers ought to be as consistent as you can make them. You can also adjust the minimum number of sequential characters required for a match to suit your needs. I used a five-character range in the example because that's what was posted as the criterion in your original topic sentence, and because a shorter boundary marker is more likely to accidentally match truly random text.
Matching with a Character Class
Also, note that the following does the same job, but is a bit more concise. It uses a character class and a backreference to avoid alternations, which can get messy if you add many more boundary characters. Both versions are equally effective at matching your example.
$ grep -E -B1 '^([=-])\1{4,}$'
abcmnoxyz
========
A regex like this
^([^=\v]+)\v=+$
will do. Check it out at example 1
Explanation:
^([^=\v]+) # 1 or more matches of anything that is not a '=' or vertical space \v
\v=+$ # match a vertical space followed by 1 or more '='
If you want to extend this to more characters like '-' you could do this:
^([^=\-\v]+)\v(-|=)\2+$
Look at example 2
And, thanks to Ashish Ranjan, suppose you wanted to have = and/or - on the first line, use something like this:
^(.+)\v(-|=)\2+$
which would even allow you to have a first line like "=====". Having my doubts if OP had this in mind, though. Look at example 3
Hope this works
^([a-z]{1,})\n([=-]{1,})
\n and \r you have try both based on file format (unix or dos)
\1 will give you first line
\2 will give you second line
If the file contains same pattern over the text, then it might give you lot occurrence.
This answer is irrespective of number of characters in one line.
Ex: Tester

Regex optimization: negative character class "[^#]" nullifies multiline flag "m"

I'm trying to parse a text line by line, catching everything EXCEPT what's after a specific marker, # for example. No escaping to take into account, pretty basic.
For instance, if the input text is:
Multiline input text
Mid-sentence# cut, this won't be matched
Hey there
If want to retrieve
['Multiline input text',
'Mid-sentence',
'Hey There']
This is working fine with /(.*?)(?:#.*$|$)/mg (even though there are a few empty matches). However, if I try to improve the regex (by avoiding backtracking and getting rid of empty matches) with /([^#]++)(?:#.*$|$)/mg, it returns
[
"Multiline input text
Mid-sentence",
"
Hey There"
]
As if [^#] was including linebreaks, even with the multiline flag on. As far as I can tell I can fix that by adding [^#\n\r] into the class character, but this makes the multiline option kind of useless and I'm afraid it could break on some weird linebreaks in some environments/encoding.
Would any of you know the reason for this behavior, and if there's another workaround? Thanks!
Edit
Originally, it happens in PCRE. But even in Javascript with /([^#]+)(?:#.*$|$)/mg, same unwanted multiline behavior. I know I could probably use the language to parse the text line by line, but I'd like to do it with regex only.
It seems you got your definition of /m wrong. The only thing this flag does is to change what ^ and $ matches, so that they also match at the beginning and end of line respectively. It does not affect anything else. If you don't want to match line breaks you should do as you suggested and use [^#\n\r].
The regex that will work for you is:
^(.*?)(?:#.*|)$
Online Demo: http://regex101.com/r/aP8eV6
DIfference is use of .*? instead of [^#]+.
[^#]+ by definition matches anything but # and that includes newlines as well.
multiline flag m only lets you use line start/end anchors ^ and $ in multiline inputs.