Regex that matches contents of () with nested () in it - regex

I have a messy text file(Due to nature of the contents I cannot paste it).
In the file I want to match things that are in unnested parenthesis.
Here is sample that includes the problem:
a(b()c((d)e)f()g)h(i)
The output that I need is:
(b()c((d)e)f()g)(i)
(basically everything in the largest parenthesis, less 'a' and 'h')
Again I cannot post the actual contents but above example illustrates the problem I have in original file.
I am working on this from bash, I am familiar with sed, grep, but not awk unfortunately.
Thanks

Since regex will find the longest possible match, you can just use
\(.*\)
If you care about nesting and want to find the outermost, e.g. for ((a)) and (b)))) you want to find ((a)) and (b), then that's a typical example of a grammar that you technically can't match with regular expressions.
However, since you tagged your post PCRE:
grep -P -o '(?xs)(?(DEFINE) (?<c>([^()]|(?&p))) (?<p>\((?&c)*\)))((?&p))'

Ha, I know there is a good answer, but posting the one I came up with to enrich the range of ides. (demo).
(?x)
(?(DEFINE)(?<nest>(\((?:[^()]*(?1)?[^()]*)\))))
\((?:[^()]*(?&nest)?[^()]*)*\)
Of course it needs to be flattened onto the grep line.

Related

Unable to make the mentioned regular expression to work in sed command

I am trying to make the following regular expressions to work in sed command in bash.
^[^<]?(https?:\/\/(www\.)?[-a-zA-Z0-9#:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()#:%_\+.~#?&\/\/=]*))[^>]?$
I know the regular expression is correct and it is working as I expected. So; there is no help needed with that. I tested it on online regular expressions tester and it is working as per my expectations.
Please find the demo of the above regex in here.
My requirement:
I want to enclose every url inside <>. If the url is already enclosed; then append it to the result as can be seen in the above regex link.
Sample Input:(in file named website.txt)
// List of all legal urls
https://www.google.com/
https://www.fakesite.co.in
https://www.fakesite.co.uk
<https://www.fakesite.co.uk>
<https://www.google.com/>
Expected Output:(in the file named output.txt)
<https://www.google.com/> // Please notice every url is enclosed in the <>.
<https://www.fakesite.co.in>
<https://www.fakesite.co.uk>
<https://www.fakesite.co.uk> // Please notice if the url is already enclosed in <> then it is appended as it is.
<https://www.google.com/>
What I tried in sed:
Since I'm not well-versed in bash commands; so previously I was not able to capture the group properly in sed but after reading this answer; I figured out that we need to escape the parenthesis to be able to capture it.
Somewhere; I read that look-arounds are not supported in sed(GNU based) so I removed lookarounds too; but that also didn't worked. If it doesn't support look-arounds then I used this regex and it served my purpose.
Then; this is my latest try with sed command:
sed 's#^[^<]?(https?://(?:www\.)?[-a-zA-Z0-9#:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b(?:[-a-zA-Z0-9()#:%_\+.~#?&/=]*))[^>]?$#<\1>#gm;t;d' websites.txt > output.txt
My exact problem:
How can I make the above command to work properly. If you'll run the command sample I attached above in point-3; you'd see it is not replacing the contents properly. It is just dumping the contents of websites.txt to output.txt. But in regex demo; attached above it is working properly i.e. enclosing all the unenclosed websites inside <>. Any suggestions would be helpful. I preferably want it in sed but if it is possible can I convert the above command in awk also? If you can please help me with that too; I'll be highly obliged. Thanks
After working for long, I made my sed command to work. Below is the command which worked.
sed -E 's#^[^<]?(https?://(www\.)?[-a-zA-Z0-9#:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()#:%_\+.~#?&=]*))[^>]?$#<\1>#gm;t' websites.txt > output.txt
You can find the sample implementation of the command in here.
Since, the regex has already fulfilled the requirement of the person for whom I'm writing this requirement for; I needed to get help only regarding the command syntax (although any improvements are heartily welcomed); I want the command to work with the same regular expression pattern.
Things which I was unaware previously and learnt now:
I didn't knew anything about -E flag. Now I know; that -E uses POSIX "extended" syntax ("ERE"). Thanks to #GordonDavisson and #Sundeep. Further reading.
I didn't know with clarity that sed doesn't supports look-around. But now I know sed doesn't support look-around. Thanks to #dmitri-chubarov. Further reading
I didn't knew sed doesn't support non-capturing groups too. Thanks to #Sundeep for solving this part. Further Reading
I didn't knew about GNU sed as a specific command line tool. Thanks to #oguzismail for this. Further reading.
With respect to the command in your answer:
sed -E 's#^[^<]?(https?://(www\.)?[-a-zA-Z0-9#:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()#:%_\+.~#?&=]*))[^>]?$#<\1>#gm;t'
Here's a few notes:
Your posted sample input has 1 URL per line so AFAIK the gm;t at the end of your sed command is doing nothing useful so either your input is inadequate or your script is wrong.
The hard-coded ranges a-z, A-Z, and 0-9 include different characters in different locales. If you meant to include all (and only) lower case letters, upper case letters, and digits then you should replace a-zA-Z0-9 with the POSIX character class [:alnum:]. So either change to use a locale-independent character class or specify the locale you need on your command line depending in your requirements for which characters to match in your regexp.
Like most characters, the character + is literal inside a bracket expression so it shouldn't be escaped - change \+ to just +.
The bracket expression [^<]? means "1 or 0 occurrences of any character that is not a <" and similarly for [^>]? so if your "url" contained random characters at the start/end it'd be accepted, e.g.:
echo 'xhttp://foo.bar%' | sed -E 's#^[^<]?(https?://(www\.)?[-a-zA-Z0-9#:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()#:%_\+.~#?&=]*))[^>]?$#<\1>#gm;t'
<http://foo.bar%>
I think you meant to use <? and >? instead of [^<]? and [^>]?.
Your regexp would allow a "url" that has no letters:
echo 'http://=.9' | gsed -E 's#^[^<]?(https?://(www\.)?[-a-zA-Z0-9#:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()#:%_\+.~#?&=]*))[^>]?$#<\1>#gm;t'
<http://=.9>
If you edit your question to provide more truly representative sample input and expected output (including cases you do not want to match) then we can help you BUT based on a quick google of what a valid URL is it looks like there are several valid URLs that'd be disallowed by your regexp and several invalid ones that'd be allowed so you might want to ask about that in a question tagged with url or similar (with the tags you currently have we can help you implement your regexp but there may be better people to help with defining your regexp).
If the input file is just a comment followed by a list of URLs, try:
sed '1d;s/^[^<]/<&/;s/[^>]$/&>/' websites.txt
Output:
<https://www.google.com/>
<https://www.fakesite.co.in>
<https://www.fakesite.co.uk>
<https://www.fakesite.co.uk>
<https://www.google.com/>

shell: Replace all occurances of particular text contained between arbitrary "start" and "finish"

I want a one liner run from nix shell to replace all occurances of particular text contained between arbitrary "start" and "finish", e.g. in
nfw987__qrh fwef_start_hf9
832j fsjdlkfa;jd(&6^)lf dfs
ahlkj;fd__sajhfds
dsfahs__lkjfdsaf jlkfdsa_finish_jfoi__edwp
replace all __ which are between _start_ and _finish_ with ().
I've tried web search but all I find is "simple" replace. I'm writing a code to do that, but maybe that IMHO common task has been solved already with sed, perl, awk etc.
As per al76 link, sed can be easily used for such cases (text is also replaced on start and end lines regardless of where on that line text is, that does not answer my question exactly, but for my current task it is sufficient):
to address the lines between two regular expressions, RE1 and RE2, one
would do this: '/RE1/,/RE2/{commands;}'
sed '/_start_/,/_finish_/{s/__/\(\)/g}' tst2.txt

Regex ignoring linebreaks and "page layout"

I have an assortment of searchable PDF files and I often search particular patterns in all of them simultaneously, using the pdfgrep command. My regex knowledge is somewhat limited and I'm not sure how to work around linebreaks and page layout.
For example, I would like to find the pattern "ignor.{0,10}layout" in each example below:
This is a rather difficult You see, I would like to ignore
task that I am trying to page layout and still find the
achieve. pattern I am looking for.
This is a rather difficult This is because I would like to ig-
task that I am trying to nore page layout and still find the
achieve. pattern I am looking for.
In both examples, I would like the first two lines to be reported by
pdfgrep -n "ignor.{0,10}layout" *
but it fails to do so because:
there is a linebreak in the middle.
in the first example, there are more than 10 characters between ignor and layout.
in the second example, ignor is cut in half.
Is there a regex that would solve this problem entirely?
pdfgrep does not have the -z flag that would be necessary to interpret newlines as zero-bytes. You can use a workaround with pdftotext, that allows to convert it to text and stream this to STDOUT, where you can pipe a regular grep call:
pdftotext SPECIFIC-FILE.pdf - | grep -Pzo "(?s)YOUR\s+QUERY"
This makes it impossible to use globbing efficiently, but you can at least iterate the glob:
for pdf in *.pdf; do echo -n "$pdf:"; pdftotext "$pdf" - | grep -Pzo "(?s)YOUR\s+QUERY"; done
Please note that if you want to match whitespaces, you almost always will want to use \s+ which matches also newlines, when -z is enabled. See this other answer for an explanation of the flags.

How to create regex search-and-replace with comments?

I have a bit of a strange problem: I have a code (it's LaTeX but that does not matter here) that contains long lines with period (sentences).
For better version control I wanted to split these sentences on a new line each.
This can be achieved via sed 's/\. /.\n/g'.
Now the problem arises if there are comments with potential periods as well.
These comments must not be altered, otherwise they will be parsed as LaTeX code and this might result in errors etc.
As a pseudo example you can use
Foo. Bar. Baz. % A. comment. with periods.
The result should be
Foo.
Bar.
Baz. % ...
Alternatively the comment might go on the next line without any problems.
It was ok to use perl if that would work out better. I tried with different programs (sed and perl) a few ideas but none did what I expected. Either the comment was also altered or only the first period was altered (perl -pe 's/^([^%]*?)\. /\1.\n/g').
Can you point me in the right direction?
This is tricky as you're essentially trying to match all occurrences of ". " that don't follow a "%". A negative look-behind would be useful here, but Perl doesn't support variable-width negative look-behind. (Though there are hideous ways of faking it in certain situations.) We can get by without it here using backtracking control verbs:
s/(?:%(*COMMIT)(*FAIL))|\.\K (?!%)/\n/g;
The (?:%(*COMMIT)(*FAIL)) forces replacement to stop the first time it sees a "%" by committing to a match and then unconditionally failing, which prevents back-tracking. The "real" match follows the alternation: \.\K (?!%) looks for a space that follows a period and isn't followed by a "%". The \K causes the period to not be included in the match so we don't have to include it in the replacement. We only match and replace the space.
Putting the comment by itself on a following line can be done with sed pretty easily, using the hold space:
sed '/^[^.]*%/b;/%/!{s/\. /.\n/g;b};h;s/[^%]*%/%/;x;s/ *%.*//;s/\. /.\n/g;G'
Or if you want the comment by itself before the rest:
sed '/^[^.]*%/b;/%/!{s/\. /.\n/g;b};h;s/ *%.*//;s/\. /.\n/g;x;s/[^%]*%/%/;G'
Or finally, it is possible to combine the comment with the last line also:
sed '/^[^.]*%/b;/%/!{s/\. /.\n/g;b};h;s/[^%]*%/%/;x;s/ *%.*//;s/\. /.\n/g;G;s/\n\([^\n]*\)$/ \1/'

Vim - sed like labels or replacing only within pattern

On the basis of some html editing I've came up with need for help from some VIM master out there.
I wan't to achieve simple task - I have html file with mangled urls.
Just description Just description
...
Just description
Unfortunately it's not "one url per line".
I am aware of three approaches:
I would like to be able to replace only within '"http://[^"]*"' regex (similar like replace only in matching lines - but this time not whole lines but only matching pattern should be involved)
Or use sed-like labels - I can do this task with sed -e :a -e 's#\("http://[^" ]*\) \([^"]*"\)#\1_\2#g;ta'
Also I know that there is something like "\#<=" but I am non native speaker and vim manual on this is beyond my comprehension.
All help is greatly appreciated.
If possible I would like to know answer on all three problems (as those are pretty interesting and would be helpful in other tasks) but either of those will do.
Re: 1. You can replace recursively by combining vim's "evaluate replacement as an expression" feature (:h :s\=) with the substitute function (:h substitute()):
:%s!"http://[^"]*"!\=substitute(submatch(0), ' ', '_', 'g')!g
Re: 2. I don't know sed so I can't help you with that.
Re: 3. I don't see how \#<= would help here. As for what it does: It's equivalent to Perl's (?<=...) feature, also known as "positive look-behind". You can read it as "if preceded by":
:%s/\%(foo\)\#<=bar/X/g
"Replace bar by X if it's preceded by foo", i.e. turn every foobar into fooX (the \%( ... \) are just for grouping here). In Perl you'd write this as:
s/(?<=foo)bar/X/g;
More examples and explanation can be found in perldoc perlretut.
I think what you want to do is to replace all spaces in your http:// url into _.
To achieve the goal, #melpomene's solution is straightforward. You could try it in your vim.
On the other hand, if you want to simulate your sed line, you could try followings.
:let #s=':%s#\("http://[^" ]*\)\#<= #_#g^M'
^M means Ctrl-V then Enter
then
200#s
this works in same way as your sed line (label, do replace, back to label...) and #<= was used as well.
one problem is, in this way, vim cannot detect when all match-patterns were replaced. Therefore a relative big number (200 in my example) was given. And in the end an error msg "E486: Pattern not found..." shows.
A script is needed to avoid the message.