Delete all characters/words that doesn't match a pattern - regex

I have a text, without lines, and i want to delete all the characters that doesn't match a pattern:
The pattern would be from the word parameter until it finds }}. For example if i have this entry:
KHJLMNNamespaceparameter:{{"Hello i am here"}}NamespaceHSKFSAFSLLLJparameter:{{H}}...
I would like to delete everything and leave this in the file: parameter:{{"Hello i am here"}} parameter:{{H}}.
All i found out there is to delete a line that doesn't contain a pattern, but I am not able to find anything related with a huge file without /n(end of lines). It would be possible to do that using either sed, awk or Vi?
Thanks!

$ awk 'BEGIN{RS=ORS="}}"} sub(/.*parameter/,"parameter")' file
parameter:{{"Hello i am here"}}parameter:{{H}}
Note that this is gawk-specific due to the multi-char RS.

You can use this grep with -P (PCRE) regex:
grep -oP '.*?\Kparameter:\{\{.*?\}\}' file
parameter:{{"Hello i am here"}}
parameter:{{H}}

If perl is an option, you can do this:
perl -ne "my #wo = ($_ =~ /parameter:\{\{.*?\}\}/g); print join(' ',#wo);" your_text_file
In perl, the modifier *? is a non-greedy quantifier, such that it stops at the first encountered }}.
I think a perl expert can do this in one instruction, without a temporary array ...
EDIT: this command only outputs the wanted text on stdout. To change the file itself, use the switch -i when calling perl:
perl -i.bak -ne "my #wo = ($_ =~ /parameter:\{\{.*?\}\}/g); print join(' ',#wo);" your_text_file
A backup file is created with the extension .bak appended at the end, and the result is written in a file with the same name as the input filename. Note that you can get no backup file with the swtich -i alone, but some platforms don't allowed this. See doc perlrun for more information.

Related

Printing filename for a multi-line pattern found in multiple files

I am doing perl -pe along with grep to do a multi line grep. This is being done so that when "" is used as a line continuation letter, I need to join the line.
So my file is
record -field X \
-field Y
I am doing
perl -pe 's/\\\n/ /' a/b/c/*/records/*.rec | grep "\-field.*X.*\-field.*Y"
The problem with this is that it just gives me the grep result, without telling me which file had the issue. Is there a way around this. I need to know which files have this too.
I can do a foreach shell script, but was wondering if there is a one liner version of the same possibe
Once you are inside a Perl program why go to system's grep? Perl's tools are far more abundant, rounded, and usable than the shell's. One way
perl -0777 -nE'say "$ARGV: $_" for
grep { /\-field.*X.*\-field.*Y/ } split /\n/, s{\\\n}{ }gr' file-list
(broken into lines for readability)
We read the whole file into $_ ("slurp" it), so to be able to merge those particular lines, using the -0777 switch. That \\n is then substituted with a space and the resulting string returned (by virtue of the /r modifier), and split by \n to regenerate lines.
Then that list of lines is fed to grep with your desired pattern, and the ones that match the pattern are passed through. So then they are printed, prepended with the name of the currently processed file, available in the $ARGV variable.
The answer is to use ARGV[0]
perl -pe 'print "$ARGV[0]: ";s/\\\n/ /' a/b/c/*/records/*.rec | grep "\-field.*X.*\-field.*Y"

Find multi-line text & replace it, using regex, in shell script

I am trying to find a pattern of two consecutive lines, where the first line is a fixed string and the second has a part substring I like to replace.
This is to be done in sh or bash on macOS.
If I had a regex tool at hand that would operate on the entire text, this would be easy for me. However, all I find is bash's simple text replacement - which doesn't work with regex, and sed, which is line oriented.
I suspect that I can use sed in a way where it first finds a matching first line, and only then looks to replace the following line if its pattern also matches, but I cannot figure this out.
Or are there other tools present on macOS that would let me do a regex-based search-and-replace over an entire file or a string? Maybe with Python (v2.7 and v3 is installed)?
Here's a sample text and how I like it modified:
keyA
value:474
keyB
value:474 <-- only this shall be replaced (follows "keyB")
keyC
value:474
keyB
value:474
Now, I want to find all occurances where the first line is "keyB" and the following one is "value:474", and then replace that second line with another value, e.g. "value:888".
As a regex that ignores line separators, I'd write this:
Search: (\bkeyB\n\s*value):474
Replace: $1:888
So, basically, I find the pattern before the 474, and then replace it with the same pattern plus the new number 888, thereby preserving the original indentation (which is variable).
You can use
sed -e '/keyB$/{n' -e 's/\(.*\):[0-9]*/\1:888/' -e '}' file
# Or, to replace the contents of the file inline in FreeBSD sed:
sed -i '' -e '/keyB$/{n' -e 's/\(.*\):[0-9]*/\1:888/' -e '}' file
Details:
/keyB$/ - finds all lines that end with keyB
n - empties the current pattern space and reads the next line into it
s/\(.*\):[0-9]*/\1:888/ - find any text up to the last : + zero or more digits capturing that text into Group 1, and replaces with the contents of the group and :888.
The {...} create a block that is executed only once the /keyB$/ condition is met.
See an online sed demo.
Use a perl one-liner with -0777 to scan over multiple lines:
$ # inline edit:
$ perl -0777 -i -pe 's/\bkeyB\s*value):\d*/$1:888/' file.txt
$ # to stdout:
$ cat file.txt | perl -0777 -pe 's/\bkeyB\s*value):\d*/$1:888/'
In plain bash:
#!/bin/bash
keypattern='^[[:blank:]]*keyB$'
valpattern='(.*):'
replacement=888
while read -r; do
printf '%s\n' "$REPLY"
if [[ $REPLY =~ $keypattern ]]; then
read -r
if [[ $REPLY =~ $valpattern ]]; then
printf '%s%s\n' "${BASH_REMATCH[0]}" "$replacement"
else
printf '%s\n' "$REPLY"
fi
fi
done < file

Delete blank line before a pattern. What's wrong? Currently using Perl but open to sed/AWK

In a long document, I want to selectively delete the particular newlines that precede the exact string \begin{enumerate*}, ideally with a one-liner in bash or zsh.
That is, I want to convert test.tex:
Text in paragraphs.
More text
\begin{enumerate*} \item thing
to
Text in paragraphs.
More text \begin{enumerate*} \item thing
with a one-liner like
cat test.tex | perl -p -e 's/\n(?=(\\begin\{enumerate\*\}))/ /'
or
cat test.tex | perl -p -e 's/\n\\begin\{enumerate\*\}/\\begin{enumerate*}/'
but I must be missing something because it doesn't make any change.
I also clearly don't need a regular expression here. If there's a way to do this with exact string matching instead of regex, I'd rather use that way. For instance, in R I could do this with sub("\n\\begin{enumerate*}", "\\begin{enumerate*}", fixed = TRUE).
You can use the -0 (digit zero) switch with Perl to specify the line separator. Traditionally -0777 is used to read the entire file
You also need to be careful about regex metacharacters in your search string. Characters like *, {, } and \ mean something special within a regex pattern, and you should escape them — usually with a \Q ... \E construct
Taking these points into account, this should work for you
perl -0777 -pe' s/\n+(?=\Q\begin{enumerate*}\E)/ / ' myfile
perl -p processes a file string by string, so you can't expect this regex to match.
I would recommend something like
perl -e '$text = join "", <>; $text =~ s/your_regex_here//; print $text' test.txt
Mind that it loads the whole file to memory.
Also, if you want to modify file immediately, you can't just say > test.txt, see this question.
I found a solution with sed (number 25 on this page) that doesn't read the entire file into memory:
sed -i bak -n '/^\\begin{enumerate\*}/{x;d;};1h;1!{x;p;};${x;p;}' test.tex
The downside is that this doesn't actually join the two lines; instead it produces
Text in paragraphs.
More text
\begin{enumerate*} \item thing
which is good enough for what I need (latex treats single newlines the same as regular spaces)

Copy matched regex to new file

I want to copy regex matched text to a new file.
<SHOPITEM>([\s\S]*?)<YEAR>2015<\/YEAR>([\s\S]*?)<\/SHOPITEM>
([\s\S]*?) = any text, any line
This works (I am able to find) in Sublime editor, but how this regex looks for sed/grep (or any other Unix tool)?
Usually sed and grep are used to search on lines not on multiline mode as is it still possible under certain conditions.
I would advise to use Perl which should be installed on your computer:
perl -p -e 'undef $/;$_=<>;print $& if /<SHOPITEM>([\s\S]*?)<YEAR>2015<\/YEAR>([\s\S]*?)<\/SHOPITEM>/i;'
Be aware that this regex won't work if you have nested <shopitem> tags or even multiple occurences. Instead use a XML parser.
Also you can write a Program that parse your xml file and this time it will capture all the matches.
myparser.pl:
#!/usr/bin/env perl
undef $/;
$_ = <>;
print while(/<(shopitem)>[\s\S]*<(year)>2015<\/\2>[\s\S]*<\/\1>/ig);
That you can execute:
$ chmod u+x myparser.pl
$ ./myparser.pl myfile.xml
I'm not the best scripter, but I think this should work:
grep "<SHOPITEM>" infile | grep "<YEAR>2015" | sed -e "s/<[^>]*>//g" | sed "s/2015/ /g" > outfile
Edit: I didn't match the regex, instead I got SHOPITEMs with YEAR 2015 tag and removed all the unwanted parts.
Edit: I'd do it this way, but I'm not sure it's the most elegant solution.

regex command line linux - select all lines between two strings

I have a text file with contents like this:
here is some super text:
this is text that should
be selected with a cool match
And this is how it all ends
blah blah...
I am trying to get the two lines (but could be more or less lines) between:
some super text:
and
And this is how
I am using grep on an ubuntu machine and a lot of the patterns I've found seem to be specific to different kinds of regex engines.
So I should end up with something like this:
grep "my regex goes here" myFileNameHere
Not sure if egrep is needed, but could use that just as easy.
You can use addresses in sed:
sed -e '/some super text/,/And this is how/!d' file
!d means "don't output if not in the range".
To exclude the border lines, you must be more clever:
sed -n -e '/some super text/ {n;b c}; d;:c {/And this is how/ {d};p;n;b c}' file
Or, similarly, in Perl:
perl -ne 'print if /some super text/ .. /And this is how/' file
To exclude the border lines again, change it to
perl -ne '$in = /some super text/ .. /And this is how/; print if $in > 1 and $in !~ /E/' file
I don't see how it could be done in grep. Using awk:
awk '/^And this is how/ {p=0}; p; /some super text:$/ {p=1}' file
Give a try to pcregrep instead of normal grep. Because normal grep won't help you to fetch multiple lines in a row.
$ pcregrep -M -o '(?s)some super text:[^\n]*\n\K.*?(?=\n[^\n]*And this is how)' file
this is text that should
be selected with a cool match
(?s) Dotall modifier allows dot to match even newline characters also.
\K Discards the previously matched characters.
From pcregrep --help
-M, --multiline run in multiline mode
-o, --only-matching=n show only the part of the line that matched
TL;DR
With your corpus, another way to solve the problem is by matching lines with leading whitespace, rather than using a flip-flop operator of some sort to match start and end lines. The following solutions work with your posted example.
GNU Grep with PCRE Compiled In
$ grep -Po '^\s+\K.*' /tmp/corpus
this is text that should
be selected with a cool match
Alternative: Use pcregrep Instead
$ pcregrep -o '^\s+\K.*' /tmp/corpus
this is text that should
be selected with a cool match