Copy matched regex to new file - regex

I want to copy regex matched text to a new file.
<SHOPITEM>([\s\S]*?)<YEAR>2015<\/YEAR>([\s\S]*?)<\/SHOPITEM>
([\s\S]*?) = any text, any line
This works (I am able to find) in Sublime editor, but how this regex looks for sed/grep (or any other Unix tool)?

Usually sed and grep are used to search on lines not on multiline mode as is it still possible under certain conditions.
I would advise to use Perl which should be installed on your computer:
perl -p -e 'undef $/;$_=<>;print $& if /<SHOPITEM>([\s\S]*?)<YEAR>2015<\/YEAR>([\s\S]*?)<\/SHOPITEM>/i;'
Be aware that this regex won't work if you have nested <shopitem> tags or even multiple occurences. Instead use a XML parser.
Also you can write a Program that parse your xml file and this time it will capture all the matches.
myparser.pl:
#!/usr/bin/env perl
undef $/;
$_ = <>;
print while(/<(shopitem)>[\s\S]*<(year)>2015<\/\2>[\s\S]*<\/\1>/ig);
That you can execute:
$ chmod u+x myparser.pl
$ ./myparser.pl myfile.xml

I'm not the best scripter, but I think this should work:
grep "<SHOPITEM>" infile | grep "<YEAR>2015" | sed -e "s/<[^>]*>//g" | sed "s/2015/ /g" > outfile
Edit: I didn't match the regex, instead I got SHOPITEMs with YEAR 2015 tag and removed all the unwanted parts.
Edit: I'd do it this way, but I'm not sure it's the most elegant solution.

Related

bash - print regex captured groups

I have a file.xml so composed:
...some xml text here...
<Version>1.0.13-alpha</Version>
...some xml text here...
I need to extract the following information:
mayor_and_minor_release_number --> 1.0
patch_number --> 13
suffix --> -alpha
I've thought the cleanest way to achieve that is by mean of a regex with grep command:
<Version>(\d+\.\d+)\.(\d+)([\w-]+)?<\/Version>
I've checked with regex101 the correctness of this regex and actually it seems to properly capture the 3 fields I'm looking for. But here comes the problem, since I have no idea how to print those fields.
cat file.xml | grep "<Version>(\d+\.\d+)\.(\d+)([\w-]+)?<\/Version>" -oP
This command prints the entire line so it's quite useless.
Several posts on this site have been written about this topic, so I've also tried to use the bash native
regex support, with poor results:
regex="<Version>(\d+\.\d+)\.(\d+)([\w-]+)?<\/Version>"
txt=$(cat file.xml)
[[ "$txt" =~ $regex ]] --> it fails!
echo "${BASH_REMATCH[*]}"
I'm sorry but I cannot figure out how to overtake this issue. The desired output should be:
1.0
13
-alpha
You may use this read + sed solution with similar regex as your's:
read -r major minor suffix < <(
sed -nE 's~.*<Version>([0-9]+\.[0-9]+)\.([0-9]+)(-[^<]*)</Version>.*~\1 \2 \3~p' file.xml
)
Check variable contents:
declare -p major minor suffix
declare -- major="1.0"
declare -- minor="13"
declare -- suffix="-alpha"
Few points:
You cannot use \d without using -P (perl) mode in grep
grep command doesn't return capture groups
Use this Perl one-liner:
perl -lne 'print for m{<Version>(\d+\.\d+)\.(\d+)([\w-]+)?<\/Version>};' file.xml
Example:
echo '<Version>1.0.13-alpha</Version>' | perl -lne 'print for m{<Version>(\d+\.\d+)\.(\d+)([\w-]+)?<\/Version>};'
Output:
1.0
13
-alpha
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches

Perl script versus one-liner - differences in functionality with regex

I have a perl program that takes the STDIN (piped from another bash command). The output from the bash command is quite large, about 200 lines. I want to take the entire input (multiple lines) and feed that to a one-liner perl script, but so far nothing i've tried has worked. Conversely, if I use the following perl (.pl file):
#!/usr/bin/perl
use strict;
my $regex = qr/{(?:\n|.)*}(?:\n)/p;
if ( <> =~ /$regex/g ) {
print "${^MATCH}\n";
}
And execute my bash command like this:
<bash command> | perl -0777 try_m_1.pl
It works. But as a one-liner, it doesn't work with the same regex/bash command. The result of the print command is nothing. I've tried it like this:
<bash command> | perl -0777 -e '/{(?:\n|.)*}(?:\n)/pg && print "$^MATCH";'
and this:
<bash command> | perl -0777 -e '/{(?:\n|.)*}(?:\n)/g; print "$1\n";'
And a bunch of other things, too many to list them all. I'm new to perl and only want to use it to get regex output from the text. If there's something better than perl to do this (I understand from reading around that sed wouldn't work for this?) feel free to suggest.
Update: based on #zdim answer, I tried the following, which worked:
<bash command> | perl -0777 -ne '/(\{(?:\n|.)*\}(?:\n))/s and print "$1\n"'
I guess my regex needed to be wrapped in () and the { curly braces needed to be escaped.
A one-liner needs -n (or -p) to process input, so that files are opened, streams attached, and a loop set up. It still needs that even as the -0777 unsets the input record separator, so the file is read at once; see Why use the -p|-n in slurp mode in perl one liner?
That regex matches either a newline or any character other than a newline, and there is a modifier for that, /s, with which . matches newline as well. Then that need be inside curly braces, which you need to escape in newer Perls. The newline that follows doesn't need grouping.
So altogether you'd have
<bash command> | perl -0777 -ne'/(\{(.*)\}\n)/s and print "$1\n"'

Delete all characters/words that doesn't match a pattern

I have a text, without lines, and i want to delete all the characters that doesn't match a pattern:
The pattern would be from the word parameter until it finds }}. For example if i have this entry:
KHJLMNNamespaceparameter:{{"Hello i am here"}}NamespaceHSKFSAFSLLLJparameter:{{H}}...
I would like to delete everything and leave this in the file: parameter:{{"Hello i am here"}} parameter:{{H}}.
All i found out there is to delete a line that doesn't contain a pattern, but I am not able to find anything related with a huge file without /n(end of lines). It would be possible to do that using either sed, awk or Vi?
Thanks!
$ awk 'BEGIN{RS=ORS="}}"} sub(/.*parameter/,"parameter")' file
parameter:{{"Hello i am here"}}parameter:{{H}}
Note that this is gawk-specific due to the multi-char RS.
You can use this grep with -P (PCRE) regex:
grep -oP '.*?\Kparameter:\{\{.*?\}\}' file
parameter:{{"Hello i am here"}}
parameter:{{H}}
If perl is an option, you can do this:
perl -ne "my #wo = ($_ =~ /parameter:\{\{.*?\}\}/g); print join(' ',#wo);" your_text_file
In perl, the modifier *? is a non-greedy quantifier, such that it stops at the first encountered }}.
I think a perl expert can do this in one instruction, without a temporary array ...
EDIT: this command only outputs the wanted text on stdout. To change the file itself, use the switch -i when calling perl:
perl -i.bak -ne "my #wo = ($_ =~ /parameter:\{\{.*?\}\}/g); print join(' ',#wo);" your_text_file
A backup file is created with the extension .bak appended at the end, and the result is written in a file with the same name as the input filename. Note that you can get no backup file with the swtich -i alone, but some platforms don't allowed this. See doc perlrun for more information.

regex command line linux - select all lines between two strings

I have a text file with contents like this:
here is some super text:
this is text that should
be selected with a cool match
And this is how it all ends
blah blah...
I am trying to get the two lines (but could be more or less lines) between:
some super text:
and
And this is how
I am using grep on an ubuntu machine and a lot of the patterns I've found seem to be specific to different kinds of regex engines.
So I should end up with something like this:
grep "my regex goes here" myFileNameHere
Not sure if egrep is needed, but could use that just as easy.
You can use addresses in sed:
sed -e '/some super text/,/And this is how/!d' file
!d means "don't output if not in the range".
To exclude the border lines, you must be more clever:
sed -n -e '/some super text/ {n;b c}; d;:c {/And this is how/ {d};p;n;b c}' file
Or, similarly, in Perl:
perl -ne 'print if /some super text/ .. /And this is how/' file
To exclude the border lines again, change it to
perl -ne '$in = /some super text/ .. /And this is how/; print if $in > 1 and $in !~ /E/' file
I don't see how it could be done in grep. Using awk:
awk '/^And this is how/ {p=0}; p; /some super text:$/ {p=1}' file
Give a try to pcregrep instead of normal grep. Because normal grep won't help you to fetch multiple lines in a row.
$ pcregrep -M -o '(?s)some super text:[^\n]*\n\K.*?(?=\n[^\n]*And this is how)' file
this is text that should
be selected with a cool match
(?s) Dotall modifier allows dot to match even newline characters also.
\K Discards the previously matched characters.
From pcregrep --help
-M, --multiline run in multiline mode
-o, --only-matching=n show only the part of the line that matched
TL;DR
With your corpus, another way to solve the problem is by matching lines with leading whitespace, rather than using a flip-flop operator of some sort to match start and end lines. The following solutions work with your posted example.
GNU Grep with PCRE Compiled In
$ grep -Po '^\s+\K.*' /tmp/corpus
this is text that should
be selected with a cool match
Alternative: Use pcregrep Instead
$ pcregrep -o '^\s+\K.*' /tmp/corpus
this is text that should
be selected with a cool match

Grep on Linux - How do I replace text with blankspace and newlines

I'm not use to using grep on linux via the terminal. I'm use to using dnGREP on windows but there is no comparable gui tool on ubuntu from what I've found.
How do I match the regular expressions "^(.*?)[" with all files in a folder and replace it with a blankspace?
I assume this one would follow the same methodology "](?=[^.]*$)"
Also, how do I replace the text below to add new lines
{"dev_is_looking_week"
with the same text and 4 blank lines underneath. Ignore the "." at the end. StackOverflow won't show blank newlines without a character at the end.
{"dev_is_looking_week"
.
You are using the wrong tool. grep is for selecting data. You may want to use awk, perl or sed instead.
Some examples:
awk '/example/ {print; print "\n\n\n\n"; }'
awk '{print;} /example/ {print "\n\n\n\n"; }'
perl -ne 'print $_; /example/ && print "\n\n\n\n"'
Note that perl also has the neat -i option, for inplace modification of files, which comes in handy when you have to do this change on a lot of files.
Or you might opt for regexxer, redet, or kregexpeditor from KDE.
You can use sed like this:
sed 's/{"dev_is_looking_week"/&\n\n\n\n/' file
OR using awk:
awk '/{"dev_is_looking_week"/{$0=sprintf("%s\n\n\n\n", $0)} 1'