Can grep show only result i want - regex

I have data as this
tatusx2.atc?beginnum=0;8pctgRB Mwdf fgEio"text1"text4"text
tatqsx3.atc?beginnum=1;8pctgRBwsaNezxio"text2
tatssx4.atc?beginnum=2;8pctgsvMALNejkio"data2
tatksx4.atc?beginnum=1;8pctgxdfALNebfio"text3
tatzsx5.atc?beginnum=3;8pwerRBMALNetior"datac
How to get only data between ; and "
I have tried grep -oP ';.*?"' file and got output :
;8pctgRBMwdffgEio"
;8pctgRBwsaNezxio"
;8pctgsvMALNejkio"
;8pctgxdfALNebfio"
;8pwerRBMALNetior"
But my desired output is:
8pctgRB Mwdf fgEio
8pctgRBwsaNezxio
8pctgsvMALNejkio
8pctgxdfALNebfio
8pwerRBMALNetior

You need to use lookahead and lookbehind regex expressions
grep -oP '(?<=;)\w*(?=")'
I consider you play around regexr to learn more about regular expressions. Checkout their cheatsheet.

A much more readable way to write the expression you need is:
grep -oP '(?<=;).*(?=")' file
and will get you the desired result. PERL regexes are apparently experimental but certain patterns work without issues.
The following options are being used:
-o --only-matching to the print only the matched parts of a matching line
-P --perl-regexp
Using ?=; will get you the string beginning with ; but using the > you are able to start at the index after. Similarly the end string tag is specified.
Here is suggested additional reading.

Related

Regular Expression between strings (multiple results?)

I am using a regular expression to filter a link from a HTML page like so:
(?<=data-ng-non-bindable data-src=\")(.*?)(?=\" data-caption)
How do I change it so that I get multiple results, not only the first one?
With sed, you replace strings, not extract. There are options you may set to actually output replaced substrings only, there is always a big problem with matches on the the same line.
Due to this, the easiest will be using grep with -oP options:
grep -oP '(?<=data-ng-non-bindable data-src=").*?(?=" data-caption)' file > outfile
Double quotation marks are not special.

extract pattern using powershell script

My bad, I have updated the question-its using Powershell
my file contains 1000s of lines like below:
<dependency org="${abcd}" name="some-random-name" rev="100.100" conf="compile;runtime"/>
I would like to get only the output like:
name="some-random-name"
how can i achieve this. please help
This probably will solve your issue:
cat <file> | grep -oP 'name="[\w-]*"'
Explaining:
grep is the tool that print lines matching a pattern
-o option will print only the matching parts
-P option will use Perl-style regex in order to allow the \w metacharacter.
[\w-]* will match any string containing only 'word' characters or dash with size >= 0

Grep/Sed between two tags with multiline

I have many files from which I need to get information.
Example of my files:
first file content:
"test This info i need grep</singleline>"
and
second file content (with two lines):
"test This info=
i need grep too</singleline>"
in results I need grep this text: from first file - "This info i need grep" and from second file - "This info= i need grep too"
In first file I use:
grep -o 'test .*</singleline>' * | sed -e 's/test \(.*\)<\/singleline>/\1/'
and successfully get "This info i need grep" but I can not get the information from the second file by using the same command.
Please help rewrite the command or write what the other.
Or, if you insist to use grep, you can:
grep -Pzo 'test(\n|.)*(?=</singleline>)' test.txt
To understand the meaning of each flag, use grep --help:
-P, --perl-regexp
PATTERN is a Perl regular expression
-o, --only-matching
show only the part of a line matching PATTERN
-z, --null-data
a data line ends in 0 byte, not newline
I'd use pcregrep, which can match multiline regexes:
pcregrep -Mo 'test \K((?s).)*?(?=</singleline>)' filename
The tricks are:
-M allows pcregrep to match on more than one line,
-o makes it print only the match,
\K throws away the part of the match that comes before it,
(?=</singleline>) is a lookahead term that matches an empty string if (and only if) it is followed by </singleline>, and
((?s).)*? to match any characters non-greedily, which is to say that if you have several occurrences of </singleline> in the file, it will match until the closest rather than the furthest. If this is not desired, remove the ?. (?s) enables the s option locally for the term to make . match newlines in it; it wouldn't do that by default.
Thanks to #CasimiretHippolyte for pointing out the ((?s).) alternative to (.|\n).
It looks like you're parsing quoted-printable encoded text, where a "soft" line break (one that is an artifact from fixed-line-width formatting) is indicated with a line-terminating = (directly before the \n).
Since in a later comment you also expressed the desire to print each match as a single line, I suggest the following 2-pass appraoch:
use awk to remove the soft line breaks
then use grep on the result
awk '/=$/ { printf "%s", substr($0, 1, length($0)-2); next } 1' file |
grep -Po 'test .*?(?=</singleline>)'
Tip of the hat to Wintermute's helpful answer for the non-greedy quantifier, *?, and both Wintermute's and Maroun Maroun's helpful answer for the positive look-ahead assertion, (?=...).
Not that the awk command removes the line-ending = (along with the newline); replace the substr call with just $0 to retain it.
Since strings of interest are first converted back their original single-line representations:
The matches are printed in their original form.
You can use regular (GNU) grep with line-by-line matching; contrast this with
needing to read the entire file at once, as in Maroun Maroun's helpful answer.
Note that, as of this writing, * must be replaced with *? in his answer to work correctly work in files with multiple matches.
needing to install another utility, pcregrep, as in Wintermute's helpful answer.
additionally, the matches would have to be cleaned up to be single-line (something you didn't originally state as a requirement).

Bash/PHP extract URL from HTML via regex

Is there any easy way to extract this URL in bash/or PHP?
http://shop.image-site.com/images/2/format2013/fullies/kju_product.png
From this HTML code?
<a href="javascript: open_window_zoom('http://shop.image-site.com/image.php?image=http://shop.image-site.com/images/2/format2013/fullies/kju_product.png&pID=31777&download=kju.png&name=13011 KELLYS Kju: 490mm (19.5")',550,366);">
With perl you could do a match and a capture
perl -n -e 'print "$1\n" if (m/image=(.*?)\&/);'
This captures everything between image= and the next & and prints it $1.
For more on regular expressions, see perlre or http://www.regular-expressions.info/
In bash, you can try the following:
sed 's/.*image=\(http:\/\/[^&]*\).*/\1/g'
Update:
The solution above performs substitution rather than extraction. The line containing the pattern (required url) is replaced by the pattern itself. However, the substitution isn't in-place.
Whichever way you decide to dress it up, you could simply split with the delimiter equal to ?image= and then split the second token you receive (i.e. result[1]) with a simple & delimiter. The first result from that split is your answer.
However, a pure regex match would look something like: m#image=(a-z0-9\:/\.\-)&#i. You can take that regex and put it wherever you want to get your result stored in $1. Despite what a lot of people think, you do not have to match the beginning of a line and the end of a line to match a result.
Try doing this :
xmllint --html --xpath '//a/#href' file://file.html |
grep -oP 'image=\Khttp://.*?\.png'
You can use an URL instead of a local file :
http://domain.tld/path
Or if you had already extracted the line to parse in the $string variable :
grep -oP 'image=\Khttp://.*?\.png' <<< "$string"

grep - search for "<?\n" at start of a file

I have a hunch that I should probably be using ack or egrep instead, but what should I use to basically look for
<?
at the start of a file? I'm trying to find all files that contain the php short open tag since I migrated a bunch of legacy scripts to a relatively new server with the latest php 5.
I know the regex would probably be '/^<\?\n/'
I RTFM and ended up using:
grep -RlIP '^<\?\n' *
the P argument enabled full perl compatible regexes.
If you're looking for all php short tags, use a negative lookahead
/<\?(?!php)/
will match <? but will not match <?php
[meder ~/project]$ grep -rP '<\?(?!php)' .
find . -name "*.php" | xargs grep -nHo "<?[^p^x]"
^x to exclude xml start tag
if you worried about windows line endings, just add \r?.
grep '^<?$' filename
Don't know if that is showing up correctly. Should be
grep ' ^ < ? $ ' filename
Do you mean a literal "backslash n" or do you mean a newline?
For the former:
grep '^<?\\n' [files]
For the latter:
grep '^<?$' [files]
Note that grep will search all lines, so if you want to find matches just at the beginning of the file, you'll need to either filter each file down to its first line, or ask grep to print out line numbers and then only look for line-1 matches.