grep - regular expression - match till a specific word

grep - regular expression - match till a specific word - regex

Lets say I have a file with lines like this
abcefghijklxyz
abcefghijkl
I want to get only the string between abc and the end of the line. End of the line can be defined as the normal end of line or the string xyz.
My question is
How can I get only the matched string using grep and regular expressions? For example, the expected output for the two lines shown above would be
efghijkl
efghijkl
I don't want the starting and ending markers.
What I have tried till now
grep -oh "abc.*xyz"
I use Ubuntu 13.04 and Bash shell.

this line chops leading abc and ending xyz (if there was) away, and gives you the part you need:
grep -oP '^abc\K.*?(?=xyz$|$)'
with your example:
kent$ echo "abcefghijklxyz
abcefghijkl"|grep -oP '^abc\K.*?(?=xyz$|$)'
efghijkl
efghijkl
another example with xyz in the middle of the text:
kent$ echo "abcefghijklxyz
abcefghijkl
abcfffffxyzbbbxyz
abcffffxyzbbb"|grep -oP '^abc\K.*?(?=xyz$|$)'
efghijkl
efghijkl
fffffxyzbbb
ffffxyzbbb

Using sed:
sed -n '/abc/{s/.*abc\(.*\)/\1/;s/xyz.*//;p}' input
Produces:
efghijkl
efghijkl

Use a look-behind like this:
$ grep -Po '(?<=abc)[^x]*' file
efghijkl
efghijkl
It fetches everything after abc and until it finds a x.
Based on Kent's answer (not to copy, but for completeness) you can grep all within abc and xyz (or end of line):
$ grep -Po '(?<=abc).*(?=xyz|$)' file
efghijklxyz
efghijkl

Or you can just remove what you do not like:
awk '/^abc/{sub(/^abc/,x);sub(/xyz.*$/,x)}1' file
efghijkl
efghijkl
xyz.*$ represent everything from xyz to end of line.

Related

How to get the release value?

I've a file with the below name formats:
rzp-QAQ_SA2-5.12.0.38-quality.zip
rzp-TEST-5.12.0.38-quality.zip
rzp-ASQ_TFC-5.12.0.38-quality.zip
I want the value as: 5.12.0.38-quality.zip from the above file names.
I'm trying as below, but not getting the correct value though:
echo "$fl_name" | sed 's#^[-[:alpha:]_[:digit:]]*##'
fl_name is the variable containing the file name.
Thanks a lot in advance!

You are matching too much with all the alpha, digit - and _ in the same character class.
You can match alpha and - and optionally _ and alphanumerics
sed -E 's#^[-[:alpha:]]+(_[[:alnum:]]*-)?##' file
Or you can shorten the first character class, and match a - at the end:
sed -E 's#^[-[:alnum:]_]*-##' file
Output of both examples
5.12.0.38-quality.zip
5.12.0.38-quality.zip
5.12.0.38-quality.zip

With GNU grep you could try following code. Written and tested with shown samples.
grep -oP '(.*?-){2}\K.*' Input_file
OR as an alternative use(with a non-capturing group solution, as per the fourth bird's nice suggestion):
grep -oP '(?:[^-]*-){2}\K.*' Input_file
Explanation: using GNU grep here. in grep program using -oP option which is for matching exact matched values and to enable PCRE flavor respectively in program. Then in main program, using regex (.*?-){2} means, using lazy match till - 2 times here(to get first 2 matches of - here) then using \K option which is to make sure that till now matched value is forgotten and only next mentioned regex matched value will be printed, which will print rest of the values here.

It is much easier to use cut here:
cut -d- -f3- file
5.12.0.38-quality.zip
5.12.0.38-quality.zip
5.12.0.38-quality.zip
If you want sed then use:
sed -E 's/^([^-]*-){2}//' file
5.12.0.38-quality.zip
5.12.0.38-quality.zip
5.12.0.38-quality.zip

Assumptions:
all filenames contain 3 hyphens (-)
the desired result always consists of stripping off the 1st two hyphen-delimited strings
OP wants to perform this operation on a variable
We can eliminate the overhead of sub-process calls (eg, grep, cut and sed) by using parameter substitution:
$ f1_name='rzp-ASQ_TFC-5.12.0.38-quality.zip'
$ new_f1_name="${f1_name#*-}" # strip off first hyphen-delimited string
$ echo "${new_f1_name}"
ASQ_TFC-5.12.0.38-quality.zip
$ new_f1_name="${new_f1_name#*-}" # strip off next hyphen-delimited string
$ echo "${new_f1_name}"
5.12.0.38-quality.zip
On the other hand if OP is feeding a list of file names to a looping construct, and the original file names are not needed, it may be easier to perform a bulk operation on the list of file names before processing by the loop, eg:
while read -r new_f1_name
do
... process "${new_f1_name)"
done < <( command-that-generates-list-of-file-names | cut -d- -f3-)

In plain bash:
echo "${fl_name#*-*-}"

You can do a reverse of each line, and get the two last elements separated by "-" and then reverse again:
cat "$fl_name"| rev | cut -f1,2 -d'-' | rev

A Perl solution capturing digits and characters trailing a '-'
cat f_name | perl -lne 'chomp; /.*?-(\d+.*?)\z/g;print $1'

Deleting everything between two string matches in a file

I got this text in file.txt:
Osmun.Prez#mail.com:c7lB2m6b#3.a.a:tt_webid_v2=6990226111024612869; tt_webid=6990226111024612869; tt_csrf_token=VD5Nb_TQFH4RKhoJeSe2nzLB; R6kq3TV7=AHkh4PB6AQAA3LIS90nWf2ss0Q7ZTCQjUat4axctvhQY68DdUEz92RwpmVSX|1|0|e9d6917c2fe555827dcf5ee916ba9778079ab2a9; ttwid=1%7CAFodeNF0iZM2fyy-ZeiZ6HTpZoG_MSx6SmXHgGVQ-V4%7C1627538859%7C59ca1e4a56f9f537b55e655a6dabff88e44eb48502b164ed6b4199f5a5263cb0; passport_csrf_token_default=6f7653c3ce946a6ce5444723fb0c509b; passport_csrf_token=6f7653c3ce946a6ce5444723fb0c509b; sid_guard=0483b7d37f4e4bd20ab3046e29724798%7C1627538893%7C5184000%7CMon%2C+27-Sep-2021+06%3A08%3A13+GMT; uid_tt=27b52febe6222486b9f6b6a90ef4ffeace5ea25c09d29a1583be5a1ecf760996; uid_tt_ss=27b52febe6222486b9f6b6a90ef4ffeace5ea25c09d29a1583be5a1ecf760996; sid_tt=0483b7d37f4e4bd20ab3046e29724798; sessionid=0483b7d37f4e4bd20ab3046e29724798; sessionid_ss=0483b7d37f4e4bd20ab3046e29724798; store-idc=maliva; store-country-code=us; odin_tt=294845c8f7711db177f7c549a9f44edb1555031b27a2a485df809cd92c4e544ac0772bf462df5b7a100f6e488c45303cd62df3b6b950f0842520cd887850137b035d990f29cc8b752765e594560c977f; cmpl_token=AgQQAPNSF-RMpbE89z5HYF0_-2PcrxjXf4fZYP5_ZA
How can I delete everything from the string inside ( first & only instance ) from :tt_ to _ZA in file.txt keeping only Osmun.Prez#mail.com:c7lB2m6b#3.a.a using bash linux?
Thank you

Something like:
sed -i "s/:tt_.*//" file.txt
if you want to edit the file in place. If not, remove the -i switch.
The sed command means: replace (s), in each line of file.txt, all the chars (.*) starting by the pattern :tt_ with an empty string (//).
Or the command:
sed -i "s/:tt_.*_ZA//" file.txt
which is more adherent to what you ask for, but returns the same output.

Use pattern substitution:
i=$(cat file.txt)
echo "${i/:tt*_ZA}"

Assuming the general requirement is to remove everything after the 2nd : ...
Sample data:
$ cat file.txt
Osmun.Prez#mail.com:c7lB2m6b#3.a.a:tt_webid_v ... to end of line
some.one#home.com:B52_m6b#9_az.more.stuff:delete from here ... to end of line
One sed idea:
$ sed -En 's/^([^:]*:[^:]*).*$/\1/p' file.txt
Osmun.Prez#mail.com:c7lB2m6b#3.a.a
some.one#home.com:B52_m6b#9_az.more.stuff

Using awk
awk 'BEGIN{FS=OFS=":"}{print $1,$2}'
Using : as the delimiter, it is easy to extract the columns before :tt

This deletes all chars from ":tt_" to the last "_ZA", inclusive, in file.txt
Mac_3.2.57$cat file.txt | sed 's/\(\)[:]tt.*_ZA\(.*\)/\1\2/'
Osmun.Prez#mail.com:c7lB2m6b#3.a.a
Mac_3.2.57$

Or if it is always the first 2 values which are separated by colon (as per you example)
cat file.txt | cut -f1,2 -d’:’

unix - pattern matching in file

so I have a file with the following:
username=jsmith
api=3434kjklj23j4l3kj4l34j3l4j
I would like to return using regular expression "jsmith" and "3434kjklj23j4l3kj4l34j3l4j"
I know the regular expression for it is:
(username=)(.*) > \2
(api=)(.*) > \2
however using grep or sed or awk. I can't seem to figure out the way to use them without return the entire line.
How would you go about doing that with a commandline command?

awk is made for this task:
awk -F= '{print$2}' file
If the file has other entries, you can limit the output with a condition:
awk -F= '$1=="username"||$1=="api"{print$2}' file

Here is one using bash, PCRE and positive lookbehind (where supported):
$ grep -Po "((?<=^username=)|(?<=^api=)).*" file
jsmith
3434kjklj23j4l3kj4l34j3l4j
ie. output everything that is preceeded by username= or api= that start the lines.
And one in awk:
$ awk 'sub(/^(username|api)=/,""){print}' file
jsmith
3434kjklj23j4l3kj4l34j3l4j
ie. print lines where preceeding ^username= or ^api= are removed first.

Since you want to see chess with the input game=chess, here some solutions without matching username= or api=
cut -d"=" -f2- file
# or
sed -n 's/[^=]*=//p' file

here's the answer that worked on the macos and RHEl7.
awk -F= '$1=="username"{print$2}' testfile.txt
awk -F= '$1=="api"{print$2}' testfile.txt
testfile.txt
username=user1
api=pass1
username=user2
api =pass2

Match multiple patterns in same line using sed [duplicate]

Given a file, for example:
potato: 1234
apple: 5678
potato: 5432
grape: 4567
banana: 5432
sushi: 56789
I'd like to grep for all lines that start with potato: but only pipe the numbers that follow potato:. So in the above example, the output would be:
1234
5432
How can I do that?

grep 'potato:' file.txt | sed 's/^.*: //'
grep looks for any line that contains the string potato:, then, for each of these lines, sed replaces (s/// - substitute) any character (.*) from the beginning of the line (^) until the last occurrence of the sequence : (colon followed by space) with the empty string (s/...// - substitute the first part with the second part, which is empty).
or
grep 'potato:' file.txt | cut -d\ -f2
For each line that contains potato:, cut will split the line into multiple fields delimited by space (-d\ - d = delimiter, \ = escaped space character, something like -d" " would have also worked) and print the second field of each such line (-f2).
or
grep 'potato:' file.txt | awk '{print $2}'
For each line that contains potato:, awk will print the second field (print $2) which is delimited by default by spaces.
or
grep 'potato:' file.txt | perl -e 'for(<>){s/^.*: //;print}'
All lines that contain potato: are sent to an inline (-e) Perl script that takes all lines from stdin, then, for each of these lines, does the same substitution as in the first example above, then prints it.
or
awk '{if(/potato:/) print $2}' < file.txt
The file is sent via stdin (< file.txt sends the contents of the file via stdin to the command on the left) to an awk script that, for each line that contains potato: (if(/potato:/) returns true if the regular expression /potato:/ matches the current line), prints the second field, as described above.
or
perl -e 'for(<>){/potato:/ && s/^.*: // && print}' < file.txt
The file is sent via stdin (< file.txt, see above) to a Perl script that works similarly to the one above, but this time it also makes sure each line contains the string potato: (/potato:/ is a regular expression that matches if the current line contains potato:, and, if it does (&&), then proceeds to apply the regular expression described above and prints the result).

Or use regex assertions: grep -oP '(?<=potato: ).*' file.txt

grep -Po 'potato:\s\K.*' file
-P to use Perl regular expression
-o to output only the match
\s to match the space after potato:
\K to omit the match
.* to match rest of the string(s)

sed -n 's/^potato:[[:space:]]*//p' file.txt
One can think of Grep as a restricted Sed, or of Sed as a generalized Grep. In this case, Sed is one good, lightweight tool that does what you want -- though, of course, there exist several other reasonable ways to do it, too.

This will print everything after each match, on that same line only:
perl -lne 'print $1 if /^potato:\s*(.*)/' file.txt
This will do the same, except it will also print all subsequent lines:
perl -lne 'if ($found){print} elsif (/^potato:\s*(.*)/){print $1; $found++}' file.txt
These command-line options are used:
-n loop around each line of the input file
-l removes newlines before processing, and adds them back in afterwards
-e execute the perl code

You can use grep, as the other answers state. But you don't need grep, awk, sed, perl, cut, or any external tool. You can do it with pure bash.
Try this (semicolons are there to allow you to put it all on one line):
$ while read line;
do
if [[ "${line%%:\ *}" == "potato" ]];
then
echo ${line##*:\ };
fi;
done< file.txt
## tells bash to delete the longest match of ": " in $line from the front.
$ while read line; do echo ${line##*:\ }; done< file.txt
1234
5678
5432
4567
5432
56789
or if you wanted the key rather than the value, %% tells bash to delete the longest match of ": " in $line from the end.
$ while read line; do echo ${line%%:\ *}; done< file.txt
potato
apple
potato
grape
banana
sushi
The substring to split on is ":\ " because the space character must be escaped with the backslash.
You can find more like these at the linux documentation project.

Modern BASH has support for regular expressions:
while read -r line; do
if [[ $line =~ ^potato:\ ([0-9]+) ]]; then
echo "${BASH_REMATCH[1]}"
fi
done

grep potato file | grep -o "[0-9].*"

How to match and keep the first number in a line using sed?

Question
Let's say I have one line of text with a number placed somewhere (it could be at the beginning, in the middle or at the end of the line).
How to match and keep the first number found in a line using sed?
Minimal example
Here is my attempt (following this page of a tutorial on regular expressions) and the output for different positions of the number:
$echo "SomeText 123SomeText" | sed 's:.*\([0-9][0-9]*\).*:\1:'
3
$echo "123SomeText" | sed 's:.*\([0-9][0-9]*\).*:\1:'
3
$echo "SomeText 123" | sed 's:.*\([0-9][0-9]*\).*:\1:'
3
As you can only the last digit is kept in the process whereas the desired output should be 123...

Using sed:
echo "SomeText 123SomeText 456" | sed -r 's/^[^0-9]*([0-9]+).*$/\1/'
123
You can also do this in gnu awk:
echo "SomeText 123SomeText 456" | awk '{print gensub(/^[^0-9]*([0-9]+).*$/, "\\1", $0)}'
123

To complement the sed solutions, here's an awk alternative (assuming that the goal is to extract the 1st number on each line, if any (i.e., ignore lines without any numbers)):
awk -F'[^0-9]*' '/[0-9]/ { print ($1 != "" ? $1 : $2) }'
-F'[^0-9]*' defines any sequence of non-digit chars. (including the empty string) as the field separator; awk automatically breaks each input line into fields based on that separator, with $1 representing the first field, $2 the second, and so on.
/[0-9]/ is a pattern (condition) that ensures that output is only produced for lines that contain at least one digit, via its associated action (the {...} block) - in other words: lines containing NO number at all are ignored.
{ print ($1!="" ? $1 : $2) } prints the 1st field, if nonempty, otherwise the 2nd one; rationale: if the line starts with a number, the 1st field will contain the 1st number on the line (because the line starts with a field rather than a separator; otherwise, it is the 2nd field that contains the 1st number (because the line starts with a separator).

You can also use grep, which is ideally suited to this task. sed is a Stream EDitor, which is only going to indirectly give you what you want. With grep, you only have to specify the part of the line you want.
$ cat file.txt
SomeText 123SomeText
123SomeText
SomeText 123
$ grep -o '[0-9]\+' file.txt
123
123
123
grep -o prints only the matching parts of a line, each on a separate line. The pattern is simple: one or more digits.
If your version of grep is compatible with the -P switch, you can use Perl-style regular expressions and make the command even shorter:
$ grep -Po '\d+' file.txt
123
123
123
Again, this matches one or more digits.
Using grep is a lot simpler and has the advantage that if the line doesn't match, nothing is printed:
$ echo "no number" | grep -Po '\d+' # no output
$ echo "yes 123number" | grep -Po '\d+'
123
edit
As pointed out in the comments, one possible problem is that this won't only print the first matching number on the line. If the line contains more than one number, they will all be printed. As far as I'm aware, this can't be done using grep -o.
In that case, I'd go with perl:
perl -lne 'print $1 if /.*?(\d+).*/'
This uses lazy matching (the question mark) so only non-digit characters are consumed by the .* at the start of the pattern. The $1 is a back reference, like \1 in sed. If there are more than one number on the line, this only prints the first. If there aren't any at all, it doesn't print anything:
$ echo "no number" | perl -ne 'print "$1\n" if /.*?(\d+).*/'
$ echo "yes123number456" | perl -lne 'print $1 if /.*?(\d+).*/'
123
If for some reason you still really want to use sed, you can do this:
sed -n 's/^[^0-9]*$[0-9]\{1,\}$.*$/\1/p'
unlike the other answers, this is compatible with all version of sed and will only print lines that contain a match.

Try this sed command,
$echo "SomeText 123SomeText" | sed -r '/[^0-9]*([0-9][0-9]*)[^0-9]*/ s//\1 /g'
123
Another example,
$ echo "SomeText 123SomeText 456" | sed -r '/[^0-9]*([0-9][0-9]*)[^0-9]*/ s//\1 /g'
123 456
It prints all the numbers in a file and the captured numbers are separated by spaces while printing.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js