Extract sub-string from strings based on condition with shell command line - regex

I have lines in myfile like this:
mount -t cifs //hostname/path/ /mount/path/ -o username='xxxx',password='xxxxx'
I need to extract sub-strings from this based on condition "start with // till next white-space including //".
I can't parse with the position as it won't be the same in all matched lines.
So far I have extracted the sub-string using grep's perl assertion, but the result does not return the //.
The piece of code I've used is
cat myfile | grep " cifs " | grep -oP "(?<=/)[^\s]*" | grep -v ^/
Output:
hostname/path/
Expected Output:
//hostname/path/
Is there a way to get the desired output by modifying the perl regex, perhaps some other method?

Simple bash one line solution
grep " cifs " myfile | sed -e "s/ /\n/g" | grep '^\/\/'

You may consider using some non-PCRE based solutions like
sed -En '/ cifs /{s,.*(//[^[:space:]]+).*,\1,p}' file
grep -oE '//[^[:space:]]+' file
The grep solution simply extracts all occurrences of // and 1+ non-whitespace chars after from the file.
The sed solution finds lines containing cifs and then extracts the last occurrence of // and 1+ non-whitespace chars after on those lines.

Following command should do what you ask for
grep cifs myfile | cut -d ' ' -f 4
or
grep cifs myfile | nawk '{print $4}'
or
awk '/cifs/ { print $4 }' myfile
or
perl -ne "print $1 if /cifs\s+(\S+)/" myfile

Related

Filtering matched content

I want to Filter all content after match with the content and bring the first value after the "."
I have an output something like this:
Output:
product: 13.6.0.35_0
More specifically, I need only the first two digits and the first digit after the dot, remembering that we should not cling to the values in the issue, but rather on the method of filtering the content.
Expected:
13.6
I tried something like:
echo "product: 13.6.0.35_0" | grep -ow '\w*13\w*'
If you need to use grep with the current logic, you can use
echo "product: 13.6.0.35_0" | grep -ow '13\.[0-9]*' | head -1
where 13\.[0-9]* matches 13, . and zero or more digits (as whole word due to w option) and head -1 gets the first match.
You may also use sed or awk:
sed -En 's/.* ([0-9]+\.[0-9]+).*/\1/p' <<< "product: 13.6.0.35_0"
awk -F'[[:space:].]' '{print $2"."$3}' <<< "product: 13.6.0.35_0"
See the online demo.
The sed command matches any text up to space, then matches the space and captures the two subsequent dot-separated numbers into Group 1 (\1) and then the rest of the line is matched and replaced with Group 1 value that is printed (as the default line output is suppressed with -n).
In the awk command, the field separator is set to whitespace and . with -F'[[:space:].]' and the {print $2"."$3} part prints the second and third field values joined with a ..
A pure shell solution using the builtin read , Parameter Expansion and curly braces for command groupings.
echo "product: 13.6.0.35_0" | { read -r _ value; echo "${value%.*.*}" ; }
You can also use cut:
echo 'product: 13.6.0.35_0' | cut -d ' ' -f2 | cut -d '.' -f1-2
13.6
I reached the expected output, it's simple but it works:
var=$(echo "product: 13.6.0.35_0" | grep -Eo '[[:digit:]]+' | sed -n 1,2p)
echo ${var} | sed 's/ /./g'

Validating specific column in grep

Ok this is driving me crazy. I have a text file with the following content:
"1","2","3","4","text","2020-01-01","2020-12-13","4"
"1","2","3","4","text","2020-12-07","2020-12-03","22"
"1","2","3","4","text","2020-12-12","2020-04-11","21"
"1","2","3","4","text","2020-05-21","2020-03-23","453"
etc.
I want to filter lines on which the second date is in december, I tried things like:
grep '.*(\d{4}-\d{2}-\d{2}).*(2020-12-).*' > output.txt
grep '.*\d{4}-\d{2}-\d{2}.*2020-12-.*' > output.txt
grep -P '.*\d{4}-\d{2}-\d{2}.*2020-12-.*' > output.txt
But nothing seems to work. Is there any way to accomplish this with either grep, egrep, sed or awk?
You need to use -P option of grep to enable perl compatible regular expressions, could you please try following. Written and tested with your shown samples.
grep -P '("\d+",){4}"[a-zA-Z]+","2020-12-\d{2}"' Input_file
Explanation: Adding explanation for above, following is only for explanation purposes.
grep ##Starting grep command from here.
-P ##Mentioning -P option for enabling PCRE regex with grep.
'("\d+",){4} ##Looking for " digits " comma this combination 4 times here.
"[a-zA-Z]+", ##Then looking for " alphabets ", with this one.
"2020-12-\d{2}" ##Then looking for " 2020-12-07 date " which OP needs.
' Input_file ##Mentioning Input_file name here.
I suggest an alternate solution awk due to input data structured in rows and columns using a common delimiter:
awk -F, '$7 ~ /-12-/' file
"1","2","3","4","text","2020-01-01","2020-12-13","4"
"1","2","3","4","text","2020-12-07","2020-12-03","22"
Use either grep -P or egrep for short:
$ cat test.txt
"1","2","3","4","text","2020-01-01","2020-12-13","4"
"1","2","3","4","text","2020-12-07","2020-12-03","22"
"1","2","3","4","text","2020-12-12","2020-04-11","21"
"1","2","3","4","text","2020-05-21","2020-03-23","453"
$
$ grep -P '^"([^"]*","){6}2020-12-' test.txt
"1","2","3","4","text","2020-01-01","2020-12-13","4"
"1","2","3","4","text","2020-12-07","2020-12-03","22"
$
$ egrep '^"([^"]*","){6}2020-12-' test.txt
"1","2","3","4","text","2020-01-01","2020-12-13","4"
"1","2","3","4","text","2020-12-07","2020-12-03","22"
Explanation:
^" - expect a " to start
([^"]*","){6} - scan over all chars other than ", followed by ","; repeat that 6 times
2020-12- - expect 202012-
The problem is in:
egrep '.*\d{4}-\d{2}-\d{2}.2020-12-.' > output.txt
^ HERE
The . just matches a single character, but you want to skip ",", so change to:
egrep '.*\d{4}-\d{2}-\d{2}.+2020-12-.' > output.txt
^^ HERE
The . becomes a .+.

Match multiple patterns in same line using sed [duplicate]

Given a file, for example:
potato: 1234
apple: 5678
potato: 5432
grape: 4567
banana: 5432
sushi: 56789
I'd like to grep for all lines that start with potato: but only pipe the numbers that follow potato:. So in the above example, the output would be:
1234
5432
How can I do that?
grep 'potato:' file.txt | sed 's/^.*: //'
grep looks for any line that contains the string potato:, then, for each of these lines, sed replaces (s/// - substitute) any character (.*) from the beginning of the line (^) until the last occurrence of the sequence : (colon followed by space) with the empty string (s/...// - substitute the first part with the second part, which is empty).
or
grep 'potato:' file.txt | cut -d\ -f2
For each line that contains potato:, cut will split the line into multiple fields delimited by space (-d\ - d = delimiter, \ = escaped space character, something like -d" " would have also worked) and print the second field of each such line (-f2).
or
grep 'potato:' file.txt | awk '{print $2}'
For each line that contains potato:, awk will print the second field (print $2) which is delimited by default by spaces.
or
grep 'potato:' file.txt | perl -e 'for(<>){s/^.*: //;print}'
All lines that contain potato: are sent to an inline (-e) Perl script that takes all lines from stdin, then, for each of these lines, does the same substitution as in the first example above, then prints it.
or
awk '{if(/potato:/) print $2}' < file.txt
The file is sent via stdin (< file.txt sends the contents of the file via stdin to the command on the left) to an awk script that, for each line that contains potato: (if(/potato:/) returns true if the regular expression /potato:/ matches the current line), prints the second field, as described above.
or
perl -e 'for(<>){/potato:/ && s/^.*: // && print}' < file.txt
The file is sent via stdin (< file.txt, see above) to a Perl script that works similarly to the one above, but this time it also makes sure each line contains the string potato: (/potato:/ is a regular expression that matches if the current line contains potato:, and, if it does (&&), then proceeds to apply the regular expression described above and prints the result).
Or use regex assertions: grep -oP '(?<=potato: ).*' file.txt
grep -Po 'potato:\s\K.*' file
-P to use Perl regular expression
-o to output only the match
\s to match the space after potato:
\K to omit the match
.* to match rest of the string(s)
sed -n 's/^potato:[[:space:]]*//p' file.txt
One can think of Grep as a restricted Sed, or of Sed as a generalized Grep. In this case, Sed is one good, lightweight tool that does what you want -- though, of course, there exist several other reasonable ways to do it, too.
This will print everything after each match, on that same line only:
perl -lne 'print $1 if /^potato:\s*(.*)/' file.txt
This will do the same, except it will also print all subsequent lines:
perl -lne 'if ($found){print} elsif (/^potato:\s*(.*)/){print $1; $found++}' file.txt
These command-line options are used:
-n loop around each line of the input file
-l removes newlines before processing, and adds them back in afterwards
-e execute the perl code
You can use grep, as the other answers state. But you don't need grep, awk, sed, perl, cut, or any external tool. You can do it with pure bash.
Try this (semicolons are there to allow you to put it all on one line):
$ while read line;
do
if [[ "${line%%:\ *}" == "potato" ]];
then
echo ${line##*:\ };
fi;
done< file.txt
## tells bash to delete the longest match of ": " in $line from the front.
$ while read line; do echo ${line##*:\ }; done< file.txt
1234
5678
5432
4567
5432
56789
or if you wanted the key rather than the value, %% tells bash to delete the longest match of ": " in $line from the end.
$ while read line; do echo ${line%%:\ *}; done< file.txt
potato
apple
potato
grape
banana
sushi
The substring to split on is ":\ " because the space character must be escaped with the backslash.
You can find more like these at the linux documentation project.
Modern BASH has support for regular expressions:
while read -r line; do
if [[ $line =~ ^potato:\ ([0-9]+) ]]; then
echo "${BASH_REMATCH[1]}"
fi
done
grep potato file | grep -o "[0-9].*"

grep matching but not printing if line end in dos ^M

I need to search in multiple files for a PATTERN, if found display the file, line and PATTERN surrounded by a few extra chars. My problem is that if the line matching the PATTERN ends with ^M (CRLF) grep prints an empty line instead.
Create a file like this, first line "a^M", second line "a", third line empty line, forth line "a" (not followed by a new line).
a^M
a
a
Without trying to match a few chars after the PATTERN all occurrences are found and displayed:
# grep -srnoEiI ".{0,2}a" *
1:a
2:a
4:a
If I try to match any chars at the end of the PATTERN, it prints an empty line instead of line one, the one ending in CRLF:
# grep -srnoEiI ".{0,2}a.{0,2}" *
2:a
4:a
How can I change this to act as expected ?
P.S. I will like to fix this grep, but I will accept other solutions for example in awk.
EDIT:
Based on the answers below I choose to strip the \r and force grep to pipe the colors to tr:
grep --color=always -srnoEiI ".{0,2}a.{0,2}" * | tr -d '\r'
Here's a simpler case that reproduces your problem:
# Output
echo $'a\r' | grep -o "a"
# No output
echo $'a\r' | grep -o "a."
This is beacuse the ^M matches like a regular character, and makes your terminal overwrite its output (this is purely cosmetic).
How you want to fix this depends on what you want to do.
# Show the output in hex format to ensure it's correct
$ echo $'a\r' | grep -o "a." | od -t x1 -c
0000000 61 0d 0a
a \r \n
# Show the output in visually less ambiguous format
$ echo $'a\r' | grep -o "a." | cat -v
a^M
# Strip the carriage return
$ echo $'a\r' | grep -o "a." | tr -d '\r'
a
awk -v pattern="a" '$0 ~ pattern && !/\r$/ {print NR ": " $0}' file
or
sed -n '/a/{/\r$/!{=;p}}' ~/tmp/srcfile | paste -d: - -
Both of these do: find the pattern, see if the line does not end in a carriage return, print the line number and the line. For the sed, the line number is on its own line, so we have to join two consecutive lines with a colon.
You could use pcregrep:
pcregrep -n '.{0,2}a.{0,2}' inputfile
For your sample input:
$ printf $'a\r\na\n\na\n' | pcregrep -n '.{0,2}a.{0,2}'
1:a
2:a
4:a
A couple more ways:
Use the dos2unix utility to convert the dos-style line endings to unix-style:
dos2unix myfile.txt
Or preprocess the file using tr to remove the CR characters, then pipe to grep:
$ tr -d '\r' < myfile.txt | grep -srnoEiI ".{0,2}a.{0,2}"
1:a
2:a
4:a
$
Note dos2unix may need to be installed on whatever OS you are using. More than likely tr will be available on any POSIX-compliant OS.
You can use awk with a custom field separator:
awk -F '[[:blank:]\r]' '/.{0,2}a.{0,2}/{print FILENAME, NR, $1}' OFS=':' file
TESTING:
Your grep command:
grep -srnoEiI ".{0,2}a.{0,2}" file|cat -vte
file:1:a^M$
file:2:a$
file:4:a$
Suggested awk commmand:
awk -F '[[:blank:]\r]' '/.{0,2}a.{0,2}/{print FILENAME, NR, $1}' OFS=':' file|cat -vte
file:1:a$
file:2:a$
file:4:a$

Extract all numbers from a text file and store them in another file

I have a text file which have lots of lines. I want to extract all the numbers from that file.
File contains text and number and each line contains only one number.
How can i do it using sed or awk in bash script?
i tried
#! /bin/bash
sed 's/\([0-9.0-9]*\).*/\1/' <myfile.txt >output.txt
but this didn't worked.
grep can handle this:
grep -Eo '[0-9\.]+' myfile.txt
-o tells to print only the matches and [0-9\.]+ is a regular expression to match numbers.
To put all numbers on one line and save them in output.txt:
echo $(grep -Eo '[0-9\.]+' myfile.txt) >output.txt
Text files should normally end with a newline characters. The use of echo above assures that this happens.
Non-GNU grep:
If your grep does not support the -o flag, try:
echo $(tr ' ' '\n' <myfile.txt | grep -E '[0-9\.]+') >output.txt
This uses tr to replace all spaces with newlines (so each number appears separately on a line) and then uses grep to search for numbers.
tr -sc '0-9.' ' ' "$file"
Will transform every string of non-digit-or-period characters into a single space.
You can also use Bash:
while read line; do
if [[ $line =~ [0-9\.]+ ]]; then
echo $BASH_REMATCH
fi
done <myfile.txt >output.txt