How to extract filenames and check if exists using regular expressions? - regex

I have a file myfile.log that looks like this:
RS | hello.txt| OK| INFO| [CATLG]
==============================================
A4 | byebye.txt| OK| INFO| [DELETE]
==============================================
Most common:
----------------------------------------------
AS | stackoverflow.txt| OK| INFO| [CATLG]
Then I'm trying to create a script which extract the files which match with the regular expression:
\s(.+)\|\s+OK\|\s+INFO\|\s+\[CATLG
And finally check if the file exists on /myfiles/record/ directory. If not, would be printed a D before the filename.
Here is an example of output supposing that stackoverflow.txt exists and hello.txt not exists:
hello.txt
D stackoverflow.txt
I tried to use grep function, but if I do:
grep -oh '\s.+\|\s+OK\|\s+INFO\|\s+\[CATLG' myfile.log | uniq -i
Doesn't return nothing. What I doing wrong? Do you have any idea to do this?

grep's regex doesn't support \s in regex. You can use grep -P (PCRE) flavor:
grep -oPh '\s.+\|\s+OK\|\s+INFO\|\s+\[CATLG' myfile.log
OR else translate your regex into ERE:
egrep -oh '[[:blank:]].+\|[[:blank:]]+OK\|[[:blank:]]+INFO\|[[:blank:]]+\[CATLG' myfile.log
To just print file names use:
grep -oPh '[^|]+\|\s+\K[^|]+(?=\|\s+OK.*?\[CATLG)' file
hello.txt
stackoverflow.txt

awk -F '|' '/|/ {fname=gensub(" ","","g",$1)
if ( system( "[ -f " fname " ] " ) ) {
print "D " fname }
else {
print " " fname }
}' INPUTFILE
Might work for you.
sets the input field separator to |
work only on lines with |s
set the fname variable to the stripped version of the first field
call out to the test command ([) to the shell

grep -oP '\|\s*\K\S+(?=\|\s+OK.*CATLG)' |
while read file; do
[[ -f /myfiles/record/"$file" ]] && flag="" || flag=D
printf "%-2s%s\n" "$flag" "$file"
done
Explanation:
The grep command uses (-P) perl regex syntax, and only outputs the matched text (-o), each match on its own line.
the \K directive means "forget about what just matched" -- it's a way to get a variable-length look-behind.
I'm finding non-space characters that are followed by: a pipe, whitespace, "OK", some chars, and "CATLG"
The grep output is piped into a while loop
I read the filename into a variable named file
I use the conditional command [[ and the -f operator to see that the file exists.
If it does exist, the command after the && operator is executed, otherwise if the file does not exist, the command after the || operator is executed.
Finally, I print the output in the OP's desired format.

Related

Extract sub-string from strings based on condition with shell command line

I have lines in myfile like this:
mount -t cifs //hostname/path/ /mount/path/ -o username='xxxx',password='xxxxx'
I need to extract sub-strings from this based on condition "start with // till next white-space including //".
I can't parse with the position as it won't be the same in all matched lines.
So far I have extracted the sub-string using grep's perl assertion, but the result does not return the //.
The piece of code I've used is
cat myfile | grep " cifs " | grep -oP "(?<=/)[^\s]*" | grep -v ^/
Output:
hostname/path/
Expected Output:
//hostname/path/
Is there a way to get the desired output by modifying the perl regex, perhaps some other method?
Simple bash one line solution
grep " cifs " myfile | sed -e "s/ /\n/g" | grep '^\/\/'
You may consider using some non-PCRE based solutions like
sed -En '/ cifs /{s,.*(//[^[:space:]]+).*,\1,p}' file
grep -oE '//[^[:space:]]+' file
The grep solution simply extracts all occurrences of // and 1+ non-whitespace chars after from the file.
The sed solution finds lines containing cifs and then extracts the last occurrence of // and 1+ non-whitespace chars after on those lines.
Following command should do what you ask for
grep cifs myfile | cut -d ' ' -f 4
or
grep cifs myfile | nawk '{print $4}'
or
awk '/cifs/ { print $4 }' myfile
or
perl -ne "print $1 if /cifs\s+(\S+)/" myfile

Match multiple patterns in same line using sed [duplicate]

Given a file, for example:
potato: 1234
apple: 5678
potato: 5432
grape: 4567
banana: 5432
sushi: 56789
I'd like to grep for all lines that start with potato: but only pipe the numbers that follow potato:. So in the above example, the output would be:
1234
5432
How can I do that?
grep 'potato:' file.txt | sed 's/^.*: //'
grep looks for any line that contains the string potato:, then, for each of these lines, sed replaces (s/// - substitute) any character (.*) from the beginning of the line (^) until the last occurrence of the sequence : (colon followed by space) with the empty string (s/...// - substitute the first part with the second part, which is empty).
or
grep 'potato:' file.txt | cut -d\ -f2
For each line that contains potato:, cut will split the line into multiple fields delimited by space (-d\ - d = delimiter, \ = escaped space character, something like -d" " would have also worked) and print the second field of each such line (-f2).
or
grep 'potato:' file.txt | awk '{print $2}'
For each line that contains potato:, awk will print the second field (print $2) which is delimited by default by spaces.
or
grep 'potato:' file.txt | perl -e 'for(<>){s/^.*: //;print}'
All lines that contain potato: are sent to an inline (-e) Perl script that takes all lines from stdin, then, for each of these lines, does the same substitution as in the first example above, then prints it.
or
awk '{if(/potato:/) print $2}' < file.txt
The file is sent via stdin (< file.txt sends the contents of the file via stdin to the command on the left) to an awk script that, for each line that contains potato: (if(/potato:/) returns true if the regular expression /potato:/ matches the current line), prints the second field, as described above.
or
perl -e 'for(<>){/potato:/ && s/^.*: // && print}' < file.txt
The file is sent via stdin (< file.txt, see above) to a Perl script that works similarly to the one above, but this time it also makes sure each line contains the string potato: (/potato:/ is a regular expression that matches if the current line contains potato:, and, if it does (&&), then proceeds to apply the regular expression described above and prints the result).
Or use regex assertions: grep -oP '(?<=potato: ).*' file.txt
grep -Po 'potato:\s\K.*' file
-P to use Perl regular expression
-o to output only the match
\s to match the space after potato:
\K to omit the match
.* to match rest of the string(s)
sed -n 's/^potato:[[:space:]]*//p' file.txt
One can think of Grep as a restricted Sed, or of Sed as a generalized Grep. In this case, Sed is one good, lightweight tool that does what you want -- though, of course, there exist several other reasonable ways to do it, too.
This will print everything after each match, on that same line only:
perl -lne 'print $1 if /^potato:\s*(.*)/' file.txt
This will do the same, except it will also print all subsequent lines:
perl -lne 'if ($found){print} elsif (/^potato:\s*(.*)/){print $1; $found++}' file.txt
These command-line options are used:
-n loop around each line of the input file
-l removes newlines before processing, and adds them back in afterwards
-e execute the perl code
You can use grep, as the other answers state. But you don't need grep, awk, sed, perl, cut, or any external tool. You can do it with pure bash.
Try this (semicolons are there to allow you to put it all on one line):
$ while read line;
do
if [[ "${line%%:\ *}" == "potato" ]];
then
echo ${line##*:\ };
fi;
done< file.txt
## tells bash to delete the longest match of ": " in $line from the front.
$ while read line; do echo ${line##*:\ }; done< file.txt
1234
5678
5432
4567
5432
56789
or if you wanted the key rather than the value, %% tells bash to delete the longest match of ": " in $line from the end.
$ while read line; do echo ${line%%:\ *}; done< file.txt
potato
apple
potato
grape
banana
sushi
The substring to split on is ":\ " because the space character must be escaped with the backslash.
You can find more like these at the linux documentation project.
Modern BASH has support for regular expressions:
while read -r line; do
if [[ $line =~ ^potato:\ ([0-9]+) ]]; then
echo "${BASH_REMATCH[1]}"
fi
done
grep potato file | grep -o "[0-9].*"

Parsing Karma Coverage Output in Bash for a Jenkins Job (Scripting)

I'm working with the following output:
=============================== Coverage summary ===============================
Statements : 26.16% ( 1681/6425 )
Branches : 6.89% ( 119/1727 )
Functions : 23.82% ( 390/1637 )
Lines : 26.17% ( 1680/6420 )
================================================================================
I would like to parse the 4 coverage percentage numbers without the percent via REGEX, into a comma separated list.
Any suggestions for a good regex expression for this? Or another good option?
The sed command:
sed -n '/ .*% /{s/.* \(.*\)% .*/\1/;p;}' input.txt | sed ':a;N;$!ba;s/\n/,/g'
gives the output:
26.16,6.89,23.82,26.17
Edit: A better answer, with only a single sed, would be:
sed -n '/ .*% /{s/.* \(.*\)% .*/\1/;H;};${g;s/\n/,/g;s/,//;p;}' input.txt
Explanation:
/ .*% / search for lines with a percentage value (note spaces)
s/.* \(.*\)% .*/\1/ and delete everything except the percentage value
H and then append it to the hold space, prefixed with a newline
$ then for the last line
g get the hold space
s/\n/,/g replace all the newlines with commas
s/,// and delete the initial comma
p and then finally output the result
To harden the regex, you could replace the search for the percentage value .*% with for example [0-9.]*%.
I think this is a grep job. This should help:
$ grep -oE "[0-9]{1,2}\.[0-9]{2}" input.txt | xargs | tr " " ","
Output:
26.16,6.89,23.82,26.17
The input file just contains what you have shown above. Obviously, there are other ways like cat to feed the input to the command.
Explanation:
grep -oE: only show matches using extended regex
xargs: put all results onto a single line
tr " " ",": translate the spaces into commas:
This is actually a nice shell tool belt example, I would say.
Including the consideration of Joseph Quinsey, the regex can be made more robust with a lookahead to assert a % sign after then numeric value using a Perl-compatible RE pattern:
grep -oP "[0-9]{1,2}\.[0-9]{2}(?=%)" input.txt | xargs | tr " " ","
Would you consider to use awk? Here's the command you may try,
$ awk 'match($0,/[0-9.]*%/){s=(s=="")?"":s",";s=s substr($0,RSTART,RLENGTH-1)}END{print s}' file
26.16,6.89,23.82,26.17
Brief explanation,
match($0,/[0-9.]*%/): find the record matched with regex [0-9.]*%
s=(s=="")?"":s",": since comma separated is required, we just need print commas before each matched except the first one.
s=s substr($0,RSTART,RLENGTH-1): print the matched part appended to s
Assuming the item names (Statements, Branches, ...) do not contain whitespaces, how about:
#!/bin/bash
declare -a keys
declare -a vaues
while read -r line; do
if [[ "$line" =~ ^([^\ ]+)\ *:\ *([0-9.]+)% ]]; then
keys+=(${BASH_REMATCH[1]})
values+=(${BASH_REMATCH[2]})
fi
done < output.txt
ifsback=$IFS # backup IFS
IFS=,
echo "${keys[*]}"
echo "${values[*]}"
IFS=$ifsback # restore IFS
which yields:
Statements,Branches,Functions,Lines
26.16,6.89,23.82,26.17
Yet another option, with perl:
cat the_file | perl -e 'while(<>){/(\d+\.\d+)%/ and $x.="$1,"}chop $x; print $x;'
The code, unrolled and explained:
while(<>){ # Read line by line. Put lines into $_
/(\d+\.\d+)%/ and $x.="$1,"
# Equivalent to:
# if ($_ =~ /(\d+\.\d+)%/) {$x.="$1,"}
# The regex matches "numbers", "dot", "numbers" and "%",
# stores just numbers on $1 (first capturing group)
}
chop $x; # Remove extra ',' and print result
print $x;
Somewhat shorter with an extra sed
cat the_file | perl -ne '/(\d+\.\d+)%/ and print "$1,"'|sed 's/.$//'
Uses "n" parameter which implies while(<>){}. For removing the last ',' we use sed.

Regex with fswatch - Exclude files not ending with ".txt"

For a list of files, I'd like to match the ones not ending with .txt. I am currently using this expression:
.*(txt$)|(html\.txt$)
This expression will match everything ending in .txt, but I'd like it to do the opposite.
Should match:
happiness.html
joy.png
fear.src
Should not match:
madness.html.txt
excitement.txt
I'd like to get this so I can use it in pair with fswatch:
fswatch -0 -e 'regex here' . | xargs -0 -n 1 -I {} echo "{} has been changed"
The problem is it doesn't seem to work.
PS: I use the tag bash instead of fswatch because I don't have enough reputation points to create it. Sorry!
Try using a lookbehind, like this:
.*$(?<!\.txt)
Demonstration
Basically, this matches any line of text so long as the last 4 characters are not ".txt".
You can use Negative Lookahead for this purpose.
^(?!.*\.txt).+$
Live Demo
You can use this expression with grep using option -P:
grep -Po '^(?!.*\.txt).+$' file
Since question has been tagged as bash, lookaheads may not be supported (except grep -P), here is one grep solution that doesn't need lookaheads:
grep -v '\.txt$' file
happiness.html
joy.png
fear.src
EDIT: You can use this xargs command to avoid matching *.txt files:
xargs -0 -n 1 -I {} bash -c '[[ "{}" == *".txt" ]] && echo "{} has been changed"'
It really depends what regular expression tool you are using. Many tools provide a way to invert the sense of a regex. For example:
bash
# succeeds if filename ends with .txt
[[ $filename =~ "."txt$ ]]
# succeeds if filename does not end with .txt
! [[ $filename =~ "."txt$ ]]
# another way of writing the negative
[[ ! $filename =~ "."txt$ ]]
grep
# succeeds if filename ends with .txt
egrep -q "\.txt$" <<<"$filename"
# succeeds if filename does not end with .txt
egrep -qv "\.txt$" <<<"$filename"
awk
/\.txt$/ { print "line ends with .txt" }
! /\.txt$/ { print "line doesn't end with .txt" }
$1 ~ /\.txt$/ { print "first field ends with .txt" }
$1 !~ /\.txt$/ { print "first field doesn't end with .txt" }
For the adventurous, a posix ERE which will work in any posix compatible regex engine
/[^t]$|[^x]t$|[^t]xt$|[^.]txt$/

Extract all numbers from a text file and store them in another file

I have a text file which have lots of lines. I want to extract all the numbers from that file.
File contains text and number and each line contains only one number.
How can i do it using sed or awk in bash script?
i tried
#! /bin/bash
sed 's/\([0-9.0-9]*\).*/\1/' <myfile.txt >output.txt
but this didn't worked.
grep can handle this:
grep -Eo '[0-9\.]+' myfile.txt
-o tells to print only the matches and [0-9\.]+ is a regular expression to match numbers.
To put all numbers on one line and save them in output.txt:
echo $(grep -Eo '[0-9\.]+' myfile.txt) >output.txt
Text files should normally end with a newline characters. The use of echo above assures that this happens.
Non-GNU grep:
If your grep does not support the -o flag, try:
echo $(tr ' ' '\n' <myfile.txt | grep -E '[0-9\.]+') >output.txt
This uses tr to replace all spaces with newlines (so each number appears separately on a line) and then uses grep to search for numbers.
tr -sc '0-9.' ' ' "$file"
Will transform every string of non-digit-or-period characters into a single space.
You can also use Bash:
while read line; do
if [[ $line =~ [0-9\.]+ ]]; then
echo $BASH_REMATCH
fi
done <myfile.txt >output.txt