Regex to extract/output quoted strings from a file - regex

I wrote a simple regular expression to output quoted strings from a file
cat mobydick.txt | while read line; do echo -n "$line "; done | grep -oP '[^"]*"\K[^"]*'
This is what I have so far
For example, when I run this one-liner on this file mobydick.txt I get the output in a single line instead of new line separated strings.
Could someone help me with my script?
Expected Output --> when the above script is run on mobydick.txt
"From my twenty-fifth year I date my life."
"Call me Ishmael."
Above input file can be downloaded from this URL

Using GNU grep(1) (other incarnations of grep(1) don't have -P):
tr '\n' ' ' <mobydick.txt | grep -P -o '(?<=\s)"[^"]+"(?=\s)'
More accurate, using pcregrep(1):
pcregrep -M -o '(?<=^|\s)"[^"]+"(?=$|\s)' mobydick.txt

Related

Extract sub-string from strings based on condition with shell command line

I have lines in myfile like this:
mount -t cifs //hostname/path/ /mount/path/ -o username='xxxx',password='xxxxx'
I need to extract sub-strings from this based on condition "start with // till next white-space including //".
I can't parse with the position as it won't be the same in all matched lines.
So far I have extracted the sub-string using grep's perl assertion, but the result does not return the //.
The piece of code I've used is
cat myfile | grep " cifs " | grep -oP "(?<=/)[^\s]*" | grep -v ^/
Output:
hostname/path/
Expected Output:
//hostname/path/
Is there a way to get the desired output by modifying the perl regex, perhaps some other method?
Simple bash one line solution
grep " cifs " myfile | sed -e "s/ /\n/g" | grep '^\/\/'
You may consider using some non-PCRE based solutions like
sed -En '/ cifs /{s,.*(//[^[:space:]]+).*,\1,p}' file
grep -oE '//[^[:space:]]+' file
The grep solution simply extracts all occurrences of // and 1+ non-whitespace chars after from the file.
The sed solution finds lines containing cifs and then extracts the last occurrence of // and 1+ non-whitespace chars after on those lines.
Following command should do what you ask for
grep cifs myfile | cut -d ' ' -f 4
or
grep cifs myfile | nawk '{print $4}'
or
awk '/cifs/ { print $4 }' myfile
or
perl -ne "print $1 if /cifs\s+(\S+)/" myfile

Seletively extract number from file name

I have a list of files in the format as: AA13_11BB, CC290_23DD, EE92_34RR. I need to extract only the numbers after the _ character, not the ones before. For those three file names, I would like to get 11, 23, 34 as output and after each extraction, store the number into a variable.
I'm very new to bash and regex. Currently, from AA13_11BB, I am able to either obtain 13_11:
for imgs in $DIR; do
LEVEL=$(echo $imgs | egrep -o [_0-9]+);
done
or two separate numbers 13 and 11:
LEVEL=$(echo $imgs | egrep -o [0-9]+)
May I please have some advice how to obtain my desired output? Thank you!
Use egrep with sed:
LEVEL=$(echo $imgs | egrep -o '_[0-9]+' | sed 's/_//' )
To complement the existing helpful answers, using the core of hjpotter92's answer:
The following processes all filenames in $DIR in a single command and reads all extracted tokens into array:
IFS=$'\n' read -d '' -ra levels < \
<(printf '%s\n' "$DIR"/* | egrep -o '_[0-9]+' | sed 's/_//')
IFS=$'\n' read -d '' -ra levels splits the input into lines and stores them as elements of array ${levels[#]}.
<(...) is a process substitution that allows the output from a command to act as an (ephemeral) input file.
printf '%s\n' "$DIR"/* uses pathname expansion to output each filename on its own line.
egrep -o '_[0-9]+' | sed 's/_//' is the same as in hjpotter92's answer - it works equally on multiple input lines, as is the case here.
To process the extracted tokens later, use:
for level in "${levels[#]}"; do
echo "$level" # work with $level
done
You can do it in one sed using the regex .*_([0-9]+).* (escape it properly for sed):
sed "s/.*_\([0-9]\+\).*/\1/" <<< "AA13_11BB"
It replaces the line with the first captured group (the sub-regex inside the ()), outputting:
11
In your script:
LEVEL=$(sed "s/.*_\([0-9]\+\).*/\1/" <<< $imgs)
Update: as suggested by #mklement0, in both BSD sed and GNU sed you can shorten the command using the -E parameter:
LEVEL=$(sed -E "s/.*_([0-9]+).*/\1/" <<< $imgs)
Using grep with -P flag
for imgs in $DIR
do
LEVEL=$(echo $imgs | grep -Po '(?<=_)[0-9]{2}')
echo $LEVEL
done

print line that matches first field (bash)

I'm trying to read userinput, have that match the first field of a csv file, and print out the entire line. Here's what i've come up with:
#/bin/bash
echo "enter number: "
read USERINPUT
LINENUMBER=$(awk -v FS=',' '{print $1}' < test.csv | grep -n "$USERINPUT")
FULLLINE=$(sed -n $LINENUMBER\p test.csv)
echo $FULLLINE
The problem i'm running into is say i set USERINPUT=4 but my csv file has several lines like 4, 421, 444, etc.. i match all of them. How do i make
grep -n "$USERINPUT"
only match exactly what it is set to and nothing else?
Instead of printing the first column of every line, then using grep, you should just do the whole thing in awk:
line_number=$(awk -F, -v s="$number" '$1==s{print NR}' test.csv)
If you just want to print the line, that's simple:
awk -F, -v s="$number" '$1==s' test.csv
By the way, instead of using an echo followed by a read, you can use read -p which allows you to specify a prompt:
read -p "enter number: " number
#/bin/bash
read -p "enter number: " num
grep "^$num," test.csv
The -o grep option prints only what matches the regular expression.
E.g.
grep -o '.*USERINPUT.*'
or
grep -o '^USERINPUT.*'
etc.
#/bin/bash
echo "enter number: "
read USERINPUT
# for a var assignation and print content
FULLLINE=$(egrep "^${USERINPUT%% *}," test.csv )
echo $FULLLINE
# for only a print
egrep "^${USERINPUT%% *}," test.csv
Use of egrep to include deleimiter (start line and coma around the input)
Use of a small input test removing trailing space via ${VarName%% *}

How to extract filenames and check if exists using regular expressions?

I have a file myfile.log that looks like this:
RS | hello.txt| OK| INFO| [CATLG]
==============================================
A4 | byebye.txt| OK| INFO| [DELETE]
==============================================
Most common:
----------------------------------------------
AS | stackoverflow.txt| OK| INFO| [CATLG]
Then I'm trying to create a script which extract the files which match with the regular expression:
\s(.+)\|\s+OK\|\s+INFO\|\s+\[CATLG
And finally check if the file exists on /myfiles/record/ directory. If not, would be printed a D before the filename.
Here is an example of output supposing that stackoverflow.txt exists and hello.txt not exists:
hello.txt
D stackoverflow.txt
I tried to use grep function, but if I do:
grep -oh '\s.+\|\s+OK\|\s+INFO\|\s+\[CATLG' myfile.log | uniq -i
Doesn't return nothing. What I doing wrong? Do you have any idea to do this?
grep's regex doesn't support \s in regex. You can use grep -P (PCRE) flavor:
grep -oPh '\s.+\|\s+OK\|\s+INFO\|\s+\[CATLG' myfile.log
OR else translate your regex into ERE:
egrep -oh '[[:blank:]].+\|[[:blank:]]+OK\|[[:blank:]]+INFO\|[[:blank:]]+\[CATLG' myfile.log
To just print file names use:
grep -oPh '[^|]+\|\s+\K[^|]+(?=\|\s+OK.*?\[CATLG)' file
hello.txt
stackoverflow.txt
awk -F '|' '/|/ {fname=gensub(" ","","g",$1)
if ( system( "[ -f " fname " ] " ) ) {
print "D " fname }
else {
print " " fname }
}' INPUTFILE
Might work for you.
sets the input field separator to |
work only on lines with |s
set the fname variable to the stripped version of the first field
call out to the test command ([) to the shell
grep -oP '\|\s*\K\S+(?=\|\s+OK.*CATLG)' |
while read file; do
[[ -f /myfiles/record/"$file" ]] && flag="" || flag=D
printf "%-2s%s\n" "$flag" "$file"
done
Explanation:
The grep command uses (-P) perl regex syntax, and only outputs the matched text (-o), each match on its own line.
the \K directive means "forget about what just matched" -- it's a way to get a variable-length look-behind.
I'm finding non-space characters that are followed by: a pipe, whitespace, "OK", some chars, and "CATLG"
The grep output is piped into a while loop
I read the filename into a variable named file
I use the conditional command [[ and the -f operator to see that the file exists.
If it does exist, the command after the && operator is executed, otherwise if the file does not exist, the command after the || operator is executed.
Finally, I print the output in the OP's desired format.

Extract all numbers from a text file and store them in another file

I have a text file which have lots of lines. I want to extract all the numbers from that file.
File contains text and number and each line contains only one number.
How can i do it using sed or awk in bash script?
i tried
#! /bin/bash
sed 's/\([0-9.0-9]*\).*/\1/' <myfile.txt >output.txt
but this didn't worked.
grep can handle this:
grep -Eo '[0-9\.]+' myfile.txt
-o tells to print only the matches and [0-9\.]+ is a regular expression to match numbers.
To put all numbers on one line and save them in output.txt:
echo $(grep -Eo '[0-9\.]+' myfile.txt) >output.txt
Text files should normally end with a newline characters. The use of echo above assures that this happens.
Non-GNU grep:
If your grep does not support the -o flag, try:
echo $(tr ' ' '\n' <myfile.txt | grep -E '[0-9\.]+') >output.txt
This uses tr to replace all spaces with newlines (so each number appears separately on a line) and then uses grep to search for numbers.
tr -sc '0-9.' ' ' "$file"
Will transform every string of non-digit-or-period characters into a single space.
You can also use Bash:
while read line; do
if [[ $line =~ [0-9\.]+ ]]; then
echo $BASH_REMATCH
fi
done <myfile.txt >output.txt