In bash, given a path such as:
mypath='my/path/to/version/5e/is/7/here'
I would like to extract the first part that contains a number. For the example I would want to extract: 5e
Is there a better way than looping over the parts using while and checking each part for a number?
while IFS=/ read part
do
if [[ $part =~ *[0-9]* ]]; then
echo "$part"
fi
done <<< "$mypath"
Using Bash's regex:
[[ "$mypath" =~ [^/]*[0-9]+[^/]* ]] && echo "${BASH_REMATCH[0]}"
5e
Method using 'grep -o'.
echo $mypath | grep -o -E '\b[^/]*[0-9][^/]*\b' | head -1
Replace / by a newline
Filter the first match with a number
mypath='my/path/to/version/5e/is/7/here'
<<<"${mypath//\//$'\n'}" grep -m1 '[0-9]'
and a safer alternative that uses zero separated stream with GNU tools in case there are newlines in the path:
<<<"${mypath}" tr '/' '\0' | grep -z -m1 '[0-9]'
Is there a better way than looping over the parts using while and checking each part for a number?
No, either way or another you have to loop through all the parts until the first part with numbers is discovered. The loop may be hidden behind other tools, but it's still going to loop through the parts. You solution seems to be pretty good by itself, just break after you've found the first part if you want only the first.
Could you please try following, written and tested with shown samples. This should print if we have more than 1 values in the lines too. If you talk about better way, awk could be fast compare to pure bash loop + regex solutions IMHO, so adding it here.
awk -F'/' '
{
val=""
for(i=1;i<=NF;i++){
if($i~/[0-9][a-zA-Z]/ || $i~/[a-zA-Z][0-9]/){
val=(val?val OFS:"")$i
}
}
print val
}' Input_file
Explanation: Adding detailed explanation for above.
awk -F'/' ' ##Starting awk program from here and setting field separator as / here.
{
val="" ##Nullifying val here.
for(i=1;i<=NF;i++){ ##Running for loop till value of NF.
if($i~/[0-9][a-zA-Z]/ || $i~/[a-zA-Z][0-9]/){ ##Checking condition if field value is matching regex of digit alphabet then do following.
val=(val?val OFS:"")$i ##Creating variable val where keep on adding current field value in it.
}
}
print val ##Printing val here.
}' Input_file ##Mentioning Input_file name here.
Using Perl:
mypath='my/path/to/version/5e/is/7/here'
# Method 1 (using for loop):
echo "${mypath}" | perl -F'/' -lane 'for my $dir ( #F ) { next unless $dir =~ /\d/; print $dir; last; }'
# Method 2 (using grep):
echo "${mypath}" | perl -F'/' -lane 'my $dir = ( grep { /\d/ } #F )[0]; print $dir if defined $dir;'
# Prints:
# 5e
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array #F on whitespace or on the regex specified in -F option.
-F'/' : Split into #F on /, rather than on whitespace.
next unless $dir =~ /\d/; : skip the rest of the loop if the current part of the path does not* contain a digit (\d).
last; : exit the loop (here, it also exits the script), so that it prints only the first occurrence of the matching directory.
grep { ... } LIST : for the LIST argument, returns the list of elements for which the expression ... is true, here returns the list of all path elements that have a digit.
(LIST)[0] : returns the first element of the LIST, here, the first path element with a digit.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perldoc perlre: Perl regular expressions (regexes)
perldoc perlre: Perl regular expressions (regexes): Quantifiers; Character Classes and other Special Escapes; Assertions; Capture groups
With awk, set RS to / and print the first record containing a number.
awk -v RS=/ '/[0-9]/{print;exit}' <<< "$mypath"
5e
Another bash variant
mypath='my/path/to/app version/5e/is/7/here'
until [[ ${mypath:0:1} =~ [0-9] ]]; do
mypath=${mypath#*/}
done
echo ${mypath%%/*}
Related
I'm working with the following output:
=============================== Coverage summary ===============================
Statements : 26.16% ( 1681/6425 )
Branches : 6.89% ( 119/1727 )
Functions : 23.82% ( 390/1637 )
Lines : 26.17% ( 1680/6420 )
================================================================================
I would like to parse the 4 coverage percentage numbers without the percent via REGEX, into a comma separated list.
Any suggestions for a good regex expression for this? Or another good option?
The sed command:
sed -n '/ .*% /{s/.* \(.*\)% .*/\1/;p;}' input.txt | sed ':a;N;$!ba;s/\n/,/g'
gives the output:
26.16,6.89,23.82,26.17
Edit: A better answer, with only a single sed, would be:
sed -n '/ .*% /{s/.* \(.*\)% .*/\1/;H;};${g;s/\n/,/g;s/,//;p;}' input.txt
Explanation:
/ .*% / search for lines with a percentage value (note spaces)
s/.* \(.*\)% .*/\1/ and delete everything except the percentage value
H and then append it to the hold space, prefixed with a newline
$ then for the last line
g get the hold space
s/\n/,/g replace all the newlines with commas
s/,// and delete the initial comma
p and then finally output the result
To harden the regex, you could replace the search for the percentage value .*% with for example [0-9.]*%.
I think this is a grep job. This should help:
$ grep -oE "[0-9]{1,2}\.[0-9]{2}" input.txt | xargs | tr " " ","
Output:
26.16,6.89,23.82,26.17
The input file just contains what you have shown above. Obviously, there are other ways like cat to feed the input to the command.
Explanation:
grep -oE: only show matches using extended regex
xargs: put all results onto a single line
tr " " ",": translate the spaces into commas:
This is actually a nice shell tool belt example, I would say.
Including the consideration of Joseph Quinsey, the regex can be made more robust with a lookahead to assert a % sign after then numeric value using a Perl-compatible RE pattern:
grep -oP "[0-9]{1,2}\.[0-9]{2}(?=%)" input.txt | xargs | tr " " ","
Would you consider to use awk? Here's the command you may try,
$ awk 'match($0,/[0-9.]*%/){s=(s=="")?"":s",";s=s substr($0,RSTART,RLENGTH-1)}END{print s}' file
26.16,6.89,23.82,26.17
Brief explanation,
match($0,/[0-9.]*%/): find the record matched with regex [0-9.]*%
s=(s=="")?"":s",": since comma separated is required, we just need print commas before each matched except the first one.
s=s substr($0,RSTART,RLENGTH-1): print the matched part appended to s
Assuming the item names (Statements, Branches, ...) do not contain whitespaces, how about:
#!/bin/bash
declare -a keys
declare -a vaues
while read -r line; do
if [[ "$line" =~ ^([^\ ]+)\ *:\ *([0-9.]+)% ]]; then
keys+=(${BASH_REMATCH[1]})
values+=(${BASH_REMATCH[2]})
fi
done < output.txt
ifsback=$IFS # backup IFS
IFS=,
echo "${keys[*]}"
echo "${values[*]}"
IFS=$ifsback # restore IFS
which yields:
Statements,Branches,Functions,Lines
26.16,6.89,23.82,26.17
Yet another option, with perl:
cat the_file | perl -e 'while(<>){/(\d+\.\d+)%/ and $x.="$1,"}chop $x; print $x;'
The code, unrolled and explained:
while(<>){ # Read line by line. Put lines into $_
/(\d+\.\d+)%/ and $x.="$1,"
# Equivalent to:
# if ($_ =~ /(\d+\.\d+)%/) {$x.="$1,"}
# The regex matches "numbers", "dot", "numbers" and "%",
# stores just numbers on $1 (first capturing group)
}
chop $x; # Remove extra ',' and print result
print $x;
Somewhat shorter with an extra sed
cat the_file | perl -ne '/(\d+\.\d+)%/ and print "$1,"'|sed 's/.$//'
Uses "n" parameter which implies while(<>){}. For removing the last ',' we use sed.
I'm trying to count the words with at least two vowels in all the .txt files in the directory. Here's my code so far:
#!/bin/bash
wordcount=0
for i in $HOME/*.txt
do
cat $i |
while read line
do
for w in $line
do
if [[ $w == .*[aeiouAEIOU].*[AEIOUaeiou].* ]]
then
wordcount=`expr $wordcount + 1`
echo $w ':' $wordcount
else
echo "In else"
fi
done
done
echo $i ':' $wordcount
wordcount=0
done
Here is my sample from a txt file
Last modified: Sun Aug 20 18:18:27 IST 2017
To remove PPAs
sudo apt-get install ppa-purge
sudo ppa-purge ppa:
The problem is it doesn't match the pattern in the if statement for all the words in the text file. It goes directly to the else statement. And secondly, the wordcount in echo $i ':' $wordcount is equal to 0 which should be some value.
Immediate Issue: Glob vs Regex
[[ $string = $pattern ]] doesn't perform regex matching; instead, it's a glob-style pattern match. While . means "any character" in regex, it matches only itself in glob.
You have a few options here:
Use =~ instead to perform regular expression matching:
[[ $w =~ .*[aeiouAEIOU].*[AEIOUaeiou].* ]]
Use a glob-style expression instead of a regex:
[[ $w = *[aeiouAEIOU]*[aeiouAEIOU]* ]]
Note the use of = rather than == here; while either is technically valid, the former avoids building finger memory that would lead to bugs when writing code for a POSIX implementation of test / [, as = is the only valid string comparison operator there.
Larger Issue: Properly Reading Word-By-Word
Using for w in $line is innately unsafe. Use read -a to read a line into an array of words:
#!/usr/bin/env bash
wordcount=0
for i in "$HOME"/*.txt; do
while read -r -a words; do
for word in "${words[#]}"; do
if [[ $word = *[aeiouAEIOU]*[aeiouAEIOU]* ]]; then
(( ++wordcount ))
fi
done
done <"$i"
printf '%s: %s\n' "$i" "$wordcount"
wordcount=0
done
Try:
awk '/[aeiouAEIOU].*[AEIOUaeiou]/{n++} ENDFILE{print FILENAME":"n; n=0}' RS='[[:space:]]' *.txt
Sample output looks like:
$ awk '/[aeiouAEIOU].*[AEIOUaeiou]/{n++} ENDFILE{print FILENAME":"n; n=0}' RS='[[:space:]]' *.txt
one.txt:1
sample.txt:9
How it works:
/[aeiouAEIOU].*[AEIOUaeiou]/{n++}
Every time we find a word with two vowels, we increment variable n.
ENDFILE{print FILENAME":"n; n=0}
At the end of each file, we print the name of the file and the 2-vowel word count n. We then reset n to zero.
RS='[[:space:]]'
This tells awk to use any whitespace as a word separator. This makes each word into a record. Awk reads the input one record at a time.
Shell issues
The use of awk avoids a multitude of shell issues. For example, consider the line for w in $line. This will not work the way you hope. Consider a directory with these files:
$ ls
one.txt sample.txt
Now, let's take line='* Item One' and see what happens:
$ line='* Item One'
$ for w in $line; do echo "w=$w"; done
w=one.txt
w=sample.txt
w=Item
w=One
The shell treats the * in line as a wildcard and expands it into a list of files. Odds are you didn't want this. The awk solution avoids a variety of issues like this.
Using grep - this is pretty simple to do.
#!/bin/bash
wordcount=0
for file in ./*.txt
do
count=`cat $file | xargs -n1 | grep -ie "[aeiou].*[aeiou]" | wc -l`
wordcount=`expr $wordcount + $count`
done
echo $wordcount
I have created this basic script:
#!/bin/bash
file="/usr/share/dict/words"
var=2
sed -n "/^$var$/p" /usr/share/dict/words
However, it's not working as required to be (or still need some more logic to put in it).
Here, it should print only 2 letter words but with this it is giving different output
Can anyone suggest ideas on how to achieve this with sed or with awk?
it should print only 2 letter words
Your sed command is just searching for lines with 2 in text.
You can use awk for this:
awk 'length() == 2' file
Or using a shell variable:
awk -v n=$var 'length() == n' file
What you are executing is:
sed -n "/^2$/p" /usr/share/dict/words
This means: all lines consisting in exactly the number 2, nothing else. Of course this does not return anything, since /usr/share/dict/words has words and not numbers (as far as I know).
If you want to print those lines consisting in two characters, you need to use something like .. (since . matches any character):
sed -n "/^..$/p" /usr/share/dict/words
To make the number of characters variable, use a quantifier {} like (note the usage of \ to have sed's BRE understand properly):
sed -n "/^.\{2\}$/p" /usr/share/dict/words
Or, with a variable:
sed -n '/^.\{'"$var"'\}$/p' /usr/share/dict/words
Note that we are putting the variable outside the quotes for safety (thanks Ed Morton in comments for the reminder).
Pure bash... :)
file="/usr/share/dict/words"
var=2
#building a regex
str=$(printf "%${var}s")
re="^${str// /.}$"
while read -r word
do
[[ "$word" =~ $re ]] && echo "$word"
done < "$file"
It builds a regex in a form ^..$ (the number of dots is variable). So doing it in 2 steps:
create a string of the desired length e.g: %2s. without args the printf prints only the filler spaces for the desired length e.g.: 2
but we have a variable var, therefore %${var}s
replace all spaces in the string with .
but don't use this solution. It is too slow, and here are better utilities for this, best is imho grep.
file="/usr/share/dict/words"
var=5
grep -P "^\w{$var}$" "$file"
Try awk-
awk -v var=2 '{if (length($0) == var) print $0}' /usr/share/dict/words
This can be shortened to
awk -v var=2 'length($0) == var' /usr/share/dict/words
which has the same effect.
To output only lines matching 2 alphabetic characters with grep:
grep '^[[:alpha:]]\{2\}$' /usr/share/dict/words
GNU awk and mawk at least (due to empty FS):
$ awk -F '' 'NF==2' /usr/share/dict/words #| head -5
aa
Ab
ad
ae
Ah
Empty FS separates each character on its own field so NF tells the record length.
For a list of files, I'd like to match the ones not ending with .txt. I am currently using this expression:
.*(txt$)|(html\.txt$)
This expression will match everything ending in .txt, but I'd like it to do the opposite.
Should match:
happiness.html
joy.png
fear.src
Should not match:
madness.html.txt
excitement.txt
I'd like to get this so I can use it in pair with fswatch:
fswatch -0 -e 'regex here' . | xargs -0 -n 1 -I {} echo "{} has been changed"
The problem is it doesn't seem to work.
PS: I use the tag bash instead of fswatch because I don't have enough reputation points to create it. Sorry!
Try using a lookbehind, like this:
.*$(?<!\.txt)
Demonstration
Basically, this matches any line of text so long as the last 4 characters are not ".txt".
You can use Negative Lookahead for this purpose.
^(?!.*\.txt).+$
Live Demo
You can use this expression with grep using option -P:
grep -Po '^(?!.*\.txt).+$' file
Since question has been tagged as bash, lookaheads may not be supported (except grep -P), here is one grep solution that doesn't need lookaheads:
grep -v '\.txt$' file
happiness.html
joy.png
fear.src
EDIT: You can use this xargs command to avoid matching *.txt files:
xargs -0 -n 1 -I {} bash -c '[[ "{}" == *".txt" ]] && echo "{} has been changed"'
It really depends what regular expression tool you are using. Many tools provide a way to invert the sense of a regex. For example:
bash
# succeeds if filename ends with .txt
[[ $filename =~ "."txt$ ]]
# succeeds if filename does not end with .txt
! [[ $filename =~ "."txt$ ]]
# another way of writing the negative
[[ ! $filename =~ "."txt$ ]]
grep
# succeeds if filename ends with .txt
egrep -q "\.txt$" <<<"$filename"
# succeeds if filename does not end with .txt
egrep -qv "\.txt$" <<<"$filename"
awk
/\.txt$/ { print "line ends with .txt" }
! /\.txt$/ { print "line doesn't end with .txt" }
$1 ~ /\.txt$/ { print "first field ends with .txt" }
$1 !~ /\.txt$/ { print "first field doesn't end with .txt" }
For the adventurous, a posix ERE which will work in any posix compatible regex engine
/[^t]$|[^x]t$|[^t]xt$|[^.]txt$/
How do I display data from the beginning of a file until the first occurrence of a regular expression?
For example, if I have a file that contains:
One
Two
Three
Bravo
Four
Five
I want to start displaying the contents of the file starting at line 1 and stopping when I find the string "B*". So the output should look like this:
One
Two
Three
perl -pe 'last if /^B/' source.txt
An explanation: the -p switch adds a loop around the code, turning it into this:
while ( <> ) {
last if /^B.*/; # The bit we provide
print;
}
The last keyword exits the surrounding loop immediately if the condition holds - in this case, /^B/, which indicates that the line begins with a B.
if its from the start of the file
awk '/^B/{exit}1' file
if you want to start from specific line number
awk '/^B/{exit}NR>=10' file # start from line 10
sed -n '1,/^B/p'
Print from line 1 to /^B/ (inclusive). -n suppresses default echo.
Update: Opps.... didn't want "Bravo", so instead the reverse action is needed ;-)
sed -n '/^B/,$!p'
/I3az/
sed '/^B/,$d'
Read that as follows: Delete (d) all lines beginning with the first line that starts with a "B" (/^B/), up and until the last line ($).
Some of the sed commands given by others will continue to unnecessarily process the input after the regex is found which could be quite slow for large input. This quits when the regex is found:
sed -n '/^Bravo/q;p'
in Perl:
perl -nle '/B.*/ && last; print; ' source.txt
Just sharing some answers I've received:
Print data starting at the first line, and continue until we find a match to the regex, then stop:
<command> | perl -n -e 'print "$_" if 1 ... /<regex>/;'
Print data starting at the first line, and continue until we find a match to the regex, BUT don't display the line that matches the regular expression:
<command> | perl -pe '/<regex>/ && exit;'
Doing it in sed:
<command> | sed -n '1,/<regex>/p'
Your problem is a variation on an answer in perlfaq6: How can I pull out lines between two patterns that are themselves on different lines?.
You can use Perl's somewhat exotic .. operator (documented in perlop):
perl -ne 'print if /START/ .. /END/' file1 file2 ...
If you wanted text and not lines, you would use
perl -0777 -ne 'print "$1\n" while /START(.*?)END/gs' file1 file2 ...
But if you want nested occurrences of START through END, you'll run up against the problem described in the question in this section on matching balanced text.
Here's another example of using ..:
while (<>) {
$in_header = 1 .. /^$/;
$in_body = /^$/ .. eof;
# now choose between them
} continue {
$. = 0 if eof; # fix $.
}
Here is a perl one-liner:
perl -pe 'last if /B/' file
If Perl is a possibilty, you could do something like this:
% perl -0ne 'if (/B.*/) { print $`; last }' INPUT_FILE
one liner with basic shell commands:
head -`grep -n B file|head -1|cut -f1 -d":"` file