shell regex: Extract prices - regex

Given the following list of prices, I am trying to figure out how to normalize/extract only the digits.
INPUT DESIRED_OUTPUT
CA$1399.00 1399.00
$1399.11 1399.11
$1,399.22< 1399.22
Z$1 399.33 1399.33
$1399.44# 1399.44
C$ 1399.55 1399.55
1,399.66 1399.66
1399.77 1399.77
,1399.88 1399.88
25 1399.88 1399.88
399.99 399.99
88.88 99.99 99.99 (if >2 matches on one line, only the last one matters)
.1399.88 DO NOT MATCH (not a price; too many ".")
666.000 DO NOT MATCH (not a price: too many 0's)
I suppose it is a good idea to begin is with what they all have in common:
Prices always contain .NN, but never contain .NNN
Upon further inspection, other rules become apparent:
.NN must be preceded by one or more digits.
NNN.NN can be preceded by either ,, , or a simple digit, but nothing else.
Anything following .NN and preceding *N.NN marks the end of the match.
Finally, the regex needs to consider commas in things like 1,399.66 (1399.66) to determine whether it is a price, but then strip them. 1, 399.66, for instance does not equal 1399.66: it should be 399.66.
I am looking at sed, grep, and awk for a portable and efficient solution. How should I go about approaching this problem?
I found a similar question, but I have no idea how to try the following regex with sed:
^\d+(,\d{1,2})?$
EDIT: Yes, my input format is can be a little weird, because it is the result of the concatenation of scraped pages.

You can use the following shell script:
#/bin/sh
grep -v '\.\d\+\.' | # get rid of lines with multiple dots within the same number
grep -v '\.\d\d\d\+' | # get rid of lines with more than 2 digits after .
sed -e 's/\(.*\.[0-9][0-9]\).*$/\1/' | # remove anything after last .NN
sed -e 's/^.* \([0-9][0-9][0-9][0-9]\)\./\1./' | # "* NNNN." => "NNNN."
sed -e 's/^.* \([0-9][0-9]\)\./\1./' | # "* NN." => "NN."
sed -e 's/^.* \([0-9]\)\./\1./' | # "* N." => "N."
sed -e 's/^\(.*\)[ ,]\(\([0-9]\)\{3,\}\)\./\1\2./g' | # "*,NNN." or "* NNN." => "*NNN."
sed -e 's/^\(.*\)[ ,]\(\([0-9]\)\{6,\}\)\./\1\2./g' | # "*,NNNNNN." or "* NNNNNN." => "*NNNNNN."
sed -e 's/^\(.*\)[ ,]\(\([0-9]\)\{9,\}\)\./\1\2./g' | # "*,NNNNNNNNN." or "* NNNNNNNNN." => "*NNNNNNNNN."
grep -o '\d\+\.\d\d' # print only the price
In case of numbers that are separated by space or , in groups of 3 digits, this solution works up to 9 digits before the .. If you need to extract bigger prices, just add more lines, increasing the number in the regex by 3. ;-)
Put it in a file called extract_prices, make it executable (chmod +x extract_prices) and run it: ./extract_prices < my_list.txt
Tested on OS X using the following input:
CA$1399.00
$1399.11
$1,399.22<
Z$1 399.33
Z$12 777 666.34 # <-- additonal monster price
$1399.44#
C$ 1399.55
1,399.66
1399.77
,1399.88
25 1399.88
399.99
88.88 99.99
.1399.88
666.000
Which generates the following output:
1399.00
1399.11
1399.22
1399.33
12777666.34
1399.44
1399.55
1399.66
1399.77
1399.88
1399.88
399.99
99.99

A solution with awk that splits on all characters that are not numbers or decimal point and prints the last field that matches a price. The leading sed script handles the exception case #3 where we have a space instead of a comma marking the thousands spot.
sed -e 's/ / x /g; :a; s/\(\$[1-9][0-9]*\) /\1/; ta' | awk -F '[^0-9.]' -v p='[0-9]+\\.[0-9][0-9]' '$0 ~ p { gsub(/,/, ""); for (i=NF; i>0; i--) if ($i ~ "^" p "$") { print $i; next } }'
Notes:
1) The sed script uses a test to iterate; therefore, it can handle millions, billions, etc.
2) The sed script also handles the multiple space condition such that $1[ ][ ]1000.00 does not become $11000.00 in the end.
3) Commas are simply stripped/ignored... if there is an issue with comma separation of numbers, the issue can be resolved by getting rid of the gsub in the awk script and fixing the filter in the leading sed script
Here is a more complicated version that builds on the idea in note #3 to make commas and spaces part of the number only if the space or comma is at a thousands separator.
sed -e ':a; s/\(\$[1-9][0-9]*\) \([0-9][0-9][0-9][ .]\)/\1\2/; ta; :b; s/\([1-9][0-9]*\),\([0-9][0-9][0-9][,.]\)/\1\2/; tb;' | awk -F '[^0-9.]' -v p='[0-9]+\\.[0-9][0-9]' '$0 ~ p { for (i=NF; i>0; i--) if ($i ~ "^" p "$") { print $i; next } }'
If chance of success is high on each line, then getting rid of "p" would make for a more efficient script.
sed -e ':a; s/\(\$[1-9][0-9]*\) \([0-9][0-9][0-9][ .]\)/\1\2/; ta; :b; s/\([1-9][0-9]*\),\([0-9][0-9][0-9][,.]\)/\1\2/; tb;' | awk -F '[^0-9.]' '{ for (i=NF; i>0; i--) if ($i ~ /^[0-9]+\.[0-9][0-9]$/) { print $i; next } }'
Finally, for safety, we can check in the sed filter to make sure we have a valid space or comma delimited number before we do either substitution.
sed -e ':a; /\$[1-9][0-9]\?[0-9]\?\( [0-9][0-9][0-9]\)\+\.[0-9][0-9]/ s/\(\$[1-9][0-9]*\) \([0-9][0-9][0-9][ .]\)/\1\2/; ta; :b; /[1-9][0-9]\?[0-9]\?\(,[0-9][0-9][0-9]\)\+\.[0-9][0-9]/ s/\([1-9][0-9]*\),\([0-9][0-9][0-9][,.]\)/\1\2/; tb;' | awk -F '[^0-9.]' '{ for (i=NF; i>0; i--) if ($i ~ /^[0-9]+\.[0-9][0-9]$/) { print $i; next } }'

This might work for you (GNU sed):
sed -r '/\n/!s/([^0-9]*\b(([0-9])[ ,]([0-9]{3})|([0-9]+))(\.[0-9]{2})\b)+/\n\3\4\5\6\n/;/^[0-9]+\.[0-9]{2}\b/P;D' file
This works with the data provided but some of the specification is a bit sketchy.

Related

How to match a regex 1 to 3 times in a sed command?

Problem
I want to get any text that consists of 1 to three digits followed by a % but without the % using sed.
What I tried
So i guess the following regex should match the right pattern : [0-9]{1,3}%.
Then i can use this sed command to catch the three digits and only print them :
sed -nE 's/.*([0-9]{1,3})%.*/\1/p'
Example
However when i run it, it shows :
$ echo "100%" | sed -nE 's/.*([0-9]{1,3})%.*/\1/p'
0
instead of
100
Obviously, there's something wrong with my sed command and i think the problem comes from here :
[0-9]{1,3}
which apparently doesn't do what i want it to do.
edit:
Solution
The .* at the start of sed -nE 's/.*([0-9]{1,3})%.*/\1/p' "ate" the two first digits.
The right way to write it, according to Wicktor's answer, is :
sed -nE 's/(.*[^0-9])?([0-9]{1,3})%.*/\2/p'
The .* grabs all digits leaving just the last of the three digits in 100%.
Use
sed -nE 's/(.*[^0-9])?([0-9]{1,3})%.*/\2/p'
Details
(.*[^0-9])? - (Group 1) an optional sequence of any 0 or more chars up to the non-digit char including it
([0-9]{1,3}) - (Group 2) one to three digits
% - a % char
.* - the rest of the string.
The match is replaced with Group 2 contents, and that is the only value printed since n suppresses the default line output.
It will be easier to use a cut + grep option:
echo "abc 100%" | cut -d% -f1 | grep -oE '[0-9]{1,3}'
100
echo "100%" | cut -d% -f1 | grep -oE '[0-9]{1,3}'
100
Or else you may use this awk:
echo "100%" | awk 'match($0, /[0-9]{1,3}%/){print substr($0, RSTART, RLENGTH-1)}'
100
Or else if you have gnu grep then use -P (PCRE) option:
echo "abc 100%" | ggrep -oP '[0-9]{1,3}(?=%)'
100
This might work for you (GNU sed):
sed -En 's/.*\<([0-9]{1,3})%.*/\1/p' file
This is a filtering exercise, so use the -n option.
Use a back reference to capture 1 to 3 digits, followed by % and print the result if successful.
N.B. The \< ensures the digits start on a word boundary, \b could also be used. The -E option is employed to reduce the number of back slashes which would normally be necessary to quote (,),{ and } metacharacters.

Parsing Karma Coverage Output in Bash for a Jenkins Job (Scripting)

I'm working with the following output:
=============================== Coverage summary ===============================
Statements : 26.16% ( 1681/6425 )
Branches : 6.89% ( 119/1727 )
Functions : 23.82% ( 390/1637 )
Lines : 26.17% ( 1680/6420 )
================================================================================
I would like to parse the 4 coverage percentage numbers without the percent via REGEX, into a comma separated list.
Any suggestions for a good regex expression for this? Or another good option?
The sed command:
sed -n '/ .*% /{s/.* \(.*\)% .*/\1/;p;}' input.txt | sed ':a;N;$!ba;s/\n/,/g'
gives the output:
26.16,6.89,23.82,26.17
Edit: A better answer, with only a single sed, would be:
sed -n '/ .*% /{s/.* \(.*\)% .*/\1/;H;};${g;s/\n/,/g;s/,//;p;}' input.txt
Explanation:
/ .*% / search for lines with a percentage value (note spaces)
s/.* \(.*\)% .*/\1/ and delete everything except the percentage value
H and then append it to the hold space, prefixed with a newline
$ then for the last line
g get the hold space
s/\n/,/g replace all the newlines with commas
s/,// and delete the initial comma
p and then finally output the result
To harden the regex, you could replace the search for the percentage value .*% with for example [0-9.]*%.
I think this is a grep job. This should help:
$ grep -oE "[0-9]{1,2}\.[0-9]{2}" input.txt | xargs | tr " " ","
Output:
26.16,6.89,23.82,26.17
The input file just contains what you have shown above. Obviously, there are other ways like cat to feed the input to the command.
Explanation:
grep -oE: only show matches using extended regex
xargs: put all results onto a single line
tr " " ",": translate the spaces into commas:
This is actually a nice shell tool belt example, I would say.
Including the consideration of Joseph Quinsey, the regex can be made more robust with a lookahead to assert a % sign after then numeric value using a Perl-compatible RE pattern:
grep -oP "[0-9]{1,2}\.[0-9]{2}(?=%)" input.txt | xargs | tr " " ","
Would you consider to use awk? Here's the command you may try,
$ awk 'match($0,/[0-9.]*%/){s=(s=="")?"":s",";s=s substr($0,RSTART,RLENGTH-1)}END{print s}' file
26.16,6.89,23.82,26.17
Brief explanation,
match($0,/[0-9.]*%/): find the record matched with regex [0-9.]*%
s=(s=="")?"":s",": since comma separated is required, we just need print commas before each matched except the first one.
s=s substr($0,RSTART,RLENGTH-1): print the matched part appended to s
Assuming the item names (Statements, Branches, ...) do not contain whitespaces, how about:
#!/bin/bash
declare -a keys
declare -a vaues
while read -r line; do
if [[ "$line" =~ ^([^\ ]+)\ *:\ *([0-9.]+)% ]]; then
keys+=(${BASH_REMATCH[1]})
values+=(${BASH_REMATCH[2]})
fi
done < output.txt
ifsback=$IFS # backup IFS
IFS=,
echo "${keys[*]}"
echo "${values[*]}"
IFS=$ifsback # restore IFS
which yields:
Statements,Branches,Functions,Lines
26.16,6.89,23.82,26.17
Yet another option, with perl:
cat the_file | perl -e 'while(<>){/(\d+\.\d+)%/ and $x.="$1,"}chop $x; print $x;'
The code, unrolled and explained:
while(<>){ # Read line by line. Put lines into $_
/(\d+\.\d+)%/ and $x.="$1,"
# Equivalent to:
# if ($_ =~ /(\d+\.\d+)%/) {$x.="$1,"}
# The regex matches "numbers", "dot", "numbers" and "%",
# stores just numbers on $1 (first capturing group)
}
chop $x; # Remove extra ',' and print result
print $x;
Somewhat shorter with an extra sed
cat the_file | perl -ne '/(\d+\.\d+)%/ and print "$1,"'|sed 's/.$//'
Uses "n" parameter which implies while(<>){}. For removing the last ',' we use sed.

Regex pattern for quoted numbers and commas

I'm trying to find the correct regex to search a file for double quoted numbers separated by a comma. For example I'm trying to find "27,422,734" and then replace it in a text editor to correct the comma to be every 4 numbers so the end result would be "2742,2734"
I've tried a few examples I found on SO but none are helping me with this scenario like
"[^"]+"
'\d+'
while the above do find matches, I don't know how to deal with the commas and how what to replace that with.
Thanks for any help!
I found an even shorter solution (works with gnu-sed):
colonmv () {
echo $# | sed 's/,//g' | sed -r ':a;s/\B[0-9]{4}\>/,&/;ta'
}
But attention, the first sed command eats every comma, not just between digits, so improve it or filter your input before.
The second command uses the :a trick.
Read 4 digits, followed by a non digit (>) replace with the same plus comma, when a replacement took place, jump back from ta to :a and repeat.
Now, let's see colonmv in the wild:
colonmv '"A 3-grouped, pretty long number: 5,127,422,734 and an ungrouped one 5678905567789065778"'
"A 3-grouped pretty long number: 51,2742,2734 and an ungrouped one 567,8905,5677,8906,5778"
There might be better way of doing but I propose the following approach:
INPUT:
$ cat to_transform.txt
abc "27,422,734" def"27,422,734" def
ltu "123,734" abc "345,678,123,734" vtu
xtz "345,678,123,734" vtu "345,678,123,734"
u "1" a
"123"
iu"abc"a "123,734"
CMD:
$ paste -d' ' <(grep -oP '(?<=")(:?\d+,\d+)+(?=")' to_transform.txt) <(grep -oP '(?<=")(:?\d+,\d+)+(?=")' to_transform.txt | sed -e 's/,//g;:loop s/\([0-9]\{4\}\)\($\|,\)/\2,\1/g; s/,,/,/g; /\([0-9]\{5\}\)/b loop') | awk '{cmd="sed -i 0,/"$1"/s/" $1 "/" $2 "/ to_transform.txt"; system(cmd)}'
OUTPUT:
$ cat to_transform.txt
abc "2742,2734" def"2742,2734" def
ltu "12,3734" abc "3456,7812,3734" vtu
xtz "3456,7812,3734" vtu "3456,7812,3734"
u "1" a
"123"
iu"abc"a "12,3734"
CODE DETAILS AND EXPLANATIONS:
<(grep -oP '(?<=")(:?\d+,\d+)+(?=")' to_transform.txt) will extract each number to be processed from the input file, the regex used here use lookbehind/lookahead to enforce the surrounded by quotes condition, (:?\d+,\d+)+ is used to extract the numbers like 27,422,734.
the sed command will getting the output from the grep command will then do the following operations:
SED DETAILS:
s/,//g #remove all , in the number
:loop #create a label to loop
s/\([0-9]\{4\}\)\($\|,\)/\2,\1/g #add a coma after every chain of 4 characters starting by the end of the string/or from the latest coma added
s/,,/,/g #remove duplicate comas added by the previous step if any
/\([0-9]\{5\}\)/b loop #if there are at least 5 digits present successively in the string loop and continue the processing.
Temporary output after the paste operation:
27,422,734 2742,2734
27,422,734 2742,2734
123,734 12,3734
345,678,123,734 3456,7812,3734
345,678,123,734 3456,7812,3734
345,678,123,734 3456,7812,3734
123,734 12,3734
Last but not least the awk command will read this file and run some sed command to replace every element of the first column by the corresponding value in the second command: awk '{cmd="sed -i 0,/"$1"/s/" $1 "/" $2 "/ to_transform.txt"; system(cmd)}'.
Precondition: Your input conforms to "[0-9,]*" and is a "#,###"-format correct number.
#!/bin/bash
colonmv () {
echo $1 | sed -r 's/,([0-9]{3})+/\1/g;' | \
rev | sed -r 's/[^0-9]?([0-9]{4})/\1,/g;s/,"$/"/;s/.*/"&/' | rev
}
colonmv '"734"'
colonmv '"2,734"'
colonmv '"22,734"'
colonmv '"422,734"'
colonmv '"7,422,734"'
colonmv '"27,422,734"'
colonmv '"127,422,734"'
colonmv '"5,127,422,734"'
Test:
colonmv.sh
"734""
"2734"
"2,2734"
"42,2734"
"742,2734"
"2742,2734"
"1,2742,2734"
"51,2742,2734"

How to display words as per given number of letters?

I have created this basic script:
#!/bin/bash
file="/usr/share/dict/words"
var=2
sed -n "/^$var$/p" /usr/share/dict/words
However, it's not working as required to be (or still need some more logic to put in it).
Here, it should print only 2 letter words but with this it is giving different output
Can anyone suggest ideas on how to achieve this with sed or with awk?
it should print only 2 letter words
Your sed command is just searching for lines with 2 in text.
You can use awk for this:
awk 'length() == 2' file
Or using a shell variable:
awk -v n=$var 'length() == n' file
What you are executing is:
sed -n "/^2$/p" /usr/share/dict/words
This means: all lines consisting in exactly the number 2, nothing else. Of course this does not return anything, since /usr/share/dict/words has words and not numbers (as far as I know).
If you want to print those lines consisting in two characters, you need to use something like .. (since . matches any character):
sed -n "/^..$/p" /usr/share/dict/words
To make the number of characters variable, use a quantifier {} like (note the usage of \ to have sed's BRE understand properly):
sed -n "/^.\{2\}$/p" /usr/share/dict/words
Or, with a variable:
sed -n '/^.\{'"$var"'\}$/p' /usr/share/dict/words
Note that we are putting the variable outside the quotes for safety (thanks Ed Morton in comments for the reminder).
Pure bash... :)
file="/usr/share/dict/words"
var=2
#building a regex
str=$(printf "%${var}s")
re="^${str// /.}$"
while read -r word
do
[[ "$word" =~ $re ]] && echo "$word"
done < "$file"
It builds a regex in a form ^..$ (the number of dots is variable). So doing it in 2 steps:
create a string of the desired length e.g: %2s. without args the printf prints only the filler spaces for the desired length e.g.: 2
but we have a variable var, therefore %${var}s
replace all spaces in the string with .
but don't use this solution. It is too slow, and here are better utilities for this, best is imho grep.
file="/usr/share/dict/words"
var=5
grep -P "^\w{$var}$" "$file"
Try awk-
awk -v var=2 '{if (length($0) == var) print $0}' /usr/share/dict/words
This can be shortened to
awk -v var=2 'length($0) == var' /usr/share/dict/words
which has the same effect.
To output only lines matching 2 alphabetic characters with grep:
grep '^[[:alpha:]]\{2\}$' /usr/share/dict/words
GNU awk and mawk at least (due to empty FS):
$ awk -F '' 'NF==2' /usr/share/dict/words #| head -5
aa
Ab
ad
ae
Ah
Empty FS separates each character on its own field so NF tells the record length.

Find words with exact number of characters

I have hundreds of lines like
1234 dfsdfdsfa INIUUININI112123424124 12321 JH7897IUHIH879KJ
and from each line, I want to get only words with exactly 9 characters (dfsdfdsfa in the example). How could I do it?
I tried many regexs/sed/grep/awk but without success.
With grep:
$ grep -oE '\b.{9}\b' infile
dfsdfdsfa
-o returns only matches and not the complete lines; -E is because I'm lazy and don't want to escape the {} (as in \{\}).
The regex itself is "any 9 characters between word boundaries". This is not exactly foolproof and would also match abcd efgh, which can be avoided by indicating that we want non-blank characters only:
grep -oE '\b[^[:blank:]]{9}\b' infile
Instead of using \b...\b, we could use the -w option to grep, which ensures the same.
grep with -w (--word-regexp) option:
grep -wo '.\{9\}' file.txt
Note that, word constituent characters are:
[[:alnum:]_]
Example:
% grep -wo '.\{9\}' <<<'1234 dfsdfdsfa INIUUININI112123424124 12321 JH7897IUHIH879KJ'
dfsdfdsfa
Here is a pure bash solution:
filename="test.txt"
declare -a record
while read -ra record
do
for field in ${record[#]}
do
if (( ${#field} == 9 ))
then
echo $field
fi
done
done < "$filename"
and here is an awk solution embedded in bash:
filename='test.txt'
awk -f - "$filename" << '_END_'
{
for (i=1; i < NF; i++) {
if (length($i) == 9) print $i
}
}
_END_
cat foo.txt | sed -e 's/[\t ]/\n/g' | awk '/^.{9}$/
should do the trick too.