Regex pattern for quoted numbers and commas

Regex pattern for quoted numbers and commas - regex

I'm trying to find the correct regex to search a file for double quoted numbers separated by a comma. For example I'm trying to find "27,422,734" and then replace it in a text editor to correct the comma to be every 4 numbers so the end result would be "2742,2734"
I've tried a few examples I found on SO but none are helping me with this scenario like
"[^"]+"
'\d+'
while the above do find matches, I don't know how to deal with the commas and how what to replace that with.
Thanks for any help!

I found an even shorter solution (works with gnu-sed):
colonmv () {
echo $# | sed 's/,//g' | sed -r ':a;s/\B[0-9]{4}\>/,&/;ta'
}
But attention, the first sed command eats every comma, not just between digits, so improve it or filter your input before.
The second command uses the :a trick.
Read 4 digits, followed by a non digit (>) replace with the same plus comma, when a replacement took place, jump back from ta to :a and repeat.
Now, let's see colonmv in the wild:
colonmv '"A 3-grouped, pretty long number: 5,127,422,734 and an ungrouped one 5678905567789065778"'
"A 3-grouped pretty long number: 51,2742,2734 and an ungrouped one 567,8905,5677,8906,5778"

There might be better way of doing but I propose the following approach:
INPUT:
$ cat to_transform.txt
abc "27,422,734" def"27,422,734" def
ltu "123,734" abc "345,678,123,734" vtu
xtz "345,678,123,734" vtu "345,678,123,734"
u "1" a
"123"
iu"abc"a "123,734"
CMD:
$ paste -d' ' <(grep -oP '(?<=")(:?\d+,\d+)+(?=")' to_transform.txt) <(grep -oP '(?<=")(:?\d+,\d+)+(?=")' to_transform.txt | sed -e 's/,//g;:loop s/\([0-9]\{4\}\)\($\|,\)/\2,\1/g; s/,,/,/g; /\([0-9]\{5\}\)/b loop') | awk '{cmd="sed -i 0,/"$1"/s/" $1 "/" $2 "/ to_transform.txt"; system(cmd)}'
OUTPUT:
$ cat to_transform.txt
abc "2742,2734" def"2742,2734" def
ltu "12,3734" abc "3456,7812,3734" vtu
xtz "3456,7812,3734" vtu "3456,7812,3734"
u "1" a
"123"
iu"abc"a "12,3734"
CODE DETAILS AND EXPLANATIONS:
<(grep -oP '(?<=")(:?\d+,\d+)+(?=")' to_transform.txt) will extract each number to be processed from the input file, the regex used here use lookbehind/lookahead to enforce the surrounded by quotes condition, (:?\d+,\d+)+ is used to extract the numbers like 27,422,734.
the sed command will getting the output from the grep command will then do the following operations:
SED DETAILS:
s/,//g #remove all , in the number
:loop #create a label to loop
s/\([0-9]\{4\}\)\($\|,\)/\2,\1/g #add a coma after every chain of 4 characters starting by the end of the string/or from the latest coma added
s/,,/,/g #remove duplicate comas added by the previous step if any
/\([0-9]\{5\}\)/b loop #if there are at least 5 digits present successively in the string loop and continue the processing.
Temporary output after the paste operation:
27,422,734 2742,2734
27,422,734 2742,2734
123,734 12,3734
345,678,123,734 3456,7812,3734
345,678,123,734 3456,7812,3734
345,678,123,734 3456,7812,3734
123,734 12,3734
Last but not least the awk command will read this file and run some sed command to replace every element of the first column by the corresponding value in the second command: awk '{cmd="sed -i 0,/"$1"/s/" $1 "/" $2 "/ to_transform.txt"; system(cmd)}'.

Precondition: Your input conforms to "[0-9,]*" and is a "#,###"-format correct number.
#!/bin/bash
colonmv () {
echo $1 | sed -r 's/,([0-9]{3})+/\1/g;' | \
rev | sed -r 's/[^0-9]?([0-9]{4})/\1,/g;s/,"$/"/;s/.*/"&/' | rev
}
colonmv '"734"'
colonmv '"2,734"'
colonmv '"22,734"'
colonmv '"422,734"'
colonmv '"7,422,734"'
colonmv '"27,422,734"'
colonmv '"127,422,734"'
colonmv '"5,127,422,734"'
Test:
colonmv.sh
"734""
"2734"
"2,2734"
"42,2734"
"742,2734"
"2742,2734"
"1,2742,2734"
"51,2742,2734"

Related

How to extract text between first 2 dashes in the string using sed or grep in shell

I have the string like this feature/test-111-test-test.
I need to extract string till the second dash and change forward slash to dash as well.
I have to do it in Makefile using shell syntax and there for me doesn't work some regular expression which can help or this case
Finally I have to get smth like this:
input - feature/test-111-test-test
output - feature-test-111- or at least feature-test-111
feature/test-111-test-test | grep -oP '\A(?:[^-]++-??){2}' | sed -e 's/\//-/g')
But grep -oP doesn't work in my case. This regexp doesn't work as well - (.*?-.*?)-.*.

Another sed solution using a capture group and regex/pattern iteration (same thing Socowi used):
$ s='feature/test-111-test-test'
$ sed -E 's/\//-/;s/^(([^-]*-){3}).*$/\1/' <<< "${s}"
feature-test-111-
Where:
-E - enable extended regex support
s/\//-/ - replace / with -
s/^....*$/ - match start and end of input line
(([^-]-){3}) - capture group #1 that consists of 3 sets of anything not - followed by -
\1 - print just the capture group #1 (this will discard everything else on the line that's not part of the capture group)
To store the result in a variable:
$ url=$(sed -E 's/\//-/;s/^(([^-]*-){3}).*$/\1/' <<< "${s}")
$ echo $url
feature-test-111-

You can use awk keeping in mind that in Makefile the $ char in awk command must be doubled:
url=$(shell echo 'feature/test-111-test-test' | awk -F'-' '{gsub(/\//, "-", $$1);print $$1"-"$$2"-"}')
echo "$url"
# => feature-test-111-
See the online demo. Here, -F'-' sets the field delimiter as -, gsub(/\//, "-", $1) replaces / with - in Field 1 and print $1"-"$2"-" prints the value of --separated Field 1 and 2.
Or, with a regex as a field delimiter:
url=$(shell echo 'feature/test-111-test-test' | awk -F'[-/]' '{print $$1"-"$$2"-"$$3"-"}')
echo "$url"
# => feature-test-111-
The -F'[-/]' option sets the field separator to - and /.
The '{print $1"-"$2"-"$3"-"}' part prints the first, second and third value with a separating hyphen.
See the online demo.

To get the nth occurrence of a character C you don't need fancy perl regexes. Instead, build a regex of the form "(anything that isn't C, then C) for n times":
grep -Eo '([^-]*-){2}' | tr / -

With sed and cut
echo feature/test-111-test-test| cut -d'-' -f-2 |sed 's/\//-/'
Output
feature-test-111
echo feature/test-111-test-test| cut -d'-' -f-2 |sed 's/\//-/;s/$/-/'
Output
feature-test-111-

You can use the simple BRE regex form of not something then that something which is [^-]*- to get all characters other than - up to a -.
This works:
echo 'feature/test-111-test-test' | sed -nE 's/^([^/]*)\/([^-]*-[^-]*-).*/\1-\2/p'
feature-test-111-

Another idea using parameter expansions/substitutions:
s='feature/test-111-test-test'
tail="${s//\//-}" # replace '/' with '-'
# split first field from rest of fields ('-' delimited); do this 3x times
head="${tail%%-*}" # pull first field
tail="${tail#*-}" # drop first field
head="${head}-${tail%%-*}" # pull first field; append to previous field
tail="${tail#*-}" # drop first field
head="${head}-${tail%%-*}-" # pull first field; append to previous fields; add trailing '-'
$ echo "${head}"
feature-test-111-

A short sed solution, without extended regular expressions:
sed 's|\(.*\)/\([^-]*-[^-]*\).*|\1-\2|'

Parsing Karma Coverage Output in Bash for a Jenkins Job (Scripting)

I'm working with the following output:
=============================== Coverage summary ===============================
Statements : 26.16% ( 1681/6425 )
Branches : 6.89% ( 119/1727 )
Functions : 23.82% ( 390/1637 )
Lines : 26.17% ( 1680/6420 )
================================================================================
I would like to parse the 4 coverage percentage numbers without the percent via REGEX, into a comma separated list.
Any suggestions for a good regex expression for this? Or another good option?

The sed command:
sed -n '/ .*% /{s/.* \(.*\)% .*/\1/;p;}' input.txt | sed ':a;N;$!ba;s/\n/,/g'
gives the output:
26.16,6.89,23.82,26.17
Edit: A better answer, with only a single sed, would be:
sed -n '/ .*% /{s/.* \(.*\)% .*/\1/;H;};${g;s/\n/,/g;s/,//;p;}' input.txt
Explanation:
/ .*% / search for lines with a percentage value (note spaces)
s/.* \(.*\)% .*/\1/ and delete everything except the percentage value
H and then append it to the hold space, prefixed with a newline
$ then for the last line
g get the hold space
s/\n/,/g replace all the newlines with commas
s/,// and delete the initial comma
p and then finally output the result
To harden the regex, you could replace the search for the percentage value .*% with for example [0-9.]*%.

I think this is a grep job. This should help:
$ grep -oE "[0-9]{1,2}\.[0-9]{2}" input.txt | xargs | tr " " ","
Output:
26.16,6.89,23.82,26.17
The input file just contains what you have shown above. Obviously, there are other ways like cat to feed the input to the command.
Explanation:
grep -oE: only show matches using extended regex
xargs: put all results onto a single line
tr " " ",": translate the spaces into commas:
This is actually a nice shell tool belt example, I would say.
Including the consideration of Joseph Quinsey, the regex can be made more robust with a lookahead to assert a % sign after then numeric value using a Perl-compatible RE pattern:
grep -oP "[0-9]{1,2}\.[0-9]{2}(?=%)" input.txt | xargs | tr " " ","

Would you consider to use awk? Here's the command you may try,
$ awk 'match($0,/[0-9.]*%/){s=(s=="")?"":s",";s=s substr($0,RSTART,RLENGTH-1)}END{print s}' file
26.16,6.89,23.82,26.17
Brief explanation,
match($0,/[0-9.]*%/): find the record matched with regex [0-9.]*%
s=(s=="")?"":s",": since comma separated is required, we just need print commas before each matched except the first one.
s=s substr($0,RSTART,RLENGTH-1): print the matched part appended to s

Assuming the item names (Statements, Branches, ...) do not contain whitespaces, how about:
#!/bin/bash
declare -a keys
declare -a vaues
while read -r line; do
if [[ "$line" =~ ^([^\ ]+)\ *:\ *([0-9.]+)% ]]; then
keys+=(${BASH_REMATCH[1]})
values+=(${BASH_REMATCH[2]})
fi
done < output.txt
ifsback=$IFS # backup IFS
IFS=,
echo "${keys[*]}"
echo "${values[*]}"
IFS=$ifsback # restore IFS
which yields:
Statements,Branches,Functions,Lines
26.16,6.89,23.82,26.17

Yet another option, with perl:
cat the_file | perl -e 'while(<>){/(\d+\.\d+)%/ and $x.="$1,"}chop $x; print $x;'
The code, unrolled and explained:
while(<>){ # Read line by line. Put lines into $_
/(\d+\.\d+)%/ and $x.="$1,"
# Equivalent to:
# if ($_ =~ /(\d+\.\d+)%/) {$x.="$1,"}
# The regex matches "numbers", "dot", "numbers" and "%",
# stores just numbers on $1 (first capturing group)
}
chop $x; # Remove extra ',' and print result
print $x;
Somewhat shorter with an extra sed
cat the_file | perl -ne '/(\d+\.\d+)%/ and print "$1,"'|sed 's/.$//'
Uses "n" parameter which implies while(<>){}. For removing the last ',' we use sed.

Using sed for extracting substring from string

I just started using sed from doing regex. I wanted to extract XXXXXX from *****/XXXXXX> so I was following
sed -n "/^/*/(\S*\).>$/p"
If I do so I get following error
sed: 1: "/^//(\S).>$/p": invalid command code *
I am not sure what am I missing here.

Try:
$ echo '*****/XXXXXX>' | sed 's|.*/||; s|>.*||'
XXXXXX
The substitute command s|.*/|| removes everything up to the last / in the string. The substitute command s|>.*|| removes everything from the first > in the string that remains to the end of the line.
Or:
$ echo '*****/XXXXXX>' | sed -E 's|.*/(.*)>|\1|'
XXXXXX
The substitute command s|.*/(.*)>|\1| captures whatever is between the last / and the last > and saves it in group 1. That is then replaced with group 1, \1.

In my opinion awk performs better this task. Using -F you can use multiple delimiters such as "/" and ">":
echo "*****/XXXXXX>" | awk -F'/|>' '{print $1}'
Of course you could use sed, but it's more complicated to understand. First I'm removing the first part (delimited by "/") and after the second one (delimited by ">"):
echo "*****/XXXXXX>" | sed -e s/.*[/]// -e s/\>//
Both will bring the expected result: XXXXXX.

with grep if you have pcre option
$ echo '*****/XXXXXX>' | grep -oP '/\K[^>]+'
XXXXXX
/\K positive lookbehind / - not part of output
[^>]+ characters other than >

echo '*****/XXXXXX>' |sed 's/^.*\/\|>$//g'
XXXXXX
Start from start of the line, then proceed till lask / ALSO find > followed by EOL , if any of these found then replace it with blank.

shell regex: Extract prices

Given the following list of prices, I am trying to figure out how to normalize/extract only the digits.
INPUT DESIRED_OUTPUT
CA$1399.00 1399.00
$1399.11 1399.11
$1,399.22< 1399.22
Z$1 399.33 1399.33
$1399.44# 1399.44
C$ 1399.55 1399.55
1,399.66 1399.66
1399.77 1399.77
,1399.88 1399.88
25 1399.88 1399.88
399.99 399.99
88.88 99.99 99.99 (if >2 matches on one line, only the last one matters)
.1399.88 DO NOT MATCH (not a price; too many ".")
666.000 DO NOT MATCH (not a price: too many 0's)
I suppose it is a good idea to begin is with what they all have in common:
Prices always contain .NN, but never contain .NNN
Upon further inspection, other rules become apparent:
.NN must be preceded by one or more digits.
NNN.NN can be preceded by either ,, , or a simple digit, but nothing else.
Anything following .NN and preceding *N.NN marks the end of the match.
Finally, the regex needs to consider commas in things like 1,399.66 (1399.66) to determine whether it is a price, but then strip them. 1, 399.66, for instance does not equal 1399.66: it should be 399.66.
I am looking at sed, grep, and awk for a portable and efficient solution. How should I go about approaching this problem?
I found a similar question, but I have no idea how to try the following regex with sed:
^\d+(,\d{1,2})?$
EDIT: Yes, my input format is can be a little weird, because it is the result of the concatenation of scraped pages.

You can use the following shell script:
#/bin/sh
grep -v '\.\d\+\.' | # get rid of lines with multiple dots within the same number
grep -v '\.\d\d\d\+' | # get rid of lines with more than 2 digits after .
sed -e 's/\(.*\.[0-9][0-9]\).*$/\1/' | # remove anything after last .NN
sed -e 's/^.* \([0-9][0-9][0-9][0-9]\)\./\1./' | # "* NNNN." => "NNNN."
sed -e 's/^.* \([0-9][0-9]\)\./\1./' | # "* NN." => "NN."
sed -e 's/^.* \([0-9]\)\./\1./' | # "* N." => "N."
sed -e 's/^\(.*\)[ ,]\(\([0-9]\)\{3,\}\)\./\1\2./g' | # "*,NNN." or "* NNN." => "*NNN."
sed -e 's/^\(.*\)[ ,]\(\([0-9]\)\{6,\}\)\./\1\2./g' | # "*,NNNNNN." or "* NNNNNN." => "*NNNNNN."
sed -e 's/^\(.*\)[ ,]\(\([0-9]\)\{9,\}\)\./\1\2./g' | # "*,NNNNNNNNN." or "* NNNNNNNNN." => "*NNNNNNNNN."
grep -o '\d\+\.\d\d' # print only the price
In case of numbers that are separated by space or , in groups of 3 digits, this solution works up to 9 digits before the .. If you need to extract bigger prices, just add more lines, increasing the number in the regex by 3. ;-)
Put it in a file called extract_prices, make it executable (chmod +x extract_prices) and run it: ./extract_prices < my_list.txt
Tested on OS X using the following input:
CA$1399.00
$1399.11
$1,399.22<
Z$1 399.33
Z$12 777 666.34 # <-- additonal monster price
$1399.44#
C$ 1399.55
1,399.66
1399.77
,1399.88
25 1399.88
399.99
88.88 99.99
.1399.88
666.000
Which generates the following output:
1399.00
1399.11
1399.22
1399.33
12777666.34
1399.44
1399.55
1399.66
1399.77
1399.88
1399.88
399.99
99.99

A solution with awk that splits on all characters that are not numbers or decimal point and prints the last field that matches a price. The leading sed script handles the exception case #3 where we have a space instead of a comma marking the thousands spot.
sed -e 's/ / x /g; :a; s/\(\$[1-9][0-9]*\) /\1/; ta' | awk -F '[^0-9.]' -v p='[0-9]+\\.[0-9][0-9]' '$0 ~ p { gsub(/,/, ""); for (i=NF; i>0; i--) if ($i ~ "^" p "$") { print $i; next } }'
Notes:
1) The sed script uses a test to iterate; therefore, it can handle millions, billions, etc.
2) The sed script also handles the multiple space condition such that $1[ ][ ]1000.00 does not become $11000.00 in the end.
3) Commas are simply stripped/ignored... if there is an issue with comma separation of numbers, the issue can be resolved by getting rid of the gsub in the awk script and fixing the filter in the leading sed script
Here is a more complicated version that builds on the idea in note #3 to make commas and spaces part of the number only if the space or comma is at a thousands separator.
sed -e ':a; s/\(\$[1-9][0-9]*\) \([0-9][0-9][0-9][ .]\)/\1\2/; ta; :b; s/\([1-9][0-9]*\),\([0-9][0-9][0-9][,.]\)/\1\2/; tb;' | awk -F '[^0-9.]' -v p='[0-9]+\\.[0-9][0-9]' '$0 ~ p { for (i=NF; i>0; i--) if ($i ~ "^" p "$") { print $i; next } }'
If chance of success is high on each line, then getting rid of "p" would make for a more efficient script.
sed -e ':a; s/\(\$[1-9][0-9]*\) \([0-9][0-9][0-9][ .]\)/\1\2/; ta; :b; s/\([1-9][0-9]*\),\([0-9][0-9][0-9][,.]\)/\1\2/; tb;' | awk -F '[^0-9.]' '{ for (i=NF; i>0; i--) if ($i ~ /^[0-9]+\.[0-9][0-9]$/) { print $i; next } }'
Finally, for safety, we can check in the sed filter to make sure we have a valid space or comma delimited number before we do either substitution.
sed -e ':a; /\$[1-9][0-9]\?[0-9]\?\( [0-9][0-9][0-9]\)\+\.[0-9][0-9]/ s/\(\$[1-9][0-9]*\) \([0-9][0-9][0-9][ .]\)/\1\2/; ta; :b; /[1-9][0-9]\?[0-9]\?\(,[0-9][0-9][0-9]\)\+\.[0-9][0-9]/ s/\([1-9][0-9]*\),\([0-9][0-9][0-9][,.]\)/\1\2/; tb;' | awk -F '[^0-9.]' '{ for (i=NF; i>0; i--) if ($i ~ /^[0-9]+\.[0-9][0-9]$/) { print $i; next } }'

This might work for you (GNU sed):
sed -r '/\n/!s/([^0-9]*\b(([0-9])[ ,]([0-9]{3})|([0-9]+))(\.[0-9]{2})\b)+/\n\3\4\5\6\n/;/^[0-9]+\.[0-9]{2}\b/P;D' file
This works with the data provided but some of the specification is a bit sketchy.

How to match and keep the first number in a line using sed?

Question
Let's say I have one line of text with a number placed somewhere (it could be at the beginning, in the middle or at the end of the line).
How to match and keep the first number found in a line using sed?
Minimal example
Here is my attempt (following this page of a tutorial on regular expressions) and the output for different positions of the number:
$echo "SomeText 123SomeText" | sed 's:.*\([0-9][0-9]*\).*:\1:'
3
$echo "123SomeText" | sed 's:.*\([0-9][0-9]*\).*:\1:'
3
$echo "SomeText 123" | sed 's:.*\([0-9][0-9]*\).*:\1:'
3
As you can only the last digit is kept in the process whereas the desired output should be 123...

Using sed:
echo "SomeText 123SomeText 456" | sed -r 's/^[^0-9]*([0-9]+).*$/\1/'
123
You can also do this in gnu awk:
echo "SomeText 123SomeText 456" | awk '{print gensub(/^[^0-9]*([0-9]+).*$/, "\\1", $0)}'
123

To complement the sed solutions, here's an awk alternative (assuming that the goal is to extract the 1st number on each line, if any (i.e., ignore lines without any numbers)):
awk -F'[^0-9]*' '/[0-9]/ { print ($1 != "" ? $1 : $2) }'
-F'[^0-9]*' defines any sequence of non-digit chars. (including the empty string) as the field separator; awk automatically breaks each input line into fields based on that separator, with $1 representing the first field, $2 the second, and so on.
/[0-9]/ is a pattern (condition) that ensures that output is only produced for lines that contain at least one digit, via its associated action (the {...} block) - in other words: lines containing NO number at all are ignored.
{ print ($1!="" ? $1 : $2) } prints the 1st field, if nonempty, otherwise the 2nd one; rationale: if the line starts with a number, the 1st field will contain the 1st number on the line (because the line starts with a field rather than a separator; otherwise, it is the 2nd field that contains the 1st number (because the line starts with a separator).

You can also use grep, which is ideally suited to this task. sed is a Stream EDitor, which is only going to indirectly give you what you want. With grep, you only have to specify the part of the line you want.
$ cat file.txt
SomeText 123SomeText
123SomeText
SomeText 123
$ grep -o '[0-9]\+' file.txt
123
123
123
grep -o prints only the matching parts of a line, each on a separate line. The pattern is simple: one or more digits.
If your version of grep is compatible with the -P switch, you can use Perl-style regular expressions and make the command even shorter:
$ grep -Po '\d+' file.txt
123
123
123
Again, this matches one or more digits.
Using grep is a lot simpler and has the advantage that if the line doesn't match, nothing is printed:
$ echo "no number" | grep -Po '\d+' # no output
$ echo "yes 123number" | grep -Po '\d+'
123
edit
As pointed out in the comments, one possible problem is that this won't only print the first matching number on the line. If the line contains more than one number, they will all be printed. As far as I'm aware, this can't be done using grep -o.
In that case, I'd go with perl:
perl -lne 'print $1 if /.*?(\d+).*/'
This uses lazy matching (the question mark) so only non-digit characters are consumed by the .* at the start of the pattern. The $1 is a back reference, like \1 in sed. If there are more than one number on the line, this only prints the first. If there aren't any at all, it doesn't print anything:
$ echo "no number" | perl -ne 'print "$1\n" if /.*?(\d+).*/'
$ echo "yes123number456" | perl -lne 'print $1 if /.*?(\d+).*/'
123
If for some reason you still really want to use sed, you can do this:
sed -n 's/^[^0-9]*$[0-9]\{1,\}$.*$/\1/p'
unlike the other answers, this is compatible with all version of sed and will only print lines that contain a match.

Try this sed command,
$echo "SomeText 123SomeText" | sed -r '/[^0-9]*([0-9][0-9]*)[^0-9]*/ s//\1 /g'
123
Another example,
$ echo "SomeText 123SomeText 456" | sed -r '/[^0-9]*([0-9][0-9]*)[^0-9]*/ s//\1 /g'
123 456
It prints all the numbers in a file and the captured numbers are separated by spaces while printing.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js