Unix pattern datetime match - regex

I want to edit this line:
1987,4,12,31,4,1987-12-31 00:00:00.0000000,UA,19977,UA,,631,12197,1219701,31703,HPN,White Plains, NY,NY,36,New York,22,13930,1393001,30977,ORD,Chicago\, IL,IL,17,Illinois,41,756,802,483.2,6,6,0,0,0700-0759,,,,,914,938,600.8,24,24,1,1,0900-0959,0,,0,138,156,,1,738,3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,US1NJBG0005,US1ILCK0027,,,,,,,,,,,,,1987-12-31 08:09:12.0000000,519494350
and i want the output to be :
1987,4,12,31,4,1987-12-31 00:00:00.000,UA,19977,UA,,631,12197,1219701,31703,HPN,White
Plains, NY,NY,36,New York,22,13930,1393001,30977,ORD,Chicago\, IL,IL,17,Illinois,41,756,802,483.2,6,6,0,0,0700-0759,,,,,914,938,600.8,24,24,1,1,0900-0959,0,,0,138,156,,1,738,3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,US1NJBG0005,US1ILCK0027,,,,,,,,,,,,,1987-12-31 08:09:12.000,519494350
I want to find each pattern of: ****-**-** **:**:**.0000000
and erase the last 4 digits ( 0000 ) so I get ****-**-** **:**:**.000.
If its helpful this date format is in the 6th columns and the n-1 columns.

To get the value of the 6th column and erase the last four digits you can use:
awk -F, '{print substr($6, 0, length($6)-4) }'
Similarly, the N-1 column can be reached by:
awk -F, '{print substr( $(NF-1), 0, length($(NF-1))-4) }'
Edit:
To only replace the values in the columns, but still print everything use:
awk 'BEGIN{ FS=","; OFS=","}
{ $6=substr($6, 0, length($6)-4);
$(NF-1)=substr( $(NF-1), 0,length($(NF-1))-4);
print $0}'

Awk based solution
Nicely formatted, portable script:
#!/usr/bin/awk -f
BEGIN {
FS = "," # input: fields are separated by ,
OFS = "," # output: fields are separated by ,
}
{
sub(/[0-9][0-9][0-9][0-9]$/, "", $6) # remove last 4 digits from the 6th column
sub(/[0-9][0-9][0-9][0-9]$/, "", $(NF-1)) # remove last 4 digits from the n-1 column
print
}
One-line, less portable version using gawk:
gawk --re-interval -F , -v OFS=, '{sub("[0-9]{4}$", "", $6); sub("[0-9]{4}$", "", $(NF-1)); print}'
N.B. The regular expression engine of the traditional awk doesn't support the {n} repetition operator, so gawk version 3 or older needs to be run with --re-interval. For other flavors of awk e.g. nawk, you need to explicitly repeat the regular expression as in the portable longer script from above.
sed based solution
sed -r 's/^(([^,]*,){5})([^,]+)[0-9]{4},(([^,]*,)*)([^,]+)[0-9]{4}(,[^,]*)$/\1\3\4\6\7/'
(tested with GNU sed-4.2.2-6)

You could try this GNU sed command also,
$ sed -r 's/^.*,([^,]*)....,.*$/\1/g' file
1987-12-31 08:09:12.000
If you want just replacing then try this,
$ sed -r 's/^(.*,)([^,]*)....(,.*)$/\1\2\3/g' file
1987,4,12,31,4,1987-12-31 00:00:00.0000000,UA,19977,UA,,631,12197,1219701,31703,HPN,White Plains, NY,NY,36,New York,22,13930,1393001,30977,ORD,Chicago\, IL,IL,17,Illinois,41,756,802,483.2,6,6,0,0,0700-0759,,,,,914,938,600.8,24,24,1,1,0900-0959,0,,0,138,156,,1,738,3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,US1NJBG0005,US1ILCK0027,,,,,,,,,,,,,1987-12-31 08:09:12.000,519494350
I think you want the the output to be like this,
$ grep -oP '[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}\....' file
1987-12-31 00:00:00.000
1987-12-31 08:09:12.000
Update:
$ echo '1987,4,12,31,4,1987-12-31 00:00:00.0000000,UA,19977,UA,,631,12197,1219701,31703,HPN,White Plains, NY,NY,36,New York,22,13930,1393001,30977,ORD,Chicago\, IL,IL,17,Illinois,41,756,802,483.2,6,6,0,0,0700-0759,,,,,914,938,600.8,24,24,1,1,0900-0959,0,,0,138,156,,1,738,3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,US1NJBG0005,US1ILCK0027,,,,,,,,,,,,,1987-12-31 08:09:12.0000000,519494350' | sed -r 's/([0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}\....)..../\1/g'
1987,4,12,31,4,1987-12-31 00:00:00.000,UA,19977,UA,,631,12197,1219701,31703,HPN,White Plains, NY,NY,36,New York,22,13930,1393001,30977,ORD,Chicago\, IL,IL,17,Illinois,41,756,802,483.2,6,6,0,0,0700-0759,,,,,914,938,600.8,24,24,1,1,0900-0959,0,,0,138,156,,1,738,3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,US1NJBG0005,US1ILCK0027,,,,,,,,,,,,,1987-12-31 08:09:12.000,519494350

Here's a solution in Perl.
Update - Edited to output the full CSV line with the timestamp replaced with the truncated one
Update 2 - Update both timestamp columns, not just the first one
#!/usr/bin/env perl
use strict;
use warnings;
use feature 'say';
use Text::CSV;
my $CSV = Text::CSV->new();
while (my $line = readline(STDIN)) {
$CSV->parse($line) or die "Unable to parse line '$line'";
my #fields = $CSV->fields();
for my $f (#fields) {
$f =~ s/
^ # start of string
( # start capture to $1
\d{4} - # year
\d{2} - # month
\d{2} \s+ # day
\d{2} : # hour
\d{2} : # minute
\d{2} [.] # second
\d{3} # milisecond
) # end capture to $1
\d{4} # unwanted sub-second precision
$ # end of string
/$1/gmsx;
}
$CSV->combine(#fields);
say $CSV->string();
}
For example:
alex#yuzu:~$ cat input.txt
1987,4,12,31,4,1987-12-31 00:00:00.0000000,UA,19977,UA,,631,12197,1219701,31703,HPN,White Plains, NY,NY,36,New York,22,13930,1393001,30977,ORD,Chicago\, IL,IL,17,Illinois,41,756,802,483.2,6,6,0,0,0700-0759,,,,,914,938,600.8,24,24,1,1,0900-0959,0,,0,138,156,,1,738,3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,US1NJBG0005,US1ILCK0027,,,,,,,,,,,,,1987-12-31 08:09:12.0000000,519494350
alex#yuzu:~$ ./csv.pl < input.txt
1987,4,12,31,4,"1987-12-31 00:00:00.000",UA,19977,UA,,631,12197,1219701,31703,HPN,"White Plains"," NY",NY,36,"New York",22,13930,1393001,30977,ORD,Chicago\," IL",IL,17,Illinois,41,756,802,483.2,6,6,0,0,0700-0759,,,,,914,938,600.8,24,24,1,1,0900-0959,0,,0,138,156,,1,738,3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,US1NJBG0005,US1ILCK0027,,,,,,,,,,,,,"1987-12-31 08:09:12.000",519494350
On a Debian-like system such as Ubuntu you should already have Perl, and you can install Text::CSV with:
$ sudo apt-get install libtext-csv-perl

Related

sed - get only text without extension

How do I remove the extension in this SED statement?
Through
sed 's/.* - //'
File content
2021-04-21_#fluffyban_6953588770591509765.mp4 - Filename.mp4
Actual
Filename.mp4
Desired
Filename
With your shown samples only. This could be done with simple codes in awk,sed and perl as follows.
1st solution: Using sed, perform simple substitutions and you will get desired output.
sed 's/.*- //;s/\.mp4$//' Input_file
2nd solution: Using awk its more simpler, creating different field separator and just print appropriate 2nd last column.
awk -F'- |.mp4' '{print $(NF-1)}' Input_file
3rd solution: Using substitution method in awk to get the required value as per OP's requirement.
awk '{gsub(/.*- |\.mp4$/,"")} 1' Input_file
4th solution: With perl one liner we could grab the appropriate needed value by setting field separators as dash spaces and .mp4 as follows:
perl -a -F'-\s+|\.mp4' -ne 'print "$F[$#F-1]\n";' Input_file
The Bash way (which works in most similar shells such us zsh,sh,ksh) is:
fn="2021-04-21_#fluffyban_6953588770591509765.mp4 - Filename.mp4"
base=${fn%.*}
ext=${fn#$base.}
echo "$base"
echo "$ext"
Prints:
2021-04-21_#fluffyban_6953588770591509765.mp4 - Filename
mp4
You can use
#!/bin/bash
s='2021-04-21_#fluffyban_6953588770591509765.mp4 - Filename.mp4'
sed -n 's/.* - \([^.]*\).*/\1/p' <<< "$s"
# => Filename
See the online demo.
Details:
-n - suppress default line output
s/ - substitute found pattern
.* - \([^.]*\).* - any text, space, -, space, then any zero or more chars other than a dot captured into Group 1, and then any text
/\1/ - replace found matches with Group 1 value
p - print the result of the substitution.
Using gnu awk you can also use a capture group to get the filename
match($0, /.* - ([^.]+)\.mp4$/, a) {print a[1]}' file
Regex explanation
.* - Match the last occurrence of -
( Capture group 1 (Referred to by a[1] in the awk example)
[^.]+ Match 1+ times any char except a dot
) Close group 1
\.mp4$ Match .mp4 at the end of the string
Awk explanation
awk '
match($0, /.* - ([^.]+)\.mp4$/, a) { # Test if the line using $0 matches the pattern
print a[1] # Print the value of group 1
}
' file
Yet another awk:
awk '{sub(/\.[^.]+$/, ""); print $NF}' file
Filename
gawk/mawk/mawk2 'BEGIN { FS = "( \- |[.][^. ]+$)"
} NF > 2 { print $(NF-1) }'
no substr(), index(), match(), or sub() needed. If you're VERY certain " - " can only occur once, then
awk 'BEGIN { FS = "(^.* \- |[.][^. ]+$)"; OFS = "" } —-NF'

How to extract text between first 2 dashes in the string using sed or grep in shell

I have the string like this feature/test-111-test-test.
I need to extract string till the second dash and change forward slash to dash as well.
I have to do it in Makefile using shell syntax and there for me doesn't work some regular expression which can help or this case
Finally I have to get smth like this:
input - feature/test-111-test-test
output - feature-test-111- or at least feature-test-111
feature/test-111-test-test | grep -oP '\A(?:[^-]++-??){2}' | sed -e 's/\//-/g')
But grep -oP doesn't work in my case. This regexp doesn't work as well - (.*?-.*?)-.*.
Another sed solution using a capture group and regex/pattern iteration (same thing Socowi used):
$ s='feature/test-111-test-test'
$ sed -E 's/\//-/;s/^(([^-]*-){3}).*$/\1/' <<< "${s}"
feature-test-111-
Where:
-E - enable extended regex support
s/\//-/ - replace / with -
s/^....*$/ - match start and end of input line
(([^-]-){3}) - capture group #1 that consists of 3 sets of anything not - followed by -
\1 - print just the capture group #1 (this will discard everything else on the line that's not part of the capture group)
To store the result in a variable:
$ url=$(sed -E 's/\//-/;s/^(([^-]*-){3}).*$/\1/' <<< "${s}")
$ echo $url
feature-test-111-
You can use awk keeping in mind that in Makefile the $ char in awk command must be doubled:
url=$(shell echo 'feature/test-111-test-test' | awk -F'-' '{gsub(/\//, "-", $$1);print $$1"-"$$2"-"}')
echo "$url"
# => feature-test-111-
See the online demo. Here, -F'-' sets the field delimiter as -, gsub(/\//, "-", $1) replaces / with - in Field 1 and print $1"-"$2"-" prints the value of --separated Field 1 and 2.
Or, with a regex as a field delimiter:
url=$(shell echo 'feature/test-111-test-test' | awk -F'[-/]' '{print $$1"-"$$2"-"$$3"-"}')
echo "$url"
# => feature-test-111-
The -F'[-/]' option sets the field separator to - and /.
The '{print $1"-"$2"-"$3"-"}' part prints the first, second and third value with a separating hyphen.
See the online demo.
To get the nth occurrence of a character C you don't need fancy perl regexes. Instead, build a regex of the form "(anything that isn't C, then C) for n times":
grep -Eo '([^-]*-){2}' | tr / -
With sed and cut
echo feature/test-111-test-test| cut -d'-' -f-2 |sed 's/\//-/'
Output
feature-test-111
echo feature/test-111-test-test| cut -d'-' -f-2 |sed 's/\//-/;s/$/-/'
Output
feature-test-111-
You can use the simple BRE regex form of not something then that something which is [^-]*- to get all characters other than - up to a -.
This works:
echo 'feature/test-111-test-test' | sed -nE 's/^([^/]*)\/([^-]*-[^-]*-).*/\1-\2/p'
feature-test-111-
Another idea using parameter expansions/substitutions:
s='feature/test-111-test-test'
tail="${s//\//-}" # replace '/' with '-'
# split first field from rest of fields ('-' delimited); do this 3x times
head="${tail%%-*}" # pull first field
tail="${tail#*-}" # drop first field
head="${head}-${tail%%-*}" # pull first field; append to previous field
tail="${tail#*-}" # drop first field
head="${head}-${tail%%-*}-" # pull first field; append to previous fields; add trailing '-'
$ echo "${head}"
feature-test-111-
A short sed solution, without extended regular expressions:
sed 's|\(.*\)/\([^-]*-[^-]*\).*|\1-\2|'

Parsing Karma Coverage Output in Bash for a Jenkins Job (Scripting)

I'm working with the following output:
=============================== Coverage summary ===============================
Statements : 26.16% ( 1681/6425 )
Branches : 6.89% ( 119/1727 )
Functions : 23.82% ( 390/1637 )
Lines : 26.17% ( 1680/6420 )
================================================================================
I would like to parse the 4 coverage percentage numbers without the percent via REGEX, into a comma separated list.
Any suggestions for a good regex expression for this? Or another good option?
The sed command:
sed -n '/ .*% /{s/.* \(.*\)% .*/\1/;p;}' input.txt | sed ':a;N;$!ba;s/\n/,/g'
gives the output:
26.16,6.89,23.82,26.17
Edit: A better answer, with only a single sed, would be:
sed -n '/ .*% /{s/.* \(.*\)% .*/\1/;H;};${g;s/\n/,/g;s/,//;p;}' input.txt
Explanation:
/ .*% / search for lines with a percentage value (note spaces)
s/.* \(.*\)% .*/\1/ and delete everything except the percentage value
H and then append it to the hold space, prefixed with a newline
$ then for the last line
g get the hold space
s/\n/,/g replace all the newlines with commas
s/,// and delete the initial comma
p and then finally output the result
To harden the regex, you could replace the search for the percentage value .*% with for example [0-9.]*%.
I think this is a grep job. This should help:
$ grep -oE "[0-9]{1,2}\.[0-9]{2}" input.txt | xargs | tr " " ","
Output:
26.16,6.89,23.82,26.17
The input file just contains what you have shown above. Obviously, there are other ways like cat to feed the input to the command.
Explanation:
grep -oE: only show matches using extended regex
xargs: put all results onto a single line
tr " " ",": translate the spaces into commas:
This is actually a nice shell tool belt example, I would say.
Including the consideration of Joseph Quinsey, the regex can be made more robust with a lookahead to assert a % sign after then numeric value using a Perl-compatible RE pattern:
grep -oP "[0-9]{1,2}\.[0-9]{2}(?=%)" input.txt | xargs | tr " " ","
Would you consider to use awk? Here's the command you may try,
$ awk 'match($0,/[0-9.]*%/){s=(s=="")?"":s",";s=s substr($0,RSTART,RLENGTH-1)}END{print s}' file
26.16,6.89,23.82,26.17
Brief explanation,
match($0,/[0-9.]*%/): find the record matched with regex [0-9.]*%
s=(s=="")?"":s",": since comma separated is required, we just need print commas before each matched except the first one.
s=s substr($0,RSTART,RLENGTH-1): print the matched part appended to s
Assuming the item names (Statements, Branches, ...) do not contain whitespaces, how about:
#!/bin/bash
declare -a keys
declare -a vaues
while read -r line; do
if [[ "$line" =~ ^([^\ ]+)\ *:\ *([0-9.]+)% ]]; then
keys+=(${BASH_REMATCH[1]})
values+=(${BASH_REMATCH[2]})
fi
done < output.txt
ifsback=$IFS # backup IFS
IFS=,
echo "${keys[*]}"
echo "${values[*]}"
IFS=$ifsback # restore IFS
which yields:
Statements,Branches,Functions,Lines
26.16,6.89,23.82,26.17
Yet another option, with perl:
cat the_file | perl -e 'while(<>){/(\d+\.\d+)%/ and $x.="$1,"}chop $x; print $x;'
The code, unrolled and explained:
while(<>){ # Read line by line. Put lines into $_
/(\d+\.\d+)%/ and $x.="$1,"
# Equivalent to:
# if ($_ =~ /(\d+\.\d+)%/) {$x.="$1,"}
# The regex matches "numbers", "dot", "numbers" and "%",
# stores just numbers on $1 (first capturing group)
}
chop $x; # Remove extra ',' and print result
print $x;
Somewhat shorter with an extra sed
cat the_file | perl -ne '/(\d+\.\d+)%/ and print "$1,"'|sed 's/.$//'
Uses "n" parameter which implies while(<>){}. For removing the last ',' we use sed.

shell regex: Extract prices

Given the following list of prices, I am trying to figure out how to normalize/extract only the digits.
INPUT DESIRED_OUTPUT
CA$1399.00 1399.00
$1399.11 1399.11
$1,399.22< 1399.22
Z$1 399.33 1399.33
$1399.44# 1399.44
C$ 1399.55 1399.55
1,399.66 1399.66
1399.77 1399.77
,1399.88 1399.88
25 1399.88 1399.88
399.99 399.99
88.88 99.99 99.99 (if >2 matches on one line, only the last one matters)
.1399.88 DO NOT MATCH (not a price; too many ".")
666.000 DO NOT MATCH (not a price: too many 0's)
I suppose it is a good idea to begin is with what they all have in common:
Prices always contain .NN, but never contain .NNN
Upon further inspection, other rules become apparent:
.NN must be preceded by one or more digits.
NNN.NN can be preceded by either ,, , or a simple digit, but nothing else.
Anything following .NN and preceding *N.NN marks the end of the match.
Finally, the regex needs to consider commas in things like 1,399.66 (1399.66) to determine whether it is a price, but then strip them. 1, 399.66, for instance does not equal 1399.66: it should be 399.66.
I am looking at sed, grep, and awk for a portable and efficient solution. How should I go about approaching this problem?
I found a similar question, but I have no idea how to try the following regex with sed:
^\d+(,\d{1,2})?$
EDIT: Yes, my input format is can be a little weird, because it is the result of the concatenation of scraped pages.
You can use the following shell script:
#/bin/sh
grep -v '\.\d\+\.' | # get rid of lines with multiple dots within the same number
grep -v '\.\d\d\d\+' | # get rid of lines with more than 2 digits after .
sed -e 's/\(.*\.[0-9][0-9]\).*$/\1/' | # remove anything after last .NN
sed -e 's/^.* \([0-9][0-9][0-9][0-9]\)\./\1./' | # "* NNNN." => "NNNN."
sed -e 's/^.* \([0-9][0-9]\)\./\1./' | # "* NN." => "NN."
sed -e 's/^.* \([0-9]\)\./\1./' | # "* N." => "N."
sed -e 's/^\(.*\)[ ,]\(\([0-9]\)\{3,\}\)\./\1\2./g' | # "*,NNN." or "* NNN." => "*NNN."
sed -e 's/^\(.*\)[ ,]\(\([0-9]\)\{6,\}\)\./\1\2./g' | # "*,NNNNNN." or "* NNNNNN." => "*NNNNNN."
sed -e 's/^\(.*\)[ ,]\(\([0-9]\)\{9,\}\)\./\1\2./g' | # "*,NNNNNNNNN." or "* NNNNNNNNN." => "*NNNNNNNNN."
grep -o '\d\+\.\d\d' # print only the price
In case of numbers that are separated by space or , in groups of 3 digits, this solution works up to 9 digits before the .. If you need to extract bigger prices, just add more lines, increasing the number in the regex by 3. ;-)
Put it in a file called extract_prices, make it executable (chmod +x extract_prices) and run it: ./extract_prices < my_list.txt
Tested on OS X using the following input:
CA$1399.00
$1399.11
$1,399.22<
Z$1 399.33
Z$12 777 666.34 # <-- additonal monster price
$1399.44#
C$ 1399.55
1,399.66
1399.77
,1399.88
25 1399.88
399.99
88.88 99.99
.1399.88
666.000
Which generates the following output:
1399.00
1399.11
1399.22
1399.33
12777666.34
1399.44
1399.55
1399.66
1399.77
1399.88
1399.88
399.99
99.99
A solution with awk that splits on all characters that are not numbers or decimal point and prints the last field that matches a price. The leading sed script handles the exception case #3 where we have a space instead of a comma marking the thousands spot.
sed -e 's/ / x /g; :a; s/\(\$[1-9][0-9]*\) /\1/; ta' | awk -F '[^0-9.]' -v p='[0-9]+\\.[0-9][0-9]' '$0 ~ p { gsub(/,/, ""); for (i=NF; i>0; i--) if ($i ~ "^" p "$") { print $i; next } }'
Notes:
1) The sed script uses a test to iterate; therefore, it can handle millions, billions, etc.
2) The sed script also handles the multiple space condition such that $1[ ][ ]1000.00 does not become $11000.00 in the end.
3) Commas are simply stripped/ignored... if there is an issue with comma separation of numbers, the issue can be resolved by getting rid of the gsub in the awk script and fixing the filter in the leading sed script
Here is a more complicated version that builds on the idea in note #3 to make commas and spaces part of the number only if the space or comma is at a thousands separator.
sed -e ':a; s/\(\$[1-9][0-9]*\) \([0-9][0-9][0-9][ .]\)/\1\2/; ta; :b; s/\([1-9][0-9]*\),\([0-9][0-9][0-9][,.]\)/\1\2/; tb;' | awk -F '[^0-9.]' -v p='[0-9]+\\.[0-9][0-9]' '$0 ~ p { for (i=NF; i>0; i--) if ($i ~ "^" p "$") { print $i; next } }'
If chance of success is high on each line, then getting rid of "p" would make for a more efficient script.
sed -e ':a; s/\(\$[1-9][0-9]*\) \([0-9][0-9][0-9][ .]\)/\1\2/; ta; :b; s/\([1-9][0-9]*\),\([0-9][0-9][0-9][,.]\)/\1\2/; tb;' | awk -F '[^0-9.]' '{ for (i=NF; i>0; i--) if ($i ~ /^[0-9]+\.[0-9][0-9]$/) { print $i; next } }'
Finally, for safety, we can check in the sed filter to make sure we have a valid space or comma delimited number before we do either substitution.
sed -e ':a; /\$[1-9][0-9]\?[0-9]\?\( [0-9][0-9][0-9]\)\+\.[0-9][0-9]/ s/\(\$[1-9][0-9]*\) \([0-9][0-9][0-9][ .]\)/\1\2/; ta; :b; /[1-9][0-9]\?[0-9]\?\(,[0-9][0-9][0-9]\)\+\.[0-9][0-9]/ s/\([1-9][0-9]*\),\([0-9][0-9][0-9][,.]\)/\1\2/; tb;' | awk -F '[^0-9.]' '{ for (i=NF; i>0; i--) if ($i ~ /^[0-9]+\.[0-9][0-9]$/) { print $i; next } }'
This might work for you (GNU sed):
sed -r '/\n/!s/([^0-9]*\b(([0-9])[ ,]([0-9]{3})|([0-9]+))(\.[0-9]{2})\b)+/\n\3\4\5\6\n/;/^[0-9]+\.[0-9]{2}\b/P;D' file
This works with the data provided but some of the specification is a bit sketchy.

Replace a string in bash script using regex for both linux and AIX

I have a bash script that I copy and run on both linux and AIX servers.
This script gets a "name" parameter which represents a file name, and I need to manipulate this name via regex (the purpose is irrelevant and very hard to explain).
From the name parameter I need to take the beginning until the first "-" character that is followed by a digit, and then concat it with the last "." character until the end of the string.
For example:
name: abcd-efg-1.23.4567-8.jar will become: abcd-efg.jar
name: abc123-abc3.jar will remain: abc123-abc3.jar
name: abc-890.jar will become: abc.jar
I've tried several variations of:
name=$1
regExpr="^(.*?)-\d.*\.(.*?)$/g"
echo $name
echo $(printf ${name} | sed -e $regExpr)
Also I cant use sed -r (seen on some examples) because AIX sed does not support the -r flag.
The last line is the problem of course; I think I need to somehow use $1 + $2 placeholders, but I can't seem to get it right.
How can I change my regex so that it does what I want?
Given the file:
abcd-efg-1.23.4567-8.jar
abc123-abc3.jar
abc-890.jar
This is a way to change the names you give:
$ sed 's/\(.\?\)-[0-9].*\(\.[a-z]*\)$/\1\2/' file
abcd-efg.jar
abc123-abc3.jar
abc.jar
Which is equivalent to (if you could use -r):
$ sed -r 's/(.?)-[0-9].*(\.[a-z]*)$/\1\2/' file
abcd-efg.jar
abc123-abc3.jar
abc.jar
It gets everything up to - + digit and "stores" in \1.
It gets from last . + letters and "stores" in \2.
Finally it prints those blocks back.
Note the extension could also be fetched with the basename builtin or with something like `"${line##*.}".
You could use this:
perl -F'(-(?:\d)|\.)' -ane 'print "$F[0].$F[$#F]"'
It splits the input on any - followed by a digit, or any .. Then it prints the first field, followed by a dot, followed by the last field.
Testing it out:
$ cat file
abcd-efg-1.23.4567-8.jar
abc123-abc3.jar
abc-890.jar
$ perl -F'(-(?:\d)|\.)' -ane 'print "$F[0].$F[$#F]"' file
abcd-efg.jar
abc123-abc3.jar
abc.jar
In sed, you could simply use the following.
#!/bin/sh
STRING=$( cat <<EOF
abcd-efg-1.23.4567-8.jar
abc123-abc3.jar
abc-890.jar
EOF
)
echo "$STRING" | sed 's/-[0-9].*\(\.[^.]\+\)$/\1/'
# abcd-efg.jar
# abc123-abc3.jar
# abc.jar
This matches a hyphen followed by a number and everything after and replaces with the file extension.
Or you may consider using a Perl one-liner:
echo "$STRING" | perl -pe 's/-\d.*(?=\.[^.]+$)//'
# abcd-efg.jar
# abc123-abc3.jar
# abc.jar
when a successful regex match is made, perl will capture whatever is matched in parenthesis ( .. ) as $1, $2, etc.
$ perl -e 'my $arg = $ARGV[0]; $arg =~ /^(.*?)-\d.*\.(.*?)$/; print "$1.$2\n"; ' abc-890.jar
abc.jar