Awk/ Perl regular expression to match space with hyphen - regex
I have the following 2 lines in sample.txt
AIA - 1000
AIA Integrations for E-Business Suite - 5544
Now i want to see the following output:
Column1 | Column2
AIA 1000
AIA Integrations for E-Business Suite 5544
i tried:
awk -F "-" sample.txt
It filters the hyphen "-" near "E-Business Suite"
How to make it filter the last hyphen instead of the intermediate ones.
You can use:
awk -F ' - ' -v OFS=';' 'BEGIN{print "Column1", "Column2"} {print $1, $2}' file |
column -s ';' -t
Column1 Column2
AIA 1000
AIA Integrations for E-Business Suite 5544
-F ' - ' uses " - " is input field separator
-v OFS=';' uses ; as output field separator
column -s ';' -t formats data in tabular format using ; as delimiter
Another example, using split and join:
perl -F- -e 'print join "\t", reverse pop #F, join "-", #F' sample.txt
I would use perl to guarentee that we are truly catching the last - as the separator not some other instance of it in the middle of the first field:
perl -wnle '/^(.+) - (.+)$/ or die; print "$1\t$2"' sample.txt
If you want the output to be in fixed width columns, you can use column:
perl -wnle '/^(.+) - (.+)$/ or die; print "$1\t$2"' sample.txt | column -s $'\t' -t
Explanation: The first (.+) in the regex will capture the first group. Because + is greedy, ^(.+) - it match with the largest possible substring, so that if there are multiple instances of -, it will include all of them but the last one in the first capture group. Then the last (.+) will capture all the remaining characters in the second capture group.
Related
sed - get only text without extension
How do I remove the extension in this SED statement? Through sed 's/.* - //' File content 2021-04-21_#fluffyban_6953588770591509765.mp4 - Filename.mp4 Actual Filename.mp4 Desired Filename
With your shown samples only. This could be done with simple codes in awk,sed and perl as follows. 1st solution: Using sed, perform simple substitutions and you will get desired output. sed 's/.*- //;s/\.mp4$//' Input_file 2nd solution: Using awk its more simpler, creating different field separator and just print appropriate 2nd last column. awk -F'- |.mp4' '{print $(NF-1)}' Input_file 3rd solution: Using substitution method in awk to get the required value as per OP's requirement. awk '{gsub(/.*- |\.mp4$/,"")} 1' Input_file 4th solution: With perl one liner we could grab the appropriate needed value by setting field separators as dash spaces and .mp4 as follows: perl -a -F'-\s+|\.mp4' -ne 'print "$F[$#F-1]\n";' Input_file
The Bash way (which works in most similar shells such us zsh,sh,ksh) is: fn="2021-04-21_#fluffyban_6953588770591509765.mp4 - Filename.mp4" base=${fn%.*} ext=${fn#$base.} echo "$base" echo "$ext" Prints: 2021-04-21_#fluffyban_6953588770591509765.mp4 - Filename mp4
You can use #!/bin/bash s='2021-04-21_#fluffyban_6953588770591509765.mp4 - Filename.mp4' sed -n 's/.* - \([^.]*\).*/\1/p' <<< "$s" # => Filename See the online demo. Details: -n - suppress default line output s/ - substitute found pattern .* - \([^.]*\).* - any text, space, -, space, then any zero or more chars other than a dot captured into Group 1, and then any text /\1/ - replace found matches with Group 1 value p - print the result of the substitution.
Using gnu awk you can also use a capture group to get the filename match($0, /.* - ([^.]+)\.mp4$/, a) {print a[1]}' file Regex explanation .* - Match the last occurrence of - ( Capture group 1 (Referred to by a[1] in the awk example) [^.]+ Match 1+ times any char except a dot ) Close group 1 \.mp4$ Match .mp4 at the end of the string Awk explanation awk ' match($0, /.* - ([^.]+)\.mp4$/, a) { # Test if the line using $0 matches the pattern print a[1] # Print the value of group 1 } ' file
Yet another awk: awk '{sub(/\.[^.]+$/, ""); print $NF}' file Filename
gawk/mawk/mawk2 'BEGIN { FS = "( \- |[.][^. ]+$)" } NF > 2 { print $(NF-1) }' no substr(), index(), match(), or sub() needed. If you're VERY certain " - " can only occur once, then awk 'BEGIN { FS = "(^.* \- |[.][^. ]+$)"; OFS = "" } —-NF'
How to extract text between first 2 dashes in the string using sed or grep in shell
I have the string like this feature/test-111-test-test. I need to extract string till the second dash and change forward slash to dash as well. I have to do it in Makefile using shell syntax and there for me doesn't work some regular expression which can help or this case Finally I have to get smth like this: input - feature/test-111-test-test output - feature-test-111- or at least feature-test-111 feature/test-111-test-test | grep -oP '\A(?:[^-]++-??){2}' | sed -e 's/\//-/g') But grep -oP doesn't work in my case. This regexp doesn't work as well - (.*?-.*?)-.*.
Another sed solution using a capture group and regex/pattern iteration (same thing Socowi used): $ s='feature/test-111-test-test' $ sed -E 's/\//-/;s/^(([^-]*-){3}).*$/\1/' <<< "${s}" feature-test-111- Where: -E - enable extended regex support s/\//-/ - replace / with - s/^....*$/ - match start and end of input line (([^-]-){3}) - capture group #1 that consists of 3 sets of anything not - followed by - \1 - print just the capture group #1 (this will discard everything else on the line that's not part of the capture group) To store the result in a variable: $ url=$(sed -E 's/\//-/;s/^(([^-]*-){3}).*$/\1/' <<< "${s}") $ echo $url feature-test-111-
You can use awk keeping in mind that in Makefile the $ char in awk command must be doubled: url=$(shell echo 'feature/test-111-test-test' | awk -F'-' '{gsub(/\//, "-", $$1);print $$1"-"$$2"-"}') echo "$url" # => feature-test-111- See the online demo. Here, -F'-' sets the field delimiter as -, gsub(/\//, "-", $1) replaces / with - in Field 1 and print $1"-"$2"-" prints the value of --separated Field 1 and 2. Or, with a regex as a field delimiter: url=$(shell echo 'feature/test-111-test-test' | awk -F'[-/]' '{print $$1"-"$$2"-"$$3"-"}') echo "$url" # => feature-test-111- The -F'[-/]' option sets the field separator to - and /. The '{print $1"-"$2"-"$3"-"}' part prints the first, second and third value with a separating hyphen. See the online demo.
To get the nth occurrence of a character C you don't need fancy perl regexes. Instead, build a regex of the form "(anything that isn't C, then C) for n times": grep -Eo '([^-]*-){2}' | tr / -
With sed and cut echo feature/test-111-test-test| cut -d'-' -f-2 |sed 's/\//-/' Output feature-test-111 echo feature/test-111-test-test| cut -d'-' -f-2 |sed 's/\//-/;s/$/-/' Output feature-test-111-
You can use the simple BRE regex form of not something then that something which is [^-]*- to get all characters other than - up to a -. This works: echo 'feature/test-111-test-test' | sed -nE 's/^([^/]*)\/([^-]*-[^-]*-).*/\1-\2/p' feature-test-111-
Another idea using parameter expansions/substitutions: s='feature/test-111-test-test' tail="${s//\//-}" # replace '/' with '-' # split first field from rest of fields ('-' delimited); do this 3x times head="${tail%%-*}" # pull first field tail="${tail#*-}" # drop first field head="${head}-${tail%%-*}" # pull first field; append to previous field tail="${tail#*-}" # drop first field head="${head}-${tail%%-*}-" # pull first field; append to previous fields; add trailing '-' $ echo "${head}" feature-test-111-
A short sed solution, without extended regular expressions: sed 's|\(.*\)/\([^-]*-[^-]*\).*|\1-\2|'
How do I grep and filter logs for date and a particular field
My logs will have some lines with below format test/blah.log.32:30141:2019-08-12 16:40:09,839 com.test.service.testService [P1-7XX8] INFO testMethod(): userId: 12345XX, someOtherId: 12345XXXCCCDDD, blah, blah...., _someType=V, blah, blah, blah.... how do I grep for data that contains text _someType=V and then filter date and userId out of it. My final result should be 2019-08-12 16:40:09,839-12345XX I could do a grep with grep -Hn '_someType=V' but failing to filter the data.
You can pipe the output of your grep command into sed to transform the whole line into the two relevant pieces of data : grep '_someType=V' | sed -E 's/^([^ ]* [^ ]*).*userId: ([^ ]*).*/\1-\2/' The sed substitution command captures the two first "words" of the line corresponding to the date into a first capturing group and the word that follows userId into a second one, matching the whole line to replace it with the content of the two capturing groups separated by a dash. If the order between _someType=V and userId is always the same, you can do without the grep, for instance if _someType=V always appears after the userId: sed -nE 's/^([^ ]* [^ ]*).*userId: ([^ ]*).*_someType=V.*/\1-\2/p'
You may use awk: awk -v s='userId: ' '/_someType=V/ && match($0, s "[^, ]+") { print $1, $2 "-" substr($0, RSTART+length(s), RLENGTH-length(s)) }' file 2019-08-12 16:40:09,839-12345XX
Parsing Karma Coverage Output in Bash for a Jenkins Job (Scripting)
I'm working with the following output: =============================== Coverage summary =============================== Statements : 26.16% ( 1681/6425 ) Branches : 6.89% ( 119/1727 ) Functions : 23.82% ( 390/1637 ) Lines : 26.17% ( 1680/6420 ) ================================================================================ I would like to parse the 4 coverage percentage numbers without the percent via REGEX, into a comma separated list. Any suggestions for a good regex expression for this? Or another good option?
The sed command: sed -n '/ .*% /{s/.* \(.*\)% .*/\1/;p;}' input.txt | sed ':a;N;$!ba;s/\n/,/g' gives the output: 26.16,6.89,23.82,26.17 Edit: A better answer, with only a single sed, would be: sed -n '/ .*% /{s/.* \(.*\)% .*/\1/;H;};${g;s/\n/,/g;s/,//;p;}' input.txt Explanation: / .*% / search for lines with a percentage value (note spaces) s/.* \(.*\)% .*/\1/ and delete everything except the percentage value H and then append it to the hold space, prefixed with a newline $ then for the last line g get the hold space s/\n/,/g replace all the newlines with commas s/,// and delete the initial comma p and then finally output the result To harden the regex, you could replace the search for the percentage value .*% with for example [0-9.]*%.
I think this is a grep job. This should help: $ grep -oE "[0-9]{1,2}\.[0-9]{2}" input.txt | xargs | tr " " "," Output: 26.16,6.89,23.82,26.17 The input file just contains what you have shown above. Obviously, there are other ways like cat to feed the input to the command. Explanation: grep -oE: only show matches using extended regex xargs: put all results onto a single line tr " " ",": translate the spaces into commas: This is actually a nice shell tool belt example, I would say. Including the consideration of Joseph Quinsey, the regex can be made more robust with a lookahead to assert a % sign after then numeric value using a Perl-compatible RE pattern: grep -oP "[0-9]{1,2}\.[0-9]{2}(?=%)" input.txt | xargs | tr " " ","
Would you consider to use awk? Here's the command you may try, $ awk 'match($0,/[0-9.]*%/){s=(s=="")?"":s",";s=s substr($0,RSTART,RLENGTH-1)}END{print s}' file 26.16,6.89,23.82,26.17 Brief explanation, match($0,/[0-9.]*%/): find the record matched with regex [0-9.]*% s=(s=="")?"":s",": since comma separated is required, we just need print commas before each matched except the first one. s=s substr($0,RSTART,RLENGTH-1): print the matched part appended to s
Assuming the item names (Statements, Branches, ...) do not contain whitespaces, how about: #!/bin/bash declare -a keys declare -a vaues while read -r line; do if [[ "$line" =~ ^([^\ ]+)\ *:\ *([0-9.]+)% ]]; then keys+=(${BASH_REMATCH[1]}) values+=(${BASH_REMATCH[2]}) fi done < output.txt ifsback=$IFS # backup IFS IFS=, echo "${keys[*]}" echo "${values[*]}" IFS=$ifsback # restore IFS which yields: Statements,Branches,Functions,Lines 26.16,6.89,23.82,26.17
Yet another option, with perl: cat the_file | perl -e 'while(<>){/(\d+\.\d+)%/ and $x.="$1,"}chop $x; print $x;' The code, unrolled and explained: while(<>){ # Read line by line. Put lines into $_ /(\d+\.\d+)%/ and $x.="$1," # Equivalent to: # if ($_ =~ /(\d+\.\d+)%/) {$x.="$1,"} # The regex matches "numbers", "dot", "numbers" and "%", # stores just numbers on $1 (first capturing group) } chop $x; # Remove extra ',' and print result print $x; Somewhat shorter with an extra sed cat the_file | perl -ne '/(\d+\.\d+)%/ and print "$1,"'|sed 's/.$//' Uses "n" parameter which implies while(<>){}. For removing the last ',' we use sed.
Unix pattern datetime match
I want to edit this line: 1987,4,12,31,4,1987-12-31 00:00:00.0000000,UA,19977,UA,,631,12197,1219701,31703,HPN,White Plains, NY,NY,36,New York,22,13930,1393001,30977,ORD,Chicago\, IL,IL,17,Illinois,41,756,802,483.2,6,6,0,0,0700-0759,,,,,914,938,600.8,24,24,1,1,0900-0959,0,,0,138,156,,1,738,3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,US1NJBG0005,US1ILCK0027,,,,,,,,,,,,,1987-12-31 08:09:12.0000000,519494350 and i want the output to be : 1987,4,12,31,4,1987-12-31 00:00:00.000,UA,19977,UA,,631,12197,1219701,31703,HPN,White Plains, NY,NY,36,New York,22,13930,1393001,30977,ORD,Chicago\, IL,IL,17,Illinois,41,756,802,483.2,6,6,0,0,0700-0759,,,,,914,938,600.8,24,24,1,1,0900-0959,0,,0,138,156,,1,738,3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,US1NJBG0005,US1ILCK0027,,,,,,,,,,,,,1987-12-31 08:09:12.000,519494350 I want to find each pattern of: ****-**-** **:**:**.0000000 and erase the last 4 digits ( 0000 ) so I get ****-**-** **:**:**.000. If its helpful this date format is in the 6th columns and the n-1 columns.
To get the value of the 6th column and erase the last four digits you can use: awk -F, '{print substr($6, 0, length($6)-4) }' Similarly, the N-1 column can be reached by: awk -F, '{print substr( $(NF-1), 0, length($(NF-1))-4) }' Edit: To only replace the values in the columns, but still print everything use: awk 'BEGIN{ FS=","; OFS=","} { $6=substr($6, 0, length($6)-4); $(NF-1)=substr( $(NF-1), 0,length($(NF-1))-4); print $0}'
Awk based solution Nicely formatted, portable script: #!/usr/bin/awk -f BEGIN { FS = "," # input: fields are separated by , OFS = "," # output: fields are separated by , } { sub(/[0-9][0-9][0-9][0-9]$/, "", $6) # remove last 4 digits from the 6th column sub(/[0-9][0-9][0-9][0-9]$/, "", $(NF-1)) # remove last 4 digits from the n-1 column print } One-line, less portable version using gawk: gawk --re-interval -F , -v OFS=, '{sub("[0-9]{4}$", "", $6); sub("[0-9]{4}$", "", $(NF-1)); print}' N.B. The regular expression engine of the traditional awk doesn't support the {n} repetition operator, so gawk version 3 or older needs to be run with --re-interval. For other flavors of awk e.g. nawk, you need to explicitly repeat the regular expression as in the portable longer script from above. sed based solution sed -r 's/^(([^,]*,){5})([^,]+)[0-9]{4},(([^,]*,)*)([^,]+)[0-9]{4}(,[^,]*)$/\1\3\4\6\7/' (tested with GNU sed-4.2.2-6)
You could try this GNU sed command also, $ sed -r 's/^.*,([^,]*)....,.*$/\1/g' file 1987-12-31 08:09:12.000 If you want just replacing then try this, $ sed -r 's/^(.*,)([^,]*)....(,.*)$/\1\2\3/g' file 1987,4,12,31,4,1987-12-31 00:00:00.0000000,UA,19977,UA,,631,12197,1219701,31703,HPN,White Plains, NY,NY,36,New York,22,13930,1393001,30977,ORD,Chicago\, IL,IL,17,Illinois,41,756,802,483.2,6,6,0,0,0700-0759,,,,,914,938,600.8,24,24,1,1,0900-0959,0,,0,138,156,,1,738,3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,US1NJBG0005,US1ILCK0027,,,,,,,,,,,,,1987-12-31 08:09:12.000,519494350 I think you want the the output to be like this, $ grep -oP '[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}\....' file 1987-12-31 00:00:00.000 1987-12-31 08:09:12.000 Update: $ echo '1987,4,12,31,4,1987-12-31 00:00:00.0000000,UA,19977,UA,,631,12197,1219701,31703,HPN,White Plains, NY,NY,36,New York,22,13930,1393001,30977,ORD,Chicago\, IL,IL,17,Illinois,41,756,802,483.2,6,6,0,0,0700-0759,,,,,914,938,600.8,24,24,1,1,0900-0959,0,,0,138,156,,1,738,3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,US1NJBG0005,US1ILCK0027,,,,,,,,,,,,,1987-12-31 08:09:12.0000000,519494350' | sed -r 's/([0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}\....)..../\1/g' 1987,4,12,31,4,1987-12-31 00:00:00.000,UA,19977,UA,,631,12197,1219701,31703,HPN,White Plains, NY,NY,36,New York,22,13930,1393001,30977,ORD,Chicago\, IL,IL,17,Illinois,41,756,802,483.2,6,6,0,0,0700-0759,,,,,914,938,600.8,24,24,1,1,0900-0959,0,,0,138,156,,1,738,3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,US1NJBG0005,US1ILCK0027,,,,,,,,,,,,,1987-12-31 08:09:12.000,519494350
Here's a solution in Perl. Update - Edited to output the full CSV line with the timestamp replaced with the truncated one Update 2 - Update both timestamp columns, not just the first one #!/usr/bin/env perl use strict; use warnings; use feature 'say'; use Text::CSV; my $CSV = Text::CSV->new(); while (my $line = readline(STDIN)) { $CSV->parse($line) or die "Unable to parse line '$line'"; my #fields = $CSV->fields(); for my $f (#fields) { $f =~ s/ ^ # start of string ( # start capture to $1 \d{4} - # year \d{2} - # month \d{2} \s+ # day \d{2} : # hour \d{2} : # minute \d{2} [.] # second \d{3} # milisecond ) # end capture to $1 \d{4} # unwanted sub-second precision $ # end of string /$1/gmsx; } $CSV->combine(#fields); say $CSV->string(); } For example: alex#yuzu:~$ cat input.txt 1987,4,12,31,4,1987-12-31 00:00:00.0000000,UA,19977,UA,,631,12197,1219701,31703,HPN,White Plains, NY,NY,36,New York,22,13930,1393001,30977,ORD,Chicago\, IL,IL,17,Illinois,41,756,802,483.2,6,6,0,0,0700-0759,,,,,914,938,600.8,24,24,1,1,0900-0959,0,,0,138,156,,1,738,3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,US1NJBG0005,US1ILCK0027,,,,,,,,,,,,,1987-12-31 08:09:12.0000000,519494350 alex#yuzu:~$ ./csv.pl < input.txt 1987,4,12,31,4,"1987-12-31 00:00:00.000",UA,19977,UA,,631,12197,1219701,31703,HPN,"White Plains"," NY",NY,36,"New York",22,13930,1393001,30977,ORD,Chicago\," IL",IL,17,Illinois,41,756,802,483.2,6,6,0,0,0700-0759,,,,,914,938,600.8,24,24,1,1,0900-0959,0,,0,138,156,,1,738,3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,US1NJBG0005,US1ILCK0027,,,,,,,,,,,,,"1987-12-31 08:09:12.000",519494350 On a Debian-like system such as Ubuntu you should already have Perl, and you can install Text::CSV with: $ sudo apt-get install libtext-csv-perl