How to match and cut the string with different conditions using sed? - regex

I want to grep the string which comes after WORK= and ignore if there comes paranthesis after that string .
The text looks like this :
//INALL TYPE=GH,WORK=HU.ET.ET(IO)
//INA2 WORK=HU.TY.TY(OP),TYPE=KK
//OOPE2 TYPE=KO,WORK=TEXT.LO1.LO2,TEXT
//OOP2 TYPE=KO,WORK=TEST1.TEST2
//H1 WORK=OP.TEE.GHU,TYPE=IU
So, desirable output should print only :
TEXT.L01.L02
TEST1.TEST2
OP.TEE.GHU
So far , I could just match and cut before WORK= but could not remove WORK= itself:
sed -E 's/(.*)(WORK=.*)/\2/'
I am not sure how to continue . Can anyone help please ?

You can use
sed -n '/WORK=.*([^()]*)/!s/.*WORK=\([^,]*\).*/\1/p' file > newfile
Details:
-n - suppresses the default line output
/WORK=.*([^()]*)/! - if a line contains a WORK= followed with any text and then a (...) substring skips it
s/.*WORK=\([^,]*\).*/\1/p - else, takes the line and removes all up to and including WORK=, and then captures into Group 1 any zero or more chars other than a comma, and then remove the rest of the line; p prints the result.
See the sed demo:
s='//INALL TYPE=GH,WORK=HU.ET.ET(IO)
//INA2 WORK=HU.TY.TY(OP),TYPE=KK
//OOPE2 TYPE=KO,WORK=TEXT.LO1.LO2,TEXT
//OOP2 TYPE=KO,WORK=TEST1.TEST2
//H1 WORK=OP.TEE.GHU,TYPE=IU'
sed -n '/WORK=.*([^()]*)/!s/.*WORK=\([^,]*\).*/\1/p' <<< "$s"
Output:
TEXT.LO1.LO2
TEST1.TEST2
OP.TEE.GHU

Could you please try following awk, written and tested with shown samples in GNU awk.
awk '
match($0,/WORK=[^,]*/){
val=substr($0,RSTART+5,RLENGTH-5)
if(val!~/\([a-zA-Z]+\)/){ print val }
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
match($0,/WORK=[^,]*/){ ##Using match function to match WORK= till comma comes.
val=substr($0,RSTART+5,RLENGTH-5) ##Creating val with sub string of match regex here.
if(val!~/\([a-zA-Z]+\)/){ print val } ##checking if val does not has ( alphabets ) then print val here.
}
' Input_file ##Mentioning Input_file name here.

This might work for you (GNU sed):
sed -n '/.*WORK=\([^,]\+\).*/{s//\1/;/(.*)/!p}' file
Extract the string following WORK= and if that string does not contain (...) print it.
This will work if there is only zero or one occurrence of WORK= and that the exclusion depends only on the (...) occurring within that string and not other following fields.
For a global solution with the same stipulations for parens:
sed -n '/WORK=\([^,]\+\)/{s//\n\1\n/;s/[^\n]*\n//;/(.*).*\n/!P;D}' file
N.B. This prints each such string on a separate line an excludes empty strings.

Related

How to check last 3 chars of a string are alphabets or not using awk?

I want to check if the last 3 letters in column 1 are alphabets and print those rows. What am I doing wrong?
My code :-
awk -F '|' ' {print str=substr( $1 , length($1) - 2) } END{if ($str ~ /^[A-Za-z]/ ) print}' file
cat file
12300USD|0392
abc56eur|97834
238aed|23911
aabccde|38731
73716yen|19287
.*/|982376
0NRT0|928731
expected output :
12300USD|0392
abc56eur|97834
238aed|23911
aabccxx|38731
73716yen|19287
$ awk -F'|' '$1 ~ /[[:alpha:]]{3}$/' file
12300USD|0392
abc56eur|97834
238aed|23911
aabccde|38731
73716yen|19287
Regarding what's wrong with your script:
You're doing the test for alphabetic characters in the END section for the final line read instead of once per input line.
You're trying to use shell variable syntax $str instead of awk str.
You're testing for literal character ranges in the bracket expression instead of using a character class so YMMV on which characters that includes depending on your locale.
You're testing for a string that starts with a letter instead of a string that ends with 3 letters.
Use grep:
grep -P '^[^|]*[A-Za-z]{3}[|]' in_file > out_file
Here, GNU grep uses the following option:
-P : Use Perl regexes.
The regex means this:
^ : Start of the string.
[^|]* : Any non-pipe character, repeated 0 or more times.
[A-Za-z]{3} : 3 letters.
[|] : Literal pipe.
sed -n '/^[^|]*[a-Z][a-Z][a-Z]|/p' file
grep '^[^|]*[a-Z][a-Z][a-Z]|' file
{m,g}awk '!+FS<NF' FS='^[^|]*[A-Za-z][A-Za-z][A-Za-z][|]'
{m,g}awk '$!_!~"[|]"' FS='[A-Za-z][A-Za-z][A-Za-z][|]'
{m,g}awk '($!_~"[|]")<NF' FS='[A-Za-z][A-Za-z][A-Za-z][|]' # to play it safe
12300USD|0392
abc56eur|97834
238aed|23911
aabccde|38731
73716yen|19287

Replace newline in quoted strings in huge files

I have a few huge files with values seperated by a pipe (|) sign.
The strings our quoted but sometimes there is a newline in between the quoted string.
I need to read these files with external table from oracle but on the newlines he will give me errors. So I need to replace them with a space.
I do some other perl commands on these files for other errors, so I would like to have a solution in a one line perl command.
I 've found some other similar questions on stackoverflow, but they don't quite do the same and I can't find a solution for my problem with the solution mentioned there.
The statement I tried but that isn't working:
perl -pi -e 's/"(^|)*\n(^|)*"/ /g' test.txt
Sample text:
4454|"test string"|20-05-1999|"test 2nd string"
4455|"test newline
in string"||"test another 2nd string"
4456|"another string"|19-03-2021|"here also a newline
"
4457|.....
Should become:
4454|"test string"|20-05-1999|"test 2nd string"
4455|"test newline in string"||"test another 2nd string"
4456|"another string"|19-03-2021|"here also a newline "
4457|.....
Sounds like you want a CSV parser like Text::CSV_XS (Install through your OS's package manager or favorite CPAN client):
$ perl -MText::CSV_XS -e '
my $csv = Text::CSV_XS->new({sep => "|", binary => 1});
while (my $row = $csv->getline(*ARGV)) {
$csv->say(*STDOUT, [ map { tr/\n/ /r } #$row ])
}' test.txt
4454|"test string"|20-05-1999|"test 2nd string"
4455|"test newline in string"||"test another 2nd string"
4456|"another string"|19-03-2021|"here also a newline "
This one-liner reads each record using | as the field separator instead of the normal comma, and for each field, replaces newlines with spaces, and then prints out the transformed record.
In your specific case, you can also consider a workaround using GNU sed or awk.
An awk command will look like
awk 'NR==1 {print;next;} /^[0-9]{4,}\|/{print "\n" $0;next;}1' ORS="" file > newfile
The ORS (output record separator) is set to an empty string, which means that \n is only added before lines starting with four or more digits followed with a | char (matched with a ^[0-9]{4,}\| POSIX ERE pattern).
A GNU sed command will look like
sed -i ':a;$!{N;/\n[0-9]\{4,\}|/!{s/\n/ /;ba}};P;D' file
This reads two consecutive lines into the pattern space, and once the second line doesn't start with four digits followed with a | char (see the [0-9]\{4\}| POSIX BRE regex pattern), the or more line break between the two is replaced with a space. The search and replace repeats until no match or the end of file.
With perl, if the file is huge but it can still fit into memory, you can use a short
perl -0777 -pi -e 's/\R++(?!\d{4,}\|)/ /g' <<< "$s"
With -0777, you slurp the file and the \R++(?!\d{4,}\|) pattern matches any one or more line breaks (\R++) not followed with four or more digits followed with a | char. The ++ possessive quantifier is required to make (?!...) negative lookahead to disallow backtracking into line break matching pattern.
With your shown samples, this could be simply done in awk program. Written and tested in GNU awk, should work in any awk. This should work fast even on huge files(better than slurping whole file into memory, having mentioned that OP may use it on huge files).
awk 'gsub(/"/,"&")%2!=0{if(val==""){val=$0} else{print val $0;val=""};next} 1' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
gsub(/"/,"&")%2!=0{ ##Checking condition if number of " are EVEN or not, because if they are NOT even then it means they are NOT closed properly.
if(val==""){ val=$0 } ##Checking condition if val is NULL then set val to current line.
else {print val $0;val=""} ##Else(if val NOT NULL) then print val current line and nullify val here.
next ##next will skip further statements from here.
}
1 ##In case number of " are EVEN in any line it will skip above condition(gusb one) and simply print the line.
' Input_file ##Mentioning Input_file name here.

grep regex how to get only results with one preceeding word?

My string is :
www.abc.texas.com
mail.texas.com
subdomain.xyz.cc.texas.com
www2.texas.com
I an trying to get results only with "one" word before texas.com. Expectation when I do a regex grep :
mail.texas.com
www2.texas.com
So mail & www2 are the "one" word that I'm talking about. I tried :
grep "*.texas.com", but I get all of them in results. Can someone please help ?
You can use
grep '^[^.]*\.texas\.com'
Details:
^ - start of string
[^.]* - zero or more chars other than a . char
\.texas\.com - .texas.com string (literal . char must be escaped in the regex pattern).
See the online demo:
#!/bin/bash
s='www.abc.texas.com
mail.texas.com
subdomain.xyz.cc.texas.com
www2.texas.com'
grep '^[^.]*\.texas\.com' <<< "$s"
Output:
mail.texas.com
www2.texas.com
With awk:
awk 'BEGIN{FS=OFS="."} /texas.com$/ && NF==3' file
Output:
mail.texas.com
www2.texas.com
Set one dot as input and output field separator, check for texas.com at the end ($) of your line and check for three fields.
See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
With your shown samples, please try following awk code.
awk -F'.' 'NF==3 && $2=="texas" && $3=="com"' Input_file
Explanation: Simple making field separator as . for all the lines in awk program. Then in main program checking condition if NF==3(means number of fields in current line)are 3 AND 2nd field is texas and 3rd field is com if all 3 conditions are MET then print the line.

Add number for each matching pattern in a txt file in bash

Hello I have a file such as
File.txt
>LO_D
AHAHAHAHHAHAH
>LEIO_DS
DHHDHDHDHDH
>LODJ_jdjd
DJDJHDHDHD
>LO_D
AAAAAAA
>LO_D
HHAHAHAHAHAH
An I would like to add a number just after each >LO_D element
I should then get:
>LO_D_1
AHAHAHAHHAHAH
>LEIO_DS
DHHDHDHDHDH
>LODJ_jdjd
DJDJHDHDHD
>LO_D_2
AAAAAAA
>LO_D_3
HHAHAHAHAHAH
You may use this awk:
awk '/^>LO_D$/ {$0 = $0 "_" (++n)} 1' file
>LO_D_1
AHAHAHAHHAHAH
>LEIO_DS
DHHDHDHDHDH
>LODJ_jdjd
DJDJHDHDHD
>LO_D_2
AAAAAAA
>LO_D_3
HHAHAHAHAHAH
Could you please try following, written and tested with shown samples.
awk '$0==">LO_D"{print $0"_"++count;next} 1' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
$0==">LO_D"{ ##Checking condition if line is >LO_D then do following.
print $0"_"++count ##Printing current line _ count variable with inreasing value.
next ##next will skip all further statements from here.
}
1 ##1 will print current line.
' Input_file ##mentioning Input_file name here.
Use this Perl one-liner:
perl -pe 's/^(>LO_D)$/"${1}_" . ++$i/e;' in.fasta
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-p : Loop over the input one line at a time, assigning it to $_ by default. Add print $_ after each loop iteration.
^(>LO_D)$ : Match the literal string >LO_D that starts at the beginning of the line (^) and ends at the end of the line ($). Capture the string using parentheses into capture variable $1.
"${1}_" . ++$i : Replacement string that consists of the captured variable $1, followed by an underscore, followed by the counter. Note that $1 is written as ${1} to avoid being interpreted as non-existent variable $1_. The counter is incremented by 1 before the expression is evaluated, so that the counter is 1, 2, 3, etc on subsequent matches.
s/PATTERN/REPLACEMENT/e : The /e flag tells Perl to evaluate REPLACEMENT as an expression first, then do the substitution.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perldoc perlrequick: Perl regular expressions quick start

How to find/extract a pattern from a file?

Here are the contents of my text file named 'temp.txt'
---start of file ---
HEROKU_POSTGRESQL_AQUA_URL (DATABASE_URL) ----backup---> b687
Capturing... done
Storing... done
---end of file ----
I want to write a bash script in which I need to capture the string 'b687' in a variable. this is really a pattern (which is the letter 'b' followed by 'n' number of digits). I can do it the hard way by looping through the file and extracting the desired string (b687 in example above). Is there an easy way to do so? Perhaps by using awk or sed?
Try using grep
v=$(grep -oE '\bb[0-9]{3}\b' file)
This will seach for a word starting with b followed by '3' digits.
regex101 demo
Using sed
v=$(sed -nr 's/.*\b(b[0-9]{3})\b.*/\1/p' file)
varname=$(awk '/HEROKU_POSTGRESQL_AQUA_URL/{print $4}' filename)
what this does is reads the file when it matches the pattern HEROKU_POSTGRESQL_AQUA_URL print the 4th token in this case b687
your other option is to use sed
varname=$(sed -n 's/.* \(b[0-9][0-9]*\)/\1/p' filename)
In this case we are looking for the pattern you mentioned b####... and only print that pattern the -n tells sed not to print line that do not have that pattern. the rest of the sed command is a substitution .* is any string at the beginning. followed by a (...) which forms a group in which we put the regex that will match your b##### the second part says out of all that match only print the group 1 and the p at the end tells sed to print the result (since by default we told sed not to print with the -n)