Add number for each matching pattern in a txt file in bash - regex

Hello I have a file such as
File.txt
>LO_D
AHAHAHAHHAHAH
>LEIO_DS
DHHDHDHDHDH
>LODJ_jdjd
DJDJHDHDHD
>LO_D
AAAAAAA
>LO_D
HHAHAHAHAHAH
An I would like to add a number just after each >LO_D element
I should then get:
>LO_D_1
AHAHAHAHHAHAH
>LEIO_DS
DHHDHDHDHDH
>LODJ_jdjd
DJDJHDHDHD
>LO_D_2
AAAAAAA
>LO_D_3
HHAHAHAHAHAH

You may use this awk:
awk '/^>LO_D$/ {$0 = $0 "_" (++n)} 1' file
>LO_D_1
AHAHAHAHHAHAH
>LEIO_DS
DHHDHDHDHDH
>LODJ_jdjd
DJDJHDHDHD
>LO_D_2
AAAAAAA
>LO_D_3
HHAHAHAHAHAH

Could you please try following, written and tested with shown samples.
awk '$0==">LO_D"{print $0"_"++count;next} 1' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
$0==">LO_D"{ ##Checking condition if line is >LO_D then do following.
print $0"_"++count ##Printing current line _ count variable with inreasing value.
next ##next will skip all further statements from here.
}
1 ##1 will print current line.
' Input_file ##mentioning Input_file name here.

Use this Perl one-liner:
perl -pe 's/^(>LO_D)$/"${1}_" . ++$i/e;' in.fasta
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-p : Loop over the input one line at a time, assigning it to $_ by default. Add print $_ after each loop iteration.
^(>LO_D)$ : Match the literal string >LO_D that starts at the beginning of the line (^) and ends at the end of the line ($). Capture the string using parentheses into capture variable $1.
"${1}_" . ++$i : Replacement string that consists of the captured variable $1, followed by an underscore, followed by the counter. Note that $1 is written as ${1} to avoid being interpreted as non-existent variable $1_. The counter is incremented by 1 before the expression is evaluated, so that the counter is 1, 2, 3, etc on subsequent matches.
s/PATTERN/REPLACEMENT/e : The /e flag tells Perl to evaluate REPLACEMENT as an expression first, then do the substitution.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perldoc perlrequick: Perl regular expressions quick start

Related

Replace newline in quoted strings in huge files

I have a few huge files with values seperated by a pipe (|) sign.
The strings our quoted but sometimes there is a newline in between the quoted string.
I need to read these files with external table from oracle but on the newlines he will give me errors. So I need to replace them with a space.
I do some other perl commands on these files for other errors, so I would like to have a solution in a one line perl command.
I 've found some other similar questions on stackoverflow, but they don't quite do the same and I can't find a solution for my problem with the solution mentioned there.
The statement I tried but that isn't working:
perl -pi -e 's/"(^|)*\n(^|)*"/ /g' test.txt
Sample text:
4454|"test string"|20-05-1999|"test 2nd string"
4455|"test newline
in string"||"test another 2nd string"
4456|"another string"|19-03-2021|"here also a newline
"
4457|.....
Should become:
4454|"test string"|20-05-1999|"test 2nd string"
4455|"test newline in string"||"test another 2nd string"
4456|"another string"|19-03-2021|"here also a newline "
4457|.....
Sounds like you want a CSV parser like Text::CSV_XS (Install through your OS's package manager or favorite CPAN client):
$ perl -MText::CSV_XS -e '
my $csv = Text::CSV_XS->new({sep => "|", binary => 1});
while (my $row = $csv->getline(*ARGV)) {
$csv->say(*STDOUT, [ map { tr/\n/ /r } #$row ])
}' test.txt
4454|"test string"|20-05-1999|"test 2nd string"
4455|"test newline in string"||"test another 2nd string"
4456|"another string"|19-03-2021|"here also a newline "
This one-liner reads each record using | as the field separator instead of the normal comma, and for each field, replaces newlines with spaces, and then prints out the transformed record.
In your specific case, you can also consider a workaround using GNU sed or awk.
An awk command will look like
awk 'NR==1 {print;next;} /^[0-9]{4,}\|/{print "\n" $0;next;}1' ORS="" file > newfile
The ORS (output record separator) is set to an empty string, which means that \n is only added before lines starting with four or more digits followed with a | char (matched with a ^[0-9]{4,}\| POSIX ERE pattern).
A GNU sed command will look like
sed -i ':a;$!{N;/\n[0-9]\{4,\}|/!{s/\n/ /;ba}};P;D' file
This reads two consecutive lines into the pattern space, and once the second line doesn't start with four digits followed with a | char (see the [0-9]\{4\}| POSIX BRE regex pattern), the or more line break between the two is replaced with a space. The search and replace repeats until no match or the end of file.
With perl, if the file is huge but it can still fit into memory, you can use a short
perl -0777 -pi -e 's/\R++(?!\d{4,}\|)/ /g' <<< "$s"
With -0777, you slurp the file and the \R++(?!\d{4,}\|) pattern matches any one or more line breaks (\R++) not followed with four or more digits followed with a | char. The ++ possessive quantifier is required to make (?!...) negative lookahead to disallow backtracking into line break matching pattern.
With your shown samples, this could be simply done in awk program. Written and tested in GNU awk, should work in any awk. This should work fast even on huge files(better than slurping whole file into memory, having mentioned that OP may use it on huge files).
awk 'gsub(/"/,"&")%2!=0{if(val==""){val=$0} else{print val $0;val=""};next} 1' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
gsub(/"/,"&")%2!=0{ ##Checking condition if number of " are EVEN or not, because if they are NOT even then it means they are NOT closed properly.
if(val==""){ val=$0 } ##Checking condition if val is NULL then set val to current line.
else {print val $0;val=""} ##Else(if val NOT NULL) then print val current line and nullify val here.
next ##next will skip further statements from here.
}
1 ##In case number of " are EVEN in any line it will skip above condition(gusb one) and simply print the line.
' Input_file ##Mentioning Input_file name here.

How to find and replace a pattern string using sed/perl/awk?

I have a file foo.properties with contents like
foo=bar
# another property
test=true
allNames=alpha:.02,beta:0.25,ph:0.03,delta:1.0,gamma:.5
In my script, I need to replace whatever value is against ph (The current value is unknown to the bash script) and change it to 0.5. So the the file should look like
foo=bar
# another property
test=true
allNames=alpha:.02,beta:0.25,ph:0.5,delta:1.0,gamma:.5
I know it can be easily done if the current value is known by using
sed "s/\,ph\:0.03\,/\,ph\:0.5\,/" foo.properties
But in my case, I have to actually read the contents against allNames and search for the value and then replace within a for loop. Rest all is taken care of but I can't figure out the sed/perl command for this.
I tried using sed "s/\,ph\:.*\,/\,ph\:0.5\,/" foo.properties and some variations but it didn't work.
A simpler sed solution:
sed -E 's/([=,]ph:)[0-9.]+/\10.5/g' file
foo=bar
# another property
test=true
allNames=alpha:.02,beta:0.25,ph:0.5,delta:1.0,gamma:.5
Here we match ([=,]ph:) (i.e. , or = followed by ph:) and capture in group #1. This should be followed by 1+ of [0-9.] character to natch any number. In replacement we put \1 back with 0.5
With your shown samples, please try following awk code.
awk -v new_val="0.5" '
match($0,/,ph:[0-9]+(\.[0-9]+)?/){
val=substr($0,RSTART+1,RLENGTH-1)
sub(/:.*/,":",val)
print substr($0,1,RSTART) val new_val substr($0,RSTART+RLENGTH)
next
}
1
' Input_file
Detailed Explanation: Creating awk's variable named new_val which contains new value which needs to put in. In main program of awk using match function of awk to match ,ph:[0-9]+(\.[0-9]+)? regex in each line, if a match of regex is found then storing that matched value into variable val. Then substituting everything from : to till end of value in val variable with : here. Then printing values as pre requirement of OP(values before matched regex value with val(edited matched value in regex) with new value and rest of line), using next will avoid going further and by mentioning 1 printing rest other lines which are NOT having a matched value in it.
2nd solution: Using sub function of awk.
awk -v newVal="0.5" '/^allNames=/{sub(/,ph:[^,]*/,",ph:"newVal)} 1' Input_file
Would you please try a perl solution:
perl -pe '
s/(?<=\bph:)[\d.]+(?=,|$)/0.5/;
' foo.properties
The -pe option makes perl to read the input line by line, perform
the operation, then print it as sed does.
The regex (?<=\bph:) is a zero-length lookbehind which matches
the string ph: preceded by a word boundary.
The regex [\d.]+ will match a decimal number.
The regex (?=,|$) is a zero-length lookahead which matches
a comma or the end of the string.
As the lookbehind and the lookahead has zero length, they are not
substituted by the s/../../ operator.
[Edit]
As Dave Cross comments, the lookahead (?=,|$) is unnecessary as long as the input file is correctly formatted.
Works with decimal place or not, or no value, anywhere in the line.
sed -E 's/(^|[^-_[:alnum:]])ph:[0-9]*(.[0-9]+)?/ph:0.5/g'
Or possibly:
sed -E 's/(^|[=,[:space:]])ph:[0-9]+(.[0-9]+)?/ph:0.5/g'
The top one uses "not other naming characters" to describe the character immediately before a name, the bottom one uses delimiter characters (you could add more characters to either). The purpose is to avoid clashing with other_ph or autograph.
Here you go
#!/usr/bin/perl
use strict;
use warnings;
print "\nPerl Starting ... \n\n";
while (my $recordLine =<DATA>)
{
chomp($recordLine);
if (index($recordLine, "ph:") != -1)
{
$recordLine =~ s/ph:.*?,/ph:0.5,/g;
print "recordLine: $recordLine ...\n";
}
}
print "\nPerl End ... \n\n";
__DATA__
foo=bar
# another property
test=true
allNames=alpha:.02,beta:0.25,ph:0.03,delta:1.0,gamma:.5
output:
Perl Starting ...
recordLine: allNames=alpha:.02,beta:0.25,ph:0.5,delta:1.0,gamma:.5 ...
Perl End ...
Using any sed in any shell on every Unix box (the other sed solutions posted that use sed -E require GNU or BSD seds):
a) if ph: is never the first tag in the allNames list (as shown in your sample input):
$ sed 's/\(,ph:\)[^,]*/\10.5/' foo.properties
foo=bar
# another property
test=true
allNames=alpha:.02,beta:0.25,ph:0.5,delta:1.0,gamma:.5
b) or if it can be first:
$ sed 's/\([,=]ph:\)[^,]*/\10.5/' foo.properties
foo=bar
# another property
test=true
allNames=alpha:.02,beta:0.25,ph:0.5,delta:1.0,gamma:.5

How to match and cut the string with different conditions using sed?

I want to grep the string which comes after WORK= and ignore if there comes paranthesis after that string .
The text looks like this :
//INALL TYPE=GH,WORK=HU.ET.ET(IO)
//INA2 WORK=HU.TY.TY(OP),TYPE=KK
//OOPE2 TYPE=KO,WORK=TEXT.LO1.LO2,TEXT
//OOP2 TYPE=KO,WORK=TEST1.TEST2
//H1 WORK=OP.TEE.GHU,TYPE=IU
So, desirable output should print only :
TEXT.L01.L02
TEST1.TEST2
OP.TEE.GHU
So far , I could just match and cut before WORK= but could not remove WORK= itself:
sed -E 's/(.*)(WORK=.*)/\2/'
I am not sure how to continue . Can anyone help please ?
You can use
sed -n '/WORK=.*([^()]*)/!s/.*WORK=\([^,]*\).*/\1/p' file > newfile
Details:
-n - suppresses the default line output
/WORK=.*([^()]*)/! - if a line contains a WORK= followed with any text and then a (...) substring skips it
s/.*WORK=\([^,]*\).*/\1/p - else, takes the line and removes all up to and including WORK=, and then captures into Group 1 any zero or more chars other than a comma, and then remove the rest of the line; p prints the result.
See the sed demo:
s='//INALL TYPE=GH,WORK=HU.ET.ET(IO)
//INA2 WORK=HU.TY.TY(OP),TYPE=KK
//OOPE2 TYPE=KO,WORK=TEXT.LO1.LO2,TEXT
//OOP2 TYPE=KO,WORK=TEST1.TEST2
//H1 WORK=OP.TEE.GHU,TYPE=IU'
sed -n '/WORK=.*([^()]*)/!s/.*WORK=\([^,]*\).*/\1/p' <<< "$s"
Output:
TEXT.LO1.LO2
TEST1.TEST2
OP.TEE.GHU
Could you please try following awk, written and tested with shown samples in GNU awk.
awk '
match($0,/WORK=[^,]*/){
val=substr($0,RSTART+5,RLENGTH-5)
if(val!~/\([a-zA-Z]+\)/){ print val }
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
match($0,/WORK=[^,]*/){ ##Using match function to match WORK= till comma comes.
val=substr($0,RSTART+5,RLENGTH-5) ##Creating val with sub string of match regex here.
if(val!~/\([a-zA-Z]+\)/){ print val } ##checking if val does not has ( alphabets ) then print val here.
}
' Input_file ##Mentioning Input_file name here.
This might work for you (GNU sed):
sed -n '/.*WORK=\([^,]\+\).*/{s//\1/;/(.*)/!p}' file
Extract the string following WORK= and if that string does not contain (...) print it.
This will work if there is only zero or one occurrence of WORK= and that the exclusion depends only on the (...) occurring within that string and not other following fields.
For a global solution with the same stipulations for parens:
sed -n '/WORK=\([^,]\+\)/{s//\n\1\n/;s/[^\n]*\n//;/(.*).*\n/!P;D}' file
N.B. This prints each such string on a separate line an excludes empty strings.

Remove everything after a changing string [duplicate]

This question already has an answer here:
How to get first N parts of a path?
(1 answer)
Closed 2 years ago.
I'm having some trouble with the following problem;
As input, I get a few lines of paths to files as follows:
root/child/abc/somefile.txt
root/child/def/123/somefile.txt
root/child/ghijklm/somefile.txt
The root/child piece is always in the path, everything after can differ.
I would like to remove everything after the grandchild folder. So the output would be:
root/child/abc/
root/child/def/
root/child/ghijklm/
I've tried the following:
sed 's/\/child\/.*/\/child\/.*/'
But of course that would just give the following output:
root/child/.*
root/child/.*
root/child/.*
Any help would be appreciated!
with cut:
cut -d\/ -f1,2,3 file
With awk: Could you please try following, written and tested with shown samples in GNU awk.
awk 'match($0,/root\/child\/[^/]*/){print substr($0,RSTART,RLENGTH)}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
match($0,/root\/child\/[^/]*/){ ##Using match function to match root/child/... till next / in current line.
print substr($0,RSTART,RLENGTH) ##printig substring from RSTART to till RLENGTH.
}
' Input_file ##Mentioning Input_file name here.
With sed:
sed 's/.*\(root\/child\/[^/]*\).*/\1/' Input_file
Explanation: Using sed's substitution method to match root/child/ till next occurrence of / and saving it into temp buffer(back reference method) and substituting whole line with only matched back referenced value.
This might work for you (GNU sed):
sed -E 's/^(([^/]*[/]){3}).*/\1/' file
Delete everything after the third group of non-forward-slashes/slash.
You were close.
sed 's%\(/child/[^/]*\)/.*%\1%'
The regex [^/]* matches as many characters as possible which are not a slash; then we replace the entire match with just the part we captured in parentheses, effectively trimming off the rest.
With Perl:
perl -pe 's{ ^ ( ( [^/]+ / ){3} ) .* $ }{$1}x' in_file > out_file
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-p : Loop over the input one line at a time, assigning it to $_ by default. Add print $_ after each loop iteration.
The regex uses this modifier:
x : Disregard whitespace and comments, for readability.
The substitution statement, explained:
^ : beginning of the line.
$ : end of the line.
[^/]+ / : one or more characters that are not slashes (/), followed by a slash.
( [^/]+ / ){3} : one or more non-slash characters, followed by a slash, repeated exactly 3 times.
( ( [^/]+ / ){3} ) : the above, with parenthesis to capture the matched part into the first capture variable, $1, to be used later in the substitution. Capture groups are counted left to right.
.* : zero or more occurrences of any character.
s{THIS}{THAT} : replace THIS with THAT.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perldoc perlre: Perl regular expressions (regexes)
perldoc perlre: Perl regular expressions (regexes): Quantifiers; Character Classes and other Special Escapes; Assertions; Capture groups
perldoc perlrequick: Perl regular expressions quick start

Filter (or 'cut') out column that begins with 'OS=abc'

My .fasta file consists of this repeating pattern.
>sp|P20855|HBB_CTEGU Hemoglobin subunit beta OS=Ctenodactylus gundi OX=10166 GN=HBB PE=1 SV=1
asdfaasdfaasdfasdfa
>sp|Q00812|TRHBN_NOSCO Group 1 truncated hemoglobin GlbN OS=Nostoc commune OX=1178 GN=glbN PE=3 SV=1
asdfadfasdfaasdfasdfasdfasd
>sp|P02197|MYG_CHICK Myoglobin OS=Gallus gallus OX=9031 GN=MB PE=1 SV=4
aafdsdfasdfasdfa
I want to filter out only the lines that contain '>' THEN filter out the string after 'OS=' and before 'OX=', (example line1=Ctenodactylus gundi)
The first part('>') is easy enough:
grep '>' my.fasta | cut -d " " -f 3 >> species.txt
The problem is that the number of fields is not constant BEFORE 'OS='.
But the number of column/fields between 'OS=' and 'OX=' is 2.
You can use the -P option to enable PCRE-based regex matching, and use lookaround patterns to ensure that the match is enclosed between OS= and OX=:
grep '>' my.fasta | grep -oP '(?<=OS=).*(?=OX=)'
Note that the -P option is available only to the GNU's version of grep, which may not be available by default in some environments.
IMHO awk will be more feasible here(since it could take care of regex and printing with condition part all together), could you please try following.
awk '/^>/ && match($0,/OS=.*OX=/){print substr($0,RSTART+3,RLENGTH-6)}' Input_file
Output will be as follows.
Ctenodactylus gundi
Nostoc commune
Gallus gallus
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
/^>/ && match($0,/OS=.*OX=/){ ##Checking condition if line starts from > AND matches regex OS=,*OX= means match from OS= till OX= in each line, if both conditions are TRUE.
print substr($0,RSTART+3,RLENGTH-6) ##Then print sub string of current line, whose starting point is RSTART+3 to till RLENGTH-6 of current line.
}
' Input_file ##Mentioning Input_file name here.
Using any awk in any shell on every UNIX box:
$ awk -F' O[SX]=' '/^>/{print $2}' file
Ctenodactylus gundi
Nostoc commune
Gallus gallus
sed solution:
$ sed -nE '/>/ s/^.*OS=(.*) OX=.*$/\1/p' .fasta
Ctenodactylus gundi
Nostoc commune
Gallus gallus
-n so that the pattern space is not printed unless requested; -E (extended regular expressions) so that we can use subexpressions and backreferences. The p flag to the s command means "print the pattern space".
The regular expression is supposed to match the entire line, singling out in a subexpression the fragment we must extract. I assumed OX is preceded by exactly one space, which must not appear in the output; that can be adjusted if/as needed.
This assumes that all lines that begin with > will have an OS= ... fragment immediately followed by an OX= ... fragment; if not, that can be added to the />/ filter before the s command. (By the way - can there be some OT= ... fragment between OS=... and OX= ...?)
Question though - wouldn't you rather include some identifier (perhaps part of the "label" at the beginning of each line) for each line of output? You have the fragments you requested - but do you know where each one of them comes?