Add text to the end if not already added - regex

I have the following lines:
source = "git::ssh://git#github.abc.com/test//bar"
source = "git::ssh://git#github.abc.com/test//foo?ref=tf12"
resource = "bar"
I want to update any lines that contain source and git words by adding ?ref:tf12 to the end of the line but inside ". If the line already contains ?ref=tf12, it should skip
source = "git::ssh://git#github.abc.com/test//bar?ref=tf12"
source = "git::ssh://git#github.abc.com/test//foo?ref=tf12"
resource = "bar"
I have the following expression using sed, but it outputs wrongly
sed 's#source.*git.*//.*#&?ref=tf12#' file.tf
source = "git::ssh://git#github.abc.com/test//bar"?ref=tf12
source = "git::ssh://git#github.abc.com/test//foo"?ref=tf12?ref=tf12
resource = "bar"

Using simple regular expressions for this is rather brittle; if at all possible, using a more robust configuration file parser would probably be a better idea. If that's not possible, you might want to tighten up the regular expressions to make sure you don't modify unrelated lines. But here is a really simple solution, at least as a starting point.
sed -e '/^ *source *= *"git/!b' -e '/?ref=tf12" *$/b' -e 's/" *$/?ref=tf12"/' file.tf
This consists of three commands. Remember that sed examines one line at a time.
/^ * source *= *"git/!b - if this line does not begin with source="git (with optional spaces between the tokens) leave it alone. (! means "does not match" and b means "branch (to the end of this script)" i.e. skip this line and fetch the next one.)
/?ref=tf12" *$/b similarly says to leave alone lines which match this regex. In other words, if the line ends with ?ref=tf12" (with optional spaces after) don't modify it.
s/"* $/?ref=tf12"/ says to change the last double quote to include ?ref=tf12 before it. This will only happen on lines which were not skipped by the two previous commands.

sed '/?ref=tf12"/!s#\(source.*git.*//.*\)"#\1?ref=tf12"#' file.tf
/?ref=tf12"/! Only run substitude command if this pattern (?ref=tf12") doesn't match
\(...\)", \1 Instead of appending to the entire line using &, only match the line until the last ". Use parentheses to match everything before that " into a group which I can then refer with \1 in the replacement. (Where we re-add the ", so that it doesn't get lost)

Related

Perl search and replace until positive lookahead over several lines - not working as expected?

The overall goal here is to remove a block of text starting with a particular string and ending with a positive lookahead. From the testing I've done, it seems that newlines are causing the problem, but I'm not sure what exactly is going on or the best way to fix it.
More context: I want to remove taxa from a .fasta file, including the taxon name and header information and the associated sequence. (fasta format begins with a header >locusname-locusnumber-species_name |locusname-locusnumber \n). Missing data in the sequence is coded as "-". Eventually I would like to do this for several species_names and do so for each of several thousand files in a directory.
I presumed this would be a simple task to do as a perl one-liner in bash (Ubuntu 18.04.2).
As an example, from the excerpt below I would like to remove the entire sequence of Pseudomymrex seminole D1367, i.e. the string that starts with >uce-483_Pseudomyrmex_seminole_D1367 |uce-483 and ends with the newline before >uce-483_Pseudomyrmex_seminole_D1435. . ..
For this, I have: perl -pe 's/>(.)+(Pseudomyrmex_seminole_D1367)[\s\S]+(?=>)//' infile.fasta > outfile.fasta
or equivalently perl -pe 's/>(.)+(Pseudomyrmex_seminole_D1367(.)+(?=>)//s' infile.fasta > outfile.fasta
Both of these seem to have no effect at all (i.e. diff infile.fasta outfile.fasta is empty.) If I remove the positive lookahead, it works correctly but only up to the first newline.
Here's an excerpt from the .fasta for context and testing:
>uce-483_Pseudomyrmex_seminole_D1366 |uce-483
------------------------------------------------------------
---------------------------------------------------tgtaaacgt
tataatacatgcgtatgaaaaaaaaaagtgaacacccggtacgtacccgtgctgaaacgt
tcagatttacatccatttgtagtagcattttcgctagttttttcaagagcaaaaaggaca
cattcaaaactgaatatacatgtcacagatgtttgtttgtgtgcaggtacctgtaatttt
gcaaacatatacctatatatgtgtgtcgcatatatatcatgtagtagatttccatgttat
gcaacatcttctcacaatgacaatcggtcgtttccttcactccgaaatgttcatgcgaac
agttaatctatatcccaagcagcgatgtaatgttatgcggcgcgcaagtctcattagact
tgtaaaccgtccgagtttcgacttaccata----tgtgtgtgtgtgcgcgcgtatgtgca
cgtac------acacgtttgtttatacatttgtctatacatttgcgtgtgaacgcgggat
gaacagagatttgcgcacacatagacatgagaaacgtcacttgtcgatgtagatactaat
tgtggaaaatacatattcctcttcagatacacgggaatgttgaattattttcactcgctc
cacgcgcgagtgttcgctccttttacgcacaacgagtccttctgctgcagc--gagatag
aaaatatttttgcgcggtaatcgtaaacgtatgagtgcctttcgacgtgaattctcttat
ggcagttctcacggtgtaaattataatcgaattaacattgcgagtgtgatctcaatataa
ttatagcgtctaagaacaaacacgtaacatgcacacacacacacacacac----------
---
>uce-483_Pseudomyrmex_seminole_D1367 |uce-483
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
--ttcaaaactgaatatacatgtcacagatgtttgtttgtgtgcaggtacctgtaatttt
gcaaacatatg---atatatatgtgtcgcatatatatcatgtagtagatttccatgttat
gcaacatcttctcacaatgacaatcggtcgtttccttcactctgaaatgttcatgcgaac
agttaatctatatcccaagcagcgatgtaatgttatgcggcgcgcaagtctcattagact
tgtaaaccgtccgagtttcgacttaccata--tgtgtgtgtgtgtgtgcgcgtatgtgca
cgtacgcgcgcacacgtttgtttatacatttgtctatacatttgcgtgtgaacgcgggat
gaacagagatttgcgcacacatagacatgagaaacgtcacttgtcgatg-----------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
---
>uce-483_Pseudomyrmex_seminole_D1435 |uce-483
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
-------tacatccatttgtagtagcattttcgctagttttttcaagagcaaaaaggaca
cattcaaaactgaatatacatgtcacagatgtttgtttgtgtgcaggtacctgtaatttt
gcaaacatatacctatatatgtgtgtcgcatatatatcatgtagtagatttccatgttat
gcaacatcttctcacaatgacaatcggtcgtttccttcactccgaaatgttcatgcgaac
agttaatctatatcccaagcagcgatgtaatgttatgcggcgcgcaagtctcattagact
tgtaaaccgtccgagtttcgacttaccata--tgtgtgtgtgtgtgtgcgcgtatgtgca
cgtac------acacgtttgtttatacatttgtctatacatttgcgtgtgaacgcgggat
gaacagagatttgcgcacacatagacatgagaaacgtcacttgtcgatgtagatactaat
tgtggaaaatacatattcctcttcagatacacgggaa-----------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
---
With -p (or -n) the one-liner is reading a line at a time; so it just can't match multiline patterns. One solution is to "slurp" the whole file in, if it isn't too large (see end for line-by-line solution)
perl -0777 -pe'...' in > out
See Command Switches in perlrun.
Then, the code shown in the question has an unbalanced parenthesis and it doesn't compile. Further, there is no reason to capture those .s so drop the parentheses around. Next, the pattern
s/>.+Pseudomyrmex_seminole_D1367...//;
matches everything from the very first > to the name of interest, so all preceding sequences are matched and removed as well. Instead, match >[^>]+...D1367 for example, so everything that isn't > after a >, to that phrase.
Finally, the last .+(?=>) will match everything to the very last > and thus the regex will remove all following sequences, not what you want according to the description. Instead, limit it to match to the first following >, either by making it "non-greedy" with .+?(?=>) or, more simply, with [^>]+.
All corrected
perl -0777 -pe's/>[^>]+?Pseudomyrmex_seminole_D1367[^>]+//' in > out
Note that there is no need for /s modifier now, since its purpose is to make . match a newline and here we don't need that since the [^>] does match newlines as well (anything other than >). The quantifier is +? to (hopefully) prevent backtracking each whole sequence that doesn't match.
Or, with your original use of lookahead
perl -0777 -pe's/>[^>]+?Pseudomyrmex_seminole_D1367.+?(?=>)//s' in > out
These work as expected with your sample, as well as with an extended example I made up with further sequences (>...) added.
For reference, and since a fasta file can be too big to slurp into a string, here it is line by line.
Once you see the >... line of interest set a flag; print a line if that flag isn't set (and if we aren't on that very line). Once you reach the next > clear the flag (print that line, too).
perl -ne'
if (/^>.+?Pseudomyrmex_seminole_D1367/) { $f = 1 }
elsif (not $f) { print }
elsif (/^>/) { $f = 0; print }
' in > out
I suspect that this may also perform considerably better on very large files.
The regex in the first solution has to scan each sequence whole in order to find that it is not the one of interest; it is only once it hits the next > that it can decide that the sequence doesn't match (and with no backtracking, hopefully, since +? would've stopped it had the right phrase been encountered).
Here the code mostly checks the first character and a flag.
So it's an incomparably lesser workload here -- but here the regex engine is started up on every line, and that is expensive. I can't tell with confidence how they stack against each other without trying.
You can also use > as input record separator. This way you avoid to slurp the whole file and since the main loop loads your file block by block, you only have to test which one is the target to not print it (without to describe the whole block in a pattern):
perl -ln076e's/\n$//;print ">$_" if $_ && !/Pseudomyrmex_seminole_D1367/' file
The l switch sets the output record separator to the input record separator (a newline by default).
The 0 switch sets the input record separator to > (76 in octal).

Regex: keep same pattern found multiple times in same line and replace line by appending single pattern in front

Is it possible with notepad++ (or maybe from linux bash shell) to create multiple lines from a pattern found , as many times as the pattern is found and also append single found pattern in the newly created line?
The multi pattern is val=[0-9]+
The single pattern is id=[a-zA-Z0-9]+
Example:
Input lines:
id=af2477,val=333,val=777
id=af3456,val=222,val=444,val=678
id=af3327,val=3234,val=123,val=701
Output lines:
id=af2477,val=333
id=af2477,val=777
id=af3456,val=222
id=af3456,val=444
id=af3456,val=678
id=af3327,val=3234
id=af3327,val=123
id=af3327,val=701
I have tried with 2 subgroups but it wont work. It will only replace the second group once:
find what:(id=[a-zA-Z0-9]+,)(val=[0-9]+,)*
replace:\n\1,\2
UPDATE: Both answers from Toto and Wiktor Stribiżew seem to do the job. Haven't tested them yet. I would still like to see how this can work with the use of Notepad++ (even if multiple steps are needed)
Since you also consider using Linux tools for this, an awk solution looks much more viable:
awk 'BEGIN{FS=OFS=","} /^id=[a-zA-Z0-9]+(,val=[0-9]+)*$/{
for(i=2; i<=NF; i++) {
print $1,$i
}; next;
}{print $0}' file > outfile
See the online demo.
Here, any line that matches ^id=[a-zA-Z0-9]+(,val=[0-9]+)*$ (i.e. matches the format of the lines you need to expand) is split the way you need with for(i=2; i<=NF; i++) {print $1,$i}; next;. Else, the line is written as is (print $0).
The BEGIN{FS=OFS=","} part sets the input and output field separator to a comma.
This perl one-liner does the job (output on STDOUT):
perl -anE '($id,$vals)=/(id=\w+),(.+)$/;say "$id,$_" for split/,/,$vals' file
id=af2477,val=333
id=af2477,val=777
id=af3456,val=222
id=af3456,val=444
id=af3456,val=678
id=af3327,val=3234
id=af3327,val=123
id=af3327,val=701
Explanation:
($id,$vals)=/(id=\w+),(.+)$/; # explode id and values for each line in input file
say "$id,$_" for split/,/,$vals # print id and each value
You can redirect the output to another file:
perl -anE '($id,$vals)=/(id=\w+),(.+)$/;say "$id,$_" for split/,/,$vals' file > outputfile
Or do the change in-place:
perl -i -anE '($id,$vals)=/(id=\w+),(.+)$/;say "$id,$_" for split/,/,$vals' file
It is possible, yet very complex to do that with one regular expression for which you are gonna have to use (?R) and conditional statements.
With multiple steps would be pretty simple. You can for instance do find and replace using the max number of val that you might have in the longest lines, such as, imagine 4 would be the largest number of val, then we'll have four of (,val=[^\r\n,]*) in our initial expression:
^(id=[^\r\n,]*)(,val=[^\r\n,]*)(,val=[^\r\n,]*)(,val=[^\r\n,]*)(,val=[^\r\n,]*)$
and replace that with four lines,
$1$2\n$1$3\n$1$4\n$1$5
---- ---- ---- ----
Demo for Step 1
For any additional step, we can simply remove one val and one line from the end of initial expression and replacement. For example, our expression would look like
^(id=[^\r\n,]*)(,val=[^\r\n,]*)(,val=[^\r\n,]*)(,val=[^\r\n,]*)$
in the second step, for which we'd replace it with:
$1$2\n$1$3\n$1$4
---- ---- ----
Demo for Step 2
In the third and final step, our expression has two vals,
^(id=[^\r\n,]*)(,val=[^\r\n,]*)(,val=[^\r\n,]*)$
and our replacement will have two lines:
$1$2\n$1$3
---- ----
Demo for Step 3
For the case exampled in the question, only two steps are required and the second and third expressions would likely work just fine.

How do i delete first 2 lines which match with a text given by me ( using sed )?

How do i delete first 2 lines which match with a text given by me ( using sed ! )
E.g :
#file.txt contains following lines :
abc
def
def
abc
abc
def
And i want to delete first 2 "abc"
Using "sed"
While #EdMorton has pointed out that sed is not the best tool for this job (if you wonder why exactly, see my answer below and compare it to the awk code), my research showed that the solution to the generalized problem
Delete occurences "N" through "M" of a line matching a given pattern using sed
indeed is a very tricky one in my opinion. There seem to be many suggestions for how to replace the "N"th occurence of a matching pattern with sed, but I found that deleting a specific matching line (or a range of lines) is a much more complex undertaking.
While the generalized problem with arbitrary values for N, M, and the pattern would probably be solved best by writing a "sed script generator" on the basis of a Finite State Machine, the solution to the special case asked by the OP is still simple enough to be coded by hand. I must admit that I wasn't very familiar with the obfuscated intricacies of the sed command syntax before, but I found this challenge to be quite useful for gaining more experience with non-trivial sed usage.
Anyway, here's my solution for deleting the first two occurences of a line containing "abc" in a file. If there's a simpler approach, I'm eager to learn about it, as this has taken me some time now.
A final caveat: this assumes GNU sed, as I was unable to find a solution with POSIX sed:
sed -n ':1;/abc/{n;b2;};p;$b4;n;b1;:2;/abc/{n;b3;};p;$b4;n;b2;:3;p;$b4;n;b3;:4;q' file
or, in more verbose syntax:
sed -n '
# BEGIN - look for first match
:first;
/abc/ {
# First match found. Skip line and jump to second section
n; bsecond;
};
# Line does not match. Print it and quit if end-of-file reached
p; $bend;
# Advance to next line and start over
n; bfirst;
# END - look for first match
# BEGIN - look for second match
:second;
/abc/ {
# Second match found. Skip line and jump to final section
n; bfinal;
}
# Line does not match. Print it and quit if end-of-file reached
p; $bend;
# Advance to next line and start over
n; bsecond;
# END - look for second match
# BEGIN - both matches found; print remaining lines
:final;
# Print line and quit if end-of-file reached
p; $bend;
# Advance to next line and start over
n; bfinal;
# END - print remaining lines
# QUIT
:end;
q;
' file
sed is for simple substitutions on individual lines, that is all. For anything else you should be using awk:
$ awk '!(/abc/ && ++c<3)' file
def
def
abc
def

regex Pattern Matching over two lines - search and replace

I have a text document that i require help with. In the below example is an extract of a tab delimited text doc whereby the first line of the 3 line pattern will always be a number. The Doc will always be in this format with the same tabbed formula on each of the three lines.
nnnn **variable** V -------
* FROM CLIP NAME - **variable**
* LOC: variable variable **variable**
I want to replace the second field on the first line with the fourth field on the third line. And then replace the field after the colon on the second line with the original second field on the first line. Is this possible with regex? I am used to single line search replace function but not multiline patterns.
000003 A009C001_151210_R6XO V C 11:21:12:17 11:21:57:14 01:00:18:22 01:01:03:19
*FROM CLIP NAME: 5-1A
*LOC: 01:00:42:15 WHITE 005_NST_010_E02
000004 B008C001_151210_R55E V C 11:21:18:09 11:21:53:07 01:01:03:19 01:01:38:17
*FROM CLIP NAME: 5-1B
*LOC: 01:01:20:14 WHITE 005_NST_010_E03
The Result would look like :
000003 005_NST_010_E02 V C 11:21:12:17 11:21:57:14 01:00:18:22 01:01:03:19
*FROM CLIP NAME: A009C001_151210_R6XO
*LOC: 01:00:42:15 WHITE 005_NST_010_E02
000004 005_NST_010_E03 V C 11:21:18:09 11:21:53:07 01:01:03:19 01:01:38:17
*FROM CLIP NAME: B008C001_151210_R55E
*LOC: 01:01:20:14 WHITE 005_NST_010_E03
Many Thanks in advance.
A regular expression defines a regular language. Alone, this only expresses a structure of some input. Performing operations on this input requires some kind of processing tool. You didn't specify which tool you were using, so I get to pick.
Multiline sed
You wrote that you are "used to single line search replace function but not multiline patterns." Perhaps you are referring to substitution with sed. See How can I use sed to replace a multi-line string?. It is more complicated than with a single line, but it is possible.
An AWK script
AWK is known for its powerful one-liners, but you can also write scripts. Here is a script that identifies the beginning of a new record/pattern using a regular expression to match the first number. (I hesitate to call it a "record" because this has a specific meaning in AWK.) It stores the fields of the first two lines until it encounters the third line. At the third line, it has all the information needed to make the desired replacements. It then prints the modified first two lines and continues. The third line is printed unchanged (you specified no replacements for the third line). If there are additional lines before the start of the next record/pattern, they will also be printed unchanged.
It's unclear exactly where the tab characters are in your sample input because the submission system has replaced them with spaces. I am assuming there is a tab between FROM CLIP NAME: and the following field and that the "variables" on the first and third line are also tab-separated. If the first number of each record/pattern is hexadecimal instead of decimal, replace the [[:digit:]] with [[:xdigit:]].
fixit.awk
#!/usr/bin/awk -f
BEGIN { FS="\t"; n=0 }
{n++}
/^[[:digit:]]+\t/ { n=1 }
# Split and save first two lines
n==1 { line1_NF = split($0, line1, FS); next }
n==2 { line2_NF = split($0, line2, FS); next }
n==3 {
# At the third line, make replacements
line1_2 = line1[2]
line1[2] = $4
line2[2] = line1_2
# Print modified first two lines
printf "%s", line1[1]
for ( i=2; i<=line1_NF; ++i )
printf "\t%s", line1[i]
print ""
printf "%s", line2[1]
for ( i=2; i<=line2_NF; ++i )
printf "\t%s", line2[i]
print ""
}
1 # Print lines after the second unchanged
You can use it like
$ awk -f fixit.awk infile.txt
or to pipe it in
$ cat infile.txt | awk -f fixit.awk
This is not the most regular expression inspired solution, but it should make the replacements that you want. For a more complex structure of input, an ideal solution would be to write a scanner and parser that correctly interprets the full input language. Using tools like string substitution might work for simple specific cases, but there could be nuances and assumptions you've made that don't apply in general. A parser can also be more powerful and implement grammars that can express languages which can't be recognized with regular expressions.

Regular Expression over multiple lines

I'm stuck with this for several hours now and cycled through a wealth of different tools to get the job done. Without success. It would be fantastic, if someone could help me out with this.
Here is the problem:
I have a very large CSV file (400mb+) that is not formatted correctly. Right now it looks something like this:
This is a long abstract describing something. What follows is the tile for this sentence."
,Title1
This is another sentence that is running on one line. On the next line you can find the title.
,Title2
As you can probably see the titles ",Title1" and ",Title2" should actually be on the same line as the foregoing sentence. Then it would look something like this:
This is a long abstract describing something. What follows is the tile for this sentence.",Title1
This is another sentence that is running on one line. On the next line you can find the title.,Title2
Please note that the end of the sentence can contain quotes or not. In the end they should be replaced too.
Here is what I came up with so far:
sed -n '1h;1!H;${;g;s/\."?.*,//g;p;}' out.csv > out1.csv
This should actually get the job done of matching the expression over multiple lines. Unfortunately it doesn't :)
The expression is looking for the dot at the end of the sentence and the optional quotes plus a newline character that I'm trying to match with .*.
Help much appreciated. And it doesn't really matter what tool gets the job done (awk, perl, sed, tr, etc.).
Multiline in sed isn't necessarily tricky per se, it's just that it uses commands most people aren't familiar with and have certain side effects, like delimiting the current line from the next line with a '\n' when you use 'N' to append the next line to the pattern space.
Anyway, it's much easier if you match on a line that starts with a comma to decide whether or not to remove the newline, so that's what I did here:
sed 'N;/\n,/s/"\? *\n//;P;D' title_csv
Input
$ cat title_csv
don't touch this line
don't touch this line either
This is a long abstract describing something. What follows is the tile for this sentence."
,Title1
seriously, don't touch this line
This is another sentence that is running on one line. On the next line you can find the title.
,Title2
also, don't touch this line
Output
$ sed 'N;/\n,/s/"\? *\n//;P;D' title_csv
don't touch this line
don't touch this line either
This is a long abstract describing something. What follows is the tile for this sentence.,Title1
seriously, don't touch this line
This is another sentence that is running on one line. On the next line you can find the title.,Title2
also, don't touch this line
Yours works with a couple of small changes:
sed -n '1h;1!H;${;g;s/\."\?\n,//g;p;}' inputfile
The ? needs to be escaped and . doesn't match newlines.
Here's another way to do it which doesn't require using the hold space:
sed -n '${p;q};N;/\n,/{s/"\?\n//p;b};P;D' inputfile
Here is a commented version:
sed -n '
$ # for the last input line
{
p; # print
q # and quit
};
N; # otherwise, append the next line
/\n,/ # if it starts with a comma
{
s/"\?\n//p; # delete an optional comma and the newline and print the result
b # branch to the end to read the next line
};
P; # it doesn't start with a comma so print it
D # delete the first line of the pair (it's just been printed) and loop to the top
' inputfile