rename specific lines in a text file with sed - regex

I have a file that looks like this:
>alks|keep1|aoiuor|lskdjf
ldkfj
alksj
asdflkj
>jhoj_kl|keep2|kjghoij|adfjl
aldskj
alskj
alsdkj
I would like to edit just the lines starting with >, ideally in-place, to get a file:
>keep1
ldkfj
alksj
asdflkj
>keep2
aldskj
alskj
alsdkj
I know that in principle this is achievable with various combinations of sed/awk/cut, but I haven't been able to figure out the right combination. Ideally it should be fast - the file has many millions of lines, and many of the lines are also very long.
Key things about the lines I want to edit:
Always start with >
The bit I want to keep is always between the first and second pipe symbol | (hence thinking cut is going to help
The bit I want to keep has alphanumeric symbols and sometimes underscores. The rest of the string on the same line can have any symbols
What I've tried that seems helpful
(Most of my sed attempts are pure garbage)
cut -d '|' -f 2 test.txt
Gets me the bit of the string that I want, and it keeps the other lines too. So it's close, but (of course) it doesn't preserve the initial > on the lines where cut applies, so it's missing a crucial part of the solution.

With sed:
sed -E '/^>/ s/^[^|]+\|([^|]+).*/>\1/'
/^>/ to select lines starting with >, not strictly necessary for given sample but sometimes this provides faster result than using s alone
^[^|]+\| this will match non | characters from the start of line
([^|]+) capture the second field
.* rest of the line
>\1 replacement string where \1 will have the contents of ([^|]+)
If your input has only ASCII character, this would give you much faster results:
LC_ALL=C sed -E '/^>/ s/^[^|]+\|([^|]+).*/>\1/'
Timing
Checking the timing results by creating a huge file from given input sample, awk is much faster and mawk is even faster
However, OP reports that the sed solution is faster for the actual data

With your shown samples, you could simply try following. In this code, we are setting field separator as | for all the lines of Input_file then in main program checking if line starts from > then print 2nd field else print the complete line.
awk -F'|' '/^>/{print ">"$2;next} 1' Input_file
Explanation: Adding detailed explanation for above.
awk -F'|' ' ##Starting awk program from here and setting field separator as | here.
/^>/{ ##Checking condition if line starts from > then do following.
print ">"$2 ##Printing 2nd field of current line here.
next ##next will skip all further statements from here.
}
1 ##Will print current line.
' Input_file ##mentioning Input_file name here.

You can also use the following awk command:
awk -F\| '/^>/{print ">"$2} !/^>/{print}' file
# Inplace replacement with gawk (GNU awk)
gawk -i inplace -F\| '/^>/{print ">"$2} !/^>/{print}' file
# "Inline-like" replacement with any awk
awk -F\| '/^>/{print ">"$2} !/^>/{print}' file > tmp && mv tmp file
Here,
-F\| - sets the field separator to a | char
/^>/ is the condition: if line starts with < (and !/^>/ means the opposite)
{print ">"$2} prints the Field 2 value with a > char prepended to it
{print} simply prints the full line.
Note that since !/^>/{print} can be reduced to !/^>/ as print is the default action.
See an online demo:
s='>alks|keep1|aoiuor|lskdjf
ldkfj
alksj
asdflkj
>jhoj_kl|keep2|kjghoij|adfjl
aldskj
alskj
alsdkj'
awk -F\| '/^>/{print ">"$2} !/^>/{print}' <<< "$s"
Output:
>keep1
ldkfj
alksj
asdflkj
>keep2
aldskj
alskj
alsdkj

Related

Skipping a part of a line using sed

I have a file with content like so - #1: 00001109
Each line is of the same format. I want the final output to be #1: 00 00 11 09.
I used command in sed to introduce a space every 2 characters - sed 's/.\{2\}/& /g'. But that will give me spaces in the part before the colon too which I want to avoid. Can anyone advise how to proceed?
Could you please try following, written and tested with shown samples.
awk '{gsub(/../,"& ",$2);sub(/ +$/,"")} 1' Input_file
Explanation: First globally substituting each 2 digits pair with same value by appending space to it where gsub is globally substitution to perform it globally). Once this is done, using single sub to substitute last coming space with NULL to avoid spaces at last of lines.
With sed:
sed -E 's/[0-9]{2}/& /g;s/ +$//' Input_file
Explanation: Globally substituting each pair of digits with its same value and appending spaces to it. Then substituting space coming last space of line(added by previous substitution) with NULL.
This might work for you (GNU sed):
sed 's/[0-9][0-9]\B/& /g' file
After a pair of digits within a word, insert a space.
If perl happens to be your option, how about:
perl -pe '1 while s/(\d+)(\d\d)/$1 $2/g' file
you can use pure bash:
for line in "$(<your_file.txt)"; do
first=`echo $line | cut -d' ' -f1`" "
last=`echo $line | cut -d' ' -f2`
for char in `seq 0 2 ${#last}`; do
first+=${last:$char:2}" "
done;
done;

Print everything before relevant symbol and keep 1 character after relevant symbol

I'm trying to find a one-liner to print every before relevant symbol and keep just 1 character after relevant symbol:
Input:
thisis#atest
thisisjust#anothertest
just#testing
Desired output:
thisis#a
thisjust#a
just#t
awk -F"#" '{print $1 "#" }' will almost give me what I want but I need to find a way to print the second character as well. Any ideas?
You can substitute what's after the first character after # with nothing with sed:
sed 's/\(#.\).*/\1/'
You could use grep:
$ grep -o '[^#]*#.' infile
thisis#a
thisisjust#a
just#t
This matches a sequence of characters other than #, followed by # and any character. The -o option retains only the match itself.
With the special RT variable in GNU's awk, you can do:
awk 'BEGIN{RS="#.|\n"}RT!="\n"{print $0 RT}'
Get the index of the '#', then pull out the substring.
$ awk '{print substr($0,1,index($0,"#")+1);}' in.txt
thisis#a
thisisjust#a
just#t
1st Solution: Could you please try following.
awk 'match($0,/[^#]*#./){print substr($0,RSTART,RLENGTH)}' Input_file
Above will print lines as per your ask which have # in them and leave lines which does not have it, in case you want to completely print those lines use following then.
awk 'match($0,/[^#]*#./){print substr($0,RSTART,RLENGTH);next} 1' Input_file
2nd solution:
awk 'BEGIN{FS=OFS="#"} {print $1,substr($2,1,1)}' Input_file
Some small variation of Ravindes 2nd example
awk -F# '{print $1"#"substr($2,1,1)}' file
awk -F# '{print $1FS substr($2,1,1)}' file
Another grep variation (shortest posted so far):
grep -oP '.+?#.' file
o print only matching
P Perl regex (due to +?)
. any character
+ and more
? but stop with:
#
. pluss one more character
If we do not add ?. This line test#one#two becomes test#one#t instead of test#o do to the greedy +
If you want to use awk, the cleanest way to do this with is using index which finds the position of a character:
awk 'n=index($0,'#') { print substr($0,1,n+1) }' file
There are, however, shorter and more dedicated tools for this. See the other answers.

awk strings for git

I'm trying to do an awk to retrieve the directory for certain git repos.
Current
git#ssh.gitlab.org:repo1/dir/dir/file1.git
git#ssh.gitlab.org:repo1/dir/dir/file2.git
git#ssh.gitlab.org:repo1/dir/dir/file3.git
git#ssh.gitlab.org:repo1/dir/dir/file4.git
I have this below using a field separate, but I'm unsure how to remote .git
awk -F':' '{print $2}' file
repo1/dir/dir/file1.git
repo1/dir/dir/file2.git
repo1/dir/dir/file3.git
repo1/dir/dir/file4.git
Desired result
repo1/dir/dir/file1
repo1/dir/dir/file2
repo1/dir/dir/file3
repo1/dir/dir/file4
You may use
awk -F':' '{sub(/\.[^.\/]*$/, "", $2); print $2;}' file
Using -F':' you will split all records (lines) into colon-separated fields. You access the second item only using $2, but before printing it, you need to remove the final . and any 0 or more chars other than . and / up to the end of the field value, which is done with sub(/\.[^.\/]*$/, "", $2).
See the online demo
With this solution, you may handle files and folders that may have any amount of dots in their names.
Could you please try following.
awk -F'[:.]' '{print $(NF-1)}' input_file
2nd solution: In case you don't want to hard code field number then try following.
awk 'match($0,/:[^.]*/){print substr($0,RSTART+1,RLENGTH-1)}' Input_file
With sed
$ sed 's/^[^:]*://; s/\.git$//' file
repo1/dir/dir/file1
repo1/dir/dir/file2
repo1/dir/dir/file3
repo1/dir/dir/file4
s/^[^:]*:// remove up to first : from start of line
s/\.git$// remove .git from end of line
you can also use sed -E 's/^[^:]*:|\.git$//g' to do it with single substitution
With regex, you can use:
(?<=:)[a-z0-9\/]*
Match anything composed of letters, numbers and slash after the semicolon. So it will stop at the dot.
Or directly match everything between : and . with
(?<=:).*(?=\.)

How can I print 2 lines if the second line contains the same match as the first line?

Let's say I have a file with several million lines, organized like this:
#1:N:0:ABC
XYZ
#1:N:0:ABC
ABC
I am trying to write a one-line grep/sed/awk matching function that returns both lines if the NCCGGAGA line from the first line is found in the second line.
When I try to use grep -A1 -P and pipe the matches with a match like '(?<=:)[A-Z]{3}', I get stuck. I think my creativity is failing me here.
With awk
$ awk -F: 'NF==1 && $0 ~ s{print p ORS $0} {s=$NF; p=$0}' ip.txt
#1:N:0:ABC
ABC
-F: use : as delimiter, makes it easy to get last column
s=$NF; p=$0 save last column value and entire line for printing later
NF==1 if line doesn't contain :
$0 ~ s if line contains the last column data saved previously
if search data can contain regex meta characters, use index($0,s) instead to search literally
note that this code assumes input file having line containing : followed by line which doesn't have :
With GNU sed (might work with other versions too, syntax might differ though)
$ sed -nE '/:/{N; /.*:(.*)\n.*\1/p}' ip.txt
#1:N:0:ABC
ABC
/:/ if line contains :
N add next line to pattern space
/.*:(.*)\n.*\1/ capture string after last : and check if it is present in next line
again, this assumes input like shown in question.. this won't work for cases like
#1:N:0:ABC
#1:N:0:XYZ
XYZ
This might work for you (GNU sed):
sed -n 'N;/.*:\(.*\)\n.*\1/p;D' file
Use grep-like option -n to explicitly print lines. Read two lines into the pattern space and print both if they meet the requirements. Always delete the first and repeat.
If you actual Input_file is same as shown example then following may help you too here.
awk -v FS="[: \n]" -v RS="" '$(NF-1)==$NF' Input_file
EDIT: Adding 1 more solution as per Sundeep suggestion too here.
awk -v FS='[:\n]' -v RS= 'index($NF, $(NF-1))' Input_file

sed - get last value in file

I am making a script to collect a value from an external file. In the middle of this, I saw myself having trouble with the following sed command to limit the result to a single line.
The following command searches for all words with "value=" by collecting the next text, ignoring rows with "#"
NUM=$(sed -n -e '/#/!s/^.*value=//p' $LOGFILE)
I found other command variations for this but none of them allowed the use of words to be ignored as is the case with this command line.
Any soul to do this capture only the final line but still ignoring lines that contain "#"?
optional: can you adapt this command to capture only numbers, ignoring rest the words on the line?
Here's 3 ways:
if you need just sed:
sed -n '/value=/h; ${g; s/value=//p}' file
if you can use other tools:
tac file | sed -n '/value=/{s///p;q}'
or, this is quite readable:
awk -F= '$1 == "value" {value = $2} END {print value}' file