Sed: Search and replace on a 4GB one-line file - regex

OS: 14.04
sed: 4.2.4
I have multiple large files (2-4gb) that I want to perform some simple manipulations on. The entire file is in one line, which makes me wonder how to perform sed operations on it.
There are three things I want to do with each file:
1) Remove all [ characters
2) Remove all ] characters
3) Replace all occurrences of },{ with }{.
So far I have tried sed -e 's/},{/}{/g' file.json > file_new.json with and without the g option, without any luck. I have also tried sed -e 's/\[//g' file.json > file_new.json without any luck. I only get a duplicate file.
Any ideas?

With gnu awk:
awk 'BEGIN{FS="},{";OFS="}{";RS="[][]";ORS=""}$1=$1' file
Perhaps faster with perl (must be tested):
perl -0135 -pe 's/},{/}{/g;y/][//d' file
Where 135 stands for the character ] in octal. The -0 option defines the record separator (instead of to be read line by line, the file is read by parts from and until each ])
The goal of these two scripts is to avoid to load the whole file in memory.
To store the result in a file:
You can redirect the output.
awk 'BEGIN{FS="},{";OFS="}{";RS="[][]";ORS=""}$1=$1' file > result
or
perl -0135 -pe 's/},{/}{/g;y/][//d' file > result
You can use command line options:
awk -i inplace -v INPLACE_SUFFIX=.bak 'BEGIN{FS="},{";OFS="}{";RS="[][]";ORS=""}$1=$1' file
or
perl -0135 -pi'*.bak' -e 's/},{/}{/g;y/][//d' file
(these two commands create a backup of the original file adding the extension .bak, if you want to change the source file in place, remove -v INPLACE_SUFFIX=.bak for gawk, and '*.bak' for perl.)

When I've got huge single-line files like that, for which the usual line-based tools won't work, I usually turn to: tr!
1) Remove all [ characters
2) Remove all ] characters
That's easy:
tr -d '[]' < file > strippedfile
(This might not work with a really, really old SysV version of tr, but it should be fine with any modern version.)
3) Replace all occurrences of },{ with }{.
That's trickier, because you care about context, so it's really a job for sed. One kludge I've used is to use tr to temporarily change some other character to a newline -- that is, to temporarily change the huge single-line file into a multi-line file -- then run sed, and finally change it back to a single-line file. Something like
tr '{' '\n' < file | sed 's/},$/}/' | tr '\n' '{' > newfile
This last works only if the original file contains no newlines. You could run through tr -d '\n' first to be sure.

Try this to place a newline at the end of the file:
echo "" >> file
sed 'whatever' file
Many UNIX tools will simply not recognize a file with no ending newlines as a text file and so will not operate on them so that MAY be your problem. If that doesn't work then edit your question to include a concise, testable example of your file.

Related

Modifying a pattern-matched line as well as next line in a file

I'm trying to write a script that, among other things, automatically enable multilib. Meaning in my /etc/pacman.conf file, I have to turn this
#[multilib]
#Include = /etc/pacman.d/mirrorlist
into this
[multilib]
Include = /etc/pacman.d/mirrorlist
without accidentally removing # from lines like these
#[community-testing]
#Include = /etc/pacman.d/mirrorlist
I already accomplished this by using this code
linenum=$(rg -n '\[multilib\]' /etc/pacman.conf | cut -f1 -d:)
sed -i "$((linenum))s/#//" /etc/pacman.conf
sed -i "$((linenum+1))s/#//" /etc/pacman.conf
but I'm wondering, whether this can be solved in a single line of code without any math expressions.
With GNU sed. Find row starting with #[multilib], append next line (N) to pattern space and then remove all # from pattern space (s/#//g).
sed -i '/^#\[multilib\]/{N;s/#//g}' /etc/pacman.conf
If the two lines contain further #, then these are also removed.
Could you please try following, written with shown samples only. Considering that multilib and it's very next line only you want to deal with.
awk '
/multilib/ || found{
found=$0~/multilib/?1:""
sub(/^#+/,"")
print
}
' Input_file
Explanation:
First checking if a line contains multilib or variable found is SET then following instructions inside it's block.
Inside block checking if line has multilib then set it to 1 or nullify it. So that only next line after multilib gets processed only.
Using sub function of awk to substitute starting hash one or more occurences with NULL here.
Then printing current line.
This will work using any awk in any shell on every UNIX box:
$ awk '$0 == "#[multilib]"{c=2} c&&c--{sub(/^#/,"")} 1' file
[multilib]
Include = /etc/pacman.d/mirrorlist
and if you had to uncomment 500 lines instead of 2 lines then you'd just change c=2 to c=500 (as opposed to typing N 500 times as with the currently accepted solution). Note that you also don't have to escape any characters in the string you're matching on. So in addition to being robust and portable this is a much more generally useful idiom to remember than the other solutions you have so far. See printing-with-sed-or-awk-a-line-following-a-matching-pattern/17914105#17914105 for more.
A perl one-liner:
perl -0777 -api.back -e 's/#(\[multilib]\R)#/$1/' /etc/pacman.conf
modify in place with a backup of original in /etc/pacman.conf.back
If there is only one [multilib] entry, with ed and the shell's printf
printf '/^#\[multilib\]$/;+1s/^#//\n,p\nQ\n' | ed -s /etc/pacman.conf
Change Q to w to edit pacman.conf
Match #[multilib]
; include the next address
+1 the next line (plus one line below)
s/^#// remove the leading #
,p prints everything to stdout
Q exit/quit ed without error message.
-s means do not print any message.
Ed can do this.
cat >> edjoin.txt << EOF
/multilib/;+j
s/#//
s/#/\
/
wq
EOF
ed -s pacman.conf < edjoin.txt
rm -v ./edjoin.txt
This will only work on the first match. If you have more matches, repeat as necessary.
This might work for you (GNU sed):
sed '/^#\[multilib\]/,+1 s/^#//' file
Focus on a range of lines (in this case, two) where the first line begins #[multilib] and remove the first character in those lines if it is a #.
N.B. The [ and ] must be escaped in the regexp otherwise they will match a single character that is m,u,l,t,i or b. The range can be extended by changing the integer +1 to +n if you were to want to uncomment n lines plus the matching line.
To remove all comments in a [multilib] section, perhaps:
sed '/^#\?\[[^]]*\]$/h;G;/^#\[multilib\]/M s/^#//;P;d' file

SED: How to search for word "tokens" on consecutive lines (Windows)?

I have EDI files I need to find, by using SED to search for some anomalies.
The anomaly is when I search for a "token" called SGP, and where they are on multiple consecutive lines — so one SGP on one line and another SGP on another line — regardless of what's after the token:
SGP+SEGU1037087'
SGP+DFSU1143210'
SGP+SEGU1166926'
SGP+TGHU1203545'
But I don't want to find files where there are other segment lines between each SGP line:
SGP+TGHU1643436'
GID+2+3:BAG'
FTX+AAA+++sdfjkhsdfjkhsdfjkh'
MEA+AAE+AAB+KGM:20000.0000'
MEA+AAE+AAW+MTQ:.0000'
SGP+HCIU2090577'
So I've tried this:
sed 'SGP.*\n.*SGP' < *.txt
And as probably expected, I get nothing.
Any ideas on how to feed into SED a list of files in DOS, and get a list of files that meet the above criteria?
UPDATE
I think I have the "feed the files" bit here. But I am still stuck on how to use SED properly.
for i in *.txt; do
sed -i '<<WHAT DO I PLACE HERE?>>' $i
done
UPDATE 2
Please no Unix/Bash/etc solutions.. I am in Windows only! Thank you
UPDATE 3
Tried a DOS equivalent of #tshiono's answer but I get nothing..
for %%f in (*.txt) do (
sed -ne ':l;N;$!b l;/SGP[^\n]\+\nSGP/p' %%f
}
UPDATE 4
#tshiono - I want the script to find files that have this pattern...
SGP+SEGU1037087'
SGP+DFSU1143210'
SGP+SEGU1166926'
SGP+TGHU1203545'
Not this pattern ...
SGP+SEGU1037087'
FTT+asdjkfhsdkf hsdjkfh sdfjkh sdf
FTX+f sdfjsdfkljsdkfljsdklfj
GID+sdfjkhsdjkfhsdjkfsdf
SGP+DFSU1143210'
FTT+asdjkfhsdkf hsdjkfh sdfjkh sdf
FTX+f sdfjsdfkljsdkfljsdklfj
GID+sdfjkhsdjkfhsdjkfsdf
SGP+SEGU1166926'
FTT+asdjkfhsdkf hsdjkfh sdfjkh sdf
FTX+f sdfjsdfkljsdkfljsdklfj
GID+sdfjkhsdjkfhsdjkfsdf
SGP+TGHU1203545'
Again - only lines with SGP as a token on every NEWLINE
Could you please try following.
awk '
FNR==1{
if(count){
if(fnr==count){
print prev_file " has all lines of SGP."
}
}
prev_file=FILENAME
count=fnr=""
}
/^SGP/{
++count
}
{
fnr++
}
END{
if(fnr==count){
print prev_file " has all lines of SGP."
}
}
' *.txt
The requirement is to detect which files contain consecutive lines both starting SGP.
Using standard (POSIX) sed, there's no way to get sed to print the file name. You can use this combination of shell script and sed, though, to detect which files contain consecutive lines starting with SGP:
for file in *.txt;
do
if [ -n "$(sed -n -e '/^SGP/{N;/^SGP.*\nSGP/{p;q;}}' "$file")" ]
then echo "$file"
fi
done
The shell test [ … ] checks whether the output of $(sed …) is a non-empty string, and reports the name of the file if it is. Note that the script is more flexible if, instead of using the glob *.txt, it uses the "$#" (list of arguments, preserving spaces etc). You can the write:
sh find-consecutive-SGP.sh *.txt
or use other more fanciful ways of specifying the file names as arguments.
The sed command doesn't print by default (-n). It looks for a line starting SGP and appends the next line into the 'pattern space'. It then looks to see if the result has two lots of SGP in it; one at the start (we know that will be there) and one after a newline. If that's found, it prints both lines (the pattern space) and quits because its job is done; it has found two consecutive lines both starting SGP. If the pattern space doesn't match, it is not printed (because of the -n) and more data is read. Any lines that don't start SGP are ignored and not printed.
With GNU sed, the F command prints the file name and a newline, so you could use:
for file in *.txt;
do
sed -n -e '/^SGP/{N;/^SGP.*\nSGP/{F;q;}}' "$file"
done
AFAICT from the GNU sed manual, there's no way to 'skip to the start of the next file' so you have to test each file separately as shown, rather than trying sed -n -e '…' *.txt — that will only report the first file that breaches the condition, not all the files.
If your objective is to get the list of filenames which meet the criteria,
how about:
for i in *.txt; do
[[ -n $(sed -ne ':l;N;$!b l;/SGP[^\n]\+\nSGP/p' "$i") ]] && echo "$i"
done
The sed commands :l;N;$!b makes a loop and slurps the whole lines
in the pattern space including "\n"
Then it matches the lines with the pattern of two consecutive lines
which both contain SGP.
If the sed output is non-empty, it prints the current filename.
[Update]
If your requirement is DOS platform, please try instead:
setlocal EnableDelayedExpansion
for %%f in (text*.txt) do (
set result=
for /f "usebackq tokens=*" %%a in (`sed.exe -ne ":l;N;$!b l;/SGP.\+\nSGP.\+/p" %%f`) do set result=!result!%%a
if "!result!" neq "" (
echo %%f
)
)
I've tested with Windows10 and sed-4.2.1.

Need help using sed to stop interpreting \n as new line in linux bash scripts

I am new to linux and any bash scripts and have the following problem:
I have this kryptokey:
-----BEGIN CERTIFICATE-----\n
MIICSTCCAfCgAwIBAgIRAMsLZqD4PavC7NJz7+5ld+EwCgYIKoZIzj0EAwIwdjEL\n
MAkGA1UEBhMCVVMxEzARBgNVBAgTCkNhbGlmb3JuaWExFjAUBgNVBAcTDVNhbiBG\n
cmFuY2lzY28xGTAXBgNVBAoTEG9yZzEuZXhhbXBsZS5jb20xHzAdBgNVBAMTFnRs\n
c2NhLm9yZzEuZXhhbXBsZS5jb20wHhcNMTgxMjMxMTA1ODA5WhcNMjgxMjI4MTA1\n
ODA5WjB2MQswCQYDVQQGEwJVUzETMBEGA1UECBMKQ2FsaWZvcm5pYTEWMBQGA1UE\n
BxMNU2FuIEZyYW5jaXNjbzEZMBcGA1UEChMQb3JnMS5leGFtcGxlLmNvbTEfMB0G\n
A1UEAxMWdGxzY2Eub3JnMS5leGFtcGxlLmNvbTBZMBMGByqGSM49AgEGCCqGSM49\n
AwEHA0IABEbH7l3CiqLA4N4wgfilYgyEuxDrMAqDX6BrFOfWhymNosjh5FlJDHtN\n
GPDKhjtrI6e1q0NC0l6wh9h9TrBn7N2jXzBdMA4GA1UdDwEB/wQEAwIBpjAPBgNV\n
HSUECDAGBgRVHSUAMA8GA1UdEwEB/wQFMAMBAf8wKQYDVR0OBCIEIH7OaekSLJda\n
S0yuV9PCsuasGTt/+/35aVBXTVbII2rCMAoGCCqGSM49BAMCA0cAMEQCIEd+YP/6\n
tCzG/LueYTEio8ApQSyz94ju07pmc3LZJDKBAiALu66LKhOpKhogY9XEFg4TScOt\n
el4dC6OnMMTmRsEtoA==\n-----END CERTIFICATE-----\n
saved in a file $replacementOrg1 (is the path to that file).
Now I want to replace in a template $file "INSERT_ORG1_CA_CERT" with this certificate and safe it in $org1. But I need to keep the "\n" Character.
The result should keep the \n and write it into one line.
I already tried:
sed -e "s#INSERT_ORG1_CA_CERT#$(cat $replacementOrg1)#g" $file > $org1
but it interprets the "\n" as new line.
So the final Output should look like this, 1 String in 1 Line:
"-----BEGIN CERTIFICATE-----\nMIICSTCCAfCgAwIBAgIRAMsLZqD4PavC7NJz7+5ld+EwCgYIKoZIzj0EAwIw djEL\nMAkGA1UEBhMCVVMxEzARBgNVBAgTCkNhbGlmb3JuaWExFjAUBgNVBAcTDVNhbiBG\ncmFuY2lzY28xGTAXBgNVBAoTEG9yZzEuZXhhbXBsZS5jb20xHzAdBgNVBAMTFnRs\nc2NhLm9yZzEuZXhhbXBsZS5jb20wHhcNMTgxMjMxMTA1ODA5WhcNMjgxMjI4MTA1\nODA5WjB2MQswCQYDVQQGEwJVUzETMBEGA1UECBMKQ2FsaWZvcm5pYTEWMBQGA1UE\nBxMNU2FuIEZyYW5jaXNjbzEZMBcGA1UEChMQb3JnMS5leGFtcGxlLmNvbTEfMB0G\nA1UEAxMWdGxzY2Eub3JnMS5leGFtcGxlLmNvbTBZMBMGByqGSM49AgEGCCqGSM49\nAwEHA0IABEbH7l3CiqLA4N4wgfilYgyEuxDrMAqDX6BrFOfWhymNosjh5FlJDHtN\nGPDKhjtrI6e1q0NC0l6wh9h9TrBn7N2jXzBdMA4GA1UdDwEB/wQEAwIBpjAPBgNV\nHSUECDAGBgRVHSUAMA8GA1UdEwEB/wQFMAMBAf8wKQYDVR0OBCIEIH7OaekSLJda\nS0yuV9PCsuasGTt/+/35aVBXTVbII2rCMAoGCCqGSM49BAMCA0cAMEQCIEd+YP/6\ntCzG/LueYTEio8ApQSyz94ju07pmc3LZJDKBAiALu66LKhOpKhogY9XEFg4TScOt\n
el4dC6OnMMTmRsEtoA==\n-----END CERTIFICATE-----\n"
Anybody can help?
Thank you
That is not a valid key. What someone has done is "half-encoding" (I don't know a better term) the newlines - they have added the literal string "\n" before every newline. What you very likely want is either the original key with no "\n" strings or a single line string where every newline has been replaced with "\n".
With the original value you can use replace instead - it supports newlines in the replacement value:
$ replace foo $'foo\nbar' <<< $'x\nfoo\ny'
x
foo
bar
y
Your case should be simply replace 'INSERT_ORG1_CA_CERT' "$(< $replacementOrg1)" "$file" > "$org1".
The substitute command isn't very good with multi-line replacement strings. But we can use GNU sed's read command to work around that:
echo "${replacementOrg1}" |
sed -e '/INSERT_ORG1_CA_CERT/{r /dev/stdin' -e ';d}' ${file} > ${org1}
How it works:
echo the multi-line string, piping it to /dev/stdin.
When sed finds the target "INSERT_ORG1_CA_CERT" it reads /dev/stdin and outputs the contents
then deletes the search string line, (which is presumed to contain no other text).
The tricky part is the inadequately documented r command -- sed assumes everything after the r is part of the filename. If we tried '/INSERT_ORG1_CA_CERT/{r /dev/stdin;d}' it would bomb with the error:
unmatched '{'
Because sed would think the filename was literally "/dev/stdin;d}". But the error message doesn't complain about the missing file, because sed never complains about a missing r filename. Instead sed complains that there's no } closing brace, because sed thinks the } is part of the filename.
To avoid that error we stick an ' -e ' in there.

Finding strings across lines and replace with nothing

I have some 'fastq' format DNA sequence files (basically just text files) like this:
#Sample_1
ACTGACTGACTGACTGACTGACTGACTG
ACTGACTGACTGACTGACTGACTGACTG
+
BBBBBBBBBBBBEEEEEEEEEEEEEEEE
EHHHHKKKKKKKKKKKKKKNQQTTTTTT
#
+
#
+
#Sample_4
ACTGACTGACTGACTGACTGACTGACTG
ACTGACTGACTGACTGACTGACTGACTG
+
BBBBBBBBBBBBEEEEEEEEEEEEEEEE
EHHHHKKKKKKKKKKKKKKNQQTTTTTT
My ultimate goal is to turn these into 'fasta' format files, but to do that I need to get rid of the two empty sequences in the middle.
EDIT
The desired output would look like this:
#Sample_1
ACTGACTGACTGACTGACTGACTGACTG
ACTGACTGACTGACTGACTGACTGACTG
+
BBBBBBBBBBBBEEEEEEEEEEEEEEEE
EHHHHKKKKKKKKKKKKKKNQQTTTTTT
#Sample_4
ACTGACTGACTGACTGACTGACTGACTG
ACTGACTGACTGACTGACTGACTGACTG
+
BBBBBBBBBBBBEEEEEEEEEEEEEEEE
EHHHHKKKKKKKKKKKKKKNQQTTTTTT
All of the dedicated software I tried (Biopython, stand alone programs, perl scripts posted by others) crash at the empty sequences. This is really just a problem of searching for the string #\n+ and replacing it with nothing. I googled this and read several posts and tried about a million options with sed and couldn't figure it out. Here are some things that didn't work:
sed s/'#'/,/'+'// test.fastq > test.fasta
sed s/'#,+'// test.fastq > test.fasta
Any insights would be greatly appreciated.
PS. I've got a Mac.
Try:
sed "/^[#+]*$/d" test.fastq > test.fasta
The /d option tells sed to "delete" the matching line (i.e. not print it).
^ and $ mean "start of string" and "end of string" respectively, i.e. the line must be an exact match.
So, the above command basically says:
Print all lines that do not only contain # or +, and write the result to test.fasta.
Edit: I misunderstood the question slightly, sorry. If you want to only remove pairs of consecutive lines like
#
+
then you need to perform a multi-line search and replace.
Although this can be done with sed, it's perhaps easier to use something like a perl script instead:
perl -0pe 's/^#\n\+\n//gm' test.fastq > test.fasta
The -0 option turns Perl into "file slurp" mode, where Perl reads the entire input file in one shot (instead of line by line). This enables multi-line search and replace.
The -pe option allows you to run Perl code (pattern matching and replacement in this case) and display output from the command line.
^#\n\+\n is the pattern to match, which we are replacing with nothing (i.e. deleting).
/gm makes the substitution multiline and global.
You could also instead pass -i as the first parameter to perl, to edit the file inline.
This may not be the most elegant solution in the world, but you can use tr to replace the \n with a null character and back.
cat test.fastq | tr '\n' '\0' | sed 's/#\x0+\x0//g' | tr '\0' '\n' > test.fasta
Try this:
sed '/^#$/{N;/\n+$/d}' file
When # is found, next line is appended to the pattern space with N.
If $ is found in next line, the d command deletes both lines.

how to rejoin words that are split accross lines with a hyphen in a text file

OCR texts often have words that flow from one line to another with a hyphen at the end of the first line. (ie: the word has '-\n' inserted in it).
I would like rejoin all such split words in a text file (in a linux environment).
I believe this should be possible with sed or awk, but the syntax for these is dark magic to me! I knew a text editor in windows that did regex search/replace with newlines in the search expression, but am unaware of such in linux.
Make sure to back up ocr_file before running as this command will modify the contents of ocr_file:
perl -i~ -e 'BEGIN{$/=undef} ($f=<>) =~ s#-\s*\n\s*(\S+)#$1\n#mg; print $f' ocr_file
This answer is relevant, because I want the words joined together... not just a removal of the dash character.
cat file| perl -CS -pe's/-\n//'|fmt -w52
is the short answer, but uses fmt to reform paragraphs after the paragraphs were mangled by perl.
without fmt, you can do
#!/usr/bin/perl
use open qw(:std :utf8);
undef $/; $_=<>;
s/-\n(\w+\W+)\s*/$1\n/sg;
print;
also, if you're doing OCR, you can use this perl one-liner to convert unicode utf-8 dashes to ascii dash characters. note the -CS option to tell perl about utf-8.
# 0x2009 - 0x2015 em-dashes to ascii dash
perl -CS -pe 'tr/\x{2009}\x{2010}\x{2011}\x{2012\x{2013}\x{2014}\x{2015}/-/'
cat file | perl -p -e 's/-\n//'
If the file has windows line endings, you'll need to catch the cr-lf with something like:
cat file | perl -p -e 's/-\s\n//'
Hey this is my first answer post, here goes:
'-\n' I suspect are the line-feed characters. You can use sed to remove these. You could try the following as a test:
1) create a test file:
echo "hello this is a test -\n" > testfile
2) check the file has the expected contents:
cat testfile
3) test the sed command, this sends the edited text stream to standard out (ie your active console window) without overwriting anything:
sed 's/-\\n//g' testfile
(you should just see 'hello this is a test file' printed to the console without the '-\n')
If I build up the command:
a) First off you have the sed command itself:
sed
b) Secondly the expression and sed specific controls need to be in quotations:
sed 'sedcontrols+regex' (the text in quotations isn't what you'll actually enter, we'll fill this in as we go along)
c) Specify the file you are reading from:
sed 'sedcontrols+regex' testfile
d) To delete the string in question, sed needs to be told to substitute the unwanted characters with nothing (null,zero), so you use 's' to substitute, forward-slash, then the unwanted string (more on that in a sec), then forward-slash again, then nothing (what it's being substituted with), then forward-slash, and then the scale (as in do you want to apply the edit to a single line or more). In this case I will select 'g' which represents global, as in the whole text file. So now we have:
sed 's/regex//g' testfile
e) We need to add in the unwanted string but it gets confusing because if there is a slash in your string, it needs to be escaped out using a back-slash. So, the unwanted string
-\n ends up looking like -\\n
We can output the edited text stream to stdout as follows:
sed 's/-\\n//g' testfile
To save the results without overwriting anything (assuming testfile2 doesn't exist) we can redirect the output to a file:
sed 's/-\\n//g' testfile >testfile2
sed -z 's/-\n//' file_with_hyphens