how to rejoin words that are split accross lines with a hyphen in a text file - regex

OCR texts often have words that flow from one line to another with a hyphen at the end of the first line. (ie: the word has '-\n' inserted in it).
I would like rejoin all such split words in a text file (in a linux environment).
I believe this should be possible with sed or awk, but the syntax for these is dark magic to me! I knew a text editor in windows that did regex search/replace with newlines in the search expression, but am unaware of such in linux.

Make sure to back up ocr_file before running as this command will modify the contents of ocr_file:
perl -i~ -e 'BEGIN{$/=undef} ($f=<>) =~ s#-\s*\n\s*(\S+)#$1\n#mg; print $f' ocr_file

This answer is relevant, because I want the words joined together... not just a removal of the dash character.
cat file| perl -CS -pe's/-\n//'|fmt -w52
is the short answer, but uses fmt to reform paragraphs after the paragraphs were mangled by perl.
without fmt, you can do
#!/usr/bin/perl
use open qw(:std :utf8);
undef $/; $_=<>;
s/-\n(\w+\W+)\s*/$1\n/sg;
print;
also, if you're doing OCR, you can use this perl one-liner to convert unicode utf-8 dashes to ascii dash characters. note the -CS option to tell perl about utf-8.
# 0x2009 - 0x2015 em-dashes to ascii dash
perl -CS -pe 'tr/\x{2009}\x{2010}\x{2011}\x{2012\x{2013}\x{2014}\x{2015}/-/'

cat file | perl -p -e 's/-\n//'
If the file has windows line endings, you'll need to catch the cr-lf with something like:
cat file | perl -p -e 's/-\s\n//'

Hey this is my first answer post, here goes:
'-\n' I suspect are the line-feed characters. You can use sed to remove these. You could try the following as a test:
1) create a test file:
echo "hello this is a test -\n" > testfile
2) check the file has the expected contents:
cat testfile
3) test the sed command, this sends the edited text stream to standard out (ie your active console window) without overwriting anything:
sed 's/-\\n//g' testfile
(you should just see 'hello this is a test file' printed to the console without the '-\n')
If I build up the command:
a) First off you have the sed command itself:
sed
b) Secondly the expression and sed specific controls need to be in quotations:
sed 'sedcontrols+regex' (the text in quotations isn't what you'll actually enter, we'll fill this in as we go along)
c) Specify the file you are reading from:
sed 'sedcontrols+regex' testfile
d) To delete the string in question, sed needs to be told to substitute the unwanted characters with nothing (null,zero), so you use 's' to substitute, forward-slash, then the unwanted string (more on that in a sec), then forward-slash again, then nothing (what it's being substituted with), then forward-slash, and then the scale (as in do you want to apply the edit to a single line or more). In this case I will select 'g' which represents global, as in the whole text file. So now we have:
sed 's/regex//g' testfile
e) We need to add in the unwanted string but it gets confusing because if there is a slash in your string, it needs to be escaped out using a back-slash. So, the unwanted string
-\n ends up looking like -\\n
We can output the edited text stream to stdout as follows:
sed 's/-\\n//g' testfile
To save the results without overwriting anything (assuming testfile2 doesn't exist) we can redirect the output to a file:
sed 's/-\\n//g' testfile >testfile2

sed -z 's/-\n//' file_with_hyphens

Related

BASH - Replacement of regex match within a file

Given the following files:
input_file:
My inputfile, contains multiple line
and also special characters {}[]ä/
template_file:
Contains multiple lines,
also special characters {}[]ä/
##regex_match## <= must be replaced by input_file
Content goes on
abc
output_file:
Contains multiple lines,
also special characters {}[]ä/
My inputfile, contains multiple line
and also special characters {}[]ä/
Content goes on
abc
I thought about sed but that would be very cumbersome because of escaping and newlines. Is there any other solution in BASH?
perl solution just for variety's sake.
perl -0777 -lpe'
BEGIN {
open $fh, "<", "input_file";
$input = $fh->getline
}
s/##regex_match##/$input/
' < template_file > output_file
sed -n -e '/##regex_match##/{r input_file' -e 'b' -e '}; p' template_file
If the regex is matched, read and output the input file and branch (end processing of the line and don't print it). Otherwise print the line.
The use of -e delimits parts of the sed commands so that the r command which reads the input file knows where the name of the file ends. Otherwise it would greedily consume the following sed commands as if they were part of the file name.
The curly braces delimit a block in the program that's like an if statement.
I tested this on MacOS, but it should be pretty similar for GNU. MacOS sed is much pickier about -e (among other differences which don't come into play here).
A very slight variation on the technique Dennis Williamson already posted, merely for discussion purposes -
sed '/##regex_match##/ {
r input_file
d
}' template_file
Contains multiple lines,
also special characters {}[]ä/
My inputfile, contains multiple line
and also special characters {}[]ä/
Content goes on
abc
c.f. the manual.
He used -e options to pass commands, where I separated them with newlines. Usually a semicolon is enough, but apparently r makes other commands on the same line get ignored.
The d prevents the tag pattern from being printed.
With any awk in any shell on every UNIX box and with any characters:
$ awk 'NR==FNR{rec=rec sep $0; sep=ORS; next} /##regex_match##/{$0=rec} 1' input_file template_file
Contains multiple lines,
also special characters {}[]ä/
My inputfile, contains multiple line
and also special characters {}[]ä/
Content goes on
abc

Need help using sed to stop interpreting \n as new line in linux bash scripts

I am new to linux and any bash scripts and have the following problem:
I have this kryptokey:
-----BEGIN CERTIFICATE-----\n
MIICSTCCAfCgAwIBAgIRAMsLZqD4PavC7NJz7+5ld+EwCgYIKoZIzj0EAwIwdjEL\n
MAkGA1UEBhMCVVMxEzARBgNVBAgTCkNhbGlmb3JuaWExFjAUBgNVBAcTDVNhbiBG\n
cmFuY2lzY28xGTAXBgNVBAoTEG9yZzEuZXhhbXBsZS5jb20xHzAdBgNVBAMTFnRs\n
c2NhLm9yZzEuZXhhbXBsZS5jb20wHhcNMTgxMjMxMTA1ODA5WhcNMjgxMjI4MTA1\n
ODA5WjB2MQswCQYDVQQGEwJVUzETMBEGA1UECBMKQ2FsaWZvcm5pYTEWMBQGA1UE\n
BxMNU2FuIEZyYW5jaXNjbzEZMBcGA1UEChMQb3JnMS5leGFtcGxlLmNvbTEfMB0G\n
A1UEAxMWdGxzY2Eub3JnMS5leGFtcGxlLmNvbTBZMBMGByqGSM49AgEGCCqGSM49\n
AwEHA0IABEbH7l3CiqLA4N4wgfilYgyEuxDrMAqDX6BrFOfWhymNosjh5FlJDHtN\n
GPDKhjtrI6e1q0NC0l6wh9h9TrBn7N2jXzBdMA4GA1UdDwEB/wQEAwIBpjAPBgNV\n
HSUECDAGBgRVHSUAMA8GA1UdEwEB/wQFMAMBAf8wKQYDVR0OBCIEIH7OaekSLJda\n
S0yuV9PCsuasGTt/+/35aVBXTVbII2rCMAoGCCqGSM49BAMCA0cAMEQCIEd+YP/6\n
tCzG/LueYTEio8ApQSyz94ju07pmc3LZJDKBAiALu66LKhOpKhogY9XEFg4TScOt\n
el4dC6OnMMTmRsEtoA==\n-----END CERTIFICATE-----\n
saved in a file $replacementOrg1 (is the path to that file).
Now I want to replace in a template $file "INSERT_ORG1_CA_CERT" with this certificate and safe it in $org1. But I need to keep the "\n" Character.
The result should keep the \n and write it into one line.
I already tried:
sed -e "s#INSERT_ORG1_CA_CERT#$(cat $replacementOrg1)#g" $file > $org1
but it interprets the "\n" as new line.
So the final Output should look like this, 1 String in 1 Line:
"-----BEGIN CERTIFICATE-----\nMIICSTCCAfCgAwIBAgIRAMsLZqD4PavC7NJz7+5ld+EwCgYIKoZIzj0EAwIw djEL\nMAkGA1UEBhMCVVMxEzARBgNVBAgTCkNhbGlmb3JuaWExFjAUBgNVBAcTDVNhbiBG\ncmFuY2lzY28xGTAXBgNVBAoTEG9yZzEuZXhhbXBsZS5jb20xHzAdBgNVBAMTFnRs\nc2NhLm9yZzEuZXhhbXBsZS5jb20wHhcNMTgxMjMxMTA1ODA5WhcNMjgxMjI4MTA1\nODA5WjB2MQswCQYDVQQGEwJVUzETMBEGA1UECBMKQ2FsaWZvcm5pYTEWMBQGA1UE\nBxMNU2FuIEZyYW5jaXNjbzEZMBcGA1UEChMQb3JnMS5leGFtcGxlLmNvbTEfMB0G\nA1UEAxMWdGxzY2Eub3JnMS5leGFtcGxlLmNvbTBZMBMGByqGSM49AgEGCCqGSM49\nAwEHA0IABEbH7l3CiqLA4N4wgfilYgyEuxDrMAqDX6BrFOfWhymNosjh5FlJDHtN\nGPDKhjtrI6e1q0NC0l6wh9h9TrBn7N2jXzBdMA4GA1UdDwEB/wQEAwIBpjAPBgNV\nHSUECDAGBgRVHSUAMA8GA1UdEwEB/wQFMAMBAf8wKQYDVR0OBCIEIH7OaekSLJda\nS0yuV9PCsuasGTt/+/35aVBXTVbII2rCMAoGCCqGSM49BAMCA0cAMEQCIEd+YP/6\ntCzG/LueYTEio8ApQSyz94ju07pmc3LZJDKBAiALu66LKhOpKhogY9XEFg4TScOt\n
el4dC6OnMMTmRsEtoA==\n-----END CERTIFICATE-----\n"
Anybody can help?
Thank you
That is not a valid key. What someone has done is "half-encoding" (I don't know a better term) the newlines - they have added the literal string "\n" before every newline. What you very likely want is either the original key with no "\n" strings or a single line string where every newline has been replaced with "\n".
With the original value you can use replace instead - it supports newlines in the replacement value:
$ replace foo $'foo\nbar' <<< $'x\nfoo\ny'
x
foo
bar
y
Your case should be simply replace 'INSERT_ORG1_CA_CERT' "$(< $replacementOrg1)" "$file" > "$org1".
The substitute command isn't very good with multi-line replacement strings. But we can use GNU sed's read command to work around that:
echo "${replacementOrg1}" |
sed -e '/INSERT_ORG1_CA_CERT/{r /dev/stdin' -e ';d}' ${file} > ${org1}
How it works:
echo the multi-line string, piping it to /dev/stdin.
When sed finds the target "INSERT_ORG1_CA_CERT" it reads /dev/stdin and outputs the contents
then deletes the search string line, (which is presumed to contain no other text).
The tricky part is the inadequately documented r command -- sed assumes everything after the r is part of the filename. If we tried '/INSERT_ORG1_CA_CERT/{r /dev/stdin;d}' it would bomb with the error:
unmatched '{'
Because sed would think the filename was literally "/dev/stdin;d}". But the error message doesn't complain about the missing file, because sed never complains about a missing r filename. Instead sed complains that there's no } closing brace, because sed thinks the } is part of the filename.
To avoid that error we stick an ' -e ' in there.

Adding a line using sed

Can't seem to find the right way to do this, despite checking my regex in a reg checker.
Given a text file containing, amongst others, this entry:
zone "example.net" {
type master;
file "/etc/bind/zones/db.example.net";
allow-transfer { x.x.x.x;y.y.y.y; };
also-notify { x.x.x.x;y.y.y.y; };
};
I want to add lines after the also-notify line, for that domain specifically.
So using this sed command string:
sed '/"example\.net".*?also-notify.*?};/a\nxxxxxxx/s' named.conf.local
I thought should work to add 'xxxxxxx' after the line. But nope. What am I doing wrong?
With POSIX sed, you can use the a for append command with an escaped literal new line:
$ sed '/^[[:blank:]]*also-notify/ a\
NEW LINE' file
With GNU sed, a is slightly more natural since the new line is assumed:
$ gsed '/^[[:blank:]]*also-notify/ a NEW LINE' file
The issue with the sed in your example is two fold.
The first is any sed regex cannot be for a multi-line match as in example\.net".*?also-notify.*?. That is more of a perl type match. You would need to use a range operator for the start as in:
$ sed '/"example\.net/,/also-notify/{
/^[[:blank:]]*also-notify/ a\
NEW LINE
}' file
The second issue is the \n in the appended text. With POSIX sed, the \n is not supported in any context. With GNU sed, the new line is assumed and the \n is out of context (if immediately after the a) and interpreted as an escaped literal n. You can use \n with GNU sed after 1 character but not immediately after. In POSIX sed, leading spaces of the appended line will always be stripped.
Following awk may help on this.
awk -v new_lines="new_line here" '/also-notify/{flag=1;print new_lines} /^};/{flag=""} !flag' Input_file
In case you want to edit Input_file itself then append > temp_file && mv temp_file Input_file to above code too. Also print new_lines here new_lines is a variable you could print the new liens directly too in there.
You're pretty close already. Just use a range (/pattern/,/pattern/{ #commands }) to select the text you want to operate on and then use /pattern/a/\ ... to add the line you want.
/"example\.net"/,/also-notify/{
/also-notify/a\
\ this is the text I want to add.
}
sed trims leading space on text to be appended. Adding a backslash \ at the start of the line prevents this.
In Bash, this would look like something like:
sed -e '/"example\.net"/,/also-notify/{
/also-notify/a\
\ this is the text I want to add.
}' named.conf.local
Also note that sed uses an older dialect of regular expressions that doesn't support non-greedy quantifies like *?.

Finding strings across lines and replace with nothing

I have some 'fastq' format DNA sequence files (basically just text files) like this:
#Sample_1
ACTGACTGACTGACTGACTGACTGACTG
ACTGACTGACTGACTGACTGACTGACTG
+
BBBBBBBBBBBBEEEEEEEEEEEEEEEE
EHHHHKKKKKKKKKKKKKKNQQTTTTTT
#
+
#
+
#Sample_4
ACTGACTGACTGACTGACTGACTGACTG
ACTGACTGACTGACTGACTGACTGACTG
+
BBBBBBBBBBBBEEEEEEEEEEEEEEEE
EHHHHKKKKKKKKKKKKKKNQQTTTTTT
My ultimate goal is to turn these into 'fasta' format files, but to do that I need to get rid of the two empty sequences in the middle.
EDIT
The desired output would look like this:
#Sample_1
ACTGACTGACTGACTGACTGACTGACTG
ACTGACTGACTGACTGACTGACTGACTG
+
BBBBBBBBBBBBEEEEEEEEEEEEEEEE
EHHHHKKKKKKKKKKKKKKNQQTTTTTT
#Sample_4
ACTGACTGACTGACTGACTGACTGACTG
ACTGACTGACTGACTGACTGACTGACTG
+
BBBBBBBBBBBBEEEEEEEEEEEEEEEE
EHHHHKKKKKKKKKKKKKKNQQTTTTTT
All of the dedicated software I tried (Biopython, stand alone programs, perl scripts posted by others) crash at the empty sequences. This is really just a problem of searching for the string #\n+ and replacing it with nothing. I googled this and read several posts and tried about a million options with sed and couldn't figure it out. Here are some things that didn't work:
sed s/'#'/,/'+'// test.fastq > test.fasta
sed s/'#,+'// test.fastq > test.fasta
Any insights would be greatly appreciated.
PS. I've got a Mac.
Try:
sed "/^[#+]*$/d" test.fastq > test.fasta
The /d option tells sed to "delete" the matching line (i.e. not print it).
^ and $ mean "start of string" and "end of string" respectively, i.e. the line must be an exact match.
So, the above command basically says:
Print all lines that do not only contain # or +, and write the result to test.fasta.
Edit: I misunderstood the question slightly, sorry. If you want to only remove pairs of consecutive lines like
#
+
then you need to perform a multi-line search and replace.
Although this can be done with sed, it's perhaps easier to use something like a perl script instead:
perl -0pe 's/^#\n\+\n//gm' test.fastq > test.fasta
The -0 option turns Perl into "file slurp" mode, where Perl reads the entire input file in one shot (instead of line by line). This enables multi-line search and replace.
The -pe option allows you to run Perl code (pattern matching and replacement in this case) and display output from the command line.
^#\n\+\n is the pattern to match, which we are replacing with nothing (i.e. deleting).
/gm makes the substitution multiline and global.
You could also instead pass -i as the first parameter to perl, to edit the file inline.
This may not be the most elegant solution in the world, but you can use tr to replace the \n with a null character and back.
cat test.fastq | tr '\n' '\0' | sed 's/#\x0+\x0//g' | tr '\0' '\n' > test.fasta
Try this:
sed '/^#$/{N;/\n+$/d}' file
When # is found, next line is appended to the pattern space with N.
If $ is found in next line, the d command deletes both lines.

Grep invert on string matched, not line matched

I'll keep this explanation of why I need help to a mimimum. One of my file directories got hacked through XSS and placed a long string at the beginning of all php files. I've tried to use sed to replace the string with nothing but it won't work because the pattern to match includes many many characters that would need to be escaped.
I found out that I can use fgrep to match a fixed string saved in a pattern file, but I'd like to replace the matched string (NOT THE LINE) in each file, but grep's -v inverts the result on the line, rather than the end of the matched string.
This is the command I'm using on an example file that contains the hacked
fgrep -v -f ~/hacked-string.txt example.php
I need the output to contain the <?php that's at the end of the line (sometimes it's a <style> tag), but the -v option inverts at the end of that line, so the output doesn't contain the <?php at the beginning.
NOTE
I've tried to use the -o or --only-matching which outputs nothing instead:
fgrep -f ~/hacked-string.txt example.php --only-matching -v
Is there another option in grep that I can use to invert on the end of the matched pattern, rather than the line where the pattern was matched? Or alternatively, is there an easier option to replace the hacked string in all .php files?
Here is a small snippet of what's in hacked-string.txt (line breaks added for readability):
]55Ld]55#*<%x5c%x7825bG9}:}.}-}!#*<%x55c%x7825)
dfyfR%x5c%x7827tfs%x5c%x7c%x785c%x5c%x7825j:^<!
%x5c%x7825w%x5c%x7860%x5c%x785c^>Ew:25tww**WYsb
oepn)%x5c%x7825bss-%x5c%x7825r%x5c%x7878B%x5c%x
7825h>#]y3860msvd},;uqpuft%x5c%x7860msvd}+;!>!}
%x5c%x7827;!%x5c%x7825V%x5c%x7827{ftmfV%x5e56+9
9386c6f+9f5d816:+946:ce44#)zbssb!>!ssbnpe_GMFT%
x5c5c%x782f#00#W~!%x5c%x7825t2w)##Qtjw)#]82#-#!
#-%x5c%x7825tmw)%x5c%x78w6*%x5c%x787f_*#fubfsdX
k5%x5c%xf2!>!bssbz)%x5c%x7824]25%x5c%x7824-8257
-K)fujs%x5c%x7878X6<#o]o]Y%x5c%x78257;utpI#7>-1
-bubE{h%x5c%x7825)sutcvt)!gj!|!*bubEpqsut>j%x5c
%x7825!*72!%x5c%x7827!hmg%x5c%x78225>2q%x5c%x7
Thanks in advance!
I think what you are asking is this:
"Is it possible to use the grep utility to remove all instances of a fixed string (which might contain lots of regex metacharacters) from a file?"
In that case, the answer is "No".
What I think you wanted to ask was:
"What is the easiest way to remove all instances of a fixed string (which might contain lots of regex metacharacters) from a file?"
Here's one reasonably simple solution:
delete_string() {
awk -v s="$the_string" '{while(i=index($0,s))$0=substr($0,1,i-1)substr($0,i+length(s))}1'
}
delete_string 'some_hideous_string_with*!"_inside' < original_file > new_file
The shell syntax is slightly fragile; it will break if the string contains an apostrophe ('). However, you can read a raw string from stdin into a variable with:
$ IFS= read -r the_string
absolutely anything here
which will work with any string which doesn't contain a newline or a NUL character. Once you have the string in a variable, you can use the above function:
delete_string "$the_string" < original_file > new_file
Here's another possible one liner, using python:
delete_string() {
python -c 'import sys;[sys.stdout.write(l.replace(r"""'"$1"'""","")) for l in sys.stdin]'
}
This won't handle strings which have three consecutive quotes (""").
Is the hacked string the same in every file?
If the length of hacked string in chars was 1234 then you can use
tail -c +1235 file.php > fixed-file.php
for each infected file.
Note that tail c +1235 tells to start output at 1235th character of the input file.
With perl:
perl -i.hacked -pe "s/\Q$(<hacked-string.txt)\E//g" example.php
Notes:
The $(<file) bit is a bash shortcut to read the contents of a file.
The \Q and \E bits are from perl, they treat the stuff in between as plain characters, ignoring regex metachars.
The -i.hacked option will edit the file in-place, creating a backup "example.php.hacked"