Using sed to convert text file to a C string - c++

I would like to use sed to replace newlines, tabs, quotes and backslashes in a text file to use it as char constant in C, but I'm lost at the start. It would be nice to maintain the newlines also in the output, adding a '\n', then a double quote to close the text line, a crlf, and another double quote to reopen the line, for example:
line1
line2
would become
"line1\n"
"line2\n"
Can anybody at least point me in the right direction?
Thanks

Try this as a sed command file:
s/\\/\\\\/g
s/"/\\"/g
s/ /\\t/g
s/^/"/
s/$/\\n"/
NB: there's an embedded tab in the third line, if using vi insert by pressing ^v <tab>
s/\\/\\\\/g - escape back slashes
s/"/\\"/g - escape quotes
s/ /\\t/g - convert tabs
s/^/"/ - prepend quote
s/$/\\n"/ - append \n and quote

Better still:
'"'`printf %s "$your_text" | sed -n -e 'H;$!b' -e 'x;s/\\/\\\\/g;s/"/\\"/g;s/ /\\t/g;s/^\n//;s/\n/\\n" \n "/g;p'`'"'
Unlike the one above, this handles newlines correctly as well. I'd just about call it a one-liner.

In Perl, file input from stdin, out to stdout. Hexifies, so no worries about escaping stuff. Doesn't remove tabs, etc. Output is a static C string.
use strict;
use warnings;
my #c = <STDIN>;
print "static char* c_str = {\n";
foreach (#c) {
my #s = split('');
print qq(\t");
printf("\\x%02x", ord($_)) foreach #s;
print qq("\n);
}
print "};\n";

Related

perl regex negative-lookbehind detect file lacking final linefeed

The following code uses tail to test whether the last line of a file fails to culminate in a newline (linefeed, LF).
> printf 'aaa\nbbb\n' | test -n "$(tail -c1)" && echo pathological last line
> printf 'aaa\nbbb' | test -n "$(tail -c1)" && echo pathological last line
pathological last line
>
One can test for the same condition by using perl, a positive lookbehind regex, and unless, as follows. This is based on the notion that, if a file ends with newline, the character immediately preceding end-of-file will be \n by definition.
(Recall that the -n0 flag causes perl to "slurp" the entire file as a single record. Thus, there is only one $, the end of the file.)
> printf 'aaa\nbbb\n' | perl -n0 -e 'print "pathological last line\n" unless m/(?<=\n)$/;'
> printf 'aaa\nbbb' | perl -n0 -e 'print "pathological last line\n" unless m/(?<=\n)$/;'
pathological last line
>
Is there a way to accomplish this using if rather than unless, and negative lookbehind? The following fails, in that the regex seems to always match:
> printf 'aaa\nbbb\n' | perl -n0 -e 'print "pathological last line\n" if m/(?<!\n)$/;'
pathological last line
> printf 'aaa\nbbb' | perl -n0 -e 'print "pathological last line\n" if m/(?<!\n)$/;'
pathological last line
>
Why does my regex always match, even when the end-of-file is preceded by newline? I am trying to test for an end-of-file that is not preceded by newline.
/(?<=\n)$/ is a weird and expensive way of doing /\n$/.
/\n$/ means /\n(?=\n?\z)/, so it's a weird and expensive way of doing /\n\z/.
A few approaches:
perl -n0777e'print "pathological last line\n" if !/\n\z/'
perl -n0777e'print "pathological last line\n" if /(?<!\n)\z/'
perl -n0777e'print "pathological last line\n" if substr($_, -1) ne "\n"'
perl -ne'$ll=$_; END { print "pathological last line\n" if $ll !~ /\n\z/ }'
The last solution avoids slurping the entire file.
Why does my regex always match, even when the end-of-file is preceded by newline?
Because you mistakenly think that $ only matches at the end of the string. Use \z for that.
Do you have a strong reason for using a regular expression for his job? Practicing regular expressions for example? If not, I think a simpler approach is to just use a while loop that tests for eof and remembers the latest character read. Something like this might do the job.
perl -le'while (!eof()) { $previous = getc(\*ARGV) }
if ($previous ne "\n") { print "pathological last line!" }'
PS: ikegami's comment about my solution being slow is well-taken. (Thanks for the helpful edit, too!) So I wondered if there's a way to read the file backwards. As it turns out, CPAN has a module for just that. After installing it, I came up with this:
perl -le 'use File::ReadBackwards;
my $bw = File::ReadBackwards->new(shift #ARGV);
print "pathological last line" if substr($bw->readline, -1) ne "\n"'
That should work efficiently, even very large files. And when I come back to read it a year later, I will more likely understand it than I would with the regular-expression approach.
The hidden context of my request was a perl script to "clean" a text file used in the TeX/LaTeX environment. This is why I wanted to slurp.
(I mistakenly thought that "laser focus" on a problem, recommended by stackoverflow, meant editing out the context.)
Thanks to the responses, here is an improved draft of the script:
#!/usr/bin/perl
use strict; use warnings; use 5.18.2;
# Loop slurp:
$/ = undef; # input record separator: entire file is a single record.
# a "trivial line" looks blank, consists exclusively of whitespace, but is not necessarily a pure newline=linefeed=LF.
while (<>) {
s/^\s*$/\n/mg; # convert any trivial line to a pure LF. Unlike \z, $ works with /m multiline.
s/[\n][\n]+/\n\n/g; # exactly 2 blank lines (newlines) separate paragraphs. Like cat -s
s/^[\n]+//; # first line is visible or "nontrivial."
s/[\n]+\z/\n/; # last line is visible or "nontrivial."
print STDOUT;
print "\n" unless m/\n\z/; # IF detect pathological last line, i.e., not ending in LF, THEN append LF.
}
And here is how it works, when named zz.pl. First a messy file, then how it looks after zz.pl gets through with it:
bash: printf ' \n \r \naaa\n \t \n \n \nbb\n\n\n\n \t'
aaa
bb
bash:
bash:
bash: printf ' \n \r \naaa\n \t \n \n \nbb\n\n\n\n \t' | zz.pl
aaa
bb
bash:

Pattern to swap backslashes in particular quoted strings

I'm using sed (actually ssed with extended "-r", so I have access to most any regex functionality I might need) to process files, one change being to convert doubled backslash characters to a single forward slashes inside quoted strings. The problem is that some quoted strings that contain the doubled backslashes should not be converted. All quoted strings I want to target have a certain word "myPhrase" inside the quotes.
So for a file with these two lines:
"\\\\server\\dir\\myPhrase\\subdir"
"Don't change \\something me!"
the output would be:
"//server/dir/myPhrase/subdir"
"Don't change \\something me!"
I've tried various combinations of lookahead like (?=myPhrase) within a search pattern that finds the desired quoted chunks and replaces a capture group (\\) with / as the replacement, but all my attempts either replace just the first occurance of the doubled backslashes, or those to the left of myPhrase, etc.
I'm sure there is some combination of lookahead/noncapture/recursion that should do this, but I'm blanking out completely right now.
With GNU awk for the 3rd arg to match():
$ cat file
"dont change \\this" "\\\\server\\dir\\myPhrase\\subdir" "nor\\this"
"Don't change \\something me!"
$ awk 'match($0,/(.*)("[^"]*myPhrase[^"]*")(.*)/,a){gsub(/[\\][\\]/,"/",a[2]); $0=a[1] a[2] a[3]} 1' file
"dont change \\this" "//server/dir/myPhrase/subdir" "nor\\this"
"Don't change \\something me!"
Try this Perl solution
perl -pe ' s{"(.+)?"}{$x=$1; if($x=~m/myPhrase/ ) {$x=~s!\\\\!/!g};sprintf("\x22%s\x22",$x)}ge ' file
with the below inputs
$ cat ykdvd.txt
"\\\\server\\dir\\myPhrase\\subdir"
"Don't change \\something me!"
Another line
$ perl -pe ' s{"(.+)?"}{$x=$1; if($x=~m/myPhrase/ ) {$x=~s!\\\\!/!g};sprintf("\x22%s\x22",$x)}ge ' ykdvd.txt
"//server/dir/myPhrase/subdir"
"Don't change \\something me!"
Another line
$

Replacing spaces with underscores within quotes

I need to replace within a large text file all occurrences such as 'yw234DV w-23-sDf wef23s-d-f' with the same strings but with underscores instead of spaces for all spaces within quotes, without replacing any spaces outside quotes with underscores.
I'm trying to find a solution for substitution within vim, but a sed solution would also be much appreciated. The number of tokens in each quote-delimited string may vary.
I've been playing with some regexes in vim, but they're pretty elementary and seem to be missing what I need.
My current attempt:
%s/'{[:alnum:] }*/'\0\_/g
And I'm experimenting with variations on that.
This is most similar to my question, though it is Java:
Replacing spaces within quotes
Sample Input:
'wiUEF7-gvouw ow wo24-RTeih we', 'yt23IT iug-76'
Sample Output:
'wiUEF7-gvouw_ow_wo24-RTeih_we', 'yt23IT_iug-76'
You may try this with VIM, tried this on Macvim:
%s/\%('[^']*'\)*\('[^']*'\)/\=substitute(submatch(1), ' ', '_', 'g')/g
Much simpler solution , Thanks to #SergioAraujo:
#%s/\v%(('[^']*'))/\=substitute(submatch(1),' ', '_', 'g')/g
Not sure however, if below is the outcome you have expected
Output:
'wiUEF7-gvouw_ow_wo24-RTeih_we', 'yt23IT_iug-76'
In perl:
perl -i -pe's{(\x27.*?\x27)}{ (my $subst = $1) =~ tr/ /_/ }ge' yourfile
or with perl5.14 or above:
perl -i -pe's{(\x27.*?\x27)}{ $1 =~ tr/ /_/r }ge'
With this the input file:
$ cat file
'wiUEF7-gvouw ow wo24-RTeih we', 'yt23IT iug-76'
We can convert all spaces inside of single-quotes into underscores with:
$ sed -E ":a; s/^(([^']*'[^']*')*[^']*'[^']*)[[:space:]]/\1_/; ta" file
'wiUEF7-gvouw_ow_wo24-RTeih_we', 'yt23IT_iug-76'
How it works
:a
This creates a label a.
s/^(([^']*'[^']*')*[^']*'[^']*)[[:space:]]/\1_/
This inserts the underscores where we want them.
^(([^']*'[^']*')*[^']*'[^']*)[[:space:]]
This looks for any odd number of single quotes followed by any number of non-quote characters followed by a space. Everything before that space is saved in group 1.
\1_
This replaces the matched text with group 1 followed by an underscore.
ta
If the previous command put any new underscores in the string, then jump back to label a and try again.
Using FPAT variable in gnu awk you can do this:
awk -v OFS=', ' -v FPAT="'[^']*'" '{for (h=1; h<=NF; h++)
{gsub(/[[:blank:]]/, "_", $h); printf "%s%s", $h, (h < NF ? OFS : ORS)}}' file
'wiUEF7-gvouw_ow_wo24-RTeih_we', 'yt23IT_iug-76'

Sed substitution for text

I have this kind of text:
0
4.496
4.496
Plain text.
7.186
Plain text
10.949
Plain text
12.988
Plain text
16.11
Plain text
17.569
Plain text
ESP:
Plain text
I am trying to make a sed substitution because I need all aligned after the numbers, something like:
0
4.496
4.496 Plain text.
7.186 Plain text
10.949 Plain text
12.988 Plain text
16.11 Plain text
17.569 Plain text ESP:Plain text
But I'm trying with different combinations of commands with sed but I can't keep a part of the matching pattern
sed -r 's/\([0-9]+\.[0-9]*\)\s*/\1/g'
I am trying to remove all \n after the numbers and align the text, but it does not work. I need to align text with text too.
I tried this as well:
sed -r 's/\n*//g'
But without results.
Thank you
This is a bit tricky. Your approach does not work because sed operates in a line-based manner (it reads a line, runs the code, reads another line, runs the code, and so forth), so it doesn't see the newlines unless you do special things. We'll have to completely override sed's normal control flow.
With GNU sed:
sed ':a $!{ N; /\n[0-9]/ { P; s/.*\n//; }; s/\n/ /; ba }' filename
This works as follows:
:a # Jump label for looping (we'll build our own loop, with
# black jack and...nevermind)
$! { # Unless the end of the input was reached:
N # fetch another line, append it to what we already have
/\n[0-9]/ { # if the new line begins with a number (if the pattern space
# contains a newline followed by a number)
P # print everything up to the newline
s/.*\n// # remove everything before the newline
}
s/\n/ / # remove the newline if it is still there
ba # go to a (loop)
}
# After the end of the input we drop off here and implicitly
# print the last line.
The code can be adapted to work with BSD sed (as found on *BSD and Mac OS X), but BSD sed is a bit picky about labels and jump instructions. I believe
sed -e ':a' -e '$!{ N; /\n[0-9]/ { P; s/.*\n//; }; s/\n/ /; ba' -e '}' filename
should work.
This gnu-awk command can also handle this:
awk -v RS= 'BEGIN{ FPAT="(^|\n)[0-9]+(\\.[0-9]+)?(\n[^0-9][^\n]*)*" }
{for (i=1; i<=NF; i++) {f=$i; sub(/^\n/, "", f); gsub(/\n/, " ", f); print f}}' file
0
4.496
4.496 Plain text.
7.186 Plain text
10.949 Plain text
12.988 Plain text
16.11 Plain text
17.569 Plain text ESP: Plain text
sed -n '1h;1!H;$!b
x;s/\n\([^0-9]\)/ \1/gp' YourFile
-n: print only at demand
'1h;1!H;$!b;xload the whole file in buffer until the end (at the end the whole file is in work buffer)
s/\n\([^0-9]\)/ \1/gp: replace all new line followed by a non digit character by a space character and print the result.
This might work for you (GNU sed):
sed ':a;N;s/\n\([[:alpha:]]\)/ \1/;ta;P;D' file
Process two lines at a time. If the second line begins with an alphabetic character, remove the preceeding newline and append another line and repeat. If the second line does not begin with an alphabetic character, print and then delete the first line and its newline. The second line now becomes the first line and the process repeats.

how to rejoin words that are split accross lines with a hyphen in a text file

OCR texts often have words that flow from one line to another with a hyphen at the end of the first line. (ie: the word has '-\n' inserted in it).
I would like rejoin all such split words in a text file (in a linux environment).
I believe this should be possible with sed or awk, but the syntax for these is dark magic to me! I knew a text editor in windows that did regex search/replace with newlines in the search expression, but am unaware of such in linux.
Make sure to back up ocr_file before running as this command will modify the contents of ocr_file:
perl -i~ -e 'BEGIN{$/=undef} ($f=<>) =~ s#-\s*\n\s*(\S+)#$1\n#mg; print $f' ocr_file
This answer is relevant, because I want the words joined together... not just a removal of the dash character.
cat file| perl -CS -pe's/-\n//'|fmt -w52
is the short answer, but uses fmt to reform paragraphs after the paragraphs were mangled by perl.
without fmt, you can do
#!/usr/bin/perl
use open qw(:std :utf8);
undef $/; $_=<>;
s/-\n(\w+\W+)\s*/$1\n/sg;
print;
also, if you're doing OCR, you can use this perl one-liner to convert unicode utf-8 dashes to ascii dash characters. note the -CS option to tell perl about utf-8.
# 0x2009 - 0x2015 em-dashes to ascii dash
perl -CS -pe 'tr/\x{2009}\x{2010}\x{2011}\x{2012\x{2013}\x{2014}\x{2015}/-/'
cat file | perl -p -e 's/-\n//'
If the file has windows line endings, you'll need to catch the cr-lf with something like:
cat file | perl -p -e 's/-\s\n//'
Hey this is my first answer post, here goes:
'-\n' I suspect are the line-feed characters. You can use sed to remove these. You could try the following as a test:
1) create a test file:
echo "hello this is a test -\n" > testfile
2) check the file has the expected contents:
cat testfile
3) test the sed command, this sends the edited text stream to standard out (ie your active console window) without overwriting anything:
sed 's/-\\n//g' testfile
(you should just see 'hello this is a test file' printed to the console without the '-\n')
If I build up the command:
a) First off you have the sed command itself:
sed
b) Secondly the expression and sed specific controls need to be in quotations:
sed 'sedcontrols+regex' (the text in quotations isn't what you'll actually enter, we'll fill this in as we go along)
c) Specify the file you are reading from:
sed 'sedcontrols+regex' testfile
d) To delete the string in question, sed needs to be told to substitute the unwanted characters with nothing (null,zero), so you use 's' to substitute, forward-slash, then the unwanted string (more on that in a sec), then forward-slash again, then nothing (what it's being substituted with), then forward-slash, and then the scale (as in do you want to apply the edit to a single line or more). In this case I will select 'g' which represents global, as in the whole text file. So now we have:
sed 's/regex//g' testfile
e) We need to add in the unwanted string but it gets confusing because if there is a slash in your string, it needs to be escaped out using a back-slash. So, the unwanted string
-\n ends up looking like -\\n
We can output the edited text stream to stdout as follows:
sed 's/-\\n//g' testfile
To save the results without overwriting anything (assuming testfile2 doesn't exist) we can redirect the output to a file:
sed 's/-\\n//g' testfile >testfile2
sed -z 's/-\n//' file_with_hyphens