Find and replace first double-quote before special character via grep - regex

I have a whole bunch of files that all have blocks of text in them that look like this:
"Each file has different text between the opening double-quote
and the closing right-quote (or whatever it's called)”
Perhaps not relevant, but in the past I have used grep to do a search and replace like this:
grep -Rl 'search' ./path/to/files/ | xargs sed -i 's/search/replace/g
Is there any way to do something similar, but use a regex to replace the opening plain old double-quote with a left-quote (“)? The only reliable way to replace the correct double-quote characters is to search on the right-quote, then backwards tot he previous double-quote. I think. I'm just not sure if that's possible or how to do it.
I could just do it with a PHP script, but then I wouldn't get to see if it's possible from the command line.

You can use sed:
sed -i.bak 's/"\([^”]*”\)/“\1/' file
cat file
“Each file has different text between the opening double-quote and the closing right-quote (or whatever it's called)”

Related

Replace Windows filepath in text file by using a Linux sed regular expression

I have a huge number of text files with tag-like syntax. Those files contain patterns like this:
<TAG1=foo><TAG-2=\\10.0.0.1\directory\filename.pdf><TAG3> ...
<TAG4=bar><TAG-6=\\10.0.0.1\directory\filename.tif,other content><TAG5>
I need to replace the first part of those UNC paths with new ones, meaning:
<TAG1=foo><TAG-2=D:\localdirectory\filename.pdf><TAG3> ...
<TAG4=bar><TAG-6=D:\localdirectory\filename.tif,other content><TAG7>
There is a huge number of files to process and so I need to automate this path replacement. So far I tried multiple regex with sed (on Linux) but did not get close to a solution.
#!/bin/bash
# New directory (escaped)
newpath='D:\\localdirectory\\'
# Actual replacement (don't work)
sed -i "s#\(<TAG-2=\)\([^\\]+\.pdf\)#\1${newpath}\2#g" filetoprocess.txt
sed -i "s#\(<TAG-6=\)\([^\\]+\.tif\)#\1${newpath}\2#g" filetoprocess.txt
Any suggestions are welcome
This shell script using sed might work:
#!/bin/bash
oldpath='\\\\10\.0\.0\.1\\directory\\'
newpath='D:\\localdirectory\\'
#sed -i "s#${oldpath}#${newpath}#g" filetoprocess.txt
sed -r -i "s#(<TAG-2=)${oldpath}([^>]+pdf)#\1${newpath}\2#g;
s#(<TAG-6=)${oldpath}([^>]+tif)#\1${newpath}\2#g;
" filetoprocess.txt
In the first line the shell shebang is #! (notice the exclamation mark). And I believe that the second line in your input example should have the TAG-6.
In the paths some care is necessary for the characters that have a special meaning in regular expressions:
you have to escape the . und \ with a backslash
this leads to the funny looking \\\\ (two escaped backslashes)
In the last line the -r option saves a bit of escaping in the argument. Note that I used [^>]+ instead of [^\\]+ to get the path part until the extension.
The [^\\]+ in your sed command would match everything after the = which is not a \ and that is only the D: part.
So your replacement would only match a literal D:.pdf.
But I would suggest trying the other (commented) sed command that just replaces the paths no matter what TAG and fileextension is.
(Backup your files before, since you use -i inplace replacement.)
Finally I came with the following regular expression. This solution can also, manage "/" Unix paths, dollar ($) and hyphens (-) :
sed -i -r 's#(<TAG-2=|TAG-6=)([\/]{2})([0-9.a-zA-Z_$ -]+[\/])+([0-9.a-zA-Z_$ -]+\.[pPtT][dDiI][fF])#\1'"${newpath}"'\\\4#g'

Incorrect escaping for regex expression in sed; I don't want a literal TAB, I want \t

I'm trying to run a script to modify .bashrc to insert the time in front of the command prompt. But I keep ending up with a literal TAB instead.
I have tried
sed -i "s/PS1='\${deb/PS1=\'\\t \${deb/" {filename}
sed -i "s/PS1='\${deb/PS1='\\t \${deb/" {filename}
sed -i "s/PS1='\${deb/PS1='\t \${deb/" {filename}
sed -i 's/PS1=\'\${deb/PS1=\'\\t \${deb/' {filename}
sed -i $'s/PS1=\'\${deb/PS1=\'\\t \${deb/' {filename}
and many variations of the above. I am usually getting nothing inserted ir an actual tab.
I have found if I do:
sed $'s/PS1=\'\${deb/PS1=\'\\t \${deb/' {filename} | grep PS1=
then I get output that appears to be what I'm after, but just not in the file.
How do I insert an actual \t and NOT a TAB inline to a file I want to edit?
In case it wasn't obvious, the last example above and the test on the command line use the same regex. Or at least, I think they do.
I was trying to change two lines in .bashrc. These lines pertained to the command prompt and are reproduced in full here:
~⟫ grep PS1 .bashrc
PS1='${debian_chroot:+($debian_chroot)}\[\033[01;32m\]\u#\h\[\033[00m\]:\[\033[01;34m\]\w\[\033[00m\]\$ '
PS1='${debian_chroot:+($debian_chroot)}\u#\h:\w\$ '
PS1="\[\e]0;${debian_chroot:+($debian_chroot)}\u#\h: \w\a\]$PS1"
I want it to be converted to:
PS1='\t ${debian_chroot:+($debian_chroot)}\[\033[01;32m\]\u#\h\[\033[00m\]:\[\033[01;34m\]\w\[\033[00m\]\$ '
PS1='\t ${debian_chroot:+($debian_chroot)}\u#\h:\w\$ '
PS1="\[\e]0;${debian_chroot:+($debian_chroot)}\u#\h: \w\a\]$PS1"
I think it is obvious, but, I don't want to insert a TAB into the line. I want to insert \t literally, that is, "backslash and t" together and not have them translate as a TAB.
As you're using double quotes around the sed command, you will need to escape the \ as well, that is use \\\t in the replacement string. This is because the shell is transforming \\ into \ before sed sees it, so if you start with \\t then you end up with \t and sed inserts a tab.
If you were using single quotes around the command, then you wouldn't have to double-escape as sed would see \\t. However, it looks like your pattern contains single quotes.
At a guess based on your attempts, you need to use something like this:
sed -E -i.bak "s/(PS1=')(\\$\\{deb)/\\1\\\t\\2/" .bashrc
I've used capture groups to reduce repetition. The references to the captured groups need escaping too. I also added a suffix to the -i switch so that a backup of your original file is made.
If you show us the current line in your file and the desired output, then we can provide you with the exact answer.
You need to double-escape the \t:
sed -i "s/PS1='\${deb/PS1='\\\t \${deb/" .bashrc
sed removes a level of escaping and will insert a literal \t in your file.

Delete blank line before a pattern. What's wrong? Currently using Perl but open to sed/AWK

In a long document, I want to selectively delete the particular newlines that precede the exact string \begin{enumerate*}, ideally with a one-liner in bash or zsh.
That is, I want to convert test.tex:
Text in paragraphs.
More text
\begin{enumerate*} \item thing
to
Text in paragraphs.
More text \begin{enumerate*} \item thing
with a one-liner like
cat test.tex | perl -p -e 's/\n(?=(\\begin\{enumerate\*\}))/ /'
or
cat test.tex | perl -p -e 's/\n\\begin\{enumerate\*\}/\\begin{enumerate*}/'
but I must be missing something because it doesn't make any change.
I also clearly don't need a regular expression here. If there's a way to do this with exact string matching instead of regex, I'd rather use that way. For instance, in R I could do this with sub("\n\\begin{enumerate*}", "\\begin{enumerate*}", fixed = TRUE).
You can use the -0 (digit zero) switch with Perl to specify the line separator. Traditionally -0777 is used to read the entire file
You also need to be careful about regex metacharacters in your search string. Characters like *, {, } and \ mean something special within a regex pattern, and you should escape them — usually with a \Q ... \E construct
Taking these points into account, this should work for you
perl -0777 -pe' s/\n+(?=\Q\begin{enumerate*}\E)/ / ' myfile
perl -p processes a file string by string, so you can't expect this regex to match.
I would recommend something like
perl -e '$text = join "", <>; $text =~ s/your_regex_here//; print $text' test.txt
Mind that it loads the whole file to memory.
Also, if you want to modify file immediately, you can't just say > test.txt, see this question.
I found a solution with sed (number 25 on this page) that doesn't read the entire file into memory:
sed -i bak -n '/^\\begin{enumerate\*}/{x;d;};1h;1!{x;p;};${x;p;}' test.tex
The downside is that this doesn't actually join the two lines; instead it produces
Text in paragraphs.
More text
\begin{enumerate*} \item thing
which is good enough for what I need (latex treats single newlines the same as regular spaces)

Using Sed to delete lines which contain non alphabets

The following Regex works as expected in Notepad++:
^.*[^a-z\r\n].*$
However, when I try to use it with sed, it wont work.
sed -r 's/\(^.*[^a-z\r\n].*$\)//g' wordlist.txt
You could use:
sed -i '/[^a-z]/d' wordlist.txt
This will delete each line that has a non-alphabet character (no need to specify linefeeds)
EDIT:
You regex doesn't work because you are trying to match
( bracket
^ beginning of line
...
$ end of line
) bracket
As you won't have a bracket and then the beginning of the line, your regex simply doesn't match anything.
Note, also an expression of
s/\(^.*[^a-z\r\n].*$\)//g'
wouldn't delete a line but replace it with a blank line
EDIT2:
Note, in sed using the -r flag changes the behaviour of \( and \) without the -r flag they are group indicators, but with the -r flag they're just brackets...
Two things:
Sed is a stream editor. It processes one line of the input at a time. That means the search and replace commands, etc, can only see the current line. By contrast, Notepad++ has the whole file in memory and so its search expressions can span two or more lines.
Your command sed -r 's/\(^.*[^a-z\r\n].*$\)//g' wordlist.txt includes \( and \). These mean real (ie non-escaped) round brackets. So the command says find a line that starts with a ( and ends with a ) with some other characters between and replace it with nothing. Rewriting the command as sed -r 's/^.*[^a-z\r\n].*$//g' wordlist.txt should have the desired effect. You could also remove the \r\n to give sed -r 's/^.*[^a-z].*$//g' wordlist.txt. But neither of these will be exactly the same as the Notepad++ command as they will leave empty lines. So you may find the command sed -r '/^.*[^a-z].*$/d' wordlist.txt is closer to what you really want.

Replace string that contains CRLF?

I'm reformatting a file, and I want to perform the following steps:
Replace double CRLF's with a temporary character sequence ($CRLF$ or something)
Remove all CRLF's in the whole file
Go back and replace the double CRLF's.
So input like this:
This is a paragraph
of text that has
been manually fitted
into a certain colum
width.
This is another
paragraph of text
that is the same.
Will become
This is a paragraph of text that has been manually fitted into a certain colum width.
This is another paragraph of text that is the same.
It seems this should be possible by piping the input through a few simple sed programs, but I'm not sure how to refer to CRLF in sed (to use in sed 's/<CRLF><CRLF>/$CRLF$/'). Or maybe there's a better way of doing this?
You can use sed to decorate all rows with a {CRLF} at end:
sed 's/$/<CRLF>/'
then remove all \r\n with tr
| tr -d "\r\n"
and then replace double CRLF's with \n
| sed 's/<CRLF><CRLF>/\n/g'
and remove leftover CRLF's.
There was an one-liner sed which did all this in a single cycle, but I can't seem to find it now.
Try the below:
cat file.txt | sed 's/$/ /;s/^ *$/CRLF/' | tr -d '\r\n' | sed 's/CRLF/\r\n'/
That's not quite the method you've given; what this does is the below:
Add a space to the end of each line.
Replace any line that contains only whitespace (ie blank lines) with "CRLF".
Deletes any line-breaking characters (both CR and LF).
Replaces any occurrences of the string "CRLF" with a Windows-style line break.
This works on Cygwin bash for me.
Redefine the Problem
It looks like what you're really trying to do is reflow your paragraphs and single-space your lines. There are a number of ways you can do this.
A Non-Sed Solution
If you don't mind using some packages outside coreutils, you could use some additional shell utilities to make this as easy as:
dos2unix /tmp/foo
fmt -w0 /tmp/foo | cat --squeeze-blank | sponge /tmp/foo
unix2dos /tmp/foo
Sponge is from the moreutils package, and will allow you to write the same file you're reading. The dos2unix (or alternatively the tofrodos) package will allow to convert your line endings back and forth for easier integration with tools that expect Unix-style line endings.
This might work for you (GNU sed):
sed ':a;$!{N;/\n$/{p;d};s/\r\?\n/ /;ba}' file
Am I missing why this is not easier?
Add CRLF:
sed -e s/\s+$/$'\r\n'/ < index.html > index_CRLF.html
remove CRLF... go unix:
sed -e s/\s+$/$'\n'/ < index_CRLF.html > index.html