Remove Word smart quotes from a text file using vim - regex

I have a large text file, originally generated in Microsoft Word, that contains these four character sequences, alongside regular text:
?~#~\
?~#~]
?~#~X
?~#~Y
From the content of what is written in the file, it appears that the sequences respectively correspond to open double quotes, close double quotes, open single quote, and close single quote. When displayed in Vim, everything in the sequences other than the question mark appears in blue.
I cannot remove them with a command such as
:.,$s/?~#~Y//
This command results in the following error from vim:
E33: No previous substitute regular expression
E476: Invalid command
Press ENTER or type command to continue
These commands also produce errors:
:.,$s/\?~#~Y//
:.,$s/\?\~\#\~Y//
Specifically,
E866: (NFA regexp) Misplaced ?
E476: Invalid command
Press ENTER or type command to continue
What would be the correct way to automatically remove or replace the sequences? Ideally, I'd like to remove the double quotes, and replace the open/close single quotes with a traditional single quote or apostrophe.

Since "everything in the sequences other than the question mark appears in blue", all characters except the question mark are probably binary characters. I'd suggest this approach:
go to the first sequence and yank it: press v to start marking, extend the mark to the end of the sequence, then press y
paste the sequence as the replace pattern from the unnamed register: :%s/Ctrl-r"//gEnter
repeat for the remaining sequences.

If you’re using a unicode-compatible encoding (such as utf-8) and your font supports it, the smart quotes will show properly.
Additionally, the digraphs for them are 6', 6", 9', and 9". This makes it pretty easy to chain a couple of substitutes to swap them for straight variants:
%s/<C-k>6'\|<C-k>9'/'/g
Etc. Wrap it in a function or command to make it easier for later.

Sorry to bump an old thread but I stumbled upon this late at night while trying to figure out how to remove the exact same characters from a bind9 configuration file that I had pasted in from a website. The aberrant characters were "~#~X", "~#~Y", " | ", and I believe another but I can't remember it at the moment. Anyway, regular expressions couldn't seem to find and replace using the above methods, but I was able to find a solution.
If you can set VIM to show the special characters in their binary representation, then you can use regex to find that. Here's how I did it:
Steps to fix
Open the file with the problem characters in VIM
(a) original method - :set encoding=latin1|set isprint=|set display+=uhex
(b) easier method - :set encoding=utf-8
NOTE: either of these should display the digraph characters in their binary form <<<>>>
(e.g. <80>, <99>, ... )
Then search and replace with VIM regex like so
:%s:\%xNN:':g #replace NN with byte code (i.e. 80, 99, etc.)
Let's break that command down, shall we:
%s: - search command looking for all occurrences due to the % at the start and the 's' for search. The ':' (colon) has been used as the delimiter in this case, but you can use other symbols to delimit the search command.
\%x - the backslash escapes the %x which represents a byte code that we're looking for (i.e. <2 x numbers between brackets>)
NN - replace with the two chars inside of the <> that you're looking to replace in your file. In my case, the byte codes were <e2>, <80>, <99>, which I had to search for separately.
:' - then, the colon delimiting the replacement group where I'm specifying a single quote to replace the byte code, you could put whatever text you want here.
:g - finally, the last colon delineation and the letter 'g' which means to search the entire file top to bottom.
You can do more research in VIM's help with:
:help isprint
Anyway, I hope this helps someone else in the future.
References:
https://blog-en.openalfa.com/how-to-edit-non-printing-and-unicode-characters-in-vim-editor
https://unix.stackexchange.com/questions/108020/can-vim-display-ascii-characters-only-and-treat-other-bytes-as-binary-data
VIM How do I search for a <XX> single byte representation

Related

Regex NotePad++ or batch script to find and replace double bracketed text with CR LF -- would prefer NP++

I managed to do most of my conversion in VBA Macro (Word > txt) but some changes were made also that I could not forego or get around. Unfortunately, I had not been in the habit of using styles and precise formatting in my docs... (Which is why a PanDoc conversion did not "pan" out well, if you'll excuse the pun.)
In my docs, I was using bold text/lines for in-text titles (not Heading 2 alas) but as I was converting mid-sentence one or two-word bold phrases into phrases to go between double square brackets, the makeshift titles/headings were also changed to [[some title]] format in the process.
With Find and Replace (a batch script that goes through all files in a folder would also do), I would like to search for each and any number of instances of CRLF [[some title CRLF]]CRLF and replace the brackets with ** (to make the title bold), or perhaps ## to make the headings I was missing back in MS Word (I would of course need the line breaks as well).
For better understanding, please see attached picture here:
I am fairly sure that all instances are similarly syntaxed. If not, I may be able to tailor your regex code to differing instances later on.
As you can see, I was trying to do it in two steps but that's not good, because the second step (which I couldn't even get right) would propably have altered other texts I need intact (there must be sentences that start with double brackets after CRLF).
I would need the two steps in one so that only the targeted double bracketed text would be changed to bold or Heading 2.
Basically what I could not do is: find the proper regex solution for matching double CRLF-ed and square-bracketed text for any number of words than may occupy more than one line and starts with a capital letter. I would need an empty line above and below the title as indicated in the image (the VBA macro somehow made two instances of CRLF and carried the brackets to a new line, which I do not like, either).
EDIT.
In the meantime I managed to cook something up but now I couldn't insert the CRLF in front of the match string. At this point this is not enough as other instances are also changed, even lowercase in-line items, for some reason...
Regex:
\[\[([A-Z][\S\s]+?)\]\]
Substitution:
## $1\r\n
https://regex101.com/r/mH6B9N/1
Since then, I made improvements towards what I wanted (I had to test in NotePad++ and not Regex101, for different results), but now in multiple documents I have found match across spill-over lines, as described in here:
Single line regex search in Notepad++
Is it possible that I cannot do what I want? The problem is having non-title text strings having line-break, double brackets and capitalized letters.
What it looks like in other documents:
See here.
I circled around with red in image for clarification. See also:
https://regex101.com/r/8XsIGx/1
Is it possible to match a certain word like "címnél" and not execute on that match if that word is present in a line?
Thanks very much in advance,
F.
You can use
(?s)\R\K\[\[((?:(?!\[\[|]]).)*)\R*]](?=\R)
Replace with ## $1. See the regex demo.
Details:
(?s) - equivalent of the . matches newline option
\R - a line break sequence
\K - omit the text matched so far (the newlines)
\[\[ - a [[ text
((?:(?!\[\[|]]).)*) - Group 1: any char, as many as possible occurrences, that does not start a [[ or ]] char sequence
\R* - zero or more line breaks
]] - a ]] text
(?=\R) - immediately to the right, there must be a line break.

Removing comments from a group of php files via command line in Windows

I have a bunch of files and I would like to send them to deployment sans any comments but with whitespace intact (so that I can make any quick changes in production in emergency cases).
The comments can be either single line comments (#, //) or multi line syntax /**/ and at any indentation level.
I want to create a batch file that when executed from any directory reads all php files and strips their comments.
I am not even sure what to try. I know I can fetch all the files with .php extension easily and loop through them. Replacing their content is easy enough as well. What I am stuck on is how to remove the comments.
There's multiple ways to do it, but I think the most efficient will be to use regular expressions.
I'm not a RegExp guru, but I know that you can use grep (or any powerfull text editor such as Notepad++, sublimetext, et...) to replace expressions highlighted to empty string.
For example, in Sublime Text, I've tested this regular expression, which find any multiple lines comment :
\/\*([\s\S]*?)\*/|//.*|#.*
Quick Explanation :
the regexp is made of 3, with an OR symbol inside ( the pipe symbol |)
the first expression mean "something starting with /*, any character, or any blank character , and finishing with */"
the second means "any expression starting with // and any character following
the third means "any expression starting with # and any character following
Once you've set this search expression in sublime text, you can replace what it did return with a "blank" replacement.

Use REGEX to find line breaks within a wrapped content

The direct question: How can I use REGEX lookarounds to find instances of \r\n that occur between a set of characters (stand in open and closing tags), "[ and ]" with arbitrary characters and line breaks inside as well?
The situation:
I have a large database exported to tab or comma delineated text files that I'm trying to import into excel. The problem is that some of the cells come from text areas that contain line breaks, and are qualified by double quotes. Importing into excel these line breaks are treated as new rows. I cannot adjust how the file is exported. I data needs to be preserved, but the exact format doesn't, so I was planning on using some placeholder for the returns or ~
Here's a generic illustration of the format of my data:
column1rowA column2rowA column3rowA column4rowA
column1rowB column2rowB "column3rowB
3Bcont
3Bcont
3Bcont
" column4rowB
column1rowC column2rowC column4rowC
column1rowD column2rowD "column3rowD
3Dcont" column4rowD
My thought has been to try to select and replace line breaks within the quotes using REGEX search and replace in Notepad++. To try and make is simpler I have tried adding a character to the double quotes to help indicate whether it is an opening or closing quote:
"[column3rowB
3Bcont
3Bcont
3Bcont
]"
I am new to REGEX. The progress I've made (which isn't much) is:
(?<="[) missing some sort of wildcard \r\n(?=.*]")
Every iteration I've tried has also included every line break between the first "[ and last ]"
I would also appreciate any other approaches that solve the underlying problem
If you can use some tool other than Notepad++, you can use this regex (see my working example on regex101):
(?!\n(([^"]*"){2})*[^"]*$)\n
It uses a negative lookahead to find line breaks only when not followed by an even number of quotes. You could replace them with <br>, spaces, or whatever is appropriate.
Breakdown:
(?! ... ) This is the negative lookahead, necessary because it's zero-width. Anything matched by it will still be available to match again.
(([^"]*"){2})* This is the other key piece. It ensures even-numbered pairs of non-quote characters followed by a quote.
[^"]*$ This is ensuring that there are no more quotes from there until the end of the string.
Caveat:
I couldn't get it to work in Notepad++ because it always recognizes $ as the end of a line, not the end of the entire string.
Great answer from Brian. I added an option that would only consider real linebreaks (i.e. \n\r), which worked for my CSV file:
(?!\n|\r(([^"]*"){2})*[^"]*$)\n|\r

Replace line of text Notepad++ or UltraEdit

Real quick question here that i cant work out.
I have a bunch of text files across many directories. Within these dirs are text files named init.txt
In these many text files, are lots of lines starting with
Effective =
What i need to do is replace any line that contains that string with another string,
preferably in Notepad++, or UltraEdit if need be.
In Notepad++, iv found Search -> Replace in Files... which lets me specify a starting directory, but i cant get to replace the entire line with my new line.
I have never used regular expressions before (if thats the best way to do this) as iv never needed to, so any help would be very much appreciated.
Thank you for helping me out.
For your problem, a litter regular expression may help a lot. I use regex search in Notepad++ nearly everyday, and it is really useful.
I do not want to itimidate you with some complicated regex grammar. Instead, I hope after reading my answer, you might see that the basics of regular expression is not so exotic, and it is for regular people's everyday use.
Follow these instructions:
In Notepad++ press Ctrl-F, and switch to the Find in Files tab, in Serach mode part(it is on the bottom of the dialog), select Regular expression
In the Find what field, what you need to input here may vary according to the specific pattern of the text you want to replace.
If the text fragment you want to substitute always
Shows up at the beginning of a line,
There is NO LEADING WHITESPACES before the text,
It containes EXACTLY ONE SPCACE CHARACTER before the = character
^Effective = should be used as the pattern in the Find what Field.
The ^ symbol in ^Effective = means matching begin of the line (so if Effectiv = appears in the middle of a line, it will be ignored ), and the rest is the exact words to be matched.
However, if the above conditions is not all satisfied, e.g.
the text segement may containe leading whitesapces,
the number of withspaces between the word Effective and = symbol may vary, from one to unlimited
Under such circumstance, you may need to use ^Effective\s+=.
The \s+ part in ^Effective\s+= matches one to unlimited number of whitespaces(including, spaces \0x20, tabs \t, carrige-return \r, and new-line \n)
If you want to match zero to unlimited spaces between Effective and =, you can replace \s+ to \s*
In the Rplace with field, input changeLine
In filters filed, select the file type you want to search
Check In all sub-folders
Click Replace in Files button
Set the search mode in Notepad++
Find: Effective =
Replace with: changeLine
Search Mode: Extended (\n, \t, etc)
From: https://superuser.com/questions/34451/notepad-find-and-replace-string-with-a-new-line

Regex-based matching and sustitution with nano?

I am aware of nano's search and replace functionality, but is it capable of using regular expressions for matching and substitution (particularly substitutions that use a part of the match)? If so, can you provide some examples of the syntax used (both for matching and replacing)?
I cut my teeth on Perl-style regular expressions, but I've found that text editors will sometimes come up with their own syntax.
My version of nano has an option to swtich to regex search with the meta character + R. In cygwin on Windows, the meta-key is alt, so I hit ctrl+\ to get into search-and-replace mode, and then alt+r to swtich to regex search.
You need to add, or un-comment, the following entry in your global nanorc file (on my machine, it was /etc/nanorc):
set regexp
Then fire up a new terminal and press CTRL + / and do your replacements which should now be regex-aware.
EDIT
Search for conf->(\S+):
Replace with \1_conf
Press a to replace all occurrences:
End result:
The regular expression format / notation for nano use "Extended Regular Expression", i.e. POSIX Extended Regular Expression, which is used by egrep and sed -r, this include metacharacters ., [ and ], ^, $, (, ), \1 to \9, *, { and }, ?, +, |, and character classes like [:alnum:], [:alpha:], [:cntrl:], [:digit:], [:graph:], [:lower:], [:print:], [:punct:], [:space:], [:upper:], and [:xdigit:].
For more complete documentation you can see manual page, man 7 regex in Linux or man 7 re_format in OS X. This page may give you same information as well: https://en.wikipedia.org/wiki/Regular_expression#POSIX_basic_and_extended
Unfortunately in nano there seems to be no way to match anything that span across multiple lines.
This is a bit old, just updating the search index.
Nano 5.5 uses the ASCII column from this same table.
Thanks to #S P Arif Sahari Wibowo ,
I found the answer here anyway (same wiki link):
https://en.wikipedia.org/wiki/Regular_expression#POSIX_basic_and_extended
I was recently faced with the problem of inserting text at the beginning of everyline that started with a numerical digit. For that the only way to distinguish this from text i didn't want to change was the previous new line.
Playing around with the information provided in this answer I was able to do it and decided to add it to the answer in case somebody else faces the same situation.
To search for the beginning of the line followed by a number and then insert "Text String" at the beginning of each line that starts with a number:
\ then "(^[0-9])" press carry return, then: "Text String 1" press carry return and the select yes, if it does what you want next press a for all. Omit the " quotation marks.