Find and Replace with Regex in Microsoft Word 2013 - regex

I am editing an e-book document with a lot of unnecessary markup. I have a number of sections in the text with code similar to this:
<i>Some text here</i>
I am trying to run a regex find and replace that will find any phrase between the two i-tags, remove the i-tags, and apply a style to the text.
Here is what I'm using to search:
Find: (<i>)(*)(</i>)
Replace: \2
I'm also selecting Styles > i (for italic). This tells our conversion software to apply italics to the text. If I leave the i-tags, what ends up happening is ScribeNet's conversion process converts them to hex-values so that they show up as literal text in the e-book. Messy.
When I run this search, I get no results. I have "use wildcards" checked. What am I missing? According to Microsoft's help website, * is used to represent any number or type of characters, and individual strings are supposed to be enclosed in parentheses.

To search for a character that's defined as a wildcard, place a backslash (\) before that character. The * itself matches any string of characters, so use the range quantifier to match (1 or more times)
Find: \<i\>(*{1,})\</i\>
Replace: \1

Search for \<i\>(*{1,})\</i\> and replace with \1. Don't forget to check Use wildcard.
There is a reference table for Word's "regular expressions" here: http://office.microsoft.com/en-ca/word-help/find-and-replace-text-by-using-regular-expressions-advanced-HA102350661.aspx
< and > are special characters that need to be escaped
* means any character
{1,} means one or more times

There is a special tool for Microsoft Word called Multiple Find & Replace (see http://www.translatortools.net/products/transtoolsplus/word-multiplefindreplace) which allows to work around Word's wildcard limitations. This tool can use the standard regular expressions syntax to search and replace any text within a Word document. For example, to search for any HTML tags, you can just use <[^>]+> which will find opening, closing and standalone HTML tags. You can add any number of expressions to a list and then search the document for all of them, replace everything, see all matches for all the search expressions entered, replace only selected matches, and a few more things.
I created it for translators and editors, but it is great for any advanced search/replace operations in Word, and I am sure you will find it very useful.
Stanislav

Related

Sigil editor: Regex string to look for a (hyphen) character in text, but not html attributes

My problem:
I use Sigil to edit xhtml files of an ebook.
When exporting from InDesign to ePub I tick option to remove forced line breaks.
This act removes all - hyphen characters which are auto-generated by InDesign, but the characters which were added manually during my word-break fine-tune remain in the text.
Current ability of Sigil search: searching by - parses everything, including css class names.
TODO: How to construct regex query which finds the - within the text, but not in the html code?
Thank you!
What I have already tried: https://www.mobileread.com/forums/showpost.php?p=4099971&postcount=169:
Here is a simple example to find the word "title" not inside a tag itself, here is the simplest regex search I could think of off the top of my head. It assumes there is no bare text in the body tag and that the xhtml is well formed.
I tried it and it appears to work. There are probably better more exhaustive regex, that can handle even broken xhtml.
Code:
title(?=[^>]*<)
This basically says search for "title" but lookahead to make sure there are no closing tag chars ">" before you find the next opening tag char "<".
There are probably look behind versions that could work with reverse logic. And there are ways to use regex to find a two strings that ignores any intervening tags.
Give it a try. You could add a saved search easily to do that. But again it will not handle find and replacement of text that crosses over elements (over nodes in the tree). That is the hard part unless you have one to one corresponding matching of matching substrings to replacement substrings which in general need not be the case.
And of course if you use < and > inside strings to show a "tag" or code snippet, these would be found by mistake so reviewing each find before the replace would be needed.
In Sigil, PCRE regex engine is used.
Thus, you can use
<[^<>]*>(*SKIP)(*F)|-
See the regex demo.
Details:
<[^<>]*>(*SKIP)(*F) - matches <, zero or more chars other than < and > and then a >, and then skips the match and goes on to search for the next match from the position where the failure occurred
| - or
- - a hyphen.
NOTE: you might want to match any dashes with [\p{Pd}\x{00AD}] (to replace with -).

Find / Replace functionality that allows for boundary replacements instead of expressions

Apologies in advance for the confusing title. My issue is as follows, I have the following text in about 600 files:
$_REQUEST['FOO']
I would like to replace it with the following:
$this->input->post('FOO')
To clarify, I am matching against the following:
$_REQUEST any number of A-Za-z\d followed by a ]
and replacing it with:
$this->input->post( the alphanumeric word from above followed by a )
Or in general:
Anchor token TEXT TO KEEP end anchor token
This differs from standard find/replace as I want to retain text inside of two word boundaries.
Is this functionality present in any text editors (Eclipse,np++,etc). Or am I going to need to write some type of program to parse these 600 files to make the replacement?
s/\$__REQUEST\[(.*?)]/$this->input->post(\1)/
The .*? will match everything from [ to the first ] rather than the last although it's unlikely that it will matter in this case.
By the way the PHP superglobal is $_REQUEST rather than $__REQUEST
You can do this in Notepad++ using regular expressions. Replace
\$_REQUEST\['([^']*)'\]
with
$this->input->post('$1')
If you ever have double-quotes too, you can do use a more complex expression to handle both cases, though I'm not sure Notepad++ supports backreferences; replace
\$_REQUEST\[(['"])(.*?)\1\]
with
$this->input->post($1$2$1)
Note that I've reverted to using #ExplosionPills' suggested (.*?) hereā€”it may be better, actually.

Replace line of text Notepad++ or UltraEdit

Real quick question here that i cant work out.
I have a bunch of text files across many directories. Within these dirs are text files named init.txt
In these many text files, are lots of lines starting with
Effective =
What i need to do is replace any line that contains that string with another string,
preferably in Notepad++, or UltraEdit if need be.
In Notepad++, iv found Search -> Replace in Files... which lets me specify a starting directory, but i cant get to replace the entire line with my new line.
I have never used regular expressions before (if thats the best way to do this) as iv never needed to, so any help would be very much appreciated.
Thank you for helping me out.
For your problem, a litter regular expression may help a lot. I use regex search in Notepad++ nearly everyday, and it is really useful.
I do not want to itimidate you with some complicated regex grammar. Instead, I hope after reading my answer, you might see that the basics of regular expression is not so exotic, and it is for regular people's everyday use.
Follow these instructions:
In Notepad++ press Ctrl-F, and switch to the Find in Files tab, in Serach mode part(it is on the bottom of the dialog), select Regular expression
In the Find what field, what you need to input here may vary according to the specific pattern of the text you want to replace.
If the text fragment you want to substitute always
Shows up at the beginning of a line,
There is NO LEADING WHITESPACES before the text,
It containes EXACTLY ONE SPCACE CHARACTER before the = character
^Effective = should be used as the pattern in the Find what Field.
The ^ symbol in ^Effective = means matching begin of the line (so if Effectiv = appears in the middle of a line, it will be ignored ), and the rest is the exact words to be matched.
However, if the above conditions is not all satisfied, e.g.
the text segement may containe leading whitesapces,
the number of withspaces between the word Effective and = symbol may vary, from one to unlimited
Under such circumstance, you may need to use ^Effective\s+=.
The \s+ part in ^Effective\s+= matches one to unlimited number of whitespaces(including, spaces \0x20, tabs \t, carrige-return \r, and new-line \n)
If you want to match zero to unlimited spaces between Effective and =, you can replace \s+ to \s*
In the Rplace with field, input changeLine
In filters filed, select the file type you want to search
Check In all sub-folders
Click Replace in Files button
Set the search mode in Notepad++
Find: Effective =
Replace with: changeLine
Search Mode: Extended (\n, \t, etc)
From: https://superuser.com/questions/34451/notepad-find-and-replace-string-with-a-new-line

Remove everything before and after variable=int

I'm terrible at regex and need to remove everything from a large portion of text except for a certain variable declaration that occurs numerous times, id like to remove everything except for instances of mc_gross=anyint.
Generally we'd need to use "negative lookarounds" to find everything but a specified string. But these are fairly inefficient (although that's probably of little concern to you in this instance), and lookaround is not supported by all regex engines (not sure about notepad++, and even then probably depends on the version you're using).
If you're interested in learning about that approach, refer to How to negate specific word in regex?
But regardless, since you are using notepad++, I'd recommend selecting your target, then inverting the selection.
This will select each instance, allowing for optional white space either side of the '=' sign.
mc_gross\s*=\s*\d+
The following answer over on super user explains how to use bookmarks in notepad++ to achieve the "inverse selection":
https://superuser.com/questions/290247/how-to-delete-all-line-except-lines-containing-a-word-i-need
Substitute the regex they're using over there, with the one above.
You could do a regular expression replace of ^.*\b(mc_gross\s*=\s*\d+)\b.*$ with \1. That will remove everything other than the wanted text on each line. Note that on lines where the wanted text occurs two or more times, only one occurrence will be retained. In the search the ^.*\b matches from start-of-line to a word boundary before the wanted text; the \b.*$ matches everything from a word boundary after the wanted text until end of line; the round brackets capture the wanted text for the replacement text. If text such as abcmc_gross=13def should be matched and retained as mc_gross=13 then delete the \bs from the search.
To remove unwanted lines do a regular expression search for ^mc_gross\s*=\s*\d+$ from the Mark tab, tick Bookmark line and click Mark all. Then use Menu => Search => Bookmark => Remove unmarked lines.
Find what: [\s\S]*?(mc_gross=\d+|\Z)
Replace with: \1
Position the cursor at the start of the text then Replace All.
Add word boundaries \b around mc_gross=\d+ if you think it's necessary.

Regular expression question

I have some text like this:
dagGeneralCodes$_ctl1$_ctl0
Some text
dagGeneralCodes$_ctl2$_ctl0
Some text
dagGeneralCodes$_ctl3$_ctl0
Some text
dagGeneralCodes$_ctl4$_ctl0
Some text
I want to create a regular expression that extracts the last occurrence of dagGeneralCodes$_ctl[number]$_ctl0 from the text above.
the result should be: dagGeneralCodes$_ctl4$_ctl0
Thanks in advance
Wael
This should do it:
.*(dagGeneralCodes\$_ctl\d\$_ctl0)
The .* at the front is greedy so initially it will grab the entire input string. It will then backtrack until it finds the last occurrence of the text you want.
Alternatively you can just find all the matches and keep the last one, which is what I'd suggest.
Also, specific advice will probably need to be given depending on what language you're doing this in. In Java, for example, you will need to use DOTALL mode to . matches newlines because ordinarily it doesn't. Other languages call this multiline mode. Javascript has a slightly different workaround for this and so on.
You can use:
[\d\D]*(dagGeneralCodes\$_ctl\d+\$_ctl0)
I'm using [\d\D] instead of . to make it match new-line as well. The * is used in a greedy way so that it will consume all but the last occurrence of dagGeneralCodes$_ctl[number]$_ctl0.
I really like using this Regular Expression Cheatsheet; it's free, a single page, and printed, fits on my cube wall.