Sigil editor: Regex string to look for a (hyphen) character in text, but not html attributes - regex

My problem:
I use Sigil to edit xhtml files of an ebook.
When exporting from InDesign to ePub I tick option to remove forced line breaks.
This act removes all - hyphen characters which are auto-generated by InDesign, but the characters which were added manually during my word-break fine-tune remain in the text.
Current ability of Sigil search: searching by - parses everything, including css class names.
TODO: How to construct regex query which finds the - within the text, but not in the html code?
Thank you!
What I have already tried: https://www.mobileread.com/forums/showpost.php?p=4099971&postcount=169:
Here is a simple example to find the word "title" not inside a tag itself, here is the simplest regex search I could think of off the top of my head. It assumes there is no bare text in the body tag and that the xhtml is well formed.
I tried it and it appears to work. There are probably better more exhaustive regex, that can handle even broken xhtml.
Code:
title(?=[^>]*<)
This basically says search for "title" but lookahead to make sure there are no closing tag chars ">" before you find the next opening tag char "<".
There are probably look behind versions that could work with reverse logic. And there are ways to use regex to find a two strings that ignores any intervening tags.
Give it a try. You could add a saved search easily to do that. But again it will not handle find and replacement of text that crosses over elements (over nodes in the tree). That is the hard part unless you have one to one corresponding matching of matching substrings to replacement substrings which in general need not be the case.
And of course if you use < and > inside strings to show a "tag" or code snippet, these would be found by mistake so reviewing each find before the replace would be needed.

In Sigil, PCRE regex engine is used.
Thus, you can use
<[^<>]*>(*SKIP)(*F)|-
See the regex demo.
Details:
<[^<>]*>(*SKIP)(*F) - matches <, zero or more chars other than < and > and then a >, and then skips the match and goes on to search for the next match from the position where the failure occurred
| - or
- - a hyphen.
NOTE: you might want to match any dashes with [\p{Pd}\x{00AD}] (to replace with -).

Related

My regex expression is both lazy and greedy. Why?

Suppose I'm searching for anchor links in a web page. A regex that works is:
"\<a\s+.*?\>"
However, lets add a complication. Lets suppose that I only want links which surround specific text, for instance, the word 'next'. Normally, I would think all I had to do is:
"\<a\s+.*?\>next"
But I find that now, if there are 3 anchor tags in a page, and the third one has 'next' after it, that the regex search finds a huge string extending from the first anchor tag, and extending to the third anchor tag. This makes sense if the period-asterisk-questionmark is finding all characters until it comes across ">next". But that is not what I want. I want to find all characters until it comes across ">", and then an additional constraint should be that right after the ">" there should be "next".
How do I get this to work?
You can fix your regex by prohibiting it from matching > inside the tag, i.e. by replacing . with [^>]:
"\<a\s+[^>]*?\>next"
.*? matches any number of characters. The fact that you made it reluctant does not make it stop at >: it continues matching past it, until it finds >next at the end. This is not greedy, because the expression matched as little as possible to obtain a match. It's just that no shorter matches were available.
Demo.

Search for entire word containing specific keyword in Notepad++ using regular expressions

I use Notepad++,
i need to search and replace entire word that contain a specific keyword.
Ex: someting HELP.blablabla.blabla someting
i would like to search entire text for words that contain the keyword "HELP" untill the first space OR the first comma.
In this case: HELP.blablabla.blabla
thanks a lot
Go to the search panel, check the regex checkbox on the bottom and try: (HELP)([^ ,]*)
Note: There are a space character after the ^
This regex means: Search for the entire word HELP (HELP) followed by anything that it isn't an space or an comma [^ ,] the ^ inside the brackets is a denial
Edit:
You can use just HELP[^ ,]* the parenthesis is just to create capturing groups if you need to use the specific groups to replace later. As pointed by #alphabravo
You say search and replace an entire word but if it were that simple then I wonder why a regular search and replace isn't sufficient. So I'm reading between the lines and assuming you want to match on full lines of text.
I think I've used npp enough to get the syntax right. I don't remember any eccentricities that would apply. Is the comma/space optional?
^[^, ]*HELP[^, ]*[, ]
I'm kinda thinking this one might be good enough:
^[^, ]*HELP

Find and Replace with Regex in Microsoft Word 2013

I am editing an e-book document with a lot of unnecessary markup. I have a number of sections in the text with code similar to this:
<i>Some text here</i>
I am trying to run a regex find and replace that will find any phrase between the two i-tags, remove the i-tags, and apply a style to the text.
Here is what I'm using to search:
Find: (<i>)(*)(</i>)
Replace: \2
I'm also selecting Styles > i (for italic). This tells our conversion software to apply italics to the text. If I leave the i-tags, what ends up happening is ScribeNet's conversion process converts them to hex-values so that they show up as literal text in the e-book. Messy.
When I run this search, I get no results. I have "use wildcards" checked. What am I missing? According to Microsoft's help website, * is used to represent any number or type of characters, and individual strings are supposed to be enclosed in parentheses.
To search for a character that's defined as a wildcard, place a backslash (\) before that character. The * itself matches any string of characters, so use the range quantifier to match (1 or more times)
Find: \<i\>(*{1,})\</i\>
Replace: \1
Search for \<i\>(*{1,})\</i\> and replace with \1. Don't forget to check Use wildcard.
There is a reference table for Word's "regular expressions" here: http://office.microsoft.com/en-ca/word-help/find-and-replace-text-by-using-regular-expressions-advanced-HA102350661.aspx
< and > are special characters that need to be escaped
* means any character
{1,} means one or more times
There is a special tool for Microsoft Word called Multiple Find & Replace (see http://www.translatortools.net/products/transtoolsplus/word-multiplefindreplace) which allows to work around Word's wildcard limitations. This tool can use the standard regular expressions syntax to search and replace any text within a Word document. For example, to search for any HTML tags, you can just use <[^>]+> which will find opening, closing and standalone HTML tags. You can add any number of expressions to a list and then search the document for all of them, replace everything, see all matches for all the search expressions entered, replace only selected matches, and a few more things.
I created it for translators and editors, but it is great for any advanced search/replace operations in Word, and I am sure you will find it very useful.
Stanislav

Find / Replace functionality that allows for boundary replacements instead of expressions

Apologies in advance for the confusing title. My issue is as follows, I have the following text in about 600 files:
$_REQUEST['FOO']
I would like to replace it with the following:
$this->input->post('FOO')
To clarify, I am matching against the following:
$_REQUEST any number of A-Za-z\d followed by a ]
and replacing it with:
$this->input->post( the alphanumeric word from above followed by a )
Or in general:
Anchor token TEXT TO KEEP end anchor token
This differs from standard find/replace as I want to retain text inside of two word boundaries.
Is this functionality present in any text editors (Eclipse,np++,etc). Or am I going to need to write some type of program to parse these 600 files to make the replacement?
s/\$__REQUEST\[(.*?)]/$this->input->post(\1)/
The .*? will match everything from [ to the first ] rather than the last although it's unlikely that it will matter in this case.
By the way the PHP superglobal is $_REQUEST rather than $__REQUEST
You can do this in Notepad++ using regular expressions. Replace
\$_REQUEST\['([^']*)'\]
with
$this->input->post('$1')
If you ever have double-quotes too, you can do use a more complex expression to handle both cases, though I'm not sure Notepad++ supports backreferences; replace
\$_REQUEST\[(['"])(.*?)\1\]
with
$this->input->post($1$2$1)
Note that I've reverted to using #ExplosionPills' suggested (.*?) hereā€”it may be better, actually.

Remove everything before and after variable=int

I'm terrible at regex and need to remove everything from a large portion of text except for a certain variable declaration that occurs numerous times, id like to remove everything except for instances of mc_gross=anyint.
Generally we'd need to use "negative lookarounds" to find everything but a specified string. But these are fairly inefficient (although that's probably of little concern to you in this instance), and lookaround is not supported by all regex engines (not sure about notepad++, and even then probably depends on the version you're using).
If you're interested in learning about that approach, refer to How to negate specific word in regex?
But regardless, since you are using notepad++, I'd recommend selecting your target, then inverting the selection.
This will select each instance, allowing for optional white space either side of the '=' sign.
mc_gross\s*=\s*\d+
The following answer over on super user explains how to use bookmarks in notepad++ to achieve the "inverse selection":
https://superuser.com/questions/290247/how-to-delete-all-line-except-lines-containing-a-word-i-need
Substitute the regex they're using over there, with the one above.
You could do a regular expression replace of ^.*\b(mc_gross\s*=\s*\d+)\b.*$ with \1. That will remove everything other than the wanted text on each line. Note that on lines where the wanted text occurs two or more times, only one occurrence will be retained. In the search the ^.*\b matches from start-of-line to a word boundary before the wanted text; the \b.*$ matches everything from a word boundary after the wanted text until end of line; the round brackets capture the wanted text for the replacement text. If text such as abcmc_gross=13def should be matched and retained as mc_gross=13 then delete the \bs from the search.
To remove unwanted lines do a regular expression search for ^mc_gross\s*=\s*\d+$ from the Mark tab, tick Bookmark line and click Mark all. Then use Menu => Search => Bookmark => Remove unmarked lines.
Find what: [\s\S]*?(mc_gross=\d+|\Z)
Replace with: \1
Position the cursor at the start of the text then Replace All.
Add word boundaries \b around mc_gross=\d+ if you think it's necessary.