GREP code for InDesign Search - replace

I have a GREP command:
(?<=say).+.".+?"
It finds all lines that have say (but doesn't include it), quoted words and characters between it.
It would like to make a change only to the quoted words (not the characters in between) but
(?<=say)(?<=.+.)".+?"
this doesn't work.
Here an example of the text I need to change:
I spoke to the Captain, saying: "Sir, I would like to home".
...to say, "This is the way".
Which should be changed into this by GREP command:
I spoke to the Captain, saying: "Sir, I would like to home".
...to say, "This is the way".
Could someone help me?

I don't think you can have a variable length lookbehind; that's not possible in most flavors of REGEX, and I'd bet that's the case for ID. You could possibly do two Greps, one to apply the style to everything after 'say' and another to remove it from the text between say and the quotation mark (this would of course assume no other rendering in between). Or you could use your initial regex to insert some characters that you could then search for and use as a basis to apply the italics. Or, you could script it.
This page talks about the limitations of lookbehind: http://www.regular-expressions.info/lookaround.html
An example using positive lookbehind and lookahead to find the content between 'say' and the quotation mark:
(?<=say).+?(?=")

Related

Vim - sed like labels or replacing only within pattern

On the basis of some html editing I've came up with need for help from some VIM master out there.
I wan't to achieve simple task - I have html file with mangled urls.
Just description Just description
...
Just description
Unfortunately it's not "one url per line".
I am aware of three approaches:
I would like to be able to replace only within '"http://[^"]*"' regex (similar like replace only in matching lines - but this time not whole lines but only matching pattern should be involved)
Or use sed-like labels - I can do this task with sed -e :a -e 's#\("http://[^" ]*\) \([^"]*"\)#\1_\2#g;ta'
Also I know that there is something like "\#<=" but I am non native speaker and vim manual on this is beyond my comprehension.
All help is greatly appreciated.
If possible I would like to know answer on all three problems (as those are pretty interesting and would be helpful in other tasks) but either of those will do.
Re: 1. You can replace recursively by combining vim's "evaluate replacement as an expression" feature (:h :s\=) with the substitute function (:h substitute()):
:%s!"http://[^"]*"!\=substitute(submatch(0), ' ', '_', 'g')!g
Re: 2. I don't know sed so I can't help you with that.
Re: 3. I don't see how \#<= would help here. As for what it does: It's equivalent to Perl's (?<=...) feature, also known as "positive look-behind". You can read it as "if preceded by":
:%s/\%(foo\)\#<=bar/X/g
"Replace bar by X if it's preceded by foo", i.e. turn every foobar into fooX (the \%( ... \) are just for grouping here). In Perl you'd write this as:
s/(?<=foo)bar/X/g;
More examples and explanation can be found in perldoc perlretut.
I think what you want to do is to replace all spaces in your http:// url into _.
To achieve the goal, #melpomene's solution is straightforward. You could try it in your vim.
On the other hand, if you want to simulate your sed line, you could try followings.
:let #s=':%s#\("http://[^" ]*\)\#<= #_#g^M'
^M means Ctrl-V then Enter
then
200#s
this works in same way as your sed line (label, do replace, back to label...) and #<= was used as well.
one problem is, in this way, vim cannot detect when all match-patterns were replaced. Therefore a relative big number (200 in my example) was given. And in the end an error msg "E486: Pattern not found..." shows.
A script is needed to avoid the message.

Regex to match when a string is present twice

I am horrible at RegEx expressions and I just don't use them often enough for me to remember the syntax between uses.
I am using grepWin to search my files. I need to do a search that will return the files that have a given string twice.
So, for example, if I was searching on the word "how", then file one would not match:
Hello
how are you today?
but file two would:
Hello
how are you today?
I am fine, how are you?
Any one know how to make a RegEx that will match that?
something like this (depends on language and your specific task)
\(how.*){2}\
Edit:
according to #CodeJockey
\^(([^h]|h[^o]|ho[^w])*how([^h]|h[^o]|ho[^w])*){2,2}$\
(it become more complicated)
#CodeJockey: Thanks for comments
I don't know what grepWin supports, but here's what I came up with to make something match exactly twice.
/^((?!how).)*how((?!how).)*how((?!how).)*$/
Explanation:
/^ # start of subject
((?!how).)* # any text that does not contain "how"
how # the word "how"
((?!how).)* # any text that does not contain "how"
how # the word "how"
((?!how).)* # any text that does not contain "how"
$/ # end of subject
This ensures that you find two "how"s, but the texts between the "how"s and to either side of them do not contain "how".
Of course, you can substitute any string for "how" in the expression.
If you want to "simplify" by only writing the search expression twice, you can use backreferences thus:
/^(?:(?!how).)*(how)(?:(?!\1).)*\1(?:(?!\1).)*$/
Refiddle with this expression
Explanation:
I added ?: to make the negative lookaheads' text non-capturing. Then I added parentheses around the regular how to make that a capturing subpattern (the first and only one).
I had to include "how" again in the first lookahead because it's a negative lookahead (meaning any capture would not contain "how") and the captured "how" is not captured yet at that point.
This is significantly harder than I originally thought it would be, and requires variable-length lookbehind, which grepWin does not support...
this expression:
(?<!blah.{0,99999})blah(?=.*?blah)(?!.*blah.*blah)
was successfully used in Eclipse, using the "Search > File" dialog to exclude files with one and three instances of blah and to include files with exactly two instances of blah.
Eclipse does not permit a .* in lookbehind, so I used .{0,99999} instead.
It is possible, with the right tool, but It isn't pretty to get it to work with grepWin (see answer above). Can you use other tools (such as Eclipse) and what did you want to do with the files afterwards?
This works for grep || python, it will return a match only if "how" exists twice in a your_file:
grep "how.*how" your_file
in python (re imported):
re.search(r"how.*how","your_text")
It will return everything in between,(the dot means any character and the star means any number of characters), and you can customize your own script.

Replacing char in a String with Regular Expression

I got a string like this:
PREFIX-('STRING WITH SPACES TO REPLACE')
and i need this:
PREFIX-('STRING_WITH_SPACES_TO_REPLACE')
I'm using Notepad++ for the Regex Search and Replace, but i'm shure every other Editor capable of regex replacements can do it to.
I'm using:
PREFIX-\('(.*)(\s)(.*)'\)
for search and
PREFIX-('\1_\3')
for replace
but that replaces only one space from the string.
The regex search feature in Notepad++ is very, very weak. The only way I can see to do this in NPP is to manually select the part of the text you want to work on, then do a standard find/replace with the In selection box checked.
Alternatively, you can run the document through an external script, or you can get a better editor. EditPad Pro has the best regex support I've ever seen in an editor. It's not free, but it's worth paying for. In EPP all I had to do was this:
search: ((?:PREFIX-\('|\G)[^\s']+)\s+
replace: $1_
EDIT: \G matches the position where the previous match ended, or the beginning of the input if there was no previous match. In other words, the first time you apply the regex, \G acts like \A. You can prevent that by adding a negative lookahead, like so:
((?:PREFIX-\('|(?!\A)\G)[^\s']+)\s+
If you want to prevent a match at the very beginning of the text no matter what it starts with, you can move the lookahead outside the group:
(?!\A)((?:PREFIX-\('|\G)[^\s']+)\s+
And, just in case you were wondering, a lookbehind will work just as well as a lookahead:
((?:PREFIX-\('|(?<!\A)\G)[^\s']+)\s+
You have to keep matching from the beggining of the string untill you can match no more.
find /(PREFIX-\('[^\s']*)\s([^']*'\))/
replace $1_$2
like: while (/(PREFIX-\('[^\s']*)\s([^']*'\))/$1_$2/) {}
How about using Replace all for about 20 times? Or until you're sure no string contains more spaces
Due to nature of regex, it's not possible to do this in one step by normal regular expression.
But if I be in your place, I do such replaces in several steps:
find such patterns and mark them with special character
(Like replacing STRING WITH SPACES TO REPLACE with #STRING WITH SPACES TO REPLACE#
Replace #([^#\s]*)\s to #\1_ server times.
Remove markers!
I studied a little the regex tool in Notepad++ because I didn't know their possibilities.
I conclude that they aren't powerful enough to do what you want.
Your are obliged to learn and use a programming language having a real regex capability. There are a number of them. Personnaly, I use Python. It would take 1 mn to do what you want with it
You'd have to run the replace several times for each space but this regex will work
/(?<=PREFIX-\(')([^\s]+)\s+/g
Replace with
\1_ or $1_
See it working at http://refiddle.com/10z

How to search (using regex) for a regex literal in text?

I just stumbled on a case where I had to remove quotes surrounding a specific regex pattern in a file, and the immediate conclusion I came to was to use vim's search and replace util and just escape each special character in the original and replacement patterns.
This worked (after a little tinkering), but it left me wondering if there is a better way to do these sorts of things.
The original regex (quoted): '/^\//' to be replaced with /^\//
And the search/replace pattern I used:
s/'\/\^\\\/\/'/\/\^\\\/\//g
Thanks!
You can use almost any character as the regex delimiter. This will save you from having to escape forward slashes. You can also use groups to extract the regex and avoid re-typing it. For example, try this:
:s#'\(\\^\\//\)'#\1#
I do not know if this will work for your case, because the example you listed and the regex you gave do not match up. (The regex you listed will match '/^\//', not '\^\//'. Mine will match the latter. Adjust as necessary.)
Could you avoid using regex entirely by using a nice simple string search and replace?
Please check whether this works for you - define the line number before this substitute-expression or place the cursor onto it:
:s:'\(.*\)':\1:
I used vim 7.1 for this. Of course, you can visually mark an area before (onto which this expression shall be executed (use "v" or "V" and move the cursor accordingly)).

Remove stuff, retrieve numbers, retrieve text with spaces in place of dots, remove the rest

This is my first question, so I hope I didn't mess too much with the title and the formatting.
I have a bunch of file a client of mine sent me in this form:
Name.Of.Chapter.021x212.The.Actual.Title.Of.the.Chapter.DOC.NAME-Some.stuff.Here.ext
What I need is a regex to output just:
212 The Actual Title Of the Chapter
I'm not gonna use it with any script language in particular; it's a batch renaming of files through an app supporting regex (which already "preserves" the extension).
So far, all I was able to do was this:
/.*x(\d+)\.(.*?)\.[A-Z]{3}.*/ -->REPLACE: $1 $2
(Capture everything before a number preceded by an "x", group numbers after the "x", group everything following until a 3 digit Uppercase word is met, then capture everything that follows)
which gives me back:
212 The.Actual.Title.Of.the.Chapter
Having seen the result I thought that something like:
/.*x(\d+)\.([^.]*?)\.[A-Z]{3}.*/ -->REPLACE: $1 $2
(Changed second group to "Capture everything which is not a dot...") would have worked as expected.
Instead, the whole regex fails to match completely.
What am I missing?
TIA
ciĆ 
ale
.*x(\d+)\. matches Name.Of.Chapter.021x212.
\.[A-Z]{3}.* matches .DOC.NAME-Some.stuff.Here.ext
But ([^.]*?) does not match The.Actual.Title.Of.the.Chapter because this regex does not allow for any periods at all.
since you are on Mac, you could use the shell
$ s="Name.Of.Chapter.021x212.The.Actual.Title.Of.the.Chapter.DOC.NAME-Some.stuff.Here.ext"
$ echo ${s#*x}
212.The.Actual.Title.Of.the.Chapter.DOC.NAME-Some.stuff.Here.ext
$ t=${s#*x}
$ echo ${t%.[A-Z][A-Z][A-Z].*}
212.The.Actual.Title.Of.the.Chapter
Or if you prefer sed, eg
echo $filename | sed 's|.[^x]*x||;s/\.[A-Z][A-Z][A-Z].*//'
For processing multiple files
for file in *.ext
do
newfile=${file#*x}
newfile=${newfile%.[A-Z][A-Z][A-Z].*}
# or
# newfile=$(echo $file | sed 's|.[^x]*x||;s/\.[A-Z][A-Z][A-Z].*//')
mv "$file" "$newfile"
done
To your question "How can I remove the dots in the process of matching?" the answer is "You can't." The only way to do that is by processing the result of the match in a second step, as others have said. But I think there's a more basic question that needs to be addressed, which is "What does it mean for a regex to match a given input?"
A regex is usually said to match a string when it describes any substring of that string. If you want to be sure the regex describes the whole string, you need to add the start (^) and end ($) anchors:
/^.*x(\d+)\.(.*?)\.[A-Z]{3}.*$/
But in your case, you don't need to describe the whole string; if you get rid of the .* at either end, it will serve your just as well:
/x(\d+)\.(.*?)\.[A-Z]{3}/
I recommend you not get in the habit of "padding" regexes with .* at beginning and end. The leading .* in particular can change the behavior of the regex in unexpected ways. For example, it there were two places in the input string where x(\d+)\. could match, your "real" match would have started at the second one. Also, if it's not anchored with ^ or \A, a leading .* can make the whole regex much less efficient.
I said "usually" above because some tools do automatically "anchor" the match at the beginning (Python's match()) or at both ends (Java's matches()), but that's pretty rare. Most of the shells and command-line tools available on *nix systems define a regex match in the traditional way, but it's a good idea to say what tool(s) you're using, just in case.
Finally, a word or two about vocabulary. The parentheses in (\d+) cause the matched characters to be captured, not grouped. Many regex flavors also support non-capturing parentheses in the form (?:\d+), which are used for grouping only. Any text that is included in the overall match, whether it's captured or not, is said to have been consumed (not captured). The way you used the words "capture" and "group" in your question is guaranteed to cause maximum confusion in anyone who assumes you know what you're talking about. :D
If you haven't read it yet, check out this excellent tutorial.