How to deal with matches pairwise in regex? - regex

I have a basic understanding of regex (I think) - but lookahead and all that just escapes me :(
So I have a sense that this should be doable with rx, but I have no idea how to do it.
I want to copy Markdown from Notion into MediaWiki. Both support MD - but different flavours. So I thought I could do some search & replace in a text editor to deal with the differences.
But: strikeout text is "~~text~~" in Notion and "<s>text</s>" in MediaWiki.

Interesting how one can think overly complex. It suddenly hit me that I don't have to worry about counting or doing advanced stuff looking at individual occurences of ~~, but that I could look at the paired case using a search string "~~(.*?)~~" and a replacement "<s>$1</s>".
Nice & easy 😉

Related

Using wildcards in MS Word to replace special characters and reformat text

I'm trying to work out how to replace the following sentence:
What’s Viktoria made / makes / making in the kitchen at the moment?
with this:
What’s Viktoria [**made**makes**making] in the kitchen at the moment?
using Find & Replace with wildcards in MS Word.
Unfortunately, I'm having trouble working out a way of doing this. I've made various attempts using things like (*)(\/)(*)(\/)(*) and ([! ]*\/*\/*[! ]) but I'm getting the entire string before; i.e. it highlights everything, not just made / makes / making. I guess there might be an easier way searching for the formatting (as the only part I want to target is in italics), but any way of doing it would be greatly appreciated!
Thanks in advance. Hopefully the explanation is clear enough!
Jamie
This may be what you're after. With wildcard searching on, for the search pattern use:
(<*>) [/] (<*>) [/] (<*>)
and for the replace pattern use:
[**\1**\2**\3]
If you want to remove the italics, choose italics font format for the search pattern and regular font format for the replace text.
Hope that helps.

REGEX in MS Word 2016: Exclude a simple String from Search

So I read a lot about Negation in Regex but can't solve my problem in MS Word 2016.
How do I exclude a String, Word, Number(s) from being found?
Example:
<[A-Z]{2}[A-Z0-9]{9;11}> to search a String like XY123BBT22223
But how to exclude for example a specefic one like SEDWS12WW04?
Well it depends on what you need to achieve or is this a matter of curiosity... RegEx is not the same as the built-in Advanced Find with Wildcards; for that you need VBA.
Depending on your need, without using VBA, you could make use of space and return characters - something like this will work for the strings provided: [ ^13][A-Z]{2}[0-9]{1,}[A-Z]{1,}[0-9]{1,}[ ^13] (assuming you use normal carriage returns and spaces in your document)
Anyway, this is a good article on wildcard searches in MS Word: https://wordmvp.com/FAQs/General/UsingWildcards.htm
EDIT:
In light of your further comments you will probably want to look at section 8 of the linked article which explains grouping. For my proposed search you can use this to your advantage by creating 3 groups in your 'find' and only modifying the middle group, if indeed you do intend to modify. Using groups the search would look something like:
([ ^13])([A-Z]{2}[0-9]{1,}[A-Z]{1,}[0-9]{1,})([ ^13])
and the replace might look like this:
\1 SOMETHING \3
Note also: compared to a RegEx solution my suggestion is kinda lame, mainly because compared to RegEx, MS-Words find and replace (good as it is, and really it is) is kinda lame... it's hacky but it might work for you (although you might need to do a few searches).
BUT... if it really is REGEX that you want, well you can get access to this via VBA: How to Use/Enable (RegExp object) Regular Expression using VBA (MACRO) in word
And... then you will be able to use proper RegEx for find and replace, well almost - I'm under the impression that the VBA RegEx still has some quirks...
As already noted by others, this is not possible in Microsoft Word's flavor of regular expressions.
Instead, you should use standard regular expressions. It is actually possible to use standard regular expressions in MS Word if you use a special tool that integrates into Microsoft Word called Multiple Find & Replace (see http://www.translatortools.net/products/transtoolsplus/word-multiplefindreplace). This tool opens as a pane to the right of the document window and works just like the Advanced Find & Replace dialog. However, in addition to Word's existing search functionality, it can use the standard regular expressions syntax to search and replace any text within a Word document.
In your particular case, I would use this:
\b[A-Z]{2}[A-Z0-9]{9,11}\b(?<!\bSEDWS12WW04)
To explain, this searches for a word boundary + ID + word boundary, and then it looks back to make sure that the preceding string does not match [word boundary + excluded ID]. In a similar vein, you can do something like
(?<!\bSEDWS12WW04|\bSEDWS12WW05|\bSEDWS12WW05)
to exlude several IDs.
Multiple Find & Replace is quite powerful: you can add any number of expressions (either using regular expressions or using Word's standard search syntax) to a list and then search the document for all of them, replace everything, display all matches in a list and replace only specific matches, and a few more things.
I created this tool for translators and editors, but it is great for any advanced search/replace operations in Word, and I am sure you will find it very useful.
Best regards, Stanislav

Futile attempt to run regular expression find/replace in MS Word using groups on Mac

According to the received wisdom MS Word (more or less) supports find/replace with use of regular expressions. I have a simple regular expression:
^(C[[:alpha:]]*)(\d*)(.*)$
That I'm running on the data:
indSIMDdecile
CSdeccrim12006
CSdeccrim12006
CSdeccrim12009
CSdeccrim12009
CSdeccrim12012
CSdeccrim12012
CSdeceduc12004
CSdeceduc12004
CSdeceduc12006
CSdeceduc12006
CSdeceduc12009
CSdeceduc12009
CSdeceduc12012
CSdeceduc12012
CSdecemp12004.x
I'm interested in returning the first word prior to the digit 1, which works as demonstrated on regex101 here.
Problem
I would like to the same but in MS Word (v. 15.18 on Mac). After getting error messages of trying to supply unsuitable syntax I learned that MS Word does not support to the full regex syntax. I simplified my expression to something on the lines:
but the search does not find any strings and nothing gets replaced. Hence my questions, is it possible to use MS Word on Mac with regex?
The linked help website hints that something like that should be possible, but so far now luck.
The simple answer is "no", if you mean "Does Mac Word have a UI feature that lets you use one of the modern dialects of regex?" Word's Find/Replace only supports its own Regular Expression syntax.
In this case, I think the following will give you what you need:
Find with wildcards:
(C)([!1]#)(1)
and a replace by
\1
(If you also had to find "C1", then that doesn't work, and unfortunately nor does
(C)([!1]{0,})(1)
because Word does not allow 0 in the {,} pattern)
But there is a problem with "#". If the text the "#" is looking for is long, the find/replace may fail. There is supposed to be a 255 limit, but it seems rather more arbitrary than that. (I have long suspected a buffer overrun type error in the Word code, but perhaps there is a simpler explanation).
If you mean, "is there any way to use modern regex with Word?", then the answer is "Yes, but you only get to operate on a copy of the text in the document. You will need to create your own code to do the 'replace' part of the find replace, and that means that you would have to deal with any of the issues such as preserving formatting that Word's built-in find/replace might get right for you.
On the Windows side, people who want a better regex than Word's often use VBScript's regexp object because it is easily used from VBA. VBA itself only really has the "like" operator, which also only has fairly crude pattern matching abilities. I think there are examples of VBScript rexexp use on StackOverflow. On the Mac side, you would either have to use VBA and "shell out" to one of the built-in Mac/Unix utilities to do your finding (and perhaps replacing), or perhaps use Applescript or Javascript application scripting to do it. As far as I can remember Applescript does not have a 'modern' regex built-in either.
[As a bit of history, Word's "regular expressions" were I think introduced in Word 6, around 1993, at a time when most dialects of regex were much more crude than they are today. I don't think Word's version has moved along much at all - it probably added some Unicode support at some point, but that's probably about it. I assume that people using modern regex don't regard it as regex at all, and I personally prefer not to call Word's Regular Expressions 'regex' precisely for that reason.]

Regex exclude value from repetition

I'm working with load files and trying to write a regex that will check for any rows in the file that do not have the correct number of delimiters. Let's pretend the delimiter is % (I'm not sure this text field supports the delimiters that are in the load file). The regex I wrote that finds all rows that are correct is:
^%([^%]*%){20}$
Sometimes it's more beneficial to find any rows that do not have the correct number of delimiters, so to accomplish that, I wrote this:
(^%([^%]*%){0,19}$)|(^%([^%]*%){21,}$)
I'm concerned about the efficiency of this (and any regex I write in general), so I'm wondering if there's a better way to write it, or if the way I wrote it is fine. I thought maybe there would be some way to use alternation with the repetition tokens, such as:
{0,19}|{21,}
but that doesn't seem to work.
If it's helpful to know, I'm just searching through the files in Sublime Text, which I believe uses PCRE. I'm also open to suggestions for making the first regex better in general, although I've found it to work pretty well even in exceptionally large load files.
If your regex engine supports negative lookaheads, you can slightly modify your original regex.
^(?!%([^%]*%){20}$)
The regex above is useful for test only. If you want to capture, then you need to add .* part.
^(?!%([^%]*%){20}$).*$

How to author and manage very long regex patterns and reuse pattern blocks?

Is there any proven way to overcome the difficulty of authoring and managing large regex patterns in your code? Preferably in a visual tool? Is there any way to build up a pattern from smaller reusable pieces? I could not find an web based regex visualizers that supported multine regex for instance.
We are currently using a technique to split patterns and store the pieces in variables, but this mixes languages - an architectural no-no for us - and also hinders the ability to paste the pattern into a visualizer.
I am using .NET/Powershell/JavaScript - but I am interested in the flavor agnostic perspective as well.
At my old job we used regex for everything. The best tools I found where the below:
Best regex editor in my opinion (it explains each segment and has a reference sheet): http://regex101.com
Best web multi-line regex editor:http://regexpal.com/
Best regex editor overall (a download for the price of $40): http://www.regexbuddy.com/
As far as managing regexs, we used to keep all regexs in a properties file separate from the code, where the code loads the property (regex) in real time. We also shared regexbuddy files for exchanging regex patterns. There was one file that was saved on source control that had lines and lines of simple patterns for matching certain things. It helped to create larger ones, using the smaller pieces. However, what I have learned is that basically all regexs need to be tweaked for your specific purposes. It is not as simple as piecing small ones together. The small ones just help get started in the right direction.
Over here in flavor agnostic land, I sometimes do something like this (actual working code I just happened to be revisiting):
street = "(#{names}[A-Za-z0-9']+)((?:\\s+(?:#{StreetType.regexp}))?)"
space = '[\s.,]+'
at_a_street =
'(?:and|&|&amp;|at|#|by|just\s+\w+\s+of|just\s+past|looking(?:\s+\w+)?\s+(?:at|to|towards?)|near)' +
"#{space}#{street}"
between_streets =
"(?:between|(?:betw?|btwn)\\.?)#{space}#{street}#{space}(?:and|&|&amp;)#{space}#{street}"
address = '(\b\d+)(?:\s*-\s*\d+|[a-z])?\s+' + street
#regexps = [
/#{street}#{space}#{at_a_street}/i,
/#{street}#{space}#{between_streets}/i,
/#{address}/i,
/#{address}#{space}#{at_a_street}/i,
/#{address}#{space}#{between_streets}/i
]
Namely break the regexp up into meaningful bits, give them comprehensible names, and concatenate them as necessary. (You need to think a little extra about whether each bit can be safely concatenated with others, e.g. watch out for greedy expressions at the end.)