Using regex to store text before and after two different characters - regex

I have a series of files that have this format:
01x05e - Some text (Some more text)
01x05f - Some text (Some more text)
01x05g - Some text (Some more text)
What I'd like to to is strip them to produce this:
01x05e - Some more text
01x05f - Some more text
01x05g - Some more text
To do this, I am using the Bulk Rename Utility. I thought I had some regex I could use to do this:
-.*?\(
This successfully matches everything between the "-" and the "(" above. I was hoping I could use it to remove all of that text, before asking the Bulk Rename Utility to do a very trivial removal of the final character in every filename (i.e. the ")").
However, in Bulk Rename Utility I can't just match the text and replace it with no content to remove it. Instead, I have to specify a replace criteria and nothing I'm entering seems to be working. Any time I enter any data in the "replace" section it simply erases all of my filename.
This leads me to think I need to approach the regex differently and find some way of grabbing the data before the "-" and grabbing the data between both "(" and ")" and placing them together.
However, I've no idea how to even start doing this. Is this the correct approach, or am I missing something obvious with regards to the match and replace of the regex I currently have?

The Bulk Rename Utility has a support website, including a forum that features a section providing regex rename support, here.
It seems like you want to capture from the first paren (open) to the last paren (close) and delete everything after the first hyphen. Capturing groups are your friend here:
^([^-]* -) - This will match from start to first hyphen after space.
[^(]* - This will match everything except open paren.
\((.*)\)$ - This will match things inside the parens.
You can put it all together:
^([^-]* -)[^(]*\((.*)\)$
And replace it with just the first group and the last group:
\1 \2

Related

Regex NotePad++ or batch script to find and replace double bracketed text with CR LF -- would prefer NP++

I managed to do most of my conversion in VBA Macro (Word > txt) but some changes were made also that I could not forego or get around. Unfortunately, I had not been in the habit of using styles and precise formatting in my docs... (Which is why a PanDoc conversion did not "pan" out well, if you'll excuse the pun.)
In my docs, I was using bold text/lines for in-text titles (not Heading 2 alas) but as I was converting mid-sentence one or two-word bold phrases into phrases to go between double square brackets, the makeshift titles/headings were also changed to [[some title]] format in the process.
With Find and Replace (a batch script that goes through all files in a folder would also do), I would like to search for each and any number of instances of CRLF [[some title CRLF]]CRLF and replace the brackets with ** (to make the title bold), or perhaps ## to make the headings I was missing back in MS Word (I would of course need the line breaks as well).
For better understanding, please see attached picture here:
I am fairly sure that all instances are similarly syntaxed. If not, I may be able to tailor your regex code to differing instances later on.
As you can see, I was trying to do it in two steps but that's not good, because the second step (which I couldn't even get right) would propably have altered other texts I need intact (there must be sentences that start with double brackets after CRLF).
I would need the two steps in one so that only the targeted double bracketed text would be changed to bold or Heading 2.
Basically what I could not do is: find the proper regex solution for matching double CRLF-ed and square-bracketed text for any number of words than may occupy more than one line and starts with a capital letter. I would need an empty line above and below the title as indicated in the image (the VBA macro somehow made two instances of CRLF and carried the brackets to a new line, which I do not like, either).
EDIT.
In the meantime I managed to cook something up but now I couldn't insert the CRLF in front of the match string. At this point this is not enough as other instances are also changed, even lowercase in-line items, for some reason...
Regex:
\[\[([A-Z][\S\s]+?)\]\]
Substitution:
## $1\r\n
https://regex101.com/r/mH6B9N/1
Since then, I made improvements towards what I wanted (I had to test in NotePad++ and not Regex101, for different results), but now in multiple documents I have found match across spill-over lines, as described in here:
Single line regex search in Notepad++
Is it possible that I cannot do what I want? The problem is having non-title text strings having line-break, double brackets and capitalized letters.
What it looks like in other documents:
See here.
I circled around with red in image for clarification. See also:
https://regex101.com/r/8XsIGx/1
Is it possible to match a certain word like "címnél" and not execute on that match if that word is present in a line?
Thanks very much in advance,
F.
You can use
(?s)\R\K\[\[((?:(?!\[\[|]]).)*)\R*]](?=\R)
Replace with ## $1. See the regex demo.
Details:
(?s) - equivalent of the . matches newline option
\R - a line break sequence
\K - omit the text matched so far (the newlines)
\[\[ - a [[ text
((?:(?!\[\[|]]).)*) - Group 1: any char, as many as possible occurrences, that does not start a [[ or ]] char sequence
\R* - zero or more line breaks
]] - a ]] text
(?=\R) - immediately to the right, there must be a line break.

Match all spaces after a particular string

Doing a find and replace in VsCode on a large amount of files. I'm looking to replace all spaces after a set of quotes, but only on a specific line.
I can very easily find all spaces using \s+, but I don't understand how to capture only the spaces after a specific string(one specific line). I've tried positive look behinds, but I can only get it to match the first space, but I need to match all spaces on that line.
Example code:
variable = "01 - Testing this thing"
I need to find and replace all the spaces between the quotation marks with underscores, but I can't get any regex to match all the spaces between the quotes. I might want to replace the dash(-) as well, but the spaces are more important and I'm struggling to figure it out.
Here is a pretty good workflow.
Open a Search Editor (from the Command Palette or set a keybinding to it).
Use this regex (?<=variable = ")[^"]*.
That will find all matches in all files in your workspace or whatever folders you designate in the file to include filter. I suggest setting the context lines option to 0.
Ctrl+Shift+L to select all your matches. The matches are the 01 - Testing this thing part.
Now do a regular find in that search editor tab - with the Find in Selection option enabled.
Simply doing a find of and replaceAll with _ will make all those changes (in the Search Editor only).
To apply those changes to all the files with your initial search results, use the extension search-editor-apply-changes Apply Search Editor Changes... command.
Then you can check to see if the changes were as you expected and save all. It will open all affected files so you can inspect them.
Seems like a few steps but notice the first regex can be very simple. And then you are doing a simple find/replace in just those selections. Demo:
You search for a string that matches, it has A space between the quotes. Replace with what is before and after the space but the space is now an underscore. You have to apply this as often as the max number od spaces in a string. It can't be done in 1 regex search-replace.
In the Search Bar
Find Regex:
(variable = "[^" ]*) ([^"]*")
Replace:
$1_$2
Then apply Replace All (button) and Refresh (button) until no more searches found.

Regex specific word after equals sign

Problem is to find and replace word throughout file. I'm trying to rename several fields but not edit the system names for those fields. In the end, need to find/replace the Issue, Issues, issue, issues. I'm using Netbeans (find and replace regex) to open the .properties file which contains this code, but open to using something else.
Text to right side of equals sign needs to be replaced. Sometimes there are periods next to the word on the right, most of the time there aren't. Trying to use regex because there are around 10000 lines in the file (this is to the display text of system fields without changing the field reference itself).
Example of text to be searched:
issue.columns.admin.title=Issue Navigator Default Columns
browseproject.issues.by.status.more=View these issues in the Issue Navigator
sorted by Status
issue.operations.voting.resolved=You cannot vote or change your vote on
resolved issues.
Using the following as a pattern in netbeans, which gets around 90% correct.
(?<!([=.[a-z]]))issue(?!([[a-z]]))
However, it also matches the 'issue.' in fields like 'issue.operations.voting.resolved', which means that finding and replacing would cause problems by changing reference to the system fields. Is there a way to add to what I have already done to make it match words with periods appearing after the equals sign but not before?
You may use
(?i)(\G(?!^)|^[^=\n\r]*=)(.*?)\bissues?\b
Replace with $1$2<SOME_REPLACEMENT>.
See the regex demo.
It matches:
(?i) - an inline case insensitive modifier making the pattern case insensitive
(\G(?!^)|^[^=\n\r]*=) - Group 1 (referred to with $1 from the replacement pattern): either the location after the previous match (\G(?!^)) or the start of a line that is followed with 0+ chars other than = and line break chars and then a =
(.*?) - Group 2 (referred to with $2 from the replacement pattern): any 0+ chars, as few as possible, up to the first occurrence of...
\bissues?\b - a whole word issue or issues

Use REGEX to find line breaks within a wrapped content

The direct question: How can I use REGEX lookarounds to find instances of \r\n that occur between a set of characters (stand in open and closing tags), "[ and ]" with arbitrary characters and line breaks inside as well?
The situation:
I have a large database exported to tab or comma delineated text files that I'm trying to import into excel. The problem is that some of the cells come from text areas that contain line breaks, and are qualified by double quotes. Importing into excel these line breaks are treated as new rows. I cannot adjust how the file is exported. I data needs to be preserved, but the exact format doesn't, so I was planning on using some placeholder for the returns or ~
Here's a generic illustration of the format of my data:
column1rowA column2rowA column3rowA column4rowA
column1rowB column2rowB "column3rowB
3Bcont
3Bcont
3Bcont
" column4rowB
column1rowC column2rowC column4rowC
column1rowD column2rowD "column3rowD
3Dcont" column4rowD
My thought has been to try to select and replace line breaks within the quotes using REGEX search and replace in Notepad++. To try and make is simpler I have tried adding a character to the double quotes to help indicate whether it is an opening or closing quote:
"[column3rowB
3Bcont
3Bcont
3Bcont
]"
I am new to REGEX. The progress I've made (which isn't much) is:
(?<="[) missing some sort of wildcard \r\n(?=.*]")
Every iteration I've tried has also included every line break between the first "[ and last ]"
I would also appreciate any other approaches that solve the underlying problem
If you can use some tool other than Notepad++, you can use this regex (see my working example on regex101):
(?!\n(([^"]*"){2})*[^"]*$)\n
It uses a negative lookahead to find line breaks only when not followed by an even number of quotes. You could replace them with <br>, spaces, or whatever is appropriate.
Breakdown:
(?! ... ) This is the negative lookahead, necessary because it's zero-width. Anything matched by it will still be available to match again.
(([^"]*"){2})* This is the other key piece. It ensures even-numbered pairs of non-quote characters followed by a quote.
[^"]*$ This is ensuring that there are no more quotes from there until the end of the string.
Caveat:
I couldn't get it to work in Notepad++ because it always recognizes $ as the end of a line, not the end of the entire string.
Great answer from Brian. I added an option that would only consider real linebreaks (i.e. \n\r), which worked for my CSV file:
(?!\n|\r(([^"]*"){2})*[^"]*$)\n|\r

Remove everything before and after variable=int

I'm terrible at regex and need to remove everything from a large portion of text except for a certain variable declaration that occurs numerous times, id like to remove everything except for instances of mc_gross=anyint.
Generally we'd need to use "negative lookarounds" to find everything but a specified string. But these are fairly inefficient (although that's probably of little concern to you in this instance), and lookaround is not supported by all regex engines (not sure about notepad++, and even then probably depends on the version you're using).
If you're interested in learning about that approach, refer to How to negate specific word in regex?
But regardless, since you are using notepad++, I'd recommend selecting your target, then inverting the selection.
This will select each instance, allowing for optional white space either side of the '=' sign.
mc_gross\s*=\s*\d+
The following answer over on super user explains how to use bookmarks in notepad++ to achieve the "inverse selection":
https://superuser.com/questions/290247/how-to-delete-all-line-except-lines-containing-a-word-i-need
Substitute the regex they're using over there, with the one above.
You could do a regular expression replace of ^.*\b(mc_gross\s*=\s*\d+)\b.*$ with \1. That will remove everything other than the wanted text on each line. Note that on lines where the wanted text occurs two or more times, only one occurrence will be retained. In the search the ^.*\b matches from start-of-line to a word boundary before the wanted text; the \b.*$ matches everything from a word boundary after the wanted text until end of line; the round brackets capture the wanted text for the replacement text. If text such as abcmc_gross=13def should be matched and retained as mc_gross=13 then delete the \bs from the search.
To remove unwanted lines do a regular expression search for ^mc_gross\s*=\s*\d+$ from the Mark tab, tick Bookmark line and click Mark all. Then use Menu => Search => Bookmark => Remove unmarked lines.
Find what: [\s\S]*?(mc_gross=\d+|\Z)
Replace with: \1
Position the cursor at the start of the text then Replace All.
Add word boundaries \b around mc_gross=\d+ if you think it's necessary.