Non-greedy regexp matching too much in pandoc-generated markdown file [duplicate] - regex

This question already has answers here:
Regular expression to get text between square brackets including disparity?
(4 answers)
Closed 3 years ago.
The Problem
I'm trying to write a simple intermediary step in a Pandoc workflow. I have an original document in .docx which I'm converting to .md using the --track-changes switch (see Pandoc reader options for more information) to produce a markdown file which has MS word insertions/deletions/comments wrapped in span tags, e.g.
[Insertion text]{.insertion id="1" author="Jamie Bowman" date="2019-04-01T11:05:00Z"}
[Deletion text]{.deletion id="1" author="Jamie Bowman" date="2019-04-01T11:05:00Z"}
[Comment body]{.comment-start id="1" author="Jamie Bowman" date="2019-04-01T11:05:00Z"}[]{.comment-end id="1"}
I want to run a regexp find and replace on the markdown file which effectively 'accepts' insertions and deletions but leaves the comment spans. (This is so when I convert back to .docx, I have a clean .docx file with comments only.)
What I've tried
I have been able to accept all insertion spans and delete all deletion spans, but only when the body text does not carry across more than one line. My attempt at matching across new lines matches too much and I can't work out how to match the exact text only.
The following regexp matches almost all deletions which I can then replace with nothing:
Find: \[(.*?)\]{.deletion(.|\n)*?}
Replace:
Same for insertions which I can then use a backreference to retain the text but remove the span:
Find: \[(.*?)\]{.insertion(.|\n)*?}
Replace: $1
The patterns are matching too much, though, as you can see here.
Please let me know if anything is unclear. I've been working on this quite a bit today and it's difficult to explain the problem plainly! Thanks in advance.

The following regex should match the deletion fragments:
\[[^[]*?\]{\.deletion.*?}
The regex for the insertions are mostly the same, except you have to have a capturing group ([^[]*?\):
\[([^[]*?\)]{\.insertion.*?}

Related

Notepad++ Return ONLY content within an XML Tag [duplicate]

This question already has answers here:
Notepad++ and regex with removal of unmatching sections
(2 answers)
Closed 3 years ago.
I have a HUGE set of XML documents that have very specific tags. I'm looking to remove everything [EXCEPT] the content within a tag called :
<DisplayContents>
<ID>8</ID>
<Type>102</Type>
<Contents>A whole bunch of stuff in this tag</Contents>
</DisplayContents>
In this example I would simply want to see the text A whole bunch of stuff in this tag
I've tried to use:
<(Contents).*?>|</.(Contents)>
as a Regex and Mark the lines... then remove the unmarked. But that seems to remove everything :( So - I'm doing something wrong and it is likely because I'm not much of a Regex guru.
**EDIT: The stuff within Contents is very long and spans many lines with line feeds, in case that is what is tripping this up. **
My guess is that you wish to remove everything except,
A whole bunch of stuff in this tag
for which maybe an expression similar to,
<DisplayContents>[\s\S]*?<Contents>(.*?)<\/Contents>[\s\S]*?<\/DisplayContents>
being replaced with $1 might work.
DEMO

Regular Expression Replace on Notepad++ [duplicate]

This question already has answers here:
Notepad++ v4.2.2. regular expressions to match and replace all text between two tags
(2 answers)
Closed 3 years ago.
I need a regular expression to replace the value in XML tags. I need to find * and replace it with XXXXX. I made an attempt to do this but its giving me "invalid regex".
<TAG>\('(.*?'\)</TAG>
// replace with:
<TAG>XXXXX</TAG>
I suspect that your actual starting content is something like this:
<TAG>some content here</TAG>
If you want to mask the content of such tags, you may try the following find and replace, in regex mode:
Find: <TAG>(.*?)</TAG>
Replace: <TAG>XXXXX</TAG>
Demo
Note that in general it is not desirable to manipulate nested content like XML/HTML using regex. But sometimes, e.g. when using tools like NPP, we are forced to do this. My answer should work fine assuming you are only targeting <TAG> elements which have no other children tags inside of them.

Capture everything after one word [duplicate]

This question already has an answer here:
Learning Regular Expressions [closed]
(1 answer)
Closed 6 years ago.
I am trying to make a regular expression capture any words in the specific line after the word Attachment:
This question is for work, so it is not a homework or test question. I took the paragraph below as an example from www.regular-expressions.info. I did not major in computers but Psychology so this is completely foreign to me. I've read the manuals for the last two days, and because this is going over my head, I don't know how to begin.
I have a task which involves me linking the attachments to a specific file with the same name saved in a folder (at least 500 attachments) on Adobe PDF. What I did before was to manually select the words and link it to a specific file in a folder, but it is tedious to do when they can go up to 500 attachments.
I was aware of an application plug-in called EVERMAP that you can download for Adobe to automatically link specific words to a specific file in a folder. However, it requires me to use regular expressions which again, I don't know how to use.
I will bold the words I want to capture in the paragraph below.
The repetition operator manual expand the match as far as they, and only come back if they must to satisfy the remainder.
Attachment: The repetition operator manual
The asterisk or star tells the engine to attempt to match the preceding token zero or more times. The plus tells the engine to attempt to match the preceding token once or more.
Attachment: Asterisk and stars engine
Attachment: (.+) should work in your case unless there are other exceptions to this rule. The regex simply tells the parser to capture 1 or more character after the word Attachment:. See here for the sample
Like #Kevin said, the Regex is simple. Use Attachment: (.+).
Maybe you are confused on how to use Regex. I don't know about the Evermap plugin, but you can copy all the text from the PDF to Sublime Text (text editor to open .txt but with a lot of features) and do Regex part there. And then, since you are not a programmer, you should remove other irrelevant data. So the Regex will be:
`^\s*Attachment:\s*(.+)$|^(?!Attachment:).+$`
And replace it with:
`\1`
\1 is a variable containing group value caught in ()
In Sublime Text find Find and Replace, then apply the Regex there. Don't forget to turn on the Regex mode.

Problems with finding and replacing

Hey stackoverflow community. Ive need help with huge information file. Is it possible with regular expression to find in this tag:
<category_name><![CDATA[Prekiniai ženklai>Adler|Kita buitinė technika>Buičiai naudingi prietaisai|Kita buitinė technika>Lygintuvai]]></category_name>
Somehow replace all the other data and leave only 'Adler' or 'Lygintuvai'. Im using Altova to edit xml files, so i cant find other way then find-replace. And im new in the regex stuff. So i thought maby you can help me.
#\<category_name\>.+?gt\;([\w]+?)\|.+?gt;([\w]+?)\]\]\>\<\/category_name\>#i
\1 - Adler
\2 - Lygintuvai
PHP
regex101.com
Fields may contain alphanumeric characters without spaces.
If you want to modify the scope of acceptable characters change [\w] to something other:
[a-z] - only letters
[0-9] - only digits
etc.
It's possible, but use of regular expressions to process XML will never be 100% correct (you can prove that using computer science theory), and it may also be very inefficient. For example, the solution given by Luk is incorrect because it doesn't allow whitespace in places where XML allows it. Much better to use XQuery or XSLT, both of which are designed for the job (and both work in Altova). You can then use XPath expressions to locate the element or attribute nodes you are interested in, and you can still use regular expressions (e.g. in the XPath replace() function) to process the content of text or attribute nodes.
Incidentally, your input is rather strange because it uses escape sequences like > within a CDATA section; but XML escape sequences are not recognized in a CDATA section.

Regex to strip single comments and muti-line comments in Notepad++ [duplicate]

This question already has answers here:
How to match c-style block comments in Notepad++ with a regex?
(2 answers)
Closed 9 years ago.
the followings :
// comments
/******
comments
*******/
is it possible to have a regex for them ?
As the comments say, its not possible to strip comments in a correct way with regexes. But maybe its still enough for you to use the following regular expressions:
^\s*//.*$
/\*.*?\*/
You can do this with a simple hack. Select Extended mode and then replace all \r\n with a character/character-sequence that does not occur in your file and that which will match .*. Now change back to Regular Expression mode and apply the regular expression (given by morja) to do your replace. Now replace back the special character/character-sequence with \r\n.
#Mohammad Currently you cannot do this (match multiline) in Notepad++.
This is because matching newlines is possible in Extended search mode, and regular expressions are available in Regexp search mode.
You could however combine different steps and do what you want as pointed by other answers.
The easiest solution is not to use regex from Notepad++, you sould only export as rtf (plugins --> nppexport --> export to RTF) then open with Microsoft Word or other that support format searching, so with that feature you can search and replace the green values only.
I hope it helps.