Removing everything between a tag (including the tag itself) using Regex / Eclipse - regex

I'm fairly new to figuring out how Regex works, but this one is just frustrating.
I have a massive XML document with a lot of <description>blahblahblah</description> tags. I want to basically remove any and all instances of <description></description>.
I'm using Eclipse and have tried a few examples of Regex I've found online, but nothing works.
<description>(.*?)</description>
Shouldn't that work?
EDIT:
Here is the actual code.
<description><![CDATA[<center><table><tr><th colspan='2' align='center'><em>Attributes</em></th></tr><tr bgcolor="#E3E3F3"><th>ID</th><td>308</td></tr></table></center>]]></description>

I'm not familiar with Eclipse, but I would expect its regex search facility to use Java's built-in regex flavor. You probably just need to check a box labeled "DOTALL" or "single-line" or something similar, or you can add the corresponding inline modifier to the regex:
(?s)<description>(.*?)</description>
That will allow the . to match newlines, which it doesn't by default.
EDIT: This is assuming there are newlines within the <description> element, which is the only reason I can think of why your regex wouldn't work. I'm also assuming you really are doing a regex search; is that automatic in Eclipse, or do you have to choose between regex and literal searching?

Related

Regex to match everything except a pattern

Regex noob here struggling with this, which I know it will be easy for some of you regex gods out there!
Given the following:
title: Some title
date: 2022-08-15
tags: <value to extract>
identifier: 1234567
---------------------------
Some text
some more text
I would like a regex to match everything except the value of tags (ie the "<value to extract>" text).
For context, this is supposed to run on emacs (in case it matters).
EDIT: Just to clarify as per #phils question, all I care about extracting the tags value. However, this is via a package setting that asks for a regex string and I don't have much control over how it gets use. It seems to expect a regex to strip what I don't need from the string rather than matching what I do want, which is slightly annoying.. Also, the since it seems to match everything with \\(.\\), I'm guessing it's using the global flag?
Please let me know if any of this isn't clear.
Emacs regular expressions can't trivially express "not foo" for arbitrary values of foo. (The likes of PCRE have non-regular extensions for zero-width negative look-ahead/behind assertions, but in Emacs that sort of functionality is generally done with the support of lisp code1.)
You can still do it purely with regexp matching, but it's simply very cumbersome. An Emacs regexp which matches any line which does not begin with tags: is:
^\(?:$\|[^t]\|t[^a]\|ta[^g]\|tag[^s]\|tags[^:]\).*
or if you need to enter it in the elisp double-quoted read syntax for strings:
"^\\(?:$\\|[^t]\\|t[^a]\\|ta[^g]\\|tag[^s]\\|tags[^:]\\).*"
1 In lisp code you would instead simply check each line to see whether it does start with tags: and, if so, skip it (which is why Emacs generally gets away without the feature you're looking for, but of course that doesn't help you here).
After playing around with it for a bit and taken inspiration from #phils' answer, I've come up with the following:
"^\\(?:\\(#\\+\\)?\\(?:filetags:\s+\\|tags:\s+\\|title:.*\\|identifier:.*\\|date:.*\\)\\|.*\\)"
I've also added an extra \\(#\\+\\)? to account for org meta keys which would usually have the format #+key: value.

Writing valid RegEx for use in file/folder exclusion

I'm trying to write two expressions to use in the files/folder Exclusion List for Code42 CrashPlan backup. Their support won't help with RegEx expressions, they just point me to their KB article.
In their "File Exclusions" section, I'd like to:
exclude this folder specifically: S:\Google Drive\Temp
any file or folder containing the string Backup_Excluded anywhere in its name.
This is what I've got so far - but I have no way of knowing if they're correct:
(?i).*Google Drive\\Temp ...but since I really want to exclude a specific folder, not a pattern - do I need to escape the slashes and colon in the path of S:\Google Drive\Temp
(?i).*Backup_Excluded
Research disclaimer: I know there are RegEx resources out there, but am unsure which flavor/syntax to use, as I'd imagine there are many. I was hoping those with more RegEx familiarity could advise.
The link you posted says:
The Code42 app treats all file separators as forward slashes /.
So it seems you'd want to use / instead of \\ in your regular expressions.
Colon doesn't need escaping.
\ needs escaping because it's the escaping character itself.
/ normally needs escaping because it is the default separators for regular expression sections. However, the examples in your link don't escape it, so only the matching section is implied, so no escaping.
Then you could probably use:
S:/Google Drive/Temp
or [A-Z]:/Google Drive/Temp (to allow any drive)
.*Backup_Excluded.*
I probably wouldn't use (?i), as the capitals in those strings are usually there, but that's your call.
Check out e.g. https://regex101.com/ to test your regular expressions (also in different flavours).

REGEX in MS Word 2016: Exclude a simple String from Search

So I read a lot about Negation in Regex but can't solve my problem in MS Word 2016.
How do I exclude a String, Word, Number(s) from being found?
Example:
<[A-Z]{2}[A-Z0-9]{9;11}> to search a String like XY123BBT22223
But how to exclude for example a specefic one like SEDWS12WW04?
Well it depends on what you need to achieve or is this a matter of curiosity... RegEx is not the same as the built-in Advanced Find with Wildcards; for that you need VBA.
Depending on your need, without using VBA, you could make use of space and return characters - something like this will work for the strings provided: [ ^13][A-Z]{2}[0-9]{1,}[A-Z]{1,}[0-9]{1,}[ ^13] (assuming you use normal carriage returns and spaces in your document)
Anyway, this is a good article on wildcard searches in MS Word: https://wordmvp.com/FAQs/General/UsingWildcards.htm
EDIT:
In light of your further comments you will probably want to look at section 8 of the linked article which explains grouping. For my proposed search you can use this to your advantage by creating 3 groups in your 'find' and only modifying the middle group, if indeed you do intend to modify. Using groups the search would look something like:
([ ^13])([A-Z]{2}[0-9]{1,}[A-Z]{1,}[0-9]{1,})([ ^13])
and the replace might look like this:
\1 SOMETHING \3
Note also: compared to a RegEx solution my suggestion is kinda lame, mainly because compared to RegEx, MS-Words find and replace (good as it is, and really it is) is kinda lame... it's hacky but it might work for you (although you might need to do a few searches).
BUT... if it really is REGEX that you want, well you can get access to this via VBA: How to Use/Enable (RegExp object) Regular Expression using VBA (MACRO) in word
And... then you will be able to use proper RegEx for find and replace, well almost - I'm under the impression that the VBA RegEx still has some quirks...
As already noted by others, this is not possible in Microsoft Word's flavor of regular expressions.
Instead, you should use standard regular expressions. It is actually possible to use standard regular expressions in MS Word if you use a special tool that integrates into Microsoft Word called Multiple Find & Replace (see http://www.translatortools.net/products/transtoolsplus/word-multiplefindreplace). This tool opens as a pane to the right of the document window and works just like the Advanced Find & Replace dialog. However, in addition to Word's existing search functionality, it can use the standard regular expressions syntax to search and replace any text within a Word document.
In your particular case, I would use this:
\b[A-Z]{2}[A-Z0-9]{9,11}\b(?<!\bSEDWS12WW04)
To explain, this searches for a word boundary + ID + word boundary, and then it looks back to make sure that the preceding string does not match [word boundary + excluded ID]. In a similar vein, you can do something like
(?<!\bSEDWS12WW04|\bSEDWS12WW05|\bSEDWS12WW05)
to exlude several IDs.
Multiple Find & Replace is quite powerful: you can add any number of expressions (either using regular expressions or using Word's standard search syntax) to a list and then search the document for all of them, replace everything, display all matches in a list and replace only specific matches, and a few more things.
I created this tool for translators and editors, but it is great for any advanced search/replace operations in Word, and I am sure you will find it very useful.
Best regards, Stanislav

How to match and replace n number of times with RegEx

I'm using TextWrangler, the free version of BBEdit on the Mac, which I understand uses the PCRE engine.
What I want to do is match a specific number of lines and replace as well.
After a lot of searching I came up with this:
(^(.*\r)){25}
This lets me match up to 25 lines. It works great, but the problem comes when I want to actually replace something. I can't figure out how to do it.
For example, I would like to replace all of the returns "\r" with tabs "\t".
Hopefully this is actually possible. I'd appreciate any help. Thanks!
Regexp domain is searching. You cannot replace using regexp; a programming language or editor can use regexp as the search part of its search-and-replace function. Thus, the way to do 25 replacements is purely in the domain of said programming language or editor. If it does not provide such capability, either directly in search-and-replace or as a macro/loop/other, then you cannot do it automatically.

Problems with finding and replacing

Hey stackoverflow community. Ive need help with huge information file. Is it possible with regular expression to find in this tag:
<category_name><![CDATA[Prekiniai ženklai>Adler|Kita buitinė technika>Buičiai naudingi prietaisai|Kita buitinė technika>Lygintuvai]]></category_name>
Somehow replace all the other data and leave only 'Adler' or 'Lygintuvai'. Im using Altova to edit xml files, so i cant find other way then find-replace. And im new in the regex stuff. So i thought maby you can help me.
#\<category_name\>.+?gt\;([\w]+?)\|.+?gt;([\w]+?)\]\]\>\<\/category_name\>#i
\1 - Adler
\2 - Lygintuvai
PHP
regex101.com
Fields may contain alphanumeric characters without spaces.
If you want to modify the scope of acceptable characters change [\w] to something other:
[a-z] - only letters
[0-9] - only digits
etc.
It's possible, but use of regular expressions to process XML will never be 100% correct (you can prove that using computer science theory), and it may also be very inefficient. For example, the solution given by Luk is incorrect because it doesn't allow whitespace in places where XML allows it. Much better to use XQuery or XSLT, both of which are designed for the job (and both work in Altova). You can then use XPath expressions to locate the element or attribute nodes you are interested in, and you can still use regular expressions (e.g. in the XPath replace() function) to process the content of text or attribute nodes.
Incidentally, your input is rather strange because it uses escape sequences like > within a CDATA section; but XML escape sequences are not recognized in a CDATA section.