Regex exclude value from repetition - regex

I'm working with load files and trying to write a regex that will check for any rows in the file that do not have the correct number of delimiters. Let's pretend the delimiter is % (I'm not sure this text field supports the delimiters that are in the load file). The regex I wrote that finds all rows that are correct is:
^%([^%]*%){20}$
Sometimes it's more beneficial to find any rows that do not have the correct number of delimiters, so to accomplish that, I wrote this:
(^%([^%]*%){0,19}$)|(^%([^%]*%){21,}$)
I'm concerned about the efficiency of this (and any regex I write in general), so I'm wondering if there's a better way to write it, or if the way I wrote it is fine. I thought maybe there would be some way to use alternation with the repetition tokens, such as:
{0,19}|{21,}
but that doesn't seem to work.
If it's helpful to know, I'm just searching through the files in Sublime Text, which I believe uses PCRE. I'm also open to suggestions for making the first regex better in general, although I've found it to work pretty well even in exceptionally large load files.

If your regex engine supports negative lookaheads, you can slightly modify your original regex.
^(?!%([^%]*%){20}$)
The regex above is useful for test only. If you want to capture, then you need to add .* part.
^(?!%([^%]*%){20}$).*$

Related

Search and replace with particular phrase

I need a help with mass search and replace using regex.
I have a longer strings where I need to look for any number and particular string - e.g. 321BS and I need to replace just the text string that I was looking for. So I need to look for BS in "gf test test2 321BS test" (the pattern is always the same just the position differs) and change just BS.
Can you please help me to find particular regex for this?
Update: I need t keep the number and change just the text string. I will be doing this notepad++. However I need a general funcion for this if possible. I am a rookie in regex. Moreover, is it possible to do it in Trados SDL Studio? Or how am i able to do it in excel file in bulk?
Thank you very much!
Your question is a bit vague, however, as I understand it you want to match any digits followed by BS, ie 123BS. You want to keep 123 but replace BS?
Regex: (\d+)BS matches 123BS
In notepad++ you can:
match (\d+)BS
replace \1NEWTEXT
This will replace 123BS with 123NEWTXT.
\1 will substitue the capture group (\d+). (which matches 1 or more digits.
You could do this in Trados Studio using an app. The SDLXLIFF Toolkit may be the most appropriate for you. The advantage over Notepad++ is that it's controlled and will only affect the translatable text and not anything that might break the integrity of the file if you make a mistake. You can also handle multiple files, or even multiple Trados Studio projects in one go.
The syntax would be very similar to the suggestion above... you would:
match (\d+)BS
replace $1NEWTEXT

find string that is missing substring in xml files regular expression

This is my reg expression that find it
(<instance_material symbol="material_)([0-9]+)(part)(.*?)(")(/)(>)
I need to find a string that does not contain the word "part"
and the xml lines are
<instance_material symbol="material_677part01_h502_w5" target="#material_677part01_h502_w5"/>
<instance_material symbol="material_677" target="#material_677"/>
You can use negative lookahead
^(?!.*part).*?$
^ - start of string.
(?!.*part) - condition to avoid part.
.*? - Match anything except new line.
$ - End of string
Demo
Many regex starters will encounter the problem finding a string not containing certain words. You could find more useful tips on Regular-Expression.info.
^((?!part).)*$
You need to be aware that all attempts to process XML using regular expressions are wrong, in the sense that (a) there will be some legitimate ways of writing the XML document that the regex doesn't match, and (b) there will be some ways of getting false matches, e.g. by putting nasty stuff in XML comments. Sometimes being right 99% of the time is OK of course, but don't do this in production because soon we'll have people writing on SO "I need to generate XML with the attributes in a particular order because that's what the receiving application requires."
Your regex, for example, requires the attribute to be in double rather than single quotes, and it doesn't allow whitespace around the "=" sign, or in several other places where XML allows whitespace. If there's any risk of people deliberately trying to defeat your regex, you need to consider tricks like people writing p in place of p.
Even if this is a one-off with no risk of malicious subversion, you're much better off doing this with XPath. It then becomes a simple query like //instance_materal[#symbol[not(contains(., 'part'))]]

REGEX in MS Word 2016: Exclude a simple String from Search

So I read a lot about Negation in Regex but can't solve my problem in MS Word 2016.
How do I exclude a String, Word, Number(s) from being found?
Example:
<[A-Z]{2}[A-Z0-9]{9;11}> to search a String like XY123BBT22223
But how to exclude for example a specefic one like SEDWS12WW04?
Well it depends on what you need to achieve or is this a matter of curiosity... RegEx is not the same as the built-in Advanced Find with Wildcards; for that you need VBA.
Depending on your need, without using VBA, you could make use of space and return characters - something like this will work for the strings provided: [ ^13][A-Z]{2}[0-9]{1,}[A-Z]{1,}[0-9]{1,}[ ^13] (assuming you use normal carriage returns and spaces in your document)
Anyway, this is a good article on wildcard searches in MS Word: https://wordmvp.com/FAQs/General/UsingWildcards.htm
EDIT:
In light of your further comments you will probably want to look at section 8 of the linked article which explains grouping. For my proposed search you can use this to your advantage by creating 3 groups in your 'find' and only modifying the middle group, if indeed you do intend to modify. Using groups the search would look something like:
([ ^13])([A-Z]{2}[0-9]{1,}[A-Z]{1,}[0-9]{1,})([ ^13])
and the replace might look like this:
\1 SOMETHING \3
Note also: compared to a RegEx solution my suggestion is kinda lame, mainly because compared to RegEx, MS-Words find and replace (good as it is, and really it is) is kinda lame... it's hacky but it might work for you (although you might need to do a few searches).
BUT... if it really is REGEX that you want, well you can get access to this via VBA: How to Use/Enable (RegExp object) Regular Expression using VBA (MACRO) in word
And... then you will be able to use proper RegEx for find and replace, well almost - I'm under the impression that the VBA RegEx still has some quirks...
As already noted by others, this is not possible in Microsoft Word's flavor of regular expressions.
Instead, you should use standard regular expressions. It is actually possible to use standard regular expressions in MS Word if you use a special tool that integrates into Microsoft Word called Multiple Find & Replace (see http://www.translatortools.net/products/transtoolsplus/word-multiplefindreplace). This tool opens as a pane to the right of the document window and works just like the Advanced Find & Replace dialog. However, in addition to Word's existing search functionality, it can use the standard regular expressions syntax to search and replace any text within a Word document.
In your particular case, I would use this:
\b[A-Z]{2}[A-Z0-9]{9,11}\b(?<!\bSEDWS12WW04)
To explain, this searches for a word boundary + ID + word boundary, and then it looks back to make sure that the preceding string does not match [word boundary + excluded ID]. In a similar vein, you can do something like
(?<!\bSEDWS12WW04|\bSEDWS12WW05|\bSEDWS12WW05)
to exlude several IDs.
Multiple Find & Replace is quite powerful: you can add any number of expressions (either using regular expressions or using Word's standard search syntax) to a list and then search the document for all of them, replace everything, display all matches in a list and replace only specific matches, and a few more things.
I created this tool for translators and editors, but it is great for any advanced search/replace operations in Word, and I am sure you will find it very useful.
Best regards, Stanislav

Too Many Characters Included in Attempt to Parse a CSV File

Background
I am attempting to parse a CSV file using PCRE regular expressions. That is, making out (or extracting) the various different "cells" available in the CSV, to then put them in a somewhat nicely organized array containing all the parts that the process of parsing managed to make out.
The following regular expression is what I have come up with so far:
/(?:;|^)(?:(?:"(?:(?!"(;|$)).)*)|(?:([^;]*)))/g
I would highly recommend that you put this in a tester for regular expressions. Here is a slight bit of test data, that should match to a great extent.
"There; \"be";"but; someone spoke";hence the young man;hence the son;"test;"
The Problem
The regular expression manages to extract the correct number of parts. It is meant for the regular expression to retrieve the text from inside each and every "cell" available in the CSV (use the CSV provided above for reference). It does to some extent.
Here is the result of the groups in the regular expression above:
"There; \"be
;"but; someone spoke
hence the young man
hence the son
;"test;
As we can clearly see, the lines that are "escaped" using double-quotation marks include the " inside its group for the match, also selects the ", and sometimes even the semi-colon. From my understanding, the group for the negative lookahead should not include those.
I have probably missed something very essential here. Perhaps someone can point me in the right direction towards a fix.
Edit and Potential Solution
It appears as though I might have managed to solve it. As opposed to what I said above, the negative lookahead does not actually appear to create a capture group, which I initially thought. As such, adding yet another group to the equation seems to parse out the segments I am after.
/(?:;|^)(?:(?:"((?:(?!"(;|$)).)*))|(?:([^;]*)))/g
I will, however, leave the question open for now, and will answer it myself if no other answer comes tumbling in. As not to make it opinion based, I would therefore further inquire as to whether there might be a more efficient way in terms of speed than that in which I am using above.

Exclude a certain String from variable in regex

Hi I have a Stylesheet where i use xsl:analyze-string with the following regex:
(&journal_abbrevs;)[\s ]*([0-9]{{4}})[,][\s ][S]?[\.]?[\s ]?([0-9]{{1,4}})([\s ][(][0-9]{{1,4}}[)])?
You don't need to look at the whole thing :)
&journal_abbrevs; looks like this:
"example-String1|example-String2|example-String3|..."
What I need to do know is exclude one of the strings in &journal_abbrevs; from this regex. E.g. I don't want example-String1 to be matched.
Any ideas on how to do that ?
It seems XSLT regex does not support look-around. So I don't think you'll be able to get a solution for this that does not involve writing out all strings from journal_abbrevs in your regex. Related question.
To minimize the amount of writing out, you could split journal_abbrevs into say journal_abbrevs1, journal_abbrevs2 and journal_abbrevs3 (or how many you decide to use) and only write out whichever one that contains the string you wish to exclude. If journal_abbrevs1 contains the string, you'd then end up with something like:
((&journal_abbrevs2;)|(&journal_abbrevs3;)|example-String2|example-String3|...)...
If it supported look-around, you could've used a very simple:
(?!example-String1)(&journal_abbrevs;)...