Match each group separately followed by a dash - regex

I have some markdown files where I've been using two spaces as tabs instead of four. For correct markdown syntax it should be four spaces, so I am trying to do a find and replace in each of the files.
I can match the double spaces, but when I attempt to only match the groups preceding a dash, it doesn't match what I need.
# List in Markdown
- List Item
  - List Item
    - List Item
      - List Item
  - List Item
    - List Item- List Item
If I use / {2}/mg, it matches each group of 2 spaces separately.
https://regex101.com/r/v1qF15/1
But I only want to replace the spaces that precede a dash. The problem I'm having is that when I add grouping and the dash, all instances of double spaces before a dash is one match instead of individual matches.
/(?: {2})+(?=-)/mg
https://regex101.com/r/VqvrP8/1
Match 1: The set of 2 spaces on line 1
Match 2: The set of 4 spaces on line 2
Match 3: The set of 6 spaces on line 3
But what I really want is:
Match 1: The set of 2 spaces on line 1
Match 2: The first set of 2 spaces on line 2
Match 3: The second set of 2 spaces on line 2
Match 4: The first set of 2 spaces on line 3
I think I'm doing something wrong with my positive look behind or my grouping, but I can't figure out how to fix it.

Assuming that your (multi-line) input string is stored in variable $str:
$str -replace '(?: {2})+(?=-)', '$&$&'
In your simple case, it's sufficient to know the combined length of the two-space multiples that matched, allowing you to in effect double them simply by referring to what was matched ($&) twice in the substitution expression.

Related

How to match strings that are entirely composed of a predefined set of substrings with regex

How to match strings that are entirely composed of a predefined set of substrings. For example, I want to see if a string is composed of only the following allowed substrings:
,
034
140
201
In the case when my string is as follows:
034,201
The string is fully composed of the 'allowed' substrings, so I want to positively match it.
However, in the following string:
034,055,201
There is an additional 055, which is not in my 'allowed' substrings set. So I want to not match that string.
What regex would be capable of doing this?
Try this one:
^(034|201|140|,)+$
Here is a demo
Step by step:
^ begining of a line
(034|201|140|,) captures group with alternative possible matches
+ captured group appears one or more times
$ end of a line
This regex will match only your values and ensure that the line doesn't start or end with a comma. Only matches in group 0 if it is valid, the groups are non-matching.
^(?:034|140|201)(?:,(?:034|140|201))*$
^: start
(?:034|140|201): non-matching group for your set of items (no comma)
(?:,(?:034|140|201))*: non-matching group of a comma followed by non-matching group of values, 0 or more times
$: end

Find a match but not if followed by a specific character

I want to match all repeating spaces in a each line of the text excluding spaces around a specified character.
For example, if I want to find repeating spaces before # in this text
first second third
first second third
first second #third
first second #third
first second # third
# first second third
I am expecting to match multiple spaces in first 3 lines but not to have matches in last 3.
I have tried this: /(\S\s{2,})(?!\s*#)/g but that is also matching repeating spaces after #
How to write a regex for this case?
One possible solution with lookarounds:
(?<![#\s])\s\s+(?![\s#])
The pattern will match any 2+ spaces \s\s+, that are:
not preceeded by either space or # ((?<![#\s]))
not followed by either space or # ((?![\s#]))
Check the demo here.
You could match what you want to get out of the way, and keep in a group what you are looking for.
(^[^\S\n]+|[^\S\n]*#.*)|[^\S\n]{2,}
Explanation
( Capture group 1 (to keep unmodified)
^[^\S\n]+ Match 1+ spaces without newlines from the start of the string
| Or
[^\S\n]*#.* Match optional spaces, # and the rest of the line
) Close group 1
| Or
[^\S\n]{2,} Match 2 or more spaces without newlines
See a regex demo.
There is no language tagged, but if for example you want to replace the repeating spaces in the first 3 lines with a single space, this is a JavaScript example where you check for group 1.
If it is present, the return it to keep that part unmodified, else replace the match with your replacement string.
Note that \s can also match newlines, therefore I have added dashes between the 2 parts for clarity and used [^\S\n]+ to match spaces without newlines.
const regex = /(^[^\S\n]+|[^\S\n]*#.*)|[^\S\n]{2,}/g;
const str = `first second third
first second third
first second #third
----------------------------------------
keep the indenting second #third #fourth b.
first second #third
first second # third these spaces should not match
# first second third`;
console.log(str.replace(regex, (_, g1) => g1 || " "));

Notepad++ Search and Replace: delete 3 to 4 numbers after N in each row

I have a text file where almost all the lines start with the letter N followed by 3 or 4 numbers as below
N970 G2 X-1.0591 Y-1.7454 I0. J-.04
N980 G1 Y-1.7554
N990 X-1.0594 Y-1.7666
N1000 Z-.2187
N1010 Y-1.7566
How can I remove the N followed by the 3 or 4 numbers in Notepad++ to look like this? if i need to search twice (once for N### and then again for N####) that is fine also.
G2 X-1.0591 Y-1.7454 I0. J-.04
G1 Y-1.7554
X-1.0594 Y-1.7666
Z-.2187
Y-1.7566
the numbers go from 100-9990 in increments of 10 if that helps
You can use the following regex that should work for your case:
^N[0-9]+\s*(.*)
It will match every line that starts with a capital letter N immediately followed by one or more digits. Matched results will include a single group which will contain the text you are looking for.
Note that whitespaces between the N tags and the actual text will not be matched.
Try it out in this DEMO
Breakdown
^ # Assert position at the start of the line
N # Matches capital letter 'N' literally
[0-9]+ # Matches any digit between 1 and unlimited times
\s* # Matches whitespace between 0 and unlimited times
(.*) # The rest of the text you are looking for
Find/Replace
The regex will match each individual line so you can either select Find Next and then Replace and process your file one line at a time or you can choose Replace All to process the whole file at once.
Substitution line (Replace with:) line should just include the first group ($1) which represents the rest of your text with N-prefix tags trimmed.
Make sure that the Search Mode is set to Regular expression.

Trying to match zero outside the word bounderies

I have patterns like
FQC19515_TCELL001_20190319_165944.pdf
FQC19515_TBNK001_20190319_165944.pdf
I can match word TCELL and TBNK with this RegEX
^(\D+)-(\d+)-(\d+)([A-Z1-9]+)?.*
But if I have patterns like
FLW194640_T20NK022_20190323_131348.pdf
FLW194228_C1920_SOME_DEBRIS_REMOVED.pdf
the above regex returns
T2 and C192 instead of T20NK and C1920 respectively
Is there a general regex that matches Nzeros out side of these word boundaries?
Let's consider all 4 examples of your input:
FQC19515_TCELL001_20190319_165944.pdf
FQC19515_TBNK001_20190319_165944.pdf
FLW194640_T20NK022_20190323_131348.pdf
FLW194228_C1920_SOME_DEBRIS_REMOVED.pdf
The first group, between start of line and the first "_" (e.g. FQC19515 in row 1)
consists of:
a non-empty sequence of letters,
a non-empty sequence of digits.
So the regex matching it, including the start of line anchor and a capturing group is:
^([A-Z]+\d+)
You used \D instead of [A-Z] but I think that [A-Z] is
more specific, as it matches only letters an not e.g. "_".
The next source char is _, so the regex can also include _.
A now the more diificult part: The second group to be captured has
actually 2 variants:
a sequence of letters and a sequence of digits (after that there is
a "_"),
a sequence of letters, a sequence of digits and another sequence of
letters (after that there are digits that you want to omit).
So the most intuitive way is to define 2 alternatives, each with
a respective positive lookahead:
alternative 1: [A-Z]+\d+(?=_),
alternative 2: [A-Z]+\d+[A-Z]+(?=\d).
But there is a bit shorter way. Notice that both alternatives start
from [A-Z]+\d+.
So we can put this fragment at the first place and only the rest
include as a non-capturing group ((?:...)), with 2 alternatives.
All the above should be surrounded with a capturing group:
([A-Z]+\d+(?:(?=_)|[A-Z]+(?=\d)))
So the whole regex can be:
^([A-Z]+\d+)_([A-Z]+\d+(?:(?=_)|[A-Z]+(?=\d)))
with m option ("^" matches also the start of each line).
For a working example see https://regex101.com/r/GDdt10/1
Your regex: ^(\D+)-(\d+) is wrong as after a sequence of non-digits
(\D+) you specified a minus which doesn't occur in your source.
Also the second minus does not correspond to your input.
Edit
To match all your strings, I modified slightly the previous regex.
The changes are limited to the matching group No 2 (after _):
Alternative No 1: [A-Z]{2,}+(?=\d) - two or more letters, after them
there is a digit, to be omitted. It will match TCELL and TBNK.
Alternative No 2: [A-Z]+\d+(?:(?=_)|[A-Z]+(?=\d)) - the previous
content of this group. It will match two remaining cases.
So the whole regex is:
^([A-Z]+\d+)_([A-Z]{2,}+(?=\d)|[A-Z]+\d+(?:(?=_)|[A-Z]+(?=\d)))
For a working example see https://regex101.com/r/GDdt10/2
As far as I understand, you could use:
^[A-Z]+\d+_\K[A-Z0-9]{5}
Explanation:
^ # beginning of line
[A-Z]+ # 1 or more capitals
\d+_ # 1 or more digit and 1 underscore
\K # forget all we have seen until this position
[A-Z0-9]{5} # 5 capitals or digits
Demo

Regex capture group that excludes optional substring?

I'm trying to construct a regex to extract Swedish organization numbers from data. These numbers can be of the following formats:
999999999999 // 12 digits, first two should be ignored.
9999999999 // 10 digits, all should be included.
99999999-9999 // 12 digits with a dash, first two digits and the dash should be ignored
999999-9999 // 10 digits with a dash, dash should be ignored.
For the 12 digit cases, the first two digits are always 16, 19 or 20. My current attempt is:
(?:16|19|20)?(\d{6}\-?\d{4})
This will return a ten digit organization number in $1, but it will contain the dash if it's present. I want the dash to be stripped (or possibly added if it's missing), so that $1 has the same format regardless of dash or no dash in the input.
The regex is in a config and will be used in code that simply extracts $1, so I can't solve this in code - I need the regex to do it "by itself".
As a last resort, I could modify the code to allow config to specify a "replace string" in addition to the search regex, and have the code use the result of the replace as the end result of the extraction. In that case I could use this:
Regex: (?:16|19|20)?(\d{6})\-?(\d{4})
Replace string: $1$2
But this causes other problems, because for other config items, the regex will return multiple "data fields", one for each capture group. To get this to work I would need, in that case, to provide a sequence of replace strings, e.g. for a tab separated format with organization number in the middle:
Regex: ([^\t]*)\t(?:16|19|20)?(\d{6})\-?(\d{4})\t([\d]*)
Replace string 1: $1 (free text field)
Replace string 2: $2-$3 (the organization number with dash "enforced")
Replace string 3: $4 (numeric field)
Workable, but rather awkward... So, any way to solve it within the search regex?