I want to match all repeating spaces in a each line of the text excluding spaces around a specified character.
For example, if I want to find repeating spaces before # in this text
first second third
first second third
first second #third
first second #third
first second # third
# first second third
I am expecting to match multiple spaces in first 3 lines but not to have matches in last 3.
I have tried this: /(\S\s{2,})(?!\s*#)/g but that is also matching repeating spaces after #
How to write a regex for this case?
One possible solution with lookarounds:
(?<![#\s])\s\s+(?![\s#])
The pattern will match any 2+ spaces \s\s+, that are:
not preceeded by either space or # ((?<![#\s]))
not followed by either space or # ((?![\s#]))
Check the demo here.
You could match what you want to get out of the way, and keep in a group what you are looking for.
(^[^\S\n]+|[^\S\n]*#.*)|[^\S\n]{2,}
Explanation
( Capture group 1 (to keep unmodified)
^[^\S\n]+ Match 1+ spaces without newlines from the start of the string
| Or
[^\S\n]*#.* Match optional spaces, # and the rest of the line
) Close group 1
| Or
[^\S\n]{2,} Match 2 or more spaces without newlines
See a regex demo.
There is no language tagged, but if for example you want to replace the repeating spaces in the first 3 lines with a single space, this is a JavaScript example where you check for group 1.
If it is present, the return it to keep that part unmodified, else replace the match with your replacement string.
Note that \s can also match newlines, therefore I have added dashes between the 2 parts for clarity and used [^\S\n]+ to match spaces without newlines.
const regex = /(^[^\S\n]+|[^\S\n]*#.*)|[^\S\n]{2,}/g;
const str = `first second third
first second third
first second #third
----------------------------------------
keep the indenting second #third #fourth b.
first second #third
first second # third these spaces should not match
# first second third`;
console.log(str.replace(regex, (_, g1) => g1 || " "));
Related
I'm new to regex, and would appreciate some guidance/help.
Currently, I'm looking to write an expression, that derives a certain part of text from the 2nd line of the provided text.
Here is the text:
123 anywhere Avenue
Winnipeg, Manitoba R3E 0L7
Canada
Pharmacy Manager: person person
Pharmacy Licence Holder/Owner: 123456 Manitoba Ltd.
see correct formatting with code here
My goal is to derive the 'Manitoba' string from the second line, however I'd like to make it dynamic rather than writing an expression to always fetch Manitoba as a static. I used the below code to target the second line:
(.*)(?=(\n.*){3}$)
(It matches 3 lines up from the last line, thus targeting the desired line)
I noticed, that within the dataset, that the Province (Manitoba) is always in between two spaces.
Is there any addition I can make to the code, so that the expression only targets the second line, then matches the first string in-between spaces?
Perhaps using a lazy expression with a positive lookaround?
If I target all matches in between spaces, it would take both 'Manitoba' and 'R3E 0L7' which I dont want.
I want it to only match the first piece of text in between spaces on the second line.
Any help is much appreciated :-)
Thanks.
One option could be to match the first line, then capture the second word in the second lines in capturing group 1.
Then match the rest of the second line and assert what follows is 3 times a line.
^.*\r?\n\S+[^\S\r\n]+(\S+).*(?=(?:\r?\n.*){3}$)
In parts:
^ Start of string
.*\r?\n Match the whole lines and a newline
\S+ Match 1+ non whitespace char (the first "word")
[^\S\r\n]+ Match 1+ times a whitespace char except newlines
(\S+) Capture group 1 Match 1+ times a non whitespace char (the second "word')
.* Match the rest of the line
(?= Positive lookahead, assert what follows on the right is
(?:\r?\n.*){3}$ Match 3 times a newline followed by 0+ times any except a newline and assert the end of the string
) Close lookahead
Regex demo
You could also turn the lookahead in to a match instead
^.*\r?\n\S+[^\S\r\n]+(\S+).*(?:\r?\n.*){3}$
Regex demo
I have patterns like
FQC19515_TCELL001_20190319_165944.pdf
FQC19515_TBNK001_20190319_165944.pdf
I can match word TCELL and TBNK with this RegEX
^(\D+)-(\d+)-(\d+)([A-Z1-9]+)?.*
But if I have patterns like
FLW194640_T20NK022_20190323_131348.pdf
FLW194228_C1920_SOME_DEBRIS_REMOVED.pdf
the above regex returns
T2 and C192 instead of T20NK and C1920 respectively
Is there a general regex that matches Nzeros out side of these word boundaries?
Let's consider all 4 examples of your input:
FQC19515_TCELL001_20190319_165944.pdf
FQC19515_TBNK001_20190319_165944.pdf
FLW194640_T20NK022_20190323_131348.pdf
FLW194228_C1920_SOME_DEBRIS_REMOVED.pdf
The first group, between start of line and the first "_" (e.g. FQC19515 in row 1)
consists of:
a non-empty sequence of letters,
a non-empty sequence of digits.
So the regex matching it, including the start of line anchor and a capturing group is:
^([A-Z]+\d+)
You used \D instead of [A-Z] but I think that [A-Z] is
more specific, as it matches only letters an not e.g. "_".
The next source char is _, so the regex can also include _.
A now the more diificult part: The second group to be captured has
actually 2 variants:
a sequence of letters and a sequence of digits (after that there is
a "_"),
a sequence of letters, a sequence of digits and another sequence of
letters (after that there are digits that you want to omit).
So the most intuitive way is to define 2 alternatives, each with
a respective positive lookahead:
alternative 1: [A-Z]+\d+(?=_),
alternative 2: [A-Z]+\d+[A-Z]+(?=\d).
But there is a bit shorter way. Notice that both alternatives start
from [A-Z]+\d+.
So we can put this fragment at the first place and only the rest
include as a non-capturing group ((?:...)), with 2 alternatives.
All the above should be surrounded with a capturing group:
([A-Z]+\d+(?:(?=_)|[A-Z]+(?=\d)))
So the whole regex can be:
^([A-Z]+\d+)_([A-Z]+\d+(?:(?=_)|[A-Z]+(?=\d)))
with m option ("^" matches also the start of each line).
For a working example see https://regex101.com/r/GDdt10/1
Your regex: ^(\D+)-(\d+) is wrong as after a sequence of non-digits
(\D+) you specified a minus which doesn't occur in your source.
Also the second minus does not correspond to your input.
Edit
To match all your strings, I modified slightly the previous regex.
The changes are limited to the matching group No 2 (after _):
Alternative No 1: [A-Z]{2,}+(?=\d) - two or more letters, after them
there is a digit, to be omitted. It will match TCELL and TBNK.
Alternative No 2: [A-Z]+\d+(?:(?=_)|[A-Z]+(?=\d)) - the previous
content of this group. It will match two remaining cases.
So the whole regex is:
^([A-Z]+\d+)_([A-Z]{2,}+(?=\d)|[A-Z]+\d+(?:(?=_)|[A-Z]+(?=\d)))
For a working example see https://regex101.com/r/GDdt10/2
As far as I understand, you could use:
^[A-Z]+\d+_\K[A-Z0-9]{5}
Explanation:
^ # beginning of line
[A-Z]+ # 1 or more capitals
\d+_ # 1 or more digit and 1 underscore
\K # forget all we have seen until this position
[A-Z0-9]{5} # 5 capitals or digits
Demo
I have a text file with the following text:
andal-4.1.0.jar
besc_2.1.0-beta
prov-3.0.jar
add4lib-1.0.jar
com_lab_2.0.jar
astrix
lis-2_0_1.jar
Is there any way i can split the name and the version using regex. I want to use the results to make two columns 'Name' and 'Version' in excel.
So i want the results from regex to look like
andal 4.1.0.jar
besc 2.1.0-beta
prov 3.0.jar
add4lib 1.0.jar
com_lab 2.0.jar
astrix
lis 2_0_1.jar
So far I have used ^(?:.*-(?=\d)|\D+) to get the Version and -\d.*$ to get the Name separately. The problem with this is that when i do it for a large text file, the results from the two regex are not in the same order. So is there any way to get the results in the way I have mentioned above?
Ctrl+H
Find what: ^(.+?)[-_](\d.*)$
Replace with: $1\t$2
check Wrap around
check Regular expression
UNCHECK . matches newline
Replace all
Explanation:
^ # beginning of line
(.+?) # group 1, 1 or more any character but newline, not greedy
[-_] # a dash or underscore
(\d.*) # group 2, a digit then 0 or more any character but newline
$ # end of line
Replacement:
$1 # content of group 1
\t # a tabulation, you may replace with what you want
$2 # content of group 2
Result for given example:
andal 4.1.0.jar
besc 2.1.0-beta
prov 3.0.jar
add4lib 1.0.jar
com_lab 2.0.jar
astrix
lis 2_0_1.jar
Not quite sure what you meant for the problem in large file, and I believe the two regex you showed are doing opposite as what you said: first one should get you the name and second one should give you version.
Anyway, here is the assumption I have to guess what may make sense to you:
"Name" may follow by - or _, followed by version string.
"Version" string is something preceded by - or _, with some digit, followed by a dot or underscore, followed by some digit, and then any string.
If these assumption make sense, you may use
^(.+?)(?:[-_](\d+[._]\d+.*))?$
as your regex. Group 1 is will be the name, Group 2 will be the Version.
Demo in regex101: https://regex101.com/r/RnwMaw/3
Explanation of regex
^ start of line
(.+?) "Name" part, using reluctant match of
at least 1 character
(?: )? Optional group of "Version String", which
consists of:
[-_] - or _
( ) Followed by the "Version" , which is
\d+ at least 1 digit,
[._] then 1 dot or underscore,
\d+ then at least 1 digit,
.* then any string
$ end of line
In this example I try to validate for a city name. It works if I enter San Louis Obispo but not if I enter Boulder Creek or Boulder. I thought ? was supposed to make a block optional.
if (!/^[a-zA-Z'-]+\s[a-zA-Z'-]*\s([a-zA-Z']*)?$/.test(field)){
return "Enter City only a-z A-Z .\' allowed and not over 20 characters.\n";
}
I think spaces are the problem (\s). You made second and third words optional (by using * instead of +), but not the spaces. Question mark is only being applied to the third word because of parentheses.
The issue with your regex is that, in english, it says to match a word that's required to be followed by a space that's optionally followed by another word but then is required to have another space and then optionally another word. So, a single-word would not match - however, a word followed by two spaces would. Additionally two words that have a space at the end would also match - but neither without the trailing spaces would match.
To fix your exact regex you should add another grouping (non-matching group with (?: instead of just () around the second word to the end of the sentence) and have this group as optional with ?. Also, move the \s's inside the optional groups as well.
Try this:
^[a-zA-Z'-]+(?:\s[a-zA-Z'-]+(?:\s[a-zA-Z']+)?)?$
Regex explaind:
^ # beginning of line
[a-zA-Z'-]+ # first matching word
(?: # start of second-matching word
\s[a-zA-Z'-]+ # space followed by matching word
(?: # start of third-matching word
\s[a-zA-Z']+ # space followed by matching word
)? # third-matching word is optional
)? # second-matching word is optional
$ # end of line
Alternatively, you can try the following regex:
^([a-zA-Z'-]+(?:\s[a-zA-Z'-]+){0,2})$
This will match 1 through 3 words, or "cities", in a given line with the ability to adjust the range of words without having to further-duplicate the matching set for each new word.
Regex explained:
^( # start of line & matching group
[a-zA-Z'-]+ # required first matching word
(?: # start a non-matching group (required to "match", but not returned as an individual group)
\s # sub-group required to start with a space
[a-zA-Z'-]+ # sub-group matching word
){0,2} # sub-group can match 0 -> 2 times
)$ # end of matching group & line
So, if you want to add the ability to match more than 3 words, you can change the 2 in the {0,2} range above to be the number of words you want to match minus 1 (i.e. if you want to match 4 words, you'll set it to {0,3}).
I have written code to pull some data into a data table and do some data re-formatting. I need some help splitting some text into appropriate columns.
CASE 1
I have data formated like this that I need to split into 2 columns.
ABCDEFGS 0298 MSD
SDFKLJSDDSFWW 0298 RFD
I need the text before the numbers in column 1 and the numbers and text after the spaces in column 2. The number of spaces between the text and the numbers and will vary.
CASE 2 Data I have data like this that I need split into 3 columns.
00006011731 TAB FC 10MG 30UOU
00006011754 TAB FC 10MG 90UOU
00006027531 TAB CHEW 5MG 30UOU
00006071131 TAB CHEW 4MG 30UOU
00006027554 TAB CHEW 5MG 90UO
00006384130 GRAN PKT 4MG 30UOU
column is the first 11 characters That is easy
column 2 should contain all the text after the first 11 characters up to but not including the first number.
The last column is all the text after column 2
I would do it with these expressions:
(?-s)(\S+) +(.+)
and
(?-s)(.{11})(\D+)(.+)
And broken down in regex comment mode, those are:
(?x-s) # Flags: x enables comment mode, -s disables dotall mode.
( # start first capturing group
\S+ # any non-space character, greedily matched at least once.
) # end first capturing group
[ ]+ # a space character, greedily matched at least once. (brackets required in comment mode)
( # start second capturing group
.+ # any character (excluding newlines), greedily matched at least once.
) # end second capturing group
and
(?x-s) # Flags: x enables comment mode, -s disables dotall mode.
( # start first capturing group
.{11} # any character (excluding newlines), exactly 11 times.
) # end first capturing group
( # start second capturing group
\D+ # any non-digit character, greedily matched at least once.
) # end second capturing group
( # start third capturing group
.+ # any character (excluding newlines), greedily matched at least once.
) # end third capturing group
(The 'dotall' mode (flag s) means that . matches all characters, including newlines, so we have to disable it to prevent too much matching in the last group.)
Supposing you know how to handle the VB.NET code to get the groupings (matches) and that you are willing to strip the extra spaces from the groupings yourself
The Regex for case 1 is
(.*?\s+)(\d+.*)
.*? => grabs everything non greedily, so it will stop at the first space
\s+ => one or more whitespace characters
These two form the first group.
\d+ => one or more digits
.* => rest of the line
These two form the second group.
The Regex for case 2 is
(.{11})(.*?)(\d.*)
.{11} => matches 11 characters (you could restrict it to be just letters
and numbers with [a-zA-Z] or \d instead of .)
That's the first group.
.*? => Match everything non greedily, stop before the first
digit found (because that's the next regex)
That's the second group.
\d.* => a digit (used to stop the previous .*?) and the rest of the line
That's the third group.
I would use Peter Boughton's regexes, but ensure you have . matches newline turned off. If that is on, ensure you add a $ on the end :)
The greedy regexes will perform better.
The simplest way for the kind of data you are presenting is to split the line into fields at the spaces, then reunite what you want to have together. Regex.Split(line, "\\s+") should return an array of strings. This is also more robust against changing strings in the fields, for example if in the second case a line reads "00006011731 TAB 3FC 10MG 30UOU".