Regular Expression to parse whitespace-delimited data - regex

I have written code to pull some data into a data table and do some data re-formatting. I need some help splitting some text into appropriate columns.
CASE 1
I have data formated like this that I need to split into 2 columns.
ABCDEFGS 0298 MSD
SDFKLJSDDSFWW 0298 RFD
I need the text before the numbers in column 1 and the numbers and text after the spaces in column 2. The number of spaces between the text and the numbers and will vary.
CASE 2 Data I have data like this that I need split into 3 columns.
00006011731 TAB FC 10MG 30UOU
00006011754 TAB FC 10MG 90UOU
00006027531 TAB CHEW 5MG 30UOU
00006071131 TAB CHEW 4MG 30UOU
00006027554 TAB CHEW 5MG 90UO
00006384130 GRAN PKT 4MG 30UOU
column is the first 11 characters That is easy
column 2 should contain all the text after the first 11 characters up to but not including the first number.
The last column is all the text after column 2

I would do it with these expressions:
(?-s)(\S+) +(.+)
and
(?-s)(.{11})(\D+)(.+)
And broken down in regex comment mode, those are:
(?x-s) # Flags: x enables comment mode, -s disables dotall mode.
( # start first capturing group
\S+ # any non-space character, greedily matched at least once.
) # end first capturing group
[ ]+ # a space character, greedily matched at least once. (brackets required in comment mode)
( # start second capturing group
.+ # any character (excluding newlines), greedily matched at least once.
) # end second capturing group
and
(?x-s) # Flags: x enables comment mode, -s disables dotall mode.
( # start first capturing group
.{11} # any character (excluding newlines), exactly 11 times.
) # end first capturing group
( # start second capturing group
\D+ # any non-digit character, greedily matched at least once.
) # end second capturing group
( # start third capturing group
.+ # any character (excluding newlines), greedily matched at least once.
) # end third capturing group
(The 'dotall' mode (flag s) means that . matches all characters, including newlines, so we have to disable it to prevent too much matching in the last group.)

Supposing you know how to handle the VB.NET code to get the groupings (matches) and that you are willing to strip the extra spaces from the groupings yourself
The Regex for case 1 is
(.*?\s+)(\d+.*)
.*? => grabs everything non greedily, so it will stop at the first space
\s+ => one or more whitespace characters
These two form the first group.
\d+ => one or more digits
.* => rest of the line
These two form the second group.
The Regex for case 2 is
(.{11})(.*?)(\d.*)
.{11} => matches 11 characters (you could restrict it to be just letters
and numbers with [a-zA-Z] or \d instead of .)
That's the first group.
.*? => Match everything non greedily, stop before the first
digit found (because that's the next regex)
That's the second group.
\d.* => a digit (used to stop the previous .*?) and the rest of the line
That's the third group.

I would use Peter Boughton's regexes, but ensure you have . matches newline turned off. If that is on, ensure you add a $ on the end :)
The greedy regexes will perform better.

The simplest way for the kind of data you are presenting is to split the line into fields at the spaces, then reunite what you want to have together. Regex.Split(line, "\\s+") should return an array of strings. This is also more robust against changing strings in the fields, for example if in the second case a line reads "00006011731 TAB 3FC 10MG 30UOU".

Related

Regular expressions in notepad++ (Search and Replace)

I have a list of thousands of records within a .txt document.
some of them look like these records
201910031044 "00059" "11.31AG" "Senior Champion"
201910031044 "00060" "GBA146" "Junior Champion"
201910031044 "00999" "10.12G" "ProAM"
201910031044 "00362" "113.1LI" "Abcd"
Whenever a record similar to this occurs I'd like to get rid of the last words/numbers/etc in the last quotation marks (like "Senior Champion", "Junior Champion" etc. There are many possibilities here)
e.g. (before)
201910031044 "00059" "11.31AG" "Senior Champion"
after
201910031044 "00059" "11.31AG"
I tried the following regex but it wouldn't work.
Search: ^([0-9]{17,17} + "[0-9]{8,8}" + "[a-zA-Z0-9]").*$
Replace: \1 (replace string)
OK I forgot the . (dot) sign however even if I do not have a . (dot) sign it would not work. Not sure if it has anything to do when using the + sign used more than once.
I'd like to get rid of the last words/numbers/etc in the last quotation marks
This does the job:
Ctrl+H
Find what: ^.+\K\h+".*?"$
Replace with: LEAVE EMPTY
CHECK Wrap around
CHECK Regular expression
UNCHECK . matches newline*
Replace all
Explanation:
^ # beginning of line
.+ # 1 or more any character but newline
\K # forget all we have seen until this position
\h+ # 1 or more horizontal spaces
".*?" # something inside quotes
$ # end of line
Screen capture (before):
Screen capture (after):
The RegEx looks for the 4th double quote:
^(?:[^"]*\"){4}([^|]*)
You can see this demo: https://regex101.com/r/wJ9yS6/163
You will still need to parse the lines, so probably easier opening in excel or parsing using code as a CSV.
You have a problem with the count of your characters:
you specify that the line should start with exactly 17 digits ([0-9]{17,17}). However, there are only 12 digits in the data 201910031044.
you can specify exactly 12 digits by using {12} or if it could be 12-17, then {12,17}. I'll assume exactly 12 based on the current data.
similarly, for the second column you specify that it's exactly 8 digits surrounded by quotes ("[0-9]{8,8}") but it only has 5 digits surrounded by quotes.
again, you can specify exactly 5 with {5} or 5-8 with {5,8}. I will assume exactly 5.
finally, there is no quantifier for the final field, so the regex tries to match exactly one character that is a letter or a number surrounded by quotes "[a-zA-Z0-9]".
I'm not sure if there is any limit on the number of characters, so I would go with one or more using + as quantifier "[a-zA-Z0-9]+" - if you can have zero or more, then you can use *, or if it's any other count from m to n, then you can use {m,n} as before.
Not a character count problem but the final column can also have dots but the regex doesn't account for. You can just add . inside the square brackets and it will only match dot characters. It's usually used as a wildcard but it loses its special meaning inside a character class ([]), so you get "[a-zA-Z0-9.]+"
Putting it all together, you get
Search: ^([0-9]{12} + "[0-9]{5}" + "[a-zA-Z0-9.]+").*$
Replace: \1
Which will get rid of anything after the third field in Notepad++.
This can be shortened a bit by using \d instead of [0-9] for digits and \s+ for whitespace instead of +. As a benefit, \s will also match other whitespace like tabs, so you don't have to manually account for those. This leads to
Search: ^(\d{12}\s+"\d{5}"\s+"[a-zA-Z0-9.]+").*$
Replace: \1
If you want to get rid of the last words/numbers/etc in the last quotation marks you could capture in a group what is before that and match the last quotation marks and everything between it to remove it using a negated character class.
If what is between the values can be spaces or tabs, you could use [ \t]+ to match those (using \s could also match a newline)
Note that {17,17} and {8,8} may also be written as {17} and {8} which in this case should be {12} and {5}
^([0-9]{12}[ \t]+"[0-9]{5}"[ \t]+"[a-zA-Z0-9.]+")[ \t]{2,}"[^"\r\n]+"
In parts
^ Start of string
( Capture group 1
[0-9]{12}[ \t]+ Match 12 digits and 1+ spaces or tabs
"[0-9]{5}"[ \t]+ Match 5 digits between " and 1+ spaces or tabs
"[a-zA-Z0-9.]+" Match 1+ times any of the listed between "
) Close group
[ \t]{2,} Match 1+ times
"[^"\r\n]+"
In the replacement use group 1 $1
Regex demo
Before
After

Is there a regex for adding the first 4 characters to end of string and the last 4 characters to start of string?

I have some lines which I need to alter. They are protein sequences. How would I copy the first 4 characters of the line to the end of the line, and also copy the last 4 characters to the beginning of the line?
The strings are variable which complicates it, for example:
>X
LTGLGIGTGMAATIINAISVGLSAATILSLISGVASGGAWVLAGAKQALKEGGKKAGIAF
>Y
LVATGMAAGVAKTIVNAVSAGMDIATALSLFSGAFTAAGGIMALIKKYAQKKLWKQLIAA
Moreover, how could I exclude lines with a '>' at the beginning (these are names of the corresponding sequence)?
Does anyone know a regex which will allow this to work?
I've already tried some regex solutions but I'm not very experienced with this sort of thing and I can find the end string but can't get it to replace:
Find:
(...)$
Replace:
^$2$1"
An example of what I want to achieve is:
>1
ABCDEFGHIJKLMNOPQRSTUVWXYZ
becomes:
>1
WXYZABCDEFGHIJKLMNOPQRSTUVWXYZABCD
Thanks
Try doing a find, in regex mode, on the following pattern:
^([A-Z]{4}).*([A-Z]{4})$
Then replace with the first four and last four characters swapped:
$2$0$1
Demo
You can use the regex below.
^(([A-Z]{4})([A-Z]*)([A-Z]{4}))$
^ asserts the position at the start of the line, so nothing can come before it.
( is the start of a capture group, this is group 1.
( is the start of a capture group, this is group 2. This group is inside group 1.
[A-Z]{4} means exactly 4 capital characters from A to Z.
) is the end of capture group 2.
( is the start of a capture group, this is group 3.
[A-Z]* matches capital characters from A to Z between zero and infinite times.
) is the end of capture group 3.
( is the start of a capture group, this is group 4.
[A-Z]{4} means exactly 4 capital characters from A to Z.
) is the end of capture group 4.
$ asserts the position at the end of the line, so nothing can come after it.
See how it works with a replace here: https://regex101.com/r/W786uL/3.
$4$1$2
$4 means put capture group 4 here. Which is the last 4 characters.
$1 means put capture group 1 here. Which is everything in the entire string.
$2 means put capture group 2 here. Which is the first 4 characters.
You can use
^(.{4})(.*?)(.{4})$
^ - start of sting
(.{4}) - Match any for characters except new line
(.*?) - Match any character zero or more time (lazy mode)
$ - End of string
Demo

Regex for text file

I have a text file with the following text:
andal-4.1.0.jar
besc_2.1.0-beta
prov-3.0.jar
add4lib-1.0.jar
com_lab_2.0.jar
astrix
lis-2_0_1.jar
Is there any way i can split the name and the version using regex. I want to use the results to make two columns 'Name' and 'Version' in excel.
So i want the results from regex to look like
andal 4.1.0.jar
besc 2.1.0-beta
prov 3.0.jar
add4lib 1.0.jar
com_lab 2.0.jar
astrix
lis 2_0_1.jar
So far I have used ^(?:.*-(?=\d)|\D+) to get the Version and -\d.*$ to get the Name separately. The problem with this is that when i do it for a large text file, the results from the two regex are not in the same order. So is there any way to get the results in the way I have mentioned above?
Ctrl+H
Find what: ^(.+?)[-_](\d.*)$
Replace with: $1\t$2
check Wrap around
check Regular expression
UNCHECK . matches newline
Replace all
Explanation:
^ # beginning of line
(.+?) # group 1, 1 or more any character but newline, not greedy
[-_] # a dash or underscore
(\d.*) # group 2, a digit then 0 or more any character but newline
$ # end of line
Replacement:
$1 # content of group 1
\t # a tabulation, you may replace with what you want
$2 # content of group 2
Result for given example:
andal 4.1.0.jar
besc 2.1.0-beta
prov 3.0.jar
add4lib 1.0.jar
com_lab 2.0.jar
astrix
lis 2_0_1.jar
Not quite sure what you meant for the problem in large file, and I believe the two regex you showed are doing opposite as what you said: first one should get you the name and second one should give you version.
Anyway, here is the assumption I have to guess what may make sense to you:
"Name" may follow by - or _, followed by version string.
"Version" string is something preceded by - or _, with some digit, followed by a dot or underscore, followed by some digit, and then any string.
If these assumption make sense, you may use
^(.+?)(?:[-_](\d+[._]\d+.*))?$
as your regex. Group 1 is will be the name, Group 2 will be the Version.
Demo in regex101: https://regex101.com/r/RnwMaw/3
Explanation of regex
^ start of line
(.+?) "Name" part, using reluctant match of
at least 1 character
(?: )? Optional group of "Version String", which
consists of:
[-_] - or _
( ) Followed by the "Version" , which is
\d+ at least 1 digit,
[._] then 1 dot or underscore,
\d+ then at least 1 digit,
.* then any string
$ end of line

Regex with optional, lazy, greedy group

Let's take this source string from a word document:
A;SDLFJA;SDJFA;KSDJF;ALKSJDF SOURCE: 3 55 ASDKLFJA;KDSJF
sa;ldkjfa SOURCE: HYPERLINK "ASDLFA;SDFA;SKD" "MATCH9" 3 HYPERLINK
"ASDLFA;SDFA;SKD" "MATCH10" 55 a;sdkfja;ksdfj;aklsdjf;lk
I'm looking for a pattern that is composed of the literal text "SOURCE: " followed by a 1 digit number a space and a 2 digit number.
For example, in the first line of the source string, I want to find "SOURCE: 3 55".
Now, some clever boffin has decided to embed a hyperlink for the 1 digit number and another hyperlink for the 2 digit number. Lines 2 and 3 show the two embedded hyperlinks. MATCH1 refers to the first embedded hyperlink, MATCH2 is the second, and so on. I have no way of knowing how many hyperlinks will be placed before these, so one can't assume MATCH9 and MATCH10.
The text I want to extract is the "3 55" portion. I want to put it into a named group I'll call "KeepMe".
I don't mind using two different patterns, one for the hyperlink and one without.
Here's a pattern that works for the non-hyperlinked text:
SOURCE:\s+(?<KeepMe>\d*\s+\d*)
I get "3 55" in the KeepMe group just like I want.
I haven't been able to keep the hyperlink match pattern from being greedy.
Here's a failed regex pattern, (one of many):
SOURCE:\s+(?<Hyperlink>HYPERLINK.*MATCH\d*\u0022\s+)??(?<KeepMe1>\d*)\s+
(?<Hyperlink>HYPERLINK.*MATCH\d*\u0022\s+)??(?<KeepMe2>\d*)
In the above pattern, I'm trying to say:
Look for the literal SOURCE: followed by one or more spaces.
Then, optionally look for the literal text "HYPERLINK followed by some characters, followed by the literal text MATCH, followed by some digits and a double quote character in a lazy, non-greedy manner, followed by one or more spaces, followed by some digits I want to keep. Then, do another HYPERLINK pattern match like we just did and keep the digits after that, too.
Remember, in both cases, I want to extract "3 55". It can be extracted in one or two pieces though one would be best.
Any ideas???
This should do the trick:
\bSOURCE:\s+(?:HYPERLINK\s+"[^"]*"\s+"MATCH\d+"\s+)?(?<KeepMe1>\d+)\s+(?:HYPERLINK\s+"[^"]*"\s+"MATCH\d+"\s+)?(?<KeepMe2>\d+)\b
Main difference is that I replaced the .* between HYPERLINK and MATCH with something less greedy.
Fiddle: https://regex101.com/r/yE3fP4/1
A Regex that works for just the hyperlinked case is:
/(?<SourceToken>SOURCE:) # Start with a source tag
\s+ # Followed by whitespace
(?<HyperlinkMatchGroup> # Save the hyperlink & match combo.
(?<Hyperlink> # Save the hyperlink (to be discarded)
(?<HyperlinkToken>HYPERLINK\s+) # Hyperlinks start with the literal tag "HYPERLINK"
(?<HyperlinkText>".*?") # Hyperlink text contained in quotes, non-greedy
\s*) # Followed by whitespace
* # Repeating any number of times
(?<MatchToken>"MATCH\d*") # Followed by a literal tag "MATCH" and a digit string
\s* # Followed by whitespace
(?<KeepMe>\d+) # Finally, the match, which is just a series of digits
\s* # Followed by whitespace
)+ # The whole hyperlink & match pair must occur at least once
/x
It may or may not cover all your cases; I haven't spent much time digging into it.

I need a regx to validate a name that can be 1, 2, or 3 words

In this example I try to validate for a city name. It works if I enter San Louis Obispo but not if I enter Boulder Creek or Boulder. I thought ? was supposed to make a block optional.
if (!/^[a-zA-Z'-]+\s[a-zA-Z'-]*\s([a-zA-Z']*)?$/.test(field)){
return "Enter City only a-z A-Z .\' allowed and not over 20 characters.\n";
}
I think spaces are the problem (\s). You made second and third words optional (by using * instead of +), but not the spaces. Question mark is only being applied to the third word because of parentheses.
The issue with your regex is that, in english, it says to match a word that's required to be followed by a space that's optionally followed by another word but then is required to have another space and then optionally another word. So, a single-word would not match - however, a word followed by two spaces would. Additionally two words that have a space at the end would also match - but neither without the trailing spaces would match.
To fix your exact regex you should add another grouping (non-matching group with (?: instead of just () around the second word to the end of the sentence) and have this group as optional with ?. Also, move the \s's inside the optional groups as well.
Try this:
^[a-zA-Z'-]+(?:\s[a-zA-Z'-]+(?:\s[a-zA-Z']+)?)?$
Regex explaind:
^ # beginning of line
[a-zA-Z'-]+ # first matching word
(?: # start of second-matching word
\s[a-zA-Z'-]+ # space followed by matching word
(?: # start of third-matching word
\s[a-zA-Z']+ # space followed by matching word
)? # third-matching word is optional
)? # second-matching word is optional
$ # end of line
Alternatively, you can try the following regex:
^([a-zA-Z'-]+(?:\s[a-zA-Z'-]+){0,2})$
This will match 1 through 3 words, or "cities", in a given line with the ability to adjust the range of words without having to further-duplicate the matching set for each new word.
Regex explained:
^( # start of line & matching group
[a-zA-Z'-]+ # required first matching word
(?: # start a non-matching group (required to "match", but not returned as an individual group)
\s # sub-group required to start with a space
[a-zA-Z'-]+ # sub-group matching word
){0,2} # sub-group can match 0 -> 2 times
)$ # end of matching group & line
So, if you want to add the ability to match more than 3 words, you can change the 2 in the {0,2} range above to be the number of words you want to match minus 1 (i.e. if you want to match 4 words, you'll set it to {0,3}).