Regex ignore multiple wrong placed quotes - regex

From this input:
""" "01-01-2000""" " ",""" "Bank123""" "", "" ""Example text" " "",
I want to extract:
01-01-2000
Bank123
Example text
I managed this:
(["'])(?:(?=(\\?))\2.)*?\1
But if fails if it comes to deal with many wrong placed quotes. Any ideas?

As I see, you are interested in strings which:
start with either a digit or a letter,
followed by a (maybe empty) sequence of chars other than ".
So the intuitive solution is [a-z\d][^"]* with gi options
(global, case insensitive).

For your given example, perhaps it could be an option to match a whitespace or a double quote zero or more times [ "]* to match what comes before the value between the inner double quotes.
Then match that double quote and capture in a group not a double quote or a newline ([^"\r\n]+) using a negated character class.
At the end match the closing double quote followed by zero or more times a whitespace or a double quote which will match what comes after so the group does not match a whitespace between double quotes.
[ "]*"([^"\r\n]+)"[ "]*

There are various options to do so:
1) ([\d-\w\s][\d-\w\s]+)
2) ([\d-\w\s]{2,})
3) "\b(.+?)\b"
4) \b([^"]{2,})\b
Demo : https://regex101.com/r/jPXqKv/1
Test:
""" "01-01-2000""" " ",""" "Bank123""" "", "" ""Example text" " ""
Match:
Match 1
Full match 5-15 `01-01-2000`
Group 1. 5-15 `01-01-2000`
Match 2
Full match 28-35 `Bank123`
Group 1. 28-35 `Bank123`
Match 3
Full match 48-60 `Example text`
Group 1. 48-60 `Example text`

Related

Regular expressions in notepad++ (Search and Replace)

I have a list of thousands of records within a .txt document.
some of them look like these records
201910031044 "00059" "11.31AG" "Senior Champion"
201910031044 "00060" "GBA146" "Junior Champion"
201910031044 "00999" "10.12G" "ProAM"
201910031044 "00362" "113.1LI" "Abcd"
Whenever a record similar to this occurs I'd like to get rid of the last words/numbers/etc in the last quotation marks (like "Senior Champion", "Junior Champion" etc. There are many possibilities here)
e.g. (before)
201910031044 "00059" "11.31AG" "Senior Champion"
after
201910031044 "00059" "11.31AG"
I tried the following regex but it wouldn't work.
Search: ^([0-9]{17,17} + "[0-9]{8,8}" + "[a-zA-Z0-9]").*$
Replace: \1 (replace string)
OK I forgot the . (dot) sign however even if I do not have a . (dot) sign it would not work. Not sure if it has anything to do when using the + sign used more than once.
I'd like to get rid of the last words/numbers/etc in the last quotation marks
This does the job:
Ctrl+H
Find what: ^.+\K\h+".*?"$
Replace with: LEAVE EMPTY
CHECK Wrap around
CHECK Regular expression
UNCHECK . matches newline*
Replace all
Explanation:
^ # beginning of line
.+ # 1 or more any character but newline
\K # forget all we have seen until this position
\h+ # 1 or more horizontal spaces
".*?" # something inside quotes
$ # end of line
Screen capture (before):
Screen capture (after):
The RegEx looks for the 4th double quote:
^(?:[^"]*\"){4}([^|]*)
You can see this demo: https://regex101.com/r/wJ9yS6/163
You will still need to parse the lines, so probably easier opening in excel or parsing using code as a CSV.
You have a problem with the count of your characters:
you specify that the line should start with exactly 17 digits ([0-9]{17,17}). However, there are only 12 digits in the data 201910031044.
you can specify exactly 12 digits by using {12} or if it could be 12-17, then {12,17}. I'll assume exactly 12 based on the current data.
similarly, for the second column you specify that it's exactly 8 digits surrounded by quotes ("[0-9]{8,8}") but it only has 5 digits surrounded by quotes.
again, you can specify exactly 5 with {5} or 5-8 with {5,8}. I will assume exactly 5.
finally, there is no quantifier for the final field, so the regex tries to match exactly one character that is a letter or a number surrounded by quotes "[a-zA-Z0-9]".
I'm not sure if there is any limit on the number of characters, so I would go with one or more using + as quantifier "[a-zA-Z0-9]+" - if you can have zero or more, then you can use *, or if it's any other count from m to n, then you can use {m,n} as before.
Not a character count problem but the final column can also have dots but the regex doesn't account for. You can just add . inside the square brackets and it will only match dot characters. It's usually used as a wildcard but it loses its special meaning inside a character class ([]), so you get "[a-zA-Z0-9.]+"
Putting it all together, you get
Search: ^([0-9]{12} + "[0-9]{5}" + "[a-zA-Z0-9.]+").*$
Replace: \1
Which will get rid of anything after the third field in Notepad++.
This can be shortened a bit by using \d instead of [0-9] for digits and \s+ for whitespace instead of +. As a benefit, \s will also match other whitespace like tabs, so you don't have to manually account for those. This leads to
Search: ^(\d{12}\s+"\d{5}"\s+"[a-zA-Z0-9.]+").*$
Replace: \1
If you want to get rid of the last words/numbers/etc in the last quotation marks you could capture in a group what is before that and match the last quotation marks and everything between it to remove it using a negated character class.
If what is between the values can be spaces or tabs, you could use [ \t]+ to match those (using \s could also match a newline)
Note that {17,17} and {8,8} may also be written as {17} and {8} which in this case should be {12} and {5}
^([0-9]{12}[ \t]+"[0-9]{5}"[ \t]+"[a-zA-Z0-9.]+")[ \t]{2,}"[^"\r\n]+"
In parts
^ Start of string
( Capture group 1
[0-9]{12}[ \t]+ Match 12 digits and 1+ spaces or tabs
"[0-9]{5}"[ \t]+ Match 5 digits between " and 1+ spaces or tabs
"[a-zA-Z0-9.]+" Match 1+ times any of the listed between "
) Close group
[ \t]{2,} Match 1+ times
"[^"\r\n]+"
In the replacement use group 1 $1
Regex demo
Before
After

Regex for text file

I have a text file with the following text:
andal-4.1.0.jar
besc_2.1.0-beta
prov-3.0.jar
add4lib-1.0.jar
com_lab_2.0.jar
astrix
lis-2_0_1.jar
Is there any way i can split the name and the version using regex. I want to use the results to make two columns 'Name' and 'Version' in excel.
So i want the results from regex to look like
andal 4.1.0.jar
besc 2.1.0-beta
prov 3.0.jar
add4lib 1.0.jar
com_lab 2.0.jar
astrix
lis 2_0_1.jar
So far I have used ^(?:.*-(?=\d)|\D+) to get the Version and -\d.*$ to get the Name separately. The problem with this is that when i do it for a large text file, the results from the two regex are not in the same order. So is there any way to get the results in the way I have mentioned above?
Ctrl+H
Find what: ^(.+?)[-_](\d.*)$
Replace with: $1\t$2
check Wrap around
check Regular expression
UNCHECK . matches newline
Replace all
Explanation:
^ # beginning of line
(.+?) # group 1, 1 or more any character but newline, not greedy
[-_] # a dash or underscore
(\d.*) # group 2, a digit then 0 or more any character but newline
$ # end of line
Replacement:
$1 # content of group 1
\t # a tabulation, you may replace with what you want
$2 # content of group 2
Result for given example:
andal 4.1.0.jar
besc 2.1.0-beta
prov 3.0.jar
add4lib 1.0.jar
com_lab 2.0.jar
astrix
lis 2_0_1.jar
Not quite sure what you meant for the problem in large file, and I believe the two regex you showed are doing opposite as what you said: first one should get you the name and second one should give you version.
Anyway, here is the assumption I have to guess what may make sense to you:
"Name" may follow by - or _, followed by version string.
"Version" string is something preceded by - or _, with some digit, followed by a dot or underscore, followed by some digit, and then any string.
If these assumption make sense, you may use
^(.+?)(?:[-_](\d+[._]\d+.*))?$
as your regex. Group 1 is will be the name, Group 2 will be the Version.
Demo in regex101: https://regex101.com/r/RnwMaw/3
Explanation of regex
^ start of line
(.+?) "Name" part, using reluctant match of
at least 1 character
(?: )? Optional group of "Version String", which
consists of:
[-_] - or _
( ) Followed by the "Version" , which is
\d+ at least 1 digit,
[._] then 1 dot or underscore,
\d+ then at least 1 digit,
.* then any string
$ end of line

Regular Expression to parse group of strings with quotes separated by space

Given a line of string that does not have any linebreak, I want to get groups of strings which may consist of quotes and separated by space. Space is allowed only if it's within quotes. E.g.
a="1234" gg b=5678 c="1 2 3"
The result should have 4 groups:
a="1234"
gg
b=5678
c="1 2 3"
So far I have this
/[^\s]+(=".*?"|=".*?[^s]+|=[^\s]+|=)/g
but this cannot capture the second group "gg". I can't check if there is space before and after the text, as this will include the string that has space within quotes.
Any help will be greatly appreciated! Thanks.
Edited
This is for javascript
In JavaScript, you may use the following regex:
/\w+(?:=(?:"[^"]*"|\S+)?)?/g
See the regex demo.
Details
\w+ - 1+ letters, digits or/and _
(?:=(?:"[^"]*"|\S+)?)? - an optional sequence of:
= - an equal sign
(?:"[^"]*"|\S+)? - an optional sequence of:
"[^"]*" - a ", then 0+ chars other than " and then "
| - or
\S+ - 1+ non-whitespace chars
JS demo:
var rx = /\w+(?:=(?:"[^"]*"|\S+)?)?/g;
var s = 'a="1234" gg b=5678 c="1 2 3" d=abcd e=';
console.log(s.match(rx));
if I did not misunderstand what you are saying this is what you are looking for.
\w+=(?|"([^"]*)"|(\d+))|(?|[a-z]+)
think of the or works as a fallback option there for use more complex one in front of the more generic ones.
alternatively, you can remove second ?| and it will capture it as a different group so you can check that group (group 2)

How can I check it with regular Expression?

I have a long input string that contains certain field names in-bedded in it. For instance:
SELECT some-name, some-name FROM [some-table] WHERE [some-column] = 'some-value'
The actual field name may change, but it is always in the form of word-word. I need to perform a regex replace on the string so that the output will look like this:
SELECT some - name, some - name FROM [some-table] WHERE [some-column] = 'some - value'
In other words, when the field name is enclosed in square-brackets, it should be left untouched, but when it is not, spaces should be inserted on either side of the dash. There are no nested square brackets and the reserved word could be one or more in the string.
You can do this:
Regex.Replace(input, "(?<!\[[^-\]]*)(\w+)-(\w+)(?![^-\]]*\])", "$1 - $2")
Here's an explanation of the pattern:
(?<!\[[^-\]]*) - This is a negative look-behind. It asserts that matches cannot be immediately preceded by text that matches the sub-pattern \[[^-\]]*. In other words, the matches we are looking for cannot be preceded by a [ character followed by any number of characters that are not a - or a ].
(\w+)-(\w+) - Matches one or more word-characters, then a dash, and then one or more word characters following the dash. By enclosing the sub-patterns on either side of the dash in capturing groups, we can then refer to their values as $1 and $2 in the replacement pattern.
(?![^-\]]*\]) - This is a negative look-ahead. Similar to the negative look-behind, it asserts that matches cannot be immediately followed by text which matches the sub pattern [^-\]]*\]. In other words, a match cannot be followed by any number of characters that are not a - or a ] and then a closing ].
See a demo.
At first glance, you might assume that you could simply assert that is must not be immediately preceded by a [ character and that it must not be immediately followed by a ] character. In other words, (?<!\[)(\w+)-(\w+)(?!\]). However, that pattern would still match the text ome-nam in the input [some-name] because the text ome-nam is not immediately preceded or followed by the brackets.
Dim regex As Regex = New Regex("\[[^-]*-[^-]*\]")
Dim match As Match = regex.Match("A long string containing square brackets [some-name]")
If match.Success Then
Console.WriteLine(match.Value)
End If
Or you could use Regex.IsMatch:
Return Regex.IsMatch("A long string containing square brackets [some-name]",
"\[[^-]*-[^-]*\]")
You may match and capture the [...] substrings and then only match hyphens that are not surrounded with hyphens to replace them:
Dim nStr As String = "SELECT 'some-name' FROM [some-name]"
Dim nResult = Regex.Replace(nStr, "(\[.+?])|\s*-\s*", New MatchEvaluator(Function(m As Match)
If m.Groups(1).Success Then
Return m.Groups(1).Value
Else
Return " - "
End If
End Function))
So, what is happening is:
(\[[^]]+]) - matches and stores the value of [...] substring inside the Group(1) buffer (or \[.+?] can be used here to match a [, then 1 or more any characters and then ] - with RegexOptions.Singleline flag so that . could match a newline, too)
(?<!\s)-(?!\s) - matches any hyphen not preceded ((?<!\s)) or followed ((?!\s)) with whitespace (\s). Actually, we may even use \s*-\s* (where \s* stands for zero or more whitespaces as many as possible since * is a greedy quantifier matching zero or more occurrences of the quantified subpattern) here to remove any whitespace there is to make sure we just insert 1 space before and after -.
If Group 1 matches, then we just re-insert it (Return m.Groups(1).Value), else we insert the space-enclosed hyphen Return " - ".
Just to check if it exists, you could try
\[[^\]]+-[^\]]+\]
It matches a literal [ and then any characters, except ], up to (including) a hyphen. Then again any characters, except ], up to a literal ].
See it here at regex101.
Actually I don't know the vb.net syntax but you can use regex as
/[\s\'](\w+)\-(\w+)/g
find the (\w+)-(\w+) which is followed by space or ' and replace your string with capture group 1st - 2nd
See the sample here

Matching two single quotes or double quote

I have the following strings. It is LatLongs in degrees, minutes and seconds format,
and can be entered as follows:
Option1: 25º 23" 40.6' or
Option2: 25º 23'' 40.6' or
Option3: 25 23 40.6
With one regx i would like to match both strings, the problem for me is matching the "(double quote) AND ' '(two single quotes).
I have the following so far.
^[+|-]?[0-9]{1,2}[\º| ][ ]?[0-9]{1,2}[\"|'{2}| ]
I am building and testing the regx in the terminal on lunix (Ubuntu). From the output i get in the terminal its matches the "(double quote) but only ONE of the ' '(two single quotes).
How can i change the regx to match the "(double quote) and ' '(two single quotes), in one expression?
Thanks in advance.
Check out this pattern:
([+-]?\d{1,2}(?:\.\d{1,2})?.)\s*(\d{1,2}(?:\.\d{1,2})?[\S]*)\s*(\d{1,2}(?:\.\d{1,2})?'?)
It is independent of any special character including support of up-to 2 digits, along with the resolution of your issue.
Your regex has problems. For example, [\"|'{2}| ] matches a single ", |, ', {, 2, } or . Try the following:
^([+-]?\d+)º? ?\b(\d+)\b(?:''|\")? ?([\d.]+)'?$
Explanation:
^ # Start of string
([+-]?\d+) # Match an integer
º?[ ]? # Match a degree and/or a space (both optional)
\b(\d+)\b # Match a positive integer (entire number)
(?:''|\")?[ ]? # Match quotes and/or space (all optional)
([\d.]+) # Match a floating point number
'? # Match an optional single quote
$ # End of string
I think what you really want to have with the Regex above is
^[+|-]?[0-9]{1,2}º? ?[0-9]{1,2}(\"|'{2})? ?[0-9]{1,2}\.[0-9]'?
Although this also matches weird things like
25 23'' 40.6
Your Regex uses custom character classes (the sections in [ and ]) which only can match one single character. You can group together multiple characters by ( and ) and make these groups optional with a ?.