I have the following strings. It is LatLongs in degrees, minutes and seconds format,
and can be entered as follows:
Option1: 25º 23" 40.6' or
Option2: 25º 23'' 40.6' or
Option3: 25 23 40.6
With one regx i would like to match both strings, the problem for me is matching the "(double quote) AND ' '(two single quotes).
I have the following so far.
^[+|-]?[0-9]{1,2}[\º| ][ ]?[0-9]{1,2}[\"|'{2}| ]
I am building and testing the regx in the terminal on lunix (Ubuntu). From the output i get in the terminal its matches the "(double quote) but only ONE of the ' '(two single quotes).
How can i change the regx to match the "(double quote) and ' '(two single quotes), in one expression?
Thanks in advance.
Check out this pattern:
([+-]?\d{1,2}(?:\.\d{1,2})?.)\s*(\d{1,2}(?:\.\d{1,2})?[\S]*)\s*(\d{1,2}(?:\.\d{1,2})?'?)
It is independent of any special character including support of up-to 2 digits, along with the resolution of your issue.
Your regex has problems. For example, [\"|'{2}| ] matches a single ", |, ', {, 2, } or . Try the following:
^([+-]?\d+)º? ?\b(\d+)\b(?:''|\")? ?([\d.]+)'?$
Explanation:
^ # Start of string
([+-]?\d+) # Match an integer
º?[ ]? # Match a degree and/or a space (both optional)
\b(\d+)\b # Match a positive integer (entire number)
(?:''|\")?[ ]? # Match quotes and/or space (all optional)
([\d.]+) # Match a floating point number
'? # Match an optional single quote
$ # End of string
I think what you really want to have with the Regex above is
^[+|-]?[0-9]{1,2}º? ?[0-9]{1,2}(\"|'{2})? ?[0-9]{1,2}\.[0-9]'?
Although this also matches weird things like
25 23'' 40.6
Your Regex uses custom character classes (the sections in [ and ]) which only can match one single character. You can group together multiple characters by ( and ) and make these groups optional with a ?.
Related
I have a list of thousands of records within a .txt document.
some of them look like these records
201910031044 "00059" "11.31AG" "Senior Champion"
201910031044 "00060" "GBA146" "Junior Champion"
201910031044 "00999" "10.12G" "ProAM"
201910031044 "00362" "113.1LI" "Abcd"
Whenever a record similar to this occurs I'd like to get rid of the last words/numbers/etc in the last quotation marks (like "Senior Champion", "Junior Champion" etc. There are many possibilities here)
e.g. (before)
201910031044 "00059" "11.31AG" "Senior Champion"
after
201910031044 "00059" "11.31AG"
I tried the following regex but it wouldn't work.
Search: ^([0-9]{17,17} + "[0-9]{8,8}" + "[a-zA-Z0-9]").*$
Replace: \1 (replace string)
OK I forgot the . (dot) sign however even if I do not have a . (dot) sign it would not work. Not sure if it has anything to do when using the + sign used more than once.
I'd like to get rid of the last words/numbers/etc in the last quotation marks
This does the job:
Ctrl+H
Find what: ^.+\K\h+".*?"$
Replace with: LEAVE EMPTY
CHECK Wrap around
CHECK Regular expression
UNCHECK . matches newline*
Replace all
Explanation:
^ # beginning of line
.+ # 1 or more any character but newline
\K # forget all we have seen until this position
\h+ # 1 or more horizontal spaces
".*?" # something inside quotes
$ # end of line
Screen capture (before):
Screen capture (after):
The RegEx looks for the 4th double quote:
^(?:[^"]*\"){4}([^|]*)
You can see this demo: https://regex101.com/r/wJ9yS6/163
You will still need to parse the lines, so probably easier opening in excel or parsing using code as a CSV.
You have a problem with the count of your characters:
you specify that the line should start with exactly 17 digits ([0-9]{17,17}). However, there are only 12 digits in the data 201910031044.
you can specify exactly 12 digits by using {12} or if it could be 12-17, then {12,17}. I'll assume exactly 12 based on the current data.
similarly, for the second column you specify that it's exactly 8 digits surrounded by quotes ("[0-9]{8,8}") but it only has 5 digits surrounded by quotes.
again, you can specify exactly 5 with {5} or 5-8 with {5,8}. I will assume exactly 5.
finally, there is no quantifier for the final field, so the regex tries to match exactly one character that is a letter or a number surrounded by quotes "[a-zA-Z0-9]".
I'm not sure if there is any limit on the number of characters, so I would go with one or more using + as quantifier "[a-zA-Z0-9]+" - if you can have zero or more, then you can use *, or if it's any other count from m to n, then you can use {m,n} as before.
Not a character count problem but the final column can also have dots but the regex doesn't account for. You can just add . inside the square brackets and it will only match dot characters. It's usually used as a wildcard but it loses its special meaning inside a character class ([]), so you get "[a-zA-Z0-9.]+"
Putting it all together, you get
Search: ^([0-9]{12} + "[0-9]{5}" + "[a-zA-Z0-9.]+").*$
Replace: \1
Which will get rid of anything after the third field in Notepad++.
This can be shortened a bit by using \d instead of [0-9] for digits and \s+ for whitespace instead of +. As a benefit, \s will also match other whitespace like tabs, so you don't have to manually account for those. This leads to
Search: ^(\d{12}\s+"\d{5}"\s+"[a-zA-Z0-9.]+").*$
Replace: \1
If you want to get rid of the last words/numbers/etc in the last quotation marks you could capture in a group what is before that and match the last quotation marks and everything between it to remove it using a negated character class.
If what is between the values can be spaces or tabs, you could use [ \t]+ to match those (using \s could also match a newline)
Note that {17,17} and {8,8} may also be written as {17} and {8} which in this case should be {12} and {5}
^([0-9]{12}[ \t]+"[0-9]{5}"[ \t]+"[a-zA-Z0-9.]+")[ \t]{2,}"[^"\r\n]+"
In parts
^ Start of string
( Capture group 1
[0-9]{12}[ \t]+ Match 12 digits and 1+ spaces or tabs
"[0-9]{5}"[ \t]+ Match 5 digits between " and 1+ spaces or tabs
"[a-zA-Z0-9.]+" Match 1+ times any of the listed between "
) Close group
[ \t]{2,} Match 1+ times
"[^"\r\n]+"
In the replacement use group 1 $1
Regex demo
Before
After
So I need to match the following:
1.2.
3.4.5.
5.6.7.10
((\d+)\.(\d+)\.((\d+)\.)*) will do fine for the very first line, but the problem is: there could be many lines: could be one or more than one.
\n will only appear if there are more than one lines.
In string version, I get it like this: "1.2.\n3.4.5.\n1.2."
So my issue is: if there is only one line, \n needs not to be at the end, but if there are more than one lines, \n needs be there at the end for each line except the very last.
Here is the pattern I suggest:
^\d+(?:\.\d+)*\.?(?:\n\d+(?:\.\d+)*\.?)*$
Demo
Here is a brief explanation of the pattern:
^ from the start of the string
\d+ match a number
(?:\.\d+)* followed by dot, and another number, zero or more times
\.? followed by an optional trailing dot
(?:\n followed by a newline
\d+(?:\.\d+)*\.?)* and another path sequence, zero or more times
$ end of the string
You might check if there is a newline at the end using a positive lookahead (?=.*\n):
(?=.*\n)(\d+)\.(\d+)\.((\d+)\.)*
See a regex demo
Edit
You could use an alternation to either match when on the next line there is the same pattern following, or match the pattern when not followed by a newline.
^(?:\d+\.\d+\.(?:\d+\.)*(?=.*\n\d+\.\d+\.)|\d+\.\d+\.(?:\d+\.)*(?!.*\n))
Regex demo
^ Start of string
(?: Non capturing group
\d+\.\d+\. Match 2 times a digit and a dot
(?:\d+\.)* Repeat 0+ times matching 1+ digits and a dot
(?=.*\n\d+\.\d+\.) Positive lookahead, assert what follows a a newline starting with the pattern
| Or
\d+\.\d+\. Match 2 times a digit and a dot
(?:\d+\.)* Repeat 0+ times matching 1+ digits and a dot
*(?!.*\n) Negative lookahead, assert what follows is not a newline
) Close non capturing group
(\d+\.*)+\n* will match the text you provided. If you need to make sure the final line also ends with a . then (\d+\.)+\n* will work.
Most programming languages offer the m flag. Which is the multiline modifier. Enabling this would let $ match at the end of lines and end of string.
The solution below only appends the $ to your current regex and sets the m flag. This may vary depending on your programming language.
var text = "1.2.\n3.4.5.\n1.2.\n12.34.56.78.123.\nthis 1.2. shouldn't hit",
regex = /((\d+)\.(\d+)\.((\d+)\.)*)$/gm,
match;
while (match = regex.exec(text)) {
console.log(match);
}
You could simplify the regex to /(\d+\.){2,}$/gm, then split the full match based on the dot character to get all the different numbers. I've given a JavaScript example below, but getting a substring and splitting a string are pretty basic operations in most languages.
var text = "1.2.\n3.4.5.\n1.2.\n12.34.56.78.123.\nthis 1.2. shouldn't hit",
regex = /(\d+\.){2,}$/gm;
/* Slice is used to drop the dot at the end, otherwise resulting in
* an empty string on split.
*
* "1.2.3.".split(".") //=> ["1", "2", "3", ""]
* "1.2.3.".slice(0, -1) //=> "1.2.3"
* "1.2.3".split(".") //=> ["1", "2", "3"]
*/
console.log(
text.match(regex)
.map(match => match.slice(0, -1).split("."))
);
For more info about regex flags/modifiers have a look at: Regular Expression Reference: Mode Modifiers
This question already has answers here:
How to validate phone numbers using regex
(43 answers)
Closed 4 years ago.
I have a file that contains phone numbers of the following formats:
(xxx) xxx.xxxx
(xxx).xxx.xxxx
(xxx) xxx-xxxx
(xxx)-xxx-xxxx
xxx.xxx.xxxx
xxx-xxx-xxxx
xxx xxx-xxxx
xxx xxx.xxxx
I must parse the file for phone numbers of those and ONLY those formats, and output them to a separate file. I'm using perl, and so far I have what I think is a valid regex for two of these numbers
my $phone_regex = qr/^(\d{3}\-)?(\(\d{3}\))?\d{3}\-\d{4}$/;
But I'm not sure if this is correct, or how to do the rest all in one regex. Thank you!
Here you go
\(?\d{3}\)?[-. ]\d{3}[-. ]\d{4}
See a demo on regex101.com.
Broken down this is
\(? # "(", optional
\d{3} # three digits
\)? # ")", optional
[-. ] # one of "-", "." or " "
\d{3} # three digits
[-. ] # same as above
\d{4} # four digits
If you want, you can add word boundaries on the right site (\b), some potential matches may be filtered out then.
You haven't escaped parenthesis properly and have uselessly escaped hyphen which isn't needed. The regex you are trying to create is this,
^\(?\d{3}\)?[ .-]\d{3}[ .-]\d{4}$
Explanation:
^ -
\(? - Optional starting parenthesis (
\d{3} - Followed by three digits
\)? - Optional closing parenthesis )
[ .-] - A single character either a space or . or -
\d{3} - Followed by three digits
[ .-] - Again a single character either a space or . or -
\d{4} - Followed by four digits
$ - End of string
Demo
Your current regex allows too much, as it will allow xxx-(xxx) at the beginning. It also doesn't handle any of the . or space separated cases. You want to have only three sets of digits, and then allow optional parentheses around the first set which you can use an alternation for, and then you can make use of character classes to indicate the set of separators you want to allow.
Additionally, don't use \d as it will match any unicode digit. Since you likely only want to allow ASCII digits, use the character class [0-9] (there are other options, but this is the simplest).
Finally, $ allows a newline at the end of the string, so use \z instead which does not. Make sure if you are reading these from a file that you chomp them so they do not contain trailing newlines.
This leaves us with:
qr/^(?:[0-9]{3}|\([0-9]{3}\))[-. ][0-9]{3}[-.][0-9]{4}\z/
If you want to ensure that the two separators are the same if the first is a . or -, it is easiest to do this in multiple regex checks (these can be more lenient since we already validated the general format):
if ($str =~ m/^[0-9()]+ /
or $str =~ m/^[0-9()]+\.[0-9]{3}\./
or $str =~ m/^[0-9()]+-[0-9]{3}-/) {
# allowed
}
I have a text file with the following text:
andal-4.1.0.jar
besc_2.1.0-beta
prov-3.0.jar
add4lib-1.0.jar
com_lab_2.0.jar
astrix
lis-2_0_1.jar
Is there any way i can split the name and the version using regex. I want to use the results to make two columns 'Name' and 'Version' in excel.
So i want the results from regex to look like
andal 4.1.0.jar
besc 2.1.0-beta
prov 3.0.jar
add4lib 1.0.jar
com_lab 2.0.jar
astrix
lis 2_0_1.jar
So far I have used ^(?:.*-(?=\d)|\D+) to get the Version and -\d.*$ to get the Name separately. The problem with this is that when i do it for a large text file, the results from the two regex are not in the same order. So is there any way to get the results in the way I have mentioned above?
Ctrl+H
Find what: ^(.+?)[-_](\d.*)$
Replace with: $1\t$2
check Wrap around
check Regular expression
UNCHECK . matches newline
Replace all
Explanation:
^ # beginning of line
(.+?) # group 1, 1 or more any character but newline, not greedy
[-_] # a dash or underscore
(\d.*) # group 2, a digit then 0 or more any character but newline
$ # end of line
Replacement:
$1 # content of group 1
\t # a tabulation, you may replace with what you want
$2 # content of group 2
Result for given example:
andal 4.1.0.jar
besc 2.1.0-beta
prov 3.0.jar
add4lib 1.0.jar
com_lab 2.0.jar
astrix
lis 2_0_1.jar
Not quite sure what you meant for the problem in large file, and I believe the two regex you showed are doing opposite as what you said: first one should get you the name and second one should give you version.
Anyway, here is the assumption I have to guess what may make sense to you:
"Name" may follow by - or _, followed by version string.
"Version" string is something preceded by - or _, with some digit, followed by a dot or underscore, followed by some digit, and then any string.
If these assumption make sense, you may use
^(.+?)(?:[-_](\d+[._]\d+.*))?$
as your regex. Group 1 is will be the name, Group 2 will be the Version.
Demo in regex101: https://regex101.com/r/RnwMaw/3
Explanation of regex
^ start of line
(.+?) "Name" part, using reluctant match of
at least 1 character
(?: )? Optional group of "Version String", which
consists of:
[-_] - or _
( ) Followed by the "Version" , which is
\d+ at least 1 digit,
[._] then 1 dot or underscore,
\d+ then at least 1 digit,
.* then any string
$ end of line
From this input:
""" "01-01-2000""" " ",""" "Bank123""" "", "" ""Example text" " "",
I want to extract:
01-01-2000
Bank123
Example text
I managed this:
(["'])(?:(?=(\\?))\2.)*?\1
But if fails if it comes to deal with many wrong placed quotes. Any ideas?
As I see, you are interested in strings which:
start with either a digit or a letter,
followed by a (maybe empty) sequence of chars other than ".
So the intuitive solution is [a-z\d][^"]* with gi options
(global, case insensitive).
For your given example, perhaps it could be an option to match a whitespace or a double quote zero or more times [ "]* to match what comes before the value between the inner double quotes.
Then match that double quote and capture in a group not a double quote or a newline ([^"\r\n]+) using a negated character class.
At the end match the closing double quote followed by zero or more times a whitespace or a double quote which will match what comes after so the group does not match a whitespace between double quotes.
[ "]*"([^"\r\n]+)"[ "]*
There are various options to do so:
1) ([\d-\w\s][\d-\w\s]+)
2) ([\d-\w\s]{2,})
3) "\b(.+?)\b"
4) \b([^"]{2,})\b
Demo : https://regex101.com/r/jPXqKv/1
Test:
""" "01-01-2000""" " ",""" "Bank123""" "", "" ""Example text" " ""
Match:
Match 1
Full match 5-15 `01-01-2000`
Group 1. 5-15 `01-01-2000`
Match 2
Full match 28-35 `Bank123`
Group 1. 28-35 `Bank123`
Match 3
Full match 48-60 `Example text`
Group 1. 48-60 `Example text`