Splitting string having special characters, words, numbers and URL - regex

I have a .txt file which contains:
"'the url address i checked is: https://www.google.com/ for 2times and it's awesome!."
After parsing, the expected output should be:
['"',"'",'the','url','address','i','checked','is',':','https://www.google.com/','for','2','times','and',"it's",'awesome','!','.','"']
How do I split this list to get the output using the re module.
I came up with this pattern:
pattern = re.compile(r"\d+|[a-zA-Z]+[a-zA-Z']*|[^\w\s]")
but this is also splitting my URL.
Can any one please help?

Just pick a url regex from somewhere and make it first in the alternations.
An example only -
# (?!mailto:)(?:(?:https?|ftp)://)?(?:\S+(?::\S*)?#)?(?:(?:(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))|localhost)(?::\d{2,5})?(?:/[^\s]*)?|\d+|[a-zA-Z]+[a-zA-Z']*|[^\w\s]
(?! mailto: )
(?:
(?: https? | ftp )
://
)?
(?:
\S+
(?: : \S* )?
#
)?
(?:
(?:
(?:
[1-9] \d?
| 1 \d\d
| 2 [01] \d
| 22 [0-3]
)
(?:
\.
(?: 1? \d{1,2} | 2 [0-4] \d | 25 [0-5] )
){2}
(?:
\.
(?:
[1-9] \d?
| 1 \d\d
| 2 [0-4] \d
| 25 [0-4]
)
)
| (?:
(?: [a-z\u00a1-\uffff0-9]+ -? )*
[a-z\u00a1-\uffff0-9]+
)
(?:
\.
(?: [a-z\u00a1-\uffff0-9]+ -? )*
[a-z\u00a1-\uffff0-9]+
)*
(?:
\.
(?: [a-z\u00a1-\uffff]{2,} )
)
)
| localhost
)
(?: : \d{2,5} )?
(?: / [^\s]* )?
| \d+
| [a-zA-Z]+ [a-zA-Z']*
| [^\w\s]
Outputs:
['"',"'",'the','url','address','i','checked','is',':','https://www.google.com/','for','2','times','and',"it's",'awesome','!','.','"']

Related

RegEx for matching various dates

I am trying to put together a regex statement to match on each of the below date formats.
* Mar 7, 2017
Mar. 7, 2017
* March 7, 2017
3-7-2017
03-07-2017
3-7-17
03-07-17
* 03/7/2017
* 03/07/17
* 3/7/17
Mar-07-2017
Mar-7-2017
March-07-2017
The below regex matches on the date formats that are indicated by an asterisk above. I have tried in vain to add to what I already have but have been unsuccessful.
([0-9]+)/([0-9]+)/([0-9]+)|([12]\d{3}-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01]))|\w+\s\d{2},\s\d{4}|(?i)\b(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec](?:ember)?)\b
(?:0?[1-9]|[1-2][0-9]|3[01]),? \d{4}
Any help is always appreciated!
* Bonus question *
On some occasions, there may be multiple date matches and I need it to find a match following a certain word. In the past I've used the below syntax by enclosing the regex statement between the parenthesis after the period.
(?<=Word).(StatementHere)
Try this then ...
([0-9]+)/([0-9]+)/([0-9]+)|((0?[1-9]|1[0-2])-(0?[1-9]|[12]\d|3[01])-(\d{4}|\d{2}))|\w+\s\d{2},\s\d{4}|(?i)\b(Jan(?:uary|\.)?|Feb(?:ruary|\.)?|Mar(?:ch|\.)?|Apr(?:il|\.)?|May|Jun(?:e|\.)?|Jul(?:y|\.)?|Aug(?:ust|\.)?|Sep(?:tember|\.)?|Oct(?:ober|\.)?|Nov(?:ember|\.)?|Dec(?:ember|\.)?)([ ](?:0?[1-9]|[1-2][0-9]|3[01]),?[ ]|-(?:0?[1-9]|[1-2][0-9]|3[01])-)(\d{4})
https://regex101.com/r/k1vaVN/1
Readable version
( [0-9]+ ) # (1)
/
( [0-9]+ ) # (2)
/
( [0-9]+ ) # (3)
|
( # (4 start)
( 0? [1-9] | 1 [0-2] ) # (5)
-
( 0? [1-9] | [12] \d | 3 [01] ) # (6)
-
( \d{4} | \d{2} ) # (7)
) # (4 end)
|
\w+ \s \d{2} , \s \d{4}
|
(?i)
\b
( # (8 start)
Jan
(?: uary | \. )?
| Feb
(?: ruary | \. )?
| Mar
(?: ch | \. )?
| Apr
(?: il | \. )?
| May
| Jun
(?: e | \. )?
| Jul
(?: y | \. )?
| Aug
(?: ust | \. )?
| Sep
(?: tember | \. )?
| Oct
(?: ober | \. )?
| Nov
(?: ember | \. )?
| Dec
(?: ember | \. )?
) # (8 end)
( # (9 start)
[ ]
(?: 0? [1-9] | [1-2] [0-9] | 3 [01] )
,? [ ]
| -
(?: 0? [1-9] | [1-2] [0-9] | 3 [01] )
-
) # (9 end)
( \d{4} ) # (10)
update
Just wrap the dates in a (?: ) group, then add whatever qualifier before
it that you need.
word[ ]or[ ]phrase[ ]+\K(?:([0-9]+)/([0-9]+)/([0-9]+)|((0?[1-9]|1[0-2])-(0?[1-9]|[12]\d|3[01])-(\d{4}|\d{2}))|\w+\s\d{2},\s\d{4}|(?i)\b(Jan(?:uary|\.)?|Feb(?:ruary|\.)?|Mar(?:ch|\.)?|Apr(?:il|\.)?|May|Jun(?:e|\.)?|Jul(?:y|\.)?|Aug(?:ust|\.)?|Sep(?:tember|\.)?|Oct(?:ober|\.)?|Nov(?:ember|\.)?|Dec(?:ember|\.)?)([ ](?:0?[1-9]|[1-2][0-9]|3[01]),?[ ]|-(?:0?[1-9]|[1-2][0-9]|3[01])-)(\d{4}))

RegEx for validating a URL with optional ports

I am trying to get this regex dialed-in to validate whether a URL begins with https and if a port is supplied the only valid values are 443 or 5443. This regex is pretty close but not quite there.
^(https:\/\/)([a-zA-Z\d\.]{2,})\.([a-zA-Z]{2,})(:5{0,1}443)?(.)*
How do I solve this problem?
This is a mainstream URL validator that tests if it's between whitespace boundary's.
It only allows https device and the port numbers 5443 or 443.
(?<!\S)https://(?:\S+(?::\S*)?#)?(?:(?:(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))|localhost)(?::5?443)?(?:/[^\s]*)?(?!\S)
Readable version
(?<! \S )
https ://
(?:
\S+
(?: : \S* )?
#
)?
(?:
(?:
(?:
[1-9] \d?
| 1 \d\d
| 2 [01] \d
| 22 [0-3]
)
(?:
\.
(?: 1? \d{1,2} | 2 [0-4] \d | 25 [0-5] )
){2}
(?:
\.
(?:
[1-9] \d?
| 1 \d\d
| 2 [0-4] \d
| 25 [0-4]
)
)
| (?:
(?: [a-z\u00a1-\uffff0-9]+ -? )*
[a-z\u00a1-\uffff0-9]+
)
(?:
\.
(?: [a-z\u00a1-\uffff0-9]+ -? )*
[a-z\u00a1-\uffff0-9]+
)*
(?:
\.
(?: [a-z\u00a1-\uffff]{2,} )
)
)
| localhost
)
(?: : 5? 443 )?
(?: / [^\s]* )?
(?! \S )
You should append a / after this optional port group so it doesn't allow any digits before a /. Try using this regex,
^(https:\/\/)([a-zA-Z\d\.]{2,})\.([a-zA-Z]{2,})(:5?443)?\/\S*
Notice, I've also changed (:5{0,1}443)? to (:5?443)? and changed last .* to \S* so the url doesn't capture spaces as spaces in URL is not a valid thing. Besides that, you can also get rid of so many groups in your regex, unless you need them.
Regex Demo
Edit:
As you said in comments, that you want to match following URLs too,
https://example.com
https:example.com
https:example.com:443
you need to make \/\S* part optional by placing a ? after them. The modified regex becomes this, which will match above URLs.
^https:\/\/([a-zA-Z\d\.]{2,})\.([a-zA-Z]{2,})(:5?443)?(\/\S*)?
Demo with filepath part being optional
Your RegEx seems to work okay. You may try using this RegEx and add additional boundaries, just for safety, if you wish so:
^(https:\/\/)([a-zA-Z\d\.]{2,})\.([a-zA-Z]{2,}):(5443|443)?$
I only added a $ end char so that to bound your original expression from the right. You may just simply add a few port numbers, if you may have, in this capturing group:
(5443|443)
You can also remove unnecessary boundaries, if you wish.

Powershell Regex to match between vertical bar ( | )

Below is just two lines of string that I am matching too
6 |UDP |ENABLED | |15006 |010.247.060.120 | UDP/IP Communications | UDP/IP Communications GH1870
10 |Gway |ONLINE | |41794 |127.000.000.001 | DM-MD64x64 | DM-MD64x64
Below is the regex I have so far, but it only matches the bottom line
(?i)(?<cipid>([\w\.]+))\s*\|\s*(?<ty>\w+)?\s*\|\s*(?<stat>[\w ]+)\s*\|\s*(?<devid>\w+)?\s*\|\s*(?<prt>\d+)\s*\|\s*(?<ip>([\d\.]+))\s*\|\s*(?<mdl>[\w-]+)\s*\|\s*(?<desc>.+)
I was wondering if I could have a regular expression that just matches every character between every vertical line, instead of having to explicitly say what is between the vertical lines
Thanks all
This usually works. (?:^|(?<=\|))[^|]*?(?=\||$)
https://regex101.com/r/KMNc47/1
Formatted
(?: ^ | (?<= \| ) ) # BOS or Pipe behind
[^|]*? # Optional non-pipe chars
(?= \| | $ ) # Pipe ahead or EOS
Here it is with whitespace trim and includes a capture group.
(?:^|(?<=\|))\s*([^|]*?)\s*(?=\||$)
https://regex101.com/r/KMNc47/2
Formatted
(?: ^ | (?<= \| ) ) # BOS or Pipe behind
\s*
( [^|]*? ) # (1), Optional non-pipe chars
\s*
(?= \| | $ ) # Pipe ahead or EOS
Here it is in a Capture Collection configuration.
(?:(?:^|\|)\s*([^|]*?)\s*(?=\||$))+
https://regex101.com/r/KMNc47/3
Formatted
(?:
(?: ^ | \| ) # BOS or Pipe
\s*
( [^|]*? ) # (1), Optional non-pipe chars
\s*
(?= \| | $ ) # Pipe ahead or EOS
)+

Regular expression for separate lat and long

I'm trying to come up with two regular expressions, one for latitude value, -85.05112878 < lat < 85.05112878, and one for longitude value, -180.0 < long < 180.0
help is much appreciated
Not very pretty, you can try this one for the latitude
-85.05112878 < lat < 85.05112878
^(?:-?85\.0(?:000000\d*|0{1,5}(?:[1-9]\d*)?|[1-4]\d*|5(?:0\d*)?|5(?:1(?:0\d*)?)?|511(?:[0-1]\d*)?|5112(?:[0-7]\d*)?|51128(?:[0-6]\d*)?|511287[0-8]?)?0*|(?:-[1-9]|-?[1-7]\d|-?8[0-4]|\d)\.\d+)$
Expanded
^
(?:
-? 85
\.0
(?:
000000 \d*
| 0{1,5} (?: [1-9] \d* )?
| [1-4] \d*
| 5 (?: 0 \d* )?
| 5 (?: 1 (?: 0 \d* )? )?
| 511 (?: [0-1] \d* )?
| 5112 (?: [0-7] \d* )?
| 51128 (?: [0-6] \d* )?
| 511287 [0-8]?
)?
0*
|
(?:
- [1-9]
| -? [1-7] \d
| -? 8 [0-4]
| \d
)
\. \d+
)
$
And this for the longitude
-180.0 < long < 180.0
^(?:-?180\.0+|(?:-[1-9]|-?[1-9]\d|-?1[0-7]\d|\d)\.\d+)$
Expanded
^
(?:
-? 180 \. 0+
|
(?:
- [1-9]
| -? [1-9] \d
| -? 1 [0-7] \d
| \d
)
\. \d+
)
$
edit
This is the same as above except it matches partial (valid) forms like
54
54.
54.1
etc ...
lat
^(?:-?85(?:\.(?:0(?:000000\d*|0{1,5}(?:[1-9]\d*)?|[1-4]\d*|5(?:0\d*)?|5(?:1(?:0\d*)?)?|511(?:[0-1]\d*)?|5112(?:[0-7]\d*)?|51128(?:[0-6]\d*)?|511287[0-8]?)?)?0*)?|(?:-[1-9]|-?[1-7]\d|-?8[0-4]|\d)(?:\.\d*)?)$
Expanded
^
(?:
-? 85
(?:
\.
(?:
0
(?:
000000 \d*
| 0{1,5} (?: [1-9] \d* )?
| [1-4] \d*
| 5 (?: 0 \d* )?
| 5 (?: 1 (?: 0 \d* )? )?
| 511 (?: [0-1] \d* )?
| 5112 (?: [0-7] \d* )?
| 51128 (?: [0-6] \d* )?
| 511287 [0-8]?
)?
)?
0*
)?
|
(?:
- [1-9]
| -? [1-7] \d
| -? 8 [0-4]
| \d
)
(?: \. \d* )?
)
$
long
^(?:-?180(?:\.0*)?|(?:-[1-9]|-?[1-9]\d|-?1[0-7]\d|\d)(?:\.\d*)?)$
Expanded
^
(?:
-? 180
(?: \. 0* )?
|
(?:
- [1-9]
| -? [1-9] \d
| -? 1 [0-7] \d
| \d
)
(?: \. \d* )?
)
$

Regular Expression for a single occurrence within a String

I am new to Regular Expression and can't seem to do the proper syntax for what I need to do. I need regular expression for an alphanumeric string that can be 1-8 characters long and can contain at most 1 dash, but can't be a single dash alone.
Valid:
A-
-A
1234-678
ABC76-
Invalid:
-
F-1-
ABCD1234-
---
Thanks in advance!
One way. (Sorry if this is already posted)
# ^(?=[a-zA-Z0-9-]{1,8}$)(?=[^-]*-?[^-]*$)(?!-$).*$
^ # BOL
(?= [a-zA-Z0-9-]{1,8} $ ) # 1 - 8 alpha-num or dash
(?= [^-]* -? [^-]* $ ) # at most 1 dash
(?! - $ ) # not just a dash
.* $
Edit: Just extend it for segments separated by comma's
# ^(?!,)(?:(?=(?:^|,)[a-zA-Z0-9-]{1,8}(?:$|,))(?=(?:^|,)[^-]*-?[^-]*(?:$|,))(?!(?:^|,)-(?:$|,)),?[^,]*)+(?<!,)$
^ # BOL
(?! , ) # does not start with comma
(?: # Grouping
(?=
(?: ^ | , )
[a-zA-Z0-9-]{1,8} # 1 - 8 alpha-num or dash
(?: $ | , )
)
(?=
(?: ^ | , )
[^-]* -? [^-]* # at most 1 dash
(?: $ | , )
)
(?!
(?: ^ | , )
- # not just a dash
(?: $ | , )
)
,? [^,]* # consume the segment
)+ # Grouping, do many times
(?<! , ) # does not end with comma
$ # EOL
Edit2: If your engine doesn't support lookbehinds, this is same thing but without
# ^(?!,)(?:(?=(?:^|,)[a-zA-Z0-9-]{1,8}(?:$|,))(?=(?:^|,)[^-]*-?[^-]*(?:$|,))(?!(?:^|,)-(?:$|,))(?!,$),?[^,]*)+$
^ # BOL
(?! , ) # does not start with comma
(?: # Grouping
(?=
(?: ^ | , )
[a-zA-Z0-9-]{1,8} # 1 - 8 alpha-num or dash
(?: $ | , )
)
(?=
(?: ^ | , )
[^-]* -? [^-]* # at most 1 dash
(?: $ | , )
)
(?!
(?: ^ | , )
- # not just a dash
(?: $ | , )
)
(?! , $ ) # does not end with comma
,? [^,]* # consume the segment
)+ # End Grouping, do many times
$ # EOL
Try this regex:
/^(?!([^-]*-){2})[a-zA-Z0-9-]{1,8}$/
^ and $ are to match start and end.
(?!([^-]*-){2}) is a lookahead that makes sure that matching pattern has only one hyphen in it at the most.
[a-zA-Z0-9-]{1,8} match 1 to 8 alpha-numerals or -
Reference: http://regular-expressions.info