Regex to split a address string - regex

So I'm a real rooky with REGEX and I usually get my way through it back reference a static word in the string and then using just basic functions to find what I need, this one has me stuck though
So I have this address string "MITCHAM SA 5062" and to go through this parser i need to split the suburb, state and postcode.
I can get "MITCHAM" using /\w+/
And postcode "5062" using /\d+/
The state I'm struggling with though. I think I'm close, I'm currently using (?!\w+) (\w+) Issue here is it is still picking up the whitespace before the suburb which won't be allowed in the database.
Halp pls!
Edit - Few questions about if the state will ever be more than two letters - correct it could be. It won't always be SA
Edit 2 - Another person asked if one while regex can capture it all - No, the way our SaaS product works, I need to map each bit of data to the correct place separately (using a GUI)

If MITCHAM SA 5062 is the full string, and you want to capture each group in one regex than this will work:
^(\w+)\s*?(\w+)\s*(\d+)
If you are trying to capture the middle section only you can try:
\s(\w+)\s
Or if for some reason you cannot use capturing groups, this will work for the middle portion.
(?<=\s)(\w+)(?=\s+)

Related

Regex, Grafana Loki, Promtail: Parsing a timestamp from logs using regex

I want to parse a timestamp from logs to be used by loki as the timestamp.
Im a total noob when it comes to regex.
The log file is from "endlessh" which is essentially a tarpit/honeypit for ssh attackers.
It looks like this:
2022-04-03 14:37:25.101991388 2022-04-03T12:37:25.101Z CLOSE host=::ffff:218.92.0.192 port=21590 fd=4 time=20.015 bytes=26
2022-04-03 14:38:07.723962122 2022-04-03T12:38:07.723Z ACCEPT host=::ffff:218.92.0.192 port=64475 fd=4 n=1/4096
What I want to match, using regex, is the second timestamp present there, since its a utc timestamp and should be parseable by promtail.
I've tried different approaches, but just couldn't get it right at all.
So first of all I need a regex that matches the timestamp I want.
But secondly, I somehow need to form it into a regex that exposes the value in some sort?
The docs offer this example:
.*level=(?P<level>[a-zA-Z]+).*ts=(?P<timestamp>[T\d-:.Z]*).*component=(?P<component>[a-zA-Z]+)
Afaik, those are named groups, and that is all that it takes to expose the value for me to use it in the config?
Would be nice if someone can provide a solution for the regex, and an explanation of what it does :)
You could for example create a specific pattern to match the first part, and capture the second part:
^\d{4}-\d{2}-\d{2} \d\d:\d\d:\d\d\.\d+\s+(?P<timestamp>\d{4}-\d{2}-\d{2}T\d\d:\d\d:\d\d\.\d+Z)\b
Regex demo
Or use a very broad if the format is always the same, repeating an exact number of non whitespace characters parts and capture the part that you want to keep.
^(?:\S+\s+){2}(?<timestamp>\S+)
Regex demo

What is the correct regex pattern to use to clean up Google links in Vim?

As you know, Google links can be pretty unwieldy:
https://www.google.com/search?q=some+search+here&source=hp&newwindow=1&ei=A_23ssOllsUx&oq=some+se....
I have MANY Google links saved that I would like to clean up to make them look like so:
https://www.google.com/search?q=some+search+here
The only issue is that I cannot figure out the correct regex pattern for Vim to do this.
I figure it must be something like this:
:%s/&source=[^&].*//
:%s/&source=[^&].*[^&]//
:%s/&source=.*[^&]//
But none of these are working; they start at &source, and replace until the end of the line.
Also, the search?q=some+search+here can appear anywhere after the .com/, so I cannot rely on it being in the same place every time.
So, what is the correct Vim regex pattern to use in order to clean up these links?
Your example can easily be dealt with by using a very simple pattern:
:%s/&.*
because you want to keep everything that comes before the second parameter, which is marked by the first & in the string.
But, if the q parameter can be anywhere in the query string, as in:
https://www.google.com/search?source=hp&newwindow=1&q=some+search+here&ei=A_23ssOllsUx&oq=some+se....
then no amount of capturing or whatnot will be enough to cover every possible case with a single pattern, let alone a readable one. At this point, scripting is really the only reasonable approach, preferably with a language that understands URLs.
--- EDIT ---
Hmm, scratch that. The following seems to work across the board:
:%s#^\(https://www.google.com/search?\)\(.*\)\(q=.\{-}\)&.*#\1\3
We use # as separator because of the many / in a typical URL.
We capture a first group, up to and including the ? that marks the beginning of the query string.
We match whatever comes between the ? and the first occurrence of q= without capturing it.
We capture a second group, the q parameter, up to and excluding the next &.
We replace the whole thing with the first capture group followed by the second capture group.

Find multiple ocurrences of the same character using REGEXP_LIKE in oracle

I have the following situation: I have an email databases of people who want to receive promotional emails about the company, stuff like flash sales, new product advertisement and etc. But for some time now, people have been registering bogus email addresses like aaa#aaa.aa. I'm currently working on a way to cleanse this table and my main issue so far has been finding the correct REGEXP_LIKE pattern to help me.
I've tried this WHERE REGEXP_LIKE (email_address, '(\w){3,}') but that's no good. It found emails like john#doe.com. I've tried searching for a way to do this in oracle but so far no good.
Can anyone assist me ?
You can try one of the following patterns:
'(\w)\1{2,}'
'((\w)+)\1+'
The first Pattern will detect sequences of 3 or more of the same character. For example aaa or bbb. The second pattern will detect sequences of 2 or more repeating patterns of characters, such as aa, bbb, abab, or 123123, etc.
This works by using the \1 which is a back reference to the 1st pattern surrounded by parenthesis. In the first pattern the back reference refers to a pattern of exactly one character. In the second pattern, the back reference refers to a batter of 1 or more characters.

Reg Ex to caputure between hyphens

Hi trying to capture the following data to export out to another part of the program.
Ideally would use regular expressions as TOKEN could be problematic (its for names so the string would vary, especially for users abroad, I've seen these people with 4+ different names)
Sample data which I want to capture from would be in this format
New Starter - First Last - test
I'd want to capture everything between the hyphens rather than the entire thing
So far I have the following regex: -([^-]+)-
Which just captures
- First Last -
(?<=-\s).+(?=\s-)
If you dont want something to appear in the match, but need to check its there you can use lookahead/lookbehind
More info here
This is assuming the same format will appear on all other inputs.

RegEx: Match Mr. Ms. etc in a "Title" Database field

I need to build a RegEx expression which gets its text strings from the Title field of my Database. I.e. the complete strings being searched are: Mr. or Ms. or Dr. or Sr. etc.
Unfortunately this field was a free field and anything could be written into it. e.g.: M. ; A ; CFO etc.
The expression needs to match on everything except: Mr. ; Ms. ; Dr. ; Sr. (NOTE: The list is a bit longer but for simplicity I keep it short.)
WHAT I HAVE TRIED SO FAR:
This is what I am using successfully on on another field:
^(?!(VIP)$).* (This will match every string except "VIP")
I rewrote that expression to look like this:
^(?!(Mr.|Ms.|Dr.|Sr.)$).*
Unfortunately this did not work. I assume this is because because of the "." (dot) is a reserved symbol in RegEx and needs special handling.
I also tried:
^(?!(Mr\.|Ms\.|Dr\.|Sr\.)$).*
But no luck as well.
I looked around in the forum and tested some other solutions but could not find any which works for me.
I would like to know how I can build my formula to search the complete (short) string and matches everything except "Mr." etc. Any help is appreciated!
Note: My Question might seem unusual and seems to have many open ends and possible errors. However the rest of my application is handling those open ends. Please trust me with this.
If you want your string simply to not start with one of those prefixes, then do this:
^(?!([MDS]r|Ms)\.).*$
The above simply ensures that the beginning of the string (^) is not followed by one of your listed prefixes. (You shouldn't even need the .*$ but this is in case you're using some engine that requires a complete match.)
If you want your string to not have those prefixes anywhere, then do:
^(.(?!([MDS]r|Ms)\.))*$
The above ensures that every character (.) is not followed by one of your listed prefixes, to the end (so the $ is necessary in this one).
I just read that your list of prefixes may be longer, so let me expand for you to add:
^(.(?!(Mr|Ms|Dr|Sr)\.))*$
You say entirely of the prefixes? Then just do this:
^(?!Mr|Ms|Dr|Sr)\.$
And if you want to make the dot conditional:
^(?!Mr|Ms|Dr|Sr)\.?$
^
Through this | we can define any number prefix pattern which we gonna match with string.
var pattern = /^(Mrs.|Mr.|Ms.|Dr.|Er.).?[A-z]$/;
var str = "Mrs.Panchal";
console.log(str.match(pattern));
this may do it
/(?!.*?(?:^|\W)(?:(?:Dr|Mr|Mrs|Ms|Sr|Jr)\.?|Miss|Phd|\+|&)(?:\W|$))^.*$/i
from that page I mentioned
Rather than trying to construct a regex that matches anything except Mr., Ms., etc., it would be easier (if your application allows it) to write a regex that matches only those strings:
/^(Mr|Ms|Dr|Sr)\.$/
and just swap the logic for handling matching vs non-matching strings.
re.sub(r'^([MmDdSs][RSrs]{1,2}|[Mm]iss)\.{0,1} ','',name)