Regex exclude trailing text from company names - regex

CURRENTLY
I am try to match valid company names from strings with 4 conditions:
the name can ONLY contain alphanumeric characters + spaces + hyphens
the name can contain a hyphen (inside the name)
there are company suffixes that should be excluded from the company name i.e. Pty Ltd, Pty. Ltd., Limited, and Ltd.
If there are additional matches on the same line, these are to be excluded
What I am trying to achieve:
My regex so far:
(?:\s|^)([a-zA-Z0-9]+[a-zA-Z0-9\s-]*?[a-zA-Z0-9]+)(?: Pty Ltd| Ltd(\.){0,1}| Limited){0,1}(?:\s|$)
ISSUES
https://regex101.com/r/Gpbdln/4
It seems I am struggling with:
Excluding the suffixes to be ignored
Making the capture include spaces for the company name (while at the same time excluded suffixes)
I have been stuck on this for over an hour and would appreciate some help.

You may use
^[a-zA-Z0-9]+(?:[\s-]+[a-zA-Z0-9]+)*?(?=(?:\s+(?:(?:Pty\.?\s+)?Ltd\.?|Limited|[a-zA-Z0-9]*[^a-zA-Z0-9\s]).*)?$)
See the regex demo
If you only need to get matches that do not span across lines, replace \s with \h or [\p{Zs}\t] if supported, or [^\S\r\n], to only match horizontal whitespaces.
Details
^ - start of string
[a-zA-Z0-9]+ - 1+ ASCII alphanumeric chars
(?:[\s-]+[a-zA-Z0-9]+)*? - 0 or more (but as few as possible) occurrences of
[\s-]+ - 1+ whitespaces or hyphens
[a-zA-Z0-9]+ - 1+ ASCII alphanumeric chars
(?=(?:\s+(?:(?:Pty\.?\s+)?Ltd\.?|Limited|[a-zA-Z0-9]*[^a-zA-Z0-9\s]).*)?$) - immediately to the right, there must be
(?:\s+(?:(?:Pty\.?\s+)?Ltd\.?|Limited|[a-zA-Z0-9]*[^a-zA-Z0-9\s]).*)? - an optional occurrence of a sequence of patterns:
\s+ - 1+ whitespaces
(?:(?:Pty\.?\s+)?Ltd\.?|Limited|[a-zA-Z0-9]*[^a-zA-Z0-9\s]) - any of
(?:Pty\.?\s+)?Ltd\.?| - an optional sequence of Pty, an optional dot and then 1+ whitespaces and then Ltd string and an optional . char, or
Limited| - Limited string, or
[a-zA-Z0-9]*[^a-zA-Z0-9\s] - any 0 or more ASCII alphanumeric chars followed with a char other than whitespace and alphanumeric char
.* - the rest of the string
$ - end of string.

Related

regex for this "001/Cnt.A/2021/EX.Dng" pattern

I have some string like this below:
0015/Cnt.A/2021/EX. Mmj tech
021/Cnt.B/2021/EX.Mm logs
31/ Cgt.A / 2020 / PK Jap
453/ Nnt.A / 2020 / WK Jap pom sc
13/Wnt.A/2021/ LO.Mm pom
1911/Cno.A/2021/PQ Mm ris dMn
and I want to select for output like this below:
0015/Cnt.A/2021/EX. Mmj
021/Cnt.B/2021/EX.Mm
31/ Cgt.A / 2020 / PK Jap
453/ Nnt.A / 2020 / WK Jap
13/Wnt.A/2021/ LO.Mm
1911/Cno.A/2021/PQ Mm
I have tried this pattern [0-9]{1,}\/[a-zA-Z.\s-]{1,}\/[0-9\s]{1,}\/[a-zA-Z\s]+[\.\s]+[a-zA-Z]{1,} but it can't handle the 4th and 6th string. Anyone, can fix that pattern? and maybe make it more efficient?
edited:
There is a rule like this pattern -> number/letter with dot or space/year/letter with dot or space
The pattern to get all text up to the last slash and then only two words separated with a whitespace or . is
.*\/\s*[a-zA-Z]+[\s.]+[a-zA-Z]+
.*\/\s*\w+[\s.]+\w+
If you need to keep the initial regex part for stricter validation, use
[0-9]+\/[a-zA-Z.\s-]+\/[0-9\s]+\/\s*\w+[\s.]+\w+
See this demo (or this demo). Details:
.*\/ - any zero or more chars other than line break chars, as many as possible
\s* - zero or more whitespaces
[a-zA-Z]+ - one or more ASCII letters
[\s.]+ - one or more whitespaces/dots
[a-zA-Z]+ - one or more ASCII letters.
\w+ would match one or more letters, digits, or underscores.
Now, accommodating for the number/letter with dot or space/year/letter with dot or space rule:
\d+\/\s*[a-zA-Z]+(?:\.[a-zA-Z]+)*\s*\/\s*[0-9]{4}\s*\/\s*\w+[\s.]+\w+
See this regex demo. Details:
\d+ - one or more digits
\/ - a / char
\s* - zero or more whitespaces
[a-zA-Z]+(?:\.[a-zA-Z]+)*
\s*\/\s* - 0+ whitespaces, /, 0+ whitespaces
\d{4} - four digits
\s*\/\s* - 0+ whitespaces, /, 0+ whitespaces
\w+[\s.]+\w+ - one or more word chars, 1+ whitespaces/dots, 1+ word chars.

Match string that contains punctuations, emojis, special characters, some Chinese characters and alpha numeric

I have a string which has the following format:
Foo/FooVersion some info
Foo can contain:
punctuations
special characters
emojis
alpha numeric
Chinese characters
I have this regex to capture the following pattern:
^[\+$-¨™®é!?_ó–:—🔥😘兼职,.&\w\s]+\/\d+[\+\w.-]*
It seems quite exhaustive list of character set and I am not sure if it does cover all the characters. What I am looking for is a simplified regex that takes these characters into account and returns true if there is a match. I am using sql.
FooVersion can consists of:
start with digit followed by word including dot or hyphen
You could use such pattern ([^\/]+)\/\1Version.+
Pattern explanation:
([^\/]+) - [^\/]+ matches on or more characters other than / (this is negated character class), () means capturing group, so matched text is put into first capturing group
\/ - match / literally
\1 - back reference to match the same text as was matched by first capturing group
Version - match Version literally
.+ - match one or more of any characters (to match rest of a string - this is optional and can be removed)
Regex demo
Update
To match updated requirements, you should use ([^\/]+)\/\d[a-zA-Z\d.-]+
What's new is:
[a-zA-Z\d.-]+ - match on or more characters from set a-z (lowercase letters), A-Z (uppercase letters), \d (digits), .- - hyphen or dot
Updated demo

Regex Pattern for a Java Log

I am trying to use the regex Parser Plugin in fluentd to index the logs of my application.
Here's a snippet of it.
2020-05-06T22:34:50.860-0700 - WARN [main] o.s.b.GenericTypeAwarePropertyDescriptor: Invalid JavaBean property 'pipeline' being accessed! Ambiguous write methods found next to actually used [public void com.theoaal.module.pipeline.mbean.DynamicPhaseExecutionConfigurationMBeanBuilder.setPipeline(com.theplatform.module.pipeline.DynamicPipeline)]: [public void com.theplatform.module.pipeline.mbean.PhaseExecutionConfigurationMBeanBuilder.setPipeline(com.theoaal.module.pipeline.Pipeline)]
I have used the regex101.com to match the regex pattern and I am not able to get a match.
^(?<date>\d{4}\-\d{2}\-\d{2})(?<timestamp>[A-Z][a-z]{1}\d{2}:\d{2}:\d{2}.\d{3}\-\d{4})\s\-\s(?<loglevel>\[\w\]{6})\s+(?<class>\[[A-Z][a-z]+\])\s(?<message>.*)$
Kindly help.
Thanks
You may use
^(?<date>\d{4}-\d{2}-\d{2})[A-Z](?<timestamp>\d{2}:\d{2}:\d{2}\.\d{3}-\d{4})\s+-\s+(?<loglevel>\w+)\s+(?<class>\[\w+\])\s+(?<message>.*)
See the regex demo
Note, in your pattern, \[\w\]{6} only matches [, a single word char and six ] chars. In the timestamp pattern, [A-Z][a-z]{1} requires two letters, but tere is a single T. Your "class" pattern requires a capitalized word with [A-Z][a-z]+, but main is all lowercase. You escape - outside of character classes unnecessarily, and you failed to escape a literal dot in the pattern.
Details
^ - start of string
(?<date>\d{4}-\d{2}-\d{2}) - date: 4 digits, -, 2 digits, -, 2 digits
[A-Z] - an uppercase ASCII letter
(?<timestamp>\d{2}:\d{2}:\d{2}\.\d{3}-\d{4}) - 2 digits, :, 2 digits, :, 2 digits, ., 3 digits, - and 4 digits
\s+-\s+ - - enclosed with 1+ whitespaces
(?<loglevel>\w+) - 1+ word chars
\s+ - 1+ whitespaces
(?<class>\[\w+\]) - [, 1+ word chars, ]
\s+ - 1+ whitespaces
(?<message>.*) - the res of the line.
Copy and paste to fluent.conf or td-agent.conf:
<source>
type tail
path /var/log/foo/bar.log
pos_file /var/log/td-agent/foo-bar.log.pos
tag foo.bar
format /^(?<date>\d{4}-\d{2}-\d{2})[A-Z](?<timestamp>\d{2}:\d{2}:\d{2}\.\d{3}-\d{4})\s+-\s+(?<loglevel>\w+)\s+(?<class>\[\w+\])\s+(?<message>.*)/
</source>
Test:

Regex forward slash separator

I am using the below regex expression to ensure string is max 50 characters in length and that each word starts with uppercase letter:
reMatch("Jet Black","^(?=.{0,50}$)(^|^([A-Z][a-z]* +)*([A-Z][a-z]* *)$)")
This works, but I would also like to allow for option to separate words with / character. Example: Jet/Black and Jet / Black with a space in between.
Your suggestions are highly appreciated! Mike.
If you do not care if there may be several spaces, or slahes or intermingled spaces and slashes you may use
^(?=.{0,50}$)(?:[A-Z][a-z]*(?:[ /]+[A-Z][a-z]*)*)?$
See the regex demo.
To only allow spaces and an optional single slash with (white)spaces after use
^(?=.{0,50}$)(?:[A-Z][a-z]*(?:\s*(?:/\s*)?[A-Z][a-z]*)*)?$
See this regex demo
Details
^ - start of string
(?=.{0,50}$) - string should contain only 0 to 50 chars other than linebreak chars (same as (?!.{51}))
(?:[A-Z][a-z]*(?:\s*(?:/\s*)?[A-Z][a-z]*)*)? - an optional sequence of
[A-Z][a-z]* - an uppercase ASCII letter and 0+ lowercase ASCII letters
(?:\s*(?:/\s*)?[A-Z][a-z]*)* - 0 or more sequences of
\s* - 0+ whitespaces
(?:/\s*)? - an optional / and 0+ whitespaces
[A-Z][a-z]* - an uppercase ASCII letter and 0+ lowercase ASCII letters
$ - end of string.

Greedy regex quantifier not matching password criteria

/(^[a-zA-Z]+-?[a-zA-Z0-9]+){5,15}$/g
regex criteria
match length must be between 6 and 16 characters inclusive
must start with a letter only
must contain letters, numbers and one optional hyphen
must not end with a hyphen
the above regular expression doesnt satisfy all 4 conditions. tried moving the ^ before the group and omitting the + quantifiers but doesnt work
You are setting the limiting quantifier on a group that already has quantified subpatterns, thus, the length restriction won't work.
To set the length restriction, add the (?=.{6,16}$) lookahead after ^ and then feel free to set your consuming pattern.
You may use
/^(?=.{6,16}$)[a-zA-Z][a-zA-Z0-9]*(?:-[a-zA-Z0-9]+)?$/
See the regex demo. Note you should not use g modifier when validating the whole input string against a regex.
Details
^ - start of string
(?=.{6,16}$) - 6 to 16 chars in the string input allowed/required
[a-zA-Z] - a letter as the first char
[a-zA-Z0-9]* - 0+ alphanumeric chars
(?:-[a-zA-Z0-9]+)? - an optional sequence of - and then 1+ alphanumeric chars
$ - end of string.
All you need
^(?i)(?=.{6,16}$)(?!.*-.*-)[a-z][a-z\d-]*\d[a-z\d-]*(?<!-)$
Readable
^
(?i)
(?= .{6,16} $ ) # 6 - 16 chars
(?! .* - .* - ) # Not 2 dashes
[a-z] # Start letter
[a-z\d-]* # Optional letters, digits, dashes
\d # Must be digit
[a-z\d-]* # Optional letters, digits, dashes
(?<! - ) # Not end in dash
$
Well, at least my regex forces a number be present.