Regex to capture alpha numeric before pipe separated - regex

I've been trying to create a regex with space & alpha numeric values.
Below Im sharing the sample String.
Manchester United 8547|12345678910
|12345678910
Manchester |12345678910
124587933 |12345678910
8457 Manchester United|12345678910
Manchester United|12345678910
I want to capture everything before pipe(|) separated. At times there is a possibility of complete space and no alpha numeric values before pipe(|) which I've shown in 2nd example. Regex should not capture pipe(|) and next numerical values(12345678910).
I've tried below regex but none are working for me.
^.*$
^[\s\w\d]+$
[a-zA-Z0-9\s]+
[a-zA-Z0-9\s\W]+
^[\sa-z|A-Z|0-9]+$
^[\sa-z|A-Z|0-9]+$
[^\s]*$
([^\"]*)
^[a-zA-Z0-9]$
^([^?]*)$
.+?(?=\w)
\s[a-zA-Z0-9]+
^[\sa-zA-Z0-9]+
I need a full match & not group match
for example if I try for
Manchester 8457 then regex would be Manchester \d+. This gives me full match & not group match.

You can try this.
input.substring(0,input.indexOf("|"))

If you want to match alphanumeric before the pipe and not get a group match, but a match only, you can use a character class with a positive lookahead (?=\|) (if that is supported) to assert the pipe at the right.
^[A-Za-z0-9 ]+(?=\|)
Regex demo

Assuming that every line would have a pipe, you could split the input string on CRLF, and then extract the portion to the left of the pipe:
String input = "Manchester United 8547|12345678910\n |12345678910\nManchester |12345678910\n124587933 |12345678910\n8457 Manchester United|12345678910\n Manchester United|12345678910\n";
String[] parts = input.split("\r?\n");
List<String> contents = Arrays.stream(parts)
.map(x -> x.split("\\|")[0].trim())
.collect(Collectors.toList());
System.out.println(contents);
This prints:
[Manchester United 8547, , Manchester, 124587933, 8457 Manchester United,
Manchester United]

for getting alphanumeric part use the following
^\s*\w(.+?)\|
This should answer your question i guess.
^(.+?)\|
Please use this and try it checks only for the beginning string.
its is for the pipe
Try it here

Related

Regex - Match a string up to a digit or a specific string

I am working in python and there I have a list of countries that I would like to clean. Most countries are already written the way I want them to be. However, some country names have a one- or two-digit number attached or there is a text in brackets appended. Here's a sample of that list:
Argentina
Australia1
Bolivia (Plurinational State of)
China, Hong Kong Special Administrative Region
Côte d'Ivoire
Curaçao
Guinea-Bissau
Indonesia8
The part that I want to capture would look like this:
Argentina
Australia
Bolivia
China, Hong Kong Special Administrative Region
Côte d'Ivoire
Curaçao
Guinea-Bissau
Indonesia
The best solution that I was able to come up with is ^[a-zA-Z\s,ô'ç-]+. However, this leaves country names that are followed by a text in parentheses with a trailing white space.
This means I would like to match the entire country name unless there is a digit or a white space followed by an open bracket, then I would like it to stop before the digit or the (
I know that I could probably solve this in two steps but I am also reasonably sure that it should be possible to define a pattern that can do it in one step. Since I am anyway in the process of getting familiar with regex, I thought this would be a nice thing to know.
The pattern can be written as matching any char except digits, parenthesis or whitespace chars. And that part by itself can be optionally repeated preceded by a space.
^[^\d\s()]+(?: [^\d\s()]+)*
^ Start of string
[^\d\s()]+ Match 1+ times any char except a digit, whitespace char or parenthesis using a negated character class
(?: Non capture group to repeat as a whole part
[^\d\s()]+ Same match as above
)* Close the non capture group and optionally repeat it
Regex demo
I suggest you simply convert the strings you don't want to empty strings, using the regular expression
\d+$| +\(.*\)
with the multiline flag set, causing ^ and $ to respectively match the beginning and end of a line, rather than the beginning and end of the string.
Demo
The expression matches one or more digits at the end of a line or one or more spaces followed by a string that is enclosed in matching parentheses.
I think you can try ^([^\d \n]| +[^\d (\n])+ or, if you can guarantee your input doesn't contain double-spaces, the slightly simpler ^([^\d \n]| [^\d(\n])+
(The ^ character inside [] excludes the following characters, see https://regexone.com/lesson/excluding_characters)
Technically, the regex I've given omits trailing spaces, but for your application it doesn't sound like that would be a bad thing.
You can test the regex here https://regex101.com/r/dupn18/1
This should do the trick
In [1]: import re
In [2]: pattern = re.compile(r'(.+(?=\d| \()|.+)')
In [3]: data = """Argentina
...: Australia1
...: Bolivia (Plurinational State of)
...: China, Hong Kong Special Administrative Region
...: Côte d'Ivoire
...: Curaçao
...: Guinea-Bissau
...: Indonesia8""".splitlines()
In [4]: [pattern.search(country).group() for country in data]
Out[4]:
['Argentina',
'Australia',
'Bolivia',
'China, Hong Kong Special Administrative Region',
"Côte d'Ivoire",
'Curaçao',
'Guinea-Bissau',
'Indonesia']

Regex for putting comma before city name in address

Generally address comes with comma seperationa and can be splitted using simple regex. e.g
123 Main St, Los Angeles, CA, 90210
We can apply regex here and split using comma. But in my database addresses are stored without comma. e.g
A Better Property Management<br/> 6621 E PACIFIC COAST HWY<br/> STE 255<br/> LONG BEACH CA 90803-4241
And I want to put comma before the city. Something like this:
A Better Property Management<br/> 6621 E PACIFIC COAST HWY<br/> STE 255<br/> LONG BEACH ,CA 90803-4241
I was thing about finding the last two letter word from the end and put comma using regex . But I also need to account for the situations where we don't have complete address or missing city and pincodes. Is there a way this can be done. I only found solutions where we can split using comma but not the reverse.
I was thinking if we could select the last 2 words before numbers with something like [A-Za-z]{2} (don't know if this is correct). And at the same time if we can check to do this only if the string ends with numbers.
I tried
(\b(AL|AK|AS|AZ|AR|CA|CO|CT|DE|DC|FM|FL|GA|GU|HI|ID|IL|IN|IA|KS|KY|LA|ME|MH|MD|MA|MI|MN|MS|MO|MT|NE|NV|NH|NJ|NM|NY|NC|ND|MP|OH|OK|OR|PW|PA|PR|RI|SC|SD|TN|TX|UT|VT|VI|VA|WA|WV|WI|WY|Alabama|Alaska|Arizona|Arkansas|California|Colorado|Connecticut|Delaware|District of Columbia|Florida|Georgia|Hawaii|Idaho|Illinois|Indiana|Iowa|Kansas|Kentucky|Louisiana|Maine|Maryland|Massachusetts|Michigan|Minnesota|Mississippi|Missouri|Montana|Nebraska|Nevada|New Hampshire|New Jersey|New Mexico|New York|North Carolina|North Dakota|Ohio|Oklahoma|Oregon|Pennsylvania|Rhode Island|South Carolina|South Dakota|Tennessee|Texas|Utah|Vermont|Virginia|Washington|West Virginia|Wisconsin|Wyoming)\b)
https://regex101.com/r/75fqO6/1
You can use
[a-zA-Z]+\s+\d(?:[\d-]*\d)?$
Replace with ,$0.
See the regex demo. Details:
[a-zA-Z]+ - one or more letters
\s+ - one or more whitespaces
\d - a digit
(?:[\d-]*\d)? - an optional substring of zero or more digits/hyphens and then a digit
$ - end of string.
The $0 in the replacement is a backreference to the whole match value, all text matched by the regex is put back where it was found with a prepended comma.

Using REGEX to remove duplicates when entire line is not a duplicate

^(.*)(\r?\n\1)+$
replace with \1
The above is a great way to remove duplicate lines using REGEX
but it requires the entire line to be a duplicate
However – what would I use if I want to detect and remove dups – when the entire line s a whole is not a dup – but just the first X characters
Example:
Original File
12345 Dennis Yancey University of Miami
12345 Dennis Yancey University of Milan
12345 Dennis Yancey University of Rome
12344 Ryan Gardner University of Spain
12347 Smith John University of Canada
Dups Removed
12345 Dennis Yancey University of Miami
12344 Ryan Gardner University of Spain
12347 Smith John University of Canada
How about using a second group for checking eg the first 10 characters:
^((.{10}).*)(?:\r?\n\2.*)+
Where {n} specifies the amount of the characters from linestart that should be dupe checked.
the whole line is captured to $1 which is also used as replacement
the second group is used to check for duplicate line starts with
See this demo at regex101
Another idea would be the use of a lookahead and replace with empty string:
^(.{10}).*\r?\n(?=\1)
This one will just drop the current line, if captured $1 is ahead in the next line.
Here is the demo at regex101
For also removing duplicate lines, that contain up to 10 characters, a PCRE idea using conditionals: ^(?:(.{10})|(.{0,9}$)).*+\r?\n(?(1)(?=\1)|(?=\2$)) and replace with empty string.
If your regex flavor supports possessive quantifiers, use of .*+ will improve performance.
Be aware, that all these patterns (and your current regex) just target consecutive duplicate lines.

regex for excluding text at end of string

I have a regular expression (built in adobe javascript) which finds string which can be of varying length.
The part I need help with is when the string is found I need to exclude the extra characters at the end, which will always end with 1 1.
This is the expression:
var re = new RegExp(/WASH\sHANDLING\sPLANT\s[-A-z0-9 ]{2,90}/);
This is the result:
WASH HANDLING PLANT SIZING STATION SERVICES SHEET 1 1 75 MOR03 MUP POS SU W ST1205 DWG 0001
I need to modify the regex to exclude the string in bold beginning with the 1 1.
Keep in mind the string searched for can be of varying length hence the {2,90}
Can anyone please advise assistance in modifying the REGEX to exclude all string from 1 1
Thank you
You may use a positive lookahead and keep the same functionality:
/WASH\sHANDLING\sPLANT\s[-A-Za-z0-9 ]{2,90}(?=\b1 1\b)/
^^^^^^^^^^^
The (?=\b1 1\b) lookahead requires 1 1 as whole "word" after your match.
See the regex demo
Also, note that [A-z] matches more than just letters.

Regex parse with alteryx

One of the columns has the data as below and I only need the suburb name, not the state or postcode.
I'm using Alteryx and tried regex (\<\w+\>)\s\<\w+\> but only get a few records to the new column.
Input:
CABRAMATTA
CANLEY HEIGHTS
ST JOHNS PARK
Parramatta NSW 2150
Claymore 2559
CASULA
Output
CABRAMATTA
CANLEY HEIGHTS
ST JOHNS PARK
Parramatta
Claymore
CASULA
This regex matches all letter-words up to but not including an Australian state abbreviation (since the addresses are clearly Australian):
( ?(?!(VIC|NSW|QLD|TAS|SA|WA|ACT|NT)\b)\b[a-zA-Z]+)+
See demo
The negative look ahead includes a word boundary to allow suburbs that start with a state abbreviation (see demo).
Expanding on Bohemian's answer, you can use groupings to do a REGEXP REPLACE in alteryx. So:
REGEX_Replace([Field1], "(.*)(\VIC|NSW|QLD|TAS|SA|WA|ACT|NT)+(\s*\d+)" , "\1")
This will grab anything that matches in the first group (so just the suburb). The second and third groups match the state and the zip. Not a perfect regex, but should get you most of the way there.
I think this workflow will help you :