The goal is to use a regex to remove text between the nth and the next comma in rust.
For example outside of rust I would use
^((?:.*?,){4})[^,]*,(.*)$
on London, City of Westminster, Greater London, England, SW1A 2DX, United Kingdom
to get a desired result like:
London, City of Westminster, Greater London, England, United Kingdom
I don't have a strong understanding of regex in general unfortunately. So I would learn more about the mechanic and be able to use it in the program I'm writing to learn rust.
Just copy pasting it ala
let string = "London, City of Westminster, Greater London, England, United Kingdom"
let re = Regex::new(r"^((?:.*?,){4})[^,]*,(.*)$").unwrap();
re.replace(string, "");
is not working obviously.
The value you want to remove is the fifth comm-delimited value, not the fourth, and you need to replace with two backreferences, $1 and $2 that refer to Group 1 and Group 2 values.
Note it makes it more precise to use a [^,] negated character class rather than a .*? lazy dot in the quantified part since you are running it against a comma-delimited string.
See the Rust demo:
let string = "London, City of Westminster, Greater London, England, SW1A 2DX, United Kingdom";
let re = Regex::new(r"^((?:[^,]*,){4})[^,]*,(.*)").unwrap();
println!("{}", re.replace(string, "$1$2"));
// => London, City of Westminster, Greater London, England, United Kingdom
Related
I have a data like below
135 stjosephhrsecschool london DunAve
175865 stbele_higher_secondary sch New York
11 st marys high school for women Paris Louis Avenue
I want to extract id schoolname city area.
Pattern is id(digits) followed by single space then school name. name can have multiple words split by single space or it may have special chars. then minimum of double space or more then city . Again city may have multi words split space or may have special chars. then minimum of 2 spaces or more then its area. Even area follows the same properties as school name & city. But area may or may not present in the line. If its not then i want null value for area.
Here is regex I have tried.
([\d]+) ([\w\s\S]+)\s\s+([\w\s\S]+)\s\s+([\w\s\S]*)
But This regex is not stopping when it see more than 2 spaces. Not sure how to modify this to fit to my data.
all the help are appreciated.
Thanks
If I understand your issue correctly - the issue is that the resulting groups contain trailing spaces (e.g. "Louis Avenue "). If so - you can fix this by using the non-greedy modifiers like +? and *?:
([\d]+) ([\w\s\S]+?)\s\s+([\w\s\S]+?)\s\s+([\w\s\S]*?)?\s*
Which results in what seems to be the desired output:
val s1 = "135 stjosephhrsecschool london DunAve"
val s2 = "175865 stbele_higher_secondary sch New York "
val s3 = "11 st marys high school for women Paris Louis Avenue "
val r = """([\d]+) ([\w\s\S]+?)\s\s+([\w\s\S]+?)\s\s+([\w\s\S]*?)?\s*""".r
def matching(s: String) = s match {
case r(a,b,c,d) => println((a,b,c,d))
case _ => println("no match")
}
matching(s1) // (135,stjosephhrsecschool,london,DunAve)
matching(s2) // (175865,stbele_higher_secondary sch,New York,)
matching(s3) // (11,st marys high school for women,Paris,Louis Avenue)
I have to match only the first Country name in the pattern below. The country names are given in all upper case letters. I used the following code to get the matches but it matches all the countries.
'\\b[A-Z]{2,}.\\b'
Eg: In the pattern below, I just want UNITED KINGDOM
x = "~ London, Greater London ~ UNITED KINGDOM;~ Ottawa, Ontario ~ CANADA;~,~ AUSTRALIA;~,~ POLAND;~,~ USA"
This seems to work:
regmatches(x, regexpr('\\b[A-Z ]{2,}\\b', x))
# [1] "UNITED KINGDOM"
I just added a space to make the character set [A-Z ]. Note that regexpr gets the first match while gregexpr gets all of them (similar to sub vs gsub).
For more info, I recommend the official docs at ?regexpr.
I have an address string and I need to extract the street name from it. Examples:
Unit 1, Silicon Way -> Silicon Way
66 Yellow Brick Road -> Yellow Brick Road
77 - 5 Sesame Street -> Sesame Street
High Street -> High Street
How would a regular expression look like in this case? If language matters I'm using Scala.
This regex will not work if address contains comma or number in it. If the address is always the text from the end of the string, then try with this regex:
\s*([a-zA-Z ]+?)\s*$
$ is anchoring as end of string. So the pattern will always match from the right side of the string.
Online Demo
I need to extract name, street1, street2, city, state, zip
I have data in this form
JOHN m SMITH [1111 WEST OAK ROAD, SUITE 101, CITY, ST 55555]
GEORGE m JONES [222 MAIN STREET, CITY, ST 55555]
My results for JOHN should be
name="JOHN m SMITH"
street1="1111 WEST OAK ROAD"
street2="SUITE 101"
city = "CITY"
state = "ST"
zip = "55555"
This works with GEORGE's data
Regex r = new Regex(#"^(?<name>.*)\[(?<street>.*)[,]\s(?<city>.*)[,]\s(?<state>.*)\s(?<zip>\d{5})\]$");
var match = r.Match(fullNameAndAddress);
name = match.Groups["name"].Value;
street = match.Groups["street"].Value;
city = match.Groups["city"].Value;
state = match.Groups["state"].Value;
zip = match.Groups["zip"].Value;
How do I add the optional street2?
I want 1 and only 1 "street" group. I thought it should have this: (....){1}?
street2 is optional zero or 1 times. I thought it should have this (...)?
but it doesn't work with JOHN's data, both street1 & street2 are going into the street group:
^(?<name>.*)\[((?<street>.*)[,]\s){1}?((?<street2>.*)[,]\s)?(?<city>.*)[,]\s(?<state>.*)\s(?<zip>\d{5})\]$
Could you clarify what you want stored in street?
Do you want John's to look like '1111 WEST OAK ROAD, SUITE 101'?
Or do you want to stuff it into some variable you wont be using, so that street looks like '1111 WEST OAK ROAD'?
Edit: With clarification, check out this link
http://rubular.com/r/S4HaTMVFZl
What happens here I believe is that the * is greedy, grabbing as much as it can before finding the final occurence of [,]\s
Adding a ? after the .* makes it lazy, grabbing the least information possible.
The amended regex looks like this
^(?<name>.*)\[((?<street>.*?)[,]\s)((?<street2>.*)[,]\s)?(?<city>.*)[,]\s(?<state>.{2})\s(?<zip>\d{5})\]$
You'll notice I changed the Regex for state from .* to .{2}, forcing a 2-character state. Feel free to revert that if you don't want it :)
I made a couple of changes to your regex in rubular.com, and it seemed to be working on both the example strings:
^(?<name>.+)\s\[(?<street>[^,]+),\s((?<street2>[^,]+),\s+)?(?<city>[^,]+),\s(?<state>.+)\s(?<zip>\d{5})\]$
street2 = match.Groups["street2"].Value;
One trick I've learned with regex's is to use the negation of the divider (eg. [^,]* for anything but a comma) instead of .*, so it's impossible to capture multiple fields with one expression. Also, the + operator, which requires at least one match, is useful in most of the groups.
Also, the additional comma is only there if there's an street2 component of the address, which indicates that the comma should be in the same capture group as the street2 part. I added an extra capture group around the street2 capture group to account for this. You can make groups non-capturing in most languages, but it didn't seem necessary.
Suppose I have this string:
Address XXXXX city XXXXX
And this regEX:
Address (.*?) city (.*?)
What will happen if the Address is "The city of London" ?
It depends on whether your reex engine is in greedy mode or not.
If it's in greedy mode, it will work as expected since it will look for the longest match.
Whether your particular regex engines runs in greedy mode by default, or whether it even has a greedy mode, is not something we can tell you based on the information provided in the question.
If you're using .NET, this page has a description on greedy versus lazy matching.
Basically, given the string XYZZY, the regex X.*Y will match XYZZY (greedy) while X.*?Y will match XY (lazy).
What you need is a way to ensure you can differentiate between the delimiters and the elements of your string, otherwise you'll be in trouble no matter what, such as with:
Address The city baths city Manchester city, England
Perhaps you could look into something like:
Address "put address here" city "put city here"
and try to make sure you never get a city name with quotes in it. However, be careful. I once worked on a project where we managed to get some decent compression on city names (it was embedded so every byte counted) by only having to store alpha characters.
Shortly thereafter, we rolled out nationally and the residents of A1 mining settlement were rather miffed at our short-sightedness :-) One town in the whole of Oz with a digit in the name, who'd have thought?
Alternatively, put the address and city on separate lines thus:
Address: The city baths
City: Manchester city, England
Then you can look for things like:
^Address:\s*(.*)$
^City:\s*(.*)$