Using Regex in SOLR Query

Using Regex in SOLR Query - regex

I have a data set of street names and numbers which I need to search.
eg. 12 HILL STREET
12A HILL STREET
12B HILL STREET
123 HILL STREET
12 HILARY STREET
If I search as follows q=(street_name:12\ HILL*), I get
12 HILL STREET
I want to obtain the following results:
12 HILL STREET
12A HILL STREET
12B HILL STREET
Is there a way to query in SOLR to return the results as the above example shows?
I have tried querying as:
q=(street_name:/12[A-Z]\ HILL*/)
but don't get anything back.

You can use
q=(street_name:/12[A-Z]* HILL.*/)
Here, the pattern means
12 - string starts with 12
[A-Z]* - zero or more ASCII uppercase letters
- a space
HILL - HILL char sequence
.* - any zero or more chars other than line break chars as many as possible (so, the rest of the line).

Related

Regex for city and street name

Hi, I am looking for 2 regex which describe:
1) a valid name of a street
2) a valid name of a city
Valid street names are:
Mainstreet.
Mainstreet
Main Street
Big New mainstreet
Mainstreet-New
Mains Str.
St. Alexander Street
abcÜüßäÄöÖàâäèéêëîï ôœùûüÿçÀÂ-ÄÈÉÊËÎÏÔŒÙÛÜŸÇ.
John Kennedy Street
Not valid street names are:
Mainstreet #+;:_*´`?=)(/&%$§!
Mainstreet#+;:_*´`?=)(/&%$§!
Mainstreet 2
Mainstreet..
Mainstreet§
Valid cities are:
Edinôœùûüÿ
Berlin.
St. Petersburg
New-Berlin
Aue-Bad Schlema
Frankfurt am Main
Nürnberg
Ab
New York CityßäÄöÖàâäèéêëîïôœùûüÿçÀÂ-ÄÈÉÊËÎÏÔŒÙÛÜŸ
Not valid cities are:
Edingburgh 123
Edingburg123
St. Andrews 12
Berlin,#+;:_*´`?=)(/&%$§!
Berlin__
The solutions that I have at the moment matches very close but not perfectly:
For city and street name:
^[^\W\d_]+(?:[-\s][^\W\d_]+)*[.]?$
Unfortunately no match for these examples (the rest works fine):
St. Alexander Street
St. Petersburg
If you have more simple solutions, I am happy to learn sth. new! :-)

To make it match St. Alexander Street and St. Petersburg, you just need to add an optional dot after the letter matching patterns:
^[^\W\d_]+\.?(?:[-\s][^\W\d_]+\.?)*$
# ^^^ ^^^
See the regex demo.
Also, it might make sense to add a single apostrophe to the regex:
^[^\W\d_]+\.?(?:[-\s'’][^\W\d_]+\.?)*$
See the regex demo.

Pandas str match for German addresses

I have a quite annoying problem in designing a regex to prepare addresses for geocoding with Nominatim. I am working with German addresses which look like this:
Von-der-Leyen-Platz 1 47506 Neukirchen-Vluyn
Schildstraße 52531 Übach-Palenberg
Finkenratherstraße Straße 4a 52134 Herzogenrath
Format: Street Number Postal code City
What I want to achieve is that first literals after street numbers do not occur. For this I am using the following regex:
(\d+).*?\s+(.+)
It is matching the third address to 4 52134 Herzogenrath. But not to Finkenratherstraße 4 52143 Herzogenrath. Another problem I saw is the second address as it does not have a street number. That is why I wanted to filter create a regex which can filter for the following structure:
Street name {number if available} Postal code (5 digits) City name
The postal code always has 5 digits and the structure is always the same just that sometimes the street number is missing.
Is there any way to design this as a regex?

For your data, this could work:
# sample data
s = pd.Series(['Von-der-Leyen-Platz 1 47506 Neukirchen-Vluyn',
'Schildstraße 52531 Übach-Palenberg',
'Finkenratherstraße Straße 4a 52134 Herzogenrath'])
# extract
s.str.extract(r'(?P<Street>\D+)\s?(?P<Number>\d+\S*)?\s(?P<Postal>\d{5})\s(?P<City>\D+)$')
Output:
Street Number Postal City
0 Von-der-Leyen-Platz 1 47506 Neukirchen-Vluyn
1 Schildstraße NaN 52531 Übach-Palenberg
2 Finkenratherstraße Straße 4a 52134 Herzogenrath

Regex (Posix) to get first word only, not including numbers

New to Regex (which was recently added to SQL in DB2 for i). I don't know anything about the different engines but research indicates that it is "based on POSIX extended regular expressions".
I would like to get the street name (first non-numeric word) from an address.
e.g.
101 Main Street = Main
2/b Pleasant Ave = Pleasant
5H Unpleasant Crescent = Unpleasant
I'm sorry I don't have a string that isn't working, as suggested by the forum software. I don't even know where to start. I tried a few things I found in search but they either yielded nothing or the first "word" - i.e. the number (101, 2/b, 5H).
Thanks
Edit: Although it's looking as if IBM's implementation of regex on the DB2 family of databases may be too alien for many of the resident experts, I'll press ahead with some more detail in case it helps.
A plain English statement of the requirement would be:
Basic/acceptable: Find the first word/unbroken string that contains no numbers or special characters
Advanced/ideal: Find the first word that contains three or more characters, being only letters and zero or one embedded dash/hyphen, but no numbers or other characters.
Additional examples (original ones at top are still valid)
190 - 192 Tweety-bird avenue = Tweety-bird
190-192 Tweety-bird avenue = Tweety-bird
Charles Bronson Place = Charles
190H Charles-Bronson Place = Charles-Bronson
190 to 192 Charles Bronson Place = Charles
Second Edit:
Mooching around on the internet and trying every vaguely connected expression that I could find, I stumbled on this one:
[a-zA-Z]+(?:[\s-][a-zA-Z]+)*
which actually works pretty well - it gives the street name and street type, which on reflection would actually suit my purpose as well as the street name alone (I can easily expand common abbreviations - e.g. RD to ROAD - on the fly).
Sample SQL:
select HAD1,
regexp_substr(HAD1, '[a-zA-Z]+(?:[\s-][a-zA-Z]+)*')
from ECH
where HEDTE > 20190601
Sample output
Ship To REGEXP_SUBSTR
Address
Line 1
32 CHRISTOPHER STREET CHRISTOPHER STREET
250 - 270 FEATHERSTON STREET FEATHERSTON STREET
118 MONTREAL STREET MONTREAL STREET
7 BIRMINGHAM STREET BIRMINGHAM STREET
59 MORRISON DRIVE MORRISON DRIVE
118 MONTREAL STREET MONTREAL STREET
MASON ROAD MASON ROAD
I know this wasn't exactly the question I asked, so apologies to anyone who could have done this but was following the original request faithfully.

Not sure if this is Posix compliant, but something like this could work: ^[\w\/]+?\s((\w+\s)+?)\s*\w+?$, example here.
The script assumes that the first chunk is the number of the building, the second chunk, is the name of the street, and the last chunk is Road/Ave/Blvd/etc.
This should also cater for street names which have white spaces in them.

Using the following regex matches your examples :
(?<=[^ ]+ )[^ ]*[ ]

Regex: Match only street name within address

I have a list of addresses and I would like to have a regular expression that is able to capture just the name of the street without the street type, address number, or cardinal direction. There are some errors in formatting but all characters are in capital letters. So,
2038 W MAIN AVE
2038QWEW S JEFFERSON AVENUE
33 NORTH CALIFORNIA STREET
53371 SOUTH WASHINGTON
53371 S WASHINGTON AVENUE
1600 E PENNSYLVANIA AVE
WEST9 67ST ST
E171 N 23RD STREET
G171 N121ST STREET
ought to return
MAIN
JEFFERSON
CALIFORNIA
WASHINGTON
WASHINGTON
PENNSYLVANIA
67ST
23RD
121ST
So far I've got
([^ W ]|[^ E ]|[^ S ]|[^ N ])([0-9])*([A-Z]+)[^ ]
But I can't seem to only capture the first match that occurs after the street number. I feel like I need the standard greedy operators (i.e. ?, *, or +) but I can't figure out how to incorporate them.
These two links have taken me close:
Matching on every second occurence
Simple regex for street address

For the output what you want from the given (address) input, this regex will surely help: [\pL\pN]+(?=\h+[\pL\pN]+$)
This regex will match the second last word in your line where a word is "1 or more any letter or digit in any language".
For reference you could https://superuser.com/questions/1361759/matching-second-last-word-in-sentence-through-regular-expression

Logic: we are looking for the second last word (set of characters) + possible border with the symbol N
^.*?\s[N]{0,1}([-a-zA-Z0-9]+)\s*\w*$
Res:
Match 1
Full match 0-15 `2038 W MAIN AVE`
Group 1. 7-11 `MAIN`
Match 2
Full match 16-43 `2038QWEW S JEFFERSON AVENUE`
Group 1. 27-36 `JEFFERSON`
Match 3
Full match 44-70 `33 NORTH CALIFORNIA STREET`
Group 1. 53-63 `CALIFORNIA`
Match 4
Full match 71-93 `53371 SOUTH WASHINGTON`
Group 1. 83-93 `WASHINGTON`
Match 5
Full match 94-119 `53371 S WASHINGTON AVENUE`
Group 1. 102-112 `WASHINGTON`
Match 6
Full match 120-143 `1600 E PENNSYLVANIA AVE`
Group 1. 127-139 `PENNSYLVANIA`
Match 7
Full match 144-157 `WEST9 67ST ST`
Group 1. 150-154 `67ST`
Match 8
Full match 158-176 `E171 N 23RD STREET`
Group 1. 165-169 `23RD`
Match 9
Full match 177-195 `G171 N121ST STREET`
Group 1. 183-188 `121ST`
https://regex101.com/r/m2rmUQ/4

I was able to figure this out in a slightly different way
[0-9A-Z]* [0-9A-Z]*$
and then I simply split the string it created by the space. Maybe one or two steps too many but it's transparent

Regex: extract last occurence of pattern

I have an address string and I need to extract the street name from it. Examples:
Unit 1, Silicon Way -> Silicon Way
66 Yellow Brick Road -> Yellow Brick Road
77 - 5 Sesame Street -> Sesame Street
High Street -> High Street
How would a regular expression look like in this case? If language matters I'm using Scala.

This regex will not work if address contains comma or number in it. If the address is always the text from the end of the string, then try with this regex:
\s*([a-zA-Z ]+?)\s*$
$ is anchoring as end of string. So the pattern will always match from the right side of the string.
Online Demo

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Using Regex in SOLR Query - regex

You can use q=(street_name:/12[A-Z]* HILL./) Here, the pattern means 12 - string starts with 12 [A-Z] - zero or more ASCII uppercase letters - a space HILL - HILL char sequence .* - any zero or more chars other than line break chars as many as possible (so, the rest of the line).

Related

Regex for city and street name

Pandas str match for German addresses

Regex (Posix) to get first word only, not including numbers

Regex: Match only street name within address

Regex: extract last occurence of pattern

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Using Regex in SOLR Query - regex

You can use q=(street_name:/12[A-Z]* HILL.*/) Here, the pattern means 12 - string starts with 12 [A-Z]* - zero or more ASCII uppercase letters - a space HILL - HILL char sequence .* - any zero or more chars other than line break chars as many as possible (so, the rest of the line).

Related

Regex for city and street name

Pandas str match for German addresses

Regex (Posix) to get first word only, not including numbers

Regex: Match only street name within address

Regex: extract last occurence of pattern

Categories

Resources

You can use q=(street_name:/12[A-Z]* HILL./) Here, the pattern means 12 - string starts with 12 [A-Z] - zero or more ASCII uppercase letters - a space HILL - HILL char sequence .* - any zero or more chars other than line break chars as many as possible (so, the rest of the line).