Regex Between Strings and position

Regex Between Strings and position - regex

I have a confusing string which normally has some form of address in it, in some cases where it is a corner address it is easy as the address has a CNR at the start so i can use the following regex (i am working in vb.net):
Case 1 Instr CNR: Regex = New Regex("( CNR )(.*?)(?=\SVSE| M | SVC | SVSW | SVNE |SVNW )", RegexOptions.RightToLeft)
At the end of the string is normally a map reference, which is what the end is looking for which then allows me to extract the address. Once i have this address i plan to geocode the address to determine the latitude and longitude.
However in some cases there is no address, and the string may contain phrases which suggest to me that the address is after that point for example FIRE NOW OUT JOHN ST SUBURB M 215 G2. If this is the case i use the below regex:
Case 2 No CNR: Regex = New Regex("( ([\d]+) | ([\d]+-[\d]+) | ([\d]+ - [\d]+) | CAR SMOULDERING | INPUT | OFF | OPPOSITE | CNR | SPARKING | INCIC1 | INCIC3 | STRUC1 | STRUC3 | G&SC1 | G&SC3 | ALARC1 | ALARC3 | NOSTC1| NOSTC3 | RESCC1 | RESCC3 | HIARC1 | HIARC3 | CAR ACCIDENT - POSS PERSON TRAPPED | EXPLOSIONS HEARD | WASHAWAY AS A RESULT OF ACCIDENT | ENTRANCE | ENT | LHS | RHS | POWER LINES ARCING AND SPARKING | SMOKE ISSUING FROM FAN | CAR FIRE | FIRE ALARM OPERATING | GAS LEAK | GAS PIPE | NOW OUT | ACCIDENT | SMOKING | ROOF | GAS | REQUIRED | FIRE | LOCKED IN CAR | SMOKE RISING | SINGLE CAR ACCIDENT | ACCIDENT | FIRE)(.*?)(?=\SVSE| M | SVC | SVSW | SVNE | SVNW )", RegexOptions.RightToLeft)
In all cases I work from Right to left, looking to see when the front part of the string is found, then I want to take from the start until just before the map reference. However my question is how can I use the above (Case 2) regex and look for a phrase but not include it, when I may want to include others. For example if the string has a street number then I want to include the number in the extracted string, but if it has REQUIRED then i don't want to extract that. I will give two examples
A: SPECIAL APPLIANCE TYPE-A REQUIRED EXAMPLE ST SUBURB M 215 G5
B: HOUSE FIRE 123 EXAMPLE ST SUBURB M 215 G5
In case A, REQUIRED is not part of the address, so I don't want the regex to include that in the extracted address, and it would output as a string EXAMPLE ST SUBURB.
In case B, since a street address exists I don't want to exclude that, so the extracted address would be 123 EXAMPLE ST SUBURB.
So the question is in the above regex, how can I extract the string between phrases, and include the phrase in some cases, and exclude it in others?
Sorry for such a big question, I wanted to make sure I provided enough information.
The final question is, does regex allow you to work out where the first part matches (e.g. the length in the string) so for example in the above REQUIRED part, it is 35 characters after REQUIRED appears, and the Regex extracts the string EXAMPLE ST SUBURB. Can i have the regex return the position of the first match, so I can extract additional information from the string (e.g. from Start of string, until Regex Position A extracts: SPECIAL APPLIANCE TYPE-A REQUIRED).
Thanks for your help!

I got one of the parts of the question, using the match.index and match.length allowed me to work out where the string actually ended (e.g. the position) from there i can do the bits and pieces I wanted to.
The only bit i could not work out is how i can use a regex example like below and include CNR in the string returned (e.g. the regex match) if it is found but not include STREET1 or ROAD1:
Regex = New Regex("( CNR ||)(.*?)(?=\SVSE| M | SVC | SVSW | SVNE |SVNW )", RegexOptions.RightToLeft)
For example if my string is: EXAMPLE TEXT CNR 123 STREET A SUBURB M 215 G2 it would return CNR 123 STREET A SUBURB but if my string was EXAMPLE TEXT STREET1 STREET A SUBURB M 215 G2 then it would return STREET A SUBURB\
I should point out though, that STREET1 in the above example is the point at which the regex starts / finishes, but is just not included in the match, since STREET A may be a different phrase i can't just look for STREET A
Thanks!

Related

Regex to extract (german) street number

I have the following street constellations:
| Street name | extracted value |
| --------------------------------------- | --------------- |
| Lilienstr. 12a | 12a |
| Hagentorwall 3 | 3 |
| Seilerstr. 14 (Eingang Birkenstr.) | 14 |
| Guentherstr. 43 B | 43 B |
| Eberhard-Leibnitz Str. 1 WH 5B 241 | 1 |
| 1019-1781 Borderlinx C/O SEKO Logistics | - |
My Regex is partially working (https://regex101.com/r/KumamP/2):
\d+(?:[a-zA-Z]$|\s[a-zA-Z]$)?
Someone has got a better solution for me? Eberhard-Leibnitz Str. should only give me one result or none. 1019-1781 Borderlinx C/O SEKO Logistics should give me none result.

The following regex is working for your example
^[ \-a-zA-Z.]+\s+(\d+(\s?\w$)?)
https://regex101.com/r/KumamP/4
The basic assumption is (like your samples suggest), that valid "street constellations" always start with a street name followed by the street/house number.
The next regex is also working if there is an entry like Straße des 17. Juni 1:
^[ \-0-9a-zA-ZäöüÄÖÜß.]+?\s+(\d+(\s?[a-zA-Z])?)\s*(?:$|\(|[A-Z]{2})
https://regex101.com/r/KumamP/5
But as the commentators already wrote, it is difficult to distinguish via an regular expression between numerical street name parts and the street number. Even more if you allow "unspecified" suffixes like (Eingang Birkenstr.) or WH 5B 241 in your example.

Parsing address lines is not trivial. Many countries have their own special rules and Germany and Austria are really tricky.
To understand better the examples you provided, there's one in special that shows the point:
"Eberhard-Leibnitz Str. 1 WH 5B 241"
The "WH" here stands for "Wohnung", but they usually use just "W" (and use some separator like "//"). So it would be more like:
"Eberhard-Leibnitz Str. 1 // W 5B 241"
It's also common to find "co" or "c/o" or "z. H" (abbreviation for "zu Händen von"). And anything that follows it, it's just the mailbox's name.
And last but not least, the address line could also contain the zip code + city name. Depends on the API you're interacting with, or if it's user input (it can get very wild then!).
So, to properly parse address lines, you should first normalize them, by removing that extra information. Then you can use a regex. Take a look at this gem: https://github.com/matiasalbarello/address_line_divider
Some good reads about the topic:
https://www.german-way.com/germans-we-dont-need-apartment-numbers/
https://allaboutberlin.com/guides/addressing-a-letter-in-germany
http://interactive.zeit.de/strassennamen/

How to remove all of the contents of a string except the first character?

I have a data set with first name, middle name, and last name. I'm going to merge it with another data set matching on the same variables.
In one data set the variable mi looks like:
Lowell
Ann
Carl
A
Fran
Allen
And I want it to look like:
L
A
C
A
F
A
I tried this:
gen mi2 = substr(mi, 2, length(mi))
but this does the opposite of what I want but it's the closest that I've been able to do. I know this is probably a really easy problem but I'm stumped at the moment.

You are on the right track with substr. See the example below:
clear
input str10 mi
Lowell
Ann
Carl
A
Fran
Allen
end
gen mi2 = substr(mi,1,1)
list, sep(0)
+--------------+
| mi mi2 |
|--------------|
1. | Lowell L |
2. | Ann A |
3. | Carl C |
4. | A A |
5. | Fran F |
6. | Allen A |
+--------------+
The second and third arguments to substr are the starting position and number of characters respectively. In this case, you want to start at the first character, and take one character, so substr(mi, 1, 1) is what you need.

How do I select a substring using a regexp in robot framework

In the Robot Framework library called String, there are several keywords that allow us to use a regexp to manipulate a string, but these manipulations don't seem to include selecting a substring from a string.
To clarify, what I intend is to have a price, i.e. € 1234,00 from which I would like to select only the 4 primary digits, meaning I am left with 1234 (which I will convert to an int for use in validation calculations). I have a regexp which will allow me to do that, which is as follows:
(\d+)[\.\,]
If I use Remove String Using Regexp with this regexp I will be left with exactly what I tried to remove. If I use Get Lines Matching Regexp, I will get the entire line rather than just the result I wanted, and if I use Get Regexp Matches I will get the right result except it will be in a list, which I will then have to manipulate again so that doesn't seem optimal.
Did I simply miss the keyword that will allow me to do this or am I forced to write my own custom keyword that will let me do this? I am slightly amazed that this functionality doesn't seem to be available, as this is the first use case I would think of when I think of using a regexp with a string...

You can use the Evaluate keyword to run some python code.
For example:
| Using 'Evaluate' to find a pattern in a string
| | ${string}= | set variable | € 1234,00
| | ${result}= | evaluate | re.search(r'\\d+', '''${string}''').group(0) | re
| | should be equal as strings | ${result} | 1234
Starting with robot framework 2.9 there is a keyword named Get regexp matches, which returns a list of all matches.
For example:
| Using 'Get regexp matches' to find a pattern in a string
| | ${string}= | set variable | € 1234,00
| | ${matches}= | get regexp matches | ${string} | \\d+
| | should be equal as strings | ${matches[0]} | 1234

Regex - Contains pattern but not starting with xyz

I'm trying to match a number pattern in a text file.
The file can contain values such as
12345 567890
90123 string word word 54616
98765
The pattern should match on any line that contains a 5 digit number that does not start with 1234
I have tried using ((?!1234).*)[[:digit:]]{5} but it does not give the desired results.
Edit: The pattern can occur anywhere in the line and should still match
Any suggestions?

This regex should work for matching a line containing a number at least 5 digits long iff the line does not start with '12345':
^((?!12345).*\d{5}.*)$
Short explanation:
^((?!12345).*\d{5}.*)$ _____________
^ \_______/\/\___/\/ ^__|match the end|
_____________________________| | _| | |__ |of the line |
|match the start of a line| | | __|____ |
______________________________|_ | |match ey| |
|look ahead and make sure the | | |exactly | |
|line does not begin with "12345"| | |5 digits| |
___|_____ |
|match any|______|
|character|
|sequence |
EDIT:
It seems that the question has been edited, so this solution no longer reflects the OP's requirements. Still I'll leave it here in case someone looking for something similar lands on this page.

The following would work, using \b to match word boundaries such as start of string or space:
\b(?!12345)\d{5}.*

try this, contains at least 5 decimal digits but not 12345 using a negative look behind
\d{5,}(?<!12345)

notepad++: keep regex (multi occurence per line) and line structure, remove other characters

I have a 130k line text file with patent information and I just want to keep the dates (regex "[0-9]{4}-[0-9]{2}-[0-9]{2} ") for subsequent work in Excel. For this purpose I need to keep the line structure intact (also blank lines). My main problem is that I can't seem to find a way to identify and keep multiple occurrences of date information in the same line while deleting all other information.
Original file structure:
US20110228428A1 | US | | 7 | 2010-03-19 | SEAGATE TECHNOLOGY LLC
US20120026629A1 | US | | 7 | 2010-07-28 | TDK CORP | US20120127612A1 | US | | EXAMINER | 2010-11-24 | | US20120147501A1 | US | | 2 | 2010-12-09 | SAE MAGNETICS HK LTD,HEADWAY TECHNOLOGIES INC
Desired file structure:
2010-03-19
2010-07-28 2010-11-24 2010-12-09
Thank you for your help!

Search for
.*?(?:([0-9]{4}-[0-9]{2}-[0-9]{2})|$)
And replace with
" $1"
Don't put the quotes, just to show there is a space before the $1. This will also put a space before the first match in a row.
This regex will match as less as possible .*? before it finds either the Date or the end of the row (the $). If a date is found it is stored in $1 because of the brackets around. So as replacement just put a space to separate the found dates and then the found date from $1.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex Between Strings and position - regex

Related

Regex to extract (german) street number

How to remove all of the contents of a string except the first character?

How do I select a substring using a regexp in robot framework

Regex - Contains pattern but not starting with xyz

notepad++: keep regex (multi occurence per line) and line structure, remove other characters

Categories

Resources