How can I extract the first digits from string - regex

I am trying to use powershell to extract the first digits from a long string. How can I use regex to only get the first numbers from a string?
String 1:
000660007501S W RUSSELL DLC NO 41 SLY 2.5 FT OF ELY 313.82 FT OF FOLLOWING DESCRIBED PORTION OF SAMUEL W RUSSELL DONATION CLAIM` 1000
string 2:
010454040006ALDERBROOK DIV NO 05 62000 14000040
string 3:
012000012000ALEXANDER ACRE TRS S 1/2 OF LOT 38 TGW LOT 39 TGW N 45.96 FT OF E 109.23 FT LOT 40 LESS ANY POR PLTD DEVON LANE 13000 38-39-40

I was able to do it like this:
$accountnumber = $p.Substring(0,16) -replace '\D+',''
$Parcelnumber = $accountnumber.Substring(0,10)

Related

Golang regex : Ignore multiple occurrences

I've got a simple need.
Giving this input (string) : 10 20 30 40 65 45 44 67 100 200 65 40 66 88 65
I need to get all numbers between 65 and 66.
Problem is when we have multiple occurrence of each limit.
With a regex like : (65).+(66), I captured 65 45 44 67 100 200 65 40 66. But I would like to get only 40.
How could I achieve this ?
https://regex101.com/r/9HoKxr/1
Sounds like you want to exclude matching '65' inside the number of your pattern upto the 1st occurence of '66'? It's a bit verbose but what about:
\b65((?:\s(?:\d|[1-57-9]\d|6[0-47-9]|\d{3,}))+?)\s66\b
See an online demo
\b65\s - Start with '65' between a word-boundary and a whitespace char;
( - Open capture group;
(?:\s - Non-capture group with the constant of a whitespace char;
(?:\d|[1-57-9]\d|6[0-46-9]|\d{3,}) - Nested non-capture group to match any integer but '65' or '66';
)+?) - Close non-capture group and match it at least once but as few times as possible. Then close the capture group;
\s66\b - Match another space followed by '66' and word-boundary.
Note:
We will handle leading spaces with the Trim() function through the strings package;
That in my examples I have used '10 20 30 40 65 45 44 40 66 200 65 40 66 88 65' which should return multiple matches. In such case it's established OP is looking for the 'shortest' matching substring;
By 'shortest' it's meant that we are looking for the least amount of elements when the substring is split with spaces (using 'Fields' function from above mentione strings package). Therefor '123456' is prefered above '1 2 3' despite being the 'longer' substring in terms of characters;
Try:
package main
import (
"fmt"
"regexp"
"strings"
)
func main() {
s := `10 20 30 40 65 45 44 40 66 200 65 40 66 88 65`
re := regexp.MustCompile(`\b65((?:\s(?:\d|[1-57-9]\d|6[0-47-9]|\d{3,}))+?)\s66\b`)
matches := re.FindAllStringSubmatch(s, -1) // Retrieve all matches
shortest := ``
for i, _ := range matches { // Loop over array
if shortest == `` || len(strings.Fields(matches[i][1])) < len(strings.Fields(shortest)) {
shortest = strings.Trim(matches[i][1], ` `)
}
}
fmt.Println(shortest)
}
Try it for yourself here.

Extract number from a text after symbol "X"

I have the following text in a column, where I need to extract number next to second "X" or "x",
in the below text, it is 54.
40sHT + 2/20sCMD X 30sHT + 2/20sCMD 56 X 54 54" AWM/C129-DOBY
Some other sample texts:
21sOE X 12sFL 56 X 36 63" PLAIN # Result must be : 36
40sC X 40sC_100 X 91_63" PLAIN # Result: 91
16sOE x 12sLY 84 x 48 71" 3/1 DRILL # Result: 48
Given:
40sHT + 2/20sCMD X 30sHT + 2/20sCMD 56 X 54 54" AWM/C129-DOBY # Result: 54
21sOE X 12sFL 56 X 36 63" PLAIN # Result: 36
40sC X 40sC_100 X 91_63" PLAIN # Result: 91
16sOE x 12sLY 84 x 48 71" 3/1 DRILL # Result: 48
Use:
[Xx]\s?(\d+)(?:.(?![Xx]))*$
Demo and Explanation:
https://regex101.com/r/KshMUE/1
You didn't state which tool/language this is using, so it's hard to know for sure what to suggest.
However, if possible, I would consider splitting the string on the letter "x" (or "X") as this makes the regex part much easier to follow. For example, something like:
input = '40sHT + 2/20sCMD X 30sHT + 2/20sCMD 56 X 54 54" AWM/C129-DOBY'
input.split(/x/i)[2][/\d+/]
By doing this split, we first extract only the desired section of the string (in this case, ' 54 54" AWM/C129-DOBY'), so the regex (/\d+/) becomes trivial.
Try this:
(?i)(?<=x[^x]{1,100}x.)\d+
(?i): case-insensitive
(?<=: start of positive look-behind
x[^x]{1,100}x.: an xfollowed by up to 100 any characters except x, followed by x and any one single character
): end of look-behind
\\d+: one or more digits

Regex extract string based on String match

I have this data with some messy addresses inside which contains sometimes not in order a Province, District, and ward :
Name ADDRESS
Store1 453, Duy Tan, Phuong Nguyen Nghiem, Thanh pho Quang Ngai
Store2 13 DUNG SY THANH KHE, P. THANH KHE TAY
Store3 98 Phan Xich Long- P. 2
Store4 306 B4, NGUYENVAN LINH, Ward - 5
Store5 22, Ngo 421/16, Tran Duy Hung, To 42, Phuong Trung Hoa, Quan Cau Giay
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
//Replace each \ with \\ so that C# doesn't treat \ as escape character
//Pattern: Start of string, any integers, 0 or 1 letter, end of word
string sPattern = "^[0-9]+([A-Za-z]\\b)?";
string sString = Row.ADDRESS ?? ""; //Coalesce to empty string if NULL
//Find any matches of the pattern in the string
Match match = Regex.Match(sString, sPattern, RegexOptions.IgnoreCase);
//If a match is found
if (match.Success)
//Return the first match into the new
//HouseNumber field
Row.ward= match.Groups[0].Value;
else
//If not found, leave the HouseNumber blank
Row.ward= "";
}
}
I would like to modify my regex formula to return the data like this in the column Ward. (you can see the synonyms in my addresses (Phuong,P.,ward,etc).
Name ADDRESS ward
Store1 453, Duy Tan, Phuong Nguyen Nghiem, Quang Ngai Phuong Nguyen Nghiem
Store2 13 DUNG SY THANH KHE, P. THANH KHE TAY Phuong THANH KHE TAY
Store3 98 Phan Xich Long- P. 2 Phuong 2
Store4 306 B4, NGUYENVAN LINH, Ward - 5 Phuong 5
Store5 22, Ngo 421/16,--. To 42, Phuong Trung Hoa, Quan Cau Giay Phuong Trung Hoa
I use that regex expression to extract the civic number, but is there a way with REGEX i can modifiu return the data in my column ward like in the example above?
The groups in this regex, as tested in https://regex101.com/, match the data in your column ward, as in your example. However, you may need to better define the patterns where each will appear since this regex only matches them as they appear in your example data. However, it may be enough for you to extrapolate and get the regex that you really need.
(Phuong.*),|P\.(.*$)|Ward - (.*$)
The group in option 1 matches from Phuong (inclusive) until the first comma.
The group in option 2 matches anything that comes after P. until the end of the string.
The group in option 3 matches anything that comes after Ward - until the end of the string.
This one is a bit more advanced, but it only matches what you mentioned in your examples, no groups:
Phuong.*(?=,)|(?<=P\.).*$|(?<=Ward - ).*$
Test it in https://regex101.com to see how it works and to see what each part means.
Finally, you may want to exclude Phuong from the match in option 1 on so that your program can always print Phuong and then the match.

Retrieve the words after the last numeric occurrence using regex

I receive streetname + doorno in a string variable. I have to split them. My current regex is /[0-9].*$/ This works fine for normal addresses. But I have addresses where streetname also contains a numeric value. In this case, the streetname is considered as doorno too.
For ex,
[Correct] Street = Example Street 15B returns doorno = 15B
[Correct] Street = Example Street 15 B returns doorno = 15 B
[Correct] Street = Example Street returns doorno = null
[Correct] Street = Example Street15 returns doorno = 15
[Incorrect] Street = Example Street 158 7 returns doorno = 158 7. However I am expecting, the streetname = Example Street 158 & doorno = 7
[Incorrect] Street = Example Street 158 7 B returns doorno = 158 7 B. However I am expecting, the streetname = Example Street 158 & doorno = 7 B
[Incorrect] Street = Example Street 158 7B returns doorno = 158 7B. However I am expecting, the streetname = Example Street 158 & doorno = 7B
[Incorrect] Street = Example Street158 7 B returns doorno = 158 7B. However I am expecting, the streetname = Example Street158 & doorno = 7B
Can someone please help me to fix the regex for the above incorrect cases?
You may use
/^(.*\D)(\d.*)$/
It matches:
^ - start of a string
(.*\D) - Group 1: any 0+ chars (other than line break chars) up to the last occurrence of the subsequent subpatterns (ie. \D\d.*$)
(\d.*) - Group 2: a digit and then any 0+ chars (other than line break chars)
$ - end of string.

Regex extract only specific character and EOL

I am trying to extract some text using regex.
I want to extract only those line that contains "pour 1e" or "Pour 1€" and nothing more.
The regex must be incase-sensitive.
here is my regex that don't work like I want:
/Pour ([0-9.,])(€|e)/im
and this is my text:
Tesseract Open Source OCR Engine v3.01 with Leptonica
CARDEURS
Horaire dejour de flhllll 5 19h00
pour 1€
pour 1€ supplémentaire
pour 1€ supplémentaire
pour 1€ supplémentaire
pour 1€ supplémentaire
par€ supplémentaire
Horaire de nuit de 19h00 5 flhllll
pour 1,50€
pour 1€ supplémentaire + 300 minutes
pour 1€ supplémentaire + 420 minutes
La joumée de 24 heures
35 minutes
+ 30 minutes
+ 35 minutes
+ 40 minutes
+ 45 minutes
+ 50 minutes
60 minutes
15€
Tesseract Open Source OCR Engine v3.01 with Leptonica
TARIFS
PARKING CARNOT
Homim de juur de 8:00 3 19:00 H01-aim de null de 19:00 5 8:00
mains d‘ ggg heme : G1-atuit moins d‘ ggg heure : Gmtuil
Pour 1e
Pour 1e supplémenlaire
Pour 1e suppléulentaire
Pour 1e supplémmmm
Pour 1e supplémmmm
Par e supplémenlaiI€
40 minutes
+ 40 minutes
+ 45 minutes
+ 50 minutes
+ 55 minutes
+ 55 minules
Pour 1e so nzinules
Pour 1e supplémenlaiI€ + 300 minllles
Pour 1e 5upplémenlai1Q + 420 minules
La journée a
e 24 heums 15€
You need to anchor the expression with ^ and $ which match beginning/end of line when /m is active. For example:
/^pour [0-9]+[0-9,.]*[e€]$/im
use square brackets [] to specify a group of characters to match, caret ^ to match the beginning of the line and dollar sign $ to match the end of the line. Depending on which regex implementation you are using you may be able to pass the i flag to make it case-insensitive
/^Pour 1[€e]$/i
Or handle case explicitly with character groups
/^[Pp][Oo][Uu][Rr] 1[€e]$/
For matching repetitions, use * to match 0 or more of the previous character, + to match 1 or more, and ? to match 0 or 1.
In place of the 1 in the previous, you could use
[0-9.]+ to match any 1 or more digits or decimal points
[0-9]+\.?[0-9]* to match at least 1 digit follow by an optional decimal point and more digits
[0-9]+[0-9,]*\.?[0-9]* to match at least 1 digit, optionally more digits and commas, followed by an optional decimal point and more digits
You can also use curly braces {} to explicitly specify a number of repetitions (these must be escaped with a backslash \ in some regex engines)
[0-9]{1,3} would match 1,2 or 3 digits
[0-9]{3} would match exactly 3 digits
You can use parenthesis () to group a part of a regex pattern for backreference or repetition.
So to match a line that starts with "Pour " followed by 1 or more digits, then an optional comma or decimal point with 2 digits, then the euro symbol or letter e, and any number of trailing spaces, but no other characters until end of line, and be case-insensitive:
/^Pour [0-9]+([,.][0-9][0-9])?[€e][ ]*$/i