I have some text with different measures in that Im trying to exract with regex.
a text can look something like this
Ipsum Lorem 3. 100x210 cm
Ipsum Lorem Lorem, 100x210 cm
I have got as far as I can extract the measurements, but when there is an int in the middle of the text ( like option 1) my regex fails.
([0-9x]+)(?:\^(-?\d+))?
Gets me
Match 1 : 100x210
Match 2 : 3
Match 3 : 100X210
Any suggestion on how I can skip match 2 and only regex INTxINT ?
Thanks in advance
Using a character class [0-9x]+ could possibly also match only xxx or in this case, only 3
The optional group in your pattern could possibly also match 100x210^-2, not sure if that is intended as \^ will match a caret.
To match both the lower and uppercase variant of x, you could use a character class [xX] or make the regex case insensitive.
Using word boundaries \b on the left and right:
\b\d+[xX]\d+\b
Or a more specific pattern using a capturing group, taking matching the cm part afterwards:
\b(\d+[xX]\d+) cm\b
See a regex demo
You may use a regex like
\d+x\d+
See proof. It will match two substrings containing one or more digits separated with x character.
Related
Find the first letter and sign of a sentence with Regex.
At the beginning of the sentence can sometimes be letters and sometimes numbers.
15. Lorem ipsum is placeholder text
B. Lorem ipsum is placeholder text
C.Lorem ipsum is placeholder text
D . Lorem ipsum is placeholder text
E,Lorem ipsum is placeholder text
I wrote something like this:
[\dga-zA-Z.]{1\s}
Demo with regex101
But it doesn't work right for every sentence. Moreover, it does not detect if there is a space between the first letter/digit and the sign with the sentence.
Where am I making a mistake?
Also, In terms of performance For such scenarios, it makes more sense to use regex or PHP?
Hello this matched all of your provided examples
([A-Za-z\d ]+)(\.|,)
What this does is the following:
it matches all small, big letters, digits or space. It should find at least
one of those or more (the + sign).
It should end with a dot or comma. (\.) Note: In regex, the dot should be escaped.
If that doesn't do the trick, comment below
Edit: demo here: click
The following regex will match a single letters or multiple digits that are placed at the beginning of a sentence and then followed with either a single period or comma:
^(([a-zA-Z]{1}|[0-9]+)\s*[.,]{1})(.*)$
This is the breakdown:
^ # Asserts position at start of the line
[a-zA-Z]{1}|[0-9]+ # Match a single alphabetic character or one or more digits
\s* # Matches whitespace characters between 0 and unlimited times
[.,]{1} # Matches a single period or comma character literal
.* # Matches the rest of the text
$ # Asserts position at end of the line
Group 1 - will return both the letter/numbers and the period/comma (including potential spaces). This is in case you need to get both for some reason.
Group 2 - will return only letter or numbers at the start of the sentence, which I assume you'll actually be looking for most of the times.
Group 3 - will return the rest of the text.
The regex will need to be modified depending on what you want. For example if you don't want a match when there are spaces after the letter/digits at the start of the sentence or if you want to include more delimiting characters that mark the separator character. Let me know if you have any additional constraints you'd like this regex conform to.
See the DEMO
Use: ^[\da-zA-Z]+\h*[.,]
Demo
Explanation:
^ # beginning of line
[\da-zA-Z]+ # 1 or more letter or digit
\h* # 0 or more horizontal spaces
[.,] # a dot or a comma
I have the following RegEx to extract US address from a string.
(\d+)[ \n]+((\w+[ ,])+[\$\n, ]+){2}([a-zA-Z]){2}[$\n, ]+(\d){5}
This is not working when the address is in the below format.
2933 Glen Crow Court
San Jose
CA 95148
and is working for the below data.
2933 Glen Crow Court,
San Jose, CA 95148
.
2933 Glen Crow Court, San Jose, CA 95148
Any help on this would be much appreciated.
You can simplify your pattern to something like this for matching the address, whether in one line or in multiple line.
\b\d+(?:\s+[\w,]+)+?\s+[a-zA-Z]{2}\s+\d{5}\b
Regex Explanation:
\b\d+ - Starts matching with word boundary with one or more digit
(?:\s+[\w,]+)+? - A non-grouping pattern that matches one or more whitespace then text having one or more word character and comma and whole of it one or more times but in non-greedy way.
\s+[a-zA-Z]{2} - Matches one or more whitespace then two alphabetic characters to expect text like CA, NY
\s+\d{5}\b - Followed by one or more whitespace then finally five digits with word boundary to avoid matching partially in a larger text
Demo
Add ? to the [ ,] check:
(\d+)[ \n]+((\w+[ ,]?)+[\$\n, ]+){2}([a-zA-Z]){2}[$\n, ]+(\d){5}
Try this pattern \d+\s+[\w ]+[\s,]+[\w ]+[\s,]+\w+ \d+
Explanation:
\d+\s+ - match one ore more digits then match one ore more white spaces
[\w ]+[\s,]+ - match one or more word characters or space, then one or more white spaces or comma
\w+ \d+ -match one ore more word charaters, space and onre or more digits
Demo
Not drake but you can thank me later...
r"(?:(\d+ [A-Za-z][A-Za-z ]+)[\s,]*([A-Za-z#0-9][A-Za-z#0-9 ]+)?[\s,]*)?(?:([A-Za-z][A-Za-z ]+)[\s,]+)?((?=AL|AK|AS|AZ|AR|CA|CO|CT|DE|DC|FM|FL|GA|GU|HI|ID|IL|IN|IA|KS|KY|LA|ME|MH|MD|MA|MI|MN|MS|MO|MT|NE|NV|NH|NJ|NM|NY|NC|ND|MP|OH|OK|OR|PW|PA|PR|RI|SC|SD|TN|TX|UT|VT|VI|VA|WA|WV|WI|WY)[A-Z]{2})(?:[,\s]+(\d{5}(?:-\d{4})?))?"
you can test it out here... demo
note: this only works for us addresses
Sample text
5950 S Willow Dr Ste 304
Greenwood Village, CO 80111
P (123) 456-7890
F (123) 456-7890
Get Directions
Tried the following but it grabbed the first line of the address as well
(.*)(?=(\n.*){2}$)
Also tried
P\s(\(\d{3})\)\s\d+-\d+
but it doesn't work in WebHarvy even though it works on RegexStorm
Looking for an expression to match the phone and fax numbers from it. I would be using the expression in WebHarvy
https://www.webharvy.com/articles/regex.html
Thanks
Your second pattern is almost what you need to do. With P\s(\(\d{3})\)\s\d+-\d+, you captured into Group 1 only (\(\d{3}) part, while you need to capture the whole number.
I also suggest to restrict the context: either match P as a whole word, or as the first word on a line:
\bP\s*(\(\d{3}\)\s*\d+-\d+)
or
(?m)^\s*P\s*(\(\d{3}\)\s*\d+-\d+)
See the regex demo, and here is what you need to pay attention to there:
The \b part matches a word boundary (\b) and (?m)^\s* matches the start of a line ((?m) makes ^ match the start of a line) and then \s* matches 0+ whitespaces. You may change it to only match horizontal whitespaces by replacing the pattern with [\p{Zs}\t]*.
I'm trying to match digits with at least 5 characters (for the whole string) connected by a hyphen or space (like a bank account number).
e.g
"12345-62436-223434"
"12345 6789 123232"
I should also be able to match
"123-4567-890"
The current pattern I'm using is
(\d[\s-]*){5,}[\W]
But i'm getting these problems.
When I do this, I match all the white spaces after matching digits with at least 5 digit-characters
I'm going to replace this so I only want to match digits, not the white-spaces and hypens.
When I get the match what I want to do is to mask it like the one below.
from "12345-67890-11121" to "*****-*****-*****"
or
from "12345 67890 11121" to "***** ***** *****"
My only problem is that I don't get to match it like what I want to.
Thanks!
This one might work for you (probably some false-positives, though):
\d[ \d-]{3,}\d
See a demo on regex101.com.
Maybe you want something like this:
(\d{5,})(?:-|\s)(\d{5,})(?:-|\s)(\d{5,})
Demo
EDIT:
(\d+)(?:-|\s)(\d+)(?:-|\s)(\d+)
Demo
One option here is to take your existing pattern, and then add a positive lookahead which asserts that there are seven or more characters in the pattern. Assuming that there are two spaces or dashes in the account number, this will guarantee that there are five or more digits.
You can try using the following regex:
^(?=.{7,}$)((\\d+ \\d+ \\d+)|(\\d+-\\d+-\\d+))$
Test code:
String input = "123-4567-890";
boolean match = input.matches("^(?=.{7,}$)((\\d+ \\d+ \\d+)|(\\d+-\\d+-\\d+))$");
if (match) {
System.out.println("Match!");
}
If you need to first fish out the account numbers from a larger document/source, then do so and afterwards you can apply the regex logic above.
I am trying to match only the street name from a series of addresses. The addresses might look like:
23 Barrel Rd.
14 Old Mill Dr.
65-345 Howard's Bluff
I want to use a regex to match "Barrel", "Old Mill", and "Howard's". I need to figure out how to exclude the last word. So far I have a lookbehind to exclude the digits, and I can include the words and spaces and "'" by using this:
(?<=\d\s)(\w|\s|\')+
How can I exclude the final word (which may or may not end in a period)? I figure I should be using a lookahead, but I can't figure out how to formulate it.
You don't need a look-behind for this:
/^[-\d]+ ([\w ']+) \w+\.?$/
Match one or more digits and hyphens
space
match letters, digits, spaces, apostrophes into capture group 1
space
match a final word and an optional period
An example Ruby implementation:
regex = /^[-\d]+ ([\w ']+) \w+\.?$/
tests = [ "23 Barrel Rd.", "14 Old Mill Dr.", "65-345 Howard's Bluff" ]
tests.each do |test|
p test.match(regex)[1]
end
Output:
"Barrel"
"Old Mill"
"Howard's"
I believe the lookahead you want is (?=\s\w+\.?$).
\s: you don't want to include the last space
\w: at least one word-character (A-Z, a-z, 0-9, or '_')
\.?: optional period (for abbreviations such as "St.")
$: make sure this is the last word
If there's a possibility that there might be additional whitespace before the newline, just change this to (?=\s\w+\.?\s*$).
Why not just match what you want? If I have understood well you need to get all the words after the numbers excluding the last word. Words are separated by space so just get everything between numbers and the last space.
Example
\d+(?:-\d+)? ((?:.)+) Note: there's a space at the end.
Tha will end up with what you want in \1 N times.
If you just want to match the exact text you may use \K (not supported by every regex engine) but: Example
With the regex \d+(?:-\d+)? \K.+(?= )
Another option is to use the split() function provided in most scripting languages. Here's the Python version of what you want:
stname = address.split()[1:-1]
(Here address is the original address line, and stname is the name of the street, i.e., what you're trying to extract.)