regular expression to match six spaces followed by up to 31 alphanumerics - regex

It's getting towards the end of the day and this is annoying me - one day I'll find the time to learn regex properly as I know it can save a lot of time when extracting info from text.
I need to match strings that match the following signature:
6 spaces followed by up 31 alphanumerics (or spaces) and then no more alphanumeric text on that line.
E.g.
' sampleheading ' - is fine
' sampleheading 10^21/1 ' - should not match
' sampleheading sample ' - should not match
I've got ^(\s{6}[\w\s]{1,31}) matching the first bit correctly I think but I can't seem to get it to only select lines that don't have any text following the initial match.
Any help appreciated!
Edit:
I've updated the text as a number of you noted my hastily entered original samples would actually all have tested fine.

Use $ to match end of line:
^(\s{6}[\w\s]{1,31})$
Or, if you may still have spaces afterwards that you want to ignore:
^(\s{6}[\w\s]{1,31})\s*$

You can use a $ to indicate the end of a line, using \s* to allow optional whitespace at the end.
^\s{6}[\w\s]{1,31}\s*$
Your samples don't match what you're saying you're wanting, however. They only start with four spaces, rather than six, and, in the last sample, "sampleheading sample"
is within the 31 character limit, so it matches, too. (The middle sample is within the length, too, but has non-word characters in it, so it doesn't match). Is that what you want?

add a $ to match the end of the line, e.g.
^(\s{6}[\w\s]{1,31})$

Aren't you simply saying 'match 6 spaces followed by 31 alphanumerics' ? There's not concept there of 'and no more alphanumerics'
I think what you have is good so far (!), but you need to follow it with (say) [^\w] - i.e. 'not an alphanumeric'.

Try this one out:
^\s{6}[\w\s]{1,31}\W.*$

Related

RegExp checking for sign only if there is text afterwards

I have some cases, which I need to filter with a regex. The values which need to be filtered are listed below:
// These should be catched
123456_Test.pdf
123456 Test.pdf
123456.pdf
// These shouldn't be catched
123456Abcasd.pdf
123456-Abcasd.pdf
123456_.pdf
The current regEx looks like this:
(\d{6,7})((\_| ){0,1})(.*)\..*
The problem here is, that the latter 3 are also matched. To give you a short overview, whats wrong with the 1st "wrongly" matched strings:
The 1st capture-group has to consist 6-7 digits. (Also the capture-group is needed in the end). If there are letters after these numbers, there has to be a whitespace or underscore. The 1st example of the "shouldn't be catched" shows this. The entry is invalid, since there are letters after 123456 without the needed sign.
The last entry isn't really important, just there for convinience.
What am I missing? How do I adjust my regex in a way, that I can check for signs, only if there are letters following a number-chain?
You may use
^(\d{6,7})([_ ][A-Za-z].*)?\..*$
See the regex demo
Details
^ - start of a string
(\d{6,7}) - Group 1: 6 or 7 digits
([_ ][A-Za-z].*)? - an optional capturing group #2: a _ or space followed with a letter and then any 0+ chars as many as possible, up to the last
\. - . on a line
.* - the rest of the line
$ - end of string.
Check if this perl solution works for you.
> cat regex_catch.dat
123456_Test.pdf
123456 Test.pdf
123456.pdf
123456Abcasd.pdf
123456-Abcasd.pdf
123456_.pdf
> perl -ne ' print if m/\d+(([ _])[a-zA-Z]+| [a-zA-Z]*)?\.pdf/ ' regex_catch.dat
123456_Test.pdf
123456 Test.pdf
123456.pdf
>

How can I make regex find a string a certain distance away?

For example, this is what I came up with so far
lasts{0,1}.*?(\d).*?doggs
The beginning part could be either last or lasts with an s.
Now, I want to look a maximum of 10 characters ahead of wherever it finds lasts{0,1} If it finds a digit within those 10 characters, look again to see if anywhere within a maxmimum of 10 characters is the string doggs
Is this even possible?
This is an example
So I figure if I use them about 7-8 hours a day they should last about 5.8 doggs. That works out
I want to only get the 5
You can use some more limiting quantifiers:
lasts?.{0,10}?(\d).{0,10}doggs
^^^^^^^^ ^^^^^^^
See the regex demo
Pattern explanation:
lasts? - match either last or lasts
.{0,10}? - match 0 to 10 characters as few as possible other than a newline (use DOTALL modifier to also match a newline)
\d - a digit
.{0,10} - see above
doggs - match a literal character sequence doggs.
`lasts{0,1}.{0,10}\d.{0,10}doggs`
The lasts{0,1} can be replaced by lasts?.

Regular expression with "not character" not matches as expected

I am trying to satisfy next restrictions:
line has from 3 to 256 chars that are a-z, 0-9, dash - or dot .
this line cannot start or end with -
I want to get kind of next output:
aaa -> good
aaaa -> good
-aaa -> bad
aaa- -> bad
---a -> bad
A have some of regexes that don't give right answer:
1) ^[^-][a-z0-9\-.]{3,256}[^-]$ gives all test lines as bad;
2) ^[^-]+[a-z0-9\-.]{3,256}[^-]+$ treats first three lines as one matching string since [^-] matches new line I guess.
3) ^[^-]?[a-z0-9\-.]{3,256}[^-]?$ (? for one or zero matching dash) gives all test lines as good
Where is the truth? I'm sensing it's either close to mine or much more complicated.
P.S. I use python 3 re module.
This one is almost correct: ^[^-][a-z0-9\-.]{3,256}[^-]$
The [^-] at the start and end represent one character already, so you will need to change {3,256} into {1,254}
Also, you probably only want a-z, 0-9 and . at the start and end (not just anything except -), so the full regex becomes:
^[a-z0-9.][a-z0-9\-.]{1,254}[a-z0-9.]$
Use a lookahead to confirm that the line matches your basic requirement ((?=^[0-9a-z.-]{3,256}$)) and then apply further restrictions.:
^((?=^[0-9a-z.-]{3,256}$)[^-].*[^-])$
Regex101 link
You can use this:
^(?!-)[a-z0-9.-]{3,256}(?<!-)$
Where (?!-) is a negative lookahead assertion (not followed by a dash) and (?<!-) is a negative lookbehind (not preceded by a dash).
You don't want {3,256}... You want {1,254} because [^-] each also match 1 character at the beginning and end of your string, so you have to subtract them from the total amount of characters that you want.
^[a-z0-9.][a-z0-9.-]{1,254}[^a-z0-9.]$
Or, if you want to keep your values you can also use lookahead/behinds:
^(?=[a-z0-9.])[a-z0-9.-]{3,256}(?<=[a-z0-9.])$

How to extract internal words using regex

I am trying to match only the street name from a series of addresses. The addresses might look like:
23 Barrel Rd.
14 Old Mill Dr.
65-345 Howard's Bluff
I want to use a regex to match "Barrel", "Old Mill", and "Howard's". I need to figure out how to exclude the last word. So far I have a lookbehind to exclude the digits, and I can include the words and spaces and "'" by using this:
(?<=\d\s)(\w|\s|\')+
How can I exclude the final word (which may or may not end in a period)? I figure I should be using a lookahead, but I can't figure out how to formulate it.
You don't need a look-behind for this:
/^[-\d]+ ([\w ']+) \w+\.?$/
Match one or more digits and hyphens
space
match letters, digits, spaces, apostrophes into capture group 1
space
match a final word and an optional period
An example Ruby implementation:
regex = /^[-\d]+ ([\w ']+) \w+\.?$/
tests = [ "23 Barrel Rd.", "14 Old Mill Dr.", "65-345 Howard's Bluff" ]
tests.each do |test|
p test.match(regex)[1]
end
Output:
"Barrel"
"Old Mill"
"Howard's"
I believe the lookahead you want is (?=\s\w+\.?$).
\s: you don't want to include the last space
\w: at least one word-character (A-Z, a-z, 0-9, or '_')
\.?: optional period (for abbreviations such as "St.")
$: make sure this is the last word
If there's a possibility that there might be additional whitespace before the newline, just change this to (?=\s\w+\.?\s*$).
Why not just match what you want? If I have understood well you need to get all the words after the numbers excluding the last word. Words are separated by space so just get everything between numbers and the last space.
Example
\d+(?:-\d+)? ((?:.)+)  Note: there's a space at the end.
Tha will end up with what you want in \1 N times.
If you just want to match the exact text you may use \K (not supported by every regex engine) but: Example
With the regex \d+(?:-\d+)? \K.+(?= )
Another option is to use the split() function provided in most scripting languages. Here's the Python version of what you want:
stname = address.split()[1:-1]
(Here address is the original address line, and stname is the name of the street, i.e., what you're trying to extract.)

Word count regex that only allows alphanumeric and maximum length

I've spent the whole morning on gSkinner trying to change this regex. It correctly allows only 15 words, but how do I further limit input to alphanumeric only, and no valid word to be more than 25 characters in length?
I understand [a-z0-9], but presumably the use of word boundaries seems to confuse me because whatever I do I'm breaking it.
^\W*(?:\w+\b\W*){1,15}$
It's for use in javascript/php.
try this regex: ^((\w{1,25}))((\W\w{1,25}){1,14}|)
the first word will not be preceded by a space (\w{1,25}), these thing check this. now I want a blank space folowed by a word (\W\w{1,25}), but i want this from 1 to 14 times so (\W\w{1,25}){1,14}. Ok but if the input have only 1 word the second part of the pattern will not work, so instead of a blank space folowed by a word i can have nothing so i added the |. ((\W\w{1,25}){1,14}|)
EDIT
the pattern had a glitch if you put - and these kind of character so I updated it to this: ^([^ ]{1,25})(([ ]{1,}([^ ]{1,25}|)){1,14}|)