Matching x chars from the beginning using regex

Matching x chars from the beginning using regex - regex

i m trying to make rename program with delphi and need to know if it s possible to match some specified number of characters from the beginning, using regex.
for example if the string is FileName.txt and the specific number is 6
it should match FileNa
i also need a pattern to match string from a specific number to the end.
i would be glad if answers include descriptions because i would like to learn regex coding.

^.{6}
Will match the first 6 characters, but will not match if there are fewer than 6.
^.{1,6}
Will match the first 6 characters (as many as it can up to 6), but will not match if the string is empty.
. means to match any character (including path delimiters, in your case). You can replace . with \w if you only want letters, numbers, and underscore.
^\w{1,6}

If you use Delphi XE, regular expression functionality is build in with the TRegEx class. If you use an earlier version of Delphi you can find a library here, where you can also find more about the Delphi XE support: http://www.regular-expressions.info/delphi.html
This regular expression matches up to 6 characters until the . separating the extension from the rest of the file name.
^([^\.]{1,6})[^\.]*(?:\..*)?$
Given the input: FileName.txt
Group 1 would be: FileNa
Given the input: File.txt
Group 1 would be: File
The expression uses grouping to capture the first 6 characters. The code in Delphi XE would look something like:
var
Regex: TPerlRegEx;
ResultString: string;
Regex := TPerlRegEx.Create;
try
Regex.RegEx := '^([^\.]{1,6})[^\.]*(?:\..*)?$';
Regex.Options := [];
Regex.Subject := SubjectString;
if Regex.Match then begin
if Regex.GroupCount >= 1 then begin
ResultString := Regex.Groups[1];
end
else begin
ResultString := '';
end;
end
else begin
ResultString := '';
end;
finally
Regex.Free;
end;
For instance the filename: FileName.txt will be matched with: FileNa (group 1)
I'll try to explain the regular expression I have used, although there are probably better expressions out there:
^ # Match beginning of line
( # Begin a group (enables us to capture the contents alone)
[^\.] # Capture any character that is not a '.'
{1,6} # Capture anything from 1 to 6 of these characters (6 if possible)
) # Close the group
[^\.] # Match any character that is not '.' (again)
* # Match this 0 or more times
(?: # Begin a group that we do not wish to capture
\. # Capture the character '.' (the extension separator)
.* # Capture any character 0 or more times
) # Close the group
? # Match this group 0 or 1 time (it is either there or not)
$ # Match the end of line
To the next part of your question, creating a pattern to match a string from a specific number to the end:
^(?:.{6})?(.*)$
Given the input: This is a test
Group 1 would be: s a test
In this example the specific number is 6, change it to whatever number you are looking for. Again I've used groups to get the contents of the matched text. The first group is a none capturing group, meaning we are not interested in its content, only that we need it to be there. If we are still talking about filenames you can use the following regular expression:
^(?:[^\.]{6})([^\.]*)(?:\..*)?$
Given the input: FileName.txt
Group 1 would be: me
This is a modification of the first regular expression, where I've made the first group none capturing, told it to be 6 characters long (again change to whatever number suits you). And excluded the extension from the captured text.
Remember that regular expressions are easier to compose than to read. I always found that: http://www.regular-expressions.info/ is a good source of information, besides this book has helped me a great deal: Mastering Regular Expressions.

Related

Regular Expression: Find a specific group within other groups in VB.Net

I need to write a regular expression that has to replace everything except for a single group.
E.g
IN
OUT
OK THT PHP This is it 06222021
This is it
NO MTM PYT Get this content 111111
Get this content
I wrote the following Regular Expression: (\w{0,2}\s\w{0,3}\s\w{0,3}\s)(.*?)(\s\d{6}(\s|))
This RegEx creates 4 groups, using the first entry as an example the groups are:
OK THT PHP
This is it
06222021
Space Charachter
I need a way to:
Replace Group 1,2,4 with String.Empty
OR
Get Group 3, ONLY

You don't need 4 groups, you can use a single group 1 to be in the replacement and match 6-8 digits for the last part instead of only 6.
Note that this \w{0,2} will also match an empty string, you can use \w{1,2} if there has to be at least a single word char.
^\w{0,2}\s\w{0,3}\s\w{0,3}\s(.*?)\s\d{6,8}\s?$
^ Start of string
\w{0,2}\s\w{0,3}\s\w{0,3}\s Match 3 times word characters with a quantifier and a whitespace in between
(.*?) Capture group 1 match any char as least as possible
\s\d{6,8} Match a whitespace char and 6-8 digits
\s? Match an optional whitespace char
$ End of string
Regex demo
Example code
Dim s As String = "OK THT PHP This is it 06222021"
Dim result As String = Regex.Replace(s, "^\w{0,2}\s\w{0,3}\s\w{0,3}\s(.*?)\s\d{6,8}\s?$", "$1")
Console.WriteLine(result)
Output
This is it

My approach does not work with groups and does use a Replace operation. The match itself yields the desired result.
It uses look-around expressions. To find a pattern between two other patterns, you can use the general form
(?<=prefix)find(?=suffix)
This will only return find as match, excluding prefix and suffix.
If we insert your expressions, we get
(?<=\w{0,2}\s\w{0,3}\s\w{0,3}\s).*?(?=\s\d{6}\s?)
where I simplified (\s|) as \s?. We can also drop it completely, since we don't care about trailing spaces.
(?<=\w{0,2}\s\w{0,3}\s\w{0,3}\s).*?(?=\s\d{6})
Note that this works also if we have more than 6 digits because regex stops searching after it has found 6 digits and doesn't care about what follows.
This also gives a match if other things precede our pattern like in 123 OK THT PHP This is it 06222021. We can exclude such results by specifying that the search must start at the beginning of the string with ^.
If the exact length of the words and numbers does not matter, we simply write
(?<=^\w+\s\w+\s\w+\s).*?(?=\s\d+)
If the find part can contain numbers, we must specify that we want to match until the end of the line with $ (and include a possible space again).
(?<=^\w+\s\w+\s\w+\s).*?(?=\s\d+\s?$)
Finally, we use a quantifier for the 3 ocurrences of word-space:
(?<=^(\w+\s){3}).*?(?=\s\d+\s?$)
This is compact and will only return This is it or Get this content.
string result = Regex.Match(#"(?<=^(\w+\s){3}).*?(?=\s\d+\s?$)").Value;

PCRE Regex: Is it possible to check within only the first X characters of a string for a match

PCRE Regex: Is it possible for Regex to check for a pattern match within only the first X characters of a string, ignoring other parts of the string beyond that point?
My Regex:
I have a Regex:
/\S+V\s*/
This checks the string for non-whitespace characters whoich have a trailing 'V' and then a whitespace character or the end of the string.
This works. For example:
Example A:
SEBSTI FMDE OPORV AWEN STEM students into STEM
// Match found in 'OPORV' (correct)
Example B:
ARKFE SSETE BLMI EDSF BRNT CARFR (name removed) Academy Networking Event
//Match not found (correct).
Re: The capitalised text each letter and the letters placement has a meaning in the source data. This is followed by generic info for humans to read ("Academy Networking Event", etc.)
My Issue:
It can theoretically occur that sometimes there are names that involve roman numerals such as:
Example C:
ARKFE SSETE BLME CARFR Academy IV Networking Event
//Match found (incorrect).
I would like my Regex above to only check the first X characters of the string.
Can this be done in PCRE Regex itself? I can't find any reference to length counting in Regex and I suspect this can't easily be achieved. String lengths are completely arbitary. (We have no control over the source data).
Intention:
/\S+V\s*/{check within first 25 characters only}
ARKFE SSETE BLME CARFR Academy IV Networking Event
^
\- Cut off point. Not found so far so stop.
//Match not found (correct).
Workaround:
The Regex is in PHP and my current solution is to cut the string in PHP, to only check the first X characters, typically the first 20 characters, but I was curious if there was a way of doing this within the Regex without needing to manipulate the string directly in PHP?
$valueSubstring = substr($coreRow['value'],0,20); /* first 20 characters only */
$virtualCount = preg_match_all('/\S+V\s*/',$valueSubstring);

The trick is to capture the end of the line after the first 25 characters in a lookahead and to check if it follows the eventual match of your subpattern:
$pattern = '~^(?=.{0,25}(.*)).*?\K\S+V\b(?=.*\1)~m';
demo
details:
^ # start of the line
(?= # open a lookahead assertion
.{0,25} # the twenty first chararcters
(.*) # capture the end of the line
) # close the lookahead
.*? # consume lazily the characters
\K # the match result starts here
\S+V # your pattern
\b # a word boundary (that matches between a letter and a white-space
# or the end of the string)
(?=.*\1) # check that the end of the line follows with a reference to
# the capture group 1 content.
Note that you can also write the pattern in a more readable way like this:
$pattern = '~^
(*positive_lookahead: .{0,20} (?<line_end> .* ) )
.*? \K \S+ V \b
(*positive_lookahead: .*? \g{line_end} ) ~xm';
(The alternative syntax (*positive_lookahead: ...) is available since PHP 7.3)

You can find your pattern after X chars and skip the whole string, else, match your pattern. So, if X=25:
^.{25,}\S+V.*(*SKIP)(*F)|\S+V\s*
See the regex demo. Details:
^.{25,}\S+V.*(*SKIP)(*F) - start of string, 25 or more chars other than line break chars, as many as possible, then one or more non-whitespaces and V, and then the rest of the string, the match is failed and skipped
| - or
\S+V\s* - match one or more non-whitespaces, V and zero or more whitespace chars.

Any V ending in the first 25 positions
^.{1,24}V\s
See regex
Any word ending in V in the first 25 positions
^.{1,23}[A-Z]V\s

Regex for text file

I have a text file with the following text:
andal-4.1.0.jar
besc_2.1.0-beta
prov-3.0.jar
add4lib-1.0.jar
com_lab_2.0.jar
astrix
lis-2_0_1.jar
Is there any way i can split the name and the version using regex. I want to use the results to make two columns 'Name' and 'Version' in excel.
So i want the results from regex to look like
andal 4.1.0.jar
besc 2.1.0-beta
prov 3.0.jar
add4lib 1.0.jar
com_lab 2.0.jar
astrix
lis 2_0_1.jar
So far I have used ^(?:.*-(?=\d)|\D+) to get the Version and -\d.*$ to get the Name separately. The problem with this is that when i do it for a large text file, the results from the two regex are not in the same order. So is there any way to get the results in the way I have mentioned above?

Ctrl+H
Find what: ^(.+?)[-_](\d.*)$
Replace with: $1\t$2
check Wrap around
check Regular expression
UNCHECK . matches newline
Replace all
Explanation:
^ # beginning of line
(.+?) # group 1, 1 or more any character but newline, not greedy
[-_] # a dash or underscore
(\d.*) # group 2, a digit then 0 or more any character but newline
$ # end of line
Replacement:
$1 # content of group 1
\t # a tabulation, you may replace with what you want
$2 # content of group 2
Result for given example:
andal 4.1.0.jar
besc 2.1.0-beta
prov 3.0.jar
add4lib 1.0.jar
com_lab 2.0.jar
astrix
lis 2_0_1.jar

Not quite sure what you meant for the problem in large file, and I believe the two regex you showed are doing opposite as what you said: first one should get you the name and second one should give you version.
Anyway, here is the assumption I have to guess what may make sense to you:
"Name" may follow by - or _, followed by version string.
"Version" string is something preceded by - or _, with some digit, followed by a dot or underscore, followed by some digit, and then any string.
If these assumption make sense, you may use
^(.+?)(?:[-_](\d+[._]\d+.*))?$
as your regex. Group 1 is will be the name, Group 2 will be the Version.
Demo in regex101: https://regex101.com/r/RnwMaw/3
Explanation of regex
^ start of line
(.+?) "Name" part, using reluctant match of
at least 1 character
(?: )? Optional group of "Version String", which
consists of:
[-_] - or _
( ) Followed by the "Version" , which is
\d+ at least 1 digit,
[._] then 1 dot or underscore,
\d+ then at least 1 digit,
.* then any string
$ end of line

AUTOHOTKEY: RegExMatch() a series of numbers and letters

I've tested my regular expression in http://www.regextester.com/
([0-9]{4,4})([A-Z]{2})([0-9]{1,3})
It's matching perfect with the following strings just as I want it.
1234AB123
2000AZ20
1000XY753
But when I try it in Autohotkey I get 0 result
test := RegExMatch("2000SY155","([0-9]{4,4})([A-Z]{2})([0-9]{1,3})")
MsgBox %test%
testing for:
first 4 characters must be a number
next 2 characters must be caps letters
next 1 to 3 characters must be numbers

You had to many ( )
This is the correct implementation:
test := RegExMatch("1234AB123","[0-9]{4,4}([A-Z]{2})[0-9]{1,3}")
Edit:
So what I noticed is you want this pattern to match, but you aren't really telling it much.
Here's what I was able to come up with that matches what you asked for, it's probably not the best solution but it works:
test := RegExMatch("1234AB567","^[0-9]{4,4}[A-Z]{2}(?![0-9]{4,})[0-9$]{1,3}")
Breaking it down:
RegExMatch(Haystack, NeedleRegEx [, UnquotedOutputVar = "", StartingPosition = 1])
Circumflex (^) and dollar sign ($) are called anchors because
they don't consume any characters; instead, they tie the pattern to
the beginning or end of the string being searched.
^ may appear at the beginning of a pattern to require the match to occur at
the very beginning of a line. For example, **
** matches abc123 but not 123abc.
$ may appear at the end of a pattern to require the match to occur at the very > end of a line. For example, abc$ matches 123abc but not abc123.
So by adding Circumflex we are requiring that our Pattern [0-9]{4,4} be at the beginning of the our Haystack.
Look-ahead and look-behind assertions: The groups (?=...), (?!...) are
called assertions because they demand a condition to be met but don't
consume any characters.
(?!...) is a negative look-ahead because it requires that the specified pattern not exist.
Our next Pattern is looking for two Uppercase Alpha Characters [A-Z]{2}(?![0-9]{4,}) that does not have four or more Numeric characters after it.
And finally our last Pattern that needs to match one to three Numeric characters as the last characters in our Haystack [0-9$]{1,3}

test := RegExMatch("2000SY155","([0-9]{4,4})([A-Z]{2})([0-9]{1,3})")
MsgBox %test%
But when I try it in Autohotkey I get 0 result
The message box correctly returns 1 for me, meaning your initial script works fine with my version. Usually, braces are no problem in RegExes, you can put there as many as you like... maybe your AutoHotkey version is outdated?

Matching characters reversely using regex

i need a regex that matches a string from specified position to first character reversely. strings are some file names.
i m using Delphi 2010
my example string is New Document.extension
if specified position is 4, it should match:
New Docu
You can get from "New Document.extension" to "New docu" following those steps:
First strip the extension. You end up with "New Document"
Remove the last 4 characters. You get "New Docu".
For the "This Is My Longest Document.ext1.ext2" example:
Strip the extension, you end up with: "This Is My Longest Document.ext1"
Strip the last 4 characters. You get: "This Is My Longest Document."

So you want the entire string up to the fourth-to-last position before the final dot? No problem:
Delphi .NET:
ResultString := Regex.Match(SubjectString, '^.*(?=.{4}\.[^.]*$)').Value;
Explanation:
^ # Start of string
.* # Match any number of characters
(?= # Assert that it's possible to match, starting at the current position:
.{4} # four characters
\. # a dot (the last dot in the string!) because...
[^.]* # from here one only non-dots are allowed until...
$ # the end of the string.
) # End of lookahead.

Since I can't post the regex because I came up with the exact same Regex as Tim, I'm going to post a piece of procedural code that does the exact same thing.
function FileNameWithoutExtension(const FileName:string; const StripExtraNumChars: Integer): string;
var i: Integer;
begin
i := LastDelimiter('.', FileName); // The extension starts at the last dot
if i = 0 then i := Length(FileName) + 1; // Make up the extension position if the file has no extension
Dec(i, StripExtraNumChars + 1); // Strip the requested number of chars; Plus one for the dot itself
Result := Copy(FileName, 1, i); // This is the result!
end;

You accepted the answer giving a regex for
The entire string up to the fourth-to-last position before the final dot.
If that's what you want then you do it best without a regex:
procedure RemoveExtensionAndFinalNcharacters(var s: string; N: Integer);
begin
s := ChangeFileExt(s, '');//remove extension
s := Copy(s, 1, Length(s)-N);//remove final N characters
end;
This more efficient than a regex and, much more importantly, it is much clearer and more intelligible.
Regexes are not the only fruit.

Edit based on comments
I'm not sure how Delphi does regex, but this works in most systems.
^.*(?=.{4}\.\w+$)
^ #the start of the string
.* #Any characters.
(?= #A lookahead meaning followed by...
.{4} #Any 4 chars.
\. #A literal .
\w+ #an actual extension.
$ #the end of the string
) #closing the lookahead
You could also use \w{3}$ instead of \w+ at the end if you wanted to make sure that the extension was three charaters long.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js