Matching characters reversely using regex - regex

i need a regex that matches a string from specified position to first character reversely. strings are some file names.
i m using Delphi 2010
my example string is New Document.extension
if specified position is 4, it should match:
New Docu
You can get from "New Document.extension" to "New docu" following those steps:
First strip the extension. You end up with "New Document"
Remove the last 4 characters. You get "New Docu".
For the "This Is My Longest Document.ext1.ext2" example:
Strip the extension, you end up with: "This Is My Longest Document.ext1"
Strip the last 4 characters. You get: "This Is My Longest Document."

So you want the entire string up to the fourth-to-last position before the final dot? No problem:
Delphi .NET:
ResultString := Regex.Match(SubjectString, '^.*(?=.{4}\.[^.]*$)').Value;
Explanation:
^ # Start of string
.* # Match any number of characters
(?= # Assert that it's possible to match, starting at the current position:
.{4} # four characters
\. # a dot (the last dot in the string!) because...
[^.]* # from here one only non-dots are allowed until...
$ # the end of the string.
) # End of lookahead.

Since I can't post the regex because I came up with the exact same Regex as Tim, I'm going to post a piece of procedural code that does the exact same thing.
function FileNameWithoutExtension(const FileName:string; const StripExtraNumChars: Integer): string;
var i: Integer;
begin
i := LastDelimiter('.', FileName); // The extension starts at the last dot
if i = 0 then i := Length(FileName) + 1; // Make up the extension position if the file has no extension
Dec(i, StripExtraNumChars + 1); // Strip the requested number of chars; Plus one for the dot itself
Result := Copy(FileName, 1, i); // This is the result!
end;

You accepted the answer giving a regex for
The entire string up to the fourth-to-last position before the final dot.
If that's what you want then you do it best without a regex:
procedure RemoveExtensionAndFinalNcharacters(var s: string; N: Integer);
begin
s := ChangeFileExt(s, '');//remove extension
s := Copy(s, 1, Length(s)-N);//remove final N characters
end;
This more efficient than a regex and, much more importantly, it is much clearer and more intelligible.
Regexes are not the only fruit.

Edit based on comments
I'm not sure how Delphi does regex, but this works in most systems.
^.*(?=.{4}\.\w+$)
^ #the start of the string
.* #Any characters.
(?= #A lookahead meaning followed by...
.{4} #Any 4 chars.
\. #A literal .
\w+ #an actual extension.
$ #the end of the string
) #closing the lookahead
You could also use \w{3}$ instead of \w+ at the end if you wanted to make sure that the extension was three charaters long.

Related

PCRE Regex: Is it possible to check within only the first X characters of a string for a match

PCRE Regex: Is it possible for Regex to check for a pattern match within only the first X characters of a string, ignoring other parts of the string beyond that point?
My Regex:
I have a Regex:
/\S+V\s*/
This checks the string for non-whitespace characters whoich have a trailing 'V' and then a whitespace character or the end of the string.
This works. For example:
Example A:
SEBSTI FMDE OPORV AWEN STEM students into STEM
// Match found in 'OPORV' (correct)
Example B:
ARKFE SSETE BLMI EDSF BRNT CARFR (name removed) Academy Networking Event
//Match not found (correct).
Re: The capitalised text each letter and the letters placement has a meaning in the source data. This is followed by generic info for humans to read ("Academy Networking Event", etc.)
My Issue:
It can theoretically occur that sometimes there are names that involve roman numerals such as:
Example C:
ARKFE SSETE BLME CARFR Academy IV Networking Event
//Match found (incorrect).
I would like my Regex above to only check the first X characters of the string.
Can this be done in PCRE Regex itself? I can't find any reference to length counting in Regex and I suspect this can't easily be achieved. String lengths are completely arbitary. (We have no control over the source data).
Intention:
/\S+V\s*/{check within first 25 characters only}
ARKFE SSETE BLME CARFR Academy IV Networking Event
^
\- Cut off point. Not found so far so stop.
//Match not found (correct).
Workaround:
The Regex is in PHP and my current solution is to cut the string in PHP, to only check the first X characters, typically the first 20 characters, but I was curious if there was a way of doing this within the Regex without needing to manipulate the string directly in PHP?
$valueSubstring = substr($coreRow['value'],0,20); /* first 20 characters only */
$virtualCount = preg_match_all('/\S+V\s*/',$valueSubstring);
The trick is to capture the end of the line after the first 25 characters in a lookahead and to check if it follows the eventual match of your subpattern:
$pattern = '~^(?=.{0,25}(.*)).*?\K\S+V\b(?=.*\1)~m';
demo
details:
^ # start of the line
(?= # open a lookahead assertion
.{0,25} # the twenty first chararcters
(.*) # capture the end of the line
) # close the lookahead
.*? # consume lazily the characters
\K # the match result starts here
\S+V # your pattern
\b # a word boundary (that matches between a letter and a white-space
# or the end of the string)
(?=.*\1) # check that the end of the line follows with a reference to
# the capture group 1 content.
Note that you can also write the pattern in a more readable way like this:
$pattern = '~^
(*positive_lookahead: .{0,20} (?<line_end> .* ) )
.*? \K \S+ V \b
(*positive_lookahead: .*? \g{line_end} ) ~xm';
(The alternative syntax (*positive_lookahead: ...) is available since PHP 7.3)
You can find your pattern after X chars and skip the whole string, else, match your pattern. So, if X=25:
^.{25,}\S+V.*(*SKIP)(*F)|\S+V\s*
See the regex demo. Details:
^.{25,}\S+V.*(*SKIP)(*F) - start of string, 25 or more chars other than line break chars, as many as possible, then one or more non-whitespaces and V, and then the rest of the string, the match is failed and skipped
| - or
\S+V\s* - match one or more non-whitespaces, V and zero or more whitespace chars.
Any V ending in the first 25 positions
^.{1,24}V\s
See regex
Any word ending in V in the first 25 positions
^.{1,23}[A-Z]V\s

A regular expression for matching a group followed by a specific character

So I need to match the following:
1.2.
3.4.5.
5.6.7.10
((\d+)\.(\d+)\.((\d+)\.)*) will do fine for the very first line, but the problem is: there could be many lines: could be one or more than one.
\n will only appear if there are more than one lines.
In string version, I get it like this: "1.2.\n3.4.5.\n1.2."
So my issue is: if there is only one line, \n needs not to be at the end, but if there are more than one lines, \n needs be there at the end for each line except the very last.
Here is the pattern I suggest:
^\d+(?:\.\d+)*\.?(?:\n\d+(?:\.\d+)*\.?)*$
Demo
Here is a brief explanation of the pattern:
^ from the start of the string
\d+ match a number
(?:\.\d+)* followed by dot, and another number, zero or more times
\.? followed by an optional trailing dot
(?:\n followed by a newline
\d+(?:\.\d+)*\.?)* and another path sequence, zero or more times
$ end of the string
You might check if there is a newline at the end using a positive lookahead (?=.*\n):
(?=.*\n)(\d+)\.(\d+)\.((\d+)\.)*
See a regex demo
Edit
You could use an alternation to either match when on the next line there is the same pattern following, or match the pattern when not followed by a newline.
^(?:\d+\.\d+\.(?:\d+\.)*(?=.*\n\d+\.\d+\.)|\d+\.\d+\.(?:\d+\.)*(?!.*\n))
Regex demo
^ Start of string
(?: Non capturing group
\d+\.\d+\. Match 2 times a digit and a dot
(?:\d+\.)* Repeat 0+ times matching 1+ digits and a dot
(?=.*\n\d+\.\d+\.) Positive lookahead, assert what follows a a newline starting with the pattern
| Or
\d+\.\d+\. Match 2 times a digit and a dot
(?:\d+\.)* Repeat 0+ times matching 1+ digits and a dot
*(?!.*\n) Negative lookahead, assert what follows is not a newline
) Close non capturing group
(\d+\.*)+\n* will match the text you provided. If you need to make sure the final line also ends with a . then (\d+\.)+\n* will work.
Most programming languages offer the m flag. Which is the multiline modifier. Enabling this would let $ match at the end of lines and end of string.
The solution below only appends the $ to your current regex and sets the m flag. This may vary depending on your programming language.
var text = "1.2.\n3.4.5.\n1.2.\n12.34.56.78.123.\nthis 1.2. shouldn't hit",
regex = /((\d+)\.(\d+)\.((\d+)\.)*)$/gm,
match;
while (match = regex.exec(text)) {
console.log(match);
}
You could simplify the regex to /(\d+\.){2,}$/gm, then split the full match based on the dot character to get all the different numbers. I've given a JavaScript example below, but getting a substring and splitting a string are pretty basic operations in most languages.
var text = "1.2.\n3.4.5.\n1.2.\n12.34.56.78.123.\nthis 1.2. shouldn't hit",
regex = /(\d+\.){2,}$/gm;
/* Slice is used to drop the dot at the end, otherwise resulting in
* an empty string on split.
*
* "1.2.3.".split(".") //=> ["1", "2", "3", ""]
* "1.2.3.".slice(0, -1) //=> "1.2.3"
* "1.2.3".split(".") //=> ["1", "2", "3"]
*/
console.log(
text.match(regex)
.map(match => match.slice(0, -1).split("."))
);
For more info about regex flags/modifiers have a look at: Regular Expression Reference: Mode Modifiers

Find a string with regex of unknown length begnning with a specific string

I'm looking find a string of unknown length that beings with abc. The strings end is defined by a space, the end of a line, the end of the file, etc.
The string may contain . characters in the middle.
Examples of what I'm trying to find include:
abc.hello.1.test.a
abc.1test.hello.b.maybe
abc.myTest.1.test.maybe
Characters after the first dot must be present, so the following would not match.
abc.
abc
Use this Pattern (abc\.\S+) Demo
( # Capturing Group (1)
abc # "abc"
\. # "."
\S # <not a whitespace character>
+ # (one or more)(greedy)
) # End of Capturing Group (1)
If you really just want abc.{any non empty string} its trivial to do ^abc\..+$ which just finds abc. at the beginning, and then matches 1 or more of anything after that
If you want abc.{any string without a space} its similar, ^abc\.[^ ]+$
the ^ and $ are called anchors, and make sure your regex is matching the whole string, instead of say, efg.abc.hij

regular expressions: find every word that appears exactly one time in my document

Trying to learn regular expressions. As a practice, I'm trying to find every word that appears exactly one time in my document -- in linguistics this is a hapax legemenon (http://en.wikipedia.org/wiki/Hapax_legomenon)
So I thought the following expression give me the desired result:
\w{1}
But this doesn't work. The \w returns a character not a whole word. Also it does not appear to be giving me characters that appear only once (it actually returns 25873 matches -- which I assume are all alphanumeric characters). Can someone give me an example of how to find "hapax legemenon" with a regular expression?
If you're trying to do this as a learning exercise, you picked a very hard problem :)
First of all, here is the solution:
\b(\w+)\b(?<!\b\1\b.*\b\1\b)(?!.*\b\1\b)
Now, here is the explanation:
We want to match a word. This is \b\w+\b - a run of one or more (+) word characters (\w), with a 'word break' (\b) on either side. A word break happens between a word character and a non-word character, so this will match between (e.g.) a word character and a space, or at the beginning and the end of the string. We also capture the word into a backreference by using parentheses ((...)). This means we can refer to the match itself later on.
Next, we want to exclude the possibility that this word has already appeared in the string. This is done by using a negative lookbehind - (?<! ... ). A negative lookbehind doesn't match if its contents match the string up to this point. So we want to not match if the word we have matched has already appeared. We do this by using a backreference (\1) to the already captured word. The final match here is \b\1\b.*\b\1\b - two copies of the current match, separated by any amount of string (.*).
Finally, we don't want to match if there is another copy of this word anywhere in the rest of the string. We do this by using negative lookahead - (?! ... ). Negative lookaheads don't match if their contents match at this point in the string. We want to match the current word after any amount of string, so we use (.*\b\1\b).
Here is an example (using C#):
var s = "goat goat leopard bird leopard horse";
foreach (Match m in Regex.Matches(s, #"\b(\w+)\b(?<!\b\1\b.*\b\1\b)(?!.*\b\1\b)"))
Console.WriteLine(m.Value);
Output:
bird
horse
It can be done in a single regex if your regex engine supports infinite repetition inside lookbehind assertions (e. g. .NET):
Regex regexObj = new Regex(
#"( # Match and capture into backreference no. 1:
\b # (from the start of the word)
\p{L}+ # a succession of letters
\b # (to the end of a word).
) # End of capturing group.
(?<= # Now assert that the preceding text contains:
^ # (from the start of the string)
(?: # (Start of non-capturing group)
(?! # Assert that we can't match...
\b\1\b # the word we've just matched.
) # (End of lookahead assertion)
. # Then match any character.
)* # Repeat until...
\1 # we reach the word we've just matched.
) # End of lookbehind assertion.
# We now know that we have just matched the first instance of that word.
(?= # Now look ahead to assert that we can match the following:
(?: # (Start of non-capturing group)
(?! # Assert that we can't match again...
\b\1\b # the word we've just matched.
) # (End of lookahead assertion)
. # Then match any character.
)* # Repeat until...
$ # the end of the string.
) # End of lookahead assertion.",
RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace);
Match matchResults = regexObj.Match(subjectString);
while (matchResults.Success) {
// matched text: matchResults.Value
// match start: matchResults.Index
// match length: matchResults.Length
matchResults = matchResults.NextMatch();
}
If you are trying to match an English word, the best form is:
[a-zA-Z]+
The problem with \w is that it also includes _ and numeric digits 0-9.
If you need to include other characters, you can append them after the Z but before the ]. Or, you might need to normalize the input text first.
Now, if you want a count of all words, or just to see words that don't appear more than once, you can't do that with a single regex. You'll need to invest some time in programming more complex logic. It may very well need to be backed by a database or some sort of memory structure to keep track of the count. After you parse and count the whole text, you can search for words that have a count of 1.
(\w+){1} will match each word.
After that you could always perfrom the count on the matches....
Higher level solution:
Create an array of your matches:
preg_match_all("/([a-zA-Z]+)/", $text, $matches, PREG_PATTERN_ORDER);
Let PHP count your array elements:
$tmp_array = array_count_values($matches[1]);
Iterate over the tmp array and check the word count:
foreach ($tmp_array as $word => $count) {
echo $word . ' ' . $count;
}
Low level but does what you want:
Pass your text in an array using split:
$array = split('\s+', $text);
Iterate over that array:
foreach ($array as $word) { ... }
Check each word if it is a word:
if (!preg_match('/[^a-zA-Z]/', $word) continue;
Add the word to a temporary array as key:
if (!$tmp_array[$word]) $tmp_array[$word] = 0;
$tmp_array[$word]++;
After the loop. Iterate over the tmp array and check the word count:
foreach ($tmp_array as $word => $count) {
echo $word . ' ' . $count;
}

Matching x chars from the beginning using regex

i m trying to make rename program with delphi and need to know if it s possible to match some specified number of characters from the beginning, using regex.
for example if the string is FileName.txt and the specific number is 6
it should match FileNa
i also need a pattern to match string from a specific number to the end.
i would be glad if answers include descriptions because i would like to learn regex coding.
^.{6}
Will match the first 6 characters, but will not match if there are fewer than 6.
^.{1,6}
Will match the first 6 characters (as many as it can up to 6), but will not match if the string is empty.
. means to match any character (including path delimiters, in your case). You can replace . with \w if you only want letters, numbers, and underscore.
^\w{1,6}
If you use Delphi XE, regular expression functionality is build in with the TRegEx class. If you use an earlier version of Delphi you can find a library here, where you can also find more about the Delphi XE support: http://www.regular-expressions.info/delphi.html
This regular expression matches up to 6 characters until the . separating the extension from the rest of the file name.
^([^\.]{1,6})[^\.]*(?:\..*)?$
Given the input: FileName.txt
Group 1 would be: FileNa
Given the input: File.txt
Group 1 would be: File
The expression uses grouping to capture the first 6 characters. The code in Delphi XE would look something like:
var
Regex: TPerlRegEx;
ResultString: string;
Regex := TPerlRegEx.Create;
try
Regex.RegEx := '^([^\.]{1,6})[^\.]*(?:\..*)?$';
Regex.Options := [];
Regex.Subject := SubjectString;
if Regex.Match then begin
if Regex.GroupCount >= 1 then begin
ResultString := Regex.Groups[1];
end
else begin
ResultString := '';
end;
end
else begin
ResultString := '';
end;
finally
Regex.Free;
end;
For instance the filename: FileName.txt will be matched with: FileNa (group 1)
I'll try to explain the regular expression I have used, although there are probably better expressions out there:
^ # Match beginning of line
( # Begin a group (enables us to capture the contents alone)
[^\.] # Capture any character that is not a '.'
{1,6} # Capture anything from 1 to 6 of these characters (6 if possible)
) # Close the group
[^\.] # Match any character that is not '.' (again)
* # Match this 0 or more times
(?: # Begin a group that we do not wish to capture
\. # Capture the character '.' (the extension separator)
.* # Capture any character 0 or more times
) # Close the group
? # Match this group 0 or 1 time (it is either there or not)
$ # Match the end of line
To the next part of your question, creating a pattern to match a string from a specific number to the end:
^(?:.{6})?(.*)$
Given the input: This is a test
Group 1 would be: s a test
In this example the specific number is 6, change it to whatever number you are looking for. Again I've used groups to get the contents of the matched text. The first group is a none capturing group, meaning we are not interested in its content, only that we need it to be there. If we are still talking about filenames you can use the following regular expression:
^(?:[^\.]{6})([^\.]*)(?:\..*)?$
Given the input: FileName.txt
Group 1 would be: me
This is a modification of the first regular expression, where I've made the first group none capturing, told it to be 6 characters long (again change to whatever number suits you). And excluded the extension from the captured text.
Remember that regular expressions are easier to compose than to read. I always found that: http://www.regular-expressions.info/ is a good source of information, besides this book has helped me a great deal: Mastering Regular Expressions.