Extract specific string using regular expression

Extract specific string using regular expression - regex

I want to extract only a specific string if its match
example as an input string:
13.10.0/
13.10.1/
13.10.2/
13.10.3/
13.10.4.2/
13.10.4.4/
13.10.4.5/
I'm using this regex [0-9]+.[0-9]+.[0-9] to extract only digit.digit.digit from a string if its match
but in that case, this is the wrong output related to my regex :
13.10.0
13.10.1
13.10.2
13.10.3
13.10.4.2 (no need to match this string 13.10.4 )
13.10.4.4 (no need to match this string13.10.4 )
13.10.4.5(no need to match this string 13.10.4 )
the correct output that I need :
13.10.0
13.10.1
13.10.2
13.10.3

It's hard to say without knowing how you're passing these strings in -- are they lines in a file? An array of strings in a programming language?
If you're searching a file using grep or a similar tool, it will give you all lines that match anywhere, even if only part of the line matches.
Normally, you'd deal with this using anchors to specify the regex must start on the first character of the line, and end on the last (e.g. ^[0-9]+.[0-9]+.[0-9]$). ^ matches the start of the line, and $ matches at the end.
In your case, you've got slashes at the end of all the lines, so the easiest fix is to match that final slash, with ^[0-9]+.[0-9]+.[0-9]/.
You could also use lookahead or groups to match the slash without returning it -- but that depends a bit more on what tool you're running this regex in and how you're processing it.
If your strings are separated by whitespace (other than newlines), replacing ^ with (^|\s) (either the beginning of the string, or some whitespace character) may work -- but it will add a leading space to some of your results.
You may also need to set your regex tool to match multiple times in a line (e.g. the -o flag in grep). Again, it's hard to give useful advice about this without knowing what regular-expression tool you're using, or how you're processing the results.

I think you want:
^\d+\.\d+\.\d+$
Which is exactly 3 groups of digit(s) separates by (literal) dots.

Some tools (like grep) match all lines that contain your regex, and may have additional characters before/after.
Use $ character to match end of line after your regex. (Also note, that . matches any character, not literal dot)
[0-9]+\.[0-9]+\.[0-9]$

Related

Ant regex expression

Quite a simple one in theory but can't quite get it!
I want a regex in ant which matches anything as long as it has a slash on the end.
Below is what I expect to work
<regexp id="slash.end.pattern" pattern="*/"/>
However this throws back
java.util.regex.PatternSyntaxException: Dangling meta character '*' near index 0
*/
^
I have also tried escaping this to \*, but that matches a literal *.
Any help appreciated!

Your original regex pattern didn't work because * is a special character in regex that is only used to quantify other characters.
The pattern (.)*/$, which you mentioned in your comment, will match any string of characters not containing newlines, however it uses a possibly unnecessary capturing group. .*/$ should work just as well.
If you need to match newline characters, the dot . won't be enough. You could try something like [\s\S]*/$
On that note, it should be mentioned that you might not want to use $ in this pattern. Suppose you have the following string:
abc/def/
Should this be evaluated as two matches, abc/ and def/? Or is it a single match containing the whole thing? Your current approach creates a single match. If instead you would like to search for strings of characters and then stop the match as soon as a / is found, you could use something like this: [\s\S]*?/.

Regular expression using negative lookbehind not working in Notepad++

I have a source file with literally hundreds of occurrences of strings flecha.jpg and flecha1.jpg, but I need to find occurrences of any other .jpg image (i.e. casa.jpg, moto.jpg, whatever)
I have tried using a regular expression with negative lookbehind, like this:
(?<!flecha|flecha1).jpg
but it doesn't work! Notepad++ simply says that it is an invalid regular expression.
I have tried the regex elsewhere and it works, here is an example so I guess it is a problem with NPP's handling of regexes or with the syntax of lookbehinds/lookaheads.
So how could I achieve the same regex result in NPP?
If useful, I am using Notepad++ version 6.3 Unicode
As an extra, if you are so kind, what would be the syntax to achieve the same thing but with optional numbers (in this case only '1') as a suffix of my string? (even if it doesn't work in NPP, just to know)...
I tried (?<!flecha[1]?).jpg but it doesn't work. It should work the same as the other regex, see here (RegExr)

Notepad++ seems to not have implemented variable-length look-behinds (this happens with some tools). A workaround is to use more than one fixed-length look-behind:
(?<!flecha)(?<!flecha1)\.jpg
As you can check, the matches are the same. But this works with npp.
Notice I escaped the ., since you are trying to match extensions, what you want is the literal .. The way you had, it was a wildcard - could be any character.
About the extra question, unfortunately, as we can't have variable-length look-behinds, it is not possible to have optional suffixes (numbers) without having multiple look-behinds.

Solving the problem of the variable-length-negative-lookbehind limitation in Notepad++
Given here are several strategies for working around this limitation in Notepad++ (or any regex engine with the same limitation)
Defining the problem
Notepad++ does not support the use of variable-length negative lookbehind assertions, and it would be nice to have some workarounds. Let's consider the example in the original question, but assume we want to avoid occurrences of files named flecha with any number of digits after flecha, and with any characters before flecha. In that case, a regex utilizing a variable-length negative lookbehind would look like (?<!flecha[0-9]*)\.jpg.
Strings we don't want to match in this example
flecha.jpg
flecha1.jpg
flecha00501275696.jpg
aflecha.jpg
img_flecha9.jpg
abcflecha556677.jpg
The Strategies
Inserting Temporary Markers
Begin by performing a find-and-replace on the instances that you want to avoid working with - in our case, instances of flecha[0-9]*\.jpg. Insert a special marker to form a pattern that doesn't appear anywhere else. For this example, we will insert an extra . before .jpg, assuming that ..jpg doesn't appear elsewhere. So we do:
Find: (flecha[0-9]*)(\.jpg)
Replace with: $1.$2
Now you can search your document for all the other .jpg filenames with a simple regex like \w+\.jpg or (?<!\.)\.jpg and do what you want with them. When you're done, do a final find-and-replace operation where you replace all instances of ..jpg with .jpg, to remove the temporary marker.
Using a negative lookahead assertion
A negative lookahead assertion can be used to make sure that you're not matching the undesired file names:
(?<!\S)(?!\S*flecha\d*\.jpg)\S+\.jpg
Breaking it down:
(?<!\S) ensures that your match begins at the start of a file name, and not in the middle, by asserting that your match is not preceded by a non-whitespace character.
(?!\S*flecha\d*\.jpg) ensures that whatever is matched does not contain the pattern we want to avoid
\S+\.jpg is what actually gets matched -- a string of non-whitespace characters followed by .jpg.
Using multiple fixed-length negative lookbehinds
This is a quick (but not-so-elegant) solution for situations where the pattern you don't want to match has a small number of possible lengths.
For example, if we know that flecha is only followed by up to three digits, our regex could be:
(?<!flecha)(?<!flecha[0-9])(?<!flecha[0-9][0-9])(?<!flecha[0-9][0-9][0-9])\.jpg

Are you aware that you're only matching (in the sense of consuming) the extension (.jpg)? I would think you wanted to match the whole filename, no? And that's much easier to do with a lookahead:
\b(?!flecha1?\b)\w+\.jpg
The first \b anchors the match to the beginning of the name (assuming it's really a filename we're looking at). Then (?!flecha1?\b) asserts that the name is not flecha or flecha1. Once that's done, the \w+ goes ahead and consumes the name. Then \.jpg grabs the extension to finish off the match.

How to extract file location using Regular Expressions(VB.NET)

I am facing a problem whereby I am given a string that contains a path to a file and the file's name and I only want to extract the path (without the file's name)
For example, I will receive something like
C:\Users\OopsD\Projects\test.acdbd
and from that string I want to extract only
C:\Users\OopsD\Projects
I was trying to create a RegEx to match a backslash followed by a word, followed by a dot followed by another word - this is to match the
\test.acdbd
part and replace it with empty string so that the final result is
C:\Users\OopsD\Projects
Can anyone, familiar with RegEx, help me on this one? Also, I will be using regular expressions quite a lot in the future. Is there a (free) program I can download to create regular expressions?

Are you really sure you need to be using Regex for such as simple task? How about this:
Dim file As New IO.FileInfo(" C:\Users\OopsD\Projects\test.acdbd")
MsgBox(file.Directory.FullName)
Regarding the free program on Regex, I would definitely recommend http://www.gskinner.com/RegExr/ - using it all the time. But you always have to consider alternatives, before going the Regex way.

The regex that you are looking for is as below:
[^/]+$
where,
^ (caret):Matches at the start of the string the regex pattern is applied to. Matches a position rather than a character. Most regex flavors have an option to make the caret match after line breaks (i.e. at the start of a line in a file) as well.
$ (dollar):Matches at the end of the string the regex pattern is applied to. Matches a position rather than a character. Most regex flavors have an option to make the dollar match before line breaks (i.e. at the end of a line in a file) as well. Also matches before the very last line break if the string ends with a line break.
+ (plus):Repeats the previous item once or more. Greedy, so as many items as possible will be matched before trying permutations with less matches of the preceding item, up to the point where the preceding item is matched only once.
More reference can be found out at this link.
Many Regex softwares and tools are out there. Some of them are:
www.gskinner.com/RegExr/
www.txt2re.com
Rubular- It is not just for Ruby.

regex to match strings not ending with a pattern?

I am trying to form a regular expression that will match strings that do NOT end a with a DOT FOLLOWED BY NUMBER.
eg.
abcd1
abcdf12
abcdf124
abcd1.0
abcd1.134
abcdf12.13
abcdf124.2
abcdf124.21
I want to match first three.
I tried modifying this post but it didn't work for me as the number may have variable length.
Can someone help?

You can use something like this:
^((?!\.[\d]+)[\w.])+$
It anchors at the start and end of a line. It basically says:
Anchor at the start of the line
DO NOT match the pattern .NUMBERS
Take every letter, digit, etc, unless we hit the pattern above
Anchor at the end of the line
So, this pattern matches this (no dot then number):
This.Is.Your.Pattern or This.Is.Your.Pattern2012
However it won't match this (dot before the number):
This.Is.Your.Pattern.2012
EDIT: In response to Wiseguy's comment, you can use this:
^((?!\.[\d]+$)[\w.])+$ - which provides an anchor after the number. Therefore, it must be a dot, then only a number at the end... not that you specified that in your question..

If you can relax your restrictions a bit, you may try using this (extended) regular expression:
^[^.]*.?[^0-9]*$
You may omit anchoring metasymbols ^ and $ if you're using function/tool that matches against whole string.
Explanation: This regex allows any symbols except dot until (optional) dot is found, after which all non-numerical symbols are allowed. It won't work for numbers in improper format, like in string: abcd1...3 or abcd1.fdfd2. It also won't work correctly for some string with multiple dots, like abcd.ab123cd.a (the problem description is a bit ambigous).
Philosophical explanation: When using regular expressions, often you don't need to do exactly what your task seems to be, etc. So even simple regex will do the job. An abstract example: you have a file with lines are either numbers, or some complicated names(without digits), and say, you want to filter out all numbers, then simple filtering by [^0-9] - grep '^[0-9]' will do the job.
But if your task is more complex and requires validation of format and doing other fancy stuff on data, why not use a simple script(say, in awk, python, perl or other language)? Or a short "hand-written" function, if you're implementing stand-alone application. Regexes are cool, but they are often not the right tool to use.

I would just use a simple negative look-behind anchored at the end:
.*(?<!\\.\\d+)$

concatenate multiple regexes into one regex

For a text file, I want to match to the string that starts with "BEAM" and "FILE PATH". I would have used
^BEAM.*$
^FILE PATH.*$
if I were to match them separately. But now I have to concatenate those two matching patterns into one pattern.
Any idea on how to do this?

A pipe/bar character generally represents "or" with regexps. You could try:
^(BEAM|FILE PATH).*$

The accepted answer is right but you may have redundancy in your Regular Expression.
^ means match the start of a line
(BEAM|FILE PATH) - means the string "BEAM" or the string "FILE PATH"
.* means anything at all
$ means match the end of the line
So in effect, all you are saying is match my strings at the beginning of the line since you don't care what's at the end. You could do this with:
^(BEAM|FILE PATH)
There are two cases where this reduction wouldn't be valid:
If you doing some with the matched string, so you want to match the whole line to pass the data to something else.
You're using a Regular Expression function that wants to match a whole string rather than part of it. You can sometimes solve this by picking the a different Regular Expression function or method. For example in Python use search instead of match.

If the above post doesn't work, try escaping the () and | in different ways until you find one that works. Some regex engines treat these characters differently (special vs. non-special characters), especially if you are running the match in a shell (shell will look for special characters too):
^\(BEAM|FILE PATH\).*$
%\(BEAM\|FILE PATH\).*$
etc.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Extract specific string using regular expression - regex

I think you want: ^\d+\.\d+\.\d+$ Which is exactly 3 groups of digit(s) separates by (literal) dots.

Some tools (like grep) match all lines that contain your regex, and may have additional characters before/after. Use $ character to match end of line after your regex. (Also note, that . matches any character, not literal dot) [0-9]+\.[0-9]+\.[0-9]$

Related

Ant regex expression

Regular expression using negative lookbehind not working in Notepad++

How to extract file location using Regular Expressions(VB.NET)

regex to match strings not ending with a pattern?

concatenate multiple regexes into one regex

Categories

Resources