Regex to pull set of numbers from a string - regex

This is my string:
a:1:{i:0;a:4:{s:1:"i";s:5:"19807";s:1:"c";s:19:"7025462932180014917";s:1:"a";
d:88.719999999999998863131622783839702606201171875;s:2:"ba";
d:88.719999999999998863131622783839702606201171875;}}
Im just looking to pull out the portion that starts with 702546 and ends before that double quote after the 7. The number may be a varied length but will always start with 702546 and always end at the quote mark.
So my final selection would be 7025462932180014917.

use this regex (?<=")(702546\d+)(?=")
if youe regex engine doesn't support lookbehind use this regex "(702546\d+)" match will be in group 1

This could vary a bit depending on the technology/language you are using, but basically you can use something like this
(?<=")702546\d+(?=")
The first (?<=") may be overkill if there is no risk of other numbers in your string starting with 702546.
Breaking this down:
(?<=") // a zero-width assertion (lookbehind), so we will only match if preceded by double-quotes
702546 // literal match
\d+ // one or more additional digits
(?=") // another zero-width assertion (lookahead), so we will only match if followed by double-quotes
Basically, a zero-width assertion just means that the value must be there in order for a match to succeed, but the value does not become part of the match. We use them here because we want to be sure that the value is enclosed in quotes, without making the quotes part of our match.

The basic regex is 702546[^"]* to match any string like that, but you probably want to use a programming language of some sort to actually pull those out.
vnix$ grep -o '702546[^"]*' file
Or in a scripting language, something like this;
perl -nle 'print $1 while m/(702546[^"]*)/g' file
Your problem description leads me to believe that this is not actually what you should be doing, though. What problem are you really trying to solve?

Related

grep regex lookahead or start of string (or lookbehind or end of string)

I want to match a string which may contain a type of character before the match, or the match may begin at the beginning of the string (same for end of string).
For a minimal example, consider the text n.b., which I'd like to match either at the beginning of a line and end of a line or between two non-word characters, or some combination. The easiest way to do this would be to use word boundaries (\bn\.b\.\b), but that doesn't match; similar cases happen for other desired matches with non-word characters in them.
I'm currently using (^|[^\w])n\.b\.([^\w]|$), which works satisfactorily, but will also match the non-word characters (such as dashes) which appear immediately before and after the word, if available. I'm doing this in grep, so while I could easily pipe the output into sed, I'm using grep's --color option, which is disabled when piping into another command (for obvious reasons).
EDIT: The \K option (i.e. (\K^|[^\w])n\.b\.(\K[^\w]|$) seems to work, but it also does discard the color on the match within the output. While I could, again, invoke auxiliary tools, I'd love it if there was a quick and simple solution.
EDIT: I have misunderstood the \K operator; it simply removes all the text from the match preceding its use. No wonder it was failing to color the output.
If you're using grep, you must be using the -P option, or lookarounds and \K would throw errors. That means you also have negative lookarounds at your disposal. Here's a simpler version of your regex:
(?<!\w)n\.b\.(?!\w)
Also, be aware that (?<=...) and (?<!...) are lookbehinds, and (?=...) and (?!...) are lookaheads. The wording of your title suggests you may have gotten those mixed up, a common beginner's mistake.
Apparently matching beginning of string is possible inside lookahead/lookbehinds; the obvious solution is then (?<=^|[^\w])n\.b\.(?=[^\w]|$).

Regular expression using negative lookbehind not working in Notepad++

I have a source file with literally hundreds of occurrences of strings flecha.jpg and flecha1.jpg, but I need to find occurrences of any other .jpg image (i.e. casa.jpg, moto.jpg, whatever)
I have tried using a regular expression with negative lookbehind, like this:
(?<!flecha|flecha1).jpg
but it doesn't work! Notepad++ simply says that it is an invalid regular expression.
I have tried the regex elsewhere and it works, here is an example so I guess it is a problem with NPP's handling of regexes or with the syntax of lookbehinds/lookaheads.
So how could I achieve the same regex result in NPP?
If useful, I am using Notepad++ version 6.3 Unicode
As an extra, if you are so kind, what would be the syntax to achieve the same thing but with optional numbers (in this case only '1') as a suffix of my string? (even if it doesn't work in NPP, just to know)...
I tried (?<!flecha[1]?).jpg but it doesn't work. It should work the same as the other regex, see here (RegExr)
Notepad++ seems to not have implemented variable-length look-behinds (this happens with some tools). A workaround is to use more than one fixed-length look-behind:
(?<!flecha)(?<!flecha1)\.jpg
As you can check, the matches are the same. But this works with npp.
Notice I escaped the ., since you are trying to match extensions, what you want is the literal .. The way you had, it was a wildcard - could be any character.
About the extra question, unfortunately, as we can't have variable-length look-behinds, it is not possible to have optional suffixes (numbers) without having multiple look-behinds.
Solving the problem of the variable-length-negative-lookbehind limitation in Notepad++
Given here are several strategies for working around this limitation in Notepad++ (or any regex engine with the same limitation)
Defining the problem
Notepad++ does not support the use of variable-length negative lookbehind assertions, and it would be nice to have some workarounds. Let's consider the example in the original question, but assume we want to avoid occurrences of files named flecha with any number of digits after flecha, and with any characters before flecha. In that case, a regex utilizing a variable-length negative lookbehind would look like (?<!flecha[0-9]*)\.jpg.
Strings we don't want to match in this example
flecha.jpg
flecha1.jpg
flecha00501275696.jpg
aflecha.jpg
img_flecha9.jpg
abcflecha556677.jpg
The Strategies
Inserting Temporary Markers
Begin by performing a find-and-replace on the instances that you want to avoid working with - in our case, instances of flecha[0-9]*\.jpg. Insert a special marker to form a pattern that doesn't appear anywhere else. For this example, we will insert an extra . before .jpg, assuming that ..jpg doesn't appear elsewhere. So we do:
Find: (flecha[0-9]*)(\.jpg)
Replace with: $1.$2
Now you can search your document for all the other .jpg filenames with a simple regex like \w+\.jpg or (?<!\.)\.jpg and do what you want with them. When you're done, do a final find-and-replace operation where you replace all instances of ..jpg with .jpg, to remove the temporary marker.
Using a negative lookahead assertion
A negative lookahead assertion can be used to make sure that you're not matching the undesired file names:
(?<!\S)(?!\S*flecha\d*\.jpg)\S+\.jpg
Breaking it down:
(?<!\S) ensures that your match begins at the start of a file name, and not in the middle, by asserting that your match is not preceded by a non-whitespace character.
(?!\S*flecha\d*\.jpg) ensures that whatever is matched does not contain the pattern we want to avoid
\S+\.jpg is what actually gets matched -- a string of non-whitespace characters followed by .jpg.
Using multiple fixed-length negative lookbehinds
This is a quick (but not-so-elegant) solution for situations where the pattern you don't want to match has a small number of possible lengths.
For example, if we know that flecha is only followed by up to three digits, our regex could be:
(?<!flecha)(?<!flecha[0-9])(?<!flecha[0-9][0-9])(?<!flecha[0-9][0-9][0-9])\.jpg
Are you aware that you're only matching (in the sense of consuming) the extension (.jpg)? I would think you wanted to match the whole filename, no? And that's much easier to do with a lookahead:
\b(?!flecha1?\b)\w+\.jpg
The first \b anchors the match to the beginning of the name (assuming it's really a filename we're looking at). Then (?!flecha1?\b) asserts that the name is not flecha or flecha1. Once that's done, the \w+ goes ahead and consumes the name. Then \.jpg grabs the extension to finish off the match.

regex to match strings not ending with a pattern?

I am trying to form a regular expression that will match strings that do NOT end a with a DOT FOLLOWED BY NUMBER.
eg.
abcd1
abcdf12
abcdf124
abcd1.0
abcd1.134
abcdf12.13
abcdf124.2
abcdf124.21
I want to match first three.
I tried modifying this post but it didn't work for me as the number may have variable length.
Can someone help?
You can use something like this:
^((?!\.[\d]+)[\w.])+$
It anchors at the start and end of a line. It basically says:
Anchor at the start of the line
DO NOT match the pattern .NUMBERS
Take every letter, digit, etc, unless we hit the pattern above
Anchor at the end of the line
So, this pattern matches this (no dot then number):
This.Is.Your.Pattern or This.Is.Your.Pattern2012
However it won't match this (dot before the number):
This.Is.Your.Pattern.2012
EDIT: In response to Wiseguy's comment, you can use this:
^((?!\.[\d]+$)[\w.])+$ - which provides an anchor after the number. Therefore, it must be a dot, then only a number at the end... not that you specified that in your question..
If you can relax your restrictions a bit, you may try using this (extended) regular expression:
^[^.]*.?[^0-9]*$
You may omit anchoring metasymbols ^ and $ if you're using function/tool that matches against whole string.
Explanation: This regex allows any symbols except dot until (optional) dot is found, after which all non-numerical symbols are allowed. It won't work for numbers in improper format, like in string: abcd1...3 or abcd1.fdfd2. It also won't work correctly for some string with multiple dots, like abcd.ab123cd.a (the problem description is a bit ambigous).
Philosophical explanation: When using regular expressions, often you don't need to do exactly what your task seems to be, etc. So even simple regex will do the job. An abstract example: you have a file with lines are either numbers, or some complicated names(without digits), and say, you want to filter out all numbers, then simple filtering by [^0-9] - grep '^[0-9]' will do the job.
But if your task is more complex and requires validation of format and doing other fancy stuff on data, why not use a simple script(say, in awk, python, perl or other language)? Or a short "hand-written" function, if you're implementing stand-alone application. Regexes are cool, but they are often not the right tool to use.
I would just use a simple negative look-behind anchored at the end:
.*(?<!\\.\\d+)$

Regex lookahead

I am using a regex to find:
test:?
Followed by any character until it hits the next:
test:?
Now when I run this regex I made:
((?:test:\?)(.*)(?!test:\?))
On this text:
test:?foo2=bar2&baz2=foo2test:?foo=bar&baz=footest:?foo2=bar2&baz2=foo2
I expected to get:
test:?foo2=bar2&baz2=foo2
test:?foo=bar&baz=foo
test:?foo2=bar2&baz2=foo2
But instead it matches everything. Does anyone with more regex experience know where I have gone wrong? I've used regexes for pattern matching before but this is my first experience of lookarounds/aheads.
Thanks in advance for any help/tips/pointers :-)
I guess you could explore a greedy version.
(expanded)
(test:\? (?: (?!test:\?)[\s\S])* )
The Perl program below
#! /usr/bin/env perl
use strict;
use warnings;
$_ = "test:?foo2=bar2&baz2=foo2test:?foo=bar&baz=footest:?foo2=bar2&baz2=foo2";
while (/(test:\? .*?) (?= test:\? | $)/gx) {
print "[$1]\n";
}
produces the desired output from your question, plus brackets for emphasis.
[test:?foo2=bar2&baz2=foo2]
[test:?foo=bar&baz=foo]
[test:?foo2=bar2&baz2=foo2]
Remember that regex quantifiers are greedy and want to gobble up as much as they can without breaking the match. Each subsegment to terminate as soon as possible, which means .*? semantics.
Each subsegment terminates with either another test:? or end-of-string, which we look for with (?=...) zero-width lookahead wrapped around | for alternatives.
The pattern in the code above uses Perl’s /x regex switch for readability. Depending on the language and libraries you’re using, you may need to remove the extra whitespace.
Three issues:
(?!) is a negative lookahead assertion. You want (?=) instead, requiring that what comes next is test:?.
The .* is greedy; you want it non-greedy so that you grab just the first chunk.
You're wanting the last chunk also, so you want to match $ as well at the end.
End result:
(?:test:\?)(.*?)(?=test:\?|$)
I've also removed the outer group, seeing no point in it. All RE engines that I know of let you access group 0 as the full match, or some other such way (though perhaps not when finding all matches). You can put it back if you need to.
(This works in PCRE; not sure if it would work with POSIX regular expressions, as I'm not in the habit of working with them.)
If you're just wanting to split on test:?, though, regular expressions are the wrong tool. Split the strings using your language's inbuilt support for such things.
Python:
>>> re.findall('(?:test:\?)(.*?)(?=test:\?|$)',
... 'test:?foo2=bar2&baz2=foo2test:?foo=bar&baz=footest:?foo2=bar2&baz2=foo2')
['foo2=bar2&baz2=foo2', 'foo=bar&baz=foo', 'foo2=bar2&baz2=foo2']
You probably want ((?:test:\?)(.*?)(?=test:\?)), although you haven't told us what language you're using to drive the regexes.
The .*? matches as few characters as possible without preventing the whole string from matching, where .* matches as many as possible (is greedy).
Depending, again, on what language you're using to do this, you'll probably need to match, then chop the string, then match again, or call some language-specific match_all type function.
By the way, you don't need to anchor a regex using a lookahead (you can just match the pattern to search for, instead), so this will (most likely) do in your case:
test:[?](.*?)test:[?]

Remove stuff, retrieve numbers, retrieve text with spaces in place of dots, remove the rest

This is my first question, so I hope I didn't mess too much with the title and the formatting.
I have a bunch of file a client of mine sent me in this form:
Name.Of.Chapter.021x212.The.Actual.Title.Of.the.Chapter.DOC.NAME-Some.stuff.Here.ext
What I need is a regex to output just:
212 The Actual Title Of the Chapter
I'm not gonna use it with any script language in particular; it's a batch renaming of files through an app supporting regex (which already "preserves" the extension).
So far, all I was able to do was this:
/.*x(\d+)\.(.*?)\.[A-Z]{3}.*/ -->REPLACE: $1 $2
(Capture everything before a number preceded by an "x", group numbers after the "x", group everything following until a 3 digit Uppercase word is met, then capture everything that follows)
which gives me back:
212 The.Actual.Title.Of.the.Chapter
Having seen the result I thought that something like:
/.*x(\d+)\.([^.]*?)\.[A-Z]{3}.*/ -->REPLACE: $1 $2
(Changed second group to "Capture everything which is not a dot...") would have worked as expected.
Instead, the whole regex fails to match completely.
What am I missing?
TIA
cià
ale
.*x(\d+)\. matches Name.Of.Chapter.021x212.
\.[A-Z]{3}.* matches .DOC.NAME-Some.stuff.Here.ext
But ([^.]*?) does not match The.Actual.Title.Of.the.Chapter because this regex does not allow for any periods at all.
since you are on Mac, you could use the shell
$ s="Name.Of.Chapter.021x212.The.Actual.Title.Of.the.Chapter.DOC.NAME-Some.stuff.Here.ext"
$ echo ${s#*x}
212.The.Actual.Title.Of.the.Chapter.DOC.NAME-Some.stuff.Here.ext
$ t=${s#*x}
$ echo ${t%.[A-Z][A-Z][A-Z].*}
212.The.Actual.Title.Of.the.Chapter
Or if you prefer sed, eg
echo $filename | sed 's|.[^x]*x||;s/\.[A-Z][A-Z][A-Z].*//'
For processing multiple files
for file in *.ext
do
newfile=${file#*x}
newfile=${newfile%.[A-Z][A-Z][A-Z].*}
# or
# newfile=$(echo $file | sed 's|.[^x]*x||;s/\.[A-Z][A-Z][A-Z].*//')
mv "$file" "$newfile"
done
To your question "How can I remove the dots in the process of matching?" the answer is "You can't." The only way to do that is by processing the result of the match in a second step, as others have said. But I think there's a more basic question that needs to be addressed, which is "What does it mean for a regex to match a given input?"
A regex is usually said to match a string when it describes any substring of that string. If you want to be sure the regex describes the whole string, you need to add the start (^) and end ($) anchors:
/^.*x(\d+)\.(.*?)\.[A-Z]{3}.*$/
But in your case, you don't need to describe the whole string; if you get rid of the .* at either end, it will serve your just as well:
/x(\d+)\.(.*?)\.[A-Z]{3}/
I recommend you not get in the habit of "padding" regexes with .* at beginning and end. The leading .* in particular can change the behavior of the regex in unexpected ways. For example, it there were two places in the input string where x(\d+)\. could match, your "real" match would have started at the second one. Also, if it's not anchored with ^ or \A, a leading .* can make the whole regex much less efficient.
I said "usually" above because some tools do automatically "anchor" the match at the beginning (Python's match()) or at both ends (Java's matches()), but that's pretty rare. Most of the shells and command-line tools available on *nix systems define a regex match in the traditional way, but it's a good idea to say what tool(s) you're using, just in case.
Finally, a word or two about vocabulary. The parentheses in (\d+) cause the matched characters to be captured, not grouped. Many regex flavors also support non-capturing parentheses in the form (?:\d+), which are used for grouping only. Any text that is included in the overall match, whether it's captured or not, is said to have been consumed (not captured). The way you used the words "capture" and "group" in your question is guaranteed to cause maximum confusion in anyone who assumes you know what you're talking about. :D
If you haven't read it yet, check out this excellent tutorial.