I'm trying to get a grasp on regular expressions and I came across with the one included inside the str.extract method:
movies['year']=movies['title'].str.extract('.*\((.*)\).*',expand=True)
It is supposed to detect and extract whichever is in parentheses. So, if given this string: foobar (1995) it should return 1995. However, if I open a terminal and type the following
echo 'foobar (1995)` | grep '.*\((.*)\).*'
matches the whole string instead of only the content between parentheses. I assume the method is working with BRE flavor because of the parentheses scaping, and so is grep (default behavior). Also, regex matches in blue the whole string and green the year (capturing group). Am I missing something here? The regex works perfectly inside python
First of all, the behavior of Pandas .str.extract() is quite expected: it returns only the capturing group contents. The pattern used with extract requires at least 1 capturing group:
pat : string
Regular expression pattern with capturing groups
If you use a named capturing group, the new column will be named after the named group.
The grep command you provided can be reduced to
grep '\((.*)\)'
as grep is capable of matching a line partially (does not require a full line match) and works on a per line basis: once a match is found the whole line is returned. To override that behavior, you may use -o switch.
With grep, you cannot return the capturing group contents. This can be worked around with PCRE regexp powered with -P option, but it is not available on Mac, for example. sed or awk may help in those situations, too.
Try using this:
movies['year']= movies['title'].str.extract('.*\((\d{4})\).*',expand=False)
Set expand= True if you want it to return a DataFrame or when applying multiple capturing groups.
A year is always composed of 4 digits. So the regex: \((\d{4})\) match any date between parentheses.
Related
I am new to regex, and am trying to capture a certain pattern. There are two words (name1 and host), that I want to capture everything in between, the problem is, sometimes "everything" in between might contain 'name1'. And if it does contain 'name1', it includes everything from the previous name1, to the next 'host' word. So I basically have two 'strings' from two different 'name1' being captured.
This is the example I have:
name1{want-this-string}host,name1{want-this-string}host,name1{dont-want-this-string},name1{dont-want-this-either}name1{want-this-string}host
and this is the regex I'm using right now..
(?<=\bname1\b).*?(?=\bhost\b)
My expected output is that it matches the 3 {want-this-string}, and not the {dont-want-this} stuff. so basically:
{want-this-string}{want-this-string}{want-this-string}
But right now its grabbing the first two {want this string} and then this whole section
{dont-want-this-string},name1{dont-want-this-either}name1{want-this-string}
If you have a GNU grep, you may use
grep -oP '\bname1\{\K[^{}]*(?=}host\b)' file
With pcregrep (you may install it on MacOS if you are using that OS), you may use it like
pcregrep -oM '\bname1\{\K[^{}]*(?=}host\b)' file
See the regex demo
Details
\bname1\{ - whole word name1 and a { after
\K - match reset operator discarding the whole match
[^{}]* - 0 or more chars other than { and }
(?=}host\b) - there must be a }host as a whole word immediately to the right of the current location.
See the online grep demo:
s="name1{want-this-string}host,name1{want-this-string}host,name1{dont-want-this-string},name1{dont-want-this-either}name1{want-this-string}host"
grep -oP '\bname1\{\K[^{}]*(?=}host\b)' <<< "$s"
Output:
want-this-string
want-this-string
want-this-string
I'm not quite sure, if this pattern would pass our desired and potential inputs, yet we would similarly start to design an expression based on our cases with a likely left or if necessary right constraints, maybe such as this expression:
(^name1|}name1)({.+?})?|(host,name1)({.+?})(host,name1)
which this part can be much simplified:
(host,name1)({.+?})(host,name1)
and we are adding it here just to exemplify the implementation of a right boundary to only capture the first instance of (host,name1) value.
Demo
RegEx Circuit
jex.im visualizes regular expressions:
RegEx
If this expression wasn't desired and you wish to modify it, please visit this link at regex101.com.
I have a code with object.attribute where attribute can be an array
example: object.SIZE_OF_IMAGE[0] or a simple string. I want to search all occurrences "object.attribute" and replace it with self.lowercase(attribute) I want a regular expression on vim to do that.
I can use that :%s/object.*/self./gc and replace it manually but it is very slow.
Here are some examples:
object.SIZE to self.size
object.SIZE_OF_IMAGE[0] to self.size_of_image[0]
You basically just need two things:
Capture groups :help /\( let you store what's matched in between \(...\) and then reference it (via \1, \2, etc.) in the replacement (or even afterwards in the pattern itself).
The :help s/\L special replacement action that makes everything following lowercase.
This gives you the following command:
:%substitute/\<object\.\(\w\+\)/self.\L\1/g
Notes:
I've established a keyword start assertion (\<) at the beginning to avoid matching schlobject as well.
\w\+ matches letters, digits, and underscores (so it fulfills your example); various alternatives are possible here.
sed -E 's/object\.([^ \(]*)(.*)/self.lowercase(\1)\2/g' file_name.txt
above command considers that your attribute is followed by space or "("
you can tweek this command based on your need
Based on your comment above that the attribute part
"finishes by space or [ or (" you could match it with:
/object\.[^ [(]*
So, to replace it with self.attribute use a capturing
group and \L to make everything lowercase:
:%s/\vobject\.([^ [(]*)/self.\L\1/g
In the command mode try this
:1,$ s/object.attribute/self.lowercase(attribute)/g
I'm struggling with regex. Here's the command I'm using (running Cygwin on Windows) upon hwnd’s suggestion (which had solved my previous issue):
grep -Po '(?<="id":)[^,]+' regex_test.txt
How can I change the regular expression so that the match created starts with ,{"id": OR :[{"id": ? Sadly, the current expression is also capturing unwanted ID’s that are prepended with :{"id":
Input Text File named "regex_test.txt":
reason":{"id":25549177,“pattern":null},"iphone":[{"id":2411977008,version":null},{"id":2430057923,
Output:
25549177
2411977008
2430057923
Desired Output:
2411977008
2430057923
Please let me know your thoughts on these issues.
You can use a Positive Lookbehind assertion ( as shown in the linked answer ):
grep -Po '(?<="id":)[^,]+' regex_test.txt
Ideone Demo
How would you write a regular expression to find the file extension of the following files, keeping in mind that what I am looking for is the ".pdf" or ".xls" portion of the string?
REPORTPDF.20130810.pdf.pgp
REPORTXLS.20130810.xls.pgp
EDIT:
The resulting filenames I want to end up with are the following:
REPORT20130810.PDF
REPORT20130810.XLS
I am on a Windows platform. I've played around with this a bit at http://regexpal.com/ but so far I can only figure out how to match the date:
([0-9]{4}[0-9]{2}[0-9]{2})
Using sed:
sed 's/^\(.*[^.]*\)\.[^.]*$/\1/' <<< "REPORTPDF.20130810.pdf.pgp"
REPORTPDF.20130810.pdf
Using grep -P (PCRE regex):
grep -oP '^.+[^.]+(?=\.[^.]+$)' <<< "REPORTPDF.20130810.pdf.pgp"
REPORTPDF.20130810.pdf
.+\.(\w+)\.\w+$ would deliver the last but one extension as group 1, how this is accessed would then be dependent of your host language for the regex.
If you don't need the file extension to be capitalized, this should work
([a-zA-Z]+)\.([0-9]{4}[0-9]{2}[0-9]{2})\.(xls|pdf)\.pgp
Matches:
REPORTXLS.20130810.xls.pgp
And then the groups you'd use are two and three
REPORT\2.\3
Matches:
REPORT20130810.xls
Problem is that you don't provide much context for how you're going about changing these file names.
You don't say what language/library you're using, but this Perl one-liner does the trick:
perl -lpe "s/^([^.]*)(...)\.(\d+)(\.\2)\.pgp/\1\3\4/i; $_=uc"
I think this will work for you :)
^(([A-Z a-z]*)(?:XLS.|PDF.)(\d{8})(.pdf|.xls))
Edit live on Debuggex
^ starts at the beginning of the string
(.*) any character before
\d any number 0-9
{8} only 8 times for that character section (in this case 8 times of
the numbers 0-9)
?: is non capture groups
I wrapped the capture groups into one large one so the thing that you want will be in the first capture group :).
This can be prob be replaced
([A-Z a-z]*)
with
(REPORT)
This (.*?(?:\..*)?)(\..*) will hold things like:
'hello.1a.2bb.3' ---> group(1) == 'hello.1a.2bb', group(2) == '.3'
'yep.1' ---> group(1) == 'yep', group(2) == '.1'
If the format is pretty much fixed you could use
(REPORT)([^.]++)[.]([^.]++)[.]([^.]++)[.](pgp)
and cherry pick replacement based on what you want
Used java here but regex match would still be same
String a = "REPORTPDF.20130810.pdf.pgp".replaceAll(
"(REPORT)([^.]++)[.]([^.]++)[.]([^.]++)[.](pgp)",
"$1--$2--$3--$4--$5");
;
String b = "REPORTXLS.20130810.xls.pgp".replaceAll(
"(REPORT)([^.]++)[.]([^.]++)[.]([^.]++)[.](pgp)",
"$1--$2--$3--$4--$5");
System.out.println(a);
System.out.println(b);
REPORT--PDF--20130810--pdf--pgp
REPORT--XLS--20130810--xls--pgp
in your case "$1$3.$2"
String b = "REPORTXLS.20130810.xls.pgp".replaceAll(
"(REPORT)([^.]++)[.]([^.]++)[.]([^.]++)[.](pgp)",
"$1$3.$2");
which produces intended result
REPORT20130810.XLS
The regular expression which I have provided will select the string 72719.
Regular expression:
(?<=bdfg34f;\d{4};)\d{0,9}
Text sample:
vfhnsirf;5234;72159;2;668912;28032009;4;
bdfg34f;8467;72719;7;6637912;05072009;7;
b5g342sirf;234;72119;4;774582;20102009;3;
How can I rewrite the expression to select that string even when the number 8467; is changed to 84677; or 846777; ? Is it possible?
First, when asking a regex question, you should always specify which language you are using.
Assuming that the language you are using does not support variable length lookbehind (and most don't), here is a solution which will work. Your original expression uses a fixed-length lookbehind to match the pattern preceding the value you want. But now this preceding text may be of variable length so you can't use a look behind. This is no problem. Simply match the preceding text normally and capture the portion that you want to keep in a capture group. Here is a tested PHP code snippet which grabs all values from a string, capturing each value into capture group $1:
$re = '/^bdfg34f;\d{4,};(\d{0,9})/m';
if (preg_match_all($re, $text, $matches)) {
$values = $matches[1];
}
The changes are:
Removed the lookbehind group.
Added a start of line anchor and set multi-line mode.
Changed the \d{4} "exactly four" to \d{4,} "four or more".
Added a capture group for the desired value.
Here's how I usually describe "fields" in a regex:
[^;]+;[^;]+;([^;]+);
This means "stuff that isn't semi-colon, followed by a semicolon", which describes each field. Do that twice. Then the third time, select it.
You may have to tweak the syntax for whatever language you are doing this regex in.
Also, if this is just a data file on disk and you are using GNU tools, there's a much easier way to do this:
cat file | cut -d";" -f 3
to match the first number with a minimum of 4 digits
(?<=bdfg34f;\d{4,};)\d{0,9}
and to match the first number with 1 or more length
(?<=bdfg34f;\d+;)\d{0,9}
or to match the first number only if the length is between 4 and 6
(?<=bdfg34f;\d{4,6};)\d{0,9}
This is a simple text parsing problem that probably doesn't mandate the use of regular expressions.
You could take the input line by line and split on ';', i.e. (in php, I have no idea what you're doing)
foreach (explode("\n", $string) as $line) {
$bits = explode(";", $line);
echo $bits[3]; // third column
}
If this is indeed in a file and you happen to be using PHP, using fgetcsv would be much better though.
Anyway, context is missing, but the bottom line is I don't think you should be using regular expressions for this.