Multiline regular expressions in R - regex

I need to find instances of a LaTeX \index command in a whole bunch of knitr documents (.Rnw) which have commas in them. These may occur over multiple lines e.g.
\index{prior distribution,choosing beta prior for
$\pi$,vague prior knowledge}
I'm reasonably happy with my R code to find things:
line = paste(readLines(input), collapse = "\n")
r = gregexpr(pattern, line)
if(length(r) > 0){
lapply(regmatches(line, r), function(e){cat(paste(substr(e, 0, 50), "\n"))})
}
However, I can't seem to get the regular expression right. I've tried
pattern = "(\\s)\\\\index\\{.*[,][^}]*\\}"
which gets some but not everything
pattern = "\\\\index\\{[A-Za-z \\s][^}]*\\}"
which gets more, but a lot I don't want. For example it finds
\index{posterior variance!beta distribution}
Any help appreciated.

Often it is easier to use a multiple regexes in a row than one regex that gets exactly what you want. In your case:
library(stringr)
t = "\\index{prior distribution,choosing beta prior for
\\$\\pi\\$,vague prior knowledge} bleh
\\index{posterior variance!beta distribution}"
cat(t)
tier_1 = str_match_all(t, "(?s)\\index\\{.*?\\}")[[1]]
tier_2 = tier_1[str_detect(tier_1, ",")]
The first regex finds all the \index{} stuff, across lines. The second keeps only those that have a comma.
This gets the first, and not the second. You can add more tiers to sort away stuff you don't want like this.

Related

Cannot explain what I thought was a mildly simple regex... (positive-lookahead - positive-lookbehind)

Having a bit of trouble as to the result of the following regex, matching the following text. I usually use regex101.com to give me a hand in quickly seeing output of regular expressions, but this time around Regex101 and Python are not agreeing on the same output.... and I cannot figure out why.
Sample text I'm parsing:
82% (37)\n31% (14)\n(missing = 8)\n76% (34)\n33% (15)\n(missing = 13)\n84% (38)\n53% (24)\n(missing = 7)\n18% (8)\n13% (6)\n(missing = 37)\n16% (7)\n13% (6)\n(missing = 39)
I'm just doing a split at (missing = \d(1,2}), while including the pattern. So I've so far tried a couple different look-behinds, along with look-aheads.
Patterns tried:
.+?(?=\= ..)
.+?(?<=\= ..)
(?=\d{2}\%).+?(?<=\= ..)
As requested, link to what I'm currently trying: https://regex101.com/r/JgaZ1n/1
https://regex101.com/r/JgaZ1n/2
reg=re.compile(r'.*?(missing = \d+)')
reg.findall('82% (37)\n31% (14)\n(missing = 8)\n76% (34)\n33%...etc...')
I expect a re.findall() to return first 82% (37)\n31% (14)\n(missing = 8).... and so forth of others, but typically get either nothing, or just that what is contained within (missing \d). Any insight a to why?
Thanks for reading.

Getting a specific tag and combining if multiple same tags are found together

I want to keep the words with the tag NA. If more than one such words come together, I want to combine them into a one word.
Example:
%if i have
a='[The/D, handle/NA, of/NS, the/NaAq, hair/NA, brush/NA, is/NaAZ broken/A]'
% the output I want:
output={'handle', 'hair brush'}
I tried with searching for /NA but the problem is there are false positives which are the, is.
Currently my code is:
g=split(a(2:end-1));
b= strfind(g,'/NA');
g(~cellfun(#isempty, b))
Any ideas how to proceed? Any one-line regular expression will be very helpful if possible.
Looks like a nice NLP problem. Maybe this gets you started:
a='[The/D, handle/NA, of/NS, the/NaAq, hair/NA, brush/NA, is/NaAZ broken/A]';
output={'handle', 'hair brush'};
expr = '(\S+/NA, )+'; % look for words followed by '/NA, '
match = regexp(a,expr,'match');
output = strtrim(strrep(match,'/NA,','')) % strrep: get rid of tag - strtrim: get rid of tailing blank
Note that this approach will fail if the last word is tagged with /NA. You can catch that case independently though.

Regex to ignore commented lines C++

I'm trying to use regex to find all variable initializations or assignments in code.
Currently I have
(\w+|\w[_])\s*=\s*(\d+\.\d+|.*)
which works but also finds commented out code like
// a = 100; which I don't want it to do. I've tried
([^/]\w+|\w[_])\s*=\s*(\d+\.\d+|.*)`
which I thought should ignore strings that start with / but that doesn't work.
Edit:
For example I'd like it to find lines like
b = 200;
but not // c = 3;
I try this take if necessary.
^(?:(?!\/\/).)*[a-z][a-z0-9\_]*\s*=\s*[0-9]+;
SEE DEMO: http://regex101.com/r/jE4vM0/3
Use this regex and check if the first sub-match is "//", if yes, it is after a comment.
(//)*\s*(\w+|\w[_])\s*=\s*(\d+\.\d+|.*)
For example "var=5;" will get three sub-matches: blank, var, and 5 while "//var=5;" will get //, var, and 5.

Notepad++ RegeEx group capture syntax

I have a list of label names in a text file I'd like to manipulate using Find and Replace in Notepad++, they are listed as follows:
MyLabel_01
MyLabel_02
MyLabel_03
MyLabel_04
MyLabel_05
MyLabel_06
I want to rename them in Notepad++ to the following:
Label_A_One
Label_A_Two
Label_A_Three
Label_B_One
Label_B_Two
Label_B_Three
The Regex I'm using in the Notepad++'s replace dialog to capture the label name is the following:
((MyLabel_0)((1)|(2)|(3)|(4)|(5)|(6)))
I want to replace each capture group as follows:
\1 = Label_
\2 = A_One
\3 = A_Two
\4 = A_Three
\5 = B_One
\6 = B_Two
\7 = B_Three
My problem is that Notepad++ doesn't register the syntax of the regex above. When I hit Count in the Replace Dialog, it returns with 0 occurrences. Not sure what's misesing in the syntax. And yes I made sure the Regular Expression radio button is selected. Help is appreciated.
UPDATE:
Tried escaping the parenthesis, still didn't work:
\(\(MyLabel_0\)\((1\)|\(2\)|\(3\)|\(4\)|\(5\)|\(6\)\)\)
Ed's response has shown a working pattern since alternation isn't supported in Notepad++, however the rest of your problem can't be handled by regex alone. What you're trying to do isn't possible with a regex find/replace approach. Your desired result involves logical conditions which can't be expressed in regex. All you can do with the replace method is re-arrange items and refer to the captured items, but you can't tell it to use "A" for values 1-3, and "B" for 4-6. Furthermore, you can't assign placeholders like that. They are really capture groups that you are backreferencing.
To reach the results you've shown you would need to write a small program that would allow you to check the captured values and perform the appropriate replacements.
EDIT: here's an example of how to achieve this in C#
var numToWordMap = new Dictionary<int, string>();
numToWordMap[1] = "A_One";
numToWordMap[2] = "A_Two";
numToWordMap[3] = "A_Three";
numToWordMap[4] = "B_One";
numToWordMap[5] = "B_Two";
numToWordMap[6] = "B_Three";
string pattern = #"\bMyLabel_(\d+)\b";
string filePath = #"C:\temp.txt";
string[] contents = File.ReadAllLines(filePath);
for (int i = 0; i < contents.Length; i++)
{
contents[i] = Regex.Replace(contents[i], pattern,
m =>
{
int num = int.Parse(m.Groups[1].Value);
if (numToWordMap.ContainsKey(num))
{
return "Label_" + numToWordMap[num];
}
// key not found, use original value
return m.Value;
});
}
File.WriteAllLines(filePath, contents);
You should be able to use this easily. Perhaps you can download LINQPad or Visual C# Express to do so.
If your files are too large this might be an inefficient approach, in which case you could use a StreamReader and StreamWriter to read from the original file and write it to another, respectively.
Also be aware that my sample code writes back to the original file. For testing purposes you can change that path to another file so it isn't overwritten.
Bar bar bar - Notepad++ thinks you're a barbarian.
(obsolete - see update below.) No vertical bars in Notepad++ regex - sorry. I forget every few months, too!
Use [123456] instead.
Update: Sorry, I didn't read carefully enough; on top of the barhopping problem, #Ahmad's spot-on - you can't do a mapping replacement like that.
Update: Version 6 of Notepad++ changed the regular expression engine to a Perl-compatible one, which supports "|". AFAICT, if you have a version 5., auto-update won't update to 6. - you have to explicitly download it.
A regular expression search and replace for
MyLabel_((01)|(02)|(03)|(04)|(05)|(06))
with
Label_(?2A_One)(?3A_Two)(?4A_Three)(?5B_One)(?6B_Two)(?7B_Three)
works on Notepad 6.3.2
The outermost pair of brackets is for grouping, they limit the scope of the first alternation; not sure whether they could be omitted but including them makes the scope clear. The pattern searches for a fixed string followed by one of the two-digit pairs. (The leading zero could be factored out and placed in the fixed string.) Each digit pair is wrapped in round brackets so it is captured.
In the replacement expression, the clause (?4A_Three) says that if capture group 4 matched something then insert the text A_Three, otherwise insert nothing. Similarly for the other clauses. As the 6 alternatives are mutually exclusive only one will match. Thus only one of the (?...) clauses will have matched and so only one will insert text.
The easiest way to do this that I would recommend is to use AWK. If you're on Windows, look for the mingw32 precompiled binaries out there for free download (it'll be called gawk).
BEGIN {
FS = "_0";
a[1]="A_One";
a[2]="A_Two";
a[3]="A_Three";
a[4]="B_One";
a[5]="B_Two";
a[6]="B_Three";
}
{
printf("Label_%s\n", a[$2]);
}
Execute on Windows as follows:
C:\Users\Mydir>gawk -f test.awk awk.in
Label_A_One
Label_A_Two
Label_A_Three
Label_B_One
Label_B_Two
Label_B_Three

Match overlapping patterns with capture using a MATLAB regular expression

I'm trying to parse a log file that looks like this:
%%%% 09-May-2009 04:10:29
% Starting foo
this is stuff
to ignore
%%%% 09-May-2009 04:10:50
% Starting bar
more stuff
to ignore
%%%% 09-May-2009 04:11:29
...
This excerpt contains two time periods I'd like to extract, from the first delimiter to the second, and from the second to the third. I'd like to use a regular expression to extract the start and stop times for each of these intervals. This mostly works:
p = '%{4} (?<start>.*?)\n% Starting (?<name>.*?)\n.*?%{4} (?<stop>.*?)\n';
times = regexp(c,p,'names');
Returning:
times =
1x16 struct array with fields:
start
name
stop
The problem is that this only captures every other period, since the second delimiter is consumed as part of the first match.
In other languages, you can use lookaround operators (lookahead, lookbehind) to solve this problem. The documentation on regular expressions explains how these work in MATLAB, but I haven't been able to get these to work while still capturing the matches. That is, I not only need to be able to match every delimiter, but also I need to extract part of that match (the timestamp).
Is this possible?
P.S. I realize I can solve this problem by writing a simple state machine or by matching on the delimiters and post-processing, if there's no way to get this to work.
Update: Thanks for the workaround ideas, everyone. I heard from the developer and there's currently no way to do this with the regular expression engine in MATLAB.
MATLAB seems unable to capture characters as a token without removing them from the string (or, I should say, I was unable to do so using MATLAB REGEXP). However, by noting that the stop time for one block of text is equal to the start time of the next, I was able to capture just the start times and the names using REGEXP, then do some simple processing to get the stop times from the start times. I used the following sample text:
c =
%%%% 09-May-2009 04:10:29
% Starting foo
this is stuff
to ignore
%%%% 09-May-2009 04:10:50
% Starting bar
more stuff
to ignore
%%%% 09-May-2009 04:11:29
some more junk
...and applied the following expression:
p = '%{4} (?<start>[^\n]*)\n% Starting (?<name>[^\n]*)[^%]*|%{4} (?<start>[^\n]*).*';
The processing can then be done with the following code:
names = regexp(c,p,'names');
[names.stop] = deal(names(2:end).start,[]);
names = names(1:end-1);
...which gives us these results for the above sample text:
>> names(1)
ans =
start: '09-May-2009 04:10:29'
name: 'foo'
stop: '09-May-2009 04:10:50'
>> names(2)
ans =
start: '09-May-2009 04:10:50'
name: 'bar'
stop: '09-May-2009 04:11:29'
If you are doing a lot of parsing and such work, you might consider using Perl from within Matlab. It gives you access to the powerful regex engine of Perl and might also make many other problems easier to solve.
All you should have to do is to wrap a lookahead around the part of the regex that matches the second timestamp:
'%{4} (?<start>.*?)\n% Starting (?<name>.*?)\n.*?(?=%{4} (?<stop>.*?)\n)'
EDIT: Here it is without named groups:
'%{4} (.*?)\n% Starting (.*?)\n.*?(?=%{4} (.*?)\n)'