Regex to ignore commented lines C++ - c++

I'm trying to use regex to find all variable initializations or assignments in code.
Currently I have
(\w+|\w[_])\s*=\s*(\d+\.\d+|.*)
which works but also finds commented out code like
// a = 100; which I don't want it to do. I've tried
([^/]\w+|\w[_])\s*=\s*(\d+\.\d+|.*)`
which I thought should ignore strings that start with / but that doesn't work.
Edit:
For example I'd like it to find lines like
b = 200;
but not // c = 3;

I try this take if necessary.
^(?:(?!\/\/).)*[a-z][a-z0-9\_]*\s*=\s*[0-9]+;
SEE DEMO: http://regex101.com/r/jE4vM0/3

Use this regex and check if the first sub-match is "//", if yes, it is after a comment.
(//)*\s*(\w+|\w[_])\s*=\s*(\d+\.\d+|.*)
For example "var=5;" will get three sub-matches: blank, var, and 5 while "//var=5;" will get //, var, and 5.

Related

Stripping function calls from lines using regex

Let's say I have a line,
$bagel = parser(1);
$potato = parser(3+(other var));
$avocado = parser(3-(var1+var2+var3));
$untouchedtoast = donotremove(4);
I want to print, instead of parser(1), just 1. So I want to strip function calls (matching parser(.) I guess?), but leave the innards untouched. The output would, ideally, be
$bagel = 1;
$potato = 3+(other var);
$avocado = 3-(var1+var2+var3);
$untouchedtoast = donotremove(4);
I tried %s/parser(.)//g, but it only replaced everything to the left of the innards. Tried a few other wildcards, but I think I have to somehow pass a variable from the input regex to the output regex, and I'm not sure if that's possible. If it matters, I'm doing this in vim.
Thoughts?
%s/parser(\(.\+\));/\1;/
Search parser();, extract everything inside () using \(.\+\) group, replace the entire expression with the group (\1), add a semicolon (because it was eaten by the search expression).
Try this:
search: \w\(|\)(?=;)
replace: blank

Getting a specific tag and combining if multiple same tags are found together

I want to keep the words with the tag NA. If more than one such words come together, I want to combine them into a one word.
Example:
%if i have
a='[The/D, handle/NA, of/NS, the/NaAq, hair/NA, brush/NA, is/NaAZ broken/A]'
% the output I want:
output={'handle', 'hair brush'}
I tried with searching for /NA but the problem is there are false positives which are the, is.
Currently my code is:
g=split(a(2:end-1));
b= strfind(g,'/NA');
g(~cellfun(#isempty, b))
Any ideas how to proceed? Any one-line regular expression will be very helpful if possible.
Looks like a nice NLP problem. Maybe this gets you started:
a='[The/D, handle/NA, of/NS, the/NaAq, hair/NA, brush/NA, is/NaAZ broken/A]';
output={'handle', 'hair brush'};
expr = '(\S+/NA, )+'; % look for words followed by '/NA, '
match = regexp(a,expr,'match');
output = strtrim(strrep(match,'/NA,','')) % strrep: get rid of tag - strtrim: get rid of tailing blank
Note that this approach will fail if the last word is tagged with /NA. You can catch that case independently though.

Multiline regular expressions in R

I need to find instances of a LaTeX \index command in a whole bunch of knitr documents (.Rnw) which have commas in them. These may occur over multiple lines e.g.
\index{prior distribution,choosing beta prior for
$\pi$,vague prior knowledge}
I'm reasonably happy with my R code to find things:
line = paste(readLines(input), collapse = "\n")
r = gregexpr(pattern, line)
if(length(r) > 0){
lapply(regmatches(line, r), function(e){cat(paste(substr(e, 0, 50), "\n"))})
}
However, I can't seem to get the regular expression right. I've tried
pattern = "(\\s)\\\\index\\{.*[,][^}]*\\}"
which gets some but not everything
pattern = "\\\\index\\{[A-Za-z \\s][^}]*\\}"
which gets more, but a lot I don't want. For example it finds
\index{posterior variance!beta distribution}
Any help appreciated.
Often it is easier to use a multiple regexes in a row than one regex that gets exactly what you want. In your case:
library(stringr)
t = "\\index{prior distribution,choosing beta prior for
\\$\\pi\\$,vague prior knowledge} bleh
\\index{posterior variance!beta distribution}"
cat(t)
tier_1 = str_match_all(t, "(?s)\\index\\{.*?\\}")[[1]]
tier_2 = tier_1[str_detect(tier_1, ",")]
The first regex finds all the \index{} stuff, across lines. The second keeps only those that have a comma.
This gets the first, and not the second. You can add more tiers to sort away stuff you don't want like this.

Regex Split: Split column into Name, percentage andsolvent

Looking for a regex that can split expressions like:
A-6-b 10/%XYZ
into:
A-6-b
10%
/XYZ
Note that the first group can also contain spaces and numbers:
AQDF 100 56%/ABC
and percentage can be a float:
SFSDF 0.1%/ABC
I've come up with (^[A-Z\s\d-]*)(?!%)(\d+%)(.*$) but this doe snot match any percentages that are floats and more importantly even simple examples like ABC 10%/XYZ fail because the first digit of the percentage is assigned to the first capturing group.
Any idea how I can achieve what I want? I'm not a regex expert...
EDIT: fixed errors in example
EDIT2:
The examples are not complete. Here one more:
ABC Dwsd 0.01%/XYZ QST
First part can contain spaces
Last Part can contain spaces
number can be a float
Super simple:
/^(.*) ([1-9][0-9]*(?:\.[0-9]+)?%)(.*)$/
The most easily identifiable item is your percentage, so the ([1-9][0-9]*(?:\.[0-9]+)?%) part deals with finding that.
Then it's simply a case of getting everything before (excluding the final space) to get the name, and everything after to get the solvent.
Done.
Don't overcomplicate this by using one unreadable regex.
Based on what you've said, your separators are well defined (the last space and the last %). In JavaScript, for example, you could use:
var str = "A-6-b 10/%XYZ";
var firstSeparator = str.lastIndexOf(' ');
var secondSeparator = str.lastIndexOf('%');
var name = str.substring(0, firstSeparator);
var percentage = str.substring(firstSeparator + 1, secondSeparator + 1); // we want to include the % separator in this one
var solvent = str.substring(secondSeparator + 1);
console.log(name, percentage, solvent);
Working JSFiddle: http://jsfiddle.net/rL5uymhm/
(There may be a typo in your question, as your examples differ on where the / symbol appears. So the code may need tweaking. My point still stands – don't use a regex for the sake of it when there is a more readable alternative.)
IF you really want to use a regex, /^(.+ )([^%]+%)(.*)$/ should work.
I try this Let me know if you have any problem in comment.
((?:(?!\s*[0-9]*\/%).)*)\s*([\d\/%]*)\s*(.*)
SEE DEMO : http://regex101.com/r/lL8oN4/1
This one works for me (using PCRE):
/^(.+) ([0-9.]+)[\/%]+([^\/]+)$/

Match overlapping patterns with capture using a MATLAB regular expression

I'm trying to parse a log file that looks like this:
%%%% 09-May-2009 04:10:29
% Starting foo
this is stuff
to ignore
%%%% 09-May-2009 04:10:50
% Starting bar
more stuff
to ignore
%%%% 09-May-2009 04:11:29
...
This excerpt contains two time periods I'd like to extract, from the first delimiter to the second, and from the second to the third. I'd like to use a regular expression to extract the start and stop times for each of these intervals. This mostly works:
p = '%{4} (?<start>.*?)\n% Starting (?<name>.*?)\n.*?%{4} (?<stop>.*?)\n';
times = regexp(c,p,'names');
Returning:
times =
1x16 struct array with fields:
start
name
stop
The problem is that this only captures every other period, since the second delimiter is consumed as part of the first match.
In other languages, you can use lookaround operators (lookahead, lookbehind) to solve this problem. The documentation on regular expressions explains how these work in MATLAB, but I haven't been able to get these to work while still capturing the matches. That is, I not only need to be able to match every delimiter, but also I need to extract part of that match (the timestamp).
Is this possible?
P.S. I realize I can solve this problem by writing a simple state machine or by matching on the delimiters and post-processing, if there's no way to get this to work.
Update: Thanks for the workaround ideas, everyone. I heard from the developer and there's currently no way to do this with the regular expression engine in MATLAB.
MATLAB seems unable to capture characters as a token without removing them from the string (or, I should say, I was unable to do so using MATLAB REGEXP). However, by noting that the stop time for one block of text is equal to the start time of the next, I was able to capture just the start times and the names using REGEXP, then do some simple processing to get the stop times from the start times. I used the following sample text:
c =
%%%% 09-May-2009 04:10:29
% Starting foo
this is stuff
to ignore
%%%% 09-May-2009 04:10:50
% Starting bar
more stuff
to ignore
%%%% 09-May-2009 04:11:29
some more junk
...and applied the following expression:
p = '%{4} (?<start>[^\n]*)\n% Starting (?<name>[^\n]*)[^%]*|%{4} (?<start>[^\n]*).*';
The processing can then be done with the following code:
names = regexp(c,p,'names');
[names.stop] = deal(names(2:end).start,[]);
names = names(1:end-1);
...which gives us these results for the above sample text:
>> names(1)
ans =
start: '09-May-2009 04:10:29'
name: 'foo'
stop: '09-May-2009 04:10:50'
>> names(2)
ans =
start: '09-May-2009 04:10:50'
name: 'bar'
stop: '09-May-2009 04:11:29'
If you are doing a lot of parsing and such work, you might consider using Perl from within Matlab. It gives you access to the powerful regex engine of Perl and might also make many other problems easier to solve.
All you should have to do is to wrap a lookahead around the part of the regex that matches the second timestamp:
'%{4} (?<start>.*?)\n% Starting (?<name>.*?)\n.*?(?=%{4} (?<stop>.*?)\n)'
EDIT: Here it is without named groups:
'%{4} (.*?)\n% Starting (.*?)\n.*?(?=%{4} (.*?)\n)'