Lua pattern matching for extracting hard coded strings in code base - c++

I'm working with a C++ code base. Right now I'm using a C++ code calling lua script to look through the entire code base and hopefully return a list of all of the strings which are used in the program.
The strings in question are always preceded by a JUCE macro called TRANS. Here are some examples which should extract a string
TRANS("Normal")
TRANS ( "With spaces" )
TRANS("")
TRANS("multiple"" ""quotations")
TRANS(")")
TRANS("spans \
multiple \
lines")
And I'm sure you can imagine some other possible string varients that could occur in a large code base. I'm making an automatic tool to generate JUCE translation formatted files to automate the process as much as possible
I've gotten this far, as it stands, for pattern matching in order to find these strings. I've converted the source code into a lua string
path = ...
--Open file and read source into string
file = io.open(path, "r")
str = file:read("*all")
and called
for word in string.gmatch(string, 'TRANS%s*%b()') do print(word) end
which finds a pattern that starts with TRANS, has balanced parenthesis. This will get me the full Macro, including the brackets but from there I figured it would be pretty easy to split off the fat I don't need and just keep the actual string value.
However this doesn't work for strings which cause a parenthesis imbalance.
e.gTRANS(")") will return TRANS("), instead of TRANS("(")
I revised my pattern to
for word in string.gmatch(string, 'TRANS%s*(%s*%b""%s*') do print(word) end
where, the pattern should start with a TRANS, then 0 or many spaces. Then it should have a ( character followed by zero or more spaces. Now that we are inside the brackets, we should have a balanced number of "" marks, followed by another 0 or many spaces, and finally ended by a ) . Unfortunately, this does not return a single value when used. But... I think even IF it worked as I expected it to... There can be a \" inside, which causes the bracket imbalance.
Any advice on extracting these strings? Should I continue to try and find a pattern matching sequence? or should I try a direct algorithm... Do you know why my second pattern returned no strings? Any other advice! I'm not looking to cover 100% of all possibilities, but being close to 100% would be awesome. Thanks! :D

I love Lua patterns as much as anyone, but you're bringing a knife to a gun fight. This is one of those problems where you really don't want to code the solution as regular expressions. To deal correctly with doublequote marks and backslash escapes, you want a real parser, and LPEG will manage your needs nicely.

In the second case, you forgot to escape parentheses. Try
for word in string.gmatch(str, 'TRANS%s*%(%s*(%b"")%s*%)') do print(word) end

Related

Regular Expression misses matches in string

I'm trying to write a regular expression that captures desired strings between strings
("f38 ","f38 ","f1 ", "..") and ("\par","\hich","{","}","","..") from a decompiled DOC file and append each match to an array to eventually be printed out into a new file.
I'm having an issue with catching certain strings between "f38 " and "\hich" (usually when the string spans multiple lines but there is at least 1 exception to this I've found in the example string snippet of the DOC file I'm using on regex101.com)
Here is the regular expression as I have it now
(?<=f38 |f38 | |f1 |\.\.)\w.+(?=\\par|\\cell |\\hich|{|}|\\|\.\.)
The troublesome matches come out including "\hich". Like "e\hich" and "d\hich" and I want to match "e" and "d" respectively in these examples not the \hich portion. I'm thinking the problem is with handling the newline/line-breaks somehow.
Here is a smaller snippet of the input string, I have bolded what is matched and bolded + capitalized the problematic match. From this I want the "e" not the \hich. Note that above there are 2 examples of things going right and "\hich" is not included in the match.
l\hich\af38\dbch\af31505\loch\f38 ..ikely to involve asbestos exposure: removal, encapsulation, alteration, repair, maintenance, insulation, spill/emergency clean-up, transportation, disposal and storage of ACM. The general industry standards cover all other operations where exposure to asb..\hich\af38\dbch\af31505\loch\f38 E\HICH\af38\dbch\af31505\loch\f38 stos is possible
Here is an example with a longer portion of the input string at regex101.com
Any help would be appreciated. Thanks!
The problem is with the part you want to match those single-character samples. \w.+ requires at least two characters to match. So, for when you get "e\hich" that first backslash get matched to the dot in regex and lasts until the next backslash (which is one of the "terminators" listed in the positive lookahead portion of the regex).
You might want to use * instead of +.

Extracting substring in linux using expr and regex

So I have just begun learning regular expressions. I have to extract a substring within a large string.
My string is basically one huge line containing a lot of stuff. I have identified the pattern based on which I need to extract. I need the number in this line A lot of stuff<li>65,435 views</li>a lot of stuff This number is just for example.
This entire string is in fact one big line and my file views.txt contains a lot of such lines.
So I tried this,
while read p
do
y=`expr "$p": ".*<li>\(.*\) views "`
echo $y
done < views.txt
I wished to iterate over all such lines within this views.txt file and print out the numbers.
And I get a syntax error. I really have no idea what is going wrong here. I believe that I have correctly flanked the number by <li> and views including the spaces.
My (limited) interpretation of the above regex leads me to believe that it would output the number.
Any help is appreciated.
The syntax error is because the ":" is not separated from "$p" by a space (or tab). With that fixed, the regex has a trailing blank which will prevent it matching. Fixing those two problems, your sample script works as intended.

Regular expression trouble

Hey guys - I'm tearing my hair out trying to create a regular expression to match something like:
{TextOrNumber{MoreTextOrNumber}}
Note the matching number of open/close {}. Is this even possible?
Many thanks.
Note the matching number of open/close {}. Is this even possible?
Historically, no. However, modern regular expressions aren’t actually regular and some allow such constructs:
\{TextOrNumber(?R)?\}
(?R) recursively inserts the pattern again. Notice that not many regex engines support that (yet).
If you need to do an arbitrary number of braces, you can use a parser generator, or create a regex inside a nested function. The following is an example of a recursive regex in ruby.
def parse(s)
if s =~ /^\{([A-Za-z0-9]*)({.*})?\}$/ then
puts $1
parse($2)
end
end
parse("{foo{bar{baz}}}")
This is not possible with 1 regex if you don't have a recursive extension available. You'll have to match a regex like the following one multiple times
/\{[a-z0-9]+([a-z0-9\{\}]+)?\}/i
capture the "MoreTextOrNumber" and let it match again until you are through or it fails.
Not easy but possible
Officially, regular expressions are not designed for parsing nested paired brackets --- and if you try to do this, you run into all sorts of problems. There are other other tools (like parser generators, e.g. yacc or bison) that are designed for such structures and can handle them well. But it can be done --- and if you do it right it may even be simpler than a yacc grammar with all the support code to work around the problems of yacc.
Here are some hints:
First of all, my suggestions work best if you have some characters that will never appear in the input. Often, characters like \01 and \02 should never appear, so you can do
s/[\01\02]/ /g;
to make sure they are not there. Otherwise, you may want to escape them (e.g. convert them to text like %0 and %1) with an expression like
s/([\01\02%])/"%".ord($1)/ge;
Notice, that I also escaped the escape character "%".
Now, I suggest to parse brackets from the inside out: replace any substring "{ text }" where "text" does not contain any brackets by a place holder "\01$number\2" and store the included text in $array[$number]:
$number=1;
while (s/\{([^{}]*)\}/"\01$number\02"/e) { $array[$number]=$1; $number++; }
$array[0]=$_; # $array[0] corresponds to your input
As a final step, you may want to process each element in #array to pull out and process the "\01$number\02" markers. This is easy because they are no longer nested.
I happily use this idea in a few parsers (including separating matching bracket types like "(){}[]" etc).
But before you go down this road, make sure to have used regular expressions in simpler applications: You will run into many small problems and you need experience to resolve them (rather than turning one small problem into two small problems etc.).

Using an asterisk in a RegExp to extract data that is enclosed by a certain pattern

I have an text that consists of information enclosed by a certain pattern.
The only thing I know is the pattern: "${template.start}" and ${template.end}
To keep it simple I will substitute ${template.start} and ${template.end} with "a" in the example.
So one entry in the text would be:
aINFORMATIONHEREa
I do not know how many of these entries are concatenated in the text. So the following is correct too:
aFOOOOOOaaASDADaaASDSDADa
I want to write a regular expression to extract the information enclosed by the "a"s.
My first attempt was to do:
a(.*)a
which works as long as there is only one entry in the text. As soon as there are more than one entries it failes, because of the .* matching everything. So using a(.*)a on aFOOOOOOaaASDADaaASDSDADa results in only one capturing group containing everything between the first and the last character of the text which are "a":
FOOOOOOaaASDADaaASDSDAD
What I want to get is something like
captureGroup(0): aFOOOOOOaaASDADaaASDSDADa
captureGroup(1): FOOOOOO
captureGroup(2): ASDAD
captureGroup(3): ASDSDAD
It would be great to being able to extract each entry out of the text and from each entry the information that is enclosed between the "a"s. By the way I am using the QRegExp class of Qt4.
Any hints? Thanks!
Markus
Multiple variation of this question have been seen before. Various related discussions:
Regex to replace all \n in a String, but no those inside [code] [/code] tag
Using regular expressions how do I find a pattern surrounded by two other patterns without including the surrounding strings?
Use RegExp to match a parenthetical number then increment it
Regex for splitting a string using space when not surrounded by single or double quotes
What regex will match text excluding what lies within HTML tags?
and probably others...
Simply use non-greedy expressions, namely:
a(.*?)a
You need to match something like:
a[^a]*a
You have a couple of working answers already, but I'll add a little gratuitous advice:
Using regular expressions for parsing is a road fraught with danger
Edit: To be less cryptic: for all there power, flexibility and elegance, regular expression are not sufficiently expressive to describe any but the simplest grammars. Ther are adequate for the problem asked here, but are not a suitable replacement for state machine or recursive decent parsers if the input language become more complicated.
SO, choosing to use RE for parsing input streams is a decision that should be made with care and with an eye towards the future.

Regex Partial String CSV Matching

Let me preface this by saying I'm a complete amateur when it comes to RegEx and only started a few days ago. I'm trying to solve a problem formatting a file and have hit a hitch with a particular type of data. The input file is structured like this:
Two words,Word,Word,Word,"Number, number"
What I need to do is format it like this...
"Two words","Word",Word","Word","Number, number"
I have had a RegEx pattern of
s/,/","/g
working, except it also replaces the comma in the already quoted Number, number section, which causes the field to separate and breaks the file. Essentially, I need to modify my pattern to replace a comma with "," [quote comma quote], but only when that comma isn't followed by a space. Note that the other fields will never have a space following the comma, only the delimited number list.
I managed to write up
s/,[A-Za-z0-9]/","/g
which, while matching the appropriate strings, would replace the comma AND the following letter. I have heard of backreferences and think that might be what I need to use? My understanding was that
s/(,)[A-Za-z0-9]\b
should work, but it doesn't.
Anyone have an idea?
My experience has been that this is not a great use of regexes. As already said, CSV files are better handled by real CSV parsers. You didn't tag a language, so it's hard to tell, but in perl, I use Text::CSV_XS or DBD::CSV (allowing me SQL to access a CSV file as if it were a table, which, of course, uses Text::CSV_XS under the covers). Far simpler than rolling my own, and far more robust than using regexes.
s/,([^ ])/","$1/ will match a "," followed by a "not-a-space", capturing the not-a-space, then replacing the whole thing with the captured part.
Depending on which regex engine you're using, you might be writing \1 or other things instead of $1.
If you're using Perl or otherwise have access to a regex engine with negative lookahead, s/,(?! )/","/ (a "," not followed by a space) works.
Your input looks like CSV, though, and if it actually is, you'd be better off parsing it with a real CSV parser rather than with regexes. There's lot of other odd corner cases to worry about.
This question is similar to: Replace patterns that are inside delimiters using a regular expression call.
This could work:
s/"([^"]*)"|([^",]+)/"$1$2"/g
Looks like you're using Sed.
While your pattern seems to be a little inconsistent, I'm assuming you'd like every item separated by commas to have quotations around it. Otherwise, you're looking at areas of computational complexity regular expressions are not meant to handle.
Through sed, your command would be:
sed 's/[ \"]*,[ \"]*/\", \"/g'
Note that you'll still have to put doublequotes at the beginning and end of the string.