Parsing comments and strings in source code - regex

I'm designing and implementing a scripting language, and for the "reading" stage I'm taking the time-tested and straightforward approach of splitting the code up into tokens (lexical analysis) followed by using a stack-based AST generator to squeeze syntactic structure out of the token stream (parsing). However, I'm facing an issue with strings, comments, and how they interact.
(for reference, code in my language uses ~ to start comments)
This might be the wrong approach, but I'm performing the lexical analysis step using regular expressions. For n kinds of tokens, I run n tokenization passes on my code, with each pass finding substrings that match the given regex and "tagging" them, until eventually every character is tagged. Each regex ignores matches that lie within already-tagged sections of source, only tagging unclaimed land. This is useful, because you wouldn't want, for example, a number token infiltrating a token like translate3d.
The issue I'm running into is with comments embedded in strings and strings embedded in comments. I don't know how to simultaneously make this
"The ~ is my favorite character! It's so happy-looking!"
be tagged as a string, and have this
~ "handles" the Exception (just logs it to a file nobody ever reads and moves on)
be tagged as a comment. It seems that either way, you have to impose some ordering on the passes of lexical analysis, and that either the comment or the string pass is going to "win" and tag a substring it has no business tokenizing. For example, either the string is tagged like so: (I'm using XML notation because it's a good way to represent tagged regions of text. XML is not actually used in my program at any point)
"The <comment>~ is my favorite character! It's so happy-looking!"</comment>
or the comment is tagged like this:
<comment>~ </comment><string>"handles"</string>the Exception (just logs it to a file and moves on)
Either it's assumed a string starts in the middle of a comment or a comment starts in the middle of a string.
What's odd is that it seems that this system of regex passes tagging substrings is exactly what the syntax highlighting on a text editor does, and comments and strings work fine there. I've already developed the textmate/submlime text 2 syntax definition for my language, and all I had to do was (in a simplified version of the actual format used)
<syntax>
<color>
string_color
</color>
<pattern>
"[^"]*"
</pattern>
</syntax>
<syntax>
<color>
comment_color
</color>
<pattern>
~.*
</pattern>
</syntax>
Everything works fine when I'm writing sample code. When I tried to emulate what I imagine the behavior of the text editor is, however, I ran into the problems mentioned above. How can this be fixed, preferably in the most elegant way possible? Obviously, special handling could be added, stripping all the comments off the source code before any lexical analysis is done, except for comments inside strings (which requires the reader (reader in this case being the machine, not the human) to detect what sections of code are strings twice), but I'm sure there must be a better way, simply because sublime text only has knowledge of the regexes used to specify the two kinds of regions of code, and with only that information it behaves exactly as expected.

Rather than first tagging the source code before tokenizing it, and using several passes to do so, I recommend that you abandon the tagging and just tokenize the code in one pass using one regular expression.
If you construct an all-encompassing regex that contains sub-patterns to match and capture each token, you can then match globally and determine the token type by examining the capture group contents.
In simple example, if you had a regex such as
"([^"]*)"|~([^\n]*)|(\d+(?:.\d+)?)
to match either strings, comments, or numbers, then if a string was matched the first capture group () would contain it, and all the other capture groups would be empty.
So, in your for each loop (D Language Regular expressons) you would use conditional statements and the match object's capture group contents to determine the next token to be added.
And you wouldn't necessarily have to use just one large regex, you could match several token types in one capture group and then within the for each block apply a further regex (or indexOf etc.) on the capture group contents to determine the token.

Related

Regex match hyphenated word with hyphen-less query

I have an Azure Storage Table set up that possesses lots of values containing hyphens, apostrophes, and other bits of punctuation that the Azure Indexers don't like. Hyphenated-Word gets broken into two tokens — Hyphenated and Word — upon indexing. Accordingly, this means that searching for HyphenatedWord will not yield any results, regardless of any wildcard or fuzzy matching characters. That said, Azure Cognitive Search possesses support for Regex Lucene queries...
As such, I'm trying to find out if there's a Regex pattern I can use to match words with or without hyphens to a given query. As an example, the query homework should match the results homework and home-work.
I know that if I were trying to do the opposite — match unhyphenated words even when a hyphen is provided in the query — I would use something like /home(-)?work/. However, I'm not sure what the inverse looks like — if such a thing exists.
Is there a raw Regex pattern that will perform the kind of matching I'm proposing? Or am I SOL?
Edit: I should point out that the example I provided is unrealistic because I won't always know where a hyphen should be. Optimally, the pattern that performs this matching would be agnostic to the precise placement of a hyphen.
Edit 2: A solution I've discovered that works but isn't exactly optimal (and, though I have no way to prove this, probably isn't performant) is to just break down the query, remove all of the special characters that cause token breaks, and then dynamically build a regex query that has an optional match in between every character in the query. Using the homework example, the pattern would look something like [-'\.! ]?h[-'\.! ]?o[-'\.! ]?m[-'\.! ]?e[-'\.! ]?w[-'\.! ]?o[-'\.! ]?r[-'\.! ]?k[-'\.! ]?...which is perhaps the ugliest thing I've ever seen. Nevertheless, it gets the job done.
My solution to scenarios like this is always to introduce content- and query-processing.
Content processing is easier when you use the push model via the SDK, but you could achieve the same by creating a shadow/copy of your table where the content is manipulated for indexing purposes. You let your original table stay intact. And then you maintain a duplicate table where your text is processed.
Query processing is something you should use regardless. In its simplest form you want to clean the input from the end users before you use it in a query. Additional steps can be to handle special characters like a hyphen. Either escape it, strip it, or whatever depending on what your requirements are.
EXAMPLE
I have to support searches for ordering codes that may contain hyphens or other special characters. The maintainers of our ordering codes may define ordering codes in an inconsistent format. Customers visiting our sites are just as inconsistent.
The requirement is that ABC-123-DE_F-4.56G should match any of
ABC-123-DE_F-4.56G
ABC123-DE_F-4.56G
ABC_123_DE_F_4_56G
ABC.123.DE.F.4.56G
ABC 123 DEF 56 G
ABC123DEF56G
I solve this using my suggested approach above. I use content processing to generate a version of the ordering code without any special characters (using a simple regex). Then, I use query processing to transform the end user's input into an OR-query, like:
<verbatim-user-input-cleaned> OR OrderCodeVariation:<verbatim-user-input-without-special-chars>
So, if the user entered ABC.123.DE.F.4.56G I would effecively search for
ABC.123.DE.F.4.56G OR OrderingCodeVariation:ABC123DEF56G
It sounds like you want to define your own tokenization. Would using a custom tokenizer help? https://learn.microsoft.com/azure/search/index-add-custom-analyzers
To add onto Jennifer's answer, you could consider using a custom analyzer consisting of either of these token filters:
pattern_replace: A token filter which applies a pattern to each token in the stream, replacing match occurrences with the specified replacement string.
pattern_capture: Uses Java regexes to emit multiple tokens, one for each capture group in one or more patterns.
You could use the pattern_replace token filter to replace hyphens with the desired character, maybe an empty character.

find string that is missing substring in xml files regular expression

This is my reg expression that find it
(<instance_material symbol="material_)([0-9]+)(part)(.*?)(")(/)(>)
I need to find a string that does not contain the word "part"
and the xml lines are
<instance_material symbol="material_677part01_h502_w5" target="#material_677part01_h502_w5"/>
<instance_material symbol="material_677" target="#material_677"/>
You can use negative lookahead
^(?!.*part).*?$
^ - start of string.
(?!.*part) - condition to avoid part.
.*? - Match anything except new line.
$ - End of string
Demo
Many regex starters will encounter the problem finding a string not containing certain words. You could find more useful tips on Regular-Expression.info.
^((?!part).)*$
You need to be aware that all attempts to process XML using regular expressions are wrong, in the sense that (a) there will be some legitimate ways of writing the XML document that the regex doesn't match, and (b) there will be some ways of getting false matches, e.g. by putting nasty stuff in XML comments. Sometimes being right 99% of the time is OK of course, but don't do this in production because soon we'll have people writing on SO "I need to generate XML with the attributes in a particular order because that's what the receiving application requires."
Your regex, for example, requires the attribute to be in double rather than single quotes, and it doesn't allow whitespace around the "=" sign, or in several other places where XML allows whitespace. If there's any risk of people deliberately trying to defeat your regex, you need to consider tricks like people writing p in place of p.
Even if this is a one-off with no risk of malicious subversion, you're much better off doing this with XPath. It then becomes a simple query like //instance_materal[#symbol[not(contains(., 'part'))]]

Too Many Characters Included in Attempt to Parse a CSV File

Background
I am attempting to parse a CSV file using PCRE regular expressions. That is, making out (or extracting) the various different "cells" available in the CSV, to then put them in a somewhat nicely organized array containing all the parts that the process of parsing managed to make out.
The following regular expression is what I have come up with so far:
/(?:;|^)(?:(?:"(?:(?!"(;|$)).)*)|(?:([^;]*)))/g
I would highly recommend that you put this in a tester for regular expressions. Here is a slight bit of test data, that should match to a great extent.
"There; \"be";"but; someone spoke";hence the young man;hence the son;"test;"
The Problem
The regular expression manages to extract the correct number of parts. It is meant for the regular expression to retrieve the text from inside each and every "cell" available in the CSV (use the CSV provided above for reference). It does to some extent.
Here is the result of the groups in the regular expression above:
"There; \"be
;"but; someone spoke
hence the young man
hence the son
;"test;
As we can clearly see, the lines that are "escaped" using double-quotation marks include the " inside its group for the match, also selects the ", and sometimes even the semi-colon. From my understanding, the group for the negative lookahead should not include those.
I have probably missed something very essential here. Perhaps someone can point me in the right direction towards a fix.
Edit and Potential Solution
It appears as though I might have managed to solve it. As opposed to what I said above, the negative lookahead does not actually appear to create a capture group, which I initially thought. As such, adding yet another group to the equation seems to parse out the segments I am after.
/(?:;|^)(?:(?:"((?:(?!"(;|$)).)*))|(?:([^;]*)))/g
I will, however, leave the question open for now, and will answer it myself if no other answer comes tumbling in. As not to make it opinion based, I would therefore further inquire as to whether there might be a more efficient way in terms of speed than that in which I am using above.

Problems with finding and replacing

Hey stackoverflow community. Ive need help with huge information file. Is it possible with regular expression to find in this tag:
<category_name><![CDATA[Prekiniai ženklai>Adler|Kita buitinė technika>Buičiai naudingi prietaisai|Kita buitinė technika>Lygintuvai]]></category_name>
Somehow replace all the other data and leave only 'Adler' or 'Lygintuvai'. Im using Altova to edit xml files, so i cant find other way then find-replace. And im new in the regex stuff. So i thought maby you can help me.
#\<category_name\>.+?gt\;([\w]+?)\|.+?gt;([\w]+?)\]\]\>\<\/category_name\>#i
\1 - Adler
\2 - Lygintuvai
PHP
regex101.com
Fields may contain alphanumeric characters without spaces.
If you want to modify the scope of acceptable characters change [\w] to something other:
[a-z] - only letters
[0-9] - only digits
etc.
It's possible, but use of regular expressions to process XML will never be 100% correct (you can prove that using computer science theory), and it may also be very inefficient. For example, the solution given by Luk is incorrect because it doesn't allow whitespace in places where XML allows it. Much better to use XQuery or XSLT, both of which are designed for the job (and both work in Altova). You can then use XPath expressions to locate the element or attribute nodes you are interested in, and you can still use regular expressions (e.g. in the XPath replace() function) to process the content of text or attribute nodes.
Incidentally, your input is rather strange because it uses escape sequences like > within a CDATA section; but XML escape sequences are not recognized in a CDATA section.

Regex Pattern Matching Concatenation

Is it possible to concatenate the results of Regex Pattern Matching using only Regex syntax?
The specific instance is a program is allowing regex syntax to pull info from a file, but I would like it to pull from several portions and concatenate the results.
For instance:
Input string: 1234567890
Desired result string: 2389
Regex Pattern match: (?<=1).+(?=4)%%(?<=7).+(?=0)
Where %% represents some form of concatenation syntax. Using starting and ending with syntax is important since I know the field names but not the values of the field.
Does a keyword that functions like %% exist? Is there a more clever way to do this? Must the code be changed to allow multiple regex inputs, automatically concatenating?
Again, the pieces to be concatenated may be far apart with unknown characters in between. All that is known is the information surrounding the substrings.
2011-08-08 edit: The program is written in C#, but changing the code is a major undertaking compared to finding a regex-based solution.
Without knowing exactly what you want to match and what language you're using, it's impossible to give you an exact answer. However, the usual way to approach something like this is to use grouping.
In C#:
string pattern = #"(?<=1)(.+)(?=4).+(?<=7)(.+)(?=0)";
Match m = Regex.Match(input, pattern);
string result = m.Groups[0] + m.Groups[1];
The same approach can be applied to many other languages as well.
Edit
If you are not able to change the code, then there's no way to accomplish what you want. The reason is that in C#, the regex string itself doesn't have any power over the output. To change the result, you'd have to either change the called method of the Regex class or do some additional work afterwards. As it is, the method called most likely just returns either a Match object or a list of matching objects, neither of which will do what you want, regardless of the input regex string.