Regex Pattern Matching Concatenation - regex

Is it possible to concatenate the results of Regex Pattern Matching using only Regex syntax?
The specific instance is a program is allowing regex syntax to pull info from a file, but I would like it to pull from several portions and concatenate the results.
For instance:
Input string: 1234567890
Desired result string: 2389
Regex Pattern match: (?<=1).+(?=4)%%(?<=7).+(?=0)
Where %% represents some form of concatenation syntax. Using starting and ending with syntax is important since I know the field names but not the values of the field.
Does a keyword that functions like %% exist? Is there a more clever way to do this? Must the code be changed to allow multiple regex inputs, automatically concatenating?
Again, the pieces to be concatenated may be far apart with unknown characters in between. All that is known is the information surrounding the substrings.
2011-08-08 edit: The program is written in C#, but changing the code is a major undertaking compared to finding a regex-based solution.

Without knowing exactly what you want to match and what language you're using, it's impossible to give you an exact answer. However, the usual way to approach something like this is to use grouping.
In C#:
string pattern = #"(?<=1)(.+)(?=4).+(?<=7)(.+)(?=0)";
Match m = Regex.Match(input, pattern);
string result = m.Groups[0] + m.Groups[1];
The same approach can be applied to many other languages as well.
Edit
If you are not able to change the code, then there's no way to accomplish what you want. The reason is that in C#, the regex string itself doesn't have any power over the output. To change the result, you'd have to either change the called method of the Regex class or do some additional work afterwards. As it is, the method called most likely just returns either a Match object or a list of matching objects, neither of which will do what you want, regardless of the input regex string.

Related

Regex match hyphenated word with hyphen-less query

I have an Azure Storage Table set up that possesses lots of values containing hyphens, apostrophes, and other bits of punctuation that the Azure Indexers don't like. Hyphenated-Word gets broken into two tokens — Hyphenated and Word — upon indexing. Accordingly, this means that searching for HyphenatedWord will not yield any results, regardless of any wildcard or fuzzy matching characters. That said, Azure Cognitive Search possesses support for Regex Lucene queries...
As such, I'm trying to find out if there's a Regex pattern I can use to match words with or without hyphens to a given query. As an example, the query homework should match the results homework and home-work.
I know that if I were trying to do the opposite — match unhyphenated words even when a hyphen is provided in the query — I would use something like /home(-)?work/. However, I'm not sure what the inverse looks like — if such a thing exists.
Is there a raw Regex pattern that will perform the kind of matching I'm proposing? Or am I SOL?
Edit: I should point out that the example I provided is unrealistic because I won't always know where a hyphen should be. Optimally, the pattern that performs this matching would be agnostic to the precise placement of a hyphen.
Edit 2: A solution I've discovered that works but isn't exactly optimal (and, though I have no way to prove this, probably isn't performant) is to just break down the query, remove all of the special characters that cause token breaks, and then dynamically build a regex query that has an optional match in between every character in the query. Using the homework example, the pattern would look something like [-'\.! ]?h[-'\.! ]?o[-'\.! ]?m[-'\.! ]?e[-'\.! ]?w[-'\.! ]?o[-'\.! ]?r[-'\.! ]?k[-'\.! ]?...which is perhaps the ugliest thing I've ever seen. Nevertheless, it gets the job done.
My solution to scenarios like this is always to introduce content- and query-processing.
Content processing is easier when you use the push model via the SDK, but you could achieve the same by creating a shadow/copy of your table where the content is manipulated for indexing purposes. You let your original table stay intact. And then you maintain a duplicate table where your text is processed.
Query processing is something you should use regardless. In its simplest form you want to clean the input from the end users before you use it in a query. Additional steps can be to handle special characters like a hyphen. Either escape it, strip it, or whatever depending on what your requirements are.
EXAMPLE
I have to support searches for ordering codes that may contain hyphens or other special characters. The maintainers of our ordering codes may define ordering codes in an inconsistent format. Customers visiting our sites are just as inconsistent.
The requirement is that ABC-123-DE_F-4.56G should match any of
ABC-123-DE_F-4.56G
ABC123-DE_F-4.56G
ABC_123_DE_F_4_56G
ABC.123.DE.F.4.56G
ABC 123 DEF 56 G
ABC123DEF56G
I solve this using my suggested approach above. I use content processing to generate a version of the ordering code without any special characters (using a simple regex). Then, I use query processing to transform the end user's input into an OR-query, like:
<verbatim-user-input-cleaned> OR OrderCodeVariation:<verbatim-user-input-without-special-chars>
So, if the user entered ABC.123.DE.F.4.56G I would effecively search for
ABC.123.DE.F.4.56G OR OrderingCodeVariation:ABC123DEF56G
It sounds like you want to define your own tokenization. Would using a custom tokenizer help? https://learn.microsoft.com/azure/search/index-add-custom-analyzers
To add onto Jennifer's answer, you could consider using a custom analyzer consisting of either of these token filters:
pattern_replace: A token filter which applies a pattern to each token in the stream, replacing match occurrences with the specified replacement string.
pattern_capture: Uses Java regexes to emit multiple tokens, one for each capture group in one or more patterns.
You could use the pattern_replace token filter to replace hyphens with the desired character, maybe an empty character.

Using regex in SPARQL to bind variables?

I'm working on an OWL knowledge graph for info about patients in the Covid pandemic. I've been using SPARQL to transform strings from spreadsheets into the appropriate objects and values of properties.
I have strings like Infected by P231 and P456 and P39393 What I want is something that can bind variables to the patient ids. I thought this shouldn't be too hard because the strings only follow a few patterns. E.g, strings will have one, two, or three Patient IDs and no more so I could write a query that matches each separate case.
I thought I could use regex to do this but now that I look at regex more carefully I think all it can do is tell me that such Patient IDs exist but unlike functions such as SUBSTR that will actually return part of the string that I want so I can bind it to a variable, regex just returns true or false that some string matches the pattern or it doesn't. Is that correct?
If that is correct are there other ways to do pattern matching in SPARQL where I can actually bind variables to a substring that matches part of the pattern? Or do I need to resort to a full programming language like Python to do this?
REPLACE is the function to apply a regular expression, with () match groups, and calculate a return string based on the match using $1 to get the group actually matched. It is based on fn:replace from "XPath and XQuery Functions and Operators" as are many of the SPARQL functions.
BIND (REPLACE("123", "(.)..", "$1") AS ?str)
will set ?str to "1".

Matching within matches by extending an existing Regex

I'm trying to see if its possible to extend an existing arbitrary regex by prepending or appending another regex to match within matches.
Take the following example:
The original regex is cat|car|bat so matching output is
cat
car
bat
I want to add to this regex and output only matches that start with 'ca',
cat
car
I specifically don't want to interpret a whole regex, which could be quite a long operation and then change its internal content to match produce the output as in:
^ca[tr]
or run the original regex and then the second one over the results. I'm taking the original regex as an argument in python but want to 'prefilter' the matches by adding the additional code.
This is probably a slight abuse of regex, but I'm still interested if it's possible. I have tried what I know of subgroups and the following examples but they're not giving me what I need.
Things I've tried:
^ca(cat|car|bat)
(?<=ca(cat|car|bat))
(?<=^ca(cat|car|bat))
It may not be possible but I'm interested in what any regex gurus think. I'm also interested if there is some way of doing this positionally if the length of the initial output is known.
A slightly more realistic example of the inital query might be [a-z]{4} but if I create (?<=^ca([a-z]{4})) it matches against 6 letter strings starting with ca, not 4 letter.
Thanks for any solutions and/or opinions on it.
EDIT: See solution including #Nick's contribution below. The tool I was testing this with (exrex) seems to have a slight bug that, following the examples given, would create matches 6 characters long.
You were not far off with what you tried, only you don't need a lookbehind, but rather a lookahead assertion, and a parenthesis was misplaced. The right thing is: Put the original pattern in parentheses, and prepend (?=ca):
(?=ca)(cat|car|bat)
(?=ca)([a-z]{4})
In the second example (without | alternative), the parentheses around the original pattern wouldn't be required.
Ok, thanks to #Armali I've come to the conclusion that (?=ca)(^[a-z]{4}$) works (see https://regexr.com/3f4vo). However, I'm trying this with the great exrex tool to attempt to produce matching strings, and it's producing matches that are 6 characters long rather than 4. This may be a limitation of exrex rather than the regex, which seems to work in other cases.
See #Nick's comment.
I've also raised an issue on the exrex GitHub for this.

How Can I Regex Match The Common Name of This DN?

Given the string CN=Smith\, John,OU=Users,OU=IT,DC=contoso,DC=com, I am seeking to match the complete common name, including the comma after it. I'm really trying to remove this part, so matching the other part would work too.
The result after removal or filter ought to be OU=Users,OU=IT,DC=contoso,DC=com.
I tried ^.+(?=,OU=) with no flags but this captures CN=Some\, Guy,OU=Users
The language/regex standard is PowerShell, but I can figure out the conversion from another standard, so any solution is likely better than where I am.
You can try this: ^.+?(?=,OU=), and it shell match the entire string until the OU part.

Regular expression to parse xaml binding-esque syntax

As usual, regular expressions are causing my head to hurt.
I have the following strings (as examples) which I would like to parse:
Client: {Path=ClientName}, Balance: {Path=Balance, StringFormat='{0:0.00}'}
Client: {Path=ClientName}, Balance: {Path=Balance, StringFormat='Your balance is {0:0.00}.'}
I am looking for a regular expression (or any other method) which could split the strings as follows and then get the individual key/value values of each. (The idea is to resolve each one of these to a XAML binding)
String 1: {Path=ClientName}
Path = ClientName
String 2: {Path=Balance, StringFormat='{0:0.00}'}
Path = Balance
StringFormat = {0:0.00}
At the moment I have the following regular expression to split the strings but this gets confused by the value of StringFormat due to the '}' in the value.
(?<!'){(.+?)}(?!')
Any idea how I can achieve this?
Thanks!
It gets really tiring solving this same problem over and over, but here you go:
Technically, you're doing it wrong, you should use a parser, regular expressions aren't built to deal with nested matching parenthesis, blah blah blah. We can hack this one together, though, so why not?
/(?<!'){([^'}]|'[^']+')+}(?!')/
The meat of that - {([^'}]+|'[^']+')} - looks for two things: a) anything that's not a } or a ' character ([^'}]), and b) anything that looks like a string ('[^']+'). It assumes a string is a quote, a bunch of non-quote text, and another quote. Given your examples, this should work.
It will, however, fail to match 'This is a string with \'quotes\' in it', because it isn't designed for escaped quotation marks. Adding this is simple, and involves applying the principles we just applied, so I'll leave that to you to figure out if you can. You seem to be pretty good with regular expressions, and you at least made a start on this before you asked it, so I think you can figure out how to make it match \' in a string.
EDIT: You're using 's instead of "s. Sorry about that.