Regular Expression starting and ending with special characters

Regular Expression starting and ending with special characters - regex

I need to extract all matches from a huge text that start with [" and end with "]. These special characters separate each record from database. I need to extract all records.
Inside this record there are letters, numbers and special characters like -, ., &, (), /, {space} or so.
I'm writing this in Office VBA.
The pattern I have come so far looks like this: .Pattern = "[[][""][a-z|A-Z|w|W]*".
With this pattern, I am able to extract the first word from each record, with the starting characters [". The count of found matches is correct.
Example of one record:
["blabla","blabla","blabla","\u00e1no","nie","\u00e1no","\u00e1no","\u00e1no","\u003Ca class=\u0022btn btn-default\u0022 href=\u0022\u0026#x2F;siea\u0026#x2F;suppliers\u0026#x2F;42\u0022\u003E\u003Ci class=\u0022fa fa-pencil\u0022\u003E\u003C\/i\u003E Upravi\u0165\u003C\/a\u003E \u003Ca class=\u0022btn btn-default\u0022 href=\u0022\u0026#x2F;siea\u0026#x2F;suppliers\u0026#x2F;form\u0026#x2F;42\u0022\u003E\u003Ci class=\u0022fa fa-file-pdf-o\u0022\u003E\u003C\/i\u003E Zmluva\u003C\/a\u003E \u003Ca class=\u0022btn btn-default\u0022 href=\u0022\u0026#x2F;siea\u0026#x2F;suppliers\u0026#x2F;crz-form\u0026#x2F;42\u0022\u003E\u003Ci class=\u0022fa fa-file-pdf-o\u0022\u003E\u003C\/i\u003E Zmluva CRZ\u003C\/a\u003E"]
The question is : How can I extract the all records starting with [" and ending with "]?
I don't necessary need the starting and ending characters, but I can clean that up later.
Thanks for help.

The easiest way is to get rid of the initial and trailing [" and "] with either Replace or Left/Right/Mid functions, and then Split with "," (in VBA, """,""").
E.g.
input = "YOUR_STRING"
input = Replace(Replace(input, """]", ""), "[""", "")
result = Split(input, """,""")
If you plan to use Regex, you can use \["[\s\S]*?"] pattern, but it is not that efficient with long inputs and may even freeze the macro if timeout issue occurs. You can unroll it as
\["[^"]*(?:"(?!])[^"]*)*"]
See the regex demo. In VBA, Pattern = "\[""[^""]*(?:""(?!])[^""]*)*""]"
Note that with this unrolled pattern, you do not even need to use the workarounds for dot matching newline issue (negated character class [^"] matches any char but ", including a newline).
Pattern details:
\[" - [" literally
[^"]* - zero or more characters other than "
(?:"(?!])[^"]*)* - zero or more sequences of
"(?!]) - " not followed with ]
[^"]* - zero or more characters other than "
"] - literal character sequence "]

Related

Can't fit regexp for a substring

I have to find and remove a substring from the text using regexp in PostgreSQL. The substring corresponds to the condition: <any text between double-quotes containing for|while inside>
Example
Text:
PERFORM xxkkcsort.f_write_log("INFO", "xxkkcsort.f_load_tables__outof_order_system", " Script for data loading: ", false, v_sql, 0);
So, my purpose is to find and remove the substring "Script for data loading: ".
When I tried to use the script below:
SELECT regexp_replace(
'PERFORM xxkkcsort.f_write_log("INFO", "xxkkcsort.f_load_tables__outof_order_system", "> Table for loading: "||cc.source_table_name , false, null::text, 0);'
, '(\")(.*(for|while)(\s).*)(\")'
, '');
I have all the texts inside double-quotes replaced. The result looks like:
PERFORM xxkkcsort.f_write_log(||cc.source_table_name , false, null::text, 0);
What's a proper regular expression to solve the issue?

You canuse
SELECT regexp_replace(
'PERFORM xxkkcsort.f_write_log("INFO", "xxkkcsort.f_load_tables__outof_order_system", "> Table for loading: "||cc.source_table_name , false, null::text, 0);',
'"[^"]*(for|while)\s[^"]*"',
'') AS Result;
Output:
PERFORM xxkkcsort.f_write_log("INFO", "xxkkcsort.f_load_tables__outof_order_system", ||cc.source_table_name , false, null::text, 0);
See the regex demo and the DB fiddle. Details:
" - a " char
[^"]* - zero or more chars other than "
(for|while) - for or while
\s - a whitespace
[^"]*" - zero or more chars other than " and then a " char.

any text between double-quotes containing for|while inside
SELECT regexp_replace(string, '"[^"]*\m(?:for|while)\M[^"]*"', '');
" ... literal " (no special meaning here, so no need to escape it)
[^"]* ... character class including all characters except ", 0-n times
\m ... beginning of a word
(?:for|while) ... two branches in non-capturing parentheses
(regexp_replace() works with simple capturing parentheses, too, but it's cheaper this way since you don't use the captured substring. But try either with the replacement '\1', where it makes a difference ...)
\M ... end of a word
[^"]* ... like above
" ... like above
I dropped \s from your expression, as the task description does not strictly require a white-space character (end of string or punctuation delimiting the word ...).
Related:
Escape function for regular expression or LIKE patterns

How to extract words entirely written in uppercase with accents (Diacritics) with a Google Sheet REGEXEXTRACT formula?

Ok,
it looks simple but when the words start or finish with an accents it is the mess.
I've looked on Stack Overflow and others and haven't really found a way to solve this problem.
I would like, to be able with a Google sheet formula, to extract from a cell, words only built with the ASCII characters that follow:
A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z,À,Á,Â,Ã,Ä,Å,Æ,Ç,È,É,Ê,Ë,Ì,Í,Î,Ï,Ð,Ñ,Ò,Ó,Ô,Õ,Ö,Ø,Ù,Ú,Û,Ü,Ý
For example with "Éléonorä-Camilliâ ÀLËMMNIÖ DE SANTORINÕ" or "ÀLËMMNIÖ DE SANTORINÕ Éléonorä Camilliâ" the result has to be the same "ÀLËMMNIÖ DE SANTORINÕ"
This formula works when no accent all:
=REGEXEXTRACT(A2;"\b[A-Z]+(?:\s+[A-Z]+)*\b")
These formula work sometimes when the names are easy.
=REGEXEXTRACT(A2;"\b[A-Ý]+(?:\s+[A-Ý]+)*\b")
=REGEXEXTRACT(A2;"\B[A-Ý]+(?:\S+[A-Ý]+)*\B")
Can anybody help me or give me some hint?

It seems your expected matches are simply between whitespace or start/end of string. If you add a space before and after the cell value, you may simply extract all the chunks of whitespace-separated uppercase letter words between whitespaces, and the formula will boil down to
=REGEXEXTRACT(" " & A2 & " "; "\s([A-ZÀ-ÖØ-Ý]+(?:\s+[A-ZÀ-ÖØ-Ý]+)*)\s")
See the Google sheets demo:
Regex details:
\s - a whitespace
([A-ZÀ-ÖØ-Ý]+(?:\s+[A-ZÀ-ÖØ-Ý]+)*) - Group 1 (the actual value returned by REGEXEXTRACT): one or more uppercase letters from the specified ranges followed with zero or more repetitions of one or more whitespace and then one or more uppercase letters
\s - a whitespace.
You may use an ARRAYFORMULA, as well:
=ARRAYFORMULA(IFERROR(REGEXEXTRACT(" " & A:A & " ", "\s([A-ZÀ-ÖØ-Ý]+(?:\s+[A-ZÀ-ÖØ-Ý]+)*)\s"),""))

Supposing your sample name were in A2, this should work:
=TRIM(REGEXEXTRACT(A2&" ","([A-ZÀ-Ý ]+)\s"))
By appending a space to the end of the string first, we can then look for the [uppercase letter set or space] in any number up ending with a space. This rules out strings like "Éléonorä" and "Camilliâ" because those uppercase letters are not followed by a space.
Put a different way, the rule here says, "Grab as many uppercase letters or spaces in this set as possible, as long as you still have a space left over at the end." And since we appended a space to the end of the entire string, we can catch such groupings anywhere in the modified string.

Try this- backslashing the non A-Z characters.
[A-Z\À\Á\Â\Ã\Ä\Å\Æ\Ç\È\É\Ê\Ë\Ì\Í\Î\Ï\Ð\Ñ\Ò\Ó\Ô\Õ\Ö\Ø\Ù\Ú\Û\Ü\Ý]
If that fails you can encode each one of those letters like below:
Look up for characters: https://www.w3schools.com/charsets/ref_utf_latin1_supplement.asp
[A-Z\u00C0\u00C1... and so on...]

use:
=ARRAYFORMULA(TRIM(TRANSPOSE(QUERY(TRANSPOSE(IF(""<>
IFERROR(REGEXEXTRACT(SPLIT(A1:A, " "), "["&TEXTJOIN("", 1,
UNIQUE(QUERY({UPPER(CHAR(ROW(65:1500))), LOWER(CHAR(ROW(65:1500)))},
"select Col2 where Col1<>Col2")))&"]+")),,IFERROR(SPLIT(A1:A, " ")))),,9^9))))
or 10 characters shorter:
=INDEX(TRIM(TRANSPOSE(QUERY(TRANSPOSE(IF(""<>
IFERROR(REGEXEXTRACT(SPLIT(A:A; " "); "["&JOIN(;
UNIQUE(LOWER(QUERY(CHAR(ROUNDUP(SEQUENCE(1500; 2; 65)/2));
"select Col1 where lower(Col1)<>upper(Col2)"))))&"]+"));;
IFERROR(SPLIT(A:A; " "))));;9^9))))
works with all Europe-based alphabets and captures all diacritics out there. it can differentiate between:
LOWER
and
UPPER

Tricky substring problems

I'm having a problem with substrings, I have a string in the format below I'm
currently using getline.
Richard[12345/678910111213141516] was murdered
What I have been using is find_last_of and find_first_of to get the positions in between the brackets and forward slashes to retrieve each field. I have this working and functional but I have ran into a problem. The name field can be 32 characters in length, and can contain / and [] so when I finally ran into a user with a URL for his name it did not like that. The numbers are also random on a per user basis. I'm retrieving each field from the string, the name and the two identifying numbers.
Another string can look like this, so I would be grabbing 6 total substrings.
Richard[12345/678910111213141516] was murdered by Ralph[54321/161514131211109876]
Which is just a just another huge mess, what I was thinking about doing was starting from the back and moving to the front, but if the second name field (Ralph) contains any / or [] its going to ruin the count for retrieving the first part. Any insight would be helpful. Thank you.
In a nutshell. how do I account for these.
Names can also contain any alpha / numerical and special character.
Richard///[][][12345/678910111213141516] was murdered by Ralph[/[54321/161514131211109876]
The end result would be 6 substrings containing this.
Richard///[][]
12345
678910111213141516
Ralph[/
54321
161514131211109876
Regex has been mentioned to me, but I don't know if it would be better suited for the task or not, I included the tag so someone more experienced with it might answer/comment.

Here is a regex way to obtain all the values:
string str = "Richard///[][][12345/678910111213141516] was murdered by Ralph[/[54321/161514131211109876]";
regex rgx1(R"(([A-Z]\w*\s*\S*)\[(\d+)?(?:\/(\d+))?\])");
smatch smtch;
while (regex_search(str, smtch, rgx1)) {
std::cout << "Name: " << smtch[1] << std::endl;
std::cout << "ID1: " << smtch[2] << std::endl;
std::cout << "ID2: " << smtch[3] << std::endl;
str = smtch.suffix().str();
}
See IDEONE demo
The regex (\S*)\[(\d+)?(?:/(\d+))?\] matches:
(\S*) - (Group 1) 0 or more non-whitespace symbols, as many as possible.
\[ - an opening square bracket (must be escaped as it is a special character in regex reserved for character classes)
(\d+)? - (Group 2) 1 or more digits (optional group, can be empty)
(?:/(\d+))? - non-capturing optional group matching
/ - literal /
(\d+) - (Group 3) 1 or more digits.
\] - closing square bracket.

A possible regex solution would be to use a pattern like follows:
(\S+)\[(\d+)/(\d+)\](?:\s|$)
which will match and store the names (with their meta attributes). I am currently thinking of ways when it could break.
You can test it on regex101.

Replace group with spaces

I need to hide part of the string. Hide all before some ending part.
It easy to implement by regexp like this:
replace("123-134-04", ".(?=.*-)", " ")
replace any symbol if future part of string contains "-".
So result is: " -04"
It is important to keep spaces.
But, I can't use lookahead or lookbehind.
I can catch the group before ending part, but how to replace this for right number of spaces?
Or maybe some other ways to resolve this with regex?
Tnanks in advance!

If the number of to be replaced characters does not differ too much, and you have a means to match the part to be preserved, you could run through a series of search and replace:
replace("12-14-04", "^.{5}(-[^-]+)$", " \1")
replace("123-134-04", "^.{7}(-[^-]+)$", " \1")
replace("adfasd-adf-da7474-04", "^.{17}(-[^-]+)$", " \1")
Or you do:
split the string at the position, where the to be preserved part begins,
run the replace("ALL OF THIS SHOULD BECOME BLANKS", ".", " ") on the first part, and
join them up again.

Regex for quoted string with escaping quotes

How do I get the substring " It's big \"problem " using a regular expression?
s = ' function(){ return " It\'s big \"problem "; }';

/"(?:[^"\\]|\\.)*"/
Works in The Regex Coach and PCRE Workbench.
Example of test in JavaScript:
var s = ' function(){ return " Is big \\"problem\\", \\no? "; }';
var m = s.match(/"(?:[^"\\]|\\.)*"/);
if (m != null)
alert(m);

This one comes from nanorc.sample available in many linux distros. It is used for syntax highlighting of C style strings
\"(\\.|[^\"])*\"

As provided by ePharaoh, the answer is
/"([^"\\]*(\\.[^"\\]*)*)"/
To have the above apply to either single quoted or double quoted strings, use
/"([^"\\]*(\\.[^"\\]*)*)"|\'([^\'\\]*(\\.[^\'\\]*)*)\'/

Most of the solutions provided here use alternative repetition paths i.e. (A|B)*.
You may encounter stack overflows on large inputs since some pattern compiler implements this using recursion.
Java for instance: http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6337993
Something like this:
"(?:[^"\\]*(?:\\.)?)*", or the one provided by Guy Bedford will reduce the amount of parsing steps avoiding most stack overflows.

/(["\']).*?(?<!\\)(\\\\)*\1/is
should work with any quoted string

"(?:\\"|.)*?"
Alternating the \" and the . passes over escaped quotes while the lazy quantifier *? ensures that you don't go past the end of the quoted string. Works with .NET Framework RE classes

/"(?:[^"\\]++|\\.)*+"/
Taken straight from man perlre on a Linux system with Perl 5.22.0 installed.
As an optimization, this regex uses the 'posessive' form of both + and * to prevent backtracking, for it is known beforehand that a string without a closing quote wouldn't match in any case.

This one works perfect on PCRE and does not fall with StackOverflow.
"(.*?[^\\])??((\\\\)+)?+"
Explanation:
Every quoted string starts with Char: " ;
It may contain any number of any characters: .*? {Lazy match}; ending with non escape character [^\\];
Statement (2) is Lazy(!) optional because string can be empty(""). So: (.*?[^\\])??
Finally, every quoted string ends with Char("), but it can be preceded with even number of escape sign pairs (\\\\)+; and it is Greedy(!) optional: ((\\\\)+)?+ {Greedy matching}, bacause string can be empty or without ending pairs!

An option that has not been touched on before is:
Reverse the string.
Perform the matching on the reversed string.
Re-reverse the matched strings.
This has the added bonus of being able to correctly match escaped open tags.
Lets say you had the following string; String \"this "should" NOT match\" and "this \"should\" match"
Here, \"this "should" NOT match\" should not be matched and "should" should be.
On top of that this \"should\" match should be matched and \"should\" should not.
First an example.
// The input string.
const myString = 'String \\"this "should" NOT match\\" and "this \\"should\\" match"';
// The RegExp.
const regExp = new RegExp(
// Match close
'([\'"])(?!(?:[\\\\]{2})*[\\\\](?![\\\\]))' +
'((?:' +
// Match escaped close quote
'(?:\\1(?=(?:[\\\\]{2})*[\\\\](?![\\\\])))|' +
// Match everything thats not the close quote
'(?:(?!\\1).)' +
'){0,})' +
// Match open
'(\\1)(?!(?:[\\\\]{2})*[\\\\](?![\\\\]))',
'g'
);
// Reverse the matched strings.
matches = myString
// Reverse the string.
.split('').reverse().join('')
// '"hctam "\dluohs"\ siht" dna "\hctam TON "dluohs" siht"\ gnirtS'
// Match the quoted
.match(regExp)
// ['"hctam "\dluohs"\ siht"', '"dluohs"']
// Reverse the matches
.map(x => x.split('').reverse().join(''))
// ['"this \"should\" match"', '"should"']
// Re order the matches
.reverse();
// ['"should"', '"this \"should\" match"']
Okay, now to explain the RegExp.
This is the regexp can be easily broken into three pieces. As follows:
# Part 1
(['"]) # Match a closing quotation mark " or '
(?! # As long as it's not followed by
(?:[\\]{2})* # A pair of escape characters
[\\] # and a single escape
(?![\\]) # As long as that's not followed by an escape
)
# Part 2
((?: # Match inside the quotes
(?: # Match option 1:
\1 # Match the closing quote
(?= # As long as it's followed by
(?:\\\\)* # A pair of escape characters
\\ #
(?![\\]) # As long as that's not followed by an escape
) # and a single escape
)| # OR
(?: # Match option 2:
(?!\1). # Any character that isn't the closing quote
)
)*) # Match the group 0 or more times
# Part 3
(\1) # Match an open quotation mark that is the same as the closing one
(?! # As long as it's not followed by
(?:[\\]{2})* # A pair of escape characters
[\\] # and a single escape
(?![\\]) # As long as that's not followed by an escape
)
This is probably a lot clearer in image form: generated using Jex's Regulex
Image on github (JavaScript Regular Expression Visualizer.)
Sorry, I don't have a high enough reputation to include images, so, it's just a link for now.
Here is a gist of an example function using this concept that's a little more advanced: https://gist.github.com/scagood/bd99371c072d49a4fee29d193252f5fc#file-matchquotes-js

here is one that work with both " and ' and you easily add others at the start.
("|')(?:\\\1|[^\1])*?\1
it uses the backreference (\1) match exactley what is in the first group (" or ').
http://www.regular-expressions.info/backref.html

One has to remember that regexps aren't a silver bullet for everything string-y. Some stuff are simpler to do with a cursor and linear, manual, seeking. A CFL would do the trick pretty trivially, but there aren't many CFL implementations (afaik).

A more extensive version of https://stackoverflow.com/a/10786066/1794894
/"([^"\\]{50,}(\\.[^"\\]*)*)"|\'[^\'\\]{50,}(\\.[^\'\\]*)*\'|“[^”\\]{50,}(\\.[^“\\]*)*”/
This version also contains
Minimum quote length of 50
Extra type of quotes (open “ and close ”)

If it is searched from the beginning, maybe this can work?
\"((\\\")|[^\\])*\"

I faced a similar problem trying to remove quoted strings that may interfere with parsing of some files.
I ended up with a two-step solution that beats any convoluted regex you can come up with:
line = line.replace("\\\"","\'"); // Replace escaped quotes with something easier to handle
line = line.replaceAll("\"([^\"]*)\"","\"x\""); // Simple is beautiful
Easier to read and probably more efficient.

If your IDE is IntelliJ Idea, you can forget all these headaches and store your regex into a String variable and as you copy-paste it inside the double-quote it will automatically change to a regex acceptable format.
example in Java:
String s = "\"en_usa\":[^\\,\\}]+";
now you can use this variable in your regexp or anywhere.

(?<="|')(?:[^"\\]|\\.)*(?="|')
" It\'s big \"problem "
match result:
It\'s big \"problem
("|')(?:[^"\\]|\\.)*("|')
" It\'s big \"problem "
match result:
" It\'s big \"problem "

Messed around at regexpal and ended up with this regex: (Don't ask me how it works, I barely understand even tho I wrote it lol)
"(([^"\\]?(\\\\)?)|(\\")+)+"

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regular Expression starting and ending with special characters - regex

Related

Can't fit regexp for a substring

How to extract words entirely written in uppercase with accents (Diacritics) with a Google Sheet REGEXEXTRACT formula?

Tricky substring problems

Replace group with spaces

Regex for quoted string with escaping quotes

Categories

Resources