Why isn't the following regexp working? - regex

I'm trying to replace \\u0061 and \u0061 to %u0061 with QRegExp,
So I did this,
QString a = "\\u0061";
qDebug() << a.replace(QRegExp("\\?\\u"), "%u");
Since the slash can appear either once or twice, so I used a ? to represen the first slash, but it ain't working, what's wrong about it?
EDIT
Thanks to Denomales, it should be \\\\u that represents \\u, and I'm using \\\\+u right now.

Description
Per the QT qregex documentation , see the section on Characters and Abbreviations for Sets of Characters:
Note: The C++ compiler transforms backslashes in strings. To include a \ in a regexp, enter it twice, i.e. \\. To match the backslash character itself, enter it four times, i.e. \\\\.
Care to give this a try:
[\\\\]{1,2}(u)
I've entered 4 backslashes so the various language layers can escape the backslash correctly. Then nested it inside square brackets and required it to appear 1 to 2 times. Essentially this should find the single and double backslashes before the letter u. You could then just replace with %u as in your example.
In my example the u character is captured and should be returned as group 1 to be used later in your replacement.

Related

Tring to make more specific regex code

I understand the use of ^ for making a string specific regex, but I'm not sure why my code isn't picking it up. I've messed with changing [] and (), but with no luck :/
/.*^([\/php|.html|.css])$/
Tested with
wss://worker.com/sdfsd.css
wss://worker.com/php
html://worker.com/sdfsd/bob.html
My current non-working example; https://regex101.com/r/qkupPT/1
I don't think you understand what your current regex is doing. The ^ is for the start of a string. The [] creates a character class which allows one of the characters in it to be present (or a range if the - is used, e.g. a-z, 0-9). For multiple characters from the character class to be allowed you'd add a quantifier after the closing ], either + for 1 or more character, or * for 0 or more (meaning none of the characters are required). Something like:
[.\/](php|css|html)$
Should allow for the three examples you've listed. That allows for the string to end with a . or a / and then either css, php, or html. Also note a . outside a character class needs to be escaped otherwise it is a single non new-line character.
Demo: https://regex101.com/r/qkupPT/2

regex needed for parsing string

I am working with government measures and am required to parse a string that contains variable information based on delimiters that come from issuing bodies associated with the fda.
I am trying to retrieve the delimiter and the value after the delimiter. I have searched for hours to find a regex solution to retrieve both the delimiter and the value that follows it and, though there seems to be posts that handle this, the code found in the post haven't worked.
One of the major issues in this task is that the delimiters often have repeated characters. For instance: delimiters are used such as "=", "=,", "/=". In this case I would need to tell the difference between "=" and "=,".
Is there a regex that would handle all of this?
Here is an example of the string :
=/A9999XYZ=>100T0479&,1Blah
Notice the delimiters are:
"=/"
"=>'
"&,1"
Any help would be appreciated.
You can use a regex like this
(=/|=>|&,1)|(\w+)
Working demo
The idea is that the first group contains the delimiters and the 2nd group the content. I assume the content can be word characters (a to z and digits with underscore). You have then to grab the content of every capturing group.
You need to capture both the delimiter and the value as group 1 and 2 respectively.
If your values are all alphanumeric, use this:
(&,1|\W+)(\w+)
See live demo.
If your values can contain non-alphanumeric characters, it get complicated:
(=/|=>|=,|=|&,1)((?:.(?!=/|=>|=,|=|&,1))+.)
See live demo.
Code the delimiters longest first, eg "=," before "=", otherwise the alternation, which matches left to right, will match "=" and the comma will become part of the value.
This uses a negative look ahead to stop matching past the next delimiter.

regex remove all numbers from a paragraph except from some words

I want to remove all numbers from a paragraph except from some words.
My attempt is using a negative look-ahead:
gsub('(?!ami.12.0|allo.12)[[:digit:]]+','',
c('0.12','1245','ami.12.0 00','allo.12 1'),perl=TRUE)
But this doesn't work. I get this:
"." "" "ami.. " "allo."
Or my expected output is:
"." "" 'ami.12.0','allo.12'
You can't really use a negative lookahead here, since it will still replace when the cursor is at some point after ami.
What you can do is put back some matches:
(ami.12.0|allo.12)|[[:digit:]]+
gsub('(ami.12.0|allo.12)|[[:digit:]]+',"\\1",
c('0.12','1245','ami.12.0 00','allo.12 1'),perl=TRUE)
I kept the . since I'm not 100% sure what you have, but keep in mind that . is a wildcard and will match any character (except newlines) unless you escape it.
Your regex is actually finding every digit sequence that is not the start of "ami.12.0" or "allo.12". So for example, in your third string, it gets to the 12 in ami.12.0 and looks ahead to see if that 12 is the start of either of the two ignored strings. It is not, so it continues with replacing it. It would be best to generalize this, but in your specific case, you can probably achieve this by instead doing a negative lookbehind for any prefixes of the words (that can be followed by digit sequences) that you want to skip. So, you would use something like this:
gsub('(?<!ami\\.|ami\\.12\\.|allo\\.)[[:digit:]]+','',
c('0.12','1245','ami.12.0 00','allo.12 1'),perl=TRUE)

Regex to parse a config file, where the # sign represents a comment

With the strings
Test=Hello World #Some more text
Test=Hello World
I need both to capture the "Test" group and the "Hello World" group. If the string starts with a "#" it should not be captured at all.
The below expressions work for the first and second strings, respectively:
^((?!#).+)(?:=)(.+[\S])(?:[\s]*[#])
^((?!#).+)(?:=)(.+[\S])
How do I do a bitwise logical OR between two non-capturing Regex groups?
I tried doing something like
^((?!#).+)(?:=)(.+[\S])(?:[\s]*[#])|(?:.*)
but can't get it to work out correctly.
More Details
Background: This is being done in C# (.NET Framework 4.0). A file is being read line by line. The text to the left of the equalize sign refers to a variable name and the text to the right of the equalize sign refers to the variable's value. This file is being used as a config file.
General cases:
Note: All trailing whitespace - any whitespace after the end of the last non-whitespace character should not be captured. This also includes any space between the end of the second group and the pound sign.
1) All characters, except for a whitespace, followed immediately by an equalize sign, followed immediately by any set of characters followed by a space and a pound sign. e.g.
this=is valid #text
s0_is=this #text
and=th.is #text
the=characters after the # Pound sign are irrelevant
2) The exact same situation as case 1 except that there is no trailing space between the second capture group and the pound sign. e.g.
this=is valid#text
s0_is=this#text
and=th.is#text
the=characters after the# Pound sign are irrelevant
3) The same situation as in cases one and two; however, where there is no # sign at all (see the above note about trailing whitespace). e.g.
this=is valid
s0_is=this
and=th.is
the=characters after the
For all three of these cases the capture groups should be as shown below, respectively (the | symbol is used to distinguish between capture groups):
this|is valid
s0_is|this
and|th.is
the|characters after the
Special cases:
1) The first character of the line is a # sign. This should result in nothing being captured.
2) The # sign occurs immediately after the = sign. This should result in the second capture group being null.
3) The # sign occurs anywhere else not otherwise explicitly stated above. This should result in nothing being captured.
4) There should be no whitespace preceeding the first character of the new line; however, this case is unlikely to actually occur.
5) A space immediately after the equalize sign is invalid.
Invalid cases (where nothing should be captured):
th is=is not valid#text
nor =this#text
or_this=something
also= this
I suspect you're making this more difficult than it needs to be. Try this regex:
^(\w+)=([^\s#]+(?:[ \t]+[^\s#]+)+)
I used [ \t]+ instead of \s+ to prevent it from matching the newline and spilling over onto the next line--assuming the input really is multiline, of course. You can still apply it to standalone strings if that's what you prefer.
EDIT: In answer to your comment, try this regex:
^(\w+)=(\w+(?:[ \t]+\w+)*)
With the first regex I was trying to avoid making confining assumptions and I got a little carried away. If you can use \w+ for all words it becomes much easier, as you can see.
^((?!#).+)(?:=)(.+[\S])(?:[\s]*[#])|(?:.*)
means match
^((?!#).+)(?:=)(.+[\S])(?:[\s]*[#])
OR
(?:.*)
try this
^((?!#).+)(?:=)(.+[\S])(?:(?:[\s]*[#])|(?:.*))
although (?:.*) seems kind of pointless, why don't you try something like this instead:
^((?!#).+)(?:=)(.+?\S)(?:\s*[#])?
that will optionally match the last group, which is what I think you're trying to do, and it would be the better option in this case.

RegEx for String.Format

Hiho everyone! :)
I have an application, in which the user can insert a string into a textbox, which will be used for a String.Format output later. So the user's input must have a certain format:
I would like to replace exactly one placeholder, so the string should be of a form like this: "Text{0}Text". So it has to contain at least one '{0}', but no other statement between curly braces, for example no {1}.
For the text before and after the '{0}', I would allow any characters.
So I think, I have to respect the following restrictions: { must be written as {{, } must be written as }}, " must be written as \" and \ must be written as \.
Can somebody tell me, how I can write such a RegEx? In particular, can I do something like 'any character WITHOUT' to exclude the four characters ( {, }, " and \ ) above instead of listing every allowed character?
Many thanks!!
Nikki:)
I hate to be the guy who doesn't answer the question, but really it's poor usability to ask your user to format input to work with String.Format. Provide them with two input requests, so they enter the part before the {0} and the part after the {0}. Then you'll want to just concatenate the strings instead of use String.Format- using String.Format on user-supplied text is just a bad idea.
[^(){}\r\n]+\{0}[^(){}\r\n]+
will match any text except (, ), {, } and linebreaks, then match {0}, then the same as before. There needs to be at least one character before and after the {0}; if you don't want that, replace + with *.
You might also want to anchor the regex to beginning and end of your input string:
^[^(){}\r\n]+\{0}[^(){}\r\n]+$
(Similar to Tim's answer)
Something like:
^[^{}()]*(\{0})[^{}()]*$
Tested at http://www.regular-expressions.info/javascriptexample.html
It sounds like you're looking for the [^CHARS_GO_HERE] construct. The exact regex you'd need depends on your regex engine, but it would resemble [^({})].
Check out the "Negated Character Classes" section of the Character Class page at Regular-Expressions.info.
I think your question can be answered by the regexp:
^(((\{\{|\}\}|\\"|\\\\|[^\{\}\"\\])*(\{0\}))+(\{\{|\}\}|\\"|\\\\|[^\{\}\"\\])*$
Explanation:
The expression is built up as follows:
^(allowed chars {0})+(allowed chars)*$
one or more sequences of allowed chars followed by a {0} with optional allowed chars at the end.
allowed chars is built of the 4 sequences you mentioned (I assumed the \ escape is \\ instead of \.) plus all chars that do not contain the escapes chars:
(\{\{|\}\}|\\"|\\\\|[^\{\}\"\\])
combined they make up the regexp I started with.