Regular Expression replace special characters, only if not part of word - regex

I have the following string:
'United Breaks Guitars': Did It Really Cost The Airline $180 Million? http://ow.ly/htPVk
Currently, my regex pattern looks like this: [^A-Za-z-0-9- - / -$]
I'm not an expert on regex and I've been playing around with this tool to figure things out, but I am stuck.
I'd like to remove characters such as ', ", :, etc. So far with the above pattern the highlighted characters are being removed from my example string:
'United Breaks Guitars' : Did It Really Cost The Airline $180 Million? http://ow.ly/htPVk
The issue above is that I don't want to remove the : and . from the URL. But if the string ends with a period I would like to remove it. Also, the apostrophe ' character should be kept in case it's used to omit characters or as a possession.
Thanks in advance.

Depends on how you define "part of a word", URL isn't much of a word.
If you define "part of a word" as surrounded by non-space characters, then you could use something like:
(?<!\S)[^\w $-]+|[^\w $-]+(?!\S)
(?!\S) is a shorter way of saying (?=\s|$), and the same applies for the lookbehind.

Related

Regex to match(extract) string between dot(.)

I want to select some string combination (with dots(.)) from a very long string (sql). The full string could be a single line or multiple line with new line separator, and this combination could be in start (at first line) or a next line (new line) or at both place.
I need help in writing a regex for it.
Examples:
String s = I am testing something like test.test.test in sentence.
Expected output: test.test.test
Example2 (real usecase):
UPDATE test.table
SET access = 01
WHERE access IN (
SELECT name FROM project.dataset.tablename WHERE name = 'test' GROUP BY 1 )
Expected output: test.table and project.dataset.tablename
, can I also add some prefix or suffix words or space which should be present where ever this logic gets checked. In above case if its update regex should pick test.table, but if the statement is like select test.table regex should not pick it up this combinations and same applies for suffix.
Example3: This is to illustrate the above theory.
INS INTO test.table
SEL 'abcscsc', wu_id.Item_Nbr ,1
FROM test.table as_t
WHERE as_t.old <> 0 AND as_t.date = 11
AND (as_t.numb IN ('11') )
Expected Output: test.table, test.table (Key words are INTO and FROM)
Things Not Needed in selection:as_t.numb, as_t.old, as_t.date
If I get the regex I can use in program to extract this word.
Note: Before and after string words to the combination could be anything like update, select { or(, so we have to find the occurrence of words which are joined together with .(dot) and all the number of such occurrence.
I tried something like this:
(?<=.)(.?)(?=.)(.?) -: This only selected the word between two .dot and not all.
.(?<=.)(.?)(?=.)(.?). - This everything before and after.
To solve your initial problem, we can just use some negation. Here's the pattern I came up with:
[^\s]+\.[^\s]+
[^ ... ] Means to make a character class including everything except for what's between the brackets. In this case, I put \s in there, which matches any whitespace. So [^\s] matches anything that isn't whitespace.
+ Is a quantifier. It means to find as many of the preceding construct as you can without breaking the match. This would happily match everything that's not whitespace, but I follow it with a \., which matches a literal .. The \ is necessary because . means to match any character in regex, so we need to escape it so it only has its literal meaning. This means there has to be a . in this group of non-whitespace characters.
I end the pattern with another [^\s]+, which matches everything after the . until the next whitespace.
Now, to solve your secondary problem, you want to make this match only work if it is preceded by a given keyword. Luckily, regex has a construct almost specifically for this case. It's called a lookbehind. The syntax is (?<= ... ) where the ... is the pattern you want to look for. Using your example, this will only match after the keywords INTO and FROM:
(?<=(?:INTO|FROM)\s)[^\s]+\.[^\s]+
Here (?:INTO|FROM) means to match either the text INTO or the text FROM. I then specify that it should be followed by a whitespace character with \s. One possible problem here is that it will only match if the keywords are written in all upper case. You can change this behavior by specifying the case insensitive flag i to your regex parser. If your regex parser doesn't have a way to specify flags, you can usually still specify it inline by putting (?i) in front of the pattern, like so:
(?i)(?<=(?:INTO|FROM)\s)[^\s]+\.[^\s]+
If you are new to regex, I highly recommend using the www.regex101.com website to generate regex and learn how it works. Don't forget to check out the code generator part for getting the regex code based on the programming language you are using, that's a cool feature.
For your question, you need a regex that understands any word character \w that matches between 0 and unlimited times, followed by a dot, followed by another series of word character that repeats between 0 and unlimited times.
So here is my solution to your question:
Your regex in JavaScript:
const regex = /([\w][.][\w])+/gm;
in Java:
final String regex = "([\w][.][\w])+";
in Python:
regex = r"([\w][.][\w])+"
in PHP:
$re = '/([\w][.][\w])+/m';
Note that: this solution is written for your use case (to be used for SQL strings), because now if you have something like '.word' or 'word..word', it will still catch it which I assume you don't have a string like that.
See this screenshot for more details

Regex multiline to match string followed by text with arbitrary spaces in between

Was wondering if any regex gurus can help figure out how to create a regular expression to solve this. I am stumped.
I need to match "CUST-X" on variations of this multiline text..
"CUST-1 Some message\nLock-Id: Id74248cd199\n"
Requirement:
The "CUST-1" and "Some message" can be separated by a colon(:). The
colon is optional.
There can be none, one or multiple spaces between
the two strings.
Any number of spaces can be in front of "CUST-1".
There needs to be a message after "CUST-1". The message is
arbitrary, there's no pattern to the message.
Ignore any other CUST-XX after the first one. Only match on the 1st occurance.
java regex is preferable.
Examples:
Test strings that should match for "CUST-1"
"CUST-1 Some message\nLock-Id: Id14248cd199"
" CUST-1 another message\nLock-Id: Id14258cd199"
"CUST-1:I like apples\nLock-Id: Id84248cd199"
"CUST-1: peaches are sweet\nLock-Id: Id78248cd199"
"CUST-1: pies are great\nLock-Id: Id71248cd199"
Should match for "CUST-X" but not "CUST-X"
"CUST-1: Nice message about CUST-2\nLock-Id: Id74248cd199\n"
Test strings that should not match "CUST-1"
"CUST-1\nLock-Id: Id78248cd199"
"CUST-1 \nLock-Id: Id74248cd199"
"CUST-1:\nLock-Id: Id84248cd199"
"CUST-1: \nLock-Id: Id94248cd199"
The closes I've come up with is..
^\\s*([A-Z]+-[0-9]+):?\\s+\\S+
But this will also match the cases where I do not want the match to happen.
I think this is what you are looking for:
^\\s*([A-Z]+-[0-9]+):?\\p{Z}+\\S+
\s matches any white space char defined as [\t\n\f\r\p{Z}], which includes \n.
\p{Z} refers only to the whitespace char itself.

Matlab regexp is there an or statement?

Hey guys I am trying to find a way to display the letter I by itself but I keep having trouble this is what I have so far.
This is the text file that I open, tolls.txt:
Join Microsoft employees supporting I Inspire Youth Project and other youth causes #GivingHero: http://msft.it/6013jboz
Waze for #WindowsPhone is here: http://msft.it/6016jbp2 I
fid=fopen('tolls.txt');
getLine=fgetl(fid);
while ischar(getLine)
ct='I\s';
How=regexp(getLine,ct,'match');
counter=counter+length(How);
getLine=fgetl(fid);
end
My problem is since I have to incorporate any time there is an I I have to be able to show all the stand alone capital I that have no spaces after it such as in an end of a sentence and before a sentence. So in my bat variable I have bat=I\s but I don't know if there is a or statement I can use to also incorporate \sI.
Hope I was clear about the question thank you for the help in advance.
What you'd need is something like:
ct = '(?<!\w)(I)(?!\w)';
Here (?<!\w) and (?!\w) denote a negative look-behind and a negative look-ahead respectively for a character from the word character class.
More information about the same may be found here.
#RoneyMichael's solution is fine (though possibly overkill), but there is an or statement. Here is how you could look for three distinct patterns – ' I ' or 'I ' or ' I':
ct='(^I[\W]*\s)|(\sI[\W]*\s)|(\sI[\W]*$)';
How=regexp(getLine,ct,'match')
which returns:
How =
' I ' ' I'
The last two patterns specifically match the latter 'I' if it occurs at the beginning or the end of the string, respectively. The '[\W]*' matches zero or more occurrences of non-word characters, i.e., punctuation. It's zero or more because of things like '...', '?!', etc. Alternatively, you could explicitly list allowed punctuation by using something like '[\.\?\!]*' instead (just remember that things such as quotes, parentheses, brackets, etc. can also come at the end of a line). Also, you may want to match '"I' or ''I'. In that case you can simply use
ct='(^[\W]*I[\W]*\s)|(\s[\W]*I[\W]*\s)|(\s[\W]*I[\W]*$)';
There are other logical and conditional operators that you can use in regular expressions.

Extracting some data items in a string using regular expression

<![Apple]!>some garbage text may be here<![Banana]!>some garbage text may be here<![Orange]!><![Pear]!><![Pineapple]!>
In the above string, I would like to have a regex that matches all <![FruitName]!>, between these <![FruitName]!>, there may be some garbage text, my first attempt is like this:
<!\[[^\]!>]+\]!>
It works, but as you can see I've used this part:
[^\]!>]+
This kills some innocents. If the fruit name contains any one of these characters: ] ! > It'd be discarded and we love eating fruit so much that this should not happen.
How do we construct a regex that disallows exactly this string ]!> in the FruitName while all these can still be obtained?
The above example is just made up by me, I just want to know what the regex would look like if it has to be done in regex.
The simplest way would be <!\[.+?]!> - just don't care about what is matched between the two delimiters at all. Only make sure that it always matches the closing delimiter at the earliest possible opportunity - therefore the ? to make the quantifier lazy.
(Also, no need to escape the ])
About the specification that the sequence ]!> should be "disallowed" within the fruit name - well that's implicit since it is the closing delimiter.
To match a fruit name, you could use:
<!\[(.*?)]!>
After the opening <![, this matches the least amount of text that's followed by ]!>. By using .*? instead of .*, the least possible amount of text is matched.
Here's a full regex to match each fruit with the following text:
<!\[(.*?)]!>(.*?)(?=(<!\[)|$)
This uses positive lookahead (?=xxx) to match the beginning of the next tag or end-of-string. Positive lookahead matches but does not consume, so the next fruit can be matched by another application of the same regex.
depending on what language you are using, you can use the string methods your language provide by doing simple splitting (and simple regex that is more understandable). Split your string using "!>" as separator. Go through each field, check for <!. If found, replace all characters from front till <!. This will give you all the fruits. I use gawk to demonstrate, but the algorithm can be implemented in your language
eg gawk
# set field separator as !>
awk -F'!>' '
{
# for each field
for(i=1;i<=NF;i++){
# check if there is <!
if($i ~ /<!/){
# if <! is found, substitute from front till <!
gsub(/.*<!/,"",$i)
}
# print result
print $i
}
}
' file
output
# ./run.sh
[Apple]
[Banana]
[Orange]
[Pear]
[Pineapple]
No complicated regex needed.

RegEx for String.Format

Hiho everyone! :)
I have an application, in which the user can insert a string into a textbox, which will be used for a String.Format output later. So the user's input must have a certain format:
I would like to replace exactly one placeholder, so the string should be of a form like this: "Text{0}Text". So it has to contain at least one '{0}', but no other statement between curly braces, for example no {1}.
For the text before and after the '{0}', I would allow any characters.
So I think, I have to respect the following restrictions: { must be written as {{, } must be written as }}, " must be written as \" and \ must be written as \.
Can somebody tell me, how I can write such a RegEx? In particular, can I do something like 'any character WITHOUT' to exclude the four characters ( {, }, " and \ ) above instead of listing every allowed character?
Many thanks!!
Nikki:)
I hate to be the guy who doesn't answer the question, but really it's poor usability to ask your user to format input to work with String.Format. Provide them with two input requests, so they enter the part before the {0} and the part after the {0}. Then you'll want to just concatenate the strings instead of use String.Format- using String.Format on user-supplied text is just a bad idea.
[^(){}\r\n]+\{0}[^(){}\r\n]+
will match any text except (, ), {, } and linebreaks, then match {0}, then the same as before. There needs to be at least one character before and after the {0}; if you don't want that, replace + with *.
You might also want to anchor the regex to beginning and end of your input string:
^[^(){}\r\n]+\{0}[^(){}\r\n]+$
(Similar to Tim's answer)
Something like:
^[^{}()]*(\{0})[^{}()]*$
Tested at http://www.regular-expressions.info/javascriptexample.html
It sounds like you're looking for the [^CHARS_GO_HERE] construct. The exact regex you'd need depends on your regex engine, but it would resemble [^({})].
Check out the "Negated Character Classes" section of the Character Class page at Regular-Expressions.info.
I think your question can be answered by the regexp:
^(((\{\{|\}\}|\\"|\\\\|[^\{\}\"\\])*(\{0\}))+(\{\{|\}\}|\\"|\\\\|[^\{\}\"\\])*$
Explanation:
The expression is built up as follows:
^(allowed chars {0})+(allowed chars)*$
one or more sequences of allowed chars followed by a {0} with optional allowed chars at the end.
allowed chars is built of the 4 sequences you mentioned (I assumed the \ escape is \\ instead of \.) plus all chars that do not contain the escapes chars:
(\{\{|\}\}|\\"|\\\\|[^\{\}\"\\])
combined they make up the regexp I started with.