I need to write a function to search for single quotes (') while skipping escaped quotes (\'). I know I can do a patten search using a function like this:
let contains string pattern =
begin
let re = Str.regexp_string pattern
in
try ignore (Str.search_forward re string 0); true
with Not_found -> false
end
But how do I only search for non-escaped quote?
I'd say a non-escaped quote is one that's at the beginning of the input or is not preceded by a backslash. Unfortunately, special characters in OCaml regular expressions are marked by backslashes, and backslashes need to be doubled in an OCaml string. So you get something like the following:
let neq = "\\(^\\|[^\\]\\)'"
It just says "(the beginning of the input or a non-backslash) followed by a quote".
Don't use Str.regexp_string. Its purpose is to produce a regular expression that matches a given string exactly. You want to use a "real" regular expression. So use Str.regexp.
As a side comment, if you really just want to find unescaped quote characters (rather than learning about regular expressions), it would be much easier just to look for quote characters and then test the previous character to see if it's a backslash.
The String.Escaping module in Core (make sure to install Core, and do an open Core.Std first) lets you do just what you want here.
utop[9]> String.Escaping.index ~escape_char:'\\' "a\\'sdfde" '\'';;
- : int option = None
utop[10]> String.Escaping.index ~escape_char:'\\' "a'sdfde" '\'';;
- : int option = Some 1
utop[11]> String.Escaping.index ~escape_char:'\\' "asdfde" '\'';;
- : int option = None
Related
I have some regex values that I need to use as variables for a new regex.
I want to write something like:
val lowers: Regex = "[a-z]”.r
val uppers: Regex = "[A-Z]”.r
val letters: Regex = “(lowers | uppers)*”.r
But I don’t know the right syntax for it.
If it’s possible, how can it be done in Scala?
Edit
As suggested in the comments, this question is also related to this one when the regex variable is to be added outside the regex, which does not seem to be solving the problem here.
You can use string interpolation to construct a regex from two other strings:
val lowers: Regex = "[a-z]".r
val uppers: Regex = "[A-Z]".r
val letters: Regex = s"($lowers|$uppers)*".r
Please mind the s prefix before the initial " of the string literal.
Output:
([a-z]|[A-Z])*
Note you should be careful with spaces in the pattern, whitespace is meaningful by default.
If you want to use spaces just for formatting, for ease of reading the regex, you can use the (?x) COMMENTS ("verbose", "freespacing") modifier:
val letters: Regex = s"(?x)($lowers | $uppers)*".r
Mind you need to escape any literal whitespace char (tab, space, newline) in the pattern if you want to use this feature. Also, you will have to escape a # char as it becomes special in this mode.
I want to select some string combination (with dots(.)) from a very long string (sql). The full string could be a single line or multiple line with new line separator, and this combination could be in start (at first line) or a next line (new line) or at both place.
I need help in writing a regex for it.
Examples:
String s = I am testing something like test.test.test in sentence.
Expected output: test.test.test
Example2 (real usecase):
UPDATE test.table
SET access = 01
WHERE access IN (
SELECT name FROM project.dataset.tablename WHERE name = 'test' GROUP BY 1 )
Expected output: test.table and project.dataset.tablename
, can I also add some prefix or suffix words or space which should be present where ever this logic gets checked. In above case if its update regex should pick test.table, but if the statement is like select test.table regex should not pick it up this combinations and same applies for suffix.
Example3: This is to illustrate the above theory.
INS INTO test.table
SEL 'abcscsc', wu_id.Item_Nbr ,1
FROM test.table as_t
WHERE as_t.old <> 0 AND as_t.date = 11
AND (as_t.numb IN ('11') )
Expected Output: test.table, test.table (Key words are INTO and FROM)
Things Not Needed in selection:as_t.numb, as_t.old, as_t.date
If I get the regex I can use in program to extract this word.
Note: Before and after string words to the combination could be anything like update, select { or(, so we have to find the occurrence of words which are joined together with .(dot) and all the number of such occurrence.
I tried something like this:
(?<=.)(.?)(?=.)(.?) -: This only selected the word between two .dot and not all.
.(?<=.)(.?)(?=.)(.?). - This everything before and after.
To solve your initial problem, we can just use some negation. Here's the pattern I came up with:
[^\s]+\.[^\s]+
[^ ... ] Means to make a character class including everything except for what's between the brackets. In this case, I put \s in there, which matches any whitespace. So [^\s] matches anything that isn't whitespace.
+ Is a quantifier. It means to find as many of the preceding construct as you can without breaking the match. This would happily match everything that's not whitespace, but I follow it with a \., which matches a literal .. The \ is necessary because . means to match any character in regex, so we need to escape it so it only has its literal meaning. This means there has to be a . in this group of non-whitespace characters.
I end the pattern with another [^\s]+, which matches everything after the . until the next whitespace.
Now, to solve your secondary problem, you want to make this match only work if it is preceded by a given keyword. Luckily, regex has a construct almost specifically for this case. It's called a lookbehind. The syntax is (?<= ... ) where the ... is the pattern you want to look for. Using your example, this will only match after the keywords INTO and FROM:
(?<=(?:INTO|FROM)\s)[^\s]+\.[^\s]+
Here (?:INTO|FROM) means to match either the text INTO or the text FROM. I then specify that it should be followed by a whitespace character with \s. One possible problem here is that it will only match if the keywords are written in all upper case. You can change this behavior by specifying the case insensitive flag i to your regex parser. If your regex parser doesn't have a way to specify flags, you can usually still specify it inline by putting (?i) in front of the pattern, like so:
(?i)(?<=(?:INTO|FROM)\s)[^\s]+\.[^\s]+
If you are new to regex, I highly recommend using the www.regex101.com website to generate regex and learn how it works. Don't forget to check out the code generator part for getting the regex code based on the programming language you are using, that's a cool feature.
For your question, you need a regex that understands any word character \w that matches between 0 and unlimited times, followed by a dot, followed by another series of word character that repeats between 0 and unlimited times.
So here is my solution to your question:
Your regex in JavaScript:
const regex = /([\w][.][\w])+/gm;
in Java:
final String regex = "([\w][.][\w])+";
in Python:
regex = r"([\w][.][\w])+"
in PHP:
$re = '/([\w][.][\w])+/m';
Note that: this solution is written for your use case (to be used for SQL strings), because now if you have something like '.word' or 'word..word', it will still catch it which I assume you don't have a string like that.
See this screenshot for more details
I am parsing values from xml and saving them to variables. I was able to strip all but the braces and double quotes from the string. The value displays like this on the page: ["MPEG Video"].
Here is an exampled of the parse saving it to a variable:
#video_format = REXML::XPath.each(media_parse_doc, "//track[#type='Video']/Format/text()") { |element| element }
I tried using .ts like this:
#video_format = (REXML::XPath.each(media_parse_doc, "//track[#type='Video']/Format/text()") { |element| element } ).ts('[]"','')
but it did not work. I saw some examples telling to you gsub and I looked at the api dock for gsub but I am not understanding the thought logic in the examples to be able to apply it correctly to my own case. Here is one of the examples:
"foobar".gsub(/^./, "") # => "oobar"
I understand it is removing te first character but I don't know how to set it up to remove " and [.
Why the /^? Is that ascii for something? Can someone please show me the correct syntax to remove the double quotes and braces from my varialbes and explain the logic process so I can better understand to use on my own in the future?
Thank you for the help!
If you want to understand regular expressions, check out http://rubular.com/.
"foobar".gsub(/^./, "") # => "oobar" that particular example will substitue the first letter of the string with "" (ie, nothing). The reason is that the ^ says "pin the match to the beginning of the string", and the . says "match any character" - so, it'll match any character at the beginning of the string. The encosing / characters are just the standard delimiters for a regular expression - so it's only the ^. that you need to figure out.
To replace double quotes: 'fo"o"bar'.gsub(/"/, "") # => "foobar"
To replace left square bracket: 'fo[o[bar'.gsub(/\[/, "") # => "foobar" (because square brackets are a special character in regex, you have to prefix them with a \ when you want to use them as a 'normal' character.
to replace all quotes and square brackers in one: 'fo[o"[b]"ar'.gsub(/("|\[|\])/, "") # => "foobar"
(the parenthesis indicate a group, and the pipes | indicate 'or'. So, ("|\[|\]) means "match any of the things in this group: a quote, or a left square bracket, or a right square bracket".
But really what you should do is do a good intro tutorial to regular expressions and start from the basics. Once you understand that, it shouldn't be too hard to start composing simple regular expressions of your own.
If you're on a mac, this app is very useful for writing your own regex's: http://krillapps.com/patterns/
I'm trying to replace slashes in a string, but not all of them - only the ones before first comma. To do that, I probably have to find a way to match only slashes being followed by string containing a comma.
Is it possible to do this using one regexp, i.e. without first splitting the string by commas?
Example input string:
Abc1/Def2/Ghi3,/Dore1/Mifa2/Solla3,Sido4
What I want to get:
Abc1.Def2.Ghi3,/Dore1/Mifa2/Solla3,Sido4
I've tried some lookahead and lookbehind techniques with no effect, so currently to do this in e.g. Python I first split the data:
test = 'Abc1/Def2/Ghi3,/Dore1/Mifa2/Solla3,Sido4'
strlist = re.split(r',', test)
result = ','.join([re.sub(r'\/', r'.', strlist[0])] + strlist[1:])
What I would prefer is to use a specific regexp pattern instead of Python-oriented solution though, so essentially I could have a pattern and replacement such that the following code would give me the same result:
result = re.sub(pattern, replacement, test)
Thanks for all regex-avoiding answers - I was wondering if I could do this using only regex (so e.g. I could use sed instead of Python).
item = 'Abc1/Def2/Ghi3,/Dore1/Mifa2/Solla3,Sido4'
print item.replace("/", ".", item.count("/", 0, item.index(",")))
This will print what you need. Try to avoid regex wherever you can because they are slow.
You could do this with lookbehind expressions that look for both the beginning of the string and no comma. Or don't use re entirely.
s = 'Abc1/Def2/Ghi3,/Dore1/Mifa2/Solla3,Sido4'
left,sep,right = s.partition(',')
sep.join((left.replace('/','.'),right))
Out[24]: 'Abc1.Def2.Ghi3,/Dore1/Mifa2/Solla3,Sido4'
Is there a simple way to ignore the white space in a target string when searching for matches using a regular expression pattern? For example, if my search is for "cats", I would want "c ats" or "ca ts" to match. I can't strip out the whitespace beforehand because I need to find the begin and end index of the match (including any whitespace) in order to highlight that match and any whitespace needs to be there for formatting purposes.
You can stick optional whitespace characters \s* in between every other character in your regex. Although granted, it will get a bit lengthy.
/cats/ -> /c\s*a\s*t\s*s/
While the accepted answer is technically correct, a more practical approach, if possible, is to just strip whitespace out of both the regular expression and the search string.
If you want to search for "my cats", instead of:
myString.match(/m\s*y\s*c\s*a\*st\s*s\s*/g)
Just do:
myString.replace(/\s*/g,"").match(/mycats/g)
Warning: You can't automate this on the regular expression by just replacing all spaces with empty strings because they may occur in a negation or otherwise make your regular expression invalid.
Addressing Steven's comment to Sam Dufel's answer
Thanks, sounds like that's the way to go. But I just realized that I only want the optional whitespace characters if they follow a newline. So for example, "c\n ats" or "ca\n ts" should match. But wouldn't want "c ats" to match if there is no newline. Any ideas on how that might be done?
This should do the trick:
/c(?:\n\s*)?a(?:\n\s*)?t(?:\n\s*)?s/
See this page for all the different variations of 'cats' that this matches.
You can also solve this using conditionals, but they are not supported in the javascript flavor of regex.
You could put \s* inbetween every character in your search string so if you were looking for cat you would use c\s*a\s*t\s*s\s*s
It's long but you could build the string dynamically of course.
You can see it working here: http://www.rubular.com/r/zzWwvppSpE
If you only want to allow spaces, then
\bc *a *t *s\b
should do it. To also allow tabs, use
\bc[ \t]*a[ \t]*t[ \t]*s\b
Remove the \b anchors if you also want to find cats within words like bobcats or catsup.
This approach can be used to automate this
(the following exemplary solution is in python, although obviously it can be ported to any language):
you can strip the whitespace beforehand AND save the positions of non-whitespace characters so you can use them later to find out the matched string boundary positions in the original string like the following:
def regex_search_ignore_space(regex, string):
no_spaces = ''
char_positions = []
for pos, char in enumerate(string):
if re.match(r'\S', char): # upper \S matches non-whitespace chars
no_spaces += char
char_positions.append(pos)
match = re.search(regex, no_spaces)
if not match:
return match
# match.start() and match.end() are indices of start and end
# of the found string in the spaceless string
# (as we have searched in it).
start = char_positions[match.start()] # in the original string
end = char_positions[match.end()] # in the original string
matched_string = string[start:end] # see
# the match WITH spaces is returned.
return matched_string
with_spaces = 'a li on and a cat'
print(regex_search_ignore_space('lion', with_spaces))
# prints 'li on'
If you want to go further you can construct the match object and return it instead, so the use of this helper will be more handy.
And the performance of this function can of course also be optimized, this example is just to show the path to a solution.
The accepted answer will not work if and when you are passing a dynamic value (such as "current value" in an array loop) as the regex test value. You would not be able to input the optional white spaces without getting some really ugly regex.
Konrad Hoffner's solution is therefore better in such cases as it will strip both the regest and test string of whitespace. The test will be conducted as though both have no whitespace.