How to access the results of .match as string value in Crystal lang - regex

In many other programming languages, there is a function which takes as a parameter a regular expression and returns an array of string values. This is true of Javascript and Ruby. The .match in crystal, however, does 1) not seem to accept the global flag and 2) it does not return an array but rather a struct of type Regex::MatchData. (https://crystal-lang.org/api/0.25.1/Regex/MatchData.html)
As an example the following code:
str = "Happy days"
re = /[a-z]+/i
matches = str.match(re)
puts matches
returns Regex::MatchData("Happy")
I am unsure how to convert this result into a string or why this is not the default as it is in the inspiration language (Ruby). I understand this question probably results from my inexperience dealing with structs and compiled languages but I would appreciate an answer in hopes that it might also help someone else coming from a JS/Ruby background.

What if I want to convert to a string merely the first match?
puts "Happy days"[/[a-z]+/i]?
puts "Happy days".match(/[a-z]+/i).try &.[0]
It will try to match a string against /[a-z]+/i regex and if there is a match, Group 0, i.e. the whole match, will be output. Note that the ? after [...] will make it fail gracefully if there is no match found. If you just use puts "??!!"[/[a-z]+/i], an exception will be thrown.
See this online demo.
If you want the functionality similar to String#scan that returns all matches found in the input, you may use (shortened version only left as per #Amadan's remark):
matches = str.scan(re).map(&.string)
Output of the code above:
["Happy days", "Happy days"]
Note that:
String::scan will return an array of Regex::MatchData for each match.
You can call .string on the match to return the actual matched text.

Actually the posted example returns a #<MatchData "Happy"> in Ruby, which also has no "global" flag – thats what String#scan(Regex) is for as mentioned by others.
If you want only a single match without going through Regex::MatchData, you can use String#[](Regex):
str = "Happy days"
p str[/[a-z]+/i] # => "Happy"

Related

Extracting key-value pairs from a string using ruby & regex

I want to accomplish the following with ruby and if possible a regex:
Input: "something {\"key\":\"value\",\"key2\":3}"
Output: [["\"key\"", "\"value\""], [["\"key2\"", "3"]]
My attempt so far:
s = "something {key:\"value\",key2:3}"
s.scan(/.* {(?:([^:]+):([^,}]+),?)+}$/)
# Output: [["\"key2\"", "3"]]
For some reason the regex above only matches the last key value pair. Does someone know how to retrieve all the pairs?
Just to be clear, "something" can be any kind of string. For this reason, solutions such as (1) splitting the text directly on the equal or (2) a regex as used in s.scan(/(?:([^:]+):([^,}]+),?)/) don't work for me.
I know there are similar questions on SO. Still, from what I saw, they mostly tend towards the solutions 1 & 2 or focus on a single key value pair.
your string looks like a json data structure encoded as a string, you can use JSON.parse for this as long as you remove the word "something " from the string
require 'json'
string = "something {\"key\":\"value\",\"key2\":3}"
# the following line removes the word something
string = string[string.index("{")..-1]
x = JSON.parse(string)
puts x["key"]
puts x["key2"]
you can then convert that to an array if required
alternatively if you want to use regular expressions try
string.scan(/(?:"(\w+)":"?(\w+)"?)/)

error: multiple repeat for regex in robot [duplicate]

I'm trying to determine whether a term appears in a string.
Before and after the term must appear a space, and a standard suffix is also allowed.
Example:
term: google
string: "I love google!!! "
result: found
term: dog
string: "I love dogs "
result: found
I'm trying the following code:
regexPart1 = "\s"
regexPart2 = "(?:s|'s|!+|,|.|;|:|\(|\)|\"|\?+)?\s"
p = re.compile(regexPart1 + term + regexPart2 , re.IGNORECASE)
and get the error:
raise error("multiple repeat")
sre_constants.error: multiple repeat
Update
Real code that fails:
term = 'lg incite" OR author:"http++www.dealitem.com" OR "for sale'
regexPart1 = r"\s"
regexPart2 = r"(?:s|'s|!+|,|.|;|:|\(|\)|\"|\?+)?\s"
p = re.compile(regexPart1 + term + regexPart2 , re.IGNORECASE)
On the other hand, the following term passes smoothly (+ instead of ++)
term = 'lg incite" OR author:"http+www.dealitem.com" OR "for sale'
The problem is that, in a non-raw string, \" is ".
You get lucky with all of your other unescaped backslashes—\s is the same as \\s, not s; \( is the same as \\(, not (, and so on. But you should never rely on getting lucky, or assuming that you know the whole list of Python escape sequences by heart.
Either print out your string and escape the backslashes that get lost (bad), escape all of your backslashes (OK), or just use raw strings in the first place (best).
That being said, your regexp as posted won't match some expressions that it should, but it will never raise that "multiple repeat" error. Clearly, your actual code is different from the code you've shown us, and it's impossible to debug code we can't see.
Now that you've shown a real reproducible test case, that's a separate problem.
You're searching for terms that may have special regexp characters in them, like this:
term = 'lg incite" OR author:"http++www.dealitem.com" OR "for sale'
That p++ in the middle of a regexp means "1 or more of 1 or more of the letter p" (in the others, the same as "1 or more of the letter p") in some regexp languages, "always fail" in others, and "raise an exception" in others. Python's re falls into the last group. In fact, you can test this in isolation:
>>> re.compile('p++')
error: multiple repeat
If you want to put random strings into a regexp, you need to call re.escape on them.
One more problem (thanks to Ωmega):
. in a regexp means "any character". So, ,|.|;|:" (I've just extracted a short fragment of your longer alternation chain) means "a comma, or any character, or a semicolon, or a colon"… which is the same as "any character". You probably wanted to escape the ..
Putting all three fixes together:
term = 'lg incite" OR author:"http++www.dealitem.com" OR "for sale'
regexPart1 = r"\s"
regexPart2 = r"(?:s|'s|!+|,|\.|;|:|\(|\)|\"|\?+)?\s"
p = re.compile(regexPart1 + re.escape(term) + regexPart2 , re.IGNORECASE)
As Ωmega also pointed out in a comment, you don't need to use a chain of alternations if they're all one character long; a character class will do just as well, more concisely and more readably.
And I'm sure there are other ways this could be improved.
The other answer is great, but I would like to point out that using regular expressions to find strings in other strings is not the best way to go about it. In python simply write:
if term in string:
#do whatever
i have an example_str = "i love you c++" when using regex get error multiple repeat Error. The error I'm getting here is because the string contains "++" which is equivalent to the special characters used in the regex. my fix was to use re.escape(example_str ), here is my code.
example_str = "i love you c++"
regex_word = re.search(rf'\b{re.escape(word_filter)}\b', word_en)
Also make sure that your arguments are in the correct order!
I was trying to run a regular expression on some html code. I kept getting the multiple repeat error, even with very simple patterns of just a few letters.
Turns out I had the pattern and the html mixed up. I tried re.findall(html, pattern) instead of re.findall(pattern, html).
A general solution to "multiple repeat" is using re.escape to match the literal pattern.
Example:
>>>> re.compile(re.escape("c++"))
re.compile('c\\+\\+')
However if you want to match a literal word with space before and after try out this example:
>>>> re.findall(rf"\s{re.escape('c++')}\s", "i love c++ you c++")
[' c++ ']

Regex to extract long hexadecimal string

I want to extract with a regex the value after ajaxBrowserNavigationCheck('&x and before the = from the following javascript code:
if (ajaxBrowserNavigationCheck('&x909ef93d-61ac-4311-ac56-20c2ae9770f5=7ebdc2a4-df58-4c1c-9b50-96964c93e927', '', 'servletcontroller', '')){
processBrowserNavigationButton();
Basicly teh value I want to extra are &x909ef93d-61ac-4311-ac56-20c2ae9770f5 (the value before the = and we need the &x)
and 7ebdc2a4-df58-4c1c-9b50-96964c93e927 (the value after the =)
Note that the value is there twice (its after MODE=BROWSER_NAV)
Note that both value have 36 char without the &x
the &x is always there for the first string
My reg ex is a bit rusty here what I got so far:
(&x([0-9a-fA-F]|-)+) get me the first part
(&x([0-9a-fA-F]|-)+)|(=([0-9a-fA-F]|-)+) get me both but with the = we don't want it...
Edit: Sorry that I forgot the language, it's for a jmeter script which use jakarta ORO.
Edit2: I realize I can split those in two variable or even in three in jmeter that make it a bit easier.
Edit3: I removed the window location part because it was misleading since it was the same in the ajax part.
in ajaxBrowserNavigationCheck('&x909ef93d-61ac-4311-ac56-20c2ae9770f5=7ebdc2a4-df58-4c1c-9b50-96964c93e927', '', 'servletcontroller', ''))
we want &x909ef93d-61ac-4311-ac56-20c2ae9770f5 and 7ebdc2a4-df58-4c1c-9b50-96964c93e927
You haven't said what language you are using, so it's hard to give a solid answer.
This matches just your targets:
&x[a-fA-F0-9-]*(?==)
The last term is a look ahead, which asserts, but does not capture, an equals sign.
This regex matches all the input and captures each target twice as groups 1 and 2:
(?m).*?(&x[a-fA-F0-9-]*)=.*(&x[a-fA-F0-9-]*)=.*
See a live demo on rubular

Regexp: Keyword followed by value to extract

I had this question a couple of times before, and I still couldn't find a good answer..
In my current problem, I have a console program output (string) that looks like this:
Number of assemblies processed = 1200
Number of assemblies uninstalled = 1197
Number of failures = 3
Now I want to extract those numbers and to check if there were failures. (That's a gacutil.exe output, btw.) In other words, I want to match any number [0-9]+ in the string that is preceded by 'failures = '.
How would I do that? I want to get the number only. Of course I can match the whole thing like /failures = [0-9]+/ .. and then trim the first characters with length("failures = ") or something like that. The point is, I don't want to do that, it's a lame workaround.
Because it's odd; if my pattern-to-match-but-not-into-output ("failures = ") comes after the thing i want to extract ([0-9]+), there is a way to do it:
pattern(?=expression)
To show the absurdity of this, if the whole file was processed backwards, I could use:
[0-9]+(?= = seruliaf)
... so, is there no forward-way? :T
pattern(?=expression) is a regex positive lookahead and what you are looking for is a regex positive lookbehind that goes like this (?<=expression)pattern but this feature is not supported by all flavors of regex. It depends which language you are using.
more infos at regular-expressions.info for comparison of Lookaround feature scroll down 2/3 on this page.
If your console output does actually look like that throughout, try splitting the string on "=" when the word "failure" is found, then get the last element (or the 2nd element). You did not say what your language is, but any decent language with string splitting capability would do the job. For example
gacutil.exe.... | ruby -F"=" -ane "print $F[-1] if /failure/"

Regex: How to match a string that is not only numbers

Is it possible to write a regular expression that matches all strings that does not only contain numbers? If we have these strings:
abc
a4c
4bc
ab4
123
It should match the four first, but not the last one. I have tried fiddling around in RegexBuddy with lookaheads and stuff, but I can't seem to figure it out.
(?!^\d+$)^.+$
This says lookahead for lines that do not contain all digits and match the entire line.
Unless I am missing something, I think the most concise regex is...
/\D/
...or in other words, is there a not-digit in the string?
jjnguy had it correct (if slightly redundant) in an earlier revision.
.*?[^0-9].*
#Chad, your regex,
\b.*[a-zA-Z]+.*\b
should probably allow for non letters (eg, punctuation) even though Svish's examples didn't include one. Svish's primary requirement was: not all be digits.
\b.*[^0-9]+.*\b
Then, you don't need the + in there since all you need is to guarantee 1 non-digit is in there (more might be in there as covered by the .* on the ends).
\b.*[^0-9].*\b
Next, you can do away with the \b on either end since these are unnecessary constraints (invoking reference to alphanum and _).
.*[^0-9].*
Finally, note that this last regex shows that the problem can be solved with just the basics, those basics which have existed for decades (eg, no need for the look-ahead feature). In English, the question was logically equivalent to simply asking that 1 counter-example character be found within a string.
We can test this regex in a browser by copying the following into the location bar, replacing the string "6576576i7567" with whatever you want to test.
javascript:alert(new String("6576576i7567").match(".*[^0-9].*"));
/^\d*[a-z][a-z\d]*$/
Or, case insensitive version:
/^\d*[a-z][a-z\d]*$/i
May be a digit at the beginning, then at least one letter, then letters or digits
Try this:
/^.*\D+.*$/
It returns true if there is any simbol, that is not a number. Works fine with all languages.
Since you said "match", not just validate, the following regex will match correctly
\b.*[a-zA-Z]+.*\b
Passing Tests:
abc
a4c
4bc
ab4
1b1
11b
b11
Failing Tests:
123
if you are trying to match worlds that have at least one letter but they are formed by numbers and letters (or just letters), this is what I have used:
(\d*[a-zA-Z]+\d*)+
If we want to restrict valid characters so that string can be made from a limited set of characters, try this:
(?!^\d+$)^[a-zA-Z0-9_-]{3,}$
or
(?!^\d+$)^[\w-]{3,}$
/\w+/:
Matches any letter, number or underscore. any word character
.*[^0-9]{1,}.*
Works fine for us.
We want to use the used answer, but it's not working within YANG model.
And the one I provided here is easy to understand and it's clear:
start and end could be any chars, but, but there must be at least one NON NUMERICAL characters, which is greatest.
I am using /^[0-9]*$/gm in my JavaScript code to see if string is only numbers. If yes then it should fail otherwise it will return the string.
Below is working code snippet with test cases:
function isValidURL(string) {
var res = string.match(/^[0-9]*$/gm);
if (res == null)
return string;
else
return "fail";
};
var testCase1 = "abc";
console.log(isValidURL(testCase1)); // abc
var testCase2 = "a4c";
console.log(isValidURL(testCase2)); // a4c
var testCase3 = "4bc";
console.log(isValidURL(testCase3)); // 4bc
var testCase4 = "ab4";
console.log(isValidURL(testCase4)); // ab4
var testCase5 = "123"; // fail here
console.log(isValidURL(testCase5));
I had to do something similar in MySQL and the following whilst over simplified seems to have worked for me:
where fieldname regexp ^[a-zA-Z0-9]+$
and fieldname NOT REGEXP ^[0-9]+$
This shows all fields that are alphabetical and alphanumeric but any fields that are just numeric are hidden. This seems to work.
example:
name1 - Displayed
name - Displayed
name2 - Displayed
name3 - Displayed
name4 - Displayed
n4ame - Displayed
324234234 - Not Displayed