Regex for value.contains() in Google Refine - regex

I have a column of strings, and I want to use a regex to find commas or pipes in every cell, and then make an action. I tried this, but it doesn't work (no syntax error, just doesn't match neither commas nor pipes).
if(value.contains(/(,|\|)/), ...
The funny thing is that the same regex works with the same data in SublimeText. (Yes, I can work it there and then reimport, but I would like to understand what's the difference or what is my mistake).
I'm using Google Refine 2.5.

Since value.match should return captured texts, you need to define a regex with a capture group and check if the result is not null.
Also, pay attention to the regex itself: the string should be matched in its entirety:
Attempts to match the string s in its entirety against the regex pattern p and returns an array of capture groups.
So, add .* before and after the pattern you are looking inside a larger string:
if(value.match(/.*([,|]).*/) != null)

You can use a combination of if and isNonBlank like:
if(isNonBlank(value.match(/your regex/), ...

Related

Trying to extract repeating pattern from string in php/javascript

The following is in PHP but the regex will also be used in javascript.
Trying to extract repeating patterns from a string
string can be any of the following:
"something arbitrary"
"D123"
"D111|something"
"D197|what.org|when.net"
"D297|who.197d234.whatever|when.net|some other arbitrary string"
I'm currently using the following regex: /^D([0-9]{3})(?:\|([^\|]+))*/
This correctly does not match the first string, matches the second and third correctly. The problem is the third and fourth only match the Dxxx and the last string. I need each of the strings between the '|' to be matched.
I'm hoping to use a regex as it makes it a single step. I realize I could just detect the leading Dxxx then use explode or split as appropriate to break the strings out. I've just gotten stuck on wanting a single regular expression match step.
This same regex may be used in Python as well so just want a generic regex solution.
There is no way to have a dynamic number of capture groups in a regular expression, but if you know some upper limit to how many parts you would have in one string, you can just repeat the pattern that many times:
/^D([0-9]{3})(?:$|\|)(.*?)(?:$|\|)(.*?)(?:$|\|)(.*?)(?:$|\|)(.*?)(?:$|\|)/
So after the initial ^D([0-9]{3})(?:$|\|) you just repeat (.*?)(?:$|\|) as many times as you need it.
When the string has fewer elements, those remaining capture groups will match the empty string.
See regex tester.
Is something like preg_match_all() (the PHP variant of a global match) also acceptable for you?
Then you could use:
^(?|D([0-9]{3})|^.+$|(?!^)\|([^|\n]*)(?=\||$))
This will match everything in a string in different matches, e.g. take your string:
D197|what.org|when.net
It will you then give three matches:
D197
what.org
when.net
Running live: https://regex101.com/r/jL2oX6/4 (Everything in green are your group matches. Ignore what's in blue.)

Find Regex mismatch part in a string using vb.net

I had a regex expression
^\d{9}_[a-zA-Z]{1}_(0[1-9]|1[0-2]).(0[1-9]|[1-2][0-9]|3[0-1]).[0-9]{4}_\d*_[0-9a-zA-Z]*_[0-9a-zA-Z]*
and string that match regex expression
000066874_A_12.31.2014_001_2Q_ICAN14
if user by mistake enters the string other than above format like
000066874_12.31.14_001_2Q_ICAN14
I need to find out in which part of my regex got failed. I tried using Regex.Matches and Regex.Match but using this I couldn't find in which part my string got miss matched with my Regex expression. I am using vb.net
This is very complicated to do with regex. I managed to make this regex, but you still have to check the capture groups after that.
^(?:(?:(\d{9})|.*?)_)?(?:(?:([a-zA-Z]{1})|.*?)_)?(?:(?:((?:0[1-9]|1[0-2]).(?:0[1-9]|[1-2][0-9]|3[0-1]).[0-9]{4})|.*?)_)?(?:(?:(\d*)|.*?)_)?(?:(?:([0-9a-zA-Z]*)|.*?)_)?(?:([0-9a-zA-Z]*)|.*?)$ will work if you, as seen in demo: https://regex101.com/r/aJ1wG1/2
Each part before an underline is a capture group, if a capture group is not there, there's an error in it. As you can see in the example, $3 is not present in 1st example, hence, a mistake in date is there. In second example, the $2 is not present, hence $2 onward are not there. 3rd example is correct and all 6 caputre groups are there.
When regexes get this massive, it's a sign that probably a different method should be used to solve the problem, but this might work for you with some additional code for group result checks.

Capture part or whole using regex with same capturing name

Given the two following strings :
\06086-afde-4e46-8886-#xxx.com\0xxx7ccd-6293-4343-8e50-xxx
\0name.surname#xxx.com\0xxx6293-4343-8e50-e1d5-xxx
I try to extract 6086-afde-4e46-8886- (id it is a guid) or name.surname#xxx.com (if it is not a guid). The difficulty here is that the captured groups must have the same name.
So far, I have
(?<name>(?:\w{4}-){4}|[a-zA-Z.]{1,}#xxx\.com), but this also captures 7ccd-6293-4343-8e50- or 6293-4343-8e50-e1d5- which I don't want.
I was also thinking about something like \\\0(?<name>(?:\w{4}-){4}|[a-zA-Z.]{1,}#xxx\.com)(?:(?:#xxx\.com)?\\\0),
but then is there a way not to repeat the xxx.com part (because it is more complicated than that). Also, this relies on finding \\0, which I'd like not to, as I don't really know if this will be found somewhere else in the string.
Thanks..
The following regular expression is matching the number 6086-afde-4e46-8886- and the email name.surname#xxx.com into the same group name without using the start sequence \0
(?<_name_>[A-Za-z]+\.[A-Za-z]+#xxx\.com|(?:[\w]{4}-){4}(?=#xxx\.com))
This regular expression uses a positive look ahead (?=#xxx\.com) for matching the number without taking #xxx.com.
try this
\\0(?<_name_>(?:[\w\-\.]+))#xxx\.com
And add all allowed characters inside the square parentheses
demo: http://regexhero.net/tester/?id=be0fed5e-1d24-43cc-9db9-812311c17d61
Seems like you're trying to get the first match. If yes then try the below regex.
^.*?(?<name>(?:\w{4}-){4}|[a-zA-Z.]{1,}#xxx.com)
http://regex101.com/r/jC3uR4/5

Greedy and non-greedy regex

I currently have this regex: this\.(.*)?\s[=,]\s, however I have come across a pickle I cannot fix.
I tried the following Regex, which works, but it captures the space as well which I don't want: this\.(.*)?(?<=\s)=|(?<!\s),. What I'm trying to do is match identifier names. An example of what I want and the result is this:
this.""W = blah; which would match ""W. The second regex above does this almost perfectly, however it also captures the space before the = in the first group. Can someone point me in the correct direction to fix this?
EDIT: The reason for not simply using [^\s] in the wildcard group is that sometimes I can get lines like this: this. "$ = blah;
EDIT2: Now I have another issue. Its not matching lines like param1.readBytes(this.=!3,0,param1.readInt()); properly. Instead of matching =!3 its matching =!3,0. Is there a way to fix this? Again, I cannot simply use a [^,] because there could be a name like param1.readBytes(this.,3$,0,param1.readInt()); which should match ,3$.
(.*) will match any character including whitespace.
To force it not to end in whitespace change it to (.*[^\s])
Eg:
this\.(.*[^\s])?\s?[=,]\s
For your second edit, it seems like you are doing a language parser. Even though regular expressions are powerful, they do have limits. You need a grammar parser for that.
Maybe you can tell in your first block to capture non space characters, instead of any.
this\.(\S*)?(?<=\s)=|(?<!\s),

Regular expressions middle of string

How I can get part of SIP URI?
For example I have URI sip:username#sip.somedomain.com, I need get just username and I use [^sip:](.*)[$#]+ expression, but appeared result is username#. How I can exclude from matching #?
this should do the job
(?<=^sip:)(.*)(?=[$#])
Use a lookahead instead of actually matching #:
^sip:(.*?)(?=#|\$)
Either you are using a very strange regex flavor, or your starting character class is a mistake. [^sip:] matches a single character that isn't any of s,i,p or :. I am also not certain what the $ character is for, since that isn't a part of SIP syntax.
If lookaheads are not available in your regex flavour (for instance POSIX regexes lack them), you can still match parts of the string in your regex you don't eventually want to return, if you use capture groups and only grab the contents of some of them.
For example
^sip:(.*?)[$#]+ Then only return the contents of the first capture group