match string beginning with phrase but not ending with another phrase - regex

I have a character vector as below :
"sit", "situation", "situat", "lettuce", "situationabcd"
I'd like to subset "sit", "situation" and "situat". In fact, I'd like to subset all strings that begins with "sit" but not the ones ending with "abcd".
I tried "^(?!.*abcd$).*$"
but this one subsets "lettuce" as well.

You can approach it by using negative lookbehind against the end-of-string object ($).
Here an example in Javascript:
var str = [ "sit", "situation", "situat", "lettuce", "situationabcd" ];
var expr = /^sit.*?$(?<!abcd)/;
console.log (str.filter(x=>x.match(expr)));
// Outputs: [ 'sit', 'situation', 'situat' ]
Edit:
Here pre-ES2018 javascript solution:
var str = [ "sit", "situation", "situat", "lettuce", "situationabcd" ];
console.log (
str
.filter(x=>x.match(/^sit/))
.filter(x=>!x.match(/abcd$/))
);
// Outputs: [ 'sit', 'situation', 'situat' ]
In fact this is the original solution I thought to provide but I declined primarily because that the original question asks for single regular expression and don't say if can be approached by more than one and/or in any programming language even javascript.
But, in fact, this is a better solution if you are able to apply two regular expressons for two reasons:
Firstly, lookahead and lookbehind are expensive in all regular expression implementations (ones much more than others but expensive in all cases).
...and because approaching this behaviour avoiding negative lookbehind will be hard and I figure out the solution will be expensive anyway (because "abcd" character position can overlap or not whith the initial "sit" initial substring).

Related

Regex- to get part of String

I have got below string and I need to Get all the values Between Pizzahut: and |.
ABC:2fg45rdvsg|Pizzahut:j34532jdhgj|Dominos:3424232|Pizzahut:3242237|Wendys:3462783|Pizzahut:67688873rg|
I have got RegExpression .scan(/(?<=Pizzahut:)([.*\s\S]+)(?=\|)/) but it fetches
"j34532jdhgj|Dominos:3424232|Pizzahut:3242237|Wendys:3462783|Pizzahut:67688873rg|"
Result should be: 34532jdhgj,3242237,67688873rg
You can use
s='ABC:2fg45rdvsg|Pizzahut:j34532jdhgj|Dominos:3424232|Pizzahut:3242237|Wendys:3462783|Pizzahut:67688873rg|'
p s.scan(/Pizzahut:([^|]+)/).flatten
# => ["j34532jdhgj", "3242237", "67688873rg"]
See this Ruby demo and the Rubular demo.
It does not look possible that you have Pizzahut as a part of another word, but it is possible, use a version with a word boundary, /\bPizzahut:([^|]+)/.
The Pizzahut:([^|]+) matches Pizzahut: and then captures into Group 1 any one or more chars other than a pipe (with ([^|]+)).
Note that String#scan returns the captures only if a pattern contains a capturing group, so you do not need to use lookarounds.
I'm not sure why you're jumping to a regex solution here; that input string clearly looks structured to me, and you would probably do better by splitting it on the delimiters to convert it into a more convenient data structure.
Something like this:
input = "ABC:2fg45rdvsg|Pizzahut:j34532jdhgj|Dominos:3424232|Pizzahut:3242237|Wendys:3462783|Pizzahut:67688873rg"
converted_input = input
.split('|') #=> ["ABC:2fg45rdvsg", "Pizzahut:j34532jdhgj", ... ]
.map { |pair| pair.split(':') } #=> [["ABC", "2fg45rdvsg"], ["Pizzahut", "j34532jdhgj"], ... ]
.group_by(&:first) #=> {"ABC"=>[["ABC", "2fg45rdvsg"]], "Pizzahut"=>[["Pizzahut", "j34532jdhgj"], ... ], "Dominos"=>[["Dominos", "3424232"]], ... ]
.transform_values { |v| v.flat_map(&:last) }
(The above series of transformations is just one possible way; you could probably come up with a dozen similar alternative steps to convert this input into the same hash shown below! For example, by using reduce or even the CSV library.)
Which gives you the final result:
converted_input = {
"ABC" => ["2fg45rdvsg"],
"Pizzahut" => ["j34532jdhgj", "3242237", "67688873rg"],
"Dominos" => ["3424232"],
"Wendys" => ["3462783"]
}
Now that the data is formatted conveniently, obtaining data like your original request becomes trivial:
converted_input["Pizzahut"].join(',') #=> "j34532jdhgj,3242237,67688873rg"
(Although quite likely it would be more suitable to leave it as an Array, not a comma-separated String!!)

eval certain regex from file to replace chars in string

I'm new to ruby so please excuse my ignorance :)
I just learned about eval and I read about its dark sides.
what I've read so far:
When is eval in Ruby justified?
Is 'eval' supposed to be nasty?
Ruby Eval and the Execution of Ruby Code
so what I have to do is to read a file in which there are some text such as /e/ 3 which will replace each e with a 3 after evaluation.
so here what i did so far:(working but..)
def evaluate_lines
result="elt"
IO.foreach("test.txt") do |reg|
reg=reg.chomp.delete(' ')
puts reg
result=result.gsub(eval(reg[0..2]),"#{reg[3..reg.length]}" )
p result
end
end
contents of the test.txt file
/e/ 3
/l/ 1
/t/ 7
/$/ !
/$/ !!
this only works because I know the length of the lines in the file.
so assuming my file has the following /a-z/ 3 my program would be not able to do what is expected from it.
Note
I tried using Regexp.new reg and this resulted with the following /\/e\/3/ which isn't very helpful in this case.
simple example to the `Regexp
str="/e/3"
result="elt"
result=result.gsub(Regexp.new str)
p result #outputs: #<Enumerator: "elt":gsub(/\/e\/3/)>
i already tried stripping off the slashes but even though this wont deliver the desired result thus the gsub() takes two parameters, such as this gsub(/e/, "3").
for the usage of the Regexp, I have already read Convert a string to regular expression ruby
While you can write something to parse that file, it rapidly gets complicated because you have to parse regular expressions. Consider /\/foo\\/.
There are a number of incomplete solutions. You can split on whitespace, but this will fail on /foo bar/.
re, replace = line.split(/\s+/, 2)
You can use a regex. Here's a first stab.
match = "/3/ 4".match(%r{^/(.*)/\s+(.+)})
This fails on escaped /, we need something more complex.
match = '/3\// 4'.match(%r{\A / ((?:[^/]|\\/)*) / \s+ (.+)}x)
I'm going to guess it was not your teacher's intent to have you parsing regexes. For the purposes of the assignment, splitting on whitespace is probably fine. You should clarify with your teacher.
This is a poor data format. It is non-standard, difficult to parse, and has limitations on the replacement. Even a tab-delimited file would be better.
There's little reason to use a non-standard format these days. The simplest thing is to use a standard data format for the file. YAML or JSON are the most obvious choices. For such simple data, I'd suggest JSON.
[
{ "re": "e", "replace": "3" },
{ "re": "l", "replace": "1" }
]
Parsing the file is trivial, use the built-in JSON library.
require 'json'
specs = JSON.load("test.json")
And then you can use them as a list of hashes.
specs.each do |spec|
# No eval necessary.
re = Regexp.new(spec["re"])
# `gsub!` replaces in place
result.gsub!(re, spec["replace"])
end
The data file is extensible. For example, if later you want to add regex options.
[
{ "re": "e", "replace": "3" },
{ "re": "l", "replace": "1", "options": ['IGNORECASE'] }
]
While the teacher may have specified a poor format, pushing back on bad requirements is good practice for being a developer.
Here's a really simple example that uses vi notation like s/.../.../ and s/.../.../g:
def rsub(text, spec)
_, mode, repl, with, flags = spec.match(%r[\A(.)\/((?:[^/]|\\/)*)/((?:[^/]|\\/)*)/(\w*)\z]).to_a
case (mode)
when 's'
if (flags.include?('g'))
text.gsub(Regexp.new(repl), with)
else
text.sub(Regexp.new(repl), with)
end
end
end
Note the matcher looks for non-slash characters ([^/]) or a literal-slash combination (\\/) and splits out the two parts accordingly.
Where you can get results like this:
rsub('sandwich', 's/and/or/')
# => "sorwich"
rsub('and/or', 's/\//,/')
# => "and,or"
rsub('stack overflow', 's/o/O/')
# => "stack Overflow"
rsub('stack overflow', 's/o/O/g')
# => "stack OverflOw"
The principle here is you can use a very simple regular expression to parse out your input regular expression and feed that cleaned up data into Regexp.new. There is absolutely no need for eval here, and if anything that severely limits what you can do.
With a little work you could alter that regular expression to parse what's in your existing file and make it do what you want.

Regex to select text outside of underscores

I am looking for a regex to select the text which falls outside of underscore characters.
Sample text:
PartIWant_partINeedIgnored_morePartsINeedIgnored_PartIwant
Basically I need to be able to select the first keyword which is always before the first underscore and the last keyword which is always after the last underscore. As an additional complexity, there case also be texts which have no underscore at all, these need to be selected completely as well.
The best I got yet was this expression:
^((?! *\_[^)]*\_ *).)*
which is only yielding me the first part, not the second and it has no support for the non-underscore yet at all.
This regex is used in a tool which monitors our http traffic, which means I can only 'select' the part I need but can't invoke functions or replace logic.
Thanks!
Use JavaScript string function split(). Check below example.
var t = "PartIWant_partINeedIgnored_morePartsINeedIgnored_PartIwant";
var arr = t.split('_');
console.log(arr);
//Access the required parts like this
console.log(arr[0] + ' ' + arr[arr.length - 1]);
Perhaps something like this:
/(^[^_]+)|([^_]+$)/g
That is, match either:
^[^_]+ the beginning of the string followed by non-underscores, or
[^_]+$ non-underscores followed by the end of the string.
var regex = /(^[^_]+)|([^_]+$)/g
console.log("A_b_c_D".match(regex)) // ["A", "D"]
console.log("A_b_D".match(regex)) // ["A", "D"]
console.log("A_D".match(regex)) // ["A", "D"]
console.log("AD".match(regex)) // ["AD"]
I'm not sure if you should use a regex here. I think splitting the string at underscore, and using the first and last element of the resulting array might be faster, and less complicated.
Trivial with .replace:
str.replace(/_.*_/, '')
// "PartIWantPartIwant"
With matching, you'd need to be selecting and concatenating groups:
parts = str.match(/^([^_]*).*?([^_]*)$/)
parts[1] + parts[2]
// "PartIWantPartIwant"
EDIT
This regex is used in a tool which monitors our http traffic, which means I can only 'select' the part I need but can't invoke functions or replace logic.
This is not possible: a regular expression cannot match a discontinuous span.

Regular expression any character with dynamic size

I want to use a regular expression that would do the following thing ( i extracted the part where i'm in trouble in order to simplify ):
any character for 1 to 5 first characters, then an "underscore", then some digits, then an "underscore", then some digits or dot.
With a restriction on "underscore" it should give something like that:
^([^_]{1,5})_([\\d]{2,3})_([\\d\\.]*)$
But i want to allow the "_" in the 1-5 first characters in case it still match the end of the regular expression, for example if i had somethink like:
to_to_123_12.56
I think this is linked to an eager problem in the regex engine, nevertheless, i tried to do some lazy stuff like explained here but without sucess.
Any idea ?
I used the following regex and it appeared to work fine for your task. I've simply replaced your initial [^_] with ..
^.{1,5}_\d{2,3}_[\d\.]*$
It's probably best to replace your final * with + too, unless you allow nothing after the final '_'. And note your final part allows multiple '.' (I don't know if that's what you want or not).
For the record, here's a quick Python script I used to verify the regex:
import re
strs = [ "a_12_1",
"abc_12_134",
"abcd_123_1.",
"abcde_12_1",
"a_123_123.456.7890.",
"a_12_1",
"ab_de_12_1",
]
myre = r"^.{1,5}_\d{2,3}_[\d\.]+$"
for str in strs:
m = re.match(myre, str)
if m:
print "Yes:",
if m.group(0) == str:
print "ALL",
else:
print "No:",
print str
Output is:
Yes: ALL a_12_1
Yes: ALL abc_12_134
Yes: ALL abcd_134_1.
Yes: ALL abcde_12_1
Yes: ALL a_123_123.456.7890.
Yes: ALL a_12_1
Yes: ALL ab_de_12_1
^(.{1,5})_(\d{2,3})_([\d.]*)$
works for your example. The result doesn't change whether you use a lazy quantifier or not.
While answering the comment ( writing the lazy expression ), i saw that i did a mistake... if i simply use the folowing classical regex, it works:
^(.{1,5})_([\\d]{2,3})_([\\d\\.]*)$
Thank you.

Is there a RegEx that can parse out the longest list of digits from a string?

I have to parse various strings and determine a prefix, number, and suffix. The problem is the strings can come in a wide variety of formats. The best way for me to think about how to parse it is to find the longest number in the string, then take everything before that as a prefix and everything after that as a suffix.
Some examples:
0001 - No prefix, Number = 0001, No suffix
1-0001 - Prefix = 1-, Number = 0001, No suffix
AAA001 - Prefix = AAA, Number = 001, No suffix
AAA 001.01 - Prefix = AAA , Number = 001, Suffix = .01
1_00001-01 - Prefix = 1_, Number = 00001, Suffix = -01
123AAA 001_01 - Prefix = 123AAA , Number = 001, Suffix = _01
The strings can come with any mixture of prefixes and suffixes, but the key point is the Number portion is always the longest sequential list of digits.
I've tried a variety of RegEx's that work with most but not all of these examples. I might be missing something, or perhaps a RegEx isn't the right way to go in this case?
(The RegEx should be .NET compatible)
UPDATE: For those that are interested, here's the C# code I came up with:
var regex = new System.Text.RegularExpressions.Regex(#"(\d+)");
if (regex.IsMatch(m_Key)) {
string value = "";
int length;
var matches = regex.Matches(m_Key);
foreach (var match in matches) {
if (match.Length >= length) {
value = match.Value;
length = match.Length;
}
}
var split = m_Key.Split(new String[] {value}, System.StringSplitOptions.RemoveEmptyEntries);
m_KeyCounter = value;
if (split.Length >= 1) m_KeyPrefix = split(0);
if (split.Length >= 2) m_KeySuffix = split(1);
}
You're right, this problem can't be solved purely by regular expressions. You can use regexes to "tokenize" (lexically analyze) the input but after that you'll need further processing (parsing).
So in this case I would tokenize the input with (for example) a simple regular expression search (\d+) and then process the tokens (parse). That would involve seeing if the current token is longer than the tokens seen before it.
To gain more understanding of the class of problems regular expressions "solve" and when parsing is needed, you might want to check out general compiler theory, specifically when regexes are used in the construction of a compiler (e.g. http://en.wikipedia.org/wiki/Book:Compiler_construction).
You're input isn't regular so, a regex won't do. I would iterate over the all groups of digits via (\d+) and find the longest and then build a new regex in the form of (.*)<number>(.*) to find your prefix/suffix.
Or if you're comfortable with string operations you can probably just find the start and end of the target group and use substr to find the pre/suf fix.
I don't think you can do this with one regex. I would find all digit sequences within the string (probably with a regex) and then I would select the longest with .NET code, and call Split().
This depends entirely on your Regexp engine. Check your Regexp environment for capturing, there might be something in it like the automatic variables in Perl.
OK, let's talk about your question:
Keep in mind, that both, NFA and DFA, of almost every Regexp engine are greedy, this means, that a (\d+) will always find the longest match, when it "stumbles" over it.
Now, what I can get from your example, is you always need middle portion of a number, try this:
/^(.*\D)?(\d+)(\D.*)?$/ig
The now look at variables $1, $2, $3. Not all of them will exist: if there are all three of them, $2 will hold your number in question, the other vars, parts of the prefix. when one of the prefixes is missing, only variable $1 and $2 will be set, you have to see for yourself, which one is the integer. If both prefix and suffix are missing, $1 will hold the number.
The idea is to make the engine "stumble" over the first few characters and start matching a long number in the middle.
Since the modifier /gis present, you can loop through all available combinations, that the machine finds, you can then simply take the one you like most or something.
This example is in PCRE, but I'm sure .NET has a compatible mode.