Use regex to find a phrase with symbols in an URL - regex

I have several pages with the current url:
onclick="location.href='https://www.mydomain.com/shop/bags
at the end of each url there's something like this:
?cid=Black'"
or
?cid=Beige'"
or
?cid=Green'"
What I need is a regex to find ?cid= in each url and then replace everything from ?cid= to the ending '
CUrrently I have this:
.?cid=.*?'
This finds occurences of ?cid= in EVERY line of code. I only want it to find occurrences in onclick="location.href='https://www.mydomain.com/shop/bags
Any one got any solutions for this?
UPDATE
Sorry for the initial confusion. I'm using this program http://www.araxis.com/replace-in-files/index-eur.html which allows the use of regex's to find elements. I think it says it allows PERL style regex.
Thanks

You can use lookaround syntax to match ?cid=something preceded by the URL and followed by a '
This pattern should work:
(?<=\Qhttps://www.mydomain.com/shop/bags\E)\?cid=[^']++(?=')
If you replace that pattern with your replacement then the entire bit from ?cid until ' will be replaced.
Here is an example in Java (ignore the slightly different syntax):
public static void main(String[] args) {
final String[] in = {
"onclick=\"location.href='https://www.mydomain.com/shop/bags?cid=Black'",
"onclick=\"location.href='https://www.mydomain.com/shop/bags?cid=Beige'",
"onclick=\"location.href='https://www.mydomain.com/shop/bags?cid=Green'"
};
final Pattern pattern = Pattern.compile("(?<=\\Qhttps://www.mydomain.com/shop/bags\\E)\\?cid=[^']++(?=')");
for(final String string : in) {
final Matcher m = pattern.matcher(string);
final String replaced = m.replaceAll("SOMETHING_ELSE");
System.out.println(replaced);
}
}
Output
onclick="location.href='https://www.mydomain.com/shop/bagsSOMETHING_ELSE'
onclick="location.href='https://www.mydomain.com/shop/bagsSOMETHING_ELSE'
onclick="location.href='https://www.mydomain.com/shop/bagsSOMETHING_ELSE'
This assumes, obviously, that your tools supports lookaround.
This should certainly work if you just use Perl directly rather than via your magic tool
perl -pi -e '/s/(?<=\Qhttps://www.mydomain.com/shop/bags\E)\?cid=[^\']++(?=\')/SOMETHING_ELSE/g' *some_?glob*.pattern
EDIT
Another idea is to use a capturing group and a backreference, replace
(\Qhttps://www.mydomain.com/shop/bags\E)\?cid=[^']++
With
$1SOMETHING_ELSE
Another test case in Java:
public static void main(String[] args) {
final String[] in = {
"onclick=\"location.href='https://www.mydomain.com/shop/bags?cid=Black'",
"onclick=\"location.href='https://www.mydomain.com/shop/bags?cid=Beige'",
"onclick=\"location.href='https://www.mydomain.com/shop/bags?cid=Green'"
};
final Pattern pattern = Pattern.compile("(\\Qhttps://www.mydomain.com/shop/bags\\E)\\?cid=[^']++");
for(final String string : in) {
final Matcher m = pattern.matcher(string);
final String replaced = m.replaceAll("$1SOMETHING_ELSE");
System.out.println(replaced);
}
}
Output:
onclick="location.href='https://www.mydomain.com/shop/bagsSOMETHING_ELSE'
onclick="location.href='https://www.mydomain.com/shop/bagsSOMETHING_ELSE'
onclick="location.href='https://www.mydomain.com/shop/bagsSOMETHING_ELSE'

Find
(onclick="location.href='https://www.mydomain.com/shop/bags.*?)\?cid=.*?'
Replace
$1something'

you can use this pattern
\?cid=[^']*
The idea is to use a character class that exclude the final simple quote, then you avoid to use a lazy quantifier.
Note: you can use a possessive quantifier if supported to give the regex engine less work:
\?cid=[^']*+

Related

Parse string using regex

I need to come up with a regular expression to parse my input string. My input string is of the format:
[alphanumeric].[alpha][numeric].[alpha][alpha][alpha].[julian date: yyyyddd]
eg:
A.A2.ABC.2014071
3.M1.MMB.2014071
I need to substring it from the 3rd position and was wondering what would be the easiest way to do it.
Desired result:
A2.ABC.2014071
M1.MMB.2014071
(?i) will be considered as case insensitive.
(?i)^[a-z\d]\.[a-z]\d\.[a-z]{3}\.\d{7}$
Here a-z means any alphabet from a to z, and \d means any digit from 0 to 9.
Now, if you want to remove the first section before dot, then use this regex and replace it with $1 (or may be \1)
(?i)^[a-z\d]\.([a-z]\d\.[a-z]{3}\.\d{7})$
Another option is replace below with empty:
(?i)^[a-z\d]\.
If the input string is just the long form, then you want everything except the first two characters. You could arrange to substitute them with nothing:
s/^..//
Or you could arrange to capture everything except the first two characters:
/^..(.*)/
If the expression is part of a larger string, then the breakdown of the alphanumeric components becomes more important.
The details vary depending on the language that is hosting the regex. The notations written above could be Perl or PCRE (Perl Compatible Regular Expressions). Many other languages would accept these regexes too, but other languages would require tweaks.
Use this regex:
\w.[A-Z]\d.[A-Z]{3}.\d{7}
Use the above regex like this:
String[] in = {
"A.A2.ABC.2014071", "3.M1.MMB.2014071"
};
Pattern p = Pattern.compile("\\w.[A-Z]\\d.[A-Z]{3}.\\d{7}");
for (String s: in ) {
Matcher m = p.matcher(s);
while (m.find()) {
System.out.println("Result: " + m.group().substring(2));
}
}
Live demo: http://ideone.com/tns9iY

Using regex to match certain text

I try to look for this answer for a while but no luck (sorry if I could describe it well). I am still newbie with regex. I am trying to match a string with only number and a certain delimiter. For example: the patter would be 8/16/32/64/.... the number will be split by '/' with arbitrary amount of number, I could find a way to match them.
My attempt is \d+/\d+? but couldn't get it to work.
You could remove the '/' delimiter and then test for the existence of a number
Here is some C# as an example:
static void Main(string[] args)
{
string text = "8/16/32/64/";
Console.WriteLine(text);
TestForNum(text);
text = "8/16/32/64/b";
Console.WriteLine(text);
TestForNum(text);
Console.ReadKey();
}
private static void TestForNum(string text)
{
string tmp = Regex.Replace(text, #"/", "");
Match m = Regex.Match(tmp, #"^\d+$");
if(m.Success)
{
Console.WriteLine("\t" + m.Groups[0]);
}
else Console.WriteLine("\tno match");
}
A naive approach would be
[\d/]+
However, this does match //// as well as just 12345. To match only "proper" strings:
\d+(/\d+)+
Reads digits followed by delimiter+digits repeated at least once. If trailing/leading delimiters are allowed, then
/?(\d+/)+\d*
If you're using a flavor that uses slashes to quote the regex (like javascript), you'll need to escape them:
/\d+(\/\d+)+/
You can do:
(\d+)(\D|$)
See this work That will split a list of digits delimited by any non digit, so 1?2!3.4 would match
If you want a specific delimiter, such as /:
(\d+)(?:/|$)
As simple as possible:
(\d+\/?)+
Every digit followed by [a] slash, as many as possible. You may use g flag for all matches.

Dart: RegExp by example

I'm trying to get my Dart web app to: (1) determine if a particular string matches a given regex, and (2) if it does, extract a group/segment out of the string.
Specifically, I want to make sure that a given string is of the following form:
http://myapp.example.com/#<string-of-1-or-more-chars>[?param1=1&param2=2]
Where <string-of-1-or-more-chars> is just that: any string of 1+ chars, and where the query string ([?param1=1&param2=2]) is optional.
So:
Decide if the string matches the regex; and if so
Extract the <string-of-1-or-more-chars> group/segment out of the string
Here's my best attempt:
String testURL = "http://myapp.example.com/#fizz?a=1";
String regex = "^http://myapp.example.com/#.+(\?)+\$";
RegExp regexp= new RegExp(regex);
Iterable<Match> matches = regexp.allMatches(regex);
String viewName = null;
if(matches.length == 0) {
// testURL didn't match regex; throw error.
} else {
// It matched, now extract "fizz" from testURL...
viewName = ??? // (ex: matches.group(2)), etc.
}
In the above code, I know I'm using the RegExp API incorrectly (I'm not even using testURL anywhere), and on top of that, I have no clue how to use the RegExp API to extract (in this case) the "fizz" segment/group out of the URL.
The RegExp class comes with a convenience method for a single match:
RegExp regExp = new RegExp(r"^http://myapp.example.com/#([^?]+)");
var match = regExp.firstMatch("http://myapp.example.com/#fizz?a=1");
print(match[1]);
Note: I used anubhava's regular expression (yours was not escaping the ? correctly).
Note2: even though it's not necessary here, it is usually a good idea to use raw-strings for regular expressions since you don't need to escape $ and \ in them. Sometimes using triple-quote raw-strings are convenient too: new RegExp(r"""some'weird"regexp\$""").
Try this regex:
String regex = "^http://myapp.example.com/#([^?]+)";
And then grab: matches.group(1)
String regex = "^http://myapp.example.com/#([^?]+)";
Then:
var match = matches.elementAt(0);
print("${match.group(1)}"); // output : fizz

Select a certain part of this string

I need to select a small portion of a string.
Here's an example string: http://itunes.apple.com/app/eyelashes/id564783832?uo=5
I need: 564783832
A couple of things to keep in mind:
The number will always be preceded by id (ie. id564783832)
There may or may not be a ?uo=5 following the number (and it could be other parameters besides uo)
The string I need can be different lengths (won't always be 9 digits)
The text preceding id will have similar formatting (same # of slashes, but text will be different)
This will ultimately be implemented with Ruby.
without knowing your language/tool, just assume look behind was supported.
'(?<=id)\d+'
With awk
awk '{print $2}' FS='(id|?)'
You can match some sequence of digits preceded by "id" - this assumes that those are the only sequence of digits preceded by "id":
(?<=id)\d++
A test case in Java:
public static void main(String[] args) {
String input = "http://itunes.apple.com/app/eyelashes/id564783832?uo=5";
Pattern pattern = Pattern.compile("(?<=id)\\d++");
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
System.out.println(matcher.group());
}
}
Output
564783832
Here's mine:
[\w\/]+id(\d+)(\?|$)

Multiline Regular Expression replace

Ok, there's lots of regular expressions, but as always, none of them seem to match what I'm trying to do.
I have a text file:
F00220034277909272011
H001500020003000009272011
D001500031034970000400500020000000025000000515000000000
D001500001261770008003200010000000025000000132500000000
H004200020001014209272011
D004200005355800007702200005142000013420000000000000000
D004200031137360000779000005000000012000000000000000000
H050100180030263709272011
D050100001876700006000300019500000025000000250000001500
D050100001247060000071500030000000025000000280000000000
D050100002075670000430400020000000025000000515000000000
D050100008342500007702600005700000010000000000000000700
D050100009460270000702100015205000025000000000000006205
D050100008135120000702400015000000010000000000000001000
D050100006938430000702200026700000010000000000000001000
D050100006423710008000200025700000000000000000000001000
D050100009488040008000600007175000000000000000000001000
D050100001299190000800100016300000000000000000000003950
D050100001244850000800400005407000000000000000000001607
D050100001216280000840200020000000000000001000000006200
D050100001216840000479000008175000000000000100000001000
D050100001265880000410200014350000000000000100000001000
D050100007402650002000300026700000000000000100000001000
D050100001305150002000200016175000000000001000000000000
D050100005435430000899700022350000000000001000000000000
D050100031113850000500200008200000000250000100000001000
and, with a multiline regex (.NET flavored), I want to do a replace so that I get:
H050100180030263709272011
D050100001876700006000300019500000025000000250000001500
D050100001247060000071500030000000025000000280000000000
D050100002075670000430400020000000025000000515000000000
D050100008342500007702600005700000010000000000000000700
D050100009460270000702100015205000025000000000000006205
D050100008135120000702400015000000010000000000000001000
D050100006938430000702200026700000010000000000000001000
D050100006423710008000200025700000000000000000000001000
D050100009488040008000600007175000000000000000000001000
D050100001299190000800100016300000000000000000000003950
D050100001244850000800400005407000000000000000000001607
D050100001216280000840200020000000000000001000000006200
D050100001216840000479000008175000000000000100000001000
D050100001265880000410200014350000000000000100000001000
D050100007402650002000300026700000000000000100000001000
D050100001305150002000200016175000000000001000000000000
D050100005435430000899700022350000000000001000000000000
D050100031113850000500200008200000000250000100000001000
so that, basically, I grab everything that starts with [HD]0501 and nothing else.
I know this seems more suited to a match that a replace, but I'm going through a pre-built engine that accepts a Regex pattern string and a regex replace string only.
What can I supply for a pattern and a replace string to get my desired result? Multiline Regex is a hardcoded configuration?
I originally thought something like this would work:
search:
(?<Match>^[HD]0501\d+$), but this matched nothing.
search:
(?!^[HD]0501\d+$), but this matched a bunch of empty strings, and I couldn't figure out what to put for the replace string.
search:
(?!(?<Omit>^[HD]0501\d+$)), "Group 'Omit' not found."
It seems this should be simple, but as always, Regex manages to make me feel dumb. Help would be greatly appreciated.
Try matching the following pattern:
(?m)^(?![HD]0501).+(\r?\n)?
and replace it with an empty string.
The following demo:
using System;
using System.Text.RegularExpressions;
namespace Test
{
class MainClass
{
public static void Main (string[] args)
{
string input = #"F00220034277909272011
H001500020003000009272011
D001500031034970000400500020000000025000000515000000000
D001500001261770008003200010000000025000000132500000000
H004200020001014209272011
D004200005355800007702200005142000013420000000000000000
D004200031137360000779000005000000012000000000000000000
H050100180030263709272011
D050100001876700006000300019500000025000000250000001500
D050100001247060000071500030000000025000000280000000000
D050100002075670000430400020000000025000000515000000000
D050100008342500007702600005700000010000000000000000700
D050100009460270000702100015205000025000000000000006205
D050100008135120000702400015000000010000000000000001000
D050100006938430000702200026700000010000000000000001000
D050100006423710008000200025700000000000000000000001000
D050100009488040008000600007175000000000000000000001000
D050100001299190000800100016300000000000000000000003950
D050100001244850000800400005407000000000000000000001607
D050100001216280000840200020000000000000001000000006200
D050100001216840000479000008175000000000000100000001000
D050100001265880000410200014350000000000000100000001000
D050100007402650002000300026700000000000000100000001000
D050100001305150002000200016175000000000001000000000000
D050100005435430000899700022350000000000001000000000000
D050100031113850000500200008200000000250000100000001000";
string regex = #"(?m)^(?![HD]0501).+(\r?\n)?";
Console.WriteLine(Regex.Replace(input, regex, ""));
}
}
}
prints:
H050100180030263709272011
D050100001876700006000300019500000025000000250000001500
D050100001247060000071500030000000025000000280000000000
D050100002075670000430400020000000025000000515000000000
D050100008342500007702600005700000010000000000000000700
D050100009460270000702100015205000025000000000000006205
D050100008135120000702400015000000010000000000000001000
D050100006938430000702200026700000010000000000000001000
D050100006423710008000200025700000000000000000000001000
D050100009488040008000600007175000000000000000000001000
D050100001299190000800100016300000000000000000000003950
D050100001244850000800400005407000000000000000000001607
D050100001216280000840200020000000000000001000000006200
D050100001216840000479000008175000000000000100000001000
D050100001265880000410200014350000000000000100000001000
D050100007402650002000300026700000000000000100000001000
D050100001305150002000200016175000000000001000000000000
D050100005435430000899700022350000000000001000000000000
D050100031113850000500200008200000000250000100000001000
A quick explanation:
(?m)
enable multi-line mode so that ^ matches the start of a new line;
^
match the start of a new line;
(?![HD]0501)
look ahead to see if there's no "H0501" or "D0501";
.+
match one or more chars other than line break-chars;
(\r?\n)?
match an optional line break.