Select a certain part of this string - regex

I need to select a small portion of a string.
Here's an example string: http://itunes.apple.com/app/eyelashes/id564783832?uo=5
I need: 564783832
A couple of things to keep in mind:
The number will always be preceded by id (ie. id564783832)
There may or may not be a ?uo=5 following the number (and it could be other parameters besides uo)
The string I need can be different lengths (won't always be 9 digits)
The text preceding id will have similar formatting (same # of slashes, but text will be different)
This will ultimately be implemented with Ruby.

without knowing your language/tool, just assume look behind was supported.
'(?<=id)\d+'

With awk
awk '{print $2}' FS='(id|?)'

You can match some sequence of digits preceded by "id" - this assumes that those are the only sequence of digits preceded by "id":
(?<=id)\d++
A test case in Java:
public static void main(String[] args) {
String input = "http://itunes.apple.com/app/eyelashes/id564783832?uo=5";
Pattern pattern = Pattern.compile("(?<=id)\\d++");
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
System.out.println(matcher.group());
}
}
Output
564783832

Here's mine:
[\w\/]+id(\d+)(\?|$)

Related

Parse string using regex

I need to come up with a regular expression to parse my input string. My input string is of the format:
[alphanumeric].[alpha][numeric].[alpha][alpha][alpha].[julian date: yyyyddd]
eg:
A.A2.ABC.2014071
3.M1.MMB.2014071
I need to substring it from the 3rd position and was wondering what would be the easiest way to do it.
Desired result:
A2.ABC.2014071
M1.MMB.2014071
(?i) will be considered as case insensitive.
(?i)^[a-z\d]\.[a-z]\d\.[a-z]{3}\.\d{7}$
Here a-z means any alphabet from a to z, and \d means any digit from 0 to 9.
Now, if you want to remove the first section before dot, then use this regex and replace it with $1 (or may be \1)
(?i)^[a-z\d]\.([a-z]\d\.[a-z]{3}\.\d{7})$
Another option is replace below with empty:
(?i)^[a-z\d]\.
If the input string is just the long form, then you want everything except the first two characters. You could arrange to substitute them with nothing:
s/^..//
Or you could arrange to capture everything except the first two characters:
/^..(.*)/
If the expression is part of a larger string, then the breakdown of the alphanumeric components becomes more important.
The details vary depending on the language that is hosting the regex. The notations written above could be Perl or PCRE (Perl Compatible Regular Expressions). Many other languages would accept these regexes too, but other languages would require tweaks.
Use this regex:
\w.[A-Z]\d.[A-Z]{3}.\d{7}
Use the above regex like this:
String[] in = {
"A.A2.ABC.2014071", "3.M1.MMB.2014071"
};
Pattern p = Pattern.compile("\\w.[A-Z]\\d.[A-Z]{3}.\\d{7}");
for (String s: in ) {
Matcher m = p.matcher(s);
while (m.find()) {
System.out.println("Result: " + m.group().substring(2));
}
}
Live demo: http://ideone.com/tns9iY

Using regex to match certain text

I try to look for this answer for a while but no luck (sorry if I could describe it well). I am still newbie with regex. I am trying to match a string with only number and a certain delimiter. For example: the patter would be 8/16/32/64/.... the number will be split by '/' with arbitrary amount of number, I could find a way to match them.
My attempt is \d+/\d+? but couldn't get it to work.
You could remove the '/' delimiter and then test for the existence of a number
Here is some C# as an example:
static void Main(string[] args)
{
string text = "8/16/32/64/";
Console.WriteLine(text);
TestForNum(text);
text = "8/16/32/64/b";
Console.WriteLine(text);
TestForNum(text);
Console.ReadKey();
}
private static void TestForNum(string text)
{
string tmp = Regex.Replace(text, #"/", "");
Match m = Regex.Match(tmp, #"^\d+$");
if(m.Success)
{
Console.WriteLine("\t" + m.Groups[0]);
}
else Console.WriteLine("\tno match");
}
A naive approach would be
[\d/]+
However, this does match //// as well as just 12345. To match only "proper" strings:
\d+(/\d+)+
Reads digits followed by delimiter+digits repeated at least once. If trailing/leading delimiters are allowed, then
/?(\d+/)+\d*
If you're using a flavor that uses slashes to quote the regex (like javascript), you'll need to escape them:
/\d+(\/\d+)+/
You can do:
(\d+)(\D|$)
See this work That will split a list of digits delimited by any non digit, so 1?2!3.4 would match
If you want a specific delimiter, such as /:
(\d+)(?:/|$)
As simple as possible:
(\d+\/?)+
Every digit followed by [a] slash, as many as possible. You may use g flag for all matches.

How can I use regex to ignore strings if they contain a certain string

I am trying to use regex to scan through some log files. In particular, I am looking to pick out lines that meet this format:
IP address or random number "banned.", so for example, "111.111.111.111 banned." or "0320932 banned.", etc.
There should only be 2 groups of characters (the number/IP address and "banned." There may be more than one space in between the words or before them), the string should also not contain "client", "[private]", or "request". For the most part I am just confused about how to go about detecting the groups of characters and avoiding strings that contain those words.
Thanks for any help that you may have to offer
egrep -v '^ *[0-9]+((\.[0-9]+){3})? +banned\.$'
Allows optional leading spaces at the beginning of the line.
Must be followed by an all-digit sequence OR an IP-like address.
Must be followed by at least one space.
Line must end in 'banned.'
Finally, the -v option ensures that only lines NOT matching the regex are returned.
With these constraints you needn't worry about ruling out additional words such as 'client'.
I'm assuming in the following input data lines 1 and 3 should be dropped:
111.111.111.111 banned.
2.2.2.2 wibble
0320932 banned
1434324 wobble
You can drop them with this grep expression:
$ grep -E -v "[0-9.]+ +banned" logfile.log
2.2.2.2 wibble
1434324 wobble
$
This regular expression matches 1 or more numbers and periods followed by 1 or more spaces followed by the word "banned". Passing -v to grep will cause it to display all lines that do not match the regular expression. Add -i to the grep command to make it case-insensitive.
You want a negating match, which looks like:
/^((?!([\d.\s]+banned\.)).)*$/
See it in action: http://regex101.com/r/bY7pK4
Note your example shows a period after banned. If you don't want it, remove \. from the expression.
Try this RegExp
String regex = "\\d+.\\d+.\\d+.\\d+ banned.";
Here you can filter your both kind of string.
Example:
public static void main(String[] args) {
System.out.println("start");
String src = "657 hi tis is 111.111.111.111 banned. 57 happy i9";
//String src = "87 working is 0320932 banned. Its ending str 08";
String regex = "\\d+.\\d+.\\d+.\\d+ banned.";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(src);
while(matcher.find()){
System.out.println(matcher.start() + " : " + matcher.group());
}
}
Let me know if it is not working for you.
trying to match IP address or random number "banned."
This egrep should work for you:
egrep '(([0-9]{1,3}\.){3}[0-9]{1,3}|[0-9]+) +banned' logfile
The following will work:
\s*\d\d\d\.\d\d\d\.\d\d\d\.\d\d\d\s*banned\s*

Use regex to find a phrase with symbols in an URL

I have several pages with the current url:
onclick="location.href='https://www.mydomain.com/shop/bags
at the end of each url there's something like this:
?cid=Black'"
or
?cid=Beige'"
or
?cid=Green'"
What I need is a regex to find ?cid= in each url and then replace everything from ?cid= to the ending '
CUrrently I have this:
.?cid=.*?'
This finds occurences of ?cid= in EVERY line of code. I only want it to find occurrences in onclick="location.href='https://www.mydomain.com/shop/bags
Any one got any solutions for this?
UPDATE
Sorry for the initial confusion. I'm using this program http://www.araxis.com/replace-in-files/index-eur.html which allows the use of regex's to find elements. I think it says it allows PERL style regex.
Thanks
You can use lookaround syntax to match ?cid=something preceded by the URL and followed by a '
This pattern should work:
(?<=\Qhttps://www.mydomain.com/shop/bags\E)\?cid=[^']++(?=')
If you replace that pattern with your replacement then the entire bit from ?cid until ' will be replaced.
Here is an example in Java (ignore the slightly different syntax):
public static void main(String[] args) {
final String[] in = {
"onclick=\"location.href='https://www.mydomain.com/shop/bags?cid=Black'",
"onclick=\"location.href='https://www.mydomain.com/shop/bags?cid=Beige'",
"onclick=\"location.href='https://www.mydomain.com/shop/bags?cid=Green'"
};
final Pattern pattern = Pattern.compile("(?<=\\Qhttps://www.mydomain.com/shop/bags\\E)\\?cid=[^']++(?=')");
for(final String string : in) {
final Matcher m = pattern.matcher(string);
final String replaced = m.replaceAll("SOMETHING_ELSE");
System.out.println(replaced);
}
}
Output
onclick="location.href='https://www.mydomain.com/shop/bagsSOMETHING_ELSE'
onclick="location.href='https://www.mydomain.com/shop/bagsSOMETHING_ELSE'
onclick="location.href='https://www.mydomain.com/shop/bagsSOMETHING_ELSE'
This assumes, obviously, that your tools supports lookaround.
This should certainly work if you just use Perl directly rather than via your magic tool
perl -pi -e '/s/(?<=\Qhttps://www.mydomain.com/shop/bags\E)\?cid=[^\']++(?=\')/SOMETHING_ELSE/g' *some_?glob*.pattern
EDIT
Another idea is to use a capturing group and a backreference, replace
(\Qhttps://www.mydomain.com/shop/bags\E)\?cid=[^']++
With
$1SOMETHING_ELSE
Another test case in Java:
public static void main(String[] args) {
final String[] in = {
"onclick=\"location.href='https://www.mydomain.com/shop/bags?cid=Black'",
"onclick=\"location.href='https://www.mydomain.com/shop/bags?cid=Beige'",
"onclick=\"location.href='https://www.mydomain.com/shop/bags?cid=Green'"
};
final Pattern pattern = Pattern.compile("(\\Qhttps://www.mydomain.com/shop/bags\\E)\\?cid=[^']++");
for(final String string : in) {
final Matcher m = pattern.matcher(string);
final String replaced = m.replaceAll("$1SOMETHING_ELSE");
System.out.println(replaced);
}
}
Output:
onclick="location.href='https://www.mydomain.com/shop/bagsSOMETHING_ELSE'
onclick="location.href='https://www.mydomain.com/shop/bagsSOMETHING_ELSE'
onclick="location.href='https://www.mydomain.com/shop/bagsSOMETHING_ELSE'
Find
(onclick="location.href='https://www.mydomain.com/shop/bags.*?)\?cid=.*?'
Replace
$1something'
you can use this pattern
\?cid=[^']*
The idea is to use a character class that exclude the final simple quote, then you avoid to use a lazy quantifier.
Note: you can use a possessive quantifier if supported to give the regex engine less work:
\?cid=[^']*+

Using Regex is there a way to match outside characters in a string and exclude the inside characters?

I know I can exclude outside characters in a string using look-ahead and look-behind, but I'm not sure about characters in the center.
What I want is to get a match of ABCDEF from the string ABC 123 DEF.
Is this possible with a Regex string? If not, can it be accomplished another way?
EDIT
For more clarification, in the example above I can use the regex string /ABC.*?DEF/ to sort of get what I want, but this includes everything matched by .*?. What I want is to match with something like ABC(match whatever, but then throw it out)DEF resulting in one single match of ABCDEF.
As another example, I can do the following (in sudo-code and regex):
string myStr = "ABC 123 DEF";
string tempMatch = RegexMatch(myStr, "(?<=ABC).*?(?=DEF)"); //Returns " 123 "
string FinalString = myStr.Replace(tempMatch, ""); //Returns "ABCDEF". This is what I want
Again, is there a way to do this with a single regex string?
Since the regex replace feature in most languages does not change the string it operates on (but produces a new one), you can do it as a one-liner in most languages. Firstly, you match everything, capturing the desired parts:
^.*(ABC).*(DEF).*$
(Make sure to use the single-line/"dotall" option if your input contains line breaks!)
And then you replace this with:
$1$2
That will give you ABCDEF in one assignment.
Still, as outlined in the comments and in Mark's answer, the engine does match the stuff in between ABC and DEF. It's only the replacement convenience function that throws it out. But that is supported in pretty much every language, I would say.
Important: this approach will of course only work if your input string contains the desired pattern only once (assuming ABC and DEF are actually variable).
Example implementation in PHP:
$output = preg_replace('/^.*(ABC).*(DEF).*$/s', '$1$2', $input);
Or JavaScript (which does not have single-line mode):
var output = input.replace(/^[\s\S]*(ABC)[\s\S]*(DEF)[\s\S]*$/, '$1$2');
Or C#:
string output = Regex.Replace(input, #"^.*(ABC).*(DEF).*$", "$1$2", RegexOptions.Singleline);
A regular expression can contain multiple capturing groups. Each group must consist of consecutive characters so it's not possible to have a single group that captures what you want, but the groups themselves do not have to be contiguous so you can combine multiple groups to get your desired result.
Regular expression
(ABC).*(DEF)
Captures
ABC
DEF
See it online: rubular
Example C# code
string myStr = "ABC 123 DEF";
Match m = Regex.Match(myStr, "(ABC).*(DEF)");
if (m.Success)
{
string result = m.Groups[1].Value + m.Groups[2].Value; // Gives "ABCDEF"
// ...
}