Regex to find segment of string searching from end - regex

I'm in Java and have a string that will always be in this format:
;<b>gerg(1314)</b><br> (KC)<br>
This number 461610734 will change and may be any length.. I'd like to pick that number out and use it. As you can see the number is next to a ' (the first one working backwards) and a hash # (again, the first one working backwards).
I can find the numbers after the hash by using ([^\#]+$) and I can find up to the last ' by using ([^\']+$) (but this would be on the wrong side of the '...)
I'm lost... Anyone know how to join these two together and nudge the ' along one to the left to just get the numbers?

Actually, I believe that you could simply extract "the digits that immediately follow a #".
You could then use the following regex: (?<=#)\d+.
On the other hand, if you really want to specify that your digits are following a # and followed by a ', you could (should?) make use of the look-arounds.
The following regex should be what you're looking for:
(?<=#)\d+(?=')
You can see it live by clicking this link.

Try this:
String str = ";<b>gerg(1314)</b><br> (KC)<br>";
Pattern pattern = Pattern.compile("onClick=\"return CCL\\(this,'#([0-9]+)'");
Matcher matcher = pattern.matcher(str);
if (matcher.find()) {
System.out.println(matcher.group(1)); // Prints 461610734
}

Related

RegEx for matching a string after a string up to a comma

Here is a sample string.
"BLAH, blah, going to the store &^5, light Version 12.7(2)L6, anyway
plus other stuff Version 3.3.4.6. Then goes on an on for several lines..."
I want to capture only the first version number without including the word version if possible but not include the periods and parenthesis. The result would stop when it encounters a comma. The result would be:
"1272L6"
I don't want it to include other instances of version in the text. Can this be done?
I've tried (?<=version)[^,]* I know it does not address removing the periods and parens and does not address the subsequent versions.
This exact RegEx, maybe not the best solution, but it might help you to get 1272L6:
([0-9]{2})\.([0-9]{1})\(([0-9]{1})\)([A-Z]{1}[0-9]{1})
It creates four groups (where $1$2$3$4 is your target 1272L6) and passes ., ) and (.
You might change {1} to other numbers of repetitions, such as {1,2}.
Assuming the version number is fixed on format but not on the specific digits or letters, you could do this.
String s = "this is a test 12.7(2)L6, 13.7(2)L6, 14.7(2)L6";
String reg = "(\\d\\d\\.\\d\\(\\d\\)[A-Z]\\d),";
Matcher m = Pattern.compile(reg).matcher(s);
if (m.find()) { // should only find first one
System.out.println(m.group(1).replaceAll("[.()]", ""));
}

Swift 3: extract regex matches with non matching parts

I want to analyze a string by many different patterns for numbers, dates and other strings. So I have an array of patterns I want to check in that order.
let patterns = [... "\\d{6}", "\\d{4}", "\\d" ] // to be extended :-)
let s = "IMG_123456_2006.10.03-13.52.59 Testfile_2009_5"
Starting with the first item in pattern I need a search in string s. If found, the string should be split in found parts e.g. "2006" and "2009" and the non matching parts. The remaining parts will be searched with the next pattern and so on. Assuming I already had the pattern defined for time/date in the middle which should be placed at the first item, the splitted string should look like:
"IMG_", "123456", "_", "2006.10.03-13.52.59", " Testfile_", "2009", "_", "5"
Can I use a build in functionality of regex.matches, or do I have to write everything by my own?
I already been able to find a match. But then I have to use the ranges to split the string and do it again and again for the remaining parts until no further matches are indicated. This will need a lot more calculations than I would expect using the results in match.numberOfRanges. Any small solutions available?

Extract string of numbers from URL using regex PIG

I'm using PIG to generate a list of URLs that have been recently visited. In each of the URLs, there is a string of numbers that represents the product page visited. I'm trying to use a regex_extract_all() function to extract just the string of numbers, which vary in length from 6-8. The string of digits can be found directly after jobs2/view/ and usually ends with +&cd but sometimes they may end with ).
Here are a few example URLs:
(http://a.com/search?q=cache:QD7vZRHkPQoJ:ca.xyz.com/jobs2/view/17069404+&cd=1&hl=en&ct=clnk&gl=ca)
(http://a.com/search?q=cache:G9323j2oNbAJ:ca.xyz.com/jobs2/view/5977065+&cd=1&hl=en&ct=clnk&gl=ca)
(http://a.com/search?q=cache:aNspmG11qAJ:hk.xyz.com/jobs2/view/16988928+&cd=2&hl=zh-TW&ct=clnk&gl=hk)
(http://a.com/search?q=cache:aNspmG11AJ:hk.xyz.com/jobs2/view/16988928+&cd=2&hl=zh-TW&ct=clnk&gl=hk)
(http://a.com/search?q=cache:aNspmG11qAJ:hk.xyz.com/jobs2/view/16988928+&cd=2&hl=zh-TW&ct=cl k&gl=hk)
Here is the current regex I am using:
J = FOREACH jpage GENERATE FLATTEN(REGEX_EXTRACT_ALL(TEXTCOLUMN, '\/view\/(\d+)\+\&')) as (output:chararray)
I have also tried other forms such as:
'[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]', 'view.([0-9]+)', 'view\/([\d]+)\+',
'[0-9][0-9][0-9]+', and
'[0-9][0-9][0-9]*'; none of which work.
Can anybody assist here or have another way of going about it?
Much appreciated,
MM
Reason for"Unexpected character 'D'" is, you need to put double backslash instead of single backslash. eg just replace [\d+] to [\\d+]
Here your solution, please validate all your inputs strings
input.txt
http://a.com/search?q=cache:QD7vZRHkPQoJ:ca.xyz.com/jobs2/view/17069404+&cd=1&hl=en&ct=clnk&gl=ca
http://a.com/search?q=cache:G9323j2oNbAJ:ca.xyz.com/jobs2/view/5977065+&cd=1&hl=en&ct=clnk&gl=ca
http://a.com/search?q=cache:aNspmG11qAJ:hk.xyz.com/jobs2/view/16988928+&cd=2&hl=zh-TW&ct=clnk&gl=hk
http://a.com/search?q=cache:aNspmG11AJ:hk.xyz.com/jobs2/view/16988928+&cd=2&hl=zh-TW&ct=clnk&gl=hk
http://a.com/search?q=cache:aNspmG11qAJ:hk.xyz.com/jobs2/view/16988928+&cd=2&hl=zh-TW&ct=clk&gl=hk
http://a.com/search?q=cache:aNspmG11qAJ:hk.xyz.com/jobs2/view/16988928)=2&hl=zh-TW&ct=clk&gl=hk
http://webcache.googleusercontent.com/search?q=cache:http://my.linkedin.com/jobs2/view/9919248
Updated Pigscript:
A = LOAD 'input.txt' as line;
B = FOREACH A GENERATE REGEX_EXTRACT(line,'.*/view/(\\d+)([+|&|cd|)?]+)?',1);
dump B;
(17069404)
(5977065)
(16988928)
(16988928)
(16988928)
(16988928)
I'm not familiar with PIG, but this regex will match your target:
(?<=/jobs2/view/)\d+
By using a (non-consuming) look behind, the entire match (not just a group of the match) is your number.

Regexp: Keyword followed by value to extract

I had this question a couple of times before, and I still couldn't find a good answer..
In my current problem, I have a console program output (string) that looks like this:
Number of assemblies processed = 1200
Number of assemblies uninstalled = 1197
Number of failures = 3
Now I want to extract those numbers and to check if there were failures. (That's a gacutil.exe output, btw.) In other words, I want to match any number [0-9]+ in the string that is preceded by 'failures = '.
How would I do that? I want to get the number only. Of course I can match the whole thing like /failures = [0-9]+/ .. and then trim the first characters with length("failures = ") or something like that. The point is, I don't want to do that, it's a lame workaround.
Because it's odd; if my pattern-to-match-but-not-into-output ("failures = ") comes after the thing i want to extract ([0-9]+), there is a way to do it:
pattern(?=expression)
To show the absurdity of this, if the whole file was processed backwards, I could use:
[0-9]+(?= = seruliaf)
... so, is there no forward-way? :T
pattern(?=expression) is a regex positive lookahead and what you are looking for is a regex positive lookbehind that goes like this (?<=expression)pattern but this feature is not supported by all flavors of regex. It depends which language you are using.
more infos at regular-expressions.info for comparison of Lookaround feature scroll down 2/3 on this page.
If your console output does actually look like that throughout, try splitting the string on "=" when the word "failure" is found, then get the last element (or the 2nd element). You did not say what your language is, but any decent language with string splitting capability would do the job. For example
gacutil.exe.... | ruby -F"=" -ane "print $F[-1] if /failure/"

Regex to replace string with another string in MS Word?

Can anyone help me with a regex to turn:
filename_author
to
author_filename
I am using MS Word 2003 and am trying to do this with Word's Find-and-Replace. I've tried the use wildcards feature but haven't had any luck.
Am I only going to be able to do it programmatically?
Here is the regex:
([^_]*)_(.*)
And here is a C# example:
using System;
using System.Text.RegularExpressions;
class Program
{
static void Main()
{
String test = "filename_author";
String result = Regex.Replace(test, #"([^_]*)_(.*)", "$2_$1");
}
}
Here is a Python example:
from re import sub
test = "filename_author";
result = sub('([^_]*)_(.*)', r'\2_\1', test)
Edit: In order to do this in Microsoft Word using wildcards use this as a search string:
(<*>)_(<*>)
and replace with this:
\2_\1
Also, please see Add power to Word searches with regular expressions for an explanation of the syntax I have used above:
The asterisk (*) returns all the text in the word.
The less than and greater than symbols (< >) mark the start and end
of each word, respectively. They
ensure that the search returns a
single word.
The parentheses and the space between them divide the words into
distinct groups: (first word) (second
word). The parentheses also indicate
the order in which you want search to
evaluate each expression.
Here you go:
s/^([a-zA-Z]+)_([a-zA-Z]+)$/\2_\1/
Depending on the context, that might be a little greedy.
Search pattern:
([^_]+)_(.+)
Replacement pattern:
$2_$1
In .NET you could use ([^_]+)_([^_]+) as the regex and then $2_$1 as the substitution pattern, for this very specific type of case. If you need more than 2 parts it gets a lot more complicated.
Since you're in MS Word, you might try a non-programming approach. Highlight all of the text, select Table -> Convert -> Text to Table. Set the number of columns at 2. Choose Separate Text At, select the Other radio, and enter an _. That will give you a table. Switch the two columns. Then convert the table back to text using the _ again.
Or you could copy the whole thing to Excel, construct a formula to split and rejoin the text and then copy and paste that back to Word. Either would work.
In C# you could also do something like this.
string[] parts = "filename_author".Split('_');
return parts[1] + "_" + parts[0];
You asked about regex of course, but this might be a good alternative.