Splitting a string based on positions with regex - regex

I need to convert this (date) String "12112014" to "12.11.2014"
What i would like to to is:
Split first 2 Strings "12", add ".",
then split the string from 3-4 to get "11", add "."
at the end split the last 4 strings (or 5-8) to get "2012"
I already found out how to get the first 2 characters ( "^\d{2}" ), but I failed to get characters based on a position.

Whatever be the programming language, You should try to extract the digits from string and then join them with a ".".
In perl, it can be done as :
$_ = '12112014';
s/(\d{2})(\d{2})(\d{4})/$1.$2.$3/;
print "$_";

Without you specifying the language you're after, I've picked javascript:
var s = '12012011';
var s2 = s.replace(/(\d{2})(\d{2})(\d{4})/,'$1.$2.$3'));
console.log(s2); // prints "12.01.2011"
The gist of it is that you use () to specify groups inside your regular expression and then can use the groups in your replace expression.
Same in Java:
String s = "12012011";
String s2 = s.replaceAll("(\\d{2})(\\d{2})(\\d{4})", "$1.$2.$3");
System.out.println(s2);

I dont think that you could do that only with split.
You could expand your expression to:
"(^(\d{2})(\d{2})(\d{4}))"
Then access the groups with the Regex language of your choice and build the string you want.
Note that - besides all regex learning - alternatively you could always parse the original string into strongly typed Date or DateTime variables and output the value using the appropriate locales.

Related

Find group of strings starting and ending by a character using regular expression

I have a string, and I want to extract, using regular expressions, groups of characters that are between the character : and the other character /.
typically, here is a string example I'm getting:
'abcd:45.72643,4.91203/Rou:hereanotherdata/defgh'
and so, I want to retrieved, 45.72643,4.91203 and also hereanotherdata
As they are both between characters : and /.
I tried with this syntax in a easier string where there is only 1 time the pattern,
[tt]=regexp(str,':(\w.*)/','match')
tt = ':45.72643,4.91203/'
but it works only if the pattern happens once. If I use it in string containing multiples times the pattern, I get all the string between the first : and the last /.
How can I mention that the pattern will occur multiple time, and how can I retrieve it?
Use lookaround and a lazy quantifier:
regexp(str, '(?<=:).+?(?=/)', 'match')
Example (Matlab R2016b):
>> str = 'abcd:45.72643,4.91203/Rou:hereanotherdata/defgh';
>> result = regexp(str, '(?<=:).+?(?=/)', 'match')
result =
1×2 cell array
'45.72643,4.91203' 'hereanotherdata'
In most languages this is hard to do with a single regexp. Ultimately you'll only ever get back the one string, and you want to get back multiple strings.
I've never used Matlab, so it may be possible in that language, but based on other languages, this is how I'd approach it...
I can't give you the exact code, but a search indicates that in Matlab there is a function called strsplit, example...
C = strsplit(data,':')
That should will break your original string up into an array of strings, using the ":" as the break point. You can then ignore the first array index (as it contains text before a ":"), loop the rest of the array and regexp to extract everything that comes before a "/".
So for instance...
'abcd:45.72643,4.91203/Rou:hereanotherdata/defgh'
Breaks down into an array with parts...
1 - 'abcd'
2 - '45.72643,4.91203/Rou'
3 - 'hereanotherdata/defgh'
Then Ignore 1, and extract everything before the "/" in 2 and 3.
As John Mawer and Adriaan mentioned, strsplit is a good place to start with. You can use it for both ':' and '/', but then you will not be able to determine where each of them started. If you do it with strsplit twice, you can know where the ':' starts :
A='abcd:45.72643,4.91203/Rou:hereanotherdata/defgh';
B=cellfun(#(x) strsplit(x,'/'),strsplit(A,':'),'uniformoutput',0);
Now B has cells that start with ':', and has two cells in each cell that contain '/' also. You can extract it with checking where B has more than one cell, and take the first of each of them:
C=cellfun(#(x) x{1},B(cellfun('length',B)>1),'uniformoutput',0)
C =
1×2 cell array
'45.72643,4.91203' 'hereanotherdata'
Starting in 16b you can use extractBetween:
>> str = 'abcd:45.72643,4.91203/Rou:hereanotherdata/defgh';
>> result = extractBetween(str,':','/')
result =
2×1 cell array
{'45.72643,4.91203'}
{'hereanotherdata' }
If all your text elements have the same number of delimiters this can be vectorized too.

Swift 3: extract regex matches with non matching parts

I want to analyze a string by many different patterns for numbers, dates and other strings. So I have an array of patterns I want to check in that order.
let patterns = [... "\\d{6}", "\\d{4}", "\\d" ] // to be extended :-)
let s = "IMG_123456_2006.10.03-13.52.59 Testfile_2009_5"
Starting with the first item in pattern I need a search in string s. If found, the string should be split in found parts e.g. "2006" and "2009" and the non matching parts. The remaining parts will be searched with the next pattern and so on. Assuming I already had the pattern defined for time/date in the middle which should be placed at the first item, the splitted string should look like:
"IMG_", "123456", "_", "2006.10.03-13.52.59", " Testfile_", "2009", "_", "5"
Can I use a build in functionality of regex.matches, or do I have to write everything by my own?
I already been able to find a match. But then I have to use the ranges to split the string and do it again and again for the remaining parts until no further matches are indicated. This will need a lot more calculations than I would expect using the results in match.numberOfRanges. Any small solutions available?

Extract string of numbers from URL using regex PIG

I'm using PIG to generate a list of URLs that have been recently visited. In each of the URLs, there is a string of numbers that represents the product page visited. I'm trying to use a regex_extract_all() function to extract just the string of numbers, which vary in length from 6-8. The string of digits can be found directly after jobs2/view/ and usually ends with +&cd but sometimes they may end with ).
Here are a few example URLs:
(http://a.com/search?q=cache:QD7vZRHkPQoJ:ca.xyz.com/jobs2/view/17069404+&cd=1&hl=en&ct=clnk&gl=ca)
(http://a.com/search?q=cache:G9323j2oNbAJ:ca.xyz.com/jobs2/view/5977065+&cd=1&hl=en&ct=clnk&gl=ca)
(http://a.com/search?q=cache:aNspmG11qAJ:hk.xyz.com/jobs2/view/16988928+&cd=2&hl=zh-TW&ct=clnk&gl=hk)
(http://a.com/search?q=cache:aNspmG11AJ:hk.xyz.com/jobs2/view/16988928+&cd=2&hl=zh-TW&ct=clnk&gl=hk)
(http://a.com/search?q=cache:aNspmG11qAJ:hk.xyz.com/jobs2/view/16988928+&cd=2&hl=zh-TW&ct=cl k&gl=hk)
Here is the current regex I am using:
J = FOREACH jpage GENERATE FLATTEN(REGEX_EXTRACT_ALL(TEXTCOLUMN, '\/view\/(\d+)\+\&')) as (output:chararray)
I have also tried other forms such as:
'[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]', 'view.([0-9]+)', 'view\/([\d]+)\+',
'[0-9][0-9][0-9]+', and
'[0-9][0-9][0-9]*'; none of which work.
Can anybody assist here or have another way of going about it?
Much appreciated,
MM
Reason for"Unexpected character 'D'" is, you need to put double backslash instead of single backslash. eg just replace [\d+] to [\\d+]
Here your solution, please validate all your inputs strings
input.txt
http://a.com/search?q=cache:QD7vZRHkPQoJ:ca.xyz.com/jobs2/view/17069404+&cd=1&hl=en&ct=clnk&gl=ca
http://a.com/search?q=cache:G9323j2oNbAJ:ca.xyz.com/jobs2/view/5977065+&cd=1&hl=en&ct=clnk&gl=ca
http://a.com/search?q=cache:aNspmG11qAJ:hk.xyz.com/jobs2/view/16988928+&cd=2&hl=zh-TW&ct=clnk&gl=hk
http://a.com/search?q=cache:aNspmG11AJ:hk.xyz.com/jobs2/view/16988928+&cd=2&hl=zh-TW&ct=clnk&gl=hk
http://a.com/search?q=cache:aNspmG11qAJ:hk.xyz.com/jobs2/view/16988928+&cd=2&hl=zh-TW&ct=clk&gl=hk
http://a.com/search?q=cache:aNspmG11qAJ:hk.xyz.com/jobs2/view/16988928)=2&hl=zh-TW&ct=clk&gl=hk
http://webcache.googleusercontent.com/search?q=cache:http://my.linkedin.com/jobs2/view/9919248
Updated Pigscript:
A = LOAD 'input.txt' as line;
B = FOREACH A GENERATE REGEX_EXTRACT(line,'.*/view/(\\d+)([+|&|cd|)?]+)?',1);
dump B;
(17069404)
(5977065)
(16988928)
(16988928)
(16988928)
(16988928)
I'm not familiar with PIG, but this regex will match your target:
(?<=/jobs2/view/)\d+
By using a (non-consuming) look behind, the entire match (not just a group of the match) is your number.

Regex to select semicolons that are not enclosed in double quotes

I have string like
a;b;"aaa;;;bccc";deef
I want to split string based on delimiter ; only if ; is not inside double quotes. So after the split, it will be
a
b
"aaa;;;bccc"
deef
I tried using look-behind, but I'm not able to find a correct regular expression for splitting.
Regular expressions are probably not the right tool for this. If possible you should use a CSV library, specify ; as the delimiter and " as the quote character, this should give you the exact fields you are looking for.
That being said here is one approach that works by ensuring that there are an even number of quotation marks between the ; we are considering the split at and the end of the string.
;(?=(([^"]*"){2})*[^"]*$)
Example: http://www.rubular.com/r/RyLQyR8F19
This will break down if you can have escaped quotation marks within a string, for example a;"foo\"bar";c.
Here is a much cleaner example using Python's csv module:
import csv, StringIO
reader = csv.reader(StringIO.StringIO('a;b;"aaa;;;bccc";deef'),
delimiter=';', quotechar='"')
for row in reader:
print '\n'.join(row)
Regular expression will only get messier and break on even minor changes. You are better off using a csv parser with any scripting language. Perl built in module (so you don't need to download from CPAN if there are any restrictions) called Text::ParseWords allows you to specify the delimiter so that you are not limited to ,. Here is a sample snippet:
#!/usr/local/bin/perl
use strict;
use warnings;
use Text::ParseWords;
my $string = 'a;b;"aaa;;;bccc";deef';
my #ary = parse_line(q{;}, 0, $string);
print "$_\n" for #ary;
Output
a
b
aaa;;;bccc
deef
This is kind of ugly, but if you don't have \" inside your quoted strings (meaning you don't have strings that look like this ("foo bar \"badoo\" goo") you can split on the " first and then assume that all your even numbered array elements are, in fact, strings (and split the odd numbered elements into their component parts on the ; token).
If you *do have \" in your strings, then you'll want to first convert those into some other temporary token that you'll convert back later after you've performed your operation.
Here's a fiddle...
http://jsfiddle.net/VW9an/
var str = 'abc;def;ghi"some other dogs say \\"bow; wow; wow\\". yes they do!"and another; and a fifth'
var strCp = str.replace(/\\"/g,"--##--");
var parts = strCp.split(/"/);
var allPieces = new Array();
for(var i in parts){
if(i % 2 == 0){
var innerParts = parts[i].split(/\;/)
for(var j in innerParts)
allPieces.push(innerParts[j])
}
else{
allPieces.push('"' + parts[i] +'"')
}
}
for(var a in allPieces){
allPieces[a] = allPieces[a].replace(/--##--/g,'\\"');
}
console.log(allPieces)
Match All instead of Splitting
Answering long after the battle because no one used the way that seems the simplest to me.
Once you understand that Match All and Split are Two Sides of the Same Coin, you can use this simple regex:
"[^"]*"|[^";]+
See the matches in the Regex Demo.
The left side of the alternation | matches full quoted strings
The right side matches any chars that are neither ; nor "

Is there a RegEx that can parse out the longest list of digits from a string?

I have to parse various strings and determine a prefix, number, and suffix. The problem is the strings can come in a wide variety of formats. The best way for me to think about how to parse it is to find the longest number in the string, then take everything before that as a prefix and everything after that as a suffix.
Some examples:
0001 - No prefix, Number = 0001, No suffix
1-0001 - Prefix = 1-, Number = 0001, No suffix
AAA001 - Prefix = AAA, Number = 001, No suffix
AAA 001.01 - Prefix = AAA , Number = 001, Suffix = .01
1_00001-01 - Prefix = 1_, Number = 00001, Suffix = -01
123AAA 001_01 - Prefix = 123AAA , Number = 001, Suffix = _01
The strings can come with any mixture of prefixes and suffixes, but the key point is the Number portion is always the longest sequential list of digits.
I've tried a variety of RegEx's that work with most but not all of these examples. I might be missing something, or perhaps a RegEx isn't the right way to go in this case?
(The RegEx should be .NET compatible)
UPDATE: For those that are interested, here's the C# code I came up with:
var regex = new System.Text.RegularExpressions.Regex(#"(\d+)");
if (regex.IsMatch(m_Key)) {
string value = "";
int length;
var matches = regex.Matches(m_Key);
foreach (var match in matches) {
if (match.Length >= length) {
value = match.Value;
length = match.Length;
}
}
var split = m_Key.Split(new String[] {value}, System.StringSplitOptions.RemoveEmptyEntries);
m_KeyCounter = value;
if (split.Length >= 1) m_KeyPrefix = split(0);
if (split.Length >= 2) m_KeySuffix = split(1);
}
You're right, this problem can't be solved purely by regular expressions. You can use regexes to "tokenize" (lexically analyze) the input but after that you'll need further processing (parsing).
So in this case I would tokenize the input with (for example) a simple regular expression search (\d+) and then process the tokens (parse). That would involve seeing if the current token is longer than the tokens seen before it.
To gain more understanding of the class of problems regular expressions "solve" and when parsing is needed, you might want to check out general compiler theory, specifically when regexes are used in the construction of a compiler (e.g. http://en.wikipedia.org/wiki/Book:Compiler_construction).
You're input isn't regular so, a regex won't do. I would iterate over the all groups of digits via (\d+) and find the longest and then build a new regex in the form of (.*)<number>(.*) to find your prefix/suffix.
Or if you're comfortable with string operations you can probably just find the start and end of the target group and use substr to find the pre/suf fix.
I don't think you can do this with one regex. I would find all digit sequences within the string (probably with a regex) and then I would select the longest with .NET code, and call Split().
This depends entirely on your Regexp engine. Check your Regexp environment for capturing, there might be something in it like the automatic variables in Perl.
OK, let's talk about your question:
Keep in mind, that both, NFA and DFA, of almost every Regexp engine are greedy, this means, that a (\d+) will always find the longest match, when it "stumbles" over it.
Now, what I can get from your example, is you always need middle portion of a number, try this:
/^(.*\D)?(\d+)(\D.*)?$/ig
The now look at variables $1, $2, $3. Not all of them will exist: if there are all three of them, $2 will hold your number in question, the other vars, parts of the prefix. when one of the prefixes is missing, only variable $1 and $2 will be set, you have to see for yourself, which one is the integer. If both prefix and suffix are missing, $1 will hold the number.
The idea is to make the engine "stumble" over the first few characters and start matching a long number in the middle.
Since the modifier /gis present, you can loop through all available combinations, that the machine finds, you can then simply take the one you like most or something.
This example is in PCRE, but I'm sure .NET has a compatible mode.