Hadoop Pig: Extract all substrings matching a given regular expression - regex

I am parsing some data of the form:
(['L123', 'L234', 'L1', 'L253764'])
(['L23', 'L2'])
(['L5'])
...
where the phrases inside the parens, including the brackets, are encoded as a single chararray.
I want to extract just the L+(digits) tags to obtain tuples of the form:
((L123, L234, L1, L253764))
((L23, L2))
((L5))
I have tried using REGEX_EXTRACT_ALL using the regular expression '(L\d+)', but it only seems to extract a single tag per line, which is useless to me. Is there a way to create tuples in the way I have described above?

If order does not matter, then this will work:
-- foo is the tuple, and bar is the name of the chararray
B = FOREACH A GENERATE TOKENIZE(foo.bar, ',') AS values: {T: (value: chararray)} ;
C = FOREACH B {
clean_values = FOREACH values GENERATE
REGEX_EXTRACT(value, '(L[0-9]+)', 1) AS clean_value: chararray ;
GENERATE clean_values ;
}
The schema and output are:
C: {clean_values: {T: (clean_value: chararray)}}
({(L123),(L234),(L1),(L253764)})
({(L23),(L2)})
({(L5)})
Generally, if you don't know how many elements the array will have then a bag will be better.

Related

How can I extract a file name based on number string?

I have a list of filenames in a struct array, example:
4x1 struct array with fields:
name
date
bytes
isdir
datenum
where files.name
ans =
ts.01094000.crest.csv
ans =
ts.01100600.crest.csv
etc.
I have another list of numbers (say, 1094000). And I want to find the corresponding file name from the struct.
Please note, that 1094000 doesn't have preceding 0. Often there might be other numbers. So I want to search for '1094000' and find that name.
I know I can do it using Regex. But I have never used that before. And finding it difficult to write for numbers instead of text using strfind. Any suggestion or another method is welcome.
What I have tried:
regexp(files.name,'ts.(\d*)1094000.crest.csv','match');
I think the regular expression you'd want is more like
filenames = {'ts.01100600.crest.csv','ts.01094000.crest.csv'};
matches = regexp(filenames, ['ts\.0*' num2str(1094000) '\.crest\.csv']);
matches = ~cellfun('isempty', matches);
filenames(matches)
For a solution with strfind...
Pre-16b:
match = ~cellfun('isempty', strfind({files.name}, num2str(1094000)),'UniformOutput',true)
files(match)
16b+:
match = contains({files.name}, string(1094000))
files(match)
However, the strfind way might have issues if the number you are looking for exists in unexpected places such as looking for 10 in ["01000" "00101"].
If your filenames match the pattern ts.NUMBER.crest.csv, then in 16b+ you could do:
str = {files.name};
str = extractBetween(str,4,'.');
str = strip(str,'left','0');
matches = str == string(1094000);
files(matches)

using multiple characters as delimiters in Coldfusion list

I am trying to use multiple characters as the delimeter in ColdFusion list like ,( comma and blank) but it ignores the blank.
I then tried to use:
<cfset title = listappend( title, a[ idx ].title, "#Chr(44)##Chr(32)#" ) />
But it also ignores the blank and without blanks the list items to diffucult to read.
Any ideas?
With ListAppend you can only use one delimiter. As the docs say for the delimiters parameter:
If this parameter contains more than one character, ColdFusion uses only the first character.
I'm not sure what a[ idx ].title contains or exactly what the expected result is (would be better if you gave a complete example), but I think something like this will do what you want or at least get you started:
<cfscript>
a = [
{"title"="One"},
{"title"="Two"},
{"title"="Three"}
];
result = "";
for (el in a) {
result &= el.title & ", ";
}
writeDump(result);
</cfscript>
I think there's a fundamental flaw in your approach here. The list delimiter is part of the structure of the data, whereas you are also trying to use it for "decoration" when you come to output the data from the list. Whilst often conveniently this'll work, it's kinda conflating two ideas.
What you should do is eschew the use of lists as a data structure completely, as they're a bit crap. Use an array for storing the data, and then deal with rendering it as a separate issue: write a render function which puts whatever separator you want in your display between each element.
function displayArrayAsList(array, separator){
var list = "";
for (var element in array){
list &= (len(list) ? separator : "");
list &= element;
}
return list;
}
writeOutput(displayAsList(["tahi", "rua", "toru", "wha"], ", "));
tahi, rua, toru, wha
Use a two step process. Step 1 - create your comma delimited list. Step 2
yourList = replace(yourList, ",", ", ", "all");

Dynamic regexprep in MATLAB

I have the following strings in a long string:
a=b=c=d;
a=b;
a=b=c=d=e=f;
I want to first search for above mentioned pattern (X=Y=...=Z) and then output like the following for each of the above mentioned strings:
a=d;
b=d;
c=d;
a=b;
a=f;
b=f;
c=f;
d=f;
e=f;
In general, I want all the variables to have an equal sign with the last variable on the extreme right of the string. Is there a way I can do it using regexprep in MATLAB. I am able to do it for a fixed length string, but for variable length, I have no idea how to achieve this. Any help is appreciated.
My attempt for the case of two equal signs is as follows:
funstr = regexprep(funstr, '([^;])+\s*=\s*+(\w+)+\s*=\s*([^;])+;', '$1 = $3; \n $2 = $3;\n');
Not a regexp but if you stick to Matlab you can make use of the cellfun function to avoid loop:
str = 'a=b=c=d=e=f;' ; %// input string
list = strsplit(str,'=') ;
strout = cellfun( #(a) [a,'=',list{end}] , list(1:end-1), 'uni', 0).' %'// Horchler simplification of the previous solution below
%// this does the same than above but more convoluted
%// strout = cellfun( #(a,b) cat(2,a,'=',b) , list(1:end-1) , repmat(list(end),1,length(list)-1) , 'uni',0 ).'
Will give you:
strout =
'a=f;'
'b=f;'
'c=f;'
'd=f;'
'e=f;'
Note: As Horchler rightly pointed out in comment, although the cellfun instruction allows to compact your code, it is just a disguised loop. Moreover, since it runs on cell, it is notoriously slow. You won't see the difference on such simple inputs, but keep this use when super performances are not a major concern.
Now if you like regex you must like black magic code. If all your strings are in a cell array from the start, there is a way to (over)abuse of the cellfun capabilities to obscure your code do it all in one line.
Consider:
strlist = {
'a=b=c=d;'
'a=b;'
'a=b=c=d=e=f;'
};
Then you can have all your substring with:
strout = cellfun( #(s)cellfun(#(a,b)cat(2,a,'=',b),s(1:end-1),repmat(s(end),1,length(s)-1),'uni',0).' , cellfun(#(s) strsplit(s,'=') , strlist , 'uni',0 ) ,'uni',0)
>> strout{:}
ans =
'a=d;'
'b=d;'
'c=d;'
ans =
'a=b;'
ans =
'a=f;'
'b=f;'
'c=f;'
'd=f;'
'e=f;'
This gives you a 3x1 cell array. One cell for each group of substring. If you want to concatenate them all then simply: strall = cat(2,strout{:});
I haven't had much experience w/ Matlab; but your problem can be solved by a simple string split function.
[parts, m] = strsplit( funstr, {' ', '='}, 'CollapseDelimiters', true )
Now, store the last part of parts; and iterate over parts until that:
len = length( parts )
for i = 1:len-1
print( strcat(parts(i), ' = ', parts(len)) )
end
I do not know what exactly is the print function in matlab. You can update that accordingly.
There isn't a single Regex that you can write that will cover all the cases. As posted on this answer:
https://stackoverflow.com/a/5019658/3393095
However, you have a few alternatives to achieve your final result:
You can get all the values in the line with regexp, pick the last value, then use a for loop iterating throughout the other values to generate the output. The regex to get the values would be this:
matchStr = regexp(str,'([^=;\s]*)','match')
If you want to use regexprep at any means, you should write a pattern generator and a replace expression generator, based on number of '=' in the input string, and pass these as parameters of your regexprep func.
You can forget about Regex and Split the input to generate the output looping throughout the values (similarly to alternative #1) .

Convert list of one number to int

I have a regular expression that parses a line# string from a log. That line# is then subjected to another regular expression to just extract the line#.
For example:
Part of this regex:
m = re.match(r"^(\d{4}-\d{2}-\d{2}\s*\d{2}:\d{2}:\d{2}),?(\d{3}),?(?:\s+\[(?:[^\]]+)\])+(?<=])(\s+?[A-Z]+\s+?)+(\s?[a-zA-Z0-9\.])+\s?(\((?:\s?\w)+\))\s?(\s?.)+", line)
Will match this:
(line 206)
Then this regex:
re.findall(r'\b\d+\b', linestr)
Gives me
['206']
In order to further process my information I need to have the line number as an integer and am lost for a solution as to how to do that.
You may try:
line_int = int(re.findall(r'\b\d+\b', linestr)[0])
or if you have more than one element in the list:
lines_int = [int(i) for i in re.findall(r'\b\d+\b', linestr)]
or even
lines_int = map(int, re.findall(r'(\b\d+\b)+', linestr))
I hope it helps -^.^-
Use int() to convert your list of one "string number" to an int:
myl = ['206']
int(myl[0])
206
if you have a list of these, you can conver them all to ints using list comprehension:
[int(i) for i in myl]
resulting in a list of ints.
You can hook this into your code as best fits, e.g.,
int(re.findall(r'\b\d+\b', linestr)[0])
int(re.findall(r'\b\d+\b', linestr)[0])
?

awk split by sequence of whitespaces

Can split(string, array, separator) in awk use sequence of whitespaces as the separator (or more generally any regexp as the separator)?
Obviously, one could use the internal autosplit (that runs on each line of the input with value of FS variable as the separator) and with simple for and $0 magic do the trick. However, I was just wondering if there's a more straightforward way using the splititself.
The GNU Awk User's Guide states:
split(string, array, fieldsep)
This divides string into pieces separated by fieldsep, and stores the
pieces in array. The first piece is stored in array[1], the second
piece in array[2], and so forth. The string value of the third
argument, fieldsep, is a regexp describing where to split string (much
as FS can be a regexp describing where to split input records). If
the fieldsep is omitted, the value of FS is used. split returns the
number of elements created. The split function, then, splits strings
into pieces in a manner similar to the way input lines are split into
fields
Here is a short (somewhat silly) example that uses a simple regular expression ".s " that will match any single character followed by a lower-case s and a space. The result of the split is put into array a. Note that the parts that match are not placed into the array.
BEGIN {
s = "this isn't a string yes isodore?"
count = split(s, a, ".s ")
printf("number of splits: %d\n", count)
print "Contents of array:"
for (i = 1; i <= count; i++)
printf "a[%d]: %s\n", i, a[i]
}
The output:
$ awk -f so.awk
number of splits: 3
Contents of array:
a[1]: th
a[2]: isn't a string y
a[3]: isodore?
The article Advanced Awk for Sysadmins show an example of parsing a line using split(). This page contains an example of using a regular expression to split data into
an array.
From the GNU awk(1) manual page:
split(s, a [, r])
Splits the string s into the array a on the regular expression r, and returns the number of fields. If r is omitted, FS is used instead.
The point here is that you can use any regular expression to perform field splitting--at least you can with gawk. If you're using something else, you'll need to check your documentation.