Replace multiple words in pig - replace

I am new to Pig. In the script that I am writing I want to perform an operation similar to this:
foreach X GENERATE REPLACE(word,'.*abc.*','abc') OR REPLACE(word,'.*def.*','def').
If the first pattern matches then abc is replaced else if second pattern is matched then def is replaced. But I suppose the syntax is incorrect. Can someone help me with the syntax?

There are a few ways to do this, but since if the regex doesn't match the string, you'll just get your string back, this is pretty compact:
Y = FOREACH X GENERATE REPLACE(REPLACE(word, '.*abc.*', 'abc'), '.*def.*', 'def');

Related

regular expression search backwards, How to deal with words with and without?

I tested by https://regexr.com/
There two sample words.
BOND_aa_SB1_66-1.pdf
BOND_bb_SB2.pdf
I want to extract SB1, SB2 from each sample.
but my regular expression is not perfect.
It is working
(?<=BOND_.*_).*
But It is difficult to write the following.
I try
(?<=BOND_.*_).*(?=(_|\.))
But first sample result is 'SB1_66-1'
I just want to extract SB1
sb1 The following may or may not exist. if there is content, it can be separated by starting with _.
How should I fix it?
To extract the third underscore-separated term, we can use re.search as follows:
inp = ["BOND_aa_SB1_66-1.pdf", "BOND_bb_SB2.pdf"]
output = [re.search(r'^BOND_[^_]+_([^_.]+)', x).group(1) for x in inp]
print(output) # ['SB1', 'SB2']
s = "BOND_aa_SB1_66-1.pdf BOND_bb_SB2.pdf"
(re.findall(r'(SB\d+)', s))
['SB1', 'SB2']

The regex in string.format of LUA

I use string.format(str, regex) of LUA to fetch some key word.
local RICH_TAGS = {
"texture",
"img",
}
--\[((img)|(texture))=
local START_OF_PATTER = "\\[("
for index = 1, #RICH_TAGS - 1 do
START_OF_PATTER = START_OF_PATTER .. "(" .. RICH_TAGS[index]..")|"
end
START_OF_PATTER = START_OF_PATTER .. "("..RICH_TAGS[#RICH_TAGS].."))"
function RichTextDecoder.decodeRich(str)
local result = {}
print(str, START_OF_PATTER)
dump({string.find(str, START_OF_PATTER)})
end
output
hello[img=123] \[((texture)|(img))
dump from: [string "utils/RichTextDecoder.lua"]:21: in function 'decodeRich'
"<var>" = {
}
The output means:
str = hello[img=123]
START_OF_PATTER = \[((texture)|(img))
This regex works well with some online regex tools. But it find nothing in LUA.
Is there any wrong using in my code?
You cannot use regular expressions in Lua. Use Lua's string patterns to match strings.
See How to write this regular expression in Lua?
Try dump({str:find("\\%[%("))})
Also note that this loop:
for index = 1, #RICH_TAGS - 1 do
START_OF_PATTER = START_OF_PATTER .. "(" .. RICH_TAGS[index]..")|"
end
will leave out the last element of RICH_TAGS, I assume that was not your intention.
Edit:
But what I want is to fetch several specific word. For example, the
pattern can fetch "[img=" "[texture=" "[font=" any one of them. With
the regex string I wrote in my question, regex can do the work. But
with Lua, the way to do the job is write code like string.find(str,
"[img=") and string.find(str, "[texture=") and string.find(str,
"[font="). I wonder there should be a way to do the job with a single
pattern string. I tryed pattern string like "%[%a*=", but obviously it
will fetch a lot more string I need.
You cannot match several specific words with a single pattern unless they are in that string in a specific order. The only thing you could do is to put all the characters that make up those words into a class, but then you risk to find any word you can build from those letters.
Usually you would match each word with a separate pattern or you match any word and check if the match is one of your words using a look up table for example.
So basically you do what a regex library would do in a few lines of Lua.

Match return substring between two substrings using regexp

I have a list of records that are character vectors. Here's an example:
'1mil_0,1_1_1_lb200_ks_drivers_sorted.csv'
'1mil_0_1_lb100_ks_drivers_sorted.csv'
'1mil_1_1_lb2_100_100_ks_drivers_sorted.csv'
'1mil_1_1_lb100_ks_drivers_sorted.csv'
From these names I would like to extract whatever's between the two substrings 1mil_ and _ks_drivers_sorted.csv.
So in this case the output would be:
0,1_1_1_lb200
0_1_lb100
1_1_lb2_100_100
1_1_lb100
I'm using MATLAB so I thought to use regexp to do this, but I can't understand what kind of regular expression would be correct.
Or are there some other ways to do this without using regexp?
Let the data be:
x = {'1mil_0,1_1_1_lb200_ks_drivers_sorted.csv'
'1mil_0_1_lb100_ks_drivers_sorted.csv'
'1mil_1_1_lb2_100_100_ks_drivers_sorted.csv'
'1mil_1_1_lb100_ks_drivers_sorted.csv'};
You can use lookbehind and lookahead to find the two limiting substrings, and match everything in between:
result = cellfun(#(c) regexp(c, '(?<=1mil_).*(?=_ks_drivers_sorted\.csv)', 'match'), x);
Or, since the regular expression only produces one match, the following simpler alternative can be used (thanks #excaza for noticing):
result = regexp(x, '(?<=1mil_).*(?=_ks_drivers_sorted\.csv)', 'match', 'once');
In your example, either of the above gives
result =
4×1 cell array
'0,1_1_1_lb200'
'0_1_lb100'
'1_1_lb2_100_100'
'1_1_lb100'
For me the easy way to do this is just use espace or nothing to replace what you don't need in your string, and the rest is what you need.
If is a list, you can use a loop to do this.
Exemple to replace "1mil_" with "" and "_ks_drivers_sorted.csv" with ""
newChr = strrep(chr,'1mil_','')
newChr = strrep(chr,'_ks_drivers_sorted.csv','')

How to replace parts of a string in lua "in a single pass"?

I have the following string of anchors (where I want to change the contents of the href) and a lua table of replacements, which tells which word should be replaced for:
s1 = '<a href="word7">'
replacementTable = {}
replacementTable["word1"] = "potato1"
replacementTable["word2"] = "potato2"
replacementTable["word3"] = "potato3"
replacementTable["word4"] = "potato4"
replacementTable["word5"] = "potato5"
The expected result should be:
<a href="word7">
I know I could do this iterating for each element in the replacementTable and process the string each time, but my gut feeling tells me that if by any chance the string is very big and/or the replacement table becomes big, this apporach is going to perform poorly.
So I though it could be best if I could do the following: apply the regular expression for finding all the matches, get an iterator for each match and replace each match for its value in the replacementTable.
Something like this would be great (writing it in Javascript because I don't know yet how to write lambdas in Lua):
var newString = patternReplacement(s1, '<a[^>]* href="([^"]*)"', function(match) { return replacementTable[match] })
Where the first parameter is the string, the second one the regular expression and the third one a function that is executed for each match to get the replacement. This way I think s1 gets parsed once, being more efficient.
Is there any way to do this in Lua?
In your example, this simple code works:
print((s1:gsub("%w+",replacementTable)))
The point is that gsub already accepts a table of replacements.
In the end, the solution that worked for me was the following one:
local updatedBody = string.gsub(body, '(<a[^>]* href=")(/[^"%?]*)([^"]*")', function(leftSide, url, rightSide)
local replacedUrl = url
if (urlsToReplace[url]) then replacedUrl = urlsToReplace[url] end
return leftSide .. replacedUrl .. rightSide
end)
It kept out any querystring parameter giving me just the URI. I know it's a bad idea to parse HTML bodies with regular expressions but for my case, where I required a lot of performance, this was performing a lot faster and just did the job.

How do I assign many values to a particular Perl variable?

I am writing a script in Perl which searches for a motif(substring) in protein sequence(string). The motif sequence to be searched (or substring) is hhhDDDssEExD, where:
h is any hydrophobic amino acid
s is any small amino acid
x is any amino acid
h,s,x can have more than one value separately
Can more than one value be assigned to one variable? If yes, how should I do that? I want to assign a list of multiple values to a variable.
It seems like you want some kind of pattern matching. This can be done with strings using regular expressions.
You can use character classes in your regular expression. The classes you mentioned would be:
h -> [VLIM]
s -> [AG]
x -> [A-IK-NP-TV-Z]
The last one means "A to I, K to N, P to T, V to Z".
The regular expression for your example would be:
/[VLIM]{3}D{3}[AG]{2}E{2}[A-IK-NP-TV-Z]D/
I am no great expert in perl, so there is quite possibly a quicker way to this, but it seems like the match operator "//" in list context is what you need. When you assign the result of a match operation to a list, the match operator takes on list context and returns a list with each of the parenthesis delimited sub-expressions. If you specify global matches with the "g" flag, it will return a list of all the matches of each sub-expression. Example:
# print a list of each match for "x" in "xxx"
#aList = ("xxx" =~ /(x)/g);
print(join(".", #aList));
Will print out
x.x.x
I'm assuming you have a regular expression for each of those 5 types h, D, s, E, and x. You didn't say whether each of these parts is a single character or multiple, so I'm going to assume they can be multiple characters. If so, your solution might be something like this:
$h = ""; # Insert regex to match "h"
$D = ""; # Insert regex to match "D"
$s = ""; # Insert regex to match "s"
$E = ""; # Insert regex to match "E"
$x = ""; # Insert regex to match "x"
$sequenceRE = "($h){3}($D){3}($s){2}($E){2}($x)($D)"
if ($line =~ /$sequenceRE/) {
$hPart = $1;
$sPart = $3;
$xPart = $5;
#hValues = ($hPart =~ /($h)/g);
#sValues = ($sPart =~ /($s)/g);
#xValues = ($xPart =~ /($x)/g);
}
I'm sure there is something I've missed, and there are some subtleties of perl that I have overlooked, but this should get you most of the way there. For more information, read up on perl's match operator, and regular expressions.
I could be way off, but it sounds like you want an object with a built in method to output as a string.
If you start with a string, like the one you mentioned, you could pass the string to the class as a new object, use regular expressions like everyone has already suggested to parse out the chunks that you would then assign as variables to that object. Finally, you could have it output a string based on the variables of that object, for instance:
$string = "COHOCOHOCOHOCOHOCOHOC";
$sugar = new Organic($string);
Class Organic {
$chem;
function __construct($chem) {
$hydro_find = "OHO";
$carb_find = "C";
$this-> hydro = preg_find ($hydro_find, $chem);
$this -> carb = preg_find ($carb_find, $chem);
function __TO_STRING() {
return $this->carb."="$this->hydro;
}
}
echo $sugar;
Okay, that kind of fell apart in the end, and it was pseudo-php, not perl. But if I understand your question correctly, you are looking for a way to get all of the info from the string but keep it tied to that string. That would be objects and classes.
You probably want an array (or arrayref) or a pattern (qr//).
Or maybe Quantum::Superpositions.