Im trying to replace all strings which contain a substring by itself, in a list.
I've tried it by using the map function:
cleanUpChars = map(\w -> if isInfixOf "**" w then map(\c -> if c == "*" then ""; else c); else w)
To me this reads as: map elements in a list, such that if a character of a word contains * replace it with nothing
To Haskell: "Couldnt match expected type [[Char]] -> [[Char]] with actual type [Char] in the expression: w" (and the last w is underlined)
Any help is appreciated
To answer the revised question (when isInfixOf has been imported correctly):
cleanUpChars = map(\w -> if isInfixOf "**" w then map(\c -> if c == "*" then ""; else c); else w)
The most obvious thing wrong here is that c in the inner parentheses will be a Char (since it's the input to a function which is mapped over a String) - and characters use single quotes, not double quotes. This isn't just a case of a typo or wrong syntax, however - "" works fine as an empty string (and is equivalent to [] since Strings are just lists), but there is no such thing as an "empty character".
If, as it seems, your aim is to remove all *s from each string in the list that contains **, then the right tool is filter rather than map:
Prelude Data.List> cleanUpChars = map(\w -> if isInfixOf "**" w then filter (/= '*') w; else w)
Prelude Data.List> cleanUpChars ["th**is", "is", "a*", "t**es*t"]
["this","is","a*","test"]
(Note that in the example I made up, it removes all asterisks from t**es*t, even the single one. This may not be what you actually wanted, but it's what your logic in the faulty version implied - you'll have to be a little more sophisticated to only remove pairs of consecutive *'s.)
PS I would certainly never write the function like that, with the semicolon - it really doesn't gain you anything. I would also use the infix form of isInfixOf, which makes it much clearer which string you are looking for inside the other:
cleanUpChars :: [String] -> [String]
cleanUpChars = map (\w -> if "**" `isInfixOf` w then filter (/= '*') w else w)
I'm still not particularly happy with that for readability - there's probably some nice way to tidy it up that I'm overlooking for now. But even if not, it helps readability imo to give the function a local name (hopefully you can come up with a more concise name than my version!):
cleanUpChars :: [String] -> [String]
cleanUpChars = map possiblyRemoveAsterisks
where possiblyRemoveAsterisks w = if "**" `isInfixOf` w then filter (/= '*') w else w
Using stringr package, it is easy to perform regex replacement in a vectorized manner.
Question: How can I do the following:
Replace every word in
hello,world??your,make|[]world,hello,pos
to different replacements, e.g. increasing numbers
1,2??3,4|[]5,6,7
Note that simple separators cannot be assumed, the practical use case is more complicated.
stringr::str_replace_all does not seem to work because it
str_replace_all(x, "(\\w+)", 1:7)
produces a vector for each replacement applied to all words, or it has
uncertain and/or duplicate input entries so that
str_replace_all(x, c("hello" = "1", "world" = "2", ...))
will not work for the purpose.
Here's another idea using gsubfn. The pre function is run before the substitutions and the fun function is run for each substitution:
library(gsubfn)
x <- "hello,world??your,make|[]world,hello,pos"
p <- proto(pre = function(t) t$v <- 0, # replace all matches by 0
fun = function(t, x) t$v <- v + 1) # increment 1
gsubfn("\\w+", p, x)
Which gives:
[1] "1,2??3,4|[]5,6,7"
This variation would give the same answer since gsubfn maintains a count variable for use in proto functions:
pp <- proto(fun = function(...) count)
gsubfn("\\w+", pp, x)
See the gsubfn vignette for examples of using count.
I would suggest the "ore" package for something like this. Of particular note would be ore.search and ore.subst, the latter of which can accept a function as the replacement value.
Examples:
library(ore)
x <- "hello,world??your,make|[]world,hello,pos"
## Match all and replace with the sequence in which they are found
ore.subst("(\\w+)", function(i) seq_along(i), x, all = TRUE)
# [1] "1,2??3,4|[]5,6,7"
## Create a cool ore object with details about what was extracted
ore.search("(\\w+)", x, all = TRUE)
# match: hello world your make world hello pos
# context: , ?? , |[] , ,
# number: 1==== 2==== 3=== 4=== 5==== 6==== 7==
Here a base R solution. It should also be vectorized.
x="hello,world??your,make|[]world,hello,pos"
#split x into single chars
x_split=strsplit(x,"")[[1]]
#find all char positions and replace them with "a"
x_split[gregexpr("\\w", x)[[1]]]="a"
#find all runs of "a"
rle_res=rle(x_split)
#replace run lengths by 1
rle_res$lengths[rle_res$values=="a"]=1
#replace run values by increasing number
rle_res$values[rle_res$values=="a"]=1:sum(rle_res$values=="a")
#use inverse.rle on the modified rle object and collapse string
paste0(inverse.rle(rle_res),collapse="")
#[1] "1,2??3,4|[]5,6,7"
I'm trying to resolve ambiguity in a lowercase chemical formula. Since some element names are substrings of other element names, and they're all run together, there can be multiple global matches for the same pattern.
Considering the regex /^((h)|(s)|(hg)|(ga)|(as))+$/ against the string hgas. There are two possible matches. hg, as and h, s, ga (out of order compared to input, but not an issue). Obviously a regex for all possible symbols would be longer, but this example was done for simplicity.
Regex's powerful lookahead and lookbehind allow it to conclusively determine whether even a very long string does match this pattern or there are no possible permutations of letters. It will diligently try all possible permutations of matches, and, for example, if it hits the end of the string with a leftover g, go back through and retry a different combination.
I'm looking for a regular expression, or a language with some kind of extension, that adds on the ability to keep looking for matches after one is found, in this case, finding h, s, ga as well as hg, as.
Rebuilding the complex lookahead and lookbehind functionality of regex for this problem does not seem like a reasonable solution, especially considering the final regex also includes a \d* after each symbol.
I thought about reversing the order of the regexp, /^((as)|(ga)|(hg)|(s)|(h))+$/, to find additional mappings, but at most this will only find one additional match, and I don't have the theoretical background in regex to know if it's even reasonable to try.
I've created a sample page using my existing regex which finds 1 or 0 matches for a given lowercase string and returns it properly capitalized (and out of order). It uses the first 100 chemical symbols in its matching.
http://www.ptable.com/Script/lowercase_formula.php?formula=hgas
tl;dr: I have a regex to match 0 or 1 possible chemical formula permutations in a string. How do I find more than 1 match?
I'm well-aware this answer might be off-topic (as in the approach), but I think it is quite interesting, and it solves the OP's problem.
If you don't mind learning a new language (Prolog), then it might help you generate all possible combinations:
name(X) :- member(X, ['h', 's', 'hg', 'ga', 'as']).
parse_([], []).
parse_(InList, [HeadAtom | OutTail]) :-
atom_chars(InAtom, InList),
name(HeadAtom),
atom_concat(HeadAtom, TailAtom, InAtom),
atom_chars(TailAtom, TailList),
parse_(TailList, OutTail).
parse(In, Out) :- atom_chars(In, List), parse_(List, Out).
Sample run:
?- parse('hgas', Out).
Out = [h, ga, s] ;
Out = [hg, as] ;
false.
The improved version, which includes processing for number is a tad bit longer:
isName(X) :- member(X, ['h', 's', 'hg', 'ga', 'as', 'o', 'c']).
% Collect all numbers, since it will not be part of element name.
collect([],[],[]).
collect([H|T], [], [H|T]) :-
\+ char_type(H, digit), !.
collect([H|T], [H|OT], L) :-
char_type(H, digit), !, collect(T, OT, L).
parse_([], []).
parse_(InputChars, [Token | RestTokens]) :-
atom_chars(InputAtom, InputChars),
isName(Token),
atom_concat(Token, TailAtom, InputAtom),
atom_chars(TailAtom, TailChars),
parse_(TailChars, RestTokens).
parse_(InputChars, [Token | RestTokens]) :-
InputChars = [H|_], char_type(H, digit),
collect(InputChars, NumberChars, TailChars),
atom_chars(Token, NumberChars),
parse_(TailChars, RestTokens).
parse(In, Out) :- atom_chars(In, List), parse_(List, Out).
Sample run:
?- parse('hgassc20h245o', X).
X = [h, ga, s, s, c, '20', h, '245', o] ;
X = [hg, as, s, c, '20', h, '245', o] ;
false.
?- parse('h2so4', X).
X = [h, '2', s, o, '4'] ;
false.
?- parse('hgas', X).
X = [h, ga, s] ;
X = [hg, as] ;
false.
The reason you haven't found a generalized regex library that does this is because it's not possible with all regular expressions to do this. There are regular expressions that will not terminate.
Imagine with your example that you just added empty string to the list of terms, then
'hgas' could be:
['hg', 'as']
['hg', '', 'as']
['hg', '', '', 'as']
You'll probably just have to roll your own.
In psuedo code:
def findall(term, possible):
out = []
# for all the terms
for pos in possible:
# if it is a candidate
if term.startswith(pos):
# combine it with all possible endings
for combo in findall(term.removefrombegining(pos), possible):
newCombo = combo.prepend(out)
out.append(newCombo)
return out
findall('hgas', ['h', 'as', ...])
This above will run in exponential time so dynamic programming will be the way this isn't an exponentially large problem. Memoization for the win.
The last thing worth noting is the above code doesn't check that it fully matches.
ie. 'hga' might return ['hg'] as a possibility.
I'll leave the actual coding, memrization, and this last hiccup to, as my profs lovingly say, 'an exercise to the reader'
This is not a job for regexp, you need something more like a state machine.
You'd need to parse the string popping out all known symbols, stopping if there is none, and continuing. If the whole string gets consumed on one branch, you have found a possibility.
In PHP, something like:
<?php
$Elements = array('h','he','li','s','ga','hg','as',
// "...and there may be many others, but they haven't been discovered".
);
function check($string, $found = array(), $path = '')
{
GLOBAL $Elements, $calls;
$calls++;
if ('' == $string)
{
if (!empty($path))
$found[] = $path;
return $found;
}
$es = array_unique(array(
substr($string, 0, 1), // Element of 1 letter ("H")
substr($string, 0, 2), // Element of 2 letter ("He")
));
foreach($es as $e)
if (in_array($e, $Elements))
$found = check(substr($string, strlen($e)), $found, $path.$e.',');
return $found;
}
print_r(check('hgas'));
print "in $calls calls.\n";
?>
Don't use regex. A regex matches only 1 element as you say, instead you need to find all the possible "meanings" of your string. Given the fact that each element's lenght is 1-2 chars, I'd go with this algorythm (forgive the pseudocode):
string[][] results;
void formulas(string[] elements, string formula){
string[] elements2=elements;
if(checkSymbol(formula.substring(0,1))){
elements.append(formula.substring(0,1));
if(formula.length-1 ==0){
results.append(elements);
} else {
formulas(elements,formula.substring(1,formula.length);
}
}
if(checkSymbol(formula.substring(0,2))){
elements2.append(formula.substring(0,2));
if(formula.length-2 ==0){
results.append(elements2);
} else {
formulas(elements2,formula.substring(2,formula.length);
}
}
}
bool checkSymbol(string symbol){
// verifies if symbol is the name of an element
}
input "hgas" (let's go depth first)
first step:
checkSymbol(formula.substring(0,1)) is true for "h"
elements2 = [h]
recursive call, if(checkSymbol(formula.substring(0,1))) false
then it tests ga => true
elements2 = [h, ga]
third recursive call
test s, checksymbol returns true, elements is then [h, ga, s]. Length of the substring is 0 so it appends to results the first array: [h, ga, s]
--
let's go back to the second "branch" of first step
The test checkSymbol(formula.substring(0,2) finds that "hg" is an element as well
elements2 = [hg]
Then we call formulas([hg],"as")
The test for "a" fails (it is not an element) and the test for "as" works, the length is totally consumed, the result [hg,as] is appended to results[]
This algorythm should run in O(n^2) time in the worst case, n being the length of the string.