Split string and get last element - regex

Let's say I have a column which has values like:
foo/bar
chunky/bacon/flavor
/baz/quz/qux/bax
I.e. a variable number of strings separated by /.
In another column I want to get the last element from each of these strings, after they have been split on /. So, that column would have:
bar
flavor
bax
I can't figure this out. I can split on / and get an array, and I can see the function INDEX to get a specific numbered indexed element from the array, but can't find a way to say "the last element" in this function.

Edit:
this one is simplier:
=REGEXEXTRACT(A1,"[^/]+$")
You could use this formula:
=REGEXEXTRACT(A1,"(?:.*/)(.*)$")
And also possible to use it as ArrayFormula:
=ARRAYFORMULA(REGEXEXTRACT(A1:A3,"(?:.*/)(.*)$"))
Here's some more info:
the RegExExtract function
Some good examples of syntax
my personal list of Regex Tricks
This formula will do the same:
=INDEX(SPLIT(A1,"/"),LEN(A1)-len(SUBSTITUTE(A1,"/","")))
But it takes A1 three times, which is not prefferable.

You could do this too
=index(SPLIT(A1, "/"), COLUMNS(SPLIT(A1, "/"))-1)

Also possible, perhaps best on a copy, with Find:
.+/
(Replace with blank) and Search using regular expressions ticked.

You can try use this!
You've got the array of String, so you can acess the last element by length
String message = "chunky/bacon/flavor";
String[] outSplited = message.split("/");
System.out.println(outSplited[outSplited.length -1]);

Related

perform substring extraction on data frame column

I have a dataframe with 1 column called 'full_url'. Each element of the column is just a url. How to I write a function to remove the 'http://' from all of the elements at once? I need to use some kind of regex because some don't have it at all, some have https, etc. The closest I've gotten is gsub(".*//","",unlist(full_url))
but that also returns 'full_url1' 'full_url2' 'full_url3' ... as the row names for some reason
Without a reproducible example I'm not sure, but would something like this work?
apply(df$full_url, 1, function(x) ifelse(substr(x,1,7) == "http://", substr(x,8,length(x)),x)
So using apply to go by row and substr to find if the first 7 characters are "http://". If they are replace without the http and if they're not then replace with just x.

Simplest way to find out if at least one cell in a cell array matches a regular expression

I need to search a cell array and return a single boolean value indicating whether any cell matches a regular expression.
For example, suppose I want to find out if the cell array strs contains foo or -foo (case-insensitive). The regular expression I need to pass to regexpi is ^-?foo$.
Sample inputs:
strs={'a','b'} % result is 0
strs={'a','foo'} % result is 1
strs={'a','-FOO'} % result is 1
strs={'a','food'} % result is 0
I came up with the following solution based on How can I implement wildcard at ismember function of matlab? and Searching cell array with regex, but it seems like I should be able to simplify it:
~isempty(find(~cellfun('isempty', regexpi(strs, '^-?foo$'))))
The problem I have is that it looks rather cryptic for such a simple operation. Is there a simpler, more human-readable expression I can use to achieve the same result?
NOTE: The answer refers to the original regexp in the question: '-?foo'
You can avoid the find:
any(~cellfun('isempty', regexpi(strs, '-?foo')))
Another possibility: concatenate first all cells into a single string:
~isempty(regexpi([strs{:}], '-?foo'))
Note that you can remove the "-" sign in any of the above:
any(~cellfun('isempty', regexpi(strs, 'foo')))
~isempty(regexpi([strs{:}], 'foo'))
And that allows using strfind (with lower) instead of regexpi:
~isempty(strfind(lower([strs{:}]),'foo'))

How can I match all strings unless it contains a certain string?

So I want to match every string in this list, except the ones that contain the product SKU, which is /s7892632 <---- random string of numbers. I've been trying to do this for quite some time and have been unsuccessful. Any insight would be greatly appreciated.
/account/login?returnurl=/account/forgotpassword
/account/login?returnurl=/account/orders
/account/orders
/account/updateaddress
/account/updateemail
/account/updaterewardscard
/brands/havaianas
/careers
/Category List
/checkout
/checkout/addresses
/checkout/addresses/delivery
/checkout/addresses/deliverymethod
/checkout/affilinetbasket
/checkout/anonymous
/checkout/confirmation
/checkout/express
/checkout/login
/checkout/login?returnurl=/checkout/addresses
/checkout/null
/checkout/payment
/checkout/paypal
/checkout/quickshop/
/checkout/verify
/click-and-collect
/click-and-collect/click-and-collect-overview
/corporate/about-matalan
/corporate/careers
/corporate/cookies
/corporate/history
/customer-services/accessibility
/customer-services/contact
/customer-services/customer-services-home
/customer-services/delivery
/customer-services/faq
/customer-services/fitting-room
/customer-services/here-to-help
/customer-services/size-guides
/delivery
/events/mothers-day
/events/mothers-day/s2516241/tassle-detail-slouch-bag
/events/mothers-day/s2518752/waxed-jacket
/events/mothers-day/s2519237/fabric-buckle-tote-bag
/events/mothers-day/s2521182/heart-print-nightie
/events/mothers-day/s2521184/heart-print-dressing-gown
/events/mothers-day/s2521185/heart-print-pyjama-set
/events/mothers-day/s2521679/structured-tote-bag
/events/mothers-day/s2522143/chiffon-print-dress
/events/mothers-day/s2522347/butterfly-enamel-bowl-32cm-x-8cm
/events/mothers-day/s2526013/animal-print-jersey-blazer
/events/mothers-day/s2527624/croc-tote-bag
/events/mothers-day/s2529731/shift-dress
/events/mothers-day?page=1&size=120&cols=4&sort=&id=/events/mothers-day&priceRange[min]=2&priceRange[max]=59
/events/mothers-day?page=2&size=120&cols=4&sort=&id=/events/mothers-day&priceRange[min]=2&priceRange[max]=59
/events/mothers-day?page=2&size=36&cols=4&sort=&id=/events/mothers-day&priceRange[min]=2&priceRange[max]=59
/events/mothers-day?page=3&size=36&cols=4&sort=&id=/events/mothers-day&priceRange[min]=2&priceRange[max]=59
The following should work:
^(?!.*/s\d{7}/).*
Example: http://regexr.com?343nf
This assumes you have each string as a separate element in a list. If this is actually matching one big string with multiple lines you can use the same regex, but you may need to enable global and multiline options depending on the tool you are using (and make sure dotall/singleline is disabled).
Try this:
boolean noSku = !line.matches(".*/s\\d{5,}.*");
This uses {5,} which allows for any number of digits in the SKU greater than 4 (giving you flexibility with matching). You can change the number to whatever suits.
this matches lines that don't have the code....
^((?!s\d{7}).)*$

Character count of regular expression in cells in MATLAB

Earlier I got some help as to how to make a script that will extract hashtags from a list of tweets and put them into an array of cells.
I used this as my code, inside a for loop
hashtagCell{i} = regexp(textRead{i}, '#[A-z]*', 'match');
This works for what it is supposed to do, but now I'm trying to find the average character length of the hashtags, so I need to be able to add the character length of each hashtag pulled out by the above function and add them together. However, when I try to use the size() function, it just gives me the size of the cell instead of the size of the strings, which is what I want. I can't figure out how to do this.
For a single string it would be like this:
%# example string with hashtags.
MyText = 'this is a #text with #hashtag and also #another hashtag';
%# create the hashtagCell.
hashtagCell = regexp(MyText, '#[A-z]*', 'match');
%# compute the mean.
AverageLength = mean(cellfun(#(x) size(x,2), hashtagCell));
This should help (and it gets rid of any loops, other than, perhaps, the one used to create CellOfText):
%# Example cell array of tweets
CellOfText = {'Bah #humbug says #Mr scrooge'; 'No #presents for you'};
%# Get all hash tags
HTC = regexp(CellOfText, '#[A-z]*', 'match');
%# Get the average hash tag length, being careful to unnest HTC
AvgLength1 = mean(cellfun('length', [HTC{:}]));
DISCLAIMER: The inspiration for this method came from this excellent answer to a similar question. Thanks to #Andrey for that.

Get regular expression matched positon?

Here is a string :
str1="ha,hihi,aaaaa,ok"
I want to get the position of "," in the str1,Which can count 3,8,14.
how can I get it in R ?
You get the desired vector using this expression:
as.integer(gregexpr(",", str1)[[1]])
The [[1]] will choose the first element of the resulting list. If str1 were a vector of a length other than 1, then gregexpr would result a list with that many items, one for each element of str1.
The as.integer will strip additional attributes, like the length of the matched text. In many situations you will be able to omit this, as other code will likely simply ignore those attributes. For output to the console it might be less confusing, though, so I included it in my answer.