Create a cell array in matlab - regex

I have a file of tweets that I have read into matlab using dataread and I have stored each line into a 30x1 cell. I was wondering if there was a to take each hashtag out and store them in their own cell and then find the average length of a hashtag? Any help would be greatly appreciated.

You have the right idea, I think, with your regexp call. I will just clarify a few things. If you want the text in every hashtag in the tweet, you would want to use regexp to search for the pound sign (#) and include every character after that, until you reach the end of the word, e.g.
text = '#this #is a #test';
regexpi(lines,'\<#[a-z0-9_]*\>','match');
ans =
'#this' '#is' '#test'
where regexpi is a case-insensitive regexp, and the regex searches for '#' followed by a any number of letters, digits, or underscores (which are, I believe, the valid hashtag characters). The 'match' flag makes the regexp function return the actual matches.
If you don't want the actual hashtag in the final text, you could use regex look-behinds to return only the text. For instance:
regexpi(lines,'\<(?<=#)[a-z0-9_]*\>','match')
ans =
'this' 'is' 'test'
I think, technically, a hashtag must start with a letter, so this regex would return potentially invalid hashtags. It's not difficult to sort that out though.

Related

How to apply correct regex?

I have a special task which requires lots of regex and javascript parsing.
My head is almost exploding, so maybe I'm tired and forgot some small thing else I'm not newbie to regex so perhaps someone will point me to good direction here and show me where I did mistake.
So I have this regex code:
((?<=\ffmpg=).+(?=////u0026cs=nt))
to get the value of substring between 2 strings. The first string is called:
ffmpg= from this string it should start and it will end just before the other string start called //u0026cs=nt
The problem is that it is working fine until the html page contains only one parameter with the same name; because the source html has inside like 10's of ffmg and the same end string called cs=nt.
I can not even make regex to count the characters because every time you visit the html page the number of characters are different, sometimes +3 else +10. So the only way is to get this sting from the start of param1 to the end of param2.
This is the string I need to get: 1714248%2C23851735%2C23804281%2C23839597%2C23357901%2C3313341%2C3316343%2C23848795%2C3300132%2C26853996%2C3300114%2C3315790%2C23857451%2C23856472%2C23851936%2C3300161%2C3314786%2C23856652%2C23859863%2C23837993%2C23833479%2C23861502%2C23842630%2C23842986%2C23861012
This is the source html example:
\u0026doc=IcuU5Oy8\u0026pen=V9PXaHoOp1gKD25rgAg\u0026ffmpg=1714248%2C23851735%2C23804281%2C23839597%2C23357901%2C3313341%2C3316343%2C23848795%2C3300132%2C26853996%2C3300114%2C3315790%2C23857451%2C23856472%2C23851936%2C3300161%2C3314786%2C23856652%2C23859863%2C23837993%2C23833479%2C23861502%2C23842630%2C23842986%2C23861012\u0026cs=nt\u0026token=gHgig8eLY3qsQ0bXa\\u0026doc=IcuU5Oy8\u0026pen=V9PXaHoOp1gKD25rgAg\u0026ffmpg=1714248%2C23851735%2C23804281%2C23839597%2C23357901%2C3313341%2C3316343%2C23848795%2C3300132%2C26853996%2C3300114%2C3315790%2C23857451%2C23856472%2C23851936%2C3300161%2C3314786%2C23856652%2C23859863%2C23837993%2C23833479%2C23861502%2C23842630%2C23842986%2C23861012\u0026cs=nt\u0026token=gHgig8eLY3qsQ0bXa\\u0026doc=IcuU5Oy8\u0026pen=V9PXaHoOp1gKD25rgAg\u0026ffmpg=1714248%2C23851735%2C23804281%2C23839597%2C23357901%2C3313341%2C3316343%2C23848795%2C3300132%2C26853996%2C3300114%2C3315790%2C23857451%2C23856472%2C23851936%2C3300161%2C3314786%2C23856652%2C23859863%2C23837993%2C23833479%2C23861502%2C23842630%2C23842986%2C23861012\u0026cs=nt\u0026token=gHgig8eLY3qsQ0bXa\
I have copied 3 times the same just for this purpose because it is very big html source and I doubt I can upload it here.
Thanks for your help.
In your questions, you use (?<=\ffmpg=) where \f will match a form feed character which is not present in the data example. If you meant to use \\f it will match \f which is also not present in the example data.
You could get the match using a capturing group instead of using lookarounds as lookbehinds are not widely supported by all browsers.
If you just want to get a single match, you can omit the /g global flag.
If you use .+ you will match too much as the .+ will match until the end of the string and then backtracks until the first time it can match \\u0026cs=nt
What you could do instead is be specific in what you would allow to match which for the current string is a character class with the following characters [AC0-9%]+
You could broaden the character class with a range to match chars A-Z instead of AC for example and add more chars or ranges as required.
ffmpg=([AC0-9%]+)\\\\u0026cs=nt
Regex demo
For example
const regex = /ffmpg=([AC0-9%]+)\\\\u0026cs=nt/;
const str = `\\\\u0026doc=IcuU5Oy8\\\\u0026pen=V9PXaHoOp1gKD25rgAg\\\\u0026ffmpg=1714248%2C23851735%2C23804281%2C23839597%2C23357901%2C3313341%2C3316343%2C23848795%2C3300132%2C26853996%2C3300114%2C3315790%2C23857451%2C23856472%2C23851936%2C3300161%2C3314786%2C23856652%2C23859863%2C23837993%2C23833479%2C23861502%2C23842630%2C23842986%2C23861012\\\\u0026cs=nt\\\\u0026token=gHgig8eLY3qsQ0bXa\\\\\\\\u0026doc=IcuU5Oy8\\\\u0026pen=V9PXaHoOp1gKD25rgAg\\\\u0026ffmpg=1714248%2C23851735%2C23804281%2C23839597%2C23357901%2C3313341%2C3316343%2C23848795%2C3300132%2C26853996%2C3300114%2C3315790%2C23857451%2C23856472%2C23851936%2C3300161%2C3314786%2C23856652%2C23859863%2C23837993%2C23833479%2C23861502%2C23842630%2C23842986%2C23861012\\\\u0026cs=nt\\\\u0026token=gHgig8eLY3qsQ0bXa\\\\\\\\u0026doc=IcuU5Oy8\\\\u0026pen=V9PXaHoOp1gKD25rgAg\\\\u0026ffmpg=1714248%2C23851735%2C23804281%2C23839597%2C23357901%2C3313341%2C3316343%2C23848795%2C3300132%2C26853996%2C3300114%2C3315790%2C23857451%2C23856472%2C23851936%2C3300161%2C3314786%2C23856652%2C23859863%2C23837993%2C23833479%2C23861502%2C23842630%2C23842986%2C23861012\\\\u0026cs=nt\\\\u0026token=gHgig8eLY3qsQ0bXa\\\\`;
console.log(str.match(regex)[1]);
Try this:
(?<=ffmpg=)([A-F0-9%]+)
Explanation
Since your string only consists of url-encoded characters, you can use [A-F0-9%]+character class to capture it. It will stop when next string starts because there will be a backslash.
See online demo here.

Regex for any number of alphanumeric phrases between two '.'

I'm having a hard time trying to phrase this question correctly when researching solutions, so I thought I would ask here. I'm trying to validate a field in my UI that a user will enter in a "Java-package" format string. So a correct example would be "com.my.app.class1". However, it needs to be the full package path, so I don't want to accept '*' in the string. I'm trying to find a way to represent this in regex to validate it. My first thought is to split the string into pieces using a . as the delimiter (var splitArray : any[] = packageInput.split('.')), then iterating over the array and check for the correct regex. However, I wanted to know if I could do it all in one regex phrase.
Something as simple as ^\w+(\.\w+)*$ will validate strings of the type you've described, as long as they contain alpha, digits, or _.
It matches all of:
class1
com.my.class1
com.my.app.class1
com.my.app.sub.class1
and doesn't match:
com.my.app.*

Word removal using re results in wrong words being removed

Given a text "article_utf8" i want to remove a list of words:
remove = "el|la|de|que|y|a|en|un|ser|se|no|haber|..."
regex = re.compile(r'\b('+remove+r')\b', flags=re.IGNORECASE)
article_out = regex.sub("", article_utf8)
however this is incorrectly removing some words and parts of words for example:
1- aseguro becomes seguro
2- sería becomes í
3- coma becomes com
4- miercoles becomes 'ercoles'
Technically parts of a word can match a regexp. To solve this you would have to make sure that whatever sequence of letters your regexp matches is a single word and not part of it.
One way would be to make the regexp contain leading and trailing spaces, but words could also be separated with periods or commas so you would have to take those into account too if you want to catch all instances.
Alternatively, you can try splitting the list first into words using the built-in split method (https://docs.python.org/2/library/stdtypes.html#str.split). Then I would check each word in the resulting list, remove the ones I don't want and rejoin the strings. This method, however doesn't even need regexps so it's probably not what you intended despite being simple and practical.
After much testing, the following will remove the small words in a natural language string, without removing them from parts of other words:
regex = re.compile(r'[\s]?\b('+remove+')[\b\s\.\,]', flags=re.IGNORECASE)

How to ignore whitespace in a regular expression subject string?

Is there a simple way to ignore the white space in a target string when searching for matches using a regular expression pattern? For example, if my search is for "cats", I would want "c ats" or "ca ts" to match. I can't strip out the whitespace beforehand because I need to find the begin and end index of the match (including any whitespace) in order to highlight that match and any whitespace needs to be there for formatting purposes.
You can stick optional whitespace characters \s* in between every other character in your regex. Although granted, it will get a bit lengthy.
/cats/ -> /c\s*a\s*t\s*s/
While the accepted answer is technically correct, a more practical approach, if possible, is to just strip whitespace out of both the regular expression and the search string.
If you want to search for "my cats", instead of:
myString.match(/m\s*y\s*c\s*a\*st\s*s\s*/g)
Just do:
myString.replace(/\s*/g,"").match(/mycats/g)
Warning: You can't automate this on the regular expression by just replacing all spaces with empty strings because they may occur in a negation or otherwise make your regular expression invalid.
Addressing Steven's comment to Sam Dufel's answer
Thanks, sounds like that's the way to go. But I just realized that I only want the optional whitespace characters if they follow a newline. So for example, "c\n ats" or "ca\n ts" should match. But wouldn't want "c ats" to match if there is no newline. Any ideas on how that might be done?
This should do the trick:
/c(?:\n\s*)?a(?:\n\s*)?t(?:\n\s*)?s/
See this page for all the different variations of 'cats' that this matches.
You can also solve this using conditionals, but they are not supported in the javascript flavor of regex.
You could put \s* inbetween every character in your search string so if you were looking for cat you would use c\s*a\s*t\s*s\s*s
It's long but you could build the string dynamically of course.
You can see it working here: http://www.rubular.com/r/zzWwvppSpE
If you only want to allow spaces, then
\bc *a *t *s\b
should do it. To also allow tabs, use
\bc[ \t]*a[ \t]*t[ \t]*s\b
Remove the \b anchors if you also want to find cats within words like bobcats or catsup.
This approach can be used to automate this
(the following exemplary solution is in python, although obviously it can be ported to any language):
you can strip the whitespace beforehand AND save the positions of non-whitespace characters so you can use them later to find out the matched string boundary positions in the original string like the following:
def regex_search_ignore_space(regex, string):
no_spaces = ''
char_positions = []
for pos, char in enumerate(string):
if re.match(r'\S', char): # upper \S matches non-whitespace chars
no_spaces += char
char_positions.append(pos)
match = re.search(regex, no_spaces)
if not match:
return match
# match.start() and match.end() are indices of start and end
# of the found string in the spaceless string
# (as we have searched in it).
start = char_positions[match.start()] # in the original string
end = char_positions[match.end()] # in the original string
matched_string = string[start:end] # see
# the match WITH spaces is returned.
return matched_string
with_spaces = 'a li on and a cat'
print(regex_search_ignore_space('lion', with_spaces))
# prints 'li on'
If you want to go further you can construct the match object and return it instead, so the use of this helper will be more handy.
And the performance of this function can of course also be optimized, this example is just to show the path to a solution.
The accepted answer will not work if and when you are passing a dynamic value (such as "current value" in an array loop) as the regex test value. You would not be able to input the optional white spaces without getting some really ugly regex.
Konrad Hoffner's solution is therefore better in such cases as it will strip both the regest and test string of whitespace. The test will be conducted as though both have no whitespace.

Is this regex correct to denote only strings with min length of 3 and max length of 6?

Rules for the regex in english:
min length = 3
max length = 6
only letters from ASCII table, non-numeric
My initial attempt:
[A-Za-z]{3-6}
A second attempt
\w{3-6}
This regex will be used to validate input strings from a HTML form (i.e. validating an input field).
A modification to your first one would be more appropriate
\b[A-Za-z]{3,6}\b
The \b mark the word boundaries and avoid matching for example 'abcdef' from 'abcdefgh'. Also note the comma between '3' and '6' instead of '-'.
The problem with your second attempt is that it would include numeric characters as well, has no word boundaries again and the hypen between '3' and '6' is incorrect.
Edit: The regex I suggested is helpful if you are trying to match the words from some text. For validation etc if you want to decide if a string matches your criteria you will have to use
^[A-Za-z]{3,6}$
I don't know which regex engine you are using (this would be useful information in your question), but your initial attempt will match all alphabetic strings longer than three characters. You'll want to include word-boundary markers such as \<[A-Za-z]{3,6}\>.
The markers vary from engine to engine, so consult the documentation for your particular engine (or update your question).
First one should be modified as below
([A-Za-z]{3,6})
Second one will allow numbers, which I think you don't want to?
first one should work, second one will include digits as well, but you want to check non-numeric strings.