Extracting numbers using Regex in Matlab

Extracting numbers using Regex in Matlab - regex

I would like to extract integers from strings from a cell array in Matlab. Each string contains 1 or 2 integers formatted as shown below. Each number can be one or two digits. I would like to convert each string to a 1x2 array. If there is only one number in the string, the second column should be -1. If there are two numbers then the first entry should be the first number, and the second entry should be the second number.
'[1, 2]'
'[3]'
'[10, 3]'
'[1, 12]'
'[11, 12]'
Thank you very much!
I have tried a few different methods that did not work out. I think that I need to use regex and am having difficulty finding the proper expression.

You can use str2num to convert well formatted chars (which you appear to have) to the correct arrays/scalars. Then simply pad from the end+1 element to the 2nd element (note this is nothing in the case there's already two elements) with the value -1.
This is most clearly done in a small loop, see the comments for details:
% Set up the input
c = { ...
'[1, 2]'
'[3]'
'[10, 3]'
'[1, 12]'
'[11, 12]'
};
n = cell(size(c)); % Initialise output
for ii = 1:numel(n) % Loop over chars in 'c'
n{ii} = str2num(c{ii}); % convert char to numeric array
n{ii}(end+1:2) = -1; % Extend (if needed) to 2 elements = -1
end
% (Optional) Convert from a cell to an Nx2 array
n = cell2mat(n);
If you really wanted to use regex, you could replace the loop part with something similar:
n = regexp( c, '\d{1,2}', 'match' ); % Match between one and two digits
for ii = 1:numel(n)
n{ii} = str2double(n{ii}); % Convert cellstr of chars to arrays
n{ii}(end+1:2) = -1; % Pad to be at least 2 elements
end
But there are lots of ways to do this without touching regex, for example you could erase the square brackets, split on a comma, and pad with -1 according to whether or not there's a comma in each row. Wrap it all in a much harder to read (vs a loop) cellfun and ta-dah you get a one-liner:
n = cellfun( #(x) [str2double( strsplit( erase(x,{'[',']'}), ',' ) ), -1*ones(1,1-nnz(x==','))], c, 'uni', 0 );
I'd recommend one of the loops for ease of reading and debugging.

Related

How to sort non-numeric strings by converting them to integers? Is there a way to convert strings to unique integers while being ordered?

I am trying to convert strings to integers and sort them based on the integer value. These values should be unique to the string, no other string should be able to produce the same value. And if a string1 is bigger than string2, its integer value should be greater. Ex: since "orange" > "apple", "orange" should have a greater integer value. How can I do this?
I know there are an infinite number of possibilities between just 'a' and 'b' but I am not trying to fit every single possibility into a number. I am just trying to possibly sort, let say 1 million values, not an infinite amount.
I was able to get the values to be unique using the following:
long int order = 0;
for (auto letter : word)
order = order * 26 + letter - 'a' + 1;
return order;
but this obviously does not work since the value for "apple" will be greater than the value for "z".
This is not a homework assignment or a puzzle, this is something I thought of myself. Your help is appreciated, thank you!

You are almost there ... just a minor tweaks are needed:
you are multiplying by 26
however you have letters (a..z) and empty space so you should multiply by 27 instead !!!
Add zeropading
in order to make starting letter the most significant digit you should zeropad/align the strings to common length... if you are using 32bit integers then max size of string is:
floor(log27(2^32)) = 6
floor(32/log2(27)) = 6
Here small example:
int lexhash(char *s)
{
int i,h;
for (h=0,i=0;i<6;i++) // process string
{
if (s[i]==0) break;
h*=27;
h+=s[i]-'a'+1;
}
for (;i<6;i++) h*=27; // zeropad missing letters
return h;
}
returning these:
14348907 a
28697814 b
43046721 c
373071582 z
15470838 abc
358171551 xyz
23175774 apple
224829626 orange
ordered by hash:
14348907 a
15470838 abc
23175774 apple
28697814 b
43046721 c
224829626 orange
358171551 xyz
373071582 z
This will handle all lowercase a..z strings up to 6 characters length which is:
26^6 + 26^5 +26^4 + 26^3 + 26^2 + 26^1 = 321272406 possibilities
For more just use bigger bitwidth for the hash. Do not forget to use unsigned type if you use the highest bit of it too (not the case for 32bit)

You can use position of char:
std::string s("apple");
int result = 0;
for (size_t i = 0; i < s.size(); ++i)
result += (s[i] - 'a') * static_cast<int>(i + 1);
return result;
By the way, you are trying to get something very similar to hash function.

Need help implementing a certain logic that will fill a text to a certain width.

The task is to justify text within a certain width.
user inputs: Hello my name is Harrry. This is a sample text input that nobody
will enter.
output: What text width do you want?
user inputs: 15
output: |Hello my name|
|is Harrry. This|
|is a sample|
|text that|
|nobody will|
|enter. |
Basically, the line has to be 15 spaces wide including blank spaces. Also, if the next word in the line cant fit into 15, it will skip entirely. If there are multiple words in a line, it will try to distribute the spaces evenly between each word. See the line that says "Is a sample" for example.
I created a vector using getline(...) and all that and the entire text is saved in a vector. However, I'm kind of stuck on moving forward. I tried using multiple for loops, but I just cant seem to skip lines or even out the spacing at all.
Again, not looking or expecting anyone to solve this, but I'd appreciate it if you could guide me into the right direction in terms of logic/algorithm i should think about.

You should consider this Dynamic programming solution.
Split text into “good” lines
Since we don't know where we need to break the line for good justification, we start guessing where the break to be done to the paragraph. (That is we guess to determine whether we should break between two words and make the second word as start of the next line).
You notice something? We brutefore!
And note that if we can't find a word small enought to fit in the remaining space in the current line, we insert spaces inbetween the words in the current line. So, the space in the current line depends on the words that might go into the next or previous line. That's Dependency!
You are bruteforcing and you have dependency,there comes the DP!
Now lets define a state to identify the position on our path to solve this problem.
State: [i : j] ,which denotes line of words from ith word to jth word in the original sequence of words given as input.
Now, that you have state for the problem let us try to define how these states are related.
Since all our sub-problem states are just a pile of words, we can't just compare the words in each state and determine which one is better. Here better delineates to the use of line's width to hold maximum character and minimum spaces between the words in the particular line. So, we define a parameter, that would measure the goodness of the list of words from ith to jth words to make a line. (recall our definition of subproblem state). This is basically evaluating each of our subproblem state.
A simple comparison factor would be :
Define badness(i, j) for line of words[i : j].
For example,
Infinity if total length > page width,
else (page width − total length of words in current line)3
To make things even simple consider only suffix of the given text and apply this algorithm. This would reduce the DP table size from N*N to N.
So, For finishing lets make it clear what we want in DP terms,
subproblem = min. badness for suffix words[i :]
=⇒ No.of subproblems = Θ(n) where n = no of words
guessing = where to end first line, say i : j
=⇒ no. of choices for j = n − i = O(n)
recurrence relation between the subproblem:
• DP[i] = min(badness (i, j) + DP[j] for j in range (i + 1, n + 1))
• DP[n] = 0
=⇒ time per subproblem = Θ(n)
so, total time = Θ(n^2).
Also, I'll leave it to you how insert spaces between words after determining the words in each line.

Logic would be:
1) Put words in array
2) Loop though array of words
3) Count the number of chars in each word, and check until they are the text width or less (skip if more than textwidth). Remember the number of words that make up the total before going over 15 (example remember it took 3 words to get 9 characters, leaving space for 6 spaces)
4) Divide the number of spaces required by (number of words - 1)
5) Write those words, writing the same number of spaces each time.
Should give the desired effect I hope.

You obviously have some idea how to solve this, as you have already produced the sample output.
Perhaps re-solve your original problem writing down in words what you do in each step....
e.g.
Print text asking for sentence.
Take input
Split input into words.
Print text asking for width.
...
If you are stuck at any level, then expand the details into sub-steps.
I would look to separate the problem of working out a sequence of words which will fit onto a line.
Then how many spaces to add between each of the words.

Below is an example for printing one line after you find how many words to print and what is the starting word of the line.
std::cout << "|";
numOfSpaces = lineWidth - numOfCharsUsedByWords;
/*
* If we have three words |word1 word2 word3| in a line
* ideally the spaces to print between then are 1 less than the words
*/
int spaceChunks = numOfWordsInLine - 1;
/*
* Print the words from starting point to num of words
* you can print in a line
*/
for (j = 0; j < numOfWordsInLine; ++j) {
/*
* Calculation for the number of spaces to print
* after every word
*/
int spacesToPrint = 0;
if (spaceChunks <= 1) {
/*
* if one/two words then one
* chunk of spaces between so fill then up
*/
spacesToPrint = numOfSpaces;
} else {
/*
* Here its just segmenting a number into chunks
* example: segment 7 into 3 parts, will become 3 + 2 + 2
* 7 to 3 = (7%3) + (7/3) = 1 + 2 = 3
* 4 to 2 = (4%2) + (4/2) = 0 + 2 = 2
* 2 to 1 = (2%1) + (2/1) = 0 + 2 = 2
*/
spacesToPrint = (numOfSpaces % spaceChunks) + (numOfSpaces / spaceChunks);
}
numOfSpaces -= spacesToPrint;
spaceChunks--;
cout << words[j + lineStartIdx];
for (int space = 0; space < spacesToPrint; space++) {
std::cout << " ";
}
}
std::cout << "|" << std::endl;
Hope this code helps. Also you need to consider what happens if you set width less then the max word size.

Extract Digits From Matlab String

In Matlab, let's say that I have the following string:
mystring = 'sdfkdsgoeskjgk elkr jtk34s ;3k54352642 643l j3kf p35j535';
And I want to extract all the digits in it to a vector such that each digit is standing by its own, so the output should be like:
output = [3 4 3 5 4 3 5 2 6 4 2....]
I tried to do it using this code and regex:
mystring = 'sdfkdsgoeskjgk elkr jtk34s ;3k54352642 643l j3kf p35j535';
digits = regexp(mystring, '[0-9]');
disp(digits);
But it gives me some weird 4-combined digits instead of what I need.

By default, the output of regexp is in the index of the first character in each match which is why the numbers aren't the same as the digits in your string. You'll want to use the output of regexp to then index into the initial string to get the digits themselves
digits = mystring(regexp(mystring, '[0-9]'));
You will still need to convert these from characters to numbers so you can subtract off '0' to do this conversion
digits = mystring(regexp(mystring, '[0-9]')) - '0';
Alternately, you could specify the 'match' input to regexp to return the actual matching string itself. This will return a cell array which we can then convert to an array of numbers using str2double
digits = str2double(regexp(mystring, '[0-9]', 'match'))

I use transposing instead of any other existing function to convert a string into an array.
mystring = 'sdfkdsgoeskjgk elkr jtk34s ;3k54352642 643l j3kf p35j535';
digits = regexp(mystring, '[0-9]');
array = double(mystring(digits)')'-48; % array of doubles
disp(array);

Creating a histogram with C++ (Homework)

In my c++ class, we got assigned pairs. Normally I can come up with an effective algorithm quite easily, this time I cannot figure out how to do this to save my life.
What I am looking for is someone to explain an algorithm (or just give me tips on what would work) in order to get this done. I'm still at the planning stage and want to get this code done on my own in order to learn. I just need a little help to get there.
We have to create histograms based on a 4 or 5 integer input. It is supposed to look something like this:
Calling histo(5, 4, 6, 2) should produce output that appears like:
*
* *
* * *
* * *
* * * *
* * * *
-------
A B C D
The formatting to this is just killing me. What makes it worse is that we cannot use any type of arrays or "advanced" sorting systems using other libraries.
At first I thought I could arrange the values from highest to lowest order. But then I realized I did not know how to do this without using the sort function and I was not sure how to go on from there.
Kudos for anyone who could help me get started on this assignment. :)

Try something along the lines of this:
Determine the largest number in the histogram
Using a loop like this to construct the histogram:
for(int i = largest; i >= 1; i--)
Inside the body of the loop, do steps 3 to 5 inclusive
If i <= value_of_column_a then print a *, otherwise print a space
Repeat step 3 for each column (or write a loop...)
Print a newline character
Print the horizontal line using -
Print the column labels

Maybe i'm mistaken on your q, but if you know how many items are in each column, it should be pretty easy to print them like your example:
Step 1: Find the Max of the numbers, store in variable, assign to column.
Step 2: Print spaces until you get to column with the max. Print star. Print remaining stars / spaces. Add a \n character.
Step 3: Find next max. Print stars in columns where the max is >= the max, otherwise print a space. Add newline. at end.
Step 4: Repeat step 3 (until stop condition below)
when you've printed the # of stars equal to the largest max, you've printed all of them.
Step 5: add the -------- line, and a \n
Step 6: add row headers and a \n

If I understood the problem correctly I think the problem can be solved like this:
a= <array of the numbers entered>
T=<number of numbers entered> = length(a) //This variable is used to
//determine if we have finished
//and it will change its value
Alph={A,B,C,D,E,F,G,..., Z} //A constant array containing the alphabet
//We will use it to print the bottom row
for (i=1 to T) {print Alph[i]+" "}; //Prints the letters (plus space),
//one for each number entered
for (i=1 to T) {print "--"}; //Prints the two dashes per letter above
//the letters, one for each
while (T!=0) do {
for (i=1 to N) do {
if (a[i]>0) {print "*"; a[i]--;} else {print " "; T--;};
};
if (T!=0) {T=N};
}
What this does is, for each non-zero entered number, it will print a * and then decrease the number entered. When one of the numbers becomes zero it stops putting *s for its column. When all numbers have become zero (notice that this will occur when the value of T comes out of the for as zero. This is what the variable T is for) then it stops.
I think the problem wasn't really about histograms. Notice it also doesn't require sorting or even knowing the

Finding all possible common substrings from a file consisting of strings using c++

I am trying to find all possible common strings from a file consisting of strings of various lengths. Can anybody help me out?
E.g input file is sorted:
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAG
AAAAAAAATTAGGCTGGG
AAAAAAAATTGAAACATCTATAGGTC
AAAAAAACTCTACCTCTCT
AAAAAAACTCTACCTCTCTATACTAATCTCCCTACA
and my desired output is:
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAG
AAAAAAAATTAGGCTGGG
AAAAAAAATTGAAACATCTATAGGTC
AAAAAAACTCTACCTCTCTATACTAATCTCCCTACA
[EDIT] Each line which is a substring of any other line should be removed.

Basically for each line, compare it with the next line to see if the next line is shorter or if the next line's substring is not equal to the current line. If this is true, the line is unique. This can be done with a single linear pass because the list is sorted: any entry which contains a substring of the entry will follow that entry.
A non-algorithmic optimization (micro-optimization) is to avoid the use of substr which creates a new string. We can simply compare the other string as though it was truncated without actually creating a truncated string.
vector<string> unique_lines;
for (unsigned int j=0; j < lines.size() - 2; ++j)
{
const string& line = lines[j];
const string& next_line = lines[j + 1];
// If the line is not a substring of the next line,
// add it to the list of unique lines.
if (line.size() >= next_line.size() ||
line != next_line.substr(0, line .size()))
unique_lines.push_back(line);
}
// The last line is guaranteed to not be a substring of any
// previous line as the lines are sorted.
unique_lines.push_back(lines.back());
// The desired output will be contained in 'unique_lines'.

What I understand is you want to find substring and wanted to remove such string which is substring of any string.
For that you can use strstr method to find if a string is a substring of another string.
Hope this will help..

Well, that's probably not the fastest solution to solve your problem, but seems easy to implement. You just keep a histogram of chars that will represent a signature of a string. For each string that you read (separated for spaces), you count the numbers of each char and just stores it on your answer if there isn't any other string with the same numbers of each char. Let me illustrate it:
aaa bbb aabb ab aaa
Here we have just two possible input letters, so, we just need an histogram of size 2.
aaa - hist[0] = 3, hist[1] = 0 : New one - add to the answer
bbb - hist[0] = 0, hist[1] = 3 : New one - add to the answer
aabb - hist[0] = 2, hist[1] = 2 : New one - add to the answer
ab - hist[0] = 1, hist[1] = 1 : New one - add to the answer
aaa - hist[0] = 3, hist[1] = 0 : Already exists! Don't add to the answer.
The bottleneck of your implementation will be the histogram comparisons, and there are a lot of possible implementations for it.
The simplest one would be a simple linear search, iterating through all your previous answer and comparing with the current histogram, wich would be O(1) to store and O(n) to search. If you have a big file, it would take hours to finish.
A faster one, but a lot more troublesome to implement, would use a hash table to store your answer, and use the histogram signature to generate the hash code. Would be to troublesome to explain this approach here.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Extracting numbers using Regex in Matlab - regex

Related

How to sort non-numeric strings by converting them to integers? Is there a way to convert strings to unique integers while being ordered?

Need help implementing a certain logic that will fill a text to a certain width.

Extract Digits From Matlab String

Creating a histogram with C++ (Homework)

Finding all possible common substrings from a file consisting of strings using c++

Categories

Resources