Generate a list of all possible combinations of a regular expression based string - regex

This is not a straightforward "all possible combinations" question.
EDIT: The regex is just a fixed length string with different combinations of alpha and non alphanumeric for each index...
Given a regular expression of fixed length, what would be the fastest way of computing and storing all combinations in a database, speed of saving to database included. From the get go, given the regular expression to having any type of database of every combination?
What I did, successfully but ridiculously slow, was just create an array the length of the fixed length regular expression and each element contained every possible character at that position, I generated this with some script. And then just did a loopception on the array with an SQL Server connection open from start to finish inserting 10 possibilities at a time. It was extremely slow, we're talking a string of 7/8 characters with a maximum of 36 possibilities in any given location. It took a few days.
So, my question is given this problem what would be the best combination of technologies, languages and algorithm to accomplish this the quickest?

Number of possible strings with length 8 and composed of 36 possible characters:
36^8 = 2821109907456 = 2,8 trillion
Generating that many strings in any way will take "considerable" time. Let's look at how long it will take to insert them into DB. Assuming a really good DB performance, we can take 20000 inserts/sec. In such a case the total insertion time is expected to be:
2,8 * 10^12 / 20000 = 140 million seconds
140 * 10^6 / (60*60*24) = 1620 days
So, this answers your question I guess: 1620 DAYS!

Related

Google Sheets - perform mathematical operations after extracting numeric data from strings with specific text

I need a formula that performs a specific mathematical operation, but only with the number that meets specific conditions. In this case – with numbers extracted from strings with specific text in them.
In the first column we have some raw data: a string with different numbers and text divided by the underscore. I need to split this data into several different rows and use the following formula for this: =TRANSPOSE(SPLIT(A3,"_"))
The next column should only contain numbers, but the problem is that one of these numbers (which contains "tb" in this specific example) should be divided or multiplied by the specific number (multiplied by 1000 in this case).
I've tried the following formula which only works as long as there is no "tb" or if it's in the very beginning of the string: =IF(REGEXMATCH(A6,"tb"),REGEXEXTRACT(A6,"(\d+)tb")*1000,REGEXEXTRACT(B6,"(\d+)"))
If it's somewhere in the middle or at the end of the string only the first number still undergoes the math operation instead. I wonder if there's a way to achieve the result I want without resorting to complex formulas (I'm very new to this and would ideally like to use formulas that I can understand and easily modify for other similar tasks). A sample table for better visualisation can be seen below. Thanks in advance!
Raw data
Split data
Extracted numbers (what I get)
Desired outcome
5tb_200gb_300mb
5tb
5000
5000
200gb
200
200
300mb
300
300
2tb_500gb_50mb
2tb
2000
2000
500gb
500
500
50mb
50
50
500gb_50mb_2tb
500gb
2000
500
50mb
50
50
2tb
2
2000
Try not putting tb in the first REGEXEXTRACT:
=IF(REGEXMATCH(A6,"tb"),REGEXEXTRACT(A6,"(\d+)")*1000,REGEXEXTRACT(B6,"(\d+)"))
EDIT
Option 2: to extract the numbers adjacent before "tb"
=IF(REGEXMATCH(A6,"tb"),REGEXEXTRACT(A6,"(\d+)tb")*1000,REGEXEXTRACT(B6,"(\d+)"))

regex to match max of input numeric string

I have incoming input entries.
Like these
750
1500
1
100
25
55
And There is an lookup table like given below
25
7
5
75
So when I will receive my first entry, in this case its 750. So this will look up into lookup entry table will try to match with a string which having max match from left to right.
So for 750, max match case would be 75.
I was wondering, Is that possible if we could write a regex for this kind of scenario. Because if I choose using startsWith java function It can get me output of 7 as well.
As input entries will be coming from text file one by one and all lookup entries present in file different text file.
I'm using java language.
May I know how can I write a regex for this flavor..?
This doesn't seem like a regex problem at first, but you could actually solve it with a regex, and the result would be pretty efficient.
A regex for your example lookup table would be:
/^(75?|5|25)/
This will do what you want, and it will avoid the repeated searches of a naive "check every one" approach.
The regex would get complicated,though, as your lookup table grew. Adding a couple of terms to your lookup table:
25
7
5
75
750
72
We now have:
/^(7(50?|2)?|5|25)/
This is obviously going to get complicated quickly. The trick would be programmatically constructing the appropriate regex for arbitrary data--not a trivial problem, but not insurmountable either.
That said, this would be an..umm...unusual thing to implement in production code.
I would be hesitant to do so.
In most cases, I would simply do this:
Find all the strings that match.
Find the longest one.
(?: 25 | 5 | 75? )
There is free software that automatically makes full blown regex trie for you.
Just put the output regex into a text file and load it instead.
If your values don't change very much, this is a very fast way to do a lookup.
If it does change, generate another one.
Whats good about a full blown trie here, is that it never takes more than 8
steps to match.
The one I just did http://imgur.com/a/zwMhL
App screenshot
Even a 175,000 Word Dictionary takes no more than 8 steps.
Internally the app initially makes a ternary tree from the input
then converts it into a full blown regex trie.

Extract Float from Specific String Using Regular Expression

What regular expression do I use to extract, for example, 1.09487 from the following text contained in a .txt file? Also, how would I modify the regular expression to account for the case where the float is negative (for example, -1.948)?
I tried several suggestions on Google as well as a regular expression generator, but none seem to work. It seems I want to use an anchor (such as ^) to start searching for digits at the word "serial" and then stop at "(", but this doesn't seem to work.
Output in .txt file:
Entropy = 7.980627 bits per character.
Optimum compression would reduce the size
of this 51768 character file by 0 percent.
Chi square distribution for 51768 samples is 1542.26, and randomly
would exceed this value less than 0.01 percent of the times.
Arithmetic mean value of data bytes is 125.93 (127.5 = random).
Monte Carlo value for Pi is 3.169834647 (error 0.90 percent).
Serial correlation coefficient is 1.09487 (totally uncorrelated = 0.0).
Thanks for any help.
This should be sufficient:
(?<=Serial correlation coefficient is )[-\d.]+
Unless you're expecting garbage, this will work fine.
try this:
(-?\d+\.\d+)(?=\s\(totally)
check here

Complex regex to check for two words and a quanity

Ok what I'm trying to do is to check for the presence of
"TestItem-1"
a number greater then 1
one of the possible words in the list of "KG. Kg, kg, Kilo(s) or Kilogram(s)"
Where any of the items could be in any order and within a 6 word limit of each other.
Has to be done in regex as there is no access to the underlying scripting engine
This is what I've got as there a way of checking greater then I decided to use a range of 1-999 for the number check.
\b(?:[T|t]estItem-1\W+(?:\w+\W+){1,6}(^[0-9]|[1-9][0-9]|[1-9][0-9][0-9])$)\W+(?:\w+\W+){1,6}[K|k]il[o|os]|[K|k][[G|GS]|[g|gs]]|[|K|k]ilogra[m|ms]\b
Examples of what I need to find would be like -
"TestItem-1 is unstable in quanties above 12 Kilograms"
"1 Kilogram of TestItem-1"
While I wouldn't want to find -
"15 units of TestItem-1"
I know that what I got isn't working each section appears to work independently of each other but not together.
I pass this over to far greater minds then mine :)
You can try something like this:
\b(?:[2-9]|\d\d+)\b\s\b(?:KG.|Kg,|kg,|Kilos?|Kilograms?)\b(?:\S+\s){0,6}\bTestItem-1\b|\bTestItem-1\b(?:\S+\s){0,6}\b(?:[2-9]|\d\d+)\b\s\b(?:KG.|Kg,|kg,|Kilos?|Kilograms?)\b
Not ideal with the duplication but without lookarounds that's the best I could think of. I'll try and improve it in a bit.

SQL Hash table for words

I'm trying to solve the "find all possible words for a set of letters" problems. There are some good answers out there, but I still can't figure it out.
In my first test, I put the whole dictionary in an array and then looped through each letter. This is super fast, but it takes forever to load the dictionary in the array, and requires huge amount of memory.
So I need to store the dictionary (750,000) letter is a sql database.
I guess there are two solutions to find all the possible words:
Make an advance query that returns all the possible words
Make a simple query that return a fraction of the database with words that might be possible, and then quickly loop through that array and valide the words.
The problem?:
It must be super fast. An iPhone 4 need to be able to get all possible words in under 5-6 seconds so it doesn't hinder the game.
Here's a similar questions:
IOS: Sqlite. Find record fast
Sulthans answer seems like a good idea. Create a hash table, and then:
Bitmask for ASCII letters (ignoring any non-ASCII alphabets). Bit at
position 0 means the word contains "a", at position 1 contains "b"
etc. If we create the same bitmask for our letters, we can select
words such as (wordMask & ~lettersMask) == 0
How do you make the bitmask, hash table, and how do you construct the sql query?
Thanks
sql is probably not the best option. The traditional data structure for storing a collection of words is called a Trie. I'm sure there implementations out there you can find. Someone else will have an answer to that.
The algorithm I envision is to permute the letters you are given, and check each permutation to see if it is in the Trie.