how to do a fast regex search on a hdf5 database - regex

I have an HDF5 database with 100 million+ rows of text each storing a simple three column set of values:
ID WORD HEADWORD
1 the the
2 cats cat
3 sat sit
4 on on
5 the the
6 mats mat
...
I want to do a search on the "WORD" column to find all hits for at (i.e., 'cats', 'sat', 'mats').
In some other database (e.g. PostgresQL) I might do this with a simple regex search '?at?'. If I could search the HDF5 index using regex, that would be fine. But, I don't think this is possible. Any suggestions for how to do this kind of 'wildcard' (regex) search quickly?

Try following regex
[^\s]+[\s]+([a-zA-Z]*at[a-zA-Z]*)[\s]+[^\s]+
Group 1 in above regex will give you desired result.
"WORD" column to find all hits for at (i.e., 'cats', 'sat', 'mats').
Debuggex Demo
Regex Demo

Related

Match seven columns for each record using regex

I have tried to simulate a regex pattern on the following link
https://regex101.com/r/yusSo4/1
It works partially, as I need to get 7 columns for each record
Group 3 should be separated to be two groups
This is my try but this is not totally correct
^( *\d{6,12} *\n)(.*\n(?:.*\n)?)( *\d{14} *\n)( *\d{14} *\n)(.*\n(?:.*\n)?)( *\d{1,2} *\n)(.*\n(?:.*\n)?)((?:\n(?! *\d{6,10} *$)[^\d\n]+)*)
I can work around the pattern to get the desired results
^( *\d{6,12} *\n)(.*\n(?:.*\n)?)( *\d{14} *\n)( *\d{14} *\n)(.*\n(?:.*\n)?)( *\d{1,2} *\n)(.*\n?)((?:\n(?! *\d{6,10} *$)[^\d\n]+)*)
But I welcome any other suggestions

Regex for values that are in between spaces

I am new to regex and having difficulty obtaining values that are caught in between spaces.
I am trying to get the values "field 1" "abc/def try" from the sameple data below just using regex
Currently im using (^.{18}\s+) to skip the first 18 characters, but am at at loss of how to do grab values with spaces between.
A1234567890 field 1 abc/def try
02021051812 12 test test 12 pass
3333G132021 no test test cancel
any help/pointers will be appreciated.
If this text has fixed-width columns, you can match and trim the column values knowing the amount of chars between start of string and the column text.
For example, this regex will work for the text you posted:
^(.*?)\s*(?<=.{19})(.*?)\s*(?<=^.{34})(.*?)\s*(?<=^.{46})
See the regex demo.
So, Column 2 starts at Position 19, Column 3 starts at Position 34 and Column 4 (end of string here) is at Position 46.
However, this regex is not that efficient, and it would be really great if the data format is fixed on the provider's side.
Given the not knowing if the data is always the same length I created the following, which will provide you with a group per column you might want to use:
^((\s{0,1}\S{1,})*)(\s{2,})((\s{0,1}\S{1,})*)(\s{2,})((\s{0,1}\S{1,})*)
Regex demo

Joining two lines based on specific characters Notepad ++

I'm trying to join lines of data information in Notepad ++, currently, the data looks like this:
It has the above format for about 100,000 rows. I want to combine row 1 with row 2, but sometimes row 2 and row 3 combine and look something like this:
I want the output to look like this (all on one line):
I tried using this formula:
SEARCH: (.+)\R(.+)
REPLACE: \1 \2
If you want to match specific characters in Regex, you can simply type that character. for example, apple will only match apple. If you want to match a number, you can use \d. This will match 8, but not d.
If you want to match only things that end in 4 numbers separated by a dot, try this one: \n(.*?\d\d\.\d\d)\n
An explanation for each part can be found here.

How can I achieve this price REGEX with REGEXMATCH in Google Spreadsheet?

Here is the deal,
I want to allow user to enter this kind of entries in my price column:
1 or 1234 or 1234,1 or 1234,1234 ...
So I've used this regex which works fine with REGEX101's website
^\d+(,\d+)?$
https://regex101.com/r/D5dAXx/1
only problem is that it doesn't work well with Google spreadsheet's function REGEXMATCH
=REGEXMATCH(TO_TEXT(C2), "^\d+(,\d+)?$")
for example this entries do not match
1
12
1,123
when this entries matches correctly
1,1
1,12
Why is that and what could be the correct REGEX?
My problem was a bad format on the column.
When I entered:
12,1234
the format turned it into
12.1234
which was not matching my REGEXMATCH.
This means data validation criterion comes after the formatting in google's spreadsheets

MongoDB count and regex search count not matching

I have a huge mongoDB containing documents on which I am using a name as index.
So basically, I had a text file containing 48 000 016 entries. (I use wc -l to obtain that count)
To give more context, the database contains a lot of names that we're extracted from OCR (so a lot of junk) and also names in other languages (Japanese, Russian, etc...).
My MongoDB table statistics tell me I have 48 000 016 which is fine.
The problem happens because I want to query the items on their names (which is a standard string) using this regex :
/^([A-Z]|\W|\s|\d|_)/i
So my checklist :
any letter - check
case insensitive - check
any number - check
underscore - check
\W for anything that is not a number, letter or underscore.
So from what I understand, this regex should get me everything, since I am querying database on string values with this regex. But the problem is that I am missing 5 items.
When I run the count on the result of the query, I have 48 000 011 items.
Any idea where these 5 ones could be ? Because of the nature of my problem I can simply go through all my items using a simple cursor, I know it could be done that way, but I need a regex that can retrieve all my values.
I ran this query on the Database as indicated by the comments.
db.name.aggregate({$group:{_id:"uniqueDocs", count:{$sum:1}}})
Result is :
{ "result" : [ ], "ok" : 1 }
Thanks a lot !
I have seen you are using the anchor ^ to match the beginnig of a line. It could be possible that the line start with an new line \n or carriage return character \r.
Try to include \n and \r to your regex
/^([A-Z]|\W|\s|\d|\r|\n|_)/i
Also check to remove the anchor.
/([A-Z]|\W|\s|\d|\r|\n|_)/i
At last option inverse your regex to see which records are not included. These regex expressions should also math empty strings.
/^(?![.*])/i
I want to thank #Paul Wasilewski for giving me some great solutions. I found my problem which was not related to a regex problem.
My 5 entries we're simply not indexed, their size was more than 1024 bytes in length so MongoDB could not index them.
So that's the reason why they could not be queried by regex.