RegEx to get unique values from large file with duplicates - regex

I have a large XML-file that I want to extract unique values from. The values I'm looking for are placed in the XML-tag: ns3:order_id
To make it more complex, the file contains duplicates of order_id, and I'm only interested in geeting the unique order_id values.
I've been using RegEx to extract the values, this is the expression:
(?sm)(\<ns3:order_id>\d+\b)(?!.*\1\b)
The expression gives me what I need, BUT only if the file is way smaller. When I try this expression on the "big" file I receive: "Catastrophic backtracking has been detected and the execution of your expression has been halted." I guess it has with *, and I have tried different ways replacing it without success.
Is there any way to correct my expression so that I can collect the values?
As seen in the text above, I've tried several diffrent RegEx ways. The expression above works, but not in bigger files.

Related

Regular Expression help to find letter in XML number field

I am getting an error importing an XML file into a custom program. Other files import correctly. However, one file produces an error from a float field. I am using Notepad++ search function with Regular Expression to try and find the issue in the XML file.
When I use <milepost>([a-zA-Z0-9.]+)</milepost> I get around 30,000 results which is the correct number of records but the field is supposed to be DOUBLE. When I use <milepost>([0-9.]+)</milepost> I only get 29,994 records. This tells me that the import is most likely failing because there are letters in my number fields.
I have tried a number of variations like:
<milepost>([\S\D\d]+)</milepost>
<milepost>(.*?)</milepost>
<milepost>([\Sa-zA-Z]+)</milepost>
<milepost>([0-9.\w]+)</milepost>
etc.
Each of these returns the expected 30,000 records.
When I try to search for letters using :
<milepost>([a-zA-Z.]*)</milepost>
<milepost>([a-zA-Z]+)</milepost>
<milepost>(^[a-zA-Z]+$)</milepost>
<milepost>([a-zA-Z.a-zA-Z]+)</milepost>
I get 0 results (most likely because it excludes numbers)
I did manage to find one of the records I am looking for using this method:
<milepost>173.811818181818a</milepost>
But I do not feel like scrolling through 30,000+ lines to look for 5 more records with a letter in them.
Is there a regular expression that will return to me ONLY the values that have a letter/letters in them while allowing numbers? (Fields with only numbers and a period should be excluded)
The 6 problem records presumably contain a mixture of letters and numbers, but your searches for records containing letters will only match records consisting exclusively of letters.
Try
<milepost>.*[a-zA-Z].*</milepost>
which matches any record containing an ASCII letter in its value, as well as allowing other characters such as digits.
What you want is a negative look-ahead. Something like
<milepost>(?![0-9.]+</milepost>)
should be very close.
In plain English <milepost> not followed by exclusively digits and dots and a closing </milepost>

Regular expression to extract either integer or string from JSON

I am working in an environment without a JSON parser, so I am using regular expressions to parse some JSON. The value I'm looking to isolate may be either a string or an integer.
For instance
Entry1
{"Product_ID":455233, "Product_Name":"Entry One"}
Entry2
{"Product_ID":"455233-5", "Product_Name":"Entry One"}
I have been attempting to create a single regex pattern to extract the Product_ID whether it is a string or an integer.
I can successfully extract both results with separate patterns using look around with either (?<=Product_ID":")(.*?)(?=") or (?<=Product_ID":)(.*?)(?=,)
however since I don't know which one I will need ahead of time I would like a one size fits all.
I have tried to use [^"] in the pattern however I just cant seem to piece it together
I expect to receive 455233-5 and 455233 but currently I receive "455233-5"
(?<="Product_ID"\s*:\s*"?)[^"]+(?="?\s*,)
, try it here.

Possible combination (variations) of words in a string variable in stata

I have a string variable containing school names and I need to find all the possible combination of each word in this string variable in stata:
For example variation of a word "Academy" would be:
Academy,
Academy,
acdamey,
aacdemy,
dmcaamy,
aacedmy,
and so on.
I need this to standardize the raw data of school names, which has many typos of each word due to data entry issues, like the ones given above for "academy".
Depending whether your data is already in the Excel sheets or a file, you can either use regex trying to match all possible combinations (and probably fix them when found) or parse the strings first before bringing them into Excel. In either case you could make a file (or Excel list/table/area/etc.) that includes all the common typos and pick each typo as regex match to use when comparing to your actual input.
Making regexp that would actually find all possible cases is next to impossible, especially if there are cases where very similar (but correct) names for schools exist. In any case direct regexps would be very messy and complex, so I would advice you to parse the data by finding first the correct form, excluding it and then using (greedy) search/regex to find the typoed versions. You can then save the typos to use them as a filter/match/pattern.
To get some sort of starting ideas, check this links:
Regex: Search for verb roots
Read text file and extract string into Excel sheet using regex
P.s You should keep the count of all strings/school names and finally get a list of all names that did not match correct form or any of your regexp filters, so you can manually insert/correct them.

regular expression to extract insert sql statement from a text file and to check for hardcoded parameters

I have a bunch of sql statements updated by my team developers.
I intend to run a check before these statements are run against a db.
for example, check if a certain column is hardcoded instead of being fetched from the respective table (foreign key)
for example:
INSERT INTO [Term1] ([CreatedBy]
,[CreateUser]) values(1,'asdadad')
where 1 is hardcoded value.
Is there a regular expression that can extract all insert statements from the file so that they can be parse?
I tried with this expression http://regexlib.com/REDetails.aspx?regexp_id=1750 but it didnot work
You may need to run a multi-level regex on this. First parse the entire parameter string from the whole query, then parse each individual field from the paramter string that you previously got to get each one specifically ignoring all the other characters that may come up.

Automatically finding numbering patterns in filenames

Intro
I work in a facility where we have microscopes. These guys can be asked to generate 4D movies of a sample: they take e.g. 10 pictures at different Z position, then wait a certain amount of time (next timepoint) and take 10 slices again.
They can be asked to save a file for each slice, and they use an explicit naming pattern, something like 2009-11-03-experiment1-Z07-T42.tif. The file names are numbered to reflect the Z position and the time point
Question
Once you have all these file names, you can use a regex pattern to extract the Z and T value, if you know the backbone pattern of the file name. This I know how to do.
The question I have is: do you know a way to automatically generate regex pattern from the file name list? For instance, there is an awesome tool on the net that does similar thing: txt2re.
What algorithm would you use to parse all the file name list and generate a most likely regex pattern?
There is a Perl module called String::Diff which has the ability to generate a regular expression for two different strings. The example it gives is
my $diff = String::Diff::diff_regexp('this is Perl', 'this is Ruby');
print "$diff\n";
outputs:
this\ is\ (?:Perl|Ruby)
Maybe you could feed pairs of filenames into this kind of thing to get an initial regex. However, this wouldn't give you capturing of numbers etc. so it wouldn't be completely automatic. After getting the diff you would have to hand-edit or do some kind of substitution to get a working final regex.
First of all, you are trying to do this the hard way. I suspect that this may not be impossible but you would have to apply some artificial intelligence techniques and it would be far more complicated than it is worth. Either neural networks or a genetic algorithm system could be trained to recognize the Z numbers and T numbers, assuming that the format of Z[0-9]+ and T[0-9]+ is always used somewhere in the regex.
What I would do with this problem is to write a Python script to process all of the filenames. In this script, I would match twice against the filename, one time looking for Z[0-9]+ and one time looking for T[0-9]+. Each time I would count the matches for Z-numbers and T-numbers.
I would keep four other counters with running totals, two for Z-numbers and two for T-numbers. Each pair would represent the count of filenames with 1 match, and the ones with multiple matches. And I would count the total number of filenames processed.
At the end, I would report as follows:
nnnnnnnnnn filenames processed
Z-numbers matched only once in nnnnnnnnnn filenames.
Z-numbers matched multiple times in nnnnnn filenames.
T-numbers matched only once in nnnnnnnnnn filenames.
T-numbers matched multiple times in nnnnnn filenames.
If you are lucky, there will be no multiple matches at all, and you could use the regexes above to extract your numbers. However, if there are any significant number of multiple matches, you can run the script again with some print statements to show you example filenames that provoke a multiple match. This would tell you whether or not a simple adjustment to the regex might work.
For instance, if you have 23,768 multiple matches on T-numbers, then make the script print every 500th filename with multiple matches, which would give you 47 samples to examine.
Probably something like [ -/.=]T[0-9]+[ -/.=] would be enough to get the multiple matches down to zero, while also giving a one-time match for every filename. Or at worst, [0-9][ -/.=]T[0-9]+[ -/.=]
For Python, see this question about TemplateMaker.