In VS2012 I am wanting to find and replace a "near-repetitive" string within a few large generated .SQL files:
The format of the search string is:
print 'Processed {d} total records'
where {d} is a number such as 100, 200, 300 etc all the way up to 70,000
I will be replacing this will nothing (i.e. deleting it)
Can anybody provide me with a simple regex for the FIND using the new VS2012 syntax?
Regex is like witchcraft and is beyond me
Any questions feel free to ask
Cheers
Kyle
This should do the job:
print 'Processed [0-9]+ total records'
If your number contains thousands separators (like in '70,000') you might want to use
print 'Processed [0-9,]+ total records'
instead.
[] is a bracket expression which matches any of the characters inside it, so [0-9] matches every digit (since 0-9 is a character range). The + in the end means 'any number of matches but at least one'.
Related
Now I am working on a file-rename-applescript-project. Here is an example: The.Fantasy.1997.DVDRip.XviD-ETRG.avi.
Now I want to check if the filename contains four digits year number. In this case, it's 1997. The year number MUST begin with 19 or 20 and MUST contain four digits.
If the result is true I will do something, if false I will do something else.
I try to use regex but can't find the solution. It's out of my range. Now I m looking for help here, Thanks a million.
If you want to avoid regex completely, do something like below, using text item delimiters:
(*
This first bit breaks the string up into a list of words by cutting the string
at the period delimiter.
*)
set tid to my text item delimiters
set my text item delimiters to "."
set bits_list to text items of file_name_string
set my text item delimiters to tid
(*
This repeat loop goes though the list of words and tests them (first) to see
if it can be converted to an integer, and (second) whether the number is between
1900 and 2100. If so, it chooses it as the year.
*)
repeat with this_item in bits_list
try
set possibleYear to this_item as integer
if possibleYear ≥ 1900 and possibleYear < 2100 then
-- do what you want with the year value here
exit repeat
end if
end try
end repeat
Of course, this will not work properly if there's a number in the name (e.g., "2001.A.Space.Odyssey.1968.avi") or if a file name has different delimiters (e.g., a space or a dash). But you'd run into those problems using regex as well, so...
Since you're only wishing to check whether or the filename contains a four-digit year within the range 1900-2099, you can do this very simply by defining a handler like so:
on hasYearInTitle(filmTitle as text)
repeat with yyyy from 1900 to 2099
if yyyy is in the filmTitle then return true
end repeat
return false
end hasYearInTitle
Then you can call this handler and pass it a film title, like so:
hasYearInTitle("The.Fantasy.1997.DVDRip.XviD-ETRG.avi") --> true
hasYearInTitle("The.Fantasy.197.DVDRip.XviD-ETRG.avi") --> false
hasYearInTitle("2001.A.Space.Odyssey.1968.avi") --> true
hasYearInTitle("2001.A.Space.Odyssey.avi") --> true (hm...)
As a side-note, films indexed by newznab servers follow a strict file-naming protocol that allow a media server (on your machine) to parse it easily and extract information quickly, pertaining to (as seen in your example file name): the film's title, the film's release date, the source material, the encoding quality, the encoding format (codec), the release group, and the containing file format.
Although some filenames contain more information, and some they should always appear in an set order. This makes them very simple to parse yourself should you need to, but if you're looking to create an organised media library, you would be best looking at using media server, of which there are excellent, freeware, long-standing software options available for macOS and pretty much any other operating system.
The regex .+\.(?:19:20)\d{2}\..+ should do it
The breakdown:
.+ 1 or more any characters
\. An actual dot
(?:19|20) The string "19" or "20" (non-capturing group)
\d{2} Exactly two digits
\. An actual dot
.+ 1 or more any characters
I'm trying to write a regex to parse a bank sort code from a database.
The reason I need a regex is that the sort code might be contained in a sentence.
But also, it might not be a sort code at all because the people entering data into the database some times put bank account numbers and phone numbers into the sort code column.
I can use
^[^0-9]*[0-9]{6}[^\d]*$
which works on
"blah123456blah"
but not on
"Emloyee 12's srt code : 123456"
Anything else I've tried gives me a match for 6 or more digits within a string (which is then most likely a bank account number).
Any help is greatly appreciated.
You say you are using
[0-9]{2}\s*-?\s*[0-9]{2}\s*-?\s*[0-9]{2}
To add the boundaries like you need, add (^|[^0-9]) (either the string start position (^) or (|) a non-digit ([^0-9])) in front and ([^0-9]|$) (matching a non-digit or the end of string position ($)) at the end:
(^|[^0-9])[0-9]{2}\s*-?\s*[0-9]{2}\s*-?\s*[0-9]{2}([^0-9]|$)
See the regex demo.
I have a huge mongoDB containing documents on which I am using a name as index.
So basically, I had a text file containing 48 000 016 entries. (I use wc -l to obtain that count)
To give more context, the database contains a lot of names that we're extracted from OCR (so a lot of junk) and also names in other languages (Japanese, Russian, etc...).
My MongoDB table statistics tell me I have 48 000 016 which is fine.
The problem happens because I want to query the items on their names (which is a standard string) using this regex :
/^([A-Z]|\W|\s|\d|_)/i
So my checklist :
any letter - check
case insensitive - check
any number - check
underscore - check
\W for anything that is not a number, letter or underscore.
So from what I understand, this regex should get me everything, since I am querying database on string values with this regex. But the problem is that I am missing 5 items.
When I run the count on the result of the query, I have 48 000 011 items.
Any idea where these 5 ones could be ? Because of the nature of my problem I can simply go through all my items using a simple cursor, I know it could be done that way, but I need a regex that can retrieve all my values.
I ran this query on the Database as indicated by the comments.
db.name.aggregate({$group:{_id:"uniqueDocs", count:{$sum:1}}})
Result is :
{ "result" : [ ], "ok" : 1 }
Thanks a lot !
I have seen you are using the anchor ^ to match the beginnig of a line. It could be possible that the line start with an new line \n or carriage return character \r.
Try to include \n and \r to your regex
/^([A-Z]|\W|\s|\d|\r|\n|_)/i
Also check to remove the anchor.
/([A-Z]|\W|\s|\d|\r|\n|_)/i
At last option inverse your regex to see which records are not included. These regex expressions should also math empty strings.
/^(?![.*])/i
I want to thank #Paul Wasilewski for giving me some great solutions. I found my problem which was not related to a regex problem.
My 5 entries we're simply not indexed, their size was more than 1024 bytes in length so MongoDB could not index them.
So that's the reason why they could not be queried by regex.
This is my first post so I hope it is clear enough.
I am having a problem regarding cleaning my CSV files before I can read them into R and have spent the entire day trying to find a solution.
My data is supposed to be in the form of two columns. The first column is a timestamp consisting of 10 digits and the second an ID consisting of 11 or 12 Letters and numbers (the first 6 are always numbers).
For example:
logger10 |
0821164100 | 010300033ADD
0821164523 | 010300033ADD
0821164531 | 010700EDDA0F0831102744
010700EDDA0F|
would become:
0821164100 | 010300033ADD
0821164523 | 010300033ADD
0821164531 | 010700EDDA0F
0831102744 | 010700EDDA0F
(please excuse the lines in the middle, that was my attempt at separating the columns...).
The csv file seems to occasionally be missing a comma which means that sometimes one row will end up like this:
0923120531,010300033ADD0925075301,010700EDD00A
My hardware also adds the word logger10 (or whichever number logger this is) whenever it restarts which gives a similar problem e.g. logger10logger100831102744.
I think I have managed to solve the logger text problem (see code) but I am sure this could be improved. Also, I really don't want to delete any of the data.
My real trouble is making sure there is a line break in the right place after the ID and, if not, I would like to add one. I thought I could use regex for this but I'm having difficulty understanding it.
Any help would be greatly appreciated!
Here is my attempt:
temp <- list.files(pattern="*.CSV") #list of each csv/logger file
for(i in temp){
#clean each csv
tmp<-readLines(i) #check each line in file
tmp<-gsub("logger([0-9]{2})","",tmp) #remove logger text
pattern <- ("[0-9]{10}\\,[0-9]{6}[A-Z,0-9]{5,6}") #regex pattern ??
if (tmp!= pattern){
#I have no idea where to start here...
}
}
here is some raw data:
logger01
0729131218,020700EE1961
0729131226,020700EE1961
0831103159,0203000316DB
0831103207,0203000316DB0831103253,010700EDE28C
0831103301,010700EDE28C
0831103522,010300029815
0831103636,010300029815
0831103657,020300029815
If you want to do this in a single pass:
(?:logger\d\d )?([\dA-F]{10}),?([\dA-F]{12}) ?
can be replaced with
\1\t\2\n
What this does is look for any of those rogue logger01 entries (including the space after it) optionally: That trailing ? after the group means that it can match 0 or 1 time: if it does match, it will. If it's not there, the match just keeps going anyway.
Following that, you look for (and capture) exactly 10 hex values (either digits or A-F). The ,? means that if a comma exists, it will match, but it can match 0 or 1 time as well (making it optional).
Following that, look for (and capture) exactly 12 hex values. Finally, to get rid of any strange trailing spaces, the ? (a space character followed by ?) will optionally match the trailing space.
Your replacement will replace the first captured group (the 10 hex digits), add in a tab, replace the second captured group (the 12 hex digits), and then a newline.
You can see this in use on regex101 to see the results. You can use code generator on the left side of that page to get some preformatted PHP/Javascript/Python that you can just drop into a script.
If you're doing this from the command line, perl could be used:
perl -pe 's/(?:logger\d\d )?([\dA-F]{10}),?([\dA-F]{12}) ?/\1\t\2\n/g'
If another language, you may need to adapt it slightly to fit your needs.
EDIT
Re-reading the OP and comments, a slightly more rigid regex could be
(?:logger\d\d\ )?([\dA-F]{10}),?(\d{6}[\dA-F]{5,6})\ ?
I updated the regex101 link with the changes.
This still looks for the first 10 hex values, but now looks for exactly 6 digits, followed by 5-6 hex values, so the total number of characters matched is 11 or 12.
The replacement would be the same.
Paste your regex here https://regex101.com/ to see whether it catches all cases. The 5 or 6 letters or digits could pose an issue as it may catch the first digit of the timestamp when the logger misses out a comma. Append an '\n' to the end of the tmp string should work provided the regex catches all cases.
I need a RegEx pattern that will return the first N words using a custom word boundary that is the normal RegEx white space (\s) plus punctuation like .,;:!?-*_
EDIT #1: Thanks for all your comments.
To be clear:
I'd like to set the characters that would be the word delimiters
Lets call this the "Delimiter Set", or strDelimiters
strDelimiters = ".,;:!?-*_"
nNumWordsToFind = 5
A word is defined as any contiguous text that does NOT contain any character in strDelimiters
The RegEx word boundary is any contiguous text that contains one or more of the characters in strDelimiters
I'd like to build the RegEx pattern to get/return the first nNumWordsToFind using the strDelimiters.
EDIT #2: Sat, Aug 8, 2015 at 12:49 AM US CT
#maraca definitely answered my question as originally stated.
But what I actually need is to return the number of words ≤ nNumWordsToFind.
So if the source text has only 3 words, but my RegEx asks for 4 words, I need it to return the 3 words. The answer provided by maraca fails if nNumWordsToFind > number of actual words in the source text.
For example:
one,two;three-four_five.six:seven eight nine! ten
It would see this as 10 words.
If I want the first 5 words, it would return:
one,two;three-four_five.
I have this pattern using the normal \s whitespace, which works, but NOT exactly what I need:
([\w]+\s+){<NumWordsOut>}
where <NumWordsOut> is the number of words to return.
I have also found this word boundary pattern, but I don't know how to use it:
a "real word boundary" that detects the edge between an ASCII letter
and a non-letter.
(?i)(?<=^|[^a-z])(?=[a-z])|(?<=[a-z])(?=$|[^a-z])
However, I would want my words to allow numbers as well.
IAC, I have not been able how to use the above custom word boundary pattern to return the first N words of my text.
BTW, I will be using this in a Keyboard Maestro macro.
Can anyone help?
TIA.
All you have to do is to adapt your pattern ([\w]+\s+){<NumWordsOut>} to, including some special cases:
^[\s.,;:!?*_-]*([^\s.,;:!?*_-]+([\s.,;:!?*_-]+|$)){<NumWordsOut>}
1. 2. 3. 4. 5.
Match any amount of delimiters before the first word
Match a word (= at least one non-delimiter)
The word has to be followed by at least one delimiter
Or it can be at the end of the string (in case no delimiter follows at the end)
Repeat 2. to 4. <NumWordsOut> times
Note how I changed the order of the -, it has to be at the start or end, otherwise it needs to be escaped: \-.
Thanks to #maraca for providing the complete answer to my question.
I just wanted to post the Keyboard Maestro macro that I have built using #maraca's RegEx pattern for anyone interested in the complete solution.
See KM Forum Macro: Get a Max of N Words in String Using RegEx