Notepad++ Regex - patterns from Autonomy IDOL queries / Logs - regex

I have a document which looks something like:
sort=SIZE:NumberDecreasing
FieldText=(((EQUAL{226742}:LocationId)) AND ())
FieldText=(((EQUAL{226742}:LocationId)) AND ((EQUAL{1}:LOD AND NOTEQUAL{1}:SCR AND EMPTY{}:RPDCITYID AND NOTEQUAL{1}:Industrial)))
FieldText=( NOT EQUAL{1}:ISSCHEME AND EQUAL{215629}:LocationId)
sort=DEALDATE:decreasing
From this I would like the word before a colon, and if there are {} brackets, before those too, a colon, and then the word after the colon. These should ideally be the only things left in the file, each on their own new line.
Output would then look like:
SIZE:NumberDecreasing
EQUAL:LocationId
EQUAL:LocationId
EQUAL:LOD
NOTEQUAL:SCR
EMPTY:RPDCITYID
NOTEQUAL:Industrial
EQUAL:ISSCHEME
EQUAL:LocationId
DEALDATE:decreasing
The closest I have come so far is:
Find:
^.?+ {[0-9]}:([a-zA-Z]+)
Replace with:
...\1:\2...
with the intent to run it several times, and later replace ... with \n
I can then remove multiple newlines.
Context: this is for a log analysis I am performing, I have already removed datestamps, and reduced elements of the query down to the sort and FieldText parameters
I do not have regular UNIX tools - I am working in a windows environment
The original log looks like:
03/11/2011 16:25:44 [9] ACTION=Query&summary=Context&print=none&printFields=DISPLAYNAME%2CRECORDTYPE%2CSTREET%2CTOWN%2CCOUNTY%2CPOSTCODE%2CLATITUDE%2CLONGITUDE&DatabaseMatch=Autocomplete&sort=RECORDTYPE%3Areversealphabetical%2BDRETITLE%3Aincreasing&maxresults=200&FieldText=%28WILD%7Bbournemou%2A%7D%3ADisplayName%20NOT%20MATCH%7BScheme%7D%3ARecordType%29 (10.55.81.151)
03/11/2011 16:25:45 [9] Returning 23 matches
03/11/2011 16:25:45 [9] Query complete
03/11/2011 16:25:46 [8] ACTION=GetQueryTagValues&documentCount=True&databaseMatch=Deal&minScore=70&weighfieldtext=false&FieldName=TotalSizeSizeInSquareMetres%2CAnnualRental%2CDealType%2CYield&start=1&FieldText=%28MATCH%7BBournemouth%7D%3ATown%29 (10.55.81.151)
03/11/2011 16:25:46 [12] ACTION=Query&databaseMatch=Deal&maxResults=50&minScore=70&sort=DEALDATE%3Adecreasing&weighfieldtext=false&totalResults=true&PrintFields=LocationId%2CLatitude%2CLongitude%2CDealId%2CFloorOrUnitNumber%2CAddressAlias%2A%2CEGAddressAliasID%2COriginalBuildingName%2CSubBuilding%2CBuildingName%2CBuildingNumber%2CDependentStreet%2CStreet%2CDependentLocality%2CLocality%2CTown%2CCounty%2CPostcode%2CSchemeName%2CBuildingId%2CFullAddress%2CDealType%2CDealDate%2CSalesPrice%2CYield%2CRent%2CTotalSizeSizeInSquareMetres%2CMappingPropertyUsetype&start=1&FieldText=%28MATCH%7BBournemouth%7D%3ATown%29 (10.55.81.151)
03/11/2011 16:25:46 [8] GetQueryTagValues complete
03/11/2011 16:25:47 [12] Returning 50 matches
03/11/2011 16:25:47 [12] Query complete
03/11/2011 16:25:51 [13] ACTION=Query&print=all&databaseMatch=locationidsearch&sort=RELEVANCE%2BPOSTCODE%3Aincreasing&maxResults=10&start=1&totalResults=true&minscore=70&weighfieldtext=false&FieldText=%28%20NOT%20LESS%7B50%7D%3AOFFICE%5FPERCENT%20AND%20EXISTS%7B%7D%3AOFFICE%5FPERCENT%20NOT%20EQUAL%7B1%7D%3AISSCHEME%29&Text=%28Brazennose%3AFullAddress%2BAND%2BHouse%3AFullAddress%29&synonym=True (10.55.81.151)
03/11/2011 16:25:51 [13] Returning 3 matches
03/11/2011 16:25:51 [13] Query complete
The purpose of the whole exercise is to find out which fields are being queried and sorted upon (and how we are querying/sorting upon them) - to this end, the output could also usefully be distinct - although that is not essential.

The Perl program below is complete, and includes your sample data in the source. It produces exactly the output you describe, including reporting NOT EQUAL{1}:ISSCHEME as EQUAL:ISSCHEME because of the intermediate space.
use strict;
use warnings;
while (<DATA>) {
print "$1:$2\n" while /(\w+) (?: \{\d*\} )? : (\w+)/xg;
}
__DATA__
sort=SIZE:NumberDecreasing
FieldText=(((EQUAL{226742}:LocationId)) AND ())
FieldText=(((EQUAL{226742}:LocationId)) AND ((EQUAL{1}:LOD AND NOTEQUAL{1}:SCR AND EMPTY{}:RPDCITYID AND NOTEQUAL{1}:Industrial)))
FieldText=( NOT EQUAL{1}:ISSCHEME AND EQUAL{215629}:LocationId)
sort=DEALDATE:decreasing
OUTPUT
SIZE:NumberDecreasing
EQUAL:LocationId
EQUAL:LocationId
EQUAL:LOD
NOTEQUAL:SCR
EMPTY:RPDCITYID
NOTEQUAL:Industrial
EQUAL:ISSCHEME
EQUAL:LocationId
DEALDATE:decreasing

Related

Regex with one open and close bracket within an number

since few days I am sitting and fighting with the regular expression without any success
My first expression, what I want:
brackets just one time, doesn't matter where
Text or numbers before and after brackets optional
numbers within the brackets
Example what is allowed:
[32] text1
text1 [5]
text1 [103] text2
text1
[123]
[some value [33]] (maybe to complicated, would be not so important?)
My second expression is similar but just numbers before and after the brackets instead text
[32] 11
11 [5]
11 [103] 22
11
[123]
no match:
[12] xxx [5] (brackets are more than one time)
[aa] xxx (no number within brackets)
That's what I did but is not working because I don't know how to do with the on-time-brackets:
^.*\{?[0-9]*\}.*$
From some other answer I found also that, that's looks good but I need that for the numbers:
^[^\{\}]*\{[^\{\}]*\}[^\{\}]*$
I want to use later the number in the brackets and replace with some other values, just for some additional information, if important.
Hope someone can help me. Thanks in advance!
This is what you want:
^([^\]\n]*\[\d+\])?[^[\n]*$
Live example
Update: For just numbers:
^[\d ]*(\[\d+\])?[\d ]*$
Explaination:
^ Start of line
[^...] Negative character set --> [^\]] Any character except ]
* Zero or more length of the Class/Character set
\d 0-9
+ One or more length of the Class/Character set
(...)? 0 or 1 of the group
$ End of line
Note: These RegExs can return empty matches.
Thanks to #MMMahdy-PAPION! He improved the answer.

regex match 2 characters / and -

I have a regex that looks like this:
/(((\+|00)32[ ]?(?:\(0\)[ ]?)?)|0){1}(4(60|[789]\d)\/?(\s?\d{2}\.?){2}(\s?\d{2})|(\d\/?\s?\d{3}|\d{2}\/?\s?\d{2})(\.?\s?\d{2}){2})/g
this matches: +32 16/894477 but +32 16-894477 doesn't
this 20150211-0001731015-1 also matches but this shouldn't match
I am trying to fix my regex here:
https://regex101.com/r/LmaIPA/1
(((\+|00)32[ ]?(?:\(0\)[ ]?)?)|0){1}(4(60|[789]\d)\/?(\s?\d{2}\.?){2}(\s?\d{2})|(\d\/?\s?\d{3}|\d{2}(\/?|\-)\s?\d{2})(\.?\s?\d{2}){2})
I guess I fixed part of it by adding this but let me know if there something else that doesn't work properly :)
There are a lot of capture groups, and some can also be omitted if you don't need them for after processing.
The issue is that for +32 16-894477 you are not matching the hyphen, and you match the larger string as there are no boundaries set so you get a partial match.
Some notes:
You don't have to escape the / when using a different delimiter
You can omit {1} from the pattern
\s can also match a newline, you can use \h if you want to match a horizontal whitespace char
A single space [ ] does not have to be in a character class
You can extend the pattern with adding the hyphen and forward slash to a character class using [/-]?, wrap the whole pattern in a non capture group and assert a whitspace boundary to the right (?:whole pattern here)(?!\S)
A version without the capture groups for a match only:
(?:(?:(?:\+|00)32\h?(?:\(0\)\h?)?|0)(?:4(?:60|[789]\d)/?(?:\h?\d{2}\.?){2}\h?\d{2}|(?:\d/?\h?\d{3}|\d{2}[/-]?\h?\d{2})(?:\.?\h?\d{2}){2}))(?!\S)
Regex demo | Php demo
Php example
$re = '~(?:(?:(?:\+|00)32\h?(?:\(0\)\h?)?|0)(?:4(?:60|[789]\d)/?(?:\h?\d{2}\.?){2}\h?\d{2}|(?:\d/?\h?\d{3}|\d{2}[/-]?\h?\d{2})(?:\.?\h?\d{2}){2}))(?!\S)~';
$str = 'OK 01/07 - 31/07
OK 0487207339
OK +32487207339
OK 01.07.2016
OK +32 (0)16 89 44 77
OK 016894477
OK 003216894477
OK +3216894477
OK 016/89.44.77
OK +32 16894477
OK 0032 16894477
OK +32 16/894477
NOK +32 16-894477 (this should match)
OK 0479/878810
NOK 20150211-0001731015-1 (this shouldn\'t match)';
preg_match_all($re, $str, $matches);
print_r($matches[0]);
Output
Array
(
[0] => 0487207339
[1] => +32487207339
[2] => +32 (0)16 89 44 77
[3] => 016894477
[4] => 003216894477
[5] => +3216894477
[6] => 016/89.44.77
[7] => +32 16894477
[8] => 0032 16894477
[9] => +32 16/894477
[10] => +32 16-894477
[11] => 0479/878810
)

Detecting sequencing using regexes

Imagine I have multiple character strings in a list like this:
[[1]]
[1] "1-FA-1-I2-1-I2-1-I2-1-EX-1-I2-1-I3-1-FA-1-"
[2] "-1-I2-1-TR-1-"
[3] "-1-I2-1-FA-1-I3-1-"
[4] "-1-FA-1-FA-1-NR-1-I3-1-I2-1-TR-1-"
[5] "-1-I2-1-"
[6] "-1-I2-1-FA-1-I2-1-"
[7] "-1-I3-1-FA-1-QU-1-"
[8] "-1-I2-1-I2-1-I2-1-NR-1-I2-1-I2-1-NR-1-"
[9] "-1-I2-1-"
[10] "-1-NR-1-I3-1-QU-1-I2-1-I3-1-QU-1-NR-1-I2-1-"
[11] "-1-NR-1-QU-1-QU-1-I2-1-"
I want to use a regex to detect the particular strings where a certain substring precedes another substring, but not necessarily directly preceding the other substring.
For example, let's say that we are looking for FA preceding EX. This would need to match 1 in the list. Even though FA has -1-I2-1-I2-1-I2-1- between itself and EX, the FA still occurs before the EX, hence a match is expected.
How can a generic regex be defined that identifies strings where certain substrings appear before another substring in this manner?
You may use grep.
x <- c("1-FA-1-I2-1-I2-1-I2-1-EX-1-I2-1-I3-1-FA-1-" ,"-1-I2-1-TR-1-")
grepl("FA.*EX", x)
#[1] TRUE FALSE
grep("FA.*EX", x)
#[1] 1

split on last occurrence of digit, take 2nd part

If I have a string and want to split on the last digit and keep the last part of the split hpw can I do that?
x <- c("ID", paste0("X", 1:10, state.name[1:10]))
I'd like
[1] NA "Alabama" "Alaska" "Arizona" "Arkansas"
[6] "California" "Colorado" "Connecticut" "Delaware" "Florida"
[11] "Georgia"
But would settle for:
[1] "ID" "Alabama" "Alaska" "Arizona" "Arkansas"
[6] "California" "Colorado" "Connecticut" "Delaware" "Florida"
[11] "Georgia"
I can get the first part by:
unlist(strsplit(x, "[^0-9]*$"))
But want the second part.
Thank you in advance.
You can do this one easy step with a regular expression:
gsub("(^.*\\d+)(\\w*)", "\\2", x)
Results in:
[1] "ID" "Alabama" "Alaska" "Arizona" "Arkansas" "California" "Colorado" "Connecticut"
[9] "Delaware" "Florida" "Georgia"
What the regex does:
"(^.*\\d+)(\\w*)": Look for two groups of characters.
The first group (^.*\\d+) looks for any digit followed by at least one number at the start of the string.
The second group \\w* looks for an alpha-numeric character.
The "\\2" as the second argument to gsub() means to replace the original string with the second group that the regex found.
library(stringr)
unlist(lapply(str_split(x, "[0-9]"), tail,n=1))
gives
[1] "ID" "Alabama" "Alaska" "Arizona" "Arkansas" "California" "Colorado" "Connecticut" "Delaware"
[10] "Florida" "Georgia"
I would look at the documentation stringr for (most possibly) an even better approach.
This seems a bit clunky, but it works:
state.pt2 <- unlist(strsplit(x,"^.[0-9]+"))
state.pt2[state.pt2!=""]
It would be nice to remove the ""'s generated by the match at the start of the string but I can't figure that out.
Here's another method using substr and gregexpr too that avoids having to subset the results:
substr(x,unlist(lapply(gregexpr("[0-9]",x),max))+1,nchar(x))
gsubfn
Try this gsubfn solution:
> library(gsubfn)
> strapply(x, ".*\\d(\\w*)|$", ~ if (nchar(z)) z else NA, simplify = TRUE)
[1] NA "Alabama" "Alaska" "Arizona" "Arkansas"
[6] "California" "Colorado" "Connecticut" "Delaware" "Florida"
[11] "Georgia"
It matches the last digit followed by word characters and returns the word characters or if that fails it matches the end of line (to ensure that it matches something). If the first match succeeded then return it; otherwise, the back reference will be empty so return NA.
Note that the formula is a short hand way of writing the function function(z) if (nchar(z)) z else NA and that function could alternately replace the formula at the expense of a slightly more keystrokes.
gsub
A similar strategy could also work using just straight gsub but requires two lines and a marginally more complex regular expression. Here we use the second alternative to slurp up non-matches from the first alternative:
> s <- gsub(".*\\d(\\w*)|.*", "\\1", x)
> ifelse(nchar(s), s, NA)
[1] NA "Alabama" "Alaska" "Arizona" "Arkansas"
[6] "California" "Colorado" "Connecticut" "Delaware" "Florida"
[11] "Georgia"
EDIT: minor improvements

price regex help

how to make regex below to detect also prices like just £7 not only everything > 9
/\d[\d\,\.]+/is
thanks
to match a single digit, you can change it to
/\d[\d,.]*/
the + means require one or more, so that's why the whole thing won't match just a 7. The * is 0 or more, so an extra digit or , or . becomes optional.
The longer answer might be more complicated. For example, in the book Regular Expression Cookbook, there is an excerpt: (remove the ^ and $ if you want it to match the 2 in apple $2 each) but note that when the number is 1000 or more, the , is needed. For example, the first regex won't match 1000.33
(unsourced image from a book removed)
Your expression would allow 123...3456... I think you might want something like (£|$|€)?\d\d+((,|.)\d{2})?
This will require the source have a currency symbol, and two digits for cents with a separator.
You might look at a regex more like the following.
/(?:\d+[,.]?\d*)|(?:[,.]\d+)/
Test Set:
5.00
$7.00
6123.58
$1
.75
Result Set:
[0] => 5.00
[1] => 7.00
[2] => 6123.58
[3] => 1
[4] => .75
EDIT: Additional Case added