String simplification for regex (BASH) - regex

I'm looking for a way to simplify multiple strings for the purpose of regular expression searching, Here's an example:
I have a list of several thousand strings, similar to the ones below (text.#######):
area.202264
area.202265
area.202266
area.202267
area.202268
area.202269
area.202270
area.204517
area.204518
area.204519
area.207171
area.207338
area.208842
I've been trying to figure out an automated way to simplify it into something like this:
area.20226(4|5|6|7|8|9)|area.202270|area.20451(7|8|9)|area.207171|area.207338|area.208842
The purpose of this would be to reduce string length when searching these areas, I have absolutely no way how to approach something like this in a simple, re-usable way.
Thanks in advance! Any solutions or tips on where to start would be appreciated :)

echo "area.202264 area.202265 area.202266 area.202267 area.202268 area.202269 area.202270 area.204517 area.204518 area.204519 area.207171 area.207338 area.208842" | tr ' ' '\n' > list.txt
cat list.txt | grep -v "^$" | sed -e "s/[0-9] *$//g" | sort -u | while read p; do l=`grep $p list.txt | sed -e "s/.*\([0-9]\)$/\1/g" | xargs | tr ' ' '|'` ;echo "$p($l)" ; done | sed -e "s/(\(.\))/\1/g"| xargs| tr ' ' '|'

put search strings to the file named "filter" in one column
area.202264
area.202265
area.202266
area.202267
than you can search fast enought by
fgrep -f filter file-to-search-in
I see no easy way to produce regexp from samples, and I'm not sure regexp approach will faster.

Here are a couple of things you should know:
Nearly all regex engines build a state machine from their patterns. You can probably just put the various named between vertical bars and get good performance. (It won't look nice, but it will work.)
That is, something like:
(area.202264|area.202265|area.202266|...|area.207338|area.208842)
Even with 4k items, the right engine will just compile it down. (I don't think bash will handle it, because of the length. But perl, grep, fgrep as mentioned elsewhere can do it.)
You say "BASH", so it's worth pointing out there is a difference between regex and file globbing. If the things you are working with are text, then regex (^area.\d+$) is the way to go. If the things you are working with are filenames, then globbing (*.c) has different rules.
You can simplify greatly if you don't care at all about the numbers, only the format. For regexes:
area\.\d+ # area, dot, one or more digits (0-9)
area\.\d{1,6} # area, dot no less than 1, no more than 6 digits
area\.\d{6} # area, dot, exactly 6 digits
area\.20[234]\d{3} # area, dot, 20 {2,3,4} then 3 more digits

If you can use Perl and the Regexp::Assemble module, it can convert multiple patterns into a single, optimized, regular expression. For instance, using it on the list of strings in the question yields:
(?-xism:area\.20(?:22(?:6[456789]|70)|7(?:171|338)|451[789]|8842))
That only works if the database plugin can accept Perl regular expressions.

Related

Conditional matching in Regex [duplicate]

I'm setting up some goals in Google Analytics and could use a little regex help.
Lets say I have 4 URLs
http://www.anydotcom.com/test/search.cfm?metric=blah&selector=size&value=1
http://www.anydotcom.com/test/search.cfm?metric=blah2&selector=style&value=1
http://www.anydotcom.com/test/search.cfm?metric=blah3&selector=size&value=1
http://www.anydotcom.com/test/details.cfm?metric=blah&selector=size&value=1
I want to create an expression that will identify any URL that contains the string selector=size but does NOT contain details.cfm
I know that to find a string that does NOT contain another string I can use this expression:
(^((?!details.cfm).)*$)
But, I'm not sure how to add in the selector=size portion.
Any help would be greatly appreciated!
This should do it:
^(?!.*details\.cfm).*selector=size.*$
^.*selector=size.*$ should be clear enough. The first bit, (?!.*details.cfm) is a negative look-ahead: before matching the string it checks the string does not contain "details.cfm" (with any number of characters before it).
^(?=.*selector=size)(?:(?!details\.cfm).)+$
If your regex engine supported posessive quantifiers (though I suspect Google Analytics does not), then I guess this will perform better for large input sets:
^[^?]*+(?<!details\.cfm).*?selector=size.*$
regex could be (perl syntax):
`/^[(^(?!.*details\.cfm).*selector=size.*)|(selector=size.*^(?!.*details\.cfm).*)]$/`
There is a problem with the regex in the accepted answer. It also matches abcselector=size, selector=sizeabc etc.
A correct regex can be ^(?!.*\bdetails\.cfm\b).*\bselector=size\b.*$
Explanation of the regex at regex101:
I was looking for a way to avoid --line-buffered on a tail in a similar situation as the OP and Kobi's solution works great for me. In my case excluding lines with either "bot" or "spider" while including ' / ' (for my root document).
My original command:
tail -f mylogfile | grep --line-buffered -v 'bot\|spider' | grep ' / '
Now becomes (with -P perl switch):
tail -f mylogfile | grep -P '^(?!.*(bot|spider)).*\s\/\s.*$'

How to use Regular Expression In Find and Replacement

I've one CSV file which has almost 50k records. I want to remove the unnecessary records from this file. Can anyone tell me how can I achieve this by Regex through Find and Replacement?
The data looks like this:
Item Code,,Qty
CMAC-389109,,6
,Serial No.,
,954zg5,
,ffnaw8,
,gh8731,
,gxj419,
,hc6y9q,
,y65vh8,
CMAC-394140,,1
,Serial No.,
,4cu3z7,
and I want to convert this data to below format:
ItemCode,Serial Number,Qty
CMAC-389109,"954zg5, ffnaw8, gh8731, gxj419, hc6y9q, y65vh8",6
CMBM-394140,"4cu3z7",1
Here's a regex which captures two groups (Item Code and Shelf):
^([^,]*?)(?:,(?:[^,]+)?){5},([^,]+),.*$
I don't know what syntax DW uses to reference groups. But usually it's either $n or \n, so in your case, you can put $1, $2 in the "replacement" field of the search/replace box. Or \1, \2.
If you have access to a Linux environment (OS-X and Cygwin should work too), you can use the command-line tools cut and grep to accomplish this quite easily:
cat <filename> | cut -d ',' -f 1,7 | grep -v "^,$" > <output_file>
The parameters I used on cut are:
-d
Delimiter (by which character the fields are separated)
-f
Fields (which fields to include in the output).
... and grep:
-v
Invert pattern: Only include lines in output not matching the regex.
Given your data in your question, the above command will yield this result:
Item Code,Shelf
CMAC-386607,M5-2
CMAC-389109, F2-3
This should also be quite efficient, as cut works on a stream, and only loads as much data into memory as necessary. So you don't need to load the whole file before executing the task. It being a large file, this might be handy.

Simplify points in KML using regex

I am trying to cut down the file size of a kml file I have.
The coordinates for the polygons are this accurate:
-113.52106535153605,53.912817815321503,0.
I am not very good with regex, but I think it would be possible to write one that selects the eight characters before the commas. I'd run a search and replace so the result would be
-113.521065,53.9128178,0.
Any regex experts out there think this is possible?
Try this
\d{8}(?=,)
and replace with an empty string
See it here on Regexr
Here is something that might work. Replaces 8 chars and the coma with a coma: s/(.{8}),/,/g;
echo "-113.52106535153605,53.912817815321503,0." | sed 's/.\{8\},/,/'
So you can cat the file you have to a sed command like this:
cat file.kml | sed 's/.\{8\},/,/' > newfile.kml
I Just had to do the same thing. This is perl instead of sed, but it will look for a string of eight uninterrupted digits and then replace any number of uninterrupted digits after that with nothing. It worked great.
cat originalfile.kml | perl -pe 's/(?<=\d{8})\d*//g' > shortenedfile.kml

Regular expression for a string containing one word but not another

I'm setting up some goals in Google Analytics and could use a little regex help.
Lets say I have 4 URLs
http://www.anydotcom.com/test/search.cfm?metric=blah&selector=size&value=1
http://www.anydotcom.com/test/search.cfm?metric=blah2&selector=style&value=1
http://www.anydotcom.com/test/search.cfm?metric=blah3&selector=size&value=1
http://www.anydotcom.com/test/details.cfm?metric=blah&selector=size&value=1
I want to create an expression that will identify any URL that contains the string selector=size but does NOT contain details.cfm
I know that to find a string that does NOT contain another string I can use this expression:
(^((?!details.cfm).)*$)
But, I'm not sure how to add in the selector=size portion.
Any help would be greatly appreciated!
This should do it:
^(?!.*details\.cfm).*selector=size.*$
^.*selector=size.*$ should be clear enough. The first bit, (?!.*details.cfm) is a negative look-ahead: before matching the string it checks the string does not contain "details.cfm" (with any number of characters before it).
^(?=.*selector=size)(?:(?!details\.cfm).)+$
If your regex engine supported posessive quantifiers (though I suspect Google Analytics does not), then I guess this will perform better for large input sets:
^[^?]*+(?<!details\.cfm).*?selector=size.*$
regex could be (perl syntax):
`/^[(^(?!.*details\.cfm).*selector=size.*)|(selector=size.*^(?!.*details\.cfm).*)]$/`
There is a problem with the regex in the accepted answer. It also matches abcselector=size, selector=sizeabc etc.
A correct regex can be ^(?!.*\bdetails\.cfm\b).*\bselector=size\b.*$
Explanation of the regex at regex101:
I was looking for a way to avoid --line-buffered on a tail in a similar situation as the OP and Kobi's solution works great for me. In my case excluding lines with either "bot" or "spider" while including ' / ' (for my root document).
My original command:
tail -f mylogfile | grep --line-buffered -v 'bot\|spider' | grep ' / '
Now becomes (with -P perl switch):
tail -f mylogfile | grep -P '^(?!.*(bot|spider)).*\s\/\s.*$'

Awk/etc.: Extract Matches from File

I have an HTML file and would like to extract the text between <li> and </li> tags. There are of course a million ways to do this, but I figured it would be useful to get more into the habit of doing this in simple shell commands:
awk '/<li[^>]+><a[^>]+>([^>]+)<\/a>/m' cities.html
The problem is, this prints everything whereas I simply want to print the match in parenthesis -- ([^>]+) -- either awk doesn't support this, or I'm incompetent. The latter seems more likely. If you wanted to apply the supplied regex to a file and extract only the specified matches, how would you do it? I already know a half dozen other ways, but I don't feel like letting awk win this round ;)
Edit: The data is not well-structured, so using positional matches ($1, $2, etc.) is a no-go.
If you want to do this in the general case, where your list tags can contain any legal HTML markup, then awk is the wrong tool. The right tool for the job would be an HTML parser, which you can trust to get correct all of the little details of HTML parsing, including variants of HTML and malformed HTML.
If you are doing this for a special case, where you can control the HTML formatting, then you may be able to make awk work for you. For example, let's assume you can guarantee that each list element never occupies more than one line, is always terminated with </li> on the same line, never contains any markup (such as a list that contains a list), then you can use awk to do this, but you need to write a whole awk program that first finds lines that contain list elements, then uses other awk commands to find just the substring you are interested in.
But in general, awk is the wrong tool for this job.
gawk -F'<li>' -v RS='</li>' 'RT{print $NF}' file
Worked pretty well for me.
By your script, if you can get what you want (it means <li> and <a> tag is in one line.);
$ cat test.html | awk 'sub(/<li[^>]*><a[^>]*>/,"")&&sub(/<\/a>.*/,"")'
or
$ cat test.html | gawk '/<li[^>]*><a[^>]*>(.*?)<\/a>.*/&&$0=gensub(/<li[^>]*><a[^>]*>(.*?)<\/a>.*/,"\\1", 1)'
First one is for every awk, second one is for gnu awk.
There are several issues that I see:
The pattern has a trailing 'm' which is significant for multi-line matches in Perl, but Awk does not use Perl-compatible regular expressions. (At least, standard (non-GNU) awk does not.)
Ignoring that, the pattern seems to search for a 'start list item' followed by an anchor '<a>' to '</a>', not the end list item.
You search for anything that is not a '>' as the body of the anchor; that's not automatically wrong, but it might be more usual to search for anything that is not '<', or anything that is neither.
Awk does not do multi-line searches.
In Awk, '$1' denotes the first field, where the fields are separated by the field separator characters, which default to white space.
In classic nawk (as documented in the 'sed & awk' book vintage 1991) does not have a mechanism in place for pulling sub-fields out of matches, etc.
It is not clear that Awk is the right tool for this job. Indeed, it is not entirely clear that regular expressions are the right tool for this job.
Don't really know awk, how about Perl instead?
tr -d '\012' the.html | perl \
-e '$text = <>;' -e 'while ( length( $text) > 0)' \
-e '{ $text =~ /<li>(.*?)<\/li>(.*)/; $target = $1; $text = $2; print "$target\n" }'
1) remove newlines from file, pipe through perl
2) initialize a variable with the complete text, start a loop until text is gone
3) do a "non greedy" match for stuff bounded by list-item tags, save and print the target, set up for next pass
Make sense? (warning, did not try this code myself, need to go home soon...)
P.S. - "perl -n" is Awk (nawk?) mode. Perl is largely a superset of Awk, so I never bothered to learn Awk.