Grep a list of SQL table names within the file contents of a directory matching a table name format - regex

I'm trying to write a bash script that would be able to grep table names from across files (within a directory) that partially match a string.
For my case, I'd like to return all table references following a certain convention (case insensitive):
tblpl
tbljoin
tbldim
This would ideally return a list like this:
product.dbo.tblplColors
product..tblplMonograms
solr.dbo.tbljoinSkuCategory
Matching one table name format at a time would also be alright if that helped reduce some of the complexity. To clarify, this would return just the table names- not the file name/all of the file contents. It's safe to say the end of the table name would be delimited by a space since it's SQL.
Where I've started:
grep -rio 'tblpl*[^ ]' d:/sqldirectoryhere > c:/Users/foo/Desktop/tables.txt
Any help/pointers are appreciated here- thanks!
Edit: Both of these answers nailed my use case. I ended up adding the extended regex (so huge thanks for that recommendation) but I have to give credit to the person who wrote the bulk of it. Thanks all!
My extended use case ended up being a way to return this list of tables and then script it to a query-friendly format so I could throw these into a WHERE IN clause. In case anyone ever needs it:
grep -rioE --no-filename '[a-zA-Z_.]+\.tbl(pl|join|dim)[a-zA-Z_]+' {DIRECTORY_HERE} | sed -n 's/.*/\x27&\x27/; $! s/$/,/; 1 h; 1 ! H; $ { x; s/\n/ /g; p; }'
Returns formatted as: 'db.tblplColorSwatches', 'db.tbljoinCustomerSegment'...

It finds any mixed sequence of letters and periods followed by .tblpl or .tbljoin or .tbldim followed by one or more letters (see regex101 link)
try this regular expression:
[a-zA-Z.]+\.tbl(pl|join|dim)[a-zA-Z]+

I would use the -E flag to use extended regular expression:
grep -rioE '[a-z]*\.[a-z]*\.tbl(pl|join|dim)[a-z]*' d:/sqldirectoryhere

Related

Extract Google drive folder id from URL's

I am just trying to extract the Google drive folder id from bunch of different google drive URL's
cat links.txt
https://drive.google.com/drive/mobile/folders/1mzr8lgf50p9z6p-7RyHn4XjnyKSvyyuE?usp=sharing
https://drive.google.com/open?id=1_7vwy0-y0BqvPOtG2Or4pvoChnZHrHAx
https://drive.google.com/folderview?id=1rOLhig0g3DdgB9YfvW8HiqRA6o6LxAFF
https://drive.google.com/file/d/1o2J_NwHS3l1-fM71HaDN-xxres1jHkb_/view?usp=drivesdk
https://drive.google.com/drive/folders/0AKzaqn_X7nxiUk9PVA
https://drive.google.com/drive/mobile/folders/0AKzaqn_X7nxiUk9PVA
https://drive.google.com/drive/mobile/folders/0AKzaqn_X7nxiUk9PVA/1re_-YAGfTuyE1Gt848vzTu4ZDC6j23sG/1Ye90fM5qYMYkXp4QMAcQftsJCFVHswWj/149W7xNROO33zaPvIYTNwvtVGAXFxCg_b?sort=13&direction=a
https://drive.google.com/drive/mobile/folders/1nY48t6MATb0XM-iEdeWzEs70qXW2N4Y9?sort=13&direction=a
https://drive.google.com/drive/folders/1M3Xp3xz44NS8QJO5XJT5DK55MohwN6tF?sort=13&direction=a
Expected Output
1mzr8lgf50p9z6p-7RyHn4XjnyKSvyyuE
1_7vwy0-y0BqvPOtG2Or4pvoChnZHrHAx
1rOLhig0g3DdgB9YfvW8HiqRA6o6LxAFF
1o2J_NwHS3l1-fM71HaDN-xxres1jHkb_
0AKzaqn_X7nxiUk9PVA
0AKzaqn_X7nxiUk9PVA
149W7xNROO33zaPvIYTNwvtVGAXFxCg_b
1nY48t6MATb0XM-iEdeWzEs70qXW2N4Y9
1M3Xp3xz44NS8QJO5XJT5DK55MohwN6tF
After an hour of trial/error , i did came up with this regex - ([01A-Z])(?=[\w-]*[A-Za-z])[\w-]+
It seems to work almost well except it can't process the 3rd last link properly. If there are multiple nested folder ids in URL, i need the innermost one in the output . Can someone please help me out with this error and possibly improve the regex if it can be done in a more efficient way than mine
You may try this sed:
sed -E 's~.*[/=]([01A-Z][-_[:alnum:]]+)([?/].*|$)~\1~' links.txt
1mzr8lgf50p9z6p-7RyHn4XjnyKSvyyuE
1_7vwy0-y0BqvPOtG2Or4pvoChnZHrHAx
1rOLhig0g3DdgB9YfvW8HiqRA6o6LxAFF
1o2J_NwHS3l1-fM71HaDN-xxres1jHkb_
0AKzaqn_X7nxiUk9PVA
0AKzaqn_X7nxiUk9PVA
149W7xNROO33zaPvIYTNwvtVGAXFxCg_b
1nY48t6MATb0XM-iEdeWzEs70qXW2N4Y9
1M3Xp3xz44NS8QJO5XJT5DK55MohwN6tF
With GNU awk:
awk '{print $NF}' FPAT='[a-zA-Z0-9_-]{19,34}' file
$NF: contains last column
FPAT: A regular expression describing the contents of the fields in a record. When set, gawk parses the input into fields, where the fields match the regular expression, instead of using the value of FS as the field separator.
Output:
1mzr8lgf50p9z6p-7RyHn4XjnyKSvyyuE
1_7vwy0-y0BqvPOtG2Or4pvoChnZHrHAx
1rOLhig0g3DdgB9YfvW8HiqRA6o6LxAFF
1o2J_NwHS3l1-fM71HaDN-xxres1jHkb_
0AKzaqn_X7nxiUk9PVA
0AKzaqn_X7nxiUk9PVA
149W7xNROO33zaPvIYTNwvtVGAXFxCg_b
1nY48t6MATb0XM-iEdeWzEs70qXW2N4Y9
1M3Xp3xz44NS8QJO5XJT5DK55MohwN6tF

Extract a string from vcf file

I need to extract RS=368138379 string from following lines in a vcf file of few thousand millions lines. I am wondering how can we use grep -o "" and regular expression to quickly extract that?
AF_ESP=0.0001;ALLELEID=359042;CLNDISDB=MedGen:C0678202,OMIM:266600;CLNDN=Inflammatory_bowel_disease_1;CLNHGVS=NC_000006.11:g.31779521C>T;CLNREVSTAT=no_assertion_criteria_provided;CLNSIG=association;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;GENEINFO=HSPA1L:3305;MC=SO:0001583|missense_variant;ORIGIN=4;RS=368138379
Thanks very much indeed.
Something along the lines of RS=\d+ should do the trick for the expression you're looking for.
Let's say text.log contains your log you can use:
grep -oE "RS=[0-9]+" test.log
If you want to print also the line numbers:
grep -noE "RS=[0-9]+" test.log
Best to avoid using grep to parse VCF/BCF files. Use bcftools query instead:
bcftools query -f '%INFO/RS\n' -e 'INFO/RS="."' clinvar.vcf.gz
A simple zgrep -oE "RS=[0-9]+" clinvar.vcf.gz will miss RS values for records that contain more than one ID, which can be pipe-delimited:
##INFO=<ID=RS,Number=.,Type=String,Description="dbSNP ID (i.e. rs number)">
Number is . when the number of possible values varies, is unknown, or is unbounded. Please see: https://samtools.github.io/hts-specs/VCFv4.2.pdf

Regex a string with unknown number of parameters

Let say I have millions of string in a text file in this format:
st=expand&c=22&t=button&k=fun HTTP
This is a string we can look at as a hash with keys st, c, t and k. Some of the strings in the text file might not have a given &KEY=VALUE present and might thus look like this:
st=expand&k=fun HTTP
How would one use sed to change the string to following
expand,,,fun
that is, even thought the key=value isn't present, we still add a comma. We can assume that we have a fixed key set [st,c,t,k].
What I've tried is something like (just an idea!!)
sed 's/\(st=\|c=\|t=\|k=\)\([\(^\&\|HTTP\)])\(\&\|HTTP\)/\3,/g' big_file
but obviously, if c isn't there, it isn't adding a comma since it doesn't find any. Any ideas how to approach this? Using awk might also be acceptable (or any other fast text-processing utility)
Thanks!
Input data example
st=expand&c=22&t=button&k=fun HTTP
c=22&t=button&k=fun HTTP
st=expand&c=22&t=party&k=fun HTTP
st=expand&c=22&k=fun HTTP
st=expand HTTP
HTTP
Output data
expand,22,button,fun
,22,button,fun
expand,22,party,fun
expand,22,,fun
expand,,,
,,,
You can use this sed:
sed -E 's/(st=([^& ]*)|)(.*c=([^& ]*)|)(.*t=([^& ]*)|)(.*k=([^& ]*)|) HTTP/\2,\4,\6,\8/' file
expand,22,button,fun
,22,button,fun
expand,22,party,fun
expand,22,,fun
expand,,,
,,,
Sed Demo
RegEx Demo
Whenever you have name=value pairs in your input data, it's simplest and clearest and usually most efficient to create a name->value array and then print the values by name in whatever order you want, e.g.:
$ cat tst.awk
BEGIN { FS="[&= ]"; OFS="," }
{
delete n
for (i=1;i<NF;i+=2) {
n[$i] = $(i+1)
}
print n["st"], n["c"], n["t"], n["k"]
}
$ awk -f tst.awk file
expand,22,button,fun
,22,button,fun
expand,22,party,fun
expand,22,,fun
expand,,,
,,,
Another pattern for sed to try:
sed -r "s/(st=(\w+))?(&?c=(\w+))?(&t=(\w+))?(&k=(\w+))?( HTTP)/\2,\4,\6,\8/g" big_file
expand,22,button,fun
,22,button,fun
expand,22,party,fun
expand,22,,fun
expand,,,
REGEX 101 DEMO
How about something like this? It's not perfectly strict, but as long as your data follows the format you described on every line, it will work.
Regex:
^(?:st=([^&\n]*))?&?(?:c=([^&\n]*))?&?(?:t=([^&\n]*))?&?(?:k=([^&\n]*))? HTTP$ (must be run once per line or with multi-line and global options enabled)
Substitution:
\1,\2,\3,\4
Try it here: https://regex101.com/r/nE1oP7/2
EDIT: If you are using sed, you will need to change the non-capturing groups to regular ones ((?:) to ()) and update the backreferences accordingly (\2,\4,\6,\8). Demo: http://ideone.com/GNRNGp

How to use Regular Expression In Find and Replacement

I've one CSV file which has almost 50k records. I want to remove the unnecessary records from this file. Can anyone tell me how can I achieve this by Regex through Find and Replacement?
The data looks like this:
Item Code,,Qty
CMAC-389109,,6
,Serial No.,
,954zg5,
,ffnaw8,
,gh8731,
,gxj419,
,hc6y9q,
,y65vh8,
CMAC-394140,,1
,Serial No.,
,4cu3z7,
and I want to convert this data to below format:
ItemCode,Serial Number,Qty
CMAC-389109,"954zg5, ffnaw8, gh8731, gxj419, hc6y9q, y65vh8",6
CMBM-394140,"4cu3z7",1
Here's a regex which captures two groups (Item Code and Shelf):
^([^,]*?)(?:,(?:[^,]+)?){5},([^,]+),.*$
I don't know what syntax DW uses to reference groups. But usually it's either $n or \n, so in your case, you can put $1, $2 in the "replacement" field of the search/replace box. Or \1, \2.
If you have access to a Linux environment (OS-X and Cygwin should work too), you can use the command-line tools cut and grep to accomplish this quite easily:
cat <filename> | cut -d ',' -f 1,7 | grep -v "^,$" > <output_file>
The parameters I used on cut are:
-d
Delimiter (by which character the fields are separated)
-f
Fields (which fields to include in the output).
... and grep:
-v
Invert pattern: Only include lines in output not matching the regex.
Given your data in your question, the above command will yield this result:
Item Code,Shelf
CMAC-386607,M5-2
CMAC-389109, F2-3
This should also be quite efficient, as cut works on a stream, and only loads as much data into memory as necessary. So you don't need to load the whole file before executing the task. It being a large file, this might be handy.

Regular Expression to parse Common Name from Distinguished Name

I am attempting to parse (with sed) just First Last from the following DN(s) returned by the DSCL command in OSX terminal bash environment...
CN=First Last,OU=PCS,OU=guests,DC=domain,DC=edu
I have tried multiple regexs from this site and others with questions very close to what I wanted... mainly this question... I have tried following the advice to the best of my ability (I don't necessarily consider myself a newbie...but definitely a newbie to regex..)
DSCL returns a list of DNs, and I would like to only have First Last printed to a text file. I have attempted using sed, but I can't seem to get the correct function. I am open to other commands to parse the output. Every line begins with CN= and then there is a comma between Last and OU=.
Thank you very much for your help!
I think all of the regular expression answers provided so far are buggy, insofar as they do not properly handle quoted ',' characters in the common name. For example, consider a distinguishedName like:
CN=Doe\, John,CN=Users,DC=example,DC=local
Better to use a real library able to parse the components of a distinguishedName. If you're looking for something quick on the command line, try piping your DN to a command like this:
echo "CN=Doe\, John,CN=Users,DC=activedir,DC=local" | python -c 'import ldap; import sys; print ldap.dn.explode_dn(sys.stdin.read().strip(), notypes=1)[0]'
(depends on having the python-ldap library installed). You could cook up something similar with PHP's built-in ldap_explode_dn() function.
Two cut commands is probably the simplest (although not necessarily the best):
DSCL | cut -d, -f1 | cut -d= -f2
First, split the output from DSCL on commas and print the first field ("CN=First Last"); then split that on equal signs and print the second field.
Using sed:
sed 's/^CN=\([^,]*\).*/\1/' input_file
^ matches start of line
CN= literal string match
\([^,]*\) everything until a comma
.* rest
http://www.gnu.org/software/gawk/manual/gawk.html#Field-Separators
awk -v RS=',' -v FS='=' '$1=="CN"{print $2}' foo.txt
I like awk too, so I print the substring from the fourth char:
DSCL | awk '{FS=","}; {print substr($1,4)}' > filterednames.txt
This regex will parse a distinguished name, giving name and val a capture groups for each match.
When DN strings contain commas, they are meant to be quoted - this regex correctly handles both quoted and unquotes strings, and also handles escaped quotes in quoted strings:
(?:^|,\s?)(?:(?<name>[A-Z]+)=(?<val>"(?:[^"]|"")+"|[^,]+))+
Here is is nicely formatted:
(?:^|,\s?)
(?:
(?<name>[A-Z]+)=
(?<val>"(?:[^"]|"")+"|[^,]+)
)+
Here's a link so you can see it in action:
https://regex101.com/r/zfZX3f/2
If you want a regex to get only the CN, then this adapted version will do it:
(?:^|,\s?)(?:CN=(?<val>"(?:[^"]|"")+"|[^,]+))