How to use Regular Expression In Find and Replacement - regex

I've one CSV file which has almost 50k records. I want to remove the unnecessary records from this file. Can anyone tell me how can I achieve this by Regex through Find and Replacement?
The data looks like this:
Item Code,,Qty
CMAC-389109,,6
,Serial No.,
,954zg5,
,ffnaw8,
,gh8731,
,gxj419,
,hc6y9q,
,y65vh8,
CMAC-394140,,1
,Serial No.,
,4cu3z7,
and I want to convert this data to below format:
ItemCode,Serial Number,Qty
CMAC-389109,"954zg5, ffnaw8, gh8731, gxj419, hc6y9q, y65vh8",6
CMBM-394140,"4cu3z7",1

Here's a regex which captures two groups (Item Code and Shelf):
^([^,]*?)(?:,(?:[^,]+)?){5},([^,]+),.*$
I don't know what syntax DW uses to reference groups. But usually it's either $n or \n, so in your case, you can put $1, $2 in the "replacement" field of the search/replace box. Or \1, \2.
If you have access to a Linux environment (OS-X and Cygwin should work too), you can use the command-line tools cut and grep to accomplish this quite easily:
cat <filename> | cut -d ',' -f 1,7 | grep -v "^,$" > <output_file>
The parameters I used on cut are:
-d
Delimiter (by which character the fields are separated)
-f
Fields (which fields to include in the output).
... and grep:
-v
Invert pattern: Only include lines in output not matching the regex.
Given your data in your question, the above command will yield this result:
Item Code,Shelf
CMAC-386607,M5-2
CMAC-389109, F2-3
This should also be quite efficient, as cut works on a stream, and only loads as much data into memory as necessary. So you don't need to load the whole file before executing the task. It being a large file, this might be handy.

Related

Extract Google drive folder id from URL's

I am just trying to extract the Google drive folder id from bunch of different google drive URL's
cat links.txt
https://drive.google.com/drive/mobile/folders/1mzr8lgf50p9z6p-7RyHn4XjnyKSvyyuE?usp=sharing
https://drive.google.com/open?id=1_7vwy0-y0BqvPOtG2Or4pvoChnZHrHAx
https://drive.google.com/folderview?id=1rOLhig0g3DdgB9YfvW8HiqRA6o6LxAFF
https://drive.google.com/file/d/1o2J_NwHS3l1-fM71HaDN-xxres1jHkb_/view?usp=drivesdk
https://drive.google.com/drive/folders/0AKzaqn_X7nxiUk9PVA
https://drive.google.com/drive/mobile/folders/0AKzaqn_X7nxiUk9PVA
https://drive.google.com/drive/mobile/folders/0AKzaqn_X7nxiUk9PVA/1re_-YAGfTuyE1Gt848vzTu4ZDC6j23sG/1Ye90fM5qYMYkXp4QMAcQftsJCFVHswWj/149W7xNROO33zaPvIYTNwvtVGAXFxCg_b?sort=13&direction=a
https://drive.google.com/drive/mobile/folders/1nY48t6MATb0XM-iEdeWzEs70qXW2N4Y9?sort=13&direction=a
https://drive.google.com/drive/folders/1M3Xp3xz44NS8QJO5XJT5DK55MohwN6tF?sort=13&direction=a
Expected Output
1mzr8lgf50p9z6p-7RyHn4XjnyKSvyyuE
1_7vwy0-y0BqvPOtG2Or4pvoChnZHrHAx
1rOLhig0g3DdgB9YfvW8HiqRA6o6LxAFF
1o2J_NwHS3l1-fM71HaDN-xxres1jHkb_
0AKzaqn_X7nxiUk9PVA
0AKzaqn_X7nxiUk9PVA
149W7xNROO33zaPvIYTNwvtVGAXFxCg_b
1nY48t6MATb0XM-iEdeWzEs70qXW2N4Y9
1M3Xp3xz44NS8QJO5XJT5DK55MohwN6tF
After an hour of trial/error , i did came up with this regex - ([01A-Z])(?=[\w-]*[A-Za-z])[\w-]+
It seems to work almost well except it can't process the 3rd last link properly. If there are multiple nested folder ids in URL, i need the innermost one in the output . Can someone please help me out with this error and possibly improve the regex if it can be done in a more efficient way than mine
You may try this sed:
sed -E 's~.*[/=]([01A-Z][-_[:alnum:]]+)([?/].*|$)~\1~' links.txt
1mzr8lgf50p9z6p-7RyHn4XjnyKSvyyuE
1_7vwy0-y0BqvPOtG2Or4pvoChnZHrHAx
1rOLhig0g3DdgB9YfvW8HiqRA6o6LxAFF
1o2J_NwHS3l1-fM71HaDN-xxres1jHkb_
0AKzaqn_X7nxiUk9PVA
0AKzaqn_X7nxiUk9PVA
149W7xNROO33zaPvIYTNwvtVGAXFxCg_b
1nY48t6MATb0XM-iEdeWzEs70qXW2N4Y9
1M3Xp3xz44NS8QJO5XJT5DK55MohwN6tF
With GNU awk:
awk '{print $NF}' FPAT='[a-zA-Z0-9_-]{19,34}' file
$NF: contains last column
FPAT: A regular expression describing the contents of the fields in a record. When set, gawk parses the input into fields, where the fields match the regular expression, instead of using the value of FS as the field separator.
Output:
1mzr8lgf50p9z6p-7RyHn4XjnyKSvyyuE
1_7vwy0-y0BqvPOtG2Or4pvoChnZHrHAx
1rOLhig0g3DdgB9YfvW8HiqRA6o6LxAFF
1o2J_NwHS3l1-fM71HaDN-xxres1jHkb_
0AKzaqn_X7nxiUk9PVA
0AKzaqn_X7nxiUk9PVA
149W7xNROO33zaPvIYTNwvtVGAXFxCg_b
1nY48t6MATb0XM-iEdeWzEs70qXW2N4Y9
1M3Xp3xz44NS8QJO5XJT5DK55MohwN6tF

Insert newline before/after match for TSV

I'm going grey trying to figure out how to accomplish some regex matching to insert new lines. Example input/output below...
Example TSV Data:
Name Monitoring Tags
i-RBwPyvq8wPbUhn495 enabled "some:tags:with:colons=some:value:with:colons-and-dashes/and/slashes/yay606-values-001 some:other:tag:with-colons-and-hypens=MACHINE NAME Name=NAMETAG backup=true"
i-sMEwh2MXj3q47yWWP enabled "description=RANDOM BUSINESS INT01 backup=true Name=SOMENAME"
Desired Output:
Name Monitoring Tags
i-RBwPyvq8wPbUhn495 enabled "some:tags:with:colons=some:value:with:colons-and-dashes/and/slashes/yay606-values-001
some:other:tag:with-colons-and-hyphens=MACHINE NAME
Name=NAMETAG
backup=true"
i-sMEwh2MXj3q47yWWP enabled "description=RANDOM BUSINESS INT01
backup=true
Name=SOMENAME"
I can guarantee each key=value within those quotes are separated by hard/literal tabs, although it may not appear that way with how the StackOverflow code block is displayed in HTML they did carry over into the code block editor, the data under the column Tags is in quotes so that even though they are tab separated they stay within the Tags column. For whatever reason I'm not able to successfully get the desired results.
In my measly attempts, I've been basically capturing everything between the "" as if tabs aren't separated in my regex searches because of my use of wildcards [TAB].*=.*[TAB] is obviously not working because then I'm losing everything in between the first/last occurrence for each line. I've attempted storing them in capture groups without any success.
I'm looking for a unix toolset solution (sed, awk, perl and the like). Any/All help is appreciated!
This will work using any awk in any shell on any UNIX box:
$ awk 'match($0,/".*"/){str=substr($0,RSTART,RLENGTH); gsub(/\t/,"\n",str); $0=substr($0,1,RSTART-1) str substr($0,RSTART+RLENGTH)} 1' file
Name Monitoring Tags
i-RBwPyvq8wPbUhn495 enabled "some:tags:with:colons=some:value:with:colons-and-dashes/and/slashes/yay606-values-001
some:other:tag:with-colons-and-hypens=MACHINE NAME
Name=NAMETAG
backup=true"
i-sMEwh2MXj3q47yWWP enabled "description=RANDOM BUSINESS INT01
backup=true
Name=SOMENAME"
It just extracts a string between "s from the current record, replaces all tabs with newlines within that string, then puts the record back together before it's printed.
You can try this sed (GNU sed) 4.4
sed -E ':A;s/(".*)\t(.*")/\1\n\2/;tA' TSV_Data_File
With OSX sed, you can try this one.
I think the \t is ok.
sed -E '
:A
s/(".*)\t(.*")/\1\
\2/
tA
' TSV_Data_File
brief explain :
Catch the text inside "
Substitute the last \t by \n
If a substitution occur jump to A else continue
With awk :
awk -v RS='"' 'NR%2==0{gsub("\t","\n")}1' ORS='"' TSV_Data_File
This is basically ctac_'s awk answer converted to perl:
perl -pe'1 while s/(".*)\t(.*")/$1\n$2/s' file.tsv
Where the \t might be replaced by \t\s* if you want just one newline out of each tab-and-then-some.
This might work for you (GNU sed):
sed 's/\S\+=\S\+/\n&/2g' file
Insert a newline in before the second or more non-empty strings containing an =.

Regex command line change format of each line

I have a file that contains lines in a format similar to this...
/data/file.geojson?10,20,30,40
/data/file.geojson?bbox=-5.20751953125,49.05227025601607,3.0322265625,56.46249048388979
/data/file.geojson?bbox=-21.46728515625,45.99696161820381,19.2919921875,58.88194208135912
/data/file.geojson?bbox=-2.8482055664062496,54.38935426009769,-0.300750732421875,55.158473983815306
/data/file.geojson?bbox=-21.46728515625,45.99696161820381,19.2919921875,58.88194208135912
/data/file.geojson?bbox=-21.46728515625,45.99696161820381,19.2919921875,58.88194208135912
I've tried a combination of grep, sed, gawk, and |(pipes) to try and pattern match and then change the format to be more like this...
[10,40],[30,40],[30,20][10,20],
[-5.20751953125,56.46249048388979],[3.0322265625,56.46249048388979].....
Hopefully you get the idea from the first line so I don't have to type out all the examples manually!
I've got the hang of regex to match the co-ordinates. In fact the input file is the result of extracting from apache access logs. It might be easier to read/understand answers if they just match positive integer numbers, I will then be able to slot in a more complicated pattern to match the right range.
To be able to arrange the results like you which it is important to be able to access the last for values per line.
No pattern matching is required if you use awk. You can split the input strings by a set of delimiters and reassemble the resulting fields. 40 can be accessed as $(NF), 30 as $(NF-1) and so on.
awk -F'[?,=]' '
{printf "[%s,%s],[%s,%s],[%s,%s],[%s,%s]\n",
$(NF-3),$(NF),$(NF-1),$(NF),
$(NF-1),$(NF-2),$(NF-3),$(NF-2)
}' file
I'm using ?, , or = as the field delimiters. This makes it simple to access the columns of interest.
Output:
[10,40],[30,40],[30,20],[10,20]
[-5.20751953125,56.46249048388979],[3.0322265625,56.46249048388979],[3.0322265625,49.05227025601607],[-5.20751953125,49.05227025601607]
[-21.46728515625,58.88194208135912],[19.2919921875,58.88194208135912],[19.2919921875,45.99696161820381],[-21.46728515625,45.99696161820381]
[-2.8482055664062496,55.158473983815306],[-0.300750732421875,55.158473983815306],[-0.300750732421875,54.38935426009769],[-2.8482055664062496,54.38935426009769]
[-21.46728515625,58.88194208135912],[19.2919921875,58.88194208135912],[19.2919921875,45.99696161820381],[-21.46728515625,45.99696161820381]
[-21.46728515625,58.88194208135912],[19.2919921875,58.88194208135912],[19.2919921875,45.99696161820381],[-21.46728515625,45.99696161820381]
Btw, also sed can be used here:
sed -r 's/.*[?=]([^,]+),([^,]+),([^,]+),(.*)/[\1,\4],[\3,\4],[\3,\2],[\1,\2]/' file
The command is capturing the numbers at the end each in a separate capturing group and re-assembles them in the replacement part.
Not all versions of sed support the + quantifier. The most compatible version would look like this :)
sed 's/.*[?=]\([^,]\{1,\}\),\([^,]\{1,\}+\),\([^,]\{1,\}\),\(.*\)/[\1,\4],[\3,\4],[\3,\2],[\1,\2]/' file
sed strips off items prior to numbers, then awk splits on comma and outputs in different order. Assuming data is in a file called "td.txt"
sed 's/^[^0-9-]*//' td.txt|awk -F, '{print "["$1","$4"],["$3","$4"],["$3","$2"],["$1","$2"],"}'
This might work for you (GNU sed):
sed -r 's/^.*\?[^-0-9]*([^,]*),([^,]*),([^,]*),([^,]*)/[\1,\4],[\3,\4],[\3,\2],[\1,\2]/' file
Or with more toothpicks:
sed 's/^.*\?[^-0-9]*\([^,]*\),\([^,]*\),\([^,]*\),\([^,]*\)/[\1,\4],[\3,\4],[\3,\2],[\1,\2]/' file
You can use the following to match:
(\/data\/file\.geojson\?(?:bbox=)?)([0-9.-]+),([0-9.-]+),([0-9.-]+),([0-9.-]+)
And replace with the following:
$1[$2,$3],[$4,$5]
See DEMO

Regular Expression to parse Common Name from Distinguished Name

I am attempting to parse (with sed) just First Last from the following DN(s) returned by the DSCL command in OSX terminal bash environment...
CN=First Last,OU=PCS,OU=guests,DC=domain,DC=edu
I have tried multiple regexs from this site and others with questions very close to what I wanted... mainly this question... I have tried following the advice to the best of my ability (I don't necessarily consider myself a newbie...but definitely a newbie to regex..)
DSCL returns a list of DNs, and I would like to only have First Last printed to a text file. I have attempted using sed, but I can't seem to get the correct function. I am open to other commands to parse the output. Every line begins with CN= and then there is a comma between Last and OU=.
Thank you very much for your help!
I think all of the regular expression answers provided so far are buggy, insofar as they do not properly handle quoted ',' characters in the common name. For example, consider a distinguishedName like:
CN=Doe\, John,CN=Users,DC=example,DC=local
Better to use a real library able to parse the components of a distinguishedName. If you're looking for something quick on the command line, try piping your DN to a command like this:
echo "CN=Doe\, John,CN=Users,DC=activedir,DC=local" | python -c 'import ldap; import sys; print ldap.dn.explode_dn(sys.stdin.read().strip(), notypes=1)[0]'
(depends on having the python-ldap library installed). You could cook up something similar with PHP's built-in ldap_explode_dn() function.
Two cut commands is probably the simplest (although not necessarily the best):
DSCL | cut -d, -f1 | cut -d= -f2
First, split the output from DSCL on commas and print the first field ("CN=First Last"); then split that on equal signs and print the second field.
Using sed:
sed 's/^CN=\([^,]*\).*/\1/' input_file
^ matches start of line
CN= literal string match
\([^,]*\) everything until a comma
.* rest
http://www.gnu.org/software/gawk/manual/gawk.html#Field-Separators
awk -v RS=',' -v FS='=' '$1=="CN"{print $2}' foo.txt
I like awk too, so I print the substring from the fourth char:
DSCL | awk '{FS=","}; {print substr($1,4)}' > filterednames.txt
This regex will parse a distinguished name, giving name and val a capture groups for each match.
When DN strings contain commas, they are meant to be quoted - this regex correctly handles both quoted and unquotes strings, and also handles escaped quotes in quoted strings:
(?:^|,\s?)(?:(?<name>[A-Z]+)=(?<val>"(?:[^"]|"")+"|[^,]+))+
Here is is nicely formatted:
(?:^|,\s?)
(?:
(?<name>[A-Z]+)=
(?<val>"(?:[^"]|"")+"|[^,]+)
)+
Here's a link so you can see it in action:
https://regex101.com/r/zfZX3f/2
If you want a regex to get only the CN, then this adapted version will do it:
(?:^|,\s?)(?:CN=(?<val>"(?:[^"]|"")+"|[^,]+))

Awk/etc.: Extract Matches from File

I have an HTML file and would like to extract the text between <li> and </li> tags. There are of course a million ways to do this, but I figured it would be useful to get more into the habit of doing this in simple shell commands:
awk '/<li[^>]+><a[^>]+>([^>]+)<\/a>/m' cities.html
The problem is, this prints everything whereas I simply want to print the match in parenthesis -- ([^>]+) -- either awk doesn't support this, or I'm incompetent. The latter seems more likely. If you wanted to apply the supplied regex to a file and extract only the specified matches, how would you do it? I already know a half dozen other ways, but I don't feel like letting awk win this round ;)
Edit: The data is not well-structured, so using positional matches ($1, $2, etc.) is a no-go.
If you want to do this in the general case, where your list tags can contain any legal HTML markup, then awk is the wrong tool. The right tool for the job would be an HTML parser, which you can trust to get correct all of the little details of HTML parsing, including variants of HTML and malformed HTML.
If you are doing this for a special case, where you can control the HTML formatting, then you may be able to make awk work for you. For example, let's assume you can guarantee that each list element never occupies more than one line, is always terminated with </li> on the same line, never contains any markup (such as a list that contains a list), then you can use awk to do this, but you need to write a whole awk program that first finds lines that contain list elements, then uses other awk commands to find just the substring you are interested in.
But in general, awk is the wrong tool for this job.
gawk -F'<li>' -v RS='</li>' 'RT{print $NF}' file
Worked pretty well for me.
By your script, if you can get what you want (it means <li> and <a> tag is in one line.);
$ cat test.html | awk 'sub(/<li[^>]*><a[^>]*>/,"")&&sub(/<\/a>.*/,"")'
or
$ cat test.html | gawk '/<li[^>]*><a[^>]*>(.*?)<\/a>.*/&&$0=gensub(/<li[^>]*><a[^>]*>(.*?)<\/a>.*/,"\\1", 1)'
First one is for every awk, second one is for gnu awk.
There are several issues that I see:
The pattern has a trailing 'm' which is significant for multi-line matches in Perl, but Awk does not use Perl-compatible regular expressions. (At least, standard (non-GNU) awk does not.)
Ignoring that, the pattern seems to search for a 'start list item' followed by an anchor '<a>' to '</a>', not the end list item.
You search for anything that is not a '>' as the body of the anchor; that's not automatically wrong, but it might be more usual to search for anything that is not '<', or anything that is neither.
Awk does not do multi-line searches.
In Awk, '$1' denotes the first field, where the fields are separated by the field separator characters, which default to white space.
In classic nawk (as documented in the 'sed & awk' book vintage 1991) does not have a mechanism in place for pulling sub-fields out of matches, etc.
It is not clear that Awk is the right tool for this job. Indeed, it is not entirely clear that regular expressions are the right tool for this job.
Don't really know awk, how about Perl instead?
tr -d '\012' the.html | perl \
-e '$text = <>;' -e 'while ( length( $text) > 0)' \
-e '{ $text =~ /<li>(.*?)<\/li>(.*)/; $target = $1; $text = $2; print "$target\n" }'
1) remove newlines from file, pipe through perl
2) initialize a variable with the complete text, start a loop until text is gone
3) do a "non greedy" match for stuff bounded by list-item tags, save and print the target, set up for next pass
Make sense? (warning, did not try this code myself, need to go home soon...)
P.S. - "perl -n" is Awk (nawk?) mode. Perl is largely a superset of Awk, so I never bothered to learn Awk.