Regex to match and split every third occurrence of a string

Regex to match and split every third occurrence of a string - regex

in a Korn Shell script I have a large amount of data in a string variable contents that matches the following syntax:
account_id_0:group_id_0:name_0
account_id_1:group_id_1:name_1
...
account_id_N:group_id_N:name_N
I want to split the string on the : character every third instance so I can generate three other strings accounts,groups, and names
that have the format:
accounts = account_id_0,account_id_1,...,account_id_N
groups = group_id_0,group_id_1,...,group_id_N
names = name_0,name_1,...,name_N
The reason I would like to store these in a string rather than an array is for portability across environments.
Am I able to achieve this using something like the sed, cut, or awk command?
the current regex I'm using to capture the accounts is:
[a-zA-Z][0-9]+(?:([a-zA-z]*[0-9]*)*)(?:([a-zA-Z]*[0-9]*)*)
But I feel there is a more efficient alternative.
I have attempted to achieve the desired output using a combination of this solution and this solution however the first one lacks the repetition I require, and the latter is for file manipulation not strings.

I would use arrays, and process the contents variable like reading lines from a file:
contents='account_id_0:group_id_0:name_0
account_id_1:group_id_1:name_1
...:...:...
account_id_N:group_id_N:name_N'
as=()
gs=()
ns=()
while IFS=: read -r a g n; do
as+=("$a")
gs+=("$g")
ns+=("$n")
done <<< "$contents"
accounts=$(IFS=,; echo "${as[*]}")
groups=$(IFS=,; echo "${gs[*]}")
names=$(IFS=,; echo "${ns[*]}")
printf "%s\n" "$accounts" "$groups" "$names"
account_id_0,account_id_1,...,account_id_N
group_id_0,group_id_1,...,group_id_N
name_0,name_1,...,name_N
If you're getting the contents value from a file, you can skip the step of storing it in a variable and just read the file directly.

Related

Regex: keep same pattern found multiple times in same line and replace line by appending single pattern in front

Is it possible with notepad++ (or maybe from linux bash shell) to create multiple lines from a pattern found , as many times as the pattern is found and also append single found pattern in the newly created line?
The multi pattern is val=[0-9]+
The single pattern is id=[a-zA-Z0-9]+
Example:
Input lines:
id=af2477,val=333,val=777
id=af3456,val=222,val=444,val=678
id=af3327,val=3234,val=123,val=701
Output lines:
id=af2477,val=333
id=af2477,val=777
id=af3456,val=222
id=af3456,val=444
id=af3456,val=678
id=af3327,val=3234
id=af3327,val=123
id=af3327,val=701
I have tried with 2 subgroups but it wont work. It will only replace the second group once:
find what:(id=[a-zA-Z0-9]+,)(val=[0-9]+,)*
replace:\n\1,\2
UPDATE: Both answers from Toto and Wiktor Stribiżew seem to do the job. Haven't tested them yet. I would still like to see how this can work with the use of Notepad++ (even if multiple steps are needed)

Since you also consider using Linux tools for this, an awk solution looks much more viable:
awk 'BEGIN{FS=OFS=","} /^id=[a-zA-Z0-9]+(,val=[0-9]+)*$/{
for(i=2; i<=NF; i++) {
print $1,$i
}; next;
}{print $0}' file > outfile
See the online demo.
Here, any line that matches ^id=[a-zA-Z0-9]+(,val=[0-9]+)*$ (i.e. matches the format of the lines you need to expand) is split the way you need with for(i=2; i<=NF; i++) {print $1,$i}; next;. Else, the line is written as is (print $0).
The BEGIN{FS=OFS=","} part sets the input and output field separator to a comma.

This perl one-liner does the job (output on STDOUT):
perl -anE '($id,$vals)=/(id=\w+),(.+)$/;say "$id,$_" for split/,/,$vals' file
id=af2477,val=333
id=af2477,val=777
id=af3456,val=222
id=af3456,val=444
id=af3456,val=678
id=af3327,val=3234
id=af3327,val=123
id=af3327,val=701
Explanation:
($id,$vals)=/(id=\w+),(.+)$/; # explode id and values for each line in input file
say "$id,$_" for split/,/,$vals # print id and each value
You can redirect the output to another file:
perl -anE '($id,$vals)=/(id=\w+),(.+)$/;say "$id,$_" for split/,/,$vals' file > outputfile
Or do the change in-place:
perl -i -anE '($id,$vals)=/(id=\w+),(.+)$/;say "$id,$_" for split/,/,$vals' file

It is possible, yet very complex to do that with one regular expression for which you are gonna have to use (?R) and conditional statements.
With multiple steps would be pretty simple. You can for instance do find and replace using the max number of val that you might have in the longest lines, such as, imagine 4 would be the largest number of val, then we'll have four of (,val=[^\r\n,]*) in our initial expression:
^(id=[^\r\n,]*)(,val=[^\r\n,]*)(,val=[^\r\n,]*)(,val=[^\r\n,]*)(,val=[^\r\n,]*)$
and replace that with four lines,
$1$2\n$1$3\n$1$4\n$1$5
---- ---- ---- ----
Demo for Step 1
For any additional step, we can simply remove one val and one line from the end of initial expression and replacement. For example, our expression would look like
^(id=[^\r\n,]*)(,val=[^\r\n,]*)(,val=[^\r\n,]*)(,val=[^\r\n,]*)$
in the second step, for which we'd replace it with:
$1$2\n$1$3\n$1$4
---- ---- ----
Demo for Step 2
In the third and final step, our expression has two vals,
^(id=[^\r\n,]*)(,val=[^\r\n,]*)(,val=[^\r\n,]*)$
and our replacement will have two lines:
$1$2\n$1$3
---- ----
Demo for Step 3
For the case exampled in the question, only two steps are required and the second and third expressions would likely work just fine.

Edit CSV rows in two different ways

I have a bash script that outputs two CSV columns. I need to prepend the three-digit number of those rows of the second column that contain them with "f. " and keep the rest of the rows intact. I have tried different ways so far but each has failed in one way or another.
What I've tried mainly has been to use regular expressions with either the first or second column to separate the desired rows from the rest, but I can't separate and prepend at the same time without cancelling out or messing up the process somehow. Some of the commands I've used so far have been: $ sed $ cut as well as (nested) for loops, read-while loops, if/else and if/else/elif statements, etc. What follows is one such (failed) solution:
for var1 in "^.*_[^f]_.*"
do
sed -i "" "s:$MSname::" $pathToCSV"_final.csv"
for var2 in "^.*_f_.*"
do
sed -i "" "s:$MSname:f.:" $pathToCSV"_final.csv"
done
done
And these are some sample rows:
abc_deg0014_0001_a_1.tif,British Library 1 Front Board Outside
abc_deg0014_0002_b_000.tif,British Library 1 Front Board Inside
abc_deg0014_0003_f_001r.tif,British Library 1 001r
abc_deg0014_0004_f_001v.tif,British Library 1 001v
…
abc_deg0014_0267_f_132r.tif,British Library 1 132r
abc_deg0014_0268_f_132v.tif,British Library 1 132v
abc_deg0014_0269_y_999.tif,British Library 1 Back Board Inside
abc_deg0014_0270_z_1.tif,British Library 1 Back Board Outside
Here $MSname = British Library 1 (since with different CSVs the "British Library 1" part can change to other words that I need to remove/replace and that's why I use parameter expansion).
The desired result:
abc_deg0014_0002_b_000.tif,Front Board Inside
abc_deg0014_0003_f_001r.tif,f. 001r
…
abc_deg0014_0268_f_132v.tif,f. 132v
abc_deg0014_0269_y_999.tif,Back Board Inside
If you look closely, you'll notice these rows are also differentiated from the rest by "f" in their first column (the rows that shouldn't get the "f. " in front of their second column are differentiated by "a", "b", "y", and "z", respectively, in the first column).

You are not using var1 or var2 for anything, and even if you did, looping over variables and repeatedly running sed -i on the same output file is extremely wasteful. Ideally, you would like to write all the modifications into a single sed script, and process the file only once.
Without being able to guess what other strings than "British Library 1" you have and whether those require different kinds of actions, I would suggest something along the lines of
sed -i '/^[^,]*_f_[^,_]*,/s/,British Library 1 /,f. /
s/,British Library 1 /,/' "${pathToCSV}_final.csv"
Notice how the sed script in single quotes can be wrapped over multiple physical lines. The first line finds any lines where the last characters between underscores in the first comma-separated column is f, and replaces ",British Library 1 " with ",f. ". (I made some adjustments to the spacing here -- I hope they make sense for you.) On the following line, we simply replace any (remaining) occurrences of ",British Library 1 " with just a comma; the idea is that only the lines which didn't match the regex on the previous line will still contain this string, and so we don't have to do another regex match.
This can easily be extended to cover more patterns in the same sed script, rather than repeatedly looping over the file and rewriting one pattern at a time. For example, if your next task is to replace Windsor Palace A with either a. or nothing depending on whether the penultimate underscore-separated subfield in the first field contains a, that should be obvious enough:
sed -i '/^[^,]*_f_[^,_]*,/s/,British Library 1 /,f. /
s/,British Library 1 /,/
/^[^,]*_a_[^,_]*,/s/,Windsor Palace A /,a. /
s/,Windsor Palace A /,/' "${pathToCSV}_final.csv"
In some more detail, the regex says
^ beginning of line
[^,]* any sequence of characters which are not a comma
_f_ literal characters underscore, f, underscore
[^,_]* any sequence of characters which are not a comma or an underscore
, literal comma
You should be able to see that this will target the last pair of underscores in the first column. It's important to never skip across the first comma, and near the end, not allow any underscores after the ones we specifically target before we finally allow the comma column delimiter.
Finally, also notice how we always use double quotes around variables which contain file names. There are scenarios where you can avoid this but you have to know what you are doing; the easy and straightforward rule of thumb is to always put double quotes around variables. For the full scoop, see When to wrap quotes around a shell variable?

With awk, you can look at the firth field to see whether it matches "3digits + 1 letter" then print with f. in this case and just remove fields 2,3 and 4 in the other case. For example:
awk -F'[, ]' '{
if($5 ~ /.?[[:digit:]]{3}[a-z]$/) {
printf("%s,f. %s\n",$1,$5)}
else {
printf("%s,%s %s %s\n",$1,$5,$6,$7)
}
}' test.txt
On the example you provide, it gives:
abc_deg0014_0001_a_1.tif,Front Board Outside
abc_deg0014_0002_b_000.tif,Front Board Inside
abc_deg0014_0003_f_001r.tif,f. 001r
abc_deg0014_0004_f_001v.tif,f. 001v
abc_deg0014_0267_f_132r.tif,f. 132r
abc_deg0014_0268_f_132v.tif,f. 132v
abc_deg0014_0269_y_999.tif,Back Board Inside
abc_deg0014_0270_z_1.tif,Back Board Outside

Including regex on variable before matching string

I'm trying to find and extract the occurrence of words read from a text file in a text file. So far I can only find when the word is written correctly and not munged (a changed to # or i changed to 1). Is it possible to add a regex to my strings for matching or something similar? This is my code so far:
sub getOccurrenceOfStringInFileCaseInsensitive
{
my $fileName = $_[0];
my $stringToCount = $_[1];
my $numberOfOccurrences = 0;
my #wordArray = wordsInFileToArray ($fileName);
foreach (#wordArray)
{
my $numberOfNewOccurrences = () = (m/$stringToCount/gi);
$numberOfOccurrences += $numberOfNewOccurrences;
}
return $numberOfOccurrences;
}
The routine receives the name of a file and the string to search. The routine wordsInFileToArray () just gets every word from the file and returns an array with them.
Ideally I would like to perform this search directly reading from the file in one go instead of moving everything to an array and iterating through it. But the main question is how to hard code something into the function that allows me to capture munged words.
Example: I would like to extract both lines from the file.
example.txt:
russ1#anh#ck3r
russianhacker
# this variable also will be read from a blacklist file
$searchString = "russianhacker";
getOccurrenceOfStringInFileCaseInsensitive ("example.txt", $searchString);
Thanks in advance for any responses.
Edit:
The possible substitutions will be defined by an user and the regex must be set to fit. A user could say that a common substitution is to change the letter "a" to "#" or even "1". The possible change is completely arbitrary.
When searching for a specific word ("russian" for example) this could be done with something like:
(m/russian/i); # would just match the word as it is
(m/russi[a#1]n/i); # would match the munged word
But I'm not sure how to do that if I have the string to match stored in a variable, such as:
$stringToSearch = "russian";

This is sort of a full-text search problem, so one method is to normalize the document strings before matching against them.
use strict;
use warnings;
use Data::Munge 'list2re';
...
my %norms = (
'#' => 'a',
'1' => 'i',
...
);
my $re = list2re keys %norms;
s/($re)/$norms{$1}/ge for #wordArray;
This approach only works if there's only a single possible "normalized form" for any given word, and may be less efficient anyway than just trying every possible variation of the search string if your document is large enough and you recompute this every time you search it.
As a note your regex m/$randomString/gi should be m/\Q$randomString/gi, as you don't want any regex metacharacters in $randomString to be interpreted that way. See docs for quotemeta.

There are parts of the problem which aren't specified precisely enough (yet).
Some of the roll-your-own approaches, that depend on the details, are
If user defined substitutions are global (replace every occurrence of a character in every string) the user can submit a mapping, as a hash say, and you can fix them all. The process will identify all candidates for the words (along with the actual, unmangled, words, if found). There may be false positives so also plan on some post-processing
If the user can supply a list of substitutions along with words that they apply to (the mangled or the corresponding unmangled ones) then we can have a more targeted run
Before this is clarified, here is another way: use a module for approximate ("fuzzy") matching.
The String::Approx seems to fit quite a few of your requirements.
The match of the target with a given string relies on the notion of the Levenshtein edit distance: how many insertions, deletions, and replacements ("edits") it takes to make the given string into the sought target. The maximum accepted number of edits can be set.
A simple-minded example:
use warnings;
use strict;
use feature 'say';
use String::Approx qw(amatch);
my $target = qq(russianhacker);
my #text = qw(that h#cker was a russ1#anh#ck3r);
my #matches = amatch($target, ["25%"], #text);
say for #matches; #==> russ1#anh#ck3r
See documentation for what the module avails us, but at least two comments are in place.
First, note that the second argument in amatch specifies the percentile-deviation from the target string that is acceptable. For this particular example we need to allow every fourth character to be "edited." So much room for tweaking can result in accidental matches which then need be filtered out, so there will be some post-processing to do.
Second -- we didn't catch the easier one, h#cker. The module takes a fixed "pattern" (target), not a regex, and can search for only one at a time. So, in principle, you need a pass for each target string. This can be improved a lot, but there'll be more work to do.
Please study the documentation; the module offers a whole lot more than this simple example.

I've ended solving the problem by including the regex directly on the variable that I'll use to match against the lines of my file. It looks something like this:
sub getOccurrenceOfMungedStringInFile
{
my $fileName = $_[0];
my $mungedWordToCount = $_[1];
my $numberOfOccurrences = 0;
open (my $inputFile, "<", $fileName) or die "Can't open file: $!";
$mungedWordToCount =~ s/a/\[a\#4\]/gi;
while (my $currentLine = <$inputFile>)
{
chomp ($currentLine);
$numberOfOccurrences += () = ($currentLine =~ m/$mungedWordToCount/gi);
}
close ($inputFile) or die "Can't open file: $!";
return $numberOfOccurrences;
}
Where the line:
$mungedWordToCount =~ s/a/\[a\#4\]/gi;
Is just one of the substitutions that are needed and others can be added similarly.
I didn't know that Perl would just interpret the regex inside of the variable since I've tried that before and could only get the wanted results defining the variables inside the function using single quotes. I must've done something wrong the first time.
Thanks for the suggestions, people.

regex Pattern Matching over two lines - search and replace

I have a text document that i require help with. In the below example is an extract of a tab delimited text doc whereby the first line of the 3 line pattern will always be a number. The Doc will always be in this format with the same tabbed formula on each of the three lines.
nnnn **variable** V -------
* FROM CLIP NAME - **variable**
* LOC: variable variable **variable**
I want to replace the second field on the first line with the fourth field on the third line. And then replace the field after the colon on the second line with the original second field on the first line. Is this possible with regex? I am used to single line search replace function but not multiline patterns.
000003 A009C001_151210_R6XO V C 11:21:12:17 11:21:57:14 01:00:18:22 01:01:03:19
*FROM CLIP NAME: 5-1A
*LOC: 01:00:42:15 WHITE 005_NST_010_E02
000004 B008C001_151210_R55E V C 11:21:18:09 11:21:53:07 01:01:03:19 01:01:38:17
*FROM CLIP NAME: 5-1B
*LOC: 01:01:20:14 WHITE 005_NST_010_E03
The Result would look like :
000003 005_NST_010_E02 V C 11:21:12:17 11:21:57:14 01:00:18:22 01:01:03:19
*FROM CLIP NAME: A009C001_151210_R6XO
*LOC: 01:00:42:15 WHITE 005_NST_010_E02
000004 005_NST_010_E03 V C 11:21:18:09 11:21:53:07 01:01:03:19 01:01:38:17
*FROM CLIP NAME: B008C001_151210_R55E
*LOC: 01:01:20:14 WHITE 005_NST_010_E03
Many Thanks in advance.

A regular expression defines a regular language. Alone, this only expresses a structure of some input. Performing operations on this input requires some kind of processing tool. You didn't specify which tool you were using, so I get to pick.
Multiline sed
You wrote that you are "used to single line search replace function but not multiline patterns." Perhaps you are referring to substitution with sed. See How can I use sed to replace a multi-line string?. It is more complicated than with a single line, but it is possible.
An AWK script
AWK is known for its powerful one-liners, but you can also write scripts. Here is a script that identifies the beginning of a new record/pattern using a regular expression to match the first number. (I hesitate to call it a "record" because this has a specific meaning in AWK.) It stores the fields of the first two lines until it encounters the third line. At the third line, it has all the information needed to make the desired replacements. It then prints the modified first two lines and continues. The third line is printed unchanged (you specified no replacements for the third line). If there are additional lines before the start of the next record/pattern, they will also be printed unchanged.
It's unclear exactly where the tab characters are in your sample input because the submission system has replaced them with spaces. I am assuming there is a tab between FROM CLIP NAME: and the following field and that the "variables" on the first and third line are also tab-separated. If the first number of each record/pattern is hexadecimal instead of decimal, replace the [[:digit:]] with [[:xdigit:]].
fixit.awk
#!/usr/bin/awk -f
BEGIN { FS="\t"; n=0 }
{n++}
/^[[:digit:]]+\t/ { n=1 }
# Split and save first two lines
n==1 { line1_NF = split($0, line1, FS); next }
n==2 { line2_NF = split($0, line2, FS); next }
n==3 {
# At the third line, make replacements
line1_2 = line1[2]
line1[2] = $4
line2[2] = line1_2
# Print modified first two lines
printf "%s", line1[1]
for ( i=2; i<=line1_NF; ++i )
printf "\t%s", line1[i]
print ""
printf "%s", line2[1]
for ( i=2; i<=line2_NF; ++i )
printf "\t%s", line2[i]
print ""
}
1 # Print lines after the second unchanged
You can use it like
$ awk -f fixit.awk infile.txt
or to pipe it in
$ cat infile.txt | awk -f fixit.awk
This is not the most regular expression inspired solution, but it should make the replacements that you want. For a more complex structure of input, an ideal solution would be to write a scanner and parser that correctly interprets the full input language. Using tools like string substitution might work for simple specific cases, but there could be nuances and assumptions you've made that don't apply in general. A parser can also be more powerful and implement grammars that can express languages which can't be recognized with regular expressions.

How can I find the count of semicolon separated values?

I have a list of all email ids which I have copied from the 'To' field, from an email I received in MS Outlook. These values (email ids) are separated by a semicolon. I have copied this big list of email ids into Excel. Now I want to find the number of email ids in this list; basically by counting the number of semi colons.
One way I can do this is by writing C code. i.e. store the big list as string buffer, and keep comparing the chars to ";" in a while(char == ';') loop.
But I want to do it quickly.
Is there any quick way to find that out using either:
1.) Regular expression (I use powergrep for processing the regexps)
2.) In excel itself (any excel macro/plugin for that?)
3.) DOS script method
4.) Any other quick way of getting it done?

I believe the following should work in Excel:
= Len(A1) - Len(Substitute(A1, ";", "")) + 1
/EDIT: if you've pasted the email addresses over several cells, you can count the cells with the following function:
= CountA(A1:BY1)
CountA counts non-empty cells in a given range. You can specify the range by typing =CountA( into a cell and then selecting your cell range with the mouse cursor.

Bash/Cygwin One-Liner
$ echo "user#domain.tld;user#domain.tld;user#domain.tld" | sed -e 's/;/\n/g' | wc -l
3
If you already have Cygwin installed it's effectively instant. If not, cygwin is worth installing IMHO. It basically provides a Linux bash prompt overlaid over your Windows system.
As an aside, stuff like this is why I prefer *nix over Windows for work stuff, I can't live on a windows box without Cygwin since bash scripts are so much more powerful than batch scripts.

If counting the number of semicolons is good enough for you, you can do it in Perl using this solution: Perl FAQ 4.24: How can I count the number of occurrences of a substring within a string

PowerShell:
> $a = 'blah;blah;blah'
> $a.Split(';').Count
3

3) if you don't have neither cygwin, nor powershell installed try this .cmd
#echo off
set /a i = 0
for %%i in (name1#mail.com;name2#mail.com;name3#mail.com) do set /a i = i + 1
#echo %i%

If you are using Excel you can use this code and expose it.
Public Function CountSubString(ByVal RHS As String, ByVal Delimiter As String) As Integer
Dim V As Variant
V = Split(RHS, Delimiter)
CountSubString = UBound(V) + 1
End Function
If you have .NET you can make a little command line utility
Module CountSubString
Public Sub Main(ByVal Args() As String)
If Args.Length <> 2 Then
Console.WriteLine("wrong arguments passed->")
Else
Dim I As Integer = 0
Dim Items() = Split(Args(0), Args(1))
Console.WriteLine("There are " & CStr(UBound(Items) + 1) & "
End If
End Sub
End Module

Load the list in your favorite (not Notepad!) editor, replace ; by \n, see in the status bar how many lines you have, remove the last line if needed.

C# 3.0 with LINQ would make this easy if it is an option for you over C
myString.ToCharArray().Count(char => char = ';')

If awk, echo is awailable (and it is, even on windows):
echo "addr1;addr2;addr3...." | awk -F ";" "{print NF}"

looping over it with a while loop and counting the ';' is probably going to be the fastest, and the most readable.
Consider Konrad's suggestions, that too will loop through the string and check every char and see if it is a simicolon, and then in modifies the string (may or may not be mutable, I don't know with excel), and then it counts the length between it and the original string.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js