How can I find the count of semicolon separated values? - regex

I have a list of all email ids which I have copied from the 'To' field, from an email I received in MS Outlook. These values (email ids) are separated by a semicolon. I have copied this big list of email ids into Excel. Now I want to find the number of email ids in this list; basically by counting the number of semi colons.
One way I can do this is by writing C code. i.e. store the big list as string buffer, and keep comparing the chars to ";" in a while(char == ';') loop.
But I want to do it quickly.
Is there any quick way to find that out using either:
1.) Regular expression (I use powergrep for processing the regexps)
2.) In excel itself (any excel macro/plugin for that?)
3.) DOS script method
4.) Any other quick way of getting it done?

I believe the following should work in Excel:
= Len(A1) - Len(Substitute(A1, ";", "")) + 1
/EDIT: if you've pasted the email addresses over several cells, you can count the cells with the following function:
= CountA(A1:BY1)
CountA counts non-empty cells in a given range. You can specify the range by typing =CountA( into a cell and then selecting your cell range with the mouse cursor.

Bash/Cygwin One-Liner
$ echo "user#domain.tld;user#domain.tld;user#domain.tld" | sed -e 's/;/\n/g' | wc -l
3
If you already have Cygwin installed it's effectively instant. If not, cygwin is worth installing IMHO. It basically provides a Linux bash prompt overlaid over your Windows system.
As an aside, stuff like this is why I prefer *nix over Windows for work stuff, I can't live on a windows box without Cygwin since bash scripts are so much more powerful than batch scripts.

If counting the number of semicolons is good enough for you, you can do it in Perl using this solution: Perl FAQ 4.24: How can I count the number of occurrences of a substring within a string

PowerShell:
> $a = 'blah;blah;blah'
> $a.Split(';').Count
3

3) if you don't have neither cygwin, nor powershell installed try this .cmd
#echo off
set /a i = 0
for %%i in (name1#mail.com;name2#mail.com;name3#mail.com) do set /a i = i + 1
#echo %i%

If you are using Excel you can use this code and expose it.
Public Function CountSubString(ByVal RHS As String, ByVal Delimiter As String) As Integer
Dim V As Variant
V = Split(RHS, Delimiter)
CountSubString = UBound(V) + 1
End Function
If you have .NET you can make a little command line utility
Module CountSubString
Public Sub Main(ByVal Args() As String)
If Args.Length <> 2 Then
Console.WriteLine("wrong arguments passed->")
Else
Dim I As Integer = 0
Dim Items() = Split(Args(0), Args(1))
Console.WriteLine("There are " & CStr(UBound(Items) + 1) & "
End If
End Sub
End Module

Load the list in your favorite (not Notepad!) editor, replace ; by \n, see in the status bar how many lines you have, remove the last line if needed.

C# 3.0 with LINQ would make this easy if it is an option for you over C
myString.ToCharArray().Count(char => char = ';')

If awk, echo is awailable (and it is, even on windows):
echo "addr1;addr2;addr3...." | awk -F ";" "{print NF}"

looping over it with a while loop and counting the ';' is probably going to be the fastest, and the most readable.
Consider Konrad's suggestions, that too will loop through the string and check every char and see if it is a simicolon, and then in modifies the string (may or may not be mutable, I don't know with excel), and then it counts the length between it and the original string.

Related

Highlight line for specific commit in git log graph

I am trying to highlight the whole line for a specific commit in my git log graph. I have since before created a git log alias to format the output of my logs. I have attempted to highlight a specific line containing the commit-id, using my alias.
Alias in ~/.gitconfig
# Base command for log formatting
lg-base = "log --graph --decorate=short --decorate-refs-exclude='refs/tags/*' --color=always"
# Version 1 log format
lg1 = !"git lg-base --format=format:'%C(#f0890c)%h%C(reset) - %C(bold green)(%ar)%C(reset) %C(white)%s%C(reset) %C(dim white)- %an%C(reset)%C(#d10000)%d%C(reset)'"
Doing a test with searching for 6 months just because it should behave the same and might showcase my issue a bit better.
git lg1 | grep --color=always -E '(6 months).*|$'
Matches the correct lines. But it doesn't highlight the whole line to the right and when trying to highlight the left part of the line as well, it doesn't work as expected. Probably because of my lack in skills of using regex.
git lg1 | grep --color=always -E '.*(6 months).*|$'
Instead it marks the * in the beginning.
If you have a total other approach, that is fine with me as long as I can use my formatted git log alias.
Thomas' comment is the key to the issue here: although grep is adding its own color (or colour) changing escape sequences to highlight the line, Git has already put in color changing directives. Each such directive, for one of the named colors, looks like this:
ESC[numberm
where the number part is 30 through 37 for a foreground color and 40 through 47 for a background color (plus some extra codes for bold or dim, which I won't include here). (%C(reset) sends ESC [ m and your orange selector uses a 24-bit color directive, which is less widely supported than the eight base colors, which go back to the 1990s). Hence the original output reads:
* <sp> <orange> <hash-ID> <reset> <sp> - <sp> <blue> (n months ago) <reset> ...
The grep adds red, which is ESC [ 31 m, and a reset, around the matched expression—but the existing escapes within the expression remain.
The easiest way by far to avoid all this is to stop using color escape sequences at all, so that grep's added ones stick out like a sore red thumb. Of course that defeats your goal, which is to keep the color-changing escapes in lines that aren't highlighted. But you haven't explained what you'd like done with the color-changing escapes in lines, or parts of lines, that are highlighted. Answering that will determine what to do next.
There are any number of ways you could handle this. For instance, instead of %C(color)%<directive>%C(reset) you could use %x1b(name-of-color)%<directive>%x1b(reset) to insert the literal sequences ESC ( name of color or reset ), or assume that the terminal in question will use ANSI style escapes that end with the lowercase m character, and try to write something up in sed or awk (I'd use awk for something this complex, just because it's less like writing line noise) that does the match—awk supports regex matching—and if found, strips out the color sequences from the matched part and adds its own. Post-process this with something that inserts the appropriate terminal-dependent color-change sequences, or keep the original ESC [ ... m sequences on the assumption that you're in a window that uses that form, and you'll have the output you want (which you can now pipe through less -R if desired).
A skeleton awk program that does what you want is:
/<desired regex>/ { handle matched line; next; }
{ print }
The hard part is the "handle matched line". GNU awk has RSTART and RLENGTH to help out a lot; see, e.g., this answer. The substring of the line from the beginning to RSTART-1 wasn't matched (this may be empty), and the substring from RSTART+RLENGTH to the end of the line (which may also be empty) also was not matched; the substring of $0 at RSTART for length RLENGTH was matched and here's where you would strip out any color-changing sequences, if you want your basic red (or whatever) applied throughout.
Sample script (by Robin Hellmers)
Creating a script and placing it where you please, e.g.
~/.local/bin/highlight-commit.awk
with the contents
#!/usr/bin/nawk -f
BEGIN {
n = split(commits,arrayCommits," ");
background="145;0;0"
foreground="255;255;255"
}
{
# Compare with every given input e.g. commit id
for (i=1; i <= n; i++) {
if(match($0,arrayCommits[i])) {
# Remove any ANSI color escape sequence for matching row
gsub("\x1b\\[[0-9;]*m","",$0)
# Create ANSI color escape sequence for whole row
$0 = sprintf("\x1b[48;2;%sm\x1b[38;2;%sm%s\x1b[0m\x1b[0m",
background,
foreground,
$0);
break;
}
}
printf("%s\n", $0);
}
In ~/.gitconfig, add the following alias:
[alias]
highlight-commit = "!f() { git lg | awk -v commits=\"$*\" -f ~/.local/bin/highlight-commit.awk | less -XR; }; f"
By calling with e.g. two commits:
git highlight-commit 82451f8 310fca4

Regex to match and split every third occurrence of a string

in a Korn Shell script I have a large amount of data in a string variable contents that matches the following syntax:
account_id_0:group_id_0:name_0
account_id_1:group_id_1:name_1
...
account_id_N:group_id_N:name_N
I want to split the string on the : character every third instance so I can generate three other strings accounts,groups, and names
that have the format:
accounts = account_id_0,account_id_1,...,account_id_N
groups = group_id_0,group_id_1,...,group_id_N
names = name_0,name_1,...,name_N
The reason I would like to store these in a string rather than an array is for portability across environments.
Am I able to achieve this using something like the sed, cut, or awk command?
the current regex I'm using to capture the accounts is:
[a-zA-Z][0-9]+(?:([a-zA-z]*[0-9]*)*)(?:([a-zA-Z]*[0-9]*)*)
But I feel there is a more efficient alternative.
I have attempted to achieve the desired output using a combination of this solution and this solution however the first one lacks the repetition I require, and the latter is for file manipulation not strings.
I would use arrays, and process the contents variable like reading lines from a file:
contents='account_id_0:group_id_0:name_0
account_id_1:group_id_1:name_1
...:...:...
account_id_N:group_id_N:name_N'
as=()
gs=()
ns=()
while IFS=: read -r a g n; do
as+=("$a")
gs+=("$g")
ns+=("$n")
done <<< "$contents"
accounts=$(IFS=,; echo "${as[*]}")
groups=$(IFS=,; echo "${gs[*]}")
names=$(IFS=,; echo "${ns[*]}")
printf "%s\n" "$accounts" "$groups" "$names"
account_id_0,account_id_1,...,account_id_N
group_id_0,group_id_1,...,group_id_N
name_0,name_1,...,name_N
If you're getting the contents value from a file, you can skip the step of storing it in a variable and just read the file directly.

Edit CSV rows in two different ways

I have a bash script that outputs two CSV columns. I need to prepend the three-digit number of those rows of the second column that contain them with "f. " and keep the rest of the rows intact. I have tried different ways so far but each has failed in one way or another.
What I've tried mainly has been to use regular expressions with either the first or second column to separate the desired rows from the rest, but I can't separate and prepend at the same time without cancelling out or messing up the process somehow. Some of the commands I've used so far have been: $ sed $ cut as well as (nested) for loops, read-while loops, if/else and if/else/elif statements, etc. What follows is one such (failed) solution:
for var1 in "^.*_[^f]_.*"
do
sed -i "" "s:$MSname::" $pathToCSV"_final.csv"
for var2 in "^.*_f_.*"
do
sed -i "" "s:$MSname:f.:" $pathToCSV"_final.csv"
done
done
And these are some sample rows:
abc_deg0014_0001_a_1.tif,British Library 1 Front Board Outside
abc_deg0014_0002_b_000.tif,British Library 1 Front Board Inside
abc_deg0014_0003_f_001r.tif,British Library 1 001r
abc_deg0014_0004_f_001v.tif,British Library 1 001v
…
abc_deg0014_0267_f_132r.tif,British Library 1 132r
abc_deg0014_0268_f_132v.tif,British Library 1 132v
abc_deg0014_0269_y_999.tif,British Library 1 Back Board Inside
abc_deg0014_0270_z_1.tif,British Library 1 Back Board Outside
Here $MSname = British Library 1 (since with different CSVs the "British Library 1" part can change to other words that I need to remove/replace and that's why I use parameter expansion).
The desired result:
abc_deg0014_0002_b_000.tif,Front Board Inside
abc_deg0014_0003_f_001r.tif,f. 001r
…
abc_deg0014_0268_f_132v.tif,f. 132v
abc_deg0014_0269_y_999.tif,Back Board Inside
If you look closely, you'll notice these rows are also differentiated from the rest by "f" in their first column (the rows that shouldn't get the "f. " in front of their second column are differentiated by "a", "b", "y", and "z", respectively, in the first column).
You are not using var1 or var2 for anything, and even if you did, looping over variables and repeatedly running sed -i on the same output file is extremely wasteful. Ideally, you would like to write all the modifications into a single sed script, and process the file only once.
Without being able to guess what other strings than "British Library 1" you have and whether those require different kinds of actions, I would suggest something along the lines of
sed -i '/^[^,]*_f_[^,_]*,/s/,British Library 1 /,f. /
s/,British Library 1 /,/' "${pathToCSV}_final.csv"
Notice how the sed script in single quotes can be wrapped over multiple physical lines. The first line finds any lines where the last characters between underscores in the first comma-separated column is f, and replaces ",British Library 1 " with ",f. ". (I made some adjustments to the spacing here -- I hope they make sense for you.) On the following line, we simply replace any (remaining) occurrences of ",British Library 1 " with just a comma; the idea is that only the lines which didn't match the regex on the previous line will still contain this string, and so we don't have to do another regex match.
This can easily be extended to cover more patterns in the same sed script, rather than repeatedly looping over the file and rewriting one pattern at a time. For example, if your next task is to replace Windsor Palace A with either a. or nothing depending on whether the penultimate underscore-separated subfield in the first field contains a, that should be obvious enough:
sed -i '/^[^,]*_f_[^,_]*,/s/,British Library 1 /,f. /
s/,British Library 1 /,/
/^[^,]*_a_[^,_]*,/s/,Windsor Palace A /,a. /
s/,Windsor Palace A /,/' "${pathToCSV}_final.csv"
In some more detail, the regex says
^ beginning of line
[^,]* any sequence of characters which are not a comma
_f_ literal characters underscore, f, underscore
[^,_]* any sequence of characters which are not a comma or an underscore
, literal comma
You should be able to see that this will target the last pair of underscores in the first column. It's important to never skip across the first comma, and near the end, not allow any underscores after the ones we specifically target before we finally allow the comma column delimiter.
Finally, also notice how we always use double quotes around variables which contain file names. There are scenarios where you can avoid this but you have to know what you are doing; the easy and straightforward rule of thumb is to always put double quotes around variables. For the full scoop, see When to wrap quotes around a shell variable?
With awk, you can look at the firth field to see whether it matches "3digits + 1 letter" then print with f. in this case and just remove fields 2,3 and 4 in the other case. For example:
awk -F'[, ]' '{
if($5 ~ /.?[[:digit:]]{3}[a-z]$/) {
printf("%s,f. %s\n",$1,$5)}
else {
printf("%s,%s %s %s\n",$1,$5,$6,$7)
}
}' test.txt
On the example you provide, it gives:
abc_deg0014_0001_a_1.tif,Front Board Outside
abc_deg0014_0002_b_000.tif,Front Board Inside
abc_deg0014_0003_f_001r.tif,f. 001r
abc_deg0014_0004_f_001v.tif,f. 001v
abc_deg0014_0267_f_132r.tif,f. 132r
abc_deg0014_0268_f_132v.tif,f. 132v
abc_deg0014_0269_y_999.tif,Back Board Inside
abc_deg0014_0270_z_1.tif,Back Board Outside

BASH: Split strings without any delimiter and keep only first sub-string

I have a CSV file containing 7 columns and I am interested in modifying only the first column. In fact, in some of the rows a row name appears n times in a concatenated way without any space. I need a script that can identify where the duplication starts and remove all duplications.
Example of a row name among others:
Row name = EXAMPLE1.ABC_DEF.panel4EXAMPLE1.ABC_DEF.panel4EXAMPLE1.ABC_DEF.panel4
Replace by: EXAMPLE1.ABC_DEF.panel4
In the different rows:
n can vary
The length of the row name can vary
The structure of the row name can vary (eg. amount of _ and .), but it is always collated without any space
What I have tried:
:%s/(.+)\1+/\1/
Step-by-step:
%s: substitute in the whole file
(.+)\1+: First capturing group. .+ matches any character (except for line terminators), + is the quantifier — matches between one and unlimited times, as many times as possible, giving back as needed.
\1+: matches the same text as most recently matched by the 1st capturing group
Substitute by \1
However, I get the following errors:
E65: Illegal back reference
E476: Invalid command
From what i understand you need only one line contain EXAMPLE1.ABC_DEF.panel4. In that case you can do the following:
First remove duplicates in one line:
sed -i "s/EXAMPLE1.ABC_DEF.panel4.*/EXAMPLE1.ABC_DEF.panel4/g"
Then remove duplicated lines:
awk '!a[$0]++'
If all your rows are of the format you gave in the question (like EXAMPLExyzEXAMPLExyz) then this should work-
awk -F"EXAMPLE" '{print FS $2}' file
This takes "EXAMPLE" as the field delimiter and asks it to print only the first 'column'. It prepends "EXAMPLE" to this first column (by calling the inbuilt awk variable FS). Thanks, #andlrc.
Not an ideal solution but may be good enough for this purpose.
This script, with first arg is the string to test, can retrieve the biggest duplicate substring (i.e. "totototo" done "toto", not "to")
#!/usr/bin/env bash
row_name="$1"
#test duplicate from the longest to the smallest, by how many we need to split the string ?
for (( i=2; i<${#row_name}; i++ ))
do
match="True"
#continue test only if it's mathematically possible
if (( ${#row_name} % i )); then
continue
fi
#length of the potential duplicate substring
len_sub=$(( ${#row_name} / i ))
#test if the first substring is equal to each others
for (( s=1; s<i; s++ ))
do
if ! [ "${row_name:0:${len_sub}}" = "${row_name:$((len_sub * s)):${len_sub}}" ]; then
match="False"
break
fi
done
#each substring are equal, so return string without duplicate
if [ $match = "True" ]; then
row_name="${row_name:0:${len_sub}}"
break
fi
done
echo "$row_name"

Format all IP-Addresses to 3 digits

I'd like to use the search & replace dialogue in UltraEdit (Perl Compatible Regular Expressions) to format a list of IPs into a standard Format.
The list contains:
192.168.1.1
123.231.123.2
23.44.193.21
It should be formatted like this:
192.168.001.001
123.231.123.002
023.044.193.021
The RegEx from http://www.regextester.com/regular+expression+examples.html for IPv4 in the PCRE-Format is not working properly:
^(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5]){3}$
I'm stucked. Does anybody have a proper solution which works in UltraEdit?
Thanks in advance!
Set the regular expression engine to Perl (on the advanced section) and replace this:
(?<!\d)(\d\d?)(?!\d)
with this:
0$1
twice. That should do it.
If your input is a single IP address (per line) and nothing else (no other text), this approach will work:
I used "Replace All" with Perl style regular expressions:
Replace (?<!\d)(?=\d\d?(?=[.\s]|$))
with 0
Just replace as often as it matches. If there is other text, things will get more complicated. Maybe the "Search in Column" option is helpful here, in case you are dealing with CSV.
If this is just a one-off data cleaning job, I often just use Excel or OpenOffice Calc for this type of thing:
Open your textfile and make sure only one IP address per line.
Open Excel or whatever and goto "Data|Import External Data" and import your textfile using "." as the separator.
You should now have 4 columns in excel:
192 | 168 | 1 | 1
Right click and format each column as a number with 3 digits and leading zeroes.
In column 5 just do a string concatenation of the previous columns with a "." in between each column:
A1 & "." & B1 & "." & C1 & "." & D1
This obviously is a cheap and dirty fix and is not a programmatic way of dealing with this, but I find this sort of technique useful for cleaning up data every now and then.
I'm not sure how you can use Regular Expression in Replace With box in UltraEdit.
You can use this regular expression to find your string:
^(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])$