BASH: Split strings without any delimiter and keep only first sub-string - regex

I have a CSV file containing 7 columns and I am interested in modifying only the first column. In fact, in some of the rows a row name appears n times in a concatenated way without any space. I need a script that can identify where the duplication starts and remove all duplications.
Example of a row name among others:
Row name = EXAMPLE1.ABC_DEF.panel4EXAMPLE1.ABC_DEF.panel4EXAMPLE1.ABC_DEF.panel4
Replace by: EXAMPLE1.ABC_DEF.panel4
In the different rows:
n can vary
The length of the row name can vary
The structure of the row name can vary (eg. amount of _ and .), but it is always collated without any space
What I have tried:
:%s/(.+)\1+/\1/
Step-by-step:
%s: substitute in the whole file
(.+)\1+: First capturing group. .+ matches any character (except for line terminators), + is the quantifier — matches between one and unlimited times, as many times as possible, giving back as needed.
\1+: matches the same text as most recently matched by the 1st capturing group
Substitute by \1
However, I get the following errors:
E65: Illegal back reference
E476: Invalid command

From what i understand you need only one line contain EXAMPLE1.ABC_DEF.panel4. In that case you can do the following:
First remove duplicates in one line:
sed -i "s/EXAMPLE1.ABC_DEF.panel4.*/EXAMPLE1.ABC_DEF.panel4/g"
Then remove duplicated lines:
awk '!a[$0]++'

If all your rows are of the format you gave in the question (like EXAMPLExyzEXAMPLExyz) then this should work-
awk -F"EXAMPLE" '{print FS $2}' file
This takes "EXAMPLE" as the field delimiter and asks it to print only the first 'column'. It prepends "EXAMPLE" to this first column (by calling the inbuilt awk variable FS). Thanks, #andlrc.
Not an ideal solution but may be good enough for this purpose.

This script, with first arg is the string to test, can retrieve the biggest duplicate substring (i.e. "totototo" done "toto", not "to")
#!/usr/bin/env bash
row_name="$1"
#test duplicate from the longest to the smallest, by how many we need to split the string ?
for (( i=2; i<${#row_name}; i++ ))
do
match="True"
#continue test only if it's mathematically possible
if (( ${#row_name} % i )); then
continue
fi
#length of the potential duplicate substring
len_sub=$(( ${#row_name} / i ))
#test if the first substring is equal to each others
for (( s=1; s<i; s++ ))
do
if ! [ "${row_name:0:${len_sub}}" = "${row_name:$((len_sub * s)):${len_sub}}" ]; then
match="False"
break
fi
done
#each substring are equal, so return string without duplicate
if [ $match = "True" ]; then
row_name="${row_name:0:${len_sub}}"
break
fi
done
echo "$row_name"

Related

Edit CSV rows in two different ways

I have a bash script that outputs two CSV columns. I need to prepend the three-digit number of those rows of the second column that contain them with "f. " and keep the rest of the rows intact. I have tried different ways so far but each has failed in one way or another.
What I've tried mainly has been to use regular expressions with either the first or second column to separate the desired rows from the rest, but I can't separate and prepend at the same time without cancelling out or messing up the process somehow. Some of the commands I've used so far have been: $ sed $ cut as well as (nested) for loops, read-while loops, if/else and if/else/elif statements, etc. What follows is one such (failed) solution:
for var1 in "^.*_[^f]_.*"
do
sed -i "" "s:$MSname::" $pathToCSV"_final.csv"
for var2 in "^.*_f_.*"
do
sed -i "" "s:$MSname:f.:" $pathToCSV"_final.csv"
done
done
And these are some sample rows:
abc_deg0014_0001_a_1.tif,British Library 1 Front Board Outside
abc_deg0014_0002_b_000.tif,British Library 1 Front Board Inside
abc_deg0014_0003_f_001r.tif,British Library 1 001r
abc_deg0014_0004_f_001v.tif,British Library 1 001v
…
abc_deg0014_0267_f_132r.tif,British Library 1 132r
abc_deg0014_0268_f_132v.tif,British Library 1 132v
abc_deg0014_0269_y_999.tif,British Library 1 Back Board Inside
abc_deg0014_0270_z_1.tif,British Library 1 Back Board Outside
Here $MSname = British Library 1 (since with different CSVs the "British Library 1" part can change to other words that I need to remove/replace and that's why I use parameter expansion).
The desired result:
abc_deg0014_0002_b_000.tif,Front Board Inside
abc_deg0014_0003_f_001r.tif,f. 001r
…
abc_deg0014_0268_f_132v.tif,f. 132v
abc_deg0014_0269_y_999.tif,Back Board Inside
If you look closely, you'll notice these rows are also differentiated from the rest by "f" in their first column (the rows that shouldn't get the "f. " in front of their second column are differentiated by "a", "b", "y", and "z", respectively, in the first column).
You are not using var1 or var2 for anything, and even if you did, looping over variables and repeatedly running sed -i on the same output file is extremely wasteful. Ideally, you would like to write all the modifications into a single sed script, and process the file only once.
Without being able to guess what other strings than "British Library 1" you have and whether those require different kinds of actions, I would suggest something along the lines of
sed -i '/^[^,]*_f_[^,_]*,/s/,British Library 1 /,f. /
s/,British Library 1 /,/' "${pathToCSV}_final.csv"
Notice how the sed script in single quotes can be wrapped over multiple physical lines. The first line finds any lines where the last characters between underscores in the first comma-separated column is f, and replaces ",British Library 1 " with ",f. ". (I made some adjustments to the spacing here -- I hope they make sense for you.) On the following line, we simply replace any (remaining) occurrences of ",British Library 1 " with just a comma; the idea is that only the lines which didn't match the regex on the previous line will still contain this string, and so we don't have to do another regex match.
This can easily be extended to cover more patterns in the same sed script, rather than repeatedly looping over the file and rewriting one pattern at a time. For example, if your next task is to replace Windsor Palace A with either a. or nothing depending on whether the penultimate underscore-separated subfield in the first field contains a, that should be obvious enough:
sed -i '/^[^,]*_f_[^,_]*,/s/,British Library 1 /,f. /
s/,British Library 1 /,/
/^[^,]*_a_[^,_]*,/s/,Windsor Palace A /,a. /
s/,Windsor Palace A /,/' "${pathToCSV}_final.csv"
In some more detail, the regex says
^ beginning of line
[^,]* any sequence of characters which are not a comma
_f_ literal characters underscore, f, underscore
[^,_]* any sequence of characters which are not a comma or an underscore
, literal comma
You should be able to see that this will target the last pair of underscores in the first column. It's important to never skip across the first comma, and near the end, not allow any underscores after the ones we specifically target before we finally allow the comma column delimiter.
Finally, also notice how we always use double quotes around variables which contain file names. There are scenarios where you can avoid this but you have to know what you are doing; the easy and straightforward rule of thumb is to always put double quotes around variables. For the full scoop, see When to wrap quotes around a shell variable?
With awk, you can look at the firth field to see whether it matches "3digits + 1 letter" then print with f. in this case and just remove fields 2,3 and 4 in the other case. For example:
awk -F'[, ]' '{
if($5 ~ /.?[[:digit:]]{3}[a-z]$/) {
printf("%s,f. %s\n",$1,$5)}
else {
printf("%s,%s %s %s\n",$1,$5,$6,$7)
}
}' test.txt
On the example you provide, it gives:
abc_deg0014_0001_a_1.tif,Front Board Outside
abc_deg0014_0002_b_000.tif,Front Board Inside
abc_deg0014_0003_f_001r.tif,f. 001r
abc_deg0014_0004_f_001v.tif,f. 001v
abc_deg0014_0267_f_132r.tif,f. 132r
abc_deg0014_0268_f_132v.tif,f. 132v
abc_deg0014_0269_y_999.tif,Back Board Inside
abc_deg0014_0270_z_1.tif,Back Board Outside

awk : set start and end of match

I have a LaTeX-like table like this (columns are delimited by &) :
foobar99 & 68
foobar4 & 43
foobar2 & 73
I want to get the index of the numbers at column 2 by using match.
In Vim, we can use \zs and \ze to set start and end of matching.
Thus, to match accurately number at colum 2, we can use ^.*&\s*\zs[[:digit:]]\+\ze\s*$.
How about awk? Is there an equivalent?
EDIT:
Matching for the first line:
foobar99 & 68
^^
123456789012345678
Expected output : 18.
EDIT2:
I am writing an awk script to deal with block delimited by line break (Hence, FS="\n" and RS=""). The MWE above is just one of these blocks.
A possible way to get the index of number at column 2 is to do something like that
split(line, cases, "&");
index = match(cases[2], /[[:digit:]]\+/);
but I am looking for a beautiful way to do this.
Apologies for the XY problem. But I'm still interested in start/end matching.
Too little context, so a simple guess: have you tried splitting the table into columns? With something like awk -F '\\s*&\\s*' you have your second column in $2.
In fact, you can use split() to retrieve the exact column of a string:
split(s, a[, fs ])
Split the string s into array elements a[1], a[2], ..., a[n], and
return n. All elements of the array shall be deleted before the split is
performed. The separation shall be done with the ERE fs or with the field
separator FS if fs is not given. Each array element shall have a
string value when created and, if appropriate, the array element
shall be considered a numeric string (see Expressions in awk). The
effect of a null string as the value of fs is unspecified.
So your second column is something like
split(s, a, /\s*&\s*/)
secondColumn = a[2]
By default, awk sees three columns in your data, and column 2 contains & only (and column 3 contains numbers). If you change the field delimiter to &, then you have two columns with trailing spaces in column 1 and leading spaces in column 2 (and some trailing spaces, as it happens; try copying the data from the question).
In awk, you could convert column 2 with leading spaces into a number by adding 0: $2 + 0 would force it to be treated as a number. If you use $2 in a numeric context, it'll be treated as a number. Conversely, you can force awk to treat a field as a string by concatenating with the empty string: $2 "" will be a string.
So there's no need for the complexity of regexes to get at the number — if the data is as simple as shown.
You say you want to use match; it is not clear what you need that for.
awk -F'&' '{ printf "F1 [%s], F2 [%10s] = [%d] = [%-6d] = [%06d]\n", $1, $2, $2, $2, $2 }' data
For your data, which has a single blank at the end of the first two lines and a double blank at the end of the third, the output is:
F1 [foobar99 ], F2 [ 68 ] = [68] = [68 ] = [000068]
F1 [foobar4 ], F2 [ 43 ] = [43] = [43 ] = [000043]
F1 [foobar2 ], F2 [ 73 ] = [73] = [73 ] = [000073]
Note that I didn't need to explicitly convert $2 to a number. The printf formats treated it as a string or a number depending on whether I used %s or %d.
If you need to, you can strip trailing blanks of $1 (or, indeed, $2), but without knowing what else you need to do, it's hard to demonstrate alternatives usefully.
So, I think awk does what you need without needing you to jump through much in the way of hoops. For a better explanation, you'd need to provide a better question, describing or showing what you want to do.
You can try this way
awk '{print index($0,$3)}' infile

regex Match a capture group's items only once

So I'm trying to split a string in several options, but those options are allowed to occur only once. I've figured out how to make it match all options, but when an option occurs twice or more it matches every single option.
Example string: --split1 testsplit 1 --split2 test split 2 --split3 t e s t split 3 --split1 split1 again
Regex: /-{1,2}(split1|split2|split3) [\w|\s]+/g
Right now it is matching all cases and I want it to match --split1, --split2 and --split3 only once (so --split1 split1 again will not be matched).
I'm probably missing something really straight forward, but anyone care to help out? :)
Edit:
Decided to handle the extra occurances showing up in a script and not through RegEx, easier error handling. Thanks for the help!
EDIT: Somehow I ended up here from the PHP section, hence the PHP code. The same principles apply to any other language, however.
I realise that OP has said they have found a solution, but I am putting this here for future visitors.
function splitter(string $str, int $splits, $split = "--split")
{
$a = array();
for ($i = $splits; $i > 0; $i--) {
if (strpos($str, "$split{$i} ") !== false) {
$a[] = substr($str, strpos($str, "$split{$i} ") + strlen("$split{$i} "));
$str = substr($str, 0, strpos($str, "$split{$i} "));
}
}
return array_reverse($a);
}
This function will take the string to be split, as well as how many segments there will be. Use it like so:
$array = splitter($str, 3);
It will successfully explode the array around the $split parameter.
The parameters are used as follows:
$str
The string that you want to split. In your instance it is: --split1 testsplit 1 --split2 test split 2 --split3 t e s t split 3 --split1 split1 again.
$splits
This is how many elements of the array you wish to create. In your instance, there are 3 distinct splits.
If a split is not found, then it will be skipped. For instance, if you were to have --split1 and --split3 but no --split2 then the array will only be split twice.
$split
This is the string that will be the delimiter of the array. Note that it must be as specified in the question. This means that if you want to split using --myNewSplit then it will append that string with a number from 1 to $splits.
All elements end with a space since the function looks for $split and you have a space before each split. If you don't want to have the trailing whitespace then you can change the code to this:
$a[] = trim(substr($str, strpos($str, "$split{$i} ") + strlen("$split{$i} ")));
Also, notice that strpos looks for a space after the delimiter. Again, if you don't want the space then remove it from the string.
The reason I have used a function is that it will make it flexible for you in the future if you decide that you want to have four splits or change the delimiter.
Obviously, if you no longer want a numerically changing delimiter then the explode function exists for this purpose.
-{1,2}((split1)|(split2)|(split3)) [\w|\s]+
Something like this? This will, in this case, create 3 arrays which all will have an array of elements of the same name in them. Hope this helps

Bash command in php script, get some lines of a file according values of a specific column

Via linux bash command(s) (grep + regex + another commands ?) from a php script, I want get the line(s) of a file according some conditions, see below :
An exemple of a file :
"id_line1","value_line1_column2","foo blablabla","value_line1_column4"
"id_line2","value_line2_column2","blablabla foo","value_line2_column4"
"id_line3","value_line3_column2","blabla foo blabla","value_line3_column4"
"id_line4","value_line4_column2","blablabla","value_line4_column4"
"id_line5","value_line5_column2","fooblabla bla","value_line5_column4"
"id_line6","value_line6_column2","blabla blafoo","value_line6_column4"
"id_line7","value_line7_column2","blabla foobla bla","value_line7_column4"
I want search only on the column number X in the file ( the third column in this example ).
The regular expression
In the third column of all lines of my file, I want find string(s) containing a searched word that is : (via grep + regex ?)
At the beginning of the string of the specific column (here in the example, the third column)
OR at the end of the string of the specific column (here in the example, the third column)
OR somewhere in the string of the specific column (here in the example, the third column)
And find only the word not concatenated with another words. By example, with the example file above, if I search the word "foo" :
"id_line1","value_line1_column2","foo blablabla","value_line1_column4" // the regex must return true
"id_line2","value_line2_column2","blablabla foo","value_line2_column4" // the regex must return true
"id_line3","value_line3_column2","blabla foo blabla","value_line3_column4" // the regex must return true
"id_line4","value_line4_column2","blablabla","value_line4_column4" // the regex must return false
"id_line5","value_line5_column2","fooblabla bla","value_line5_column4" // the regex must return false
"id_line6","value_line6_column2","blabla blafoo","value_line6_column4" // the regex must return false
"id_line7","value_line7_column2","blabla foobla bla","value_line7_column4" // the regex must return false
The result
Command(s) must return lines :
"id_line1","value_line1_column2","foo blablabla","value_line1_column4"
"id_line2","value_line2_column2","blablabla foo","value_line2_column4"
"id_line3","value_line3_column2","blabla foo blabla","value_line3_column4"
How can I do this ?
If I can only get the id ("id_line1", "id_line2", "id_line3") it would be perfect :)
Awk will do the job:
awk -F, '$3 ~ /"foo / || $3 ~ / foo"/ || $3 ~ /[[:blank:]]foo[[:blank:]]/ { print $0 }' filename
Here we are checking the third piece of each line delimited by , and checking for "foo or ( signified by ||) a blank space and then foo and then another blank space and finally foo". If any of these occurances occur, print the line

regex Pattern Matching over two lines - search and replace

I have a text document that i require help with. In the below example is an extract of a tab delimited text doc whereby the first line of the 3 line pattern will always be a number. The Doc will always be in this format with the same tabbed formula on each of the three lines.
nnnn **variable** V -------
* FROM CLIP NAME - **variable**
* LOC: variable variable **variable**
I want to replace the second field on the first line with the fourth field on the third line. And then replace the field after the colon on the second line with the original second field on the first line. Is this possible with regex? I am used to single line search replace function but not multiline patterns.
000003 A009C001_151210_R6XO V C 11:21:12:17 11:21:57:14 01:00:18:22 01:01:03:19
*FROM CLIP NAME: 5-1A
*LOC: 01:00:42:15 WHITE 005_NST_010_E02
000004 B008C001_151210_R55E V C 11:21:18:09 11:21:53:07 01:01:03:19 01:01:38:17
*FROM CLIP NAME: 5-1B
*LOC: 01:01:20:14 WHITE 005_NST_010_E03
The Result would look like :
000003 005_NST_010_E02 V C 11:21:12:17 11:21:57:14 01:00:18:22 01:01:03:19
*FROM CLIP NAME: A009C001_151210_R6XO
*LOC: 01:00:42:15 WHITE 005_NST_010_E02
000004 005_NST_010_E03 V C 11:21:18:09 11:21:53:07 01:01:03:19 01:01:38:17
*FROM CLIP NAME: B008C001_151210_R55E
*LOC: 01:01:20:14 WHITE 005_NST_010_E03
Many Thanks in advance.
A regular expression defines a regular language. Alone, this only expresses a structure of some input. Performing operations on this input requires some kind of processing tool. You didn't specify which tool you were using, so I get to pick.
Multiline sed
You wrote that you are "used to single line search replace function but not multiline patterns." Perhaps you are referring to substitution with sed. See How can I use sed to replace a multi-line string?. It is more complicated than with a single line, but it is possible.
An AWK script
AWK is known for its powerful one-liners, but you can also write scripts. Here is a script that identifies the beginning of a new record/pattern using a regular expression to match the first number. (I hesitate to call it a "record" because this has a specific meaning in AWK.) It stores the fields of the first two lines until it encounters the third line. At the third line, it has all the information needed to make the desired replacements. It then prints the modified first two lines and continues. The third line is printed unchanged (you specified no replacements for the third line). If there are additional lines before the start of the next record/pattern, they will also be printed unchanged.
It's unclear exactly where the tab characters are in your sample input because the submission system has replaced them with spaces. I am assuming there is a tab between FROM CLIP NAME: and the following field and that the "variables" on the first and third line are also tab-separated. If the first number of each record/pattern is hexadecimal instead of decimal, replace the [[:digit:]] with [[:xdigit:]].
fixit.awk
#!/usr/bin/awk -f
BEGIN { FS="\t"; n=0 }
{n++}
/^[[:digit:]]+\t/ { n=1 }
# Split and save first two lines
n==1 { line1_NF = split($0, line1, FS); next }
n==2 { line2_NF = split($0, line2, FS); next }
n==3 {
# At the third line, make replacements
line1_2 = line1[2]
line1[2] = $4
line2[2] = line1_2
# Print modified first two lines
printf "%s", line1[1]
for ( i=2; i<=line1_NF; ++i )
printf "\t%s", line1[i]
print ""
printf "%s", line2[1]
for ( i=2; i<=line2_NF; ++i )
printf "\t%s", line2[i]
print ""
}
1 # Print lines after the second unchanged
You can use it like
$ awk -f fixit.awk infile.txt
or to pipe it in
$ cat infile.txt | awk -f fixit.awk
This is not the most regular expression inspired solution, but it should make the replacements that you want. For a more complex structure of input, an ideal solution would be to write a scanner and parser that correctly interprets the full input language. Using tools like string substitution might work for simple specific cases, but there could be nuances and assumptions you've made that don't apply in general. A parser can also be more powerful and implement grammars that can express languages which can't be recognized with regular expressions.