what does if(index(i,$12)==1 indicate - if-statement

Jus come across an awk script
awk 'BEGIN {OFS=FS} NR==FNR {a[$1]=($2" "$3);next} {for (i in a) if(index(i,$12)==1) print $0,a[$12]}'
in this script what does
if(index(i,$12)==1
mean? Is it indicating true/false condition on just numerical equal to 1?

Without samples it is difficult to understand complete requirements of question. Trying it by seeing your code.
BEGIN: section executes before an Input_file is being read.
OFS=FS it doesn't make sense to me since both variables by default values will be spaces.
NR==FNR: it is the condition when first Input_file is being read.
a[$1]: creating an array named a whose index is $1 of current line and value is 2nd and 3rd column of that line with space in them.
next: next will skip all further statements for 1dt Input_file from here.
for(i in a): starting a for loop which traverse through all elements of array a.
index(i,$12)==1: checking condition if index of array a which was 1st Input_file's 1st column is same as starting point of 12th column. Though it is not guarantee that it will match exact word. It will look match and returns it's starting point so here we are checking if that starting value of matched string is 1.
If above condition is TRUE then printing current line and array a whose index is $12.

index() is a function. It gets the position of a string within another string. From man awk:
index(s, t) Return the index of the string t in the string s, or 0 if t is not present. (This implies that character indices start at
one.) It is a fatal error to use a regexp constant for t.
In your example you iterate over the keys of the array a and check if column 12 starts with the key.

Related

BASH: Split strings without any delimiter and keep only first sub-string

I have a CSV file containing 7 columns and I am interested in modifying only the first column. In fact, in some of the rows a row name appears n times in a concatenated way without any space. I need a script that can identify where the duplication starts and remove all duplications.
Example of a row name among others:
Row name = EXAMPLE1.ABC_DEF.panel4EXAMPLE1.ABC_DEF.panel4EXAMPLE1.ABC_DEF.panel4
Replace by: EXAMPLE1.ABC_DEF.panel4
In the different rows:
n can vary
The length of the row name can vary
The structure of the row name can vary (eg. amount of _ and .), but it is always collated without any space
What I have tried:
:%s/(.+)\1+/\1/
Step-by-step:
%s: substitute in the whole file
(.+)\1+: First capturing group. .+ matches any character (except for line terminators), + is the quantifier — matches between one and unlimited times, as many times as possible, giving back as needed.
\1+: matches the same text as most recently matched by the 1st capturing group
Substitute by \1
However, I get the following errors:
E65: Illegal back reference
E476: Invalid command
From what i understand you need only one line contain EXAMPLE1.ABC_DEF.panel4. In that case you can do the following:
First remove duplicates in one line:
sed -i "s/EXAMPLE1.ABC_DEF.panel4.*/EXAMPLE1.ABC_DEF.panel4/g"
Then remove duplicated lines:
awk '!a[$0]++'
If all your rows are of the format you gave in the question (like EXAMPLExyzEXAMPLExyz) then this should work-
awk -F"EXAMPLE" '{print FS $2}' file
This takes "EXAMPLE" as the field delimiter and asks it to print only the first 'column'. It prepends "EXAMPLE" to this first column (by calling the inbuilt awk variable FS). Thanks, #andlrc.
Not an ideal solution but may be good enough for this purpose.
This script, with first arg is the string to test, can retrieve the biggest duplicate substring (i.e. "totototo" done "toto", not "to")
#!/usr/bin/env bash
row_name="$1"
#test duplicate from the longest to the smallest, by how many we need to split the string ?
for (( i=2; i<${#row_name}; i++ ))
do
match="True"
#continue test only if it's mathematically possible
if (( ${#row_name} % i )); then
continue
fi
#length of the potential duplicate substring
len_sub=$(( ${#row_name} / i ))
#test if the first substring is equal to each others
for (( s=1; s<i; s++ ))
do
if ! [ "${row_name:0:${len_sub}}" = "${row_name:$((len_sub * s)):${len_sub}}" ]; then
match="False"
break
fi
done
#each substring are equal, so return string without duplicate
if [ $match = "True" ]; then
row_name="${row_name:0:${len_sub}}"
break
fi
done
echo "$row_name"

Pull a positioned grouped and repeating string using regex?

Lets say I have data output such as the following:
0; root.; 0; MLG.; 247; root.; 249; MLG.; 2390; toasty.; ... someNumber; username.;
I am trying to pull the username while ignoring the number and semicolon previous to it from when the numbers are not 0, and to do this for an unknown number of times. In regards to this data the output should appear favorably as:
root. MLG. toasty.
The syntax MUST be perl format. The application using this takes no exceptions. While I do have complete control over how this data is being presented (such as I can remove the semicolons next to the numbers and attach the unique number to the username with a period and separate with a semicolon) I would like to know the method to do this regardless.
Some of the many current regexs I've tried is as follows...
(The (?<field> is to specify the regex following the end > a field name to be specified and shown by the application in case anyone is wondering)
Pulls all data to the last semicolon starting from where the first number shown is not 0.
"(?<users_online>[1-9].*;)"
Pulls all data after the first instance of non-zero number and a semicolon occurs.
"[1-9];(?<users_online>.*?)"
Pulls all data after the first instance of a non-zero number and a semicolon and outputs any alphanumeric values up until the next "word" boundry.
"[1-9];((?<users_online> \w+\b))"
Any and all help is appreciated.
You can do it like this:
my $s = "0; root.; 0; MLG.; 247; root.; 249; MLG.; 2390; toasty.; ... someNumber; username.;";
# ^--- added
print join " ", $s=~/[1-9]0*; ([^;]+)/g;
You only need to describe at least one digit that isn't a zero before eventual zeros and to use an appropriate character class that excludes the semicolon.

Command line to merge lines with matching first field, 50 GB input

A while back, I asked a question about merging lines which have a common first field. Here's the original: Command line to match lines with matching first field (sed, awk, etc.)
Sample input:
a|lorem
b|ipsum
b|dolor
c|sit
d|amet
d|consectetur
e|adipisicing
e|elit
Desired output:
b|ipsum|dolor
d|amet|consectetur
e|adipisicing|elit
The idea is that if the first field matches, then the lines are merged. The input is sorted. The actual content is more complex, but uses the pipe as the sole delimiter.
The methods provided in the prior question worked well on my 0.5GB file, processing in ~16 seconds. However, my new file is approx 100x larger, and I prefer a method that streams. In theory, this will be able to run in ~30 minutes. The prior method failed to complete after running 24 hours.
Running on MacOS (i.e., BSD-type unix).
Ideas? [Note, the prior answer to the prior question was NOT a one-liner.]
You can append you results to a file on the fly so that you don't need to build a 50GB array (which I assume you don't have the memory for!). This command will concatenate the join fields for each of the different indices in a string which is written to a file named after the respective index with some suffix.
EDIT: on the basis of OP's comment that content may have spaces, I would suggest using -F"|" instead of sub and also the following answer is designed to write to standard out
(New) Code:
# split the file on the pipe using -F
# if index "i" is still $1 (and i exists) concatenate the string
# if index "i" is not $1 or doesn't exist yet, print current a
# (will be a single blank line for first line)
# afterwards, this will print the concatenated data for the last index
# reset a for the new index and take the first data set
# set i to $1 each time
# END statement to print the single last string "a"
awk -F"|" '$1==i{a=a"|"$2}$1!=i{print a; a=$2}{i=$1}END{print a}'
This builds a string of "data" while in a given index and then prints it out when index changes and starts building the next string on the new index until that one ends... repeat...
sed '# label anchor for a jump
:loop
# load a new line in working buffer (so always 2 lines loaded after)
N
# verify if the 2 lines have same starting pattern and join if the case
/^\(\([^|]\)*\(|.*\)\)\n\2/ s//\1/
# if end of file quit (and print result)
$ b
# if lines are joined, cycle and re make with next line (jump to :loop)
t loop
# (No joined lines here)
# if more than 2 element on first line, print first line
/.*|.*|.*\n/ P
# remove first line (using last search pattern)
s///
# (if anay modif) cycle (jump to :loop)
t loop
# exit and print working buffer
' YourFile
posix version (maybe --posix on Mac)
self commented
assume sorted entry, no empty line, no pipe in data (nor escaped one)
used unbufferd -u for a stream process if available

Python re find index position of first search match

I have a series of strings, most of which contain 4 digits in a row. I want to slice the string at the end of that fourth digit, using Python. Sometimes the string contains more than one such pattern. What I want is the index position of the FIRST match of my regular expression. What I have been able to get is the LAST match.
myString = 'Today is June 14, 2019. I sometimes like to think back when I was a child in 1730.'
theYear = re.compile("\d{4}")
[(m.start(0), m.end(0)) for m in re.finditer(theYear, myString)]
print m.span(0)
The result is (77, 81), which is the index position for the second date, not the first one. I know the problem is my loop, which will iterate through all of the matches, leaving me with the last one. But I havn't been able to figure out how to access those index positions without looping.
Thanks for any help.
print theYear.search(myString).span()

decipher the regular expression

please help me decipher the regular expression-
'!_[$0]++'
It is being used to get a MSISDN (one at a time from a file containing list of MSISDN starting with zero )by the following usage:
awk '!_[$0]++' file.txt
It's not a regular expression, it's an arithmetic and boolean expression.
$0 = The current input line
_[$0] = An associative array element whose key is the input line
_[$0]++ = increment that array element each time we encounter a repeat of the line, but evaluates to the original value
!_[$0]++ = boolean inverse, so it returns true if the value was originally 0 or the empty string, false otherwise
So this expression is true the first time a line is encountered, false every other time. Since there's no action block after the expression, the default is to print the line if the expression is true, skip it when false.
So this prints the input file with duplicates omitted.
'true'- then the line will be printed
'_[$0]++'- associative array will be incremented everytime when $0 is present.means it will set the number of times each line is repeated.
'!_[$0]++'-this will be true when a line is inserted in the associative array for the firsttime only and the rest of the times it will resolve to false ultimately not printing the line.
So all the duplicate lines will not be prited.
This is not a regular expression. This particular command prints unique lines the first time they are found.
_ is being used as an array here and $0 refers to the entire line. Given that the default numeric value for array element is 0 (it's technically an empty string, but in numeric contexts its treated as 0), the first time you see a line, you print the line (since _[$0] is falsy, !_[$0] will be true). The command increments every time it sees a line (after printing -- awk's default command is to print), so the next time you see the line _[$0] will be 1 and the line will not be printed