AWK if length statement append - if-statement

I am trying to fix a problem of missing data in a csv file. The column, that should be null, is lacking a double field separator to maintain the csv structure. The example below illustrates what I mean:
1#Sunshine#2/M#L#JRVel#215#WAW
2#Pass#2/J#L1#JAvar#218#JKDes
3#Solo#2/K#JRosa#218#WAW
4#Bomber#2/D#L1#JLOrt#218#GCCon
5#SmokingCandy#2/Y#L1#MFco#218#SMAs
6#BigBound#2/H#L1#JCast#218#SMAs
7#ShootBunies#2/H#L#DLPar#218#DKo
As you can see, in the third row, fourth column, there is no "L" or "L1", instead it's "JRosa". This causes incorrect formating of a mysql database down the road. What I would like to do is use an AWK if statement to append another delimiter infront of "JRosa". So far I have half a code and no idea where to go. I was thinking it would be something that starts like:
awk -F# '{if(length($4) > 4) print "#"$4}'
Which just printouts:
#JRosa
I'm hoping to find a solution that results in:
1#Sunshine#2/M#L#JRVel#215#WAW
2#Pass#2/J#L1#JAvar#218#JKDes
3#Solo#2/K##JRosa#218#WAW
4#Bomber#2/D#L1#JLOrt#218#GCCon
5#SmokingCandy#2/Y#L1#MFco#218#SMAs
6#BigBound#2/H#L1#JCast#218#SMAs
7#ShootBunies#2/H#L#DLPar#218#DKo
Does anybody know the correct formating after the if statement that can be used to append the input data?

Just add an extra "#" at the end of the 3rd field if there's less than 7 fields on the line:
$ awk 'BEGIN{FS=OFS="#"} NF<7{$3=$3FS} 1' file
1#Sunshine#2/M#L#JRVel#215#WAW
2#Pass#2/J#L1#JAvar#218#JKDes
3#Solo#2/K##JRosa#218#WAW
4#Bomber#2/D#L1#JLOrt#218#GCCon
5#SmokingCandy#2/Y#L1#MFco#218#SMAs
6#BigBound#2/H#L1#JCast#218#SMAs
7#ShootBunies#2/H#L#DLPar#218#DKo

Related

Removing the last specific character from the results of my formula

I'm using some VLOOKUPs to pull in text from another tab on my spreadsheet using the below formula
={"Product Category Test";ARRAYFORMULA(IF(ISBLANK(A2:A),"",
VLOOKUP(A2:A,'Import Template'!A:DB,MATCH("Product Category",'Import
Template'!A1:DB1,0),false)&"|"&IF(VLOOKUP(A2:A,'Import Template'!A:DB,MATCH("Automatic
Categories",'Import Template'!A1:DB1,0),false)<>"",VLOOKUP(A2:A,'Import
Template'!A:DB,MATCH("Automatic Categories",'Import Template'!A1:DB1,0),false),"")))}
Example of results: Books|Coming Soon Images|
All of my results will be delimited by a "|" which will also be the final character. I need to remove the final "|" from the results ideally without using a helper column, is there a way to wrap another function around my formula to achieve this? I've played around with RIGHT and LEN but can't figure it out.
Thanks,
use regex:
=ARRAYFORMULA({"Product Category Test"; REGEXREPLACE(""&IF(ISBLANK(A2:A),,
VLOOKUP(A2:A,'Import Template'!A:DB,MATCH("Product Category",'Import
Template'!A1:DB1,0),)&"|"&IF(VLOOKUP(A2:A,'Import Template'!A:DB,MATCH("Automatic
Categories",'Import Template'!A1:DB1,0), )<>"",VLOOKUP(A2:A,'Import
Template'!A:DB,MATCH("Automatic Categories",'Import Template'!A1:DB1,0),),)), "\|$", )})
if this won't work make sure there are no empty spaces after last |

Regex With Colons in Data

I have a text file which I'm looking to remove some data from. The data is separated using a colon ':' as the delimiter. There are approx 9 separations. The data after the 7th column is most often null and thus useless but the additional colons are still there.
An example of the file would like this:
column1:column2:column3:column4:column5:column6:column7:column8:column9:column10
I hope to remove the info from after column8. So the data to be removed would be:
:column9:column10
Could someone advise me how to do so in Regex?
I've been reading and no where have I found a way to isolate a colon and text following after x number of colons.
Any help you could offer would be much appreciated.
$_ = join ":", ( split /:/, $_, -1 )[0..7];
or
s/(?::[^:]*){2}\z//;
The following regex will keep the first 8 columns and discard all others.
s/^[^:]*(?::[^:]*){7}\K.*//;
Assumes simple single line records.

Finding/replacing values for a specific column in Notepad++

I think I need RegEx for this, but it is new to me...
What I have in a text file are 200 rows of data, 100 INSERT INTO rows and 100 corresponding VALUE rows.
So it looks like this:
INSERT INTO DB1.Tbl1 (Col1, Col2, Col3........Col20)
VALUES(123, 'ABC', '201450204 15:37:48'........'DEF')
What I want to do is replace every Date/Timestamp value in Col3 with this: CURRENT_TIMESTAMP. The Date/Timestamps are NOT the same for every row. They differ, but they are all in Column 3.
There are 100 records in this table, some other tables have more, that's why I am looking for a shortcut to do this.
Try this:
search with (INSERT[^,]+,[^,]+,)([^,]+,)([^']+'[^']+'[^']+)('[^']+',) and replace with $1$3 and check mark regular expression in the notepad++
Live demo
With
"VALUES" being right at the beginning of the line,
"Col1" values being all numeric, and
no single quotes inside the values for "Col2"
you can search for
^(VALUES\(\d+, '[^']+', )'(\d{9} \d{2}:\d{2}:\d{2})'
and replace with
\1CURRENT_TIMESTAMP
along RegEx101. (Remember, Notepad++ uses the backslash in the replacement string…)
Personally, I'd consider to go straight to the database, and fix the timestamp there - especially, if you have more data to handle. (See my above comment for the general idea.)
Please comment, if and as further detail / adjustment is required.

Ordering output files

I have a file containing a large number of protein sequences. Each sequence is headed up by an initial "protein ID number" (GI number for those that know). I am using a awk command that allows me to print between two regular expressions. Using this, I can enter a list of GI numbers into one regex field where each GI number is separated by a "|". The second regex is a regex I added in after every protein, allowing me to perform the awk function (ABC123).
Therefore the code I am using is as follows
awk '/GI1|GI2|GI3|GI4|GIX.../,/ABC123/' database.txt > output.txt
As you can see from the above code, I am searching within database.txt and writing a new file. The problem is, when I open output.txt the list of GI's is in the wrong order. In output.txt I need them to occur in the same order as they occur in the first regex field i.e
GI1
GI2
GI3...
Instead, they occur in the order which they are found in database.txt, so in output.txt they look all jumbled i.e
Gi3
GI4
GI1
GI2
GI5
Does anyone know how I can get the list of GIs in the output file to match the same order as the list of GIs I input in the 1st regex field?
Try this command,
awk '/GI1|GI2|GI3|GI4|GIX.../,/ABC123/' database.txt | sort -k1.3,1.3 > output.txt
Now your output.txt contains the sorted list.
The specification 1.3,1.3 says that the sort key must starts at field 1 position 3 and ends at the same place.

How would I parse in a bash script date_value _space_ date_value

I am trying to import a tsv file into a mysql db but I am having trouble since the file has no unique delimiters to identify where a new row starts. The only unique identifier is a date followed by a space followed by time. Example: 6/19/2010 16:04:43
Could someone please point me in the right direction or help me make a bash script that puts a semicolon ";" in front of that string. So the end result will be ;6/19/2010 16:04:43
The tricky part is that in this file there will be other date fields and other time fields but this is the only string that will have a space in between the two.
cat file | sed 's#[0-9]\{1,2\}/[0-9]\{1,2\}/[0-9]\{4\} #;&#g' >resultfile. Test before using.