fixing excess spaces in a text file - replace

I have output from a program called KRAKEN that looks like this (also I apologize for the link but i couldn't figure out how to put tabs within a line into markdown because they just get converted to spaces)
So the problem with this is pretty obvious because if I want to do any type of text editing in terminal, these spaces equal tabs which equal new columns. What I have been trying to do is delete all of these spaces and essentially justify the 6 column.
Currently I have tried using the column command which almost worked but my output looks like this
So now I have more columns than I need. So another potential fix could be to combine all the columns after column 6 but I do not know how to do this either.
The goal is to get the output to look like this
So a quick tldr:
Is there any way to remove excess spaces and justify only one column in a text file?
Or is there a way to combine columns after a certain column yet keep the rows seperate?

I found a fix for this if anyone has an issue with this type of output.
so starting from the original file called test.txt.
column -t test.txt > temp1.txt
awk '{print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6" "$7" "$8" "$9}' temp1.txt > final.txt
For reference test.txt looks like this, temp1.txt looks like this and the final.txt looks like this.

Related

Removing duplicate rows from Notepad++

I am looking for a way to remove duplicate rows from my Notepad++ file. The rows are not exact duplicates per say. Here's the situation. I have a large file of capitalized company names with probability values as well (each separated by a tab). So the format would be like this:
ATT .7213
SAMSUNG .01294
SAMSUNG .90222
So, I need to remove one of these rows because there is a match in the first column. I don't really have a preference of which one I need to remove just as long as I end up with one row at the end. I have tried to use unique sorting with TextFX but it's looking for the whole row duplicate and not just the first column. If anyone could offer up a handy solution to fix this I would greatly appreciate it. Bash script answers using awk, sed, or cut are also acceptable as well as using regular expressions.
Thank you!
Using awk, you could say:
awk '!a[$1]++' filename
This would keep only the lines having a unique value for the first field.
Use sort:
sort -k1,1 -u companies.txt
The output will consist of the full line, but only the sorting key (the first field) will be considered for identifying duplicates.

Can Notepad++ save out search results to a text file?

I need to do quite a few regular expression search/replaces throughout hundreds and hundreds of static files. I'm looking to build an audit trail so I at least know what files were touched by what searches/replaces.
I can do my regular expression searches in Notepad++ and it gives me file names/paths and number of hits in each file. It also gives me the line #s which I don't really care that much about.
What I really want is a separate text file of the file names/paths. The # of hits in each file would be a nice addition, but really it's just a list of file names/paths that I'm after.
In Notepad++'s search results pane, I can do a right click and copy, but that includes all the line #s and code which is just too much noise, especially when you're getting hundreds of matches.
Anyone know how I can get these results to just the file name/paths? I'm after something like:
/about/foo.html
/about/bar.html
/faq/2012/awesome.html
/faq/2013/awesomer.html
/foo/bar/baz/wee.html
etc.
Then I can name that file regex_whatever_search.txt and at the top of it include the regex used for the search and replace. Below that, I've got my list of files it touched.
UPDATE What looks like the easiest thing to do (at least that I've found) is to just copy all the search results into a new text file and run the following regex:
^\tLine.+$
And replace that with an empty string. That'll give you just the file path and hit counts with a lot of empty space between each entry. Then run the following regex:
\s+\n
And replace with:
\n
That'll strip out all the unwanted empty space and you'll be left with a nice list.
maybe you need power of unix tools
assume you have GNUWin32 installed in c:\tools\gnuwin32
than if you have replace.bat file with that content:
#echo off
set BIN=c:\tools\gnuwin32\bin
set WHAT=%1
set TOWHAT=%2
set MASK=%3
rem Removing quotes
SET WHAT=###%WHAT%###
SET WHAT=%WHAT:"###=%
SET WHAT=%WHAT:###"=%
SET WHAT=%WHAT:###=%
SET TOWHAT=###%TOWHAT%###
SET TOWHAT=%TOWHAT:"###=%
SET TOWHAT=%TOWHAT:###"=%
SET TOWHAT=%TOWHAT:###=%
SET MASK=###%MASK%###
SET MASK=%MASK:"###=%
SET MASK=%MASK:###"=%
SET MASK=%MASK:###=%
echo %WHAT% replaces to %TOWHAT%
rem printing matching files
%BIN%\grep -r -c "%WHAT%" %MASK%
rem actual replace
%BIN%\find %MASK% -type f -exec %BIN%\sed -i "s/%WHAT%/%TOWHAT%/g" {} +
you can do regex replace in masked files recursively with output you required
replace "using System.Windows" "using Nothing" *.cs
The regulat expression I use for this kind of problem is
^\tLine.[0-9]*:.
And it works for me
This works well if you have Excel available and want to avoid using regular expressions:
Ctrl+A to select all the results
drag & drop the selected results to Excel
Create a Filter on the 1st row
Filter out the lines that have "(Blank)" on the 1st column
Select the remaining lines (i.e. the lines with the filenames) and copy/paste them to another sheet or any wanted destination
You could also Ctrl+A, Ctrl+C the search results, then use the Paste Option "Use Text Import Wizard" in Excel, say that the data is "Fixed width" and put one single break line after the 2nd character (to remove the two leading spaces in the filename during import), and use a filter to filter out the unwanted rows.

How to prevent OpenOffice/LibreOffice Calc from changing what you input (data, numbers,...)

Basically, I want LibreOffice Calc to do what I tell it, not what it wants.
For example:
when I input 1.1.12, I want to have 1.1.12 in that cell, not 01.01.2012 or whatever.
when I input 001, I want to have 001 in that cell, not 1
and so on and so forth
I want it to never ever touch my data until I explicitly tell it to. Is that possible at all?
I know I can set format of a cell to text. It doesn't help at all. Example:
Input 1.1.12, it gets displayed as 01.01.12, format as text, it becomes "40909", original input is lost
Format empty cells as text. Paste "000 001 002 ..." separated by line breaks. Displays "0 1 2 ..."
I know I can write ' in front of anything for it to be forced text. Again it doesn't help, because when I paste in text, I cannot have ' auto-appended to it.
I hope this is possible. I tried googling for different problems and never found a good answer.
If you want your input to be interpreted as text and preventing Calc to do fancy (and annoying) things with your input, you have to change the format before entering any value.
Select the cells/columns/rows.
Right-click 'Format Cells...'
Select the tab 'Numbers'
In the list 'Category', select 'Text' (the last option)
Select the format '#' (it is the only one in this category)
Click on 'Ok'
You may need to tweak the 'autocorrect' options as well. Go to 'Tools > Auotcorrect Options...'. Here is a link that may help: https://help.libreoffice.org/Calc/Deactivating_Automatic_Changes
I understand your problem with pasting pure unformatted text. This may be more work than you like (we can try to automate that later) but when I paste data from Notepad, I am prompted with an import screen as you can see below. Select the column header(s) and then select Column type: Text. This should solve your paste/import problem. An alternative is to handle this with an AutoHotKey script.
Oh b.t.w. the # is the format type for text, just like you have HH for 24 hour or ddd for weekdays...
When you are importing, you're given a bunch of options. Select "Quoted field as text" so any text inside quotes is treated as text which is interpreted by LibreOffice as sacred and they do not modify it in the way they they modify something that they identify as number
When you have your data in the clipboard click Edit -> Paste as... in main menu. In next window choose "Paste as text". All your data will be pasted as is.
I initially arrived at this page with a very similar (but not identical) problem. I am posting the solution here for the benefit of those who might be visiting with the same issue.
Every time I would save, close, and then re-open my .XSLX spreadsheet in OpenOffice, it would delete the spaces I had entered in between text. For example:
"Did not attend" would become "Didnotattend".
"John DOE" would become "JohnDOE", etc.
Specifying "text" (#) as the format (as recommended above) did not help me, unfortunately.
What ultimately did solve it was saving it as an .ODS file instead of .XSLX .
just simply put character ' before the text, '0.1.16 and calc will interprate it as text data
My issue was currency, properly formatted would change to a much larger number if the numbers entered could represent a date; such as 4.22 becoming $42,482. I discovered that adding a trailing zero solved the problem.
I had pasted numbers from another site and it kept coming up with dates. I just messed around and hit the arrow that's on the paste board to give me the option of unformatted text or HTML format. I selected unformatted, a window opened to show me the text I wanted so I pressed o.k.

Find, move, replace and parse strings simultanuosly while building an .xml playlist file

I get many videos and I need to compile functioning .xml playlist files where they are all listed, including snapshot jpg's. Videos and snapshot images are named automatically. So I end up with lots of files like this:
hxxp://site.com/video/_5712.480p.flv
hxxp://site.com/video/_5712.480p.jpg
hxxp://site.com/video/_5713.480p.flv
hxxp://site.com/video/_5713.480p.jpg
So with these files I need to produce an .xml file looking something like this:
....
<track>
<title>5712.480p</title>
<creator>Whatever_5712.480p</creator>
<info>hxxp://site.com/video/_5712.480p.jpg</info>
<annotation>Playlist marked_480p</annotation>
<location>hxxp://site.com/video/_5712.480p.flv</location>
<image>hxxp://site.com/video/_5712.480p.jpg</image>
</track>
<track>
<title>5713.480p</title>
<creator>Whatever_5713.480p</creator>
<info>hxxp://site.com/video/_5713.480p.jpg</info>
<annotation>Playlist marked_480p</annotation>
<location>hxxp://site.com/video/_5713.480p.flv</location>
<image>hxxp://site.com/video/_5713.480p.jpg</image>
</track>
So I guess I might be looking at some advanced sed/awk procedure to copy, move and place the right strings inside the correct brackets, and to compile one whole file? I really appreciate all the help I can get on this one. Thx
With that input, you can do something like:
awk 'NR%2==1 && /\.jpg$/ {JPGFILE=$0}
NR%2==0 { print "whateverXMLtags" JPGFILE "whatanotherXMLtags" $0 "someotherXMLtags" }' INPUTFILELIST
So this assumes that jpg files are on odd numbered lines, and on that saves the name, and on every even line prints the desired output. Note that the SPACE between e.g. JPGFILE and "whatanotherXMLtags" concatenates the sring.

Find Lines with N occurrences of a char

I have a txt file that I’m trying to import as flat file into SQL2008 that looks like this:
“123456”,”some text”
“543210”,”some more text”
“111223”,”other text”
etc…
The file has more than 300.000 rows and the text is large (usually 200-500 chars), so scanning the file by hand is very time consuming and prone to error. Other similar (and even more complex files) were successfully imported.
The problem with this one, is that “some lines” contain quotes in the text… (this came from an export from an old SuperBase DB that didn’t let you specify a text quantifier, there’s nothing I can do with the file other than clear it and try to import it).
So the “offending” lines look like this:
“123456”,”this text “contains” a quote”
“543210”,”And the “above” text is bad”
etc…
You can see the problem here.
Now, 300.000 is not too much if I could perform a search using a text editor that can use regex, I’d manually remove the quotes from each line. The problem is not the number of offending lines, but the impossibility to find them with a simple search. I’m sure there are less than 500, but spread those in a 300.000 lines txt file and you know what I mean.
Based upon that, what would be the best regex I could use to identify these lines?
My first thought is: Tell me which lines contain more than 4 quotes (“).
But I couldn’t come up with anything (I’m not good at Regex beyond the basics).
this pattern ^("[^"]+){4,} will match "lines containing more than 4 quotes"
you can experiment with replacing 4 with 5 or more, depending on your data.
I think that you can be more direct with a Regex than you're planning to be. Depending on your dialect of Regex, something like this should do it:
^"\d+",".*".*"
You could also use a regex to remove the outside quotes and use a better delimeter instead. For example, search for ^"([0-9]+)","(.*)"$ and replace it with \1+++++DELIM+++++\2.
Of course, this doesn't directly answer your question, but it might solve the problem.