Python2.7/ NLTK remove part of string before certain character - python-2.7

I am reading a csv file, containing 371 lines of text.
0þ"Text including numbers and quotes"þ4.6
I am trying to extract the texting between the þ" and "þ. How can I do this?

awk -F'þ"|"þ' '{print $2}' data.csv
The above command prints the 2nd column of each row in the file data.csv,
where columns are separated by either a þ" OR a "þ.

Thanks all!
Both your responses helped me find the solution:
test = sent[(sent.index('þ"')):(sent.index('"þ'))]

Related

Adding commas when necessary to a csv file using regex

I have a csv file like the following:
entity_name,data_field_name,type
Unit,id
Track,id,LONG
The second row is missing a comma. I wonder if there might be some regex or awk like tool in order to append commas to the end of line in case there are missing commas in these rows?
Update
I know the requirements are a little vague. There might be several alternative ways to narrow down the requirements such as:
The header row should define the number of columns (and commas) that is valid for the whole file. The script should read the header row first and find out the correct number of columns.
The number of columns might be passed as an argument to the script.
The number of columns can be hardcoded into the script.
I didn't narrow down the requirements at first because I was ok with any of them. Of course, the first alternative is the best but I wasn't sure if this was easy to implement or not.
Thanks for all the great answers and comments. Next time, I will state acceptable alternative requirements explicitly.
You can use this awk command to fill up all rows starting from 2nd row with the empty cell values based on # of columns in the header row, in order to avoid hard-coding # of columns:
awk 'BEGIN{FS=OFS=","} NR==1{nc=NF} NF{$nc=$nc} 1' file
entity_name,data_field_name,type
Unit,id,
Track,id,LONG
Earlier solution:
awk 'BEGIN{FS=OFS=","} NR==1{nc=NF} {printf "%s", $0;
for (i=NF+1; i<=nc; i++) printf "%s", OFS; print ""}' file
I would use sed,
sed 's/^[^,]*,[^,]*$/&,/' file
Example:
$ echo 'Unit,id' | sed 's/^[^,]*,[^,]*$/&,/'
Unit,id,
$ echo 'Unit,id,bar' | sed 's/^[^,]*,[^,]*$/&,/'
Unit,id,bar
Try this:
$ awk -F , 'NF==2{$2=$2","}1' file
Output:
entity_name,data_field_name,type
Unit,id,
Track,id,LONG
With another awk:
awk -F, 'NF==2{$3=""}1' OFS=, yourfile.csv
to present balance to all the awk solutions, following could be a vim only solution
:v/,.*,/norm A,
rationale
/,.*,/ searches for 2 comma's in a line
:v apply a global command on each line NOT matching the search
norm A, enters normal mode and appends a , to the end of the line
This MIGHT be all you need, depending on the info you haven't shared with us in your question:
$ awk -F, '{print $0 (NF<3?FS:"")}' file
entity_name,data_field_name,type
Unit,id,
Track,id,LONG

AWK If pattern not found in one out of two similar lines, print line with pattern found

Good day,
Problem
There are two lines, the field separator is comma. Therefore, each line have 6 fields.
abc,def,ghi,jkl,mno,pqr
abc,def,ghi,jkl,,pqr
Target
If field five is empty, don't print that line.
Expected Output
abc,def,ghi,jkl,mno,pqr
So far I have done
awk '{print ($5=="")?:$5}' file
Thank so much in advance for giving me any clue.
awk -F',' '$5!=""' file
.......
It can be much simpler:
awk -F, '$5' file
i.e. print any line which has non-empty $5.
Two more awks,which wont break if field just contains 0.
awk -F, '$5~/./' file
awk -F, 'x!=$5' file

Reorder columns in a CSV file by number

i am trying to replace the second column with the second to last column and also remove the three last column. For example, I have this sample.csv
1,2,3,4,5,6
a,b,c,d,e,f
g,h,i,j,k,l
I want to output:
1,5,3
a,e,c
g,k,i
I am using this command:
awk 'BEGIN{FS=OFS=","} {$2=$(NF-1); NF=NF-3}'1 sample.csv
which works perfectly when I view the csv file in excel. however, when I look at the .csv file in notepad, I notice that the last item on one row is connected to the first item in the next row. so I am getting
1,5,3a,e,cg,k,i
Can anyone give me any advice on how to fix the problem so I can get the .csv file to have a new paragraph for each row like the desired output? Thanks.
Adding a carriage return(\r) to the end of each line should help:
awk 'BEGIN{FS=OFS=","} {$2=$(NF-1); NF=NF-3;sub(/$/,"\r");}'1 sample.csv
Code for GNU sed:
sed -r 's/(\w,)\w,(\w,)\w,(\w,)\w/\1\3\2/' file
$cat file
1,2,3,4,5,6
a,b,c,d,e,f
g,h,i,j,k,l
$sed -r 's/(\w,)\w,(\w),\w,(\w,)\w/\1\3\2/' file
1,5,3
a,e,c
g,k,i

How can I remove all characters in each line after the first space in a text file?

I have a large log file from which I need to extract file names.
The file looks like this:
/path/to/loremIpsumDolor.sit /more/text/here/notAlways/theSame/here
/path/to/anotherFile.ext /more/text/here/differentText/here
.... about 10 million times
I need to extract the file names like this:
loremIpsumDolor.sit
anotherFile.ext
I figure my first strategy is to find/replace all /path/to/ with ''. But I'm stuck how to remove all characters after the space.
Can you help?
sed 's/ .*//' file
It doesn't take any more. The transformed output appears on standard output, of course.
In theory, you could also use awk to grab the filename from each line as:
awk '{ print $1 }' input_file.log
That, of course, assumes that there are no spaces in any of the filenames. awk defaults to looking for whitespace as the field delimiters, so the above snippet would take the first "field" from your log file (your filename) for each line, and output it.
Pass it to cut:
cut '-d ' -f1 yourfile
a bash-only solution:
while read path otherstuff; do
echo ${path##*/}
done < filename

How to join first n lines in a file

I am trying to clean up some data, and I would eventually like to put it in CSV form.
I have used some regular expressions to clean it up, but I'm stuck on one step.
I would like to replace all but every third newline (\n) with a comma.
The data looks like this:
field1
field2
field3
field1
field2
field3
etc..
I need it in
field1,field2,field3
field1,field2,field3
Anyone have a simple way to do this using sed or awk? I could write a program and use a loop with a mod counter to erase every 1st and 2nd newline char, but I'd rather do it from the command line if possible.
With awk:
awk '{n2=n1;n1=n;n=$0;if(NR%3==0){printf"%s,%s,%s\n",n2,n1,n}}' yourData.txt
This script saves the last three lines and print them at every third line. Unfortunately, this works only with files having a multiple of 3 lines.
A more general script is:
awk '{l=l$0;if(NR%3==0){print l;l=""}else{l=l","}}END{if(l!=""){print substr(l,1,length(l)-1)}}' yourData.txt
In this case, the last three lines are concatenated in a single string, with the comma separator inserted whenever the line number is not a multiple of 3. At the end of the file, the string is printed if it is not empty with the trailing comma removed.
Awk version:
awk '{if (NR%3==0){print $0;}else{printf "%s,", $0;}}'
A Perl solution that's a little shorter and that handles files that don't have a multiple of 3 lines:
perl -pe 's/\n/,/ if(++$i%3&&! eof)' yourData.txt
cat file | perl -ne 'chomp(); print $_, !(++$i%3) ? "\n" : ",";'
Use nawk or /usr/xpg4/bin/awk on Solaris:
awk 'ORS=NR%3?OFS:RS' OFS=, infile
This might work for you:
paste -sd',,\n' file
or this:
sed '$!N;$!N;y/\n/,/' file
vim version:
:1,$s/\n\(.*\)\n\(.*\)\n/,\1,\2\r/g
awk '{ORS=NR%3?",":"\n";print}' urdata.txt