Ruby - separating excel data contained in one column into individual columns [duplicate] - regex

This question already has answers here:
Split string by multiple delimiters
(6 answers)
Closed 8 months ago.
I'm trying to use Ruby to manipulate some excel data, but the .csv files I'm given have all of the data in one column.
The data has the headers and values separated by commas, but they are contained within the first column. Also, some of the values within the first column have text surrounded by quotes with commas inside the quotes.
Is there a way to separate the data within the first column into separate columns with Ruby?
I know you can do this in an excel, but I'd like to be able to do this in Ruby so I don't have to correct every .csv file manually.
I've included an example of the .csv file below.
The desired output would be:
{:header 1 => integer,
:header 2 => text,
:header 3 => "this text, has a comma within the quote"
:header 4 => integer}
I appreciate the help.

Here's one crude way to do it:
require 'csv'
result = []
csv = CSV.read('./file.csv')
headers = csv.shift
csv.each do |l|
hash = {}
hash[headers[0]] = l[0]
hash[headers[1]] = l[1]
hash[headers[2]] = l[2]
hash[headers[3]] = l[3]
result << hash
end
p result
[{"header 1"=>"integer",
"header 2"=>"text",
"header 3"=>"this text, has a comma within the quote",
"header 4"=>"integer"},
{"header 1"=>"integer",
"header 2"=>"text",
"header 3"=>"this text, has a comma within the quote",
"header 4"=>"integer"}]
This of course assumes that every row has 4 values.
Edit: Here is an example of actually writing the result to a file:
CSV.open('./output.csv', 'wb') do |csv|
result.each do |hash|
temp = []
hash.each do |key, value|
temp << "#{key} => #{value}"
end
csv << temp
end
end

Related

Skip translating words with %% and [] in Google Sheets

I am using GoogleTranslate() with Sheets to translate some contents into different languages. In those contents, we have multiple hooks [ ] and % % in one string that do not need to translate. Example :
[name] [surname] looked at your profile %number% !
I do not need to translate hooks like [username] and %number%.
I'm looking for :
[name] [surname] a regardé ton profil %number% ! (in french for example)
A solution is already provided here for one character using REGEXREPLACE and REGEXEXTRACT. But I need either symbol [xxx] and %xxx% in one formula. Thank you.
Alternatively, instead of using the GOOGLETRANSLATE with multiple nested functions, you can try creating a bound script on your spreadsheet file & then copy/paste the simple custom script below that contains translate() function for a more simplified use of function on your sheet:
CUSTOM SCRIPT
function translate(range) {
var container = [];
//KEEP ALL %***% and [***] INTO A CONTAINER
var regex = /(\[.*?])|(\%.*?%)/gm,
stringTest = range,
matched;
while(matched = regex.exec(stringTest)){
container.push(matched[0]);
}
//TRANSLATE TEXT TO FRENCH FROM ENGLISH W/O %***% and [***]
var replacedData = stringTest.replace(regex,'#');
var toTranslate = LanguageApp.translate(replacedData, 'en', 'fr');
var res = "";
//REARRANGE THE TRANSLATED TEXT WITH %***% and [***] FROM CONTAINER
for(x=0;x<toTranslate.split("#").length;x++){
res = res + toTranslate.split("#")[x]+" "+container[x];
}
//RETURN FINAL TRANSLATED TEXT WITH UNMODIFIED %***% and [***]
return res.trim().replace("undefined","");
}
SAMPLE RESULT
After saving the script, just simply put =translate(A1) (e.g. the text you want to translate is on cell A1) on a sheet cell and the script will skip any multiple words inside [***] & %***%, then it will only translate the rest of the text to french.
Try this:
=arrayformula(if(A1<>"",join("",if(isnumber(flatten(split(GOOGLETRANSLATE(join(" ",iferror(regexreplace(to_text(flatten(split(A1," "))),"(\[.*\])|(\%.*\%)","["&row(A$1:A)&"]"),)),"en","fr"),"[]"))),vlookup(flatten(split(GOOGLETRANSLATE(join(" ",iferror(regexreplace(to_text(flatten(split(A1," "))),"(\[.*\])|(\%.*\%)","["&row(A$1:A)&"]"),)),"en","fr"),"[]")),{sequence(len(regexreplace(A1,"[^\ ]",))+1,1),flatten(split(A1," "))},2,false),flatten(split(GOOGLETRANSLATE(join(" ",iferror(regexreplace(to_text(flatten(split(A1," "))),"(\[.*\])|(\%.*\%)","["&row(A$1:A)&"]"),)),"en","fr"),"[]")))),))
GOOGLETRANSLATE does not work with ARRAYFORMULA, but you can drag down this formula from cell B1 if you want to apply it to multiple rows in column A.
Individual steps taken:
Split text by space character, then flatten into one column.
Cell D1: =flatten(split(A1," "))
Replace [***] and %***% with [row#].
Cell E1: =arrayformula(iferror(regexreplace(to_text(flatten(split(A1," "))),"(\[.*\])|(\%.*\%)","["&row(A$1:A)&"]"),))
Join the rows into one cell.
Cell F1: =join(" ",E:E)
Apply Google Translate.
Cell G1: =GOOGLETRANSLATE(F1,"en","fr")
Split by [].
Cell H1: =flatten(split(G1,"[]"))
Where rows contain numbers, lookup item 1) above.
Cell I1: =arrayformula(if(isnumber(H1:H),vlookup(H1:H,{row(A$1:A),D:D},2,false),H1:H))
Join the rows into one cell.
Cell J1: =join(" ",I:I)

Google Sheets: How can I extract partial text from a string based on a column of different options?

Goal: I have a bunch of keywords I'd like to categorise automatically based on topic parameters I set. Categories that match must be in the same column so the keyword data can be filtered.
e.g. If I have "Puppies" as a first topic, it shouldn't appear as a secondary or third topic otherwise the data cannot be filtered as needed.
Example Data: https://docs.google.com/spreadsheets/d/1TWYepApOtWDlwoTP8zkaflD7AoxD_LZ4PxssSpFlrWQ/edit?usp=sharing
Video: https://drive.google.com/file/d/11T5hhyestKRY4GpuwC7RF6tx-xQudNok/view?usp=sharing
Parameters Tab: I will add words in columns D-F that change based on the keyword data set and there will often be hundreds, if not thousands, of options for larger data sets.
Categories Tab: I'd like to have a formula or script that goes down the columns D-F in Parameters and fills in a corresponding value (in Categories! columns D-F respectively) based on partial match with column B or C (makes no difference to me if there's a delimiter like a space or not. Final data sheet should only have one of these columns though).
Things I've Tried:
I've tried a bunch of things. Nested IF formula with regexmatch works but seems clunky.
e.g. this formula in Categories! column D
=IF(REGEXMATCH($B2,LOWER(Parameters!$D$3)),Parameters!$D$3,IF(REGEXMATCH($B2,LOWER(Parameters!$D$4)),Parameters!$D$4,""))
I nested more statements changing out to the next cell in Parameters!D column (as in , manually adding $D$5, $D$6 etc) but this seems inefficient for a list thousands of words long. e.g. third topic will get very long once all dog breed types are added.
Any tips?
Functionality I haven't worked out:
if a string in Categories B or C contains more than one topic in the parameters I set out, is there a way I can have the first 2 to show instead of just the first one?
e.g. Cell A14 in Categories, how can I get a formula/automation to add both "Akita" & "German Shepherd" into the third topic? Concatenation with a CHAR(10) to add to new line is ideal format here. There will be other keywords that won't have both in there in which case these values will just show up individually.
Since this data set has a bunch of mixed breeds and all breeds are added as a third topic, it would be great to differentiate interest in mixes vs pure breeds without confusion.
Any ideas will be greatly appreciated! Also, I'm open to variations in layout and functionality of the spreadsheet in case you have a more creative solution. I just care about efficiently automating a tedious task!!
Try using custom function:
To create custom function:
1.Create or open a spreadsheet in Google Sheets.
2.Select the menu item Tools > Script editor.
3.Delete any code in the script editor and copy and paste the code below into the script editor.
4.At the top, click Save save.
To use custom function:
1.Click the cell where you want to use the function.
2.Type an equals sign (=) followed by the function name and any input value — for example, =DOUBLE(A1) — and press Enter.
3.The cell will momentarily display Loading..., then return the result.
Code:
function matchTopic(p, str) {
var params = p.flat(); //Convert 2d array into 1d
var buildRegex = params.map(i => '(' + i + ')').join('|'); //convert array into series of capturing groups. Example (Dog)|(Puppies)
var regex = new RegExp(buildRegex,"gi");
var results = str.match(regex);
if(results){
// The for loops below will convert the first character of each word to Uppercase
for(var i = 0 ; i < results.length ; i++){
var words = results[i].split(" ");
for (let j = 0; j < words.length; j++) {
words[j] = words[j][0].toUpperCase() + words[j].substr(1);
}
results[i] = words.join(" ");
}
return results.join(","); //return with comma separator
}else{
return ""; //return blank if result is null
}
}
Example Usage:
Parameters:
First Topic:
Second Topic:
Third Topic:
Reference:
Custom Functions
I've added a new sheet ("Erik Help") with separate formulas (highlighted in green currently) for each of your keyword columns. They are each essentially the same except for specific column references, so I'll include only the "First Topic" formula here:
=ArrayFormula({"First Topic";IF(A2:A="",,IFERROR(REGEXEXTRACT(LOWER(B2:B&C2:C),JOIN("|",LOWER(FILTER(Parameters!D3:D,Parameters!D3:D<>""))))) & IFERROR(CHAR(10)&REGEXEXTRACT(REGEXREPLACE(LOWER(B2:B&C2:C),IFERROR(REGEXEXTRACT(LOWER(B2:B&C2:C),JOIN("|",LOWER(FILTER(Parameters!D3:D,Parameters!D3:D<>""))))),""),JOIN("|",LOWER(FILTER(Parameters!D3:D,Parameters!D3:D<>""))))))})
This formula first creates the header (which can be changed within the formula itself as you like).
The opening IF condition leaves any row in the results column blank if the corresponding cell in Column A of that row is also blank.
JOIN is used to form a concatenated string of all keywords separated by the pipe symbol, which REGEXEXTRACT interprets as OR.
IFERROR(REGEXEXTRACT(LOWER(B2:B&C2:C),JOIN("|",LOWER(FILTER(Parameters!D3:D,Parameters!D3:D<>""))))) will attempt to extract any of the keywords from each concatenated string in Columns B and C. If none is found, IFERROR will return null.
Then a second-round attempt is made:
& IFERROR(CHAR(10)&REGEXEXTRACT(REGEXREPLACE(LOWER(B2:B&C2:C),IFERROR(REGEXEXTRACT(LOWER(B2:B&C2:C),JOIN("|",LOWER(FILTER(Parameters!D3:D,Parameters!D3:D<>""))))),""),JOIN("|",LOWER(FILTER(Parameters!D3:D,Parameters!D3:D<>"")))))
Only this time, REGEXREPLACE is used to replace the results of the first round with null, thus eliminating them from being found in round two. This will cause any second listing from the JOIN clause to be found, if one exists. Otherwise, IFERROR again returns null for round two.
CHAR(10) is the new-line character.
I've written each of the three formulas to return up to two results for each keyword column. If that is not your intention for "First Topic" and "Second Topic" (i.e., if you only wanted a maximum of one result for each of those columns), just select and delete the entire round-two portion of the formula shown above from the formula in each of those columns.

How to manipulate multiple csv files with raw data starting from different row for each file?

I would like to format multiplecsv files, some of them have summaries before the raw data. Raw data can start at any row, but if “colname” is find at any row then raw data start there. I am using the Standard Libary csv module to read files and check if “colname” exist and extract the data from there. With the code below, print(data) always gives me data from the first row of the file. But I want to pull the data starting from where “colname” is found. If “colname” is not found I don’t want to read the data.
Root_dir=r”folder1”
for fname in os.listdir(root_dir):
file_path = os.path.join(root_dir, fname)
if fname.endswith(('.csv')):
n = 0
with open(file_path,'rU') as fp:
csv_reader = csv.reader(fp)
while True:
for line in csv_reader:
if line == " colname": continue
n = n + 1
data=line
print(data)
Your code's logic reads only skip lines that aren't exactly " colname", which has 2 problems:
You want to skip lines until AFTER you have seen "colname"; you could use a boolean variable to distinguish between these two situations
Not clear if your test for colname is correct; for example, if there isn't exactly one leading space, or the line has a trailing end-of-line character, would trip it up.

Merge CSV row with a string match from a 2nd CSV file

I'm working with two large files; approximately 100K+ rows each and I want to search csv file #1 for a string contained in csv file#2, then join another string from csv file#1 to the row in csv file#2 based on the match criteria. Here's an example of the data I'm working with and my expected output:
File#1: String to be matched in file#2 is the 2nd element; 1st is to be appended to each matched row in file#2. (Integer to be appended is bold; string to be matched is italicized for clarity only)
row 1:
3604430123,mta0000cadd503c.mta.net
row 2:
3604434567,mta0000CADD5638.MTA.NET
row 3:
3606304758,mta00069234e9a51.DT.COM
File#2:
row 1:
4246,211-015617,mta0000cadd503c.mta.net,old,NW MG2,BBand2 ESA,Active
row 2:
7251,ACCOUNT,mta0000CADD5638.MTA.NET,FQDN ,NW MG2,BBand2 ESA,Active
row 3:
536887946,874-22558501,mta00069234e9a51.DT.COM,"P",NW MG2,BBand2 ESA,Active
Desired Output joining bold integer string from file#1 to entire row in file#2 based on string match between file#1 and file#2:
row 1:
4246,211-015617,mta0000cadd503c.mta.net,old,NW MG2,BBand2 ESA,Active,3604430123
row 2:
7251,ACCOUNT,mta0000CADD5638.MTA.NET,FQDN ,NW MG2,BBand2 ESA,Active,3604434567
row 3:
536887946,874-22558501,mta00069234e9a51.DT.COM,"P",NW MG2,BBand2 ESA,Active,3606304758
There are many instances where the case in the match string of file#1 doesn't match the case of file#2, however the characters match, thus case can be ignored for match critera. The character case does need to be preserved in file#2 after it is appended with the integer string from file#1.
I'm a python newb and I've been at this for a while and have scoured posts in SE, but can't seem to come up with working code that gets me to the point where I can just print out a line from file#2 that has been matched on the string in file#1. I've tried a few other methods, such as writing to a dictionary, using Dictreader, etc, but haven't been able to clear what appears to be simple errors in those methods, so I tried to strip this down to simple lists and get to the point where I can use a list comprehension to combine the data, then write that back to a file named output, which will eventually be written back to a csv file. Any help or suggestions would be greatly appreciated.
import csv
sg = []
fqdn = []
output = []
with open(r'file2.csv', 'rb') as src:
read = csv.reader(src, delimiter=',')
for row in read:
sg.append(row)
with open(r'file1.csv', 'rb') as src1:
read1 = csv.reader(src1, delimiter=',')
for row in read1:
fqdn.append(row)
output = output.append([s[0] for s in sg if fqdn[1] in sg])
print output
Result after running this is:
None
Process finished with exit code 0
You should use a dictionary for file#1 than just a list, as matching is easier. Just turn fqdn into a dict and in your loop reading file#1 set your key-value pairs on the dict. I would use .lower() on the match key. This turns the key to lower case so you later only have to check if the lower-cased version of the field in file#2 is a key in the dictionary:
import csv
sg = []
fqdn = {}
output = []
with open(r'file2.csv', 'rb') as src:
read = csv.reader(src, delimiter=',')
for dataset in read:
sg.append(dataset)
with open(r'file1.csv', 'rb') as src1:
read1 = csv.reader(src1, delimiter=',')
for to_append, to_match in read1:
fqdn[to_match.lower()] = to_append
for dataset in sg:
to_append = fqdn.get(dataset[2].lower()) # If the key matched, to_append now contains the string to append, else it becomes None
if to_append:
dataset.append(to_append) # Append the field
output.append(dataset) # Append the row to the result list
print(output)
You can then use csv.writer to create a csv file from the result.
Here's a brute force solution to solving this problem. For every line of the first file, you will search through every line of the second file until you find a match. The matched lines will be written out to the output.csv file in the format you specified using the csv writer.
import csv
with open('file1.csv', 'r') as file1:
with open('file2.csv', 'r') as file2:
with open('output.csv', 'w') as outfile:
writer = csv.writer(outfile)
reader1 = csv.reader(file1)
reader2 = csv.reader(file2)
for row in reader1:
if not row:
continue
for other_row in reader2:
if not other_row:
continue
# if we found a match, let's write it to the csv file with the id appended
if row[1].lower() == other_row[2].lower():
new_row = other_row
new_row.append(row[0])
writer.writerow(new_row)
continue
# reset file pointer to beginning of file
file2.seek(0)
You might be tempted to store the information in a data structure before writing it out to a file. In my experience, you always end up getting larger files in the future and may run into memory issues. I like to write things out to file as I find the matches in order to avoid this problem.

Remove a line from list and all successive lines to N?

I have some list in R, which is a set of lines from a relatively unstructured document that I am scraping for data. At the top of each page is a page number, proceeded by the string "page" and several lines of header information which I would like to drop.
Each document has a different number of header lines. My solution so far:
RawFeed.1<- grep("Page",RawFeed)
RawFeed.1a<-length(RawFeed.1)
RawFeed.1<-RawFeed.1[-1]
Note the first instance is dropped here because the first page always has more header lines than the rest of the pages and its dropped later anyway.
y<-RawFeed.1[1]
ya<-c(y:length(RawFeed))
NSearch<-RawFeed[ya]
NSearch.1<-grep("Start", NSearch)
y1<-NSearch.1[1]
y1<-y1-1
y2<-c(0:y1)
As 'start' is always found on the line before the data begins, this consistently gives me the document specific number of header lines.
Next I attempt to remove them by:
PageBreak <-function(y) {
RawFeed<-RawFeed[-x-y]
}
RawFeedTemp<-lapply(RawFeed.1,PageBreak,y=y2)
Which does work, sort of - I am left with an array such that RawFeedTemp[[n]] has the header information removed only for that page.
So how can I preform a similar operation where I am left with a list where each page's header information has been removed or is there a way to combine the elements in the array such that it contains only one set of lines, excluding those I am trying to remove?
Edit: An example of the data
[306] N 46 10/08/12 10/08/12 Stuff :30 NM 0 $0.00"
[307] Week: 10/08/12 10/14/12 Other Stuff $6,500.00 0.00
[308] " Contract Agreement Between: Print Date 10/05/12 Page 5 of 6"
[309] ""
[310] ""
[311] " Contract / Revision Alt Order #"
[312] " Person
[313] " Address 1
[314] " Address 2
[315] " Address 3
[316] " Address 4
[317] ""
[318] " Original Date / Revision"
[319] ""
[320] "08/10/12 / 10/04/12"
[321] ""
[322] ""
[323] ""
[324] "* Line Ch Start Date End Date Description Start
[325] MORE DATA
Another File might have a different number of these headers. Also note that records occupy more than one line, most files finish a record before starting a new page but a few insist on pushing the second line of the record to a new page which why I need to remove them all
Thanks for your help!
Since you don't give a clear example of your data, I am not sure of the given solution.
If I understand you have document with parts (header) between 'Page' and 'start' That you want to remove. Here a sample of your data with 2 headers:
str <- 'Page ...... ### header1
alalalala
lalalalalal
aalalala
lslslsls start ksksksks
keep me 1
keep me 2
Page ...... ### header 2
aalalala
lslslsls start ksksksks
keep me 3
keep me 4'
Here I am using readLines to read the document , and find header lines using grep, and remove the join of lines index from the lines list.
ll <- readLines(textConnection(str))
ids <- matrix(grep('Page|start',ll),ncol=2,byrow=TRUE)
ll[-unlist(apply(ids,1,function(x)seq(x[1],x[2])))]
[1] "keep me 1" "keep me 2" "keep me 3" "keep me 4"