Grouping values with regular expressions regex - regex

I have got a list of names within an Excel sheet (also in csv) and I made groups with the origin of the names.
This is what the groups I made look like.
Now I want to add a new column with the group name behind the name.
This is what I want to obtain.
How do I get this? Do I have to use regualar expressions for this?

You don't need regex here. For instance, you can use the csv module of python.
old.csv
groups,,,
Dutch,Lore,Kilian,Daan
German,Marte,,
USA,Eva,Judith,
python script using import csv
import csv
rows = []
with open('old.csv','r') as old_csv:
old = csv.reader(old_csv, delimiter=',')
old.next()
for row in old:
for name in row[1:]:
if name:
rows.append({'name':name,'group':row[0]})
with open('new.csv','w') as new_cvs:
fieldnames = ['name', 'group']
new = csv.DictWriter(new_cvs, fieldnames=fieldnames)
new.writer.writerow(new.fieldnames)
new.writerows(rows)
new.csv
name,group
Lore,Dutch
Kilian,Dutch
Daan,Dutch
Marte,German
Eva,USA
Judith,USA
You can also use xlrd and xlwt modules but you have to install them because they aren't standard.

Related

django split data and apply search istartswith = query

I have a Project and when searching a query I need to split the data (not search query) in to words and apply searching.
for example:
my query is : 'bot' (typing 'bottle')
but if I use meta_keywords__icontains = query the filter will also return queries with 'robot'.
Here meta_keywords are keywords that can be used for searching.
I won't be able to access data if the data in meta_keywords is 'water bottle' when I use meta_keywords__istartswith is there any way I can use in this case.
what I just need is search in every words of data with just istartswith
I can simply create a model for 'meta_keywords' and use the current data to assign values by splitting and saving as different data. I know it might be the best way. I need some other ways to achieve it.
You can search the name field with each word that istartswith in variable query.
import re
instances = Model.objects.filter(Q(name__iregex=r'[[:<:]]' + re.escape(query)))
Eg: Hello world can be searched using the query 'hello' and 'world'. It don't check the icontains
note: It works only in Python3

How to extract and route only specified columns from a CSV files and drop all other columns [duplicate]

This question already has an answer here:
How to extract a subset from a CSV file using NiFi
(1 answer)
Closed 4 years ago.
I want to extract few fields along with its value from a CSV file and drop/delete all other fields in the file. Please help. I think we can use RoutText processor.Please tell me how to write the regular expression for the routing only specified fields and dropping everything else. Thanks
Example- from he snapshot attached I only want to route 'Firstname,Lastname and Siblings' fields along wit hits value(each record/row). Delete the remaining columns like 'State, Age, Apt no,Country,Gender'.
Please tell me what is the correct processor for this and what configuration properties to use in order to achieve this. Thanks
Attaching snapshot for reference.
You can use ConvertRecord for this. Provide the full schema to the CSVReader, and provide the schema with only the fields you want to the CSVRecordSetWriter. If you don't know the input schema (but you know it includes at least the fields you want to send along), you can have the reader Use String Fields From Header, that will create an input schema (using the header line) and assume all fields are strings. However the output schema would have the selected fields along with their types, and ConvertRecord will handle the "deletion" of the other fields, as well as any conversion from String to the desired data type for each of the selected fields.
I think using regular expression is not the best solution: Here is how I should do:
First you need to explore the csv:
$handle = fopen("test.csv", "r")
Map through the data
$data = fgetcsv($handle, 1000, ",")
Create New Header and Array from existed $data with wanted fields
Put new data into new csv.
$fp = fopen('file.csv', 'w');
foreach ($data as $fields) {
fputcsv($fp, $fields);
}
fclose($fp);

from Django forms to pandas DataFrame

I am very new to Django, but facing quite a daunting task already.
I need to create multiple forms like this on the webpage where user would provide input (only floating numbers allowed) and then convert these inputs to pandas DataFrame to do data analysis. I would highly appreciate if you could advise how should I go about doing this?
Form needed:
This is a very broad question and I am assuming you are familiar with pandas and python. There might be a more efficient way but this is how I would do it. It should not be that difficult have the user submit the form then import pandas in your view. Create an initial data frame Then you can get the form data using something like this
if form.is_valid():
field1 = form.cleaned_data['field1']
field2 = form.cleaned_data['field2']
field3 = form.cleaned_data['field3']
field4 = form.cleaned_data['field4']
you can then create a new data frame like so:
df2 = pd.DataFrame([[field1, field2], [field3, field4]], columns=list('AB'))
then append the second data frame to the first like so:
df.append(df2)
Keep iterating over the data in this fashion until you have added all the data. After its all been appended you can do you analysis and whatever else you like. You note you can append more data the 2 by 2 thats just for example.
Pandas append docs:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.append.html
Django forms docs:
https://docs.djangoproject.com/en/2.0/topics/forms/
The docs are you friend

REGEX in Pentaho to clean a join column in my data

I have been struggling with a certain column in my data where the source data is dirty and i cant find joins because of this.
So What I am trying to do is:
Select the column [website_reference_number] among others
REGEX to review [website_reference_number] according to certain specs
Then I need to trim that data so that there are no in-consistencies left so that my joins will be clean
In example
if [website_reference_number] = "CC-DE-109" >>> Leave it like that
if [website_reference_number] = "CC-DE-109-Duplicate" >>> change to CC-DE-109
if [website_reference_number] = "CC-DE-109 Duplicate" >>> change to CC-DE-109
if [website_reference_number] = "CC-DE-109-Duplicate-Duplic" >>> change to CC-DE-109
So the rules are in human terms {Any 2 Letters}-{Any 2 Letters}-{AnyAmountOfNumbers}
Use this pattern:
/([A-Z]{2})-([A-Z]{2})-([0-9]+).*/
Online Demo

using two xpathselectors on the same page

I have a spider where the scraped items are 3: brand, model and price from the same page.
Brands and models are using the same sel.xpath, later extracted and differentiated by .re in loop. However, price item is using different xpath. How can I use or combine two XPathSelectors in the spider?
Examples:
for brand and model:
titles = sel.xpath('//table[#border="0"]//td[#class="compact"]')
for prices:
prices = sel.xpath('//table[#border="0"]//td[#class="cl-price-cont"]//span[4]')
Tested and exported individually by xpath. My problem is the combining these 2 to construct the proper loop.
Any suggestions?
Thanks!
Provided you can differentiate all 3 kind of items (brand, model, price) later, you can try using XPath union (|) to bundle both XPath queries into one selector :
//table[#border="0"]//td[#class="compact"]
|
//table[#border="0"]//td[#class="cl-price-cont"]//span[4]
UPDATE :
Responding your comment, above meant to be single XPath string. I'm not using python, but I think it should be about like this :
sel.xpath('//table[#border="0"]//td[#class="compact"] | //table[#border="0"]//td[#class="cl-price-cont"]//span[4]')
I believe you are having trouble associating the price with the make/model because both xpaths give you a list of all numbers, correct? Instead, what you want to do is build an xpath that will get you each row of the table. Then, in your loop, you can do further xpath queries to pull out the make/model/price.
rows = sel.xpath('//table[#border="0"]/tr') # Get all the rows
for row in rows:
make_model = row.xpath('//td[#class="compact"]/text()').extract()
# set make and model here using your regex. something like:
(make,model) = re("^(.+?)\s(.+?)$", make_model).groups()
price = row.xpath('//td[#class="cl-price-cont"]//span[4]/text()').extract()
# do something with the make/model/price.
This way, you know that in each iteration of the loop, the make/model/price you're getting all go together.