Removing invalid characters from amazon cloud search sdf - amazon-web-services

While trying to post the data extracted from a pdf file to a amazon cloud search domain for indexing, the indexing failed due to invalid chars in the data.
How can i remove these invalid charecters before posting to the search end point?
I tried escaping and replacing the chars, but didn't work.

I was getting an error like this when uploading document to CloudSearch (using aws sdk / json):
Error with source for field content_stemmed: Validation error for field 'content_stemmed': Invalid codepoint B
The solution for me, as documented by AWS (reference below), was to remove invalid characters from the document prior to uploading:
For example this is what I did using javascript:
const cleaned = someFieldValue.replace(
/[^\u0009\u000a\u000d\u0020-\uD7FF\uE000-\uFFFD]/g,
''
)
ref:
Both JSON and XML batches can only contain UTF-8 characters that are valid in XML. Valid characters are the control characters tab (0009), carriage return (000D), and line feed (000A), and the legal characters of Unicode and ISO/IEC 10646. FFFE, FFFF, and the surrogate blocks D800–DBFF and DC00–DFFF are invalid and will cause errors.
You can use the following regular expression to match invalid characters so you can remove them: /[^\u0009\u000a\u000d\u0020-\uD7FF\uE000-\uFFFD]/

I have fixed the problem using the solution available here
RE_XML_ILLEGAL = u'([\u0000-\u0008\u000b-\u000c\u000e-\u001f\ufffe-\uffff])' + \
u'|' + \
u'([%s-%s][^%s-%s])|([^%s-%s][%s-%s])|([%s-%s]$)|(^[%s-%s])' % \
(unichr(0xd800),unichr(0xdbff),unichr(0xdc00),unichr(0xdfff),
unichr(0xd800),unichr(0xdbff),unichr(0xdc00),unichr(0xdfff),
unichr(0xd800),unichr(0xdbff),unichr(0xdc00),unichr(0xdfff))
x = u"<foo>text\u001a</foo>"
x = re.sub(RE_XML_ILLEGAL, "?", x)

Related

RegEx for filtering in Azure using Terraform

The Terraform azurerm_image data source lets you use a RegEx to identify a machine image whose ID matches the regular expression.
What RegEx should be used to retrieve an image that includes the string MyImageName and that takes the complete form /subscriptions/abc-123-def-456-ghi-789-jkl/resourceGroups/MyResourceGroupName/providers/Microsoft.Compute/images/MyImageName1618954096 ?
The following version of the RegEx is throwing an error because it will not accept two * characters. However, when we only used the trailing *, the image was not retrieved.
data "azurerm_image" "search" {
name_regex = "*MyImageName*"
resource_group_name = var.resourceGroupName
}
Note that the results only return a single image so you do not need to worry about multiple images being returned. There is a flag that can be set to specify either ascending or descending sorting to retrieve the oldest or the newest match.
The precise error we are getting is:
Error: "name_regex": error parsing regexp: missing argument to repetition operator: `*`
Nick's Suggestion
Per #Nick's suggestion, we tried:
data "azurerm_image" "search" {
name_regex = "/MyImageName[^/]+$"
resource_group_name =
var.resourceGroupName
}
But the result is:
Error: No Images were found for Resource Group "MyResourceGroupName"
We checked in the Azure Portal and there is an image that includes MyImageName in its name within the resource group named MyResourceGroupName. We also confirmed that Terraform is running as the subscription owner, so we imagine that the subscription owner has sufficient authorization to filter image names.
What else can we try?
After my validation, it seems that it works when name_regex includes only one trailing *. If with one prefix *, it will generate that error message.
For example, I have an image name rrr-image-20210421150018 in my resource group.
The following works:
r*
-*
8*
rrr*
image*
2021*
The following does not work:
*r
*-
*8
*image*
*rrr*
Also, verify if you have the latest azurerm provider.
Result

Encoding automatically in Postman

i have an uri that ends in something like this
...headfields=id,id^name
i was using the encodeURIComponent(Right click on the uri) to replace that "^" by "%5E" and works fine.
But my question is, can this be automatic in postman?
url encoding is done automatically you don't have to explicitly do that
Note for query parameters if you type in special character with special meaning in the url then it will not encode it , if you give it in params then it will
usecase 1 : typing in special characters
usecase2 : giving it in params
you can also encode in prerequest script as :
pm.request.url=encodeURI(pm.variables.replaceIn(pm.request.url))

How can I use regex to construct an API call in my Jekyll plugin?

I'm trying to write my own Jekyll plugin to construct an api query from a custom tag. I've gotten as far as creating the basic plugin and tag, but I've run into the limits of my programming skills so looking to you for help.
Here's my custom tag for reference:
{% card "Arbor Elf | M13" %}
Here's the progress on my plugin:
module Jekyll
class Scryfall < Liquid::Tag
def initialize(tag_name, text, tokens)
super
#text = text
end
def render(context)
# Store the name of the card, ie "Arbor Elf"
#card_name =
# Store the name of the set, ie "M13"
#card_set =
# Build the query
#query = "https://api.scryfall.com/cards/named?exact=#{#card_name}&set=#{#card_set}"
# Store a specific JSON property
#card_art =
# Finally we render out the result
"<img src='#{#card_art}' title='#{#card_name}' />"
end
end
end
Liquid::Template.register_tag('cards', Jekyll::Scryfall)
For reference, here's an example query using the above details (paste it into your browser to see the response you get back)
https://api.scryfall.com/cards/named?exact=arbor+elf&set=m13
My initial attempts after Googling around was to use regex to split the #text at the |, like so:
#card_name = "#{#text}".split(/| */)
This didn't quite work, instead it output this:
[“A”, “r”, “b”, “o”, “r”, “ “, “E”, “l”, “f”, “ “, “|”, “ “, “M”, “1”, “3”, “ “]
I'm also then not sure how to access and store specific properties within the JSON response. Ideally, I can do something like this:
#card_art = JSONRESPONSE.image_uri.large
I'm well aware I'm asking a lot here, but I'd love to try and get this working and learn from it.
Thanks for reading.
Actually, your split should work – you just need to give it the correct regex (and you can call that on #text directly). You also need to escape the pipe character in the regex, because pipes can have special meaning. You can use rubular.com to experiment with regexes.
parts = #text.split(/\|/)
# => => ["Arbor Elf ", " M13"]
Note that they also contain some extra whitespace, which you can remove with strip.
#card_name = parts.first.strip
#card_set = parts.last.strip
This might also be a good time to answer questions like: what happens if the user inserts multiple pipes? What if they insert none? Will your code give them a helpful error message for this?
You'll also need to escape these values in your URL. What if one of your users adds a card containing a & character? Your URL will break:
https://api.scryfall.com/cards/named?exact=Sword of Dungeons & Dragons&set=und
That looks like a URL with three parameters, exact, set and Dragons. You need to encode the user input to be included in a URL:
require 'cgi'
query = "https://api.scryfall.com/cards/named?exact=#{CGI.escape(#card_name)}&set=#{CGI.escape(#card_set)}"
# => "https://api.scryfall.com/cards/named?exact=Sword+of+Dungeons+%26+Dragons&set=und"
What comes after that is a little less clear, because you haven't written the code yet. Try making the call with the Net::HTTP module and then parsing the response with the JSON module. If you have trouble, come back here and ask a new question.

Regex capture group in Varnish VCL

I have a URL in the form of:
http://some-site.com/api/v2/portal-name/some/webservice/call
The data I want to fetch needs
http://portal-name.com/webservices/v2/some/webservice/call
(Yes I can rewrite the application so it uses other URL's but we are testing varnish at the moment so for now it cannot be intrusive.)
But I'm having trouble getting the URL correctly in varnish VCL. The api part is replaced by an empty string, no worries but now the portal-name.
Things I've tried:
if (req.url ~ ".*/(.*)/") {
set req.http.portalhostname = re.group.0;
set req.http.portalhostname = $1;
}
From https://docs.fastly.com/guides/vcl/vcl-regular-expression-cheat-sheet and Extracting capturing group contents in Varnish regex
And yes, std is imported.
But this gives me either a
Syntax error at
('/etc/varnish/default.vcl' Line 36 Pos 35)
set req.http.portalhostname = $1;
or a
Symbol not found: 're.group.0' (expected type STRING_LIST):
So: how can I do this? When I have extracted the portalhostname I should be able to simply do a regsub to replace that value with an empty string and then prepend "webservices" and my URL is complete.
The varnish version i'm using: varnish-4.1.8 revision d266ac5c6
Sadly re.group seems to have been removed at some version. Similar functionality appears to be accessible via one of several vmods. See https://varnish-cache.org/vmods/

<Psych::SyntaxError: did not find expected key while parsing a block mapping

When I start the server, I get this error:
I18n::InvalidLocaleData
can not load translations from /Users/Apple/myapp-website-freelance/config/locales/fr.yml: #<Psych::SyntaxError: (/Users/Apple/myapp-website-freelance/config/locales/fr.yml): did not find expected key while parsing a block mapping at line 2 column 3>
Although, the yaml file seems normal at line 2 column 3:
fr:
Electronics_Circuits_Simulator_Realistic_Interface: "Simulateur de circuits electroniques. Interface reelle."
Any idea?
This error is often misleading and points the wrong line out
Look for extra spaces in the entire YML file and try to replace all enclosing single quotes by "", it should work !
One possible reason is:
default: &default
a: 1
production: *default
b: 2
in place of:
default: &default
a: 1
production:
<<: *default
b: 2
and if you are working with many projects make sure your on the sites section the hyphens - are in line look at the pic below where i have the green line.
Check for any extra whitespace in your YML file. Also, if you were like me you might forgot to remove {} in YML template.
# Read about fixtures at http://api.rubyonrails.org/classes/ActiveRecord/FixtureSet.html
# This model initially had no columns defined. If you add columns to the
# model remove the '{}' from the fixture names and add the columns immediately
# below each fixture, per the syntax in the comments below
#
one: {}
# column: value
#
two: {}
# column: value
I had a comma at the end of a line from a bad copy/paste elsewhere. Removing the comma restored the YAML file back to normal use (my application.yml in this case). The error is apparently covering a wide variety of YAML syntax issues. I suspect it means that the YAML parser could not go to the next line in the YAML file because of a breaking syntax error on the current line.
In may case it was missing quotas for password fields with symbols like "!##". Check your yaml file with yaml validator .
I had a similar issue but in my case, it was not blank spaces as shown in #Dijiflex's solution but it was an apostrophe or single quote character like in the example below;
pages:
title: "I don't have time this evening"
Please take note of the word don't so what I did was to escape it as below
pages:
title: "I don\\'t have time this evening"
I hope you have the idea now?
Enjoy
If you are parsing the YAML through ERB first, check your final YAML output (e.g. embedding of environment variables) to see if they contain invalid YAML, e.g. incomplete strings, etc.