I have a real time web analytics problem to address, and I'm wondering if some of the WSO2 products might be an appropriate solution.
An ecommerce web site shows pages of products to a browser user, and the web site vendor wants to collect details of what products were viewed in a list, what products were selected from the list for more info, what products were put into the basket, and what products were actually purchased - all in real time. I can use web page tagging to generate logging events for the four states (I.e. In list, view detail, in basket, purchased). The web site vendor wants too see results summarized by product and by rolling time band (e.g. Last hour, last 6 hours, last 24 hours, last 72 hours) by the four product states.
As a complete WSO2 newbie I'm hoping somebody can help with some pointers on how to address this. I've been reading about the BAM module to capture events. Is that a good place to start? Also can anybody suggest a good in memory data store to hold the event data aggregated by event type and rolling time period?
TIA
Yes, BAM is more kind batch processing, monitoring and complex engine and using it you can capture data, process and then present. In architectural point of view, the product states that are changed by the browser user will be captured by the web server and publish to BAM server.
A good point to start is learning about data publishing. Once you define the data [in BAM it is known as stream definition] to be published, you can write a hive script to process it and present. You can pump all data to BAM and then you can use hive script to process and store it in the manner you wanted. Later you can retrieve and present.
Related
I'm trying to help an animal shelter deliver faster updates when a new pet is added to their website. This is likely to happen between 0-20 times a day.
The website is a simple data dump, animals are in tables with row delineation (easy to parse) and have unique IDs. When a new pet is added, ideally this would trigger a mobile notification to subscribed users (could also be an email message). The faster updates are sent, the better, but checking every 30 mins or so would be fine. Because this is for a charity, I want to spend as little as possible on resources (because I also want to be able to scale this up for other shelters that might want to use this).
For instance mobile notifications, Twitter seems to be a good candidate. It looks like my needs wont run into fees/restrictions.
The part that I'm stuck on is how best to ping the site for updates and publish those updates to twitter. The two options I've come up with are:
Build my own system. Use a web crawler like Scrapy to periodically crawl the site and check for new petIDs. Using AWS, I think I could get by with a nano instance (~$57 a year). Using dynamoDB to cache existing petIDs seems like a small additional cost. Use twitter API to post updates
Use an RSS feed generator like Feedity. These seem to be pretty expensive: Feedity is $180/year for hourly updates and $390 for 15 minute updates. Has API integrated with Twitter.
I'd like to know if there are any better/simpler/cheaper/more obvious options I may be overlooking. Thanks!
I operate a number of content websites that have several million user sessions and need a reliable way to monitor some real-time metrics on particular pieces of content (key metrics being: pageviews/unique pageviews over time, unique users, referrers).
The use case here is for the stats to be visible to authors/staff on the site, as well as to act as source data for real-time content popularity algorithms.
We already use Google Analytics, but this does not update quickly enough (4-24 hours depending on traffic volume). Google Analytics does offer a real-time reporting API, but this is currently in closed beta (I have requested access several times, but no joy yet).
New Relic appears to offer a few analytics products, but they are quite expensive ($149/500k pageviews - we have several times this).
Other answers I found on StackOverflow suggest building your own, but this was 3-5 years ago. Any ideas?
Heard some good things about Woopra and they offer 1.2m page views for the same price as Relic.
https://www.woopra.com/pricing/
If that's too expensive then it's live loading your logs and using an elastic search service to read them to get he data you want but you will need access to your logs whilst they are being written to.
A service like Loggly might suit you which would enable you to "live tail" your logs (view whilst being written) but again there is a cost to that.
Failing that you could do something yourself or get someone on freelancer to knock something up for you enabling logs to be read and displayed in a format you recognise.
https://www.portent.com/blog/analytics/how-to-read-a-web-site-log-file.htm
If the metrics that you need to track are just limited to the ones that you have listed (Page Views, Unique Users, Referrers) you may think of collecting the logs of your web servers and using a log analyzer.
There are several free tools available on the Internet to get real-time statistics out of those logs.
Take a look at www.elastic.co, for example.
Hope this helps!
Google Analytics offers real time data viewing now, if that's what you want?
https://support.google.com/analytics/answer/1638635?hl=en
I believe their API is now released as we are now looking at incorporating this!
If you have access to web server logs then you can actually set up Elastic Search as a search engine and along with log parser as Logstash and Kibana as Front end tool for analyzing the data.
For more information: please go through the elastic search link.
Elasticsearch weblink
Consider the following micro services for an online store project:
Users Service keeps account data about the store's users (including first name, last name, email address, etc')
Purchase Service keeps track of details about user's purchases.
Each service provides a UI for viewing and managing it's relevant entities.
The Purchase Service index page lists purchases. Each purchase item should have the following fields:
id, full name of purchasing user, purchased item title and price.
Furthermore, as part of the index page, I'd like to have a search box to let the store manager search purchases by purchasing user name.
It is not clear to me how to get back data which the Purchase Service does not hold - for example: a user's full name.
The problem gets worse when trying to do more complicated things like search purchases by purchasing user name.
I figured that I can obviously solve this by syncing users between the two services by broadcasting some sort of event on user creation (and saving only the relevant user properties on the Purchase Service end). That's far from ideal in my perspective. How do you deal with this when you have millions of users? would you create millions of records in each service which consumes users data?
Another obvious option is exposing an API at the Users Service end which brings back user details based on given ids. That means that every page load in the Purchase Service, I'll have to make a call to the Users Service in order to get the right user names. Not ideal, but I can live with it.
What about implementing a purchase search based on user name? Well I can always expose another API endpoint at the Users Service end which receives the query term, perform a text search over user names in the Users Service, and then return all user details which match the criteria. At the Purchase Service, map the relevant ids back to the right names and show them in the page. This approach is not ideal either.
Am I missing something? Is there another approach for implementing the above? Maybe the fact that I'm facing this issue is sort of a code smell? would love to hear other solutions.
This seems to be a very common and central question when moving into microservices. I wish there was a good answer for that :-)
About the suggested pattern already mentioned here, I would use the term Data Denormalization rather than Polyglot Persistence, as it doesn't necessarily needs to be in different persistence technologies. The point is that each service handles its own data. And yes, you have data duplication and you usually need some kind of event bus to share data across services.
There's another option, which is a sort of a take on the first - making the search itself as a separate service.
So in your example, you have the User service for managing users. The Purchases services manages purchases. Each handles its own data and only the data it needs (so, for instance, the Purchases service doesn't really need the user name, only the ID). And you have a third service - the Search Service - that consumes data produced by other services, and creates a search "view" from the combined data.
It's totally fine to keep appropriate data in different databases, it's called Polyglot Persistence. Yes, you would like to keep user data and data about purchases separately and use message queue for sync. Millions of users seems fine to me, it's scalability, not design issue ;-)
In case of search - you probably want to search more than just username, right? So, if you use message queue to update data between services you can also easily route this data to ElasticSearch, for example. And from ElasticSearch perspective it doesn't really matter what field to index - username or product title.
I usually use both approaches. Sometimes i have another service which is sitting on top on x other services and combines the data. I don't really like this approach because it is causing dependencies and coupling between services. So in general, within my last projects we tried to stick to polyglot persistence.
Also think about, if you need to have x sub http requests for combining data in some kind of middleware service, it will lead you to higher latency. We always try to cut down the amount of requests for one task and handle everything what is possible through asynchronous queues. ( especially data sync )
If you conceptualize modules as the owners and controllers of the data they work on, then your model must also communicate that data out of that module to others. In contrast, the modules in a manufacturing process have the access to change data without possessing and controlling it.
Microservices is an architecture for distributed processing, like most code, where modules pass the data around to work on it. From classic articles by Harvard Business Review and McKinsey on the subject of owning members of a supply chain, I identified complexities arising from this model and wrote an article teaching programmers what you need to know: http://www.powersemantics.com/p.html
Manufacturing is an architecture for integrated processing, where modules work on the data without passing it around from point to point. This can be accomplished by having modules configured to access the same memory, files or database tables. My architecture shows how to accomplish this on memory via reference properties.
When you consider "exposing an API at the Users Service end which brings back user details based on given ids", you need to be aware that creates what HBR calls "irreversible" complexity, which I've dubbed centralization complexity. Don't build A->B (distributed) systems, because you can't decentralize them later after failing to separate requirements. Requirements in production processes represent user instructions, and centralized modules only enable you to change the wrong users' processes. In other words, centralized modules don't document user groups or distinguish them from derived-product-users.
Say I want to develop a web application that will have registered users and will be registered as a twitter app (allowing users to give it permissions to view their timeline and post on their behalf). The sole function of the application will be to re-tweet tweets from users' timeline according to users' settings and desires.
I understand that the website for this app will use the common technologies like HTML, CSS and JS on the client-side. The server side (where the user defines what kind of tweets the application should retweet) will have to be coded in PHP/Python/Perl/... based on a DB MySQL/Postgre/...
What I don't understand, and would really appreciate your help with, is where the real "business logic" will be coded? For example, what technology should I use to code the function that will sit on my server: contacting Twitter server every 5 minutes, reading the timeline of every user I have, checking whether there are tweets worth retweeting (according to what the user has defined), and sending Tweeter the necessary commands to retweet the chosen tweets on behalf of my users.
All that will happen off-line for the user, and will be an on-going and cyclic process - but what technology should I use to code it?
Thanks!
I have heard about this API for PHP. It is actually the only one that I have heard of for PHP, though. I know that there are some good Python libraries out there, but I don't know about Perl.
I am actually working on a new API for C# (won't be a good fit for you, as you're clearly not using Windows Servers), and started building it while working on an enterprise web application that prompted several questions similar to your own.
Here is what you are going to have to do:
Before you start, you are going to have to get in touch with one of Twitter's data partners (I believe that you may contact Twitter for the reference)
The reason for this is that you are going to be requiring many more requests than you think
Twitter's time interval used for Twitter's recorded rate cap is 900 seconds (5 minutes)
With the general rate limit if you are querying a user's timeline only once every rate limit, you are limiting your number of visitors on your site to 300 at a time
Here's where it gets tricky - if every user makes one Tweet (meaning you send the Tweet - not rate limited - and then refresh the timeline - rate limited - so that they can see the updated tweet) you have now dropped your maximum number of active users at any given time to 150
Factor in the company's own timeline (-1 visitor), plus the number of visitors who leave their browsers open (now you need more logic, and you have to either kick them off or simply keep track of whose timeline you won't be refreshing), the number of users who make more than one tweet (-1 visitor for each Tweet), etc.
Moral of the story: contact one of their data partners and get yourself either unlimited requests, or at least a significant enough amount to accommodate your number of visitors/users (plus a bit of padding)
If you adhere to this advise, skip steps 2 and 3, otherwise, skip step 4
(Note: Steps 2 and 3 are only for rate-capped implementations) Using your desired language, make a service that runs on the server and makes the queries to Twitter
Based on the information that you gave, I suggest that you use Python to make this service
The service will run at all times and be on it's own clock to base the 5-minute intervals between the requests on
You will have to use a caching or a database system for storing the data
(Note: Steps 2 and 3 are only for rate-capped implementations) Add the necessary code to make a request to the service that you created for the data and perform these requests every 5 minutes
I suggest that the clock used for making these requests to the service be running a little bit behind the clock used for the service to account for instances of slow data transfer, etc.
You will also have to have to call some methods on the service for adding/removing users from the queue
(Note: Step 4 is only for unlimited request implementations) Forget about the service and simply include the request code directly in the page that the user is on.
The user's timeline will be updated based on when they visited the site or when their timeline was last refreshed (if a Tweet was made)
The only caveat to this implementation is that you will have to pay for the unlimited/larger data rate limit
I was looking here if is possible to insert/update a large quantity of rows from an as400 system.
I have a website stored on another server online and that website must be updated with the new stocks for each article. But this data only exists in the as400 system.
I would like to be the as400 system linking the web-server instead of the web-server link to as400 for security reasons.
A better system would be to update/insert everytime a change has been made in the as400, but if this is not possible it could be making an update every 3 hours in order to mantain consistency between the 2 servers.
Thanks
Yes it can be done.
You can attach a database trigger to your stock table that pushes the key fields to a data queue anytime an insert, update or delete is performed. You can then process the data queue to send the updates to the web site using an HTTP POST or other means.
IBM Redbook: Stored Procedures, Triggers, and User-Defined Functions
on DB2 Universal Database for
iSeries
IBM i 7.1 Information Center: Data Queue APIs
Scott Klement's RPG IV Sockets Tutorial