miniclip

This code is the result of the interview challenge from Miniclip

Getting started

You need to clone the repository and download rebar/rebar3 (if it's not already available in your path).

git clone https://github.com/thiagoesteves/miniclip.git
cd erlcloud
wget https://s3.amazonaws.com/rebar3/rebar3
chmod a+x rebar3

To compile and run miniclip server

make

PS: maybe you need to configure the number of file descriptors at your shell, the ideal is 4096 (ulimit -n 4096)

To run unit test and see the coverage

make test

Security Credentials

you can provide them via erlcloud application environment variables at miniclip.hrl file

%% AWS Credential definitions
-define(AWS_ACCESS_KEY_ID,     "XXXXXXXX").
-define(AWS_SECRET_ACCESS_KEY, "XXXXXXXX").
-define(AWS_REGION,            "XXXXXXXX").

CloudFormation file

you can find the template for the AWS SQS and AWS DynamoDB inside the folder aws

Timesheet

you can find the timesheet with all major subtasks and duration inside the folder doc

Discussions

What strategies would you employ should the number of validations increase hundred fold?

The image below shows how our solution was designed. There is only one consumer for AWS SQS messages and for each message, a gen_server will be created to handle the whole validation process.

The Unit Test shows how Erlang is very powerful processing thousandths of messages which leads us to look in the external servers (AWS SQS, Apple verification and DynamoDB) and hardware to search for constraints and unexpected behaviors.

AWS SQS:

In the solution, the first constraint is the message reading from AWS SQS, where the maximum number of messages is 10 per request. Besides, after the message processing, the AWS SQS api supports only one delete per request. Theses numbers are very low if the system is going to handle a very large amount of messages.

The AWS SQS supplies methods to improve the read, send and delete of the messages using horizontal scaling and batch actions. Using these techniques, we may change our Erlang application from SQS view for:

Increase the number of message consumer (Rigth now we have only 1), and for all consumed messages, we create one gen_server for apple validation;
Once the message is validate (OK or INVALID), the process could send the result to another internal server that could consume results and send it to the post queue using batch functions; BatchSQS
The results from the item above would be sent to another server, that could use the batch functions to delete.

DynamoDB:

The DynamoDB has some constraints related to read and write. At the documentation, the read/write capacity can be managed and defined when the table is created and we also have the maximum number of 50 parallel request for the table (this is the limit set in the current solution). For improvements, I would suggest:

Increase the number of read/write capacity to 50 if you are using the maximum number of parallel request;
Consider create a intermediate server to execute batch operations; BatchDynamoDB
Once we are providing transation_id for the dynamoDB table, we could create a temporary ets table and save locally the transaction_id values when a invalid receipt is received, with this scenario, if a storm of already validated transactions occurs, the server would reject the receipt before accessing DynamoDB (it reduces the cost for AWS). It would require to delete or clean this internal table from time to time to avoid saving too much data and inconsistency information.

Apple Website:

For apple website, I didn't find any information online for the constraints, but during my tests, I realize that a storm of invalid receipts cause the server to reject my requests. For this situation, our application must handle this corner case. Once the application is deployed, because we are handling only production valid receipts, we can exclude the verification if the receipt was made for sandbox. (My application is accepting because I don't have production receipts to test). This can help if any attack to the server use valid receipts generated by sandbox.

Hardware:

It is always a good idea to map how the machine resources are being used by the server application, for example: cpu processing, memory, etc. Sometimes the application is running close of the maximum CPU processing and it can cause the Erlang application to run slowly than expected and affects the server results. It is always a good idea to keep a good network connection between the Erlang application and external servers (AWS servers + Apple validation).

Based on the solutions proposed, we could redesign the server application as the picture below:

What mechanisms would you put in place in order to improve fault tolerance of the service?

In oder to keep the service alive all the time, we must take all precautions from inside (server application) and outside (redundancy).

Erlang server application

For Erlang application design, it must have all gen_servers under a supervisor process. This will allow the application to be fault tolerant when one or more gen_server crashes from a request for the external servers (SQS, DynaoDB, etc) or even for internal failures. I found many corner cases from servers replies and I would prefer to handle them instead leave the gen_server to crash (Be careful with let is crash philosophy!)

Clustering Many servers

The design of this server allow us to run the same server in one or more machines, we could make some changes to the server to run in a cluster where they could exchange messages between each work in two modes of redundancy:

Active/StandBy: Once connected, after the election, the active server is processing messages and the other will only assume in case of failures from the active server.
Load Balance: Once connected, the servers could split to 25 for each one the maximum parallel requests to DynamoDB for example. If for some reason they lose connection (from a crash for example), they can try to get all the resources to itself;

As part of the redundancy, we can configure the application server to send a message to the cloud (AWS SQS) and trigger a lambda function that will send an email to someone that should know that the server is down or it is restarting.

It is always a good practice to run redundancy servers in different physical machines to avoid unavailable service because there were a power down in the site.

Live failures/bugs will always occur, what measures would you take in order to decrease the time it takes to find issues?

The server application has the external servers access (AWS SQS, DynamoDB, etc) as the highest possible points of failures. The unit test can check and simulate some known behaviors of these servers, but they are live machines and can change without any notice. Our application can work perfectly fine today and tomorrow, their behavior change and it crashes.

I believe that logging failures and statistics will help to understand what where the causes of the possible crash. In this challenge, for example, we don't have any constraint for being very quickly to process the message which means we have some room for logging with minimal interference in processing time. For this case I would consider the following statistics:

The time to execute apple validation;
The time to execute to validate if the receipt was already validate in DynamoDB;
The time to execute send and delete messages;
The total time for the message to be processed;
The number of messages/second that are being processed;
The CPU usage to verify how much processing is available;

These statistics will give you a good idea of how the server is behaving, maybe the increase of these numbers can tell us something that we are doing incorrectly or the server is taking more time to process and we have to deal with it. We can redirect messages to the cloud (AWS SQS) where a lambda function can analyze these data and send an e-mail indicating a possible problem.

For error messages from server, It would be good to log any unexpected return value instead of crashing, again, we can send these messages to the cloud (AWQ SQS) that can trigger a lambda function to send an email informing that a new unexpected error has happen and maybe the server can be affected. I would suggest the logger as log tool, It seems to be very light and simple. More info about logger

Do not forget to handle the Erlang crash (redundancy or external daemons) to send a message to the cloud or any other place to tell you the server is down before your clients start to call at midnight telling you they bought some coins that are not appearing in their games :-).

If you want to enable tracing on-the-fly, when the server is already running, you can use the redbug that is pretty useful for tracing in specific events (when a function is called for example). you can even control how many messages and how long the trace will be enabled.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
aws		aws
config		config
doc		doc
include		include
src		src
test/common_test		test/common_test
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
rebar.config		rebar.config
rebar.config.script		rebar.config.script
rebar.lock		rebar.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

miniclip

Getting started

Security Credentials

CloudFormation file

Timesheet

Discussions

What strategies would you employ should the number of validations increase hundred fold?

AWS SQS:

DynamoDB:

Apple Website:

Hardware:

What mechanisms would you put in place in order to improve fault tolerance of the service?

Erlang server application

Clustering Many servers

Live failures/bugs will always occur, what measures would you take in order to decrease the time it takes to find issues?

About

Releases

Packages

Languages

License

thiagoesteves/miniclip

Folders and files

Latest commit

History

Repository files navigation

miniclip

Getting started

Security Credentials

CloudFormation file

Timesheet

Discussions

What strategies would you employ should the number of validations increase hundred fold?

AWS SQS:

DynamoDB:

Apple Website:

Hardware:

What mechanisms would you put in place in order to improve fault tolerance of the service?

Erlang server application

Clustering Many servers

Live failures/bugs will always occur, what measures would you take in order to decrease the time it takes to find issues?

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages