Unlimited Scaling on AWS

By Tim Hansen,
Consultant Novataris
Nyhavn 43, 1051 Copenhagen K
Mobile. +45 31 43 58 61 | Office + 45 70 27 80 00

In this blog post, I will talk about my experiences building a Java application on Amazon Web Services (AWS), and will present a scheme we use to create an infinitely (well, maybe just extremely) scalable and fully managed application for processing and storing a stream of continuous data. This set-up is based on just three of the AWS services. The first being Elastic Beanstalk – a service that makes deploying of a web application super easy, and handles auto scaling like a breeze. The next is DynamoDB, AWS’ take on a NoSQL document data store – build for extreme performance over functionality. And the last is Amazon’s take on serverless computing with AWS Lambda. Before presenting the pattern for the extremely scaling application, I will first give a description of these services to highlight their advantages and present any drawbacks that you should be aware of, should you choose to create an application along similar lines.

SDK

In writing a post about how easy it is to use AWS to build a dynamic and managed application, the AWS software development kits (SDKs) deserves special mention. These SDKs are powerful and takes care of integration with all the 72 services (at the time of writing) available on AWS. This makes it possible to access data, create tables, post notifications, send emails, invoke lambda functions or even launch new applications as part of an application to name but a few of the possibilities.
These SDKs receive a remarkable amount of attention from Amazon. Half a year ago, there were only a handful of SDKs, including for Java, Node.js and Python, and now in the time of writing there are more than ten, including SDKs for .NET, C++, Ruby, and Go. It is not just new languages that are added to the SDK suite, but also new features and improved APIs. The Java SDK, for example receives 10-15 updates every month, to ensure it can interact with all the new features and services constantly added to AWS.

 

Elastic Beanstalk

Elastic Beanstalk combines several distinct services into one, all aimed with the specific goal of deploying and managing an application in the cloud.
Elastic Beanstalk automatically sets up an auto-scaling group of virtual instances to run the application. These instances come with an OS and everything necessary to run the application, so you just upload a war file (in the case of java – many other languages are supported), and Elastic Beanstalk takes care of the rest. The auto scaling group constantly monitors the amount of traffic and work load of each of the instances in the cluster, and this information is used to automatically start new instances when needed, and to shut instances down, when instances are no longer needed. It is this auto-scaling that is the heart in the elasticity of an Elastic Beanstalk, and it can be set to trigger on CPU utilization, network in or out, latency or number of requests among others, to scale in accordance with the specific needs of an application.
While Elastic Beanstalk might not be the sexiest of services, it does what it is supposed to do, so that you can forget about it. Between the ease of deploying new code with rolling deployment and the seamless scaling, Elastic Beanstalk has the potential to become the unsung hero of almost any internet application.

DynamoDB

DynamoDB is Amazon’s take on a NoSQL database. Just as with Elastic Beanstalk, it is super easy to set up. You don’t have to worry about servers or software – all of this is handled for you. You just give the table a name, a primary key (more on that later) and set its provisioned capacity, and you are all set. Should your needs change so that you require more or less capacity, you simply adjust the provisioned capacity, and AWS scales the servers that host your table behind the scenes without any downtime or loss of data.
Here the word provisioned capacity might seem out of place. This is because DynamoDB is a service where you pay for how much capacity you think you are going to use, and not the actual usage. This is to ensure that should you encounter a peak in the activity on one of your tables, you may have foreseen this and provisioned enough capacity in advance, so that the servers hosting it is ready for the load. But considering how much effort AWS has put into making many of their services auto-scaling, it is kind of disappointing that Dynamo is very much manual-scaling.

 

 

Primary keys, indices and table operations

Returning to what a primary key is in relation to a Dynamo table, it is, simply said, the unique identifier for each entry in your table. The interesting thing about a primary key though, is that it may consist of two values: a hash key and optionally a range key. It may seem like a superfluous alternative to just having auto-generated ids for each data entry in the table. But the primary key doesn’t just set the parameter-space of the data entries that can coexist in the table, it also defines which operations can be used to read the data from the table.

DynamoDB is a document store made for performance rather than functionality. This means that while Dynamo is super-fast, the number of operations available to retrieve the content of a table is quite limited, and any inter-table relationships must be managed by the code reading from the tables outside of Dynamo. There are only (three) four operations available to read from a table. These are scan, query, get and batch get, here ordered in increasing order of how much of the primary key is needed to be known to perform the operation. Scan is the operation that requires the least knowledge, in that it requires none. It does, however, return the entire table. This may be a very expensive operation, if your table contains a lot of data. The four types of operations are named differently in the SDK’s, so it is easy to spot when you are scanning a table, and forces you to consider if that is what you want. Query requires that you provide the hash key of the primary key and returns the elements that match the given hash key. And finally get, and by extension batch get, requires that you provide the entire primary key and returns the single element matching.
The cost of the operations is calculated from the size of the list or object returned. Every 4 KB returned consumes one read capacity unit, rounded up to a whole integer. While the API allows to provide additional filters to reduce the amount of data returned and to only return a subset of the fields on the retrieved objects, this is entirely for saving network traffic. These filters are used after data have been retrieved from the database, and so does not reduce the cost of the operation. Here it is worth noting that calls to Dynamo are HTTP requests behind the scenes. This is rationale behind batch get – instead of making multiple HTTP requests, the get requests are batched into a single call. But to calculating the cost of this operation, it is still treated as a series of individual get operations, meaning there is rounding up. In some cases, it may be cheaper to scan the entire table, which really illustrates the amount of planning and thought that need to be applied when working with Dynamo. To aggravate this fact, it is worth noting, that while you can always add or remove indices from a table, you cannot change its primary key. If, as your application grows, you find out you need another set of parameters for your primary key, your only option is to create a new table.

Taking these considerations into account, it becomes evident that the query operation is the most desirable operation for retrieving a set of elements, as its cost is calculated based on the combined size of the list, with only a single round-up, and it is done in a single call. But as it requires that the primary key of the table has a range key component to be different from a get, and then only can be used to retrieved objects with the same hash key, it may seem like there is only a very limited number of cases where it can be used. This is where indices come into the picture. Indices can be used to set up mirror tables, where the primary key definition is different from the original table.
In this way you can query against any property on the objects in the table by creating an index with that property as the hash key. A word of caution: even though the query operation is a cheaper than batch get and scan, setting up an index requires you to provision and pay for the capacity of both the main table and the index.

 

Streams and TTL

Besides being a highly available, super-fast document store, Dynamo has a couple of cool features. The first to point out is streams. Streams is a service where a stream of all changes done to any element in the table is exposed as a stream. This stream can be set up to contain both the old image, before the change, and the new image. This can be used to trigger specific actions in an application listening on the stream based on the type of event, and can even be used to trigger events when certain fields have been changed by comparing the old and the new image, without having to store the state of every element of the table in a mirror table. But the cool thing about streams is they can be used to trigger lambda functions. We will return to this in the next section.
The other nice feature is time to live (TTL) and has just recently been added to Dynamo. This feature automatically cleans old data from your table, when it is no longer needed. The elements are deleted on an element-by-element basis based on the attribute identified as the TTL attribute. This could, for example, be used to set a buffer period after a delete action have been made to your application. The element is then given a value for the TTL attribute, meaning it will be deleted after a given time, but allows the user to cancel the operation within that time-space.

 

AWS Lambda

Amazon was one of the first to make a serverless application service available. This is a very interesting concept where you provide a small application (ideally so small that it may be viewed as just a method or function) and almost don’t have to think about the hardware that is going to run it. You only need to configure the amount of memory it is going to have available and the maximum duration of an invocation. When the function is called, the code is automatically deployed to some environment and then executed. When the invocation has run its course, the function will remain available for a short duration so that its next invocation does not have to wait for it to be set up on a server. If no such invocation comes within the short time-space, the application is shut down, until it is called again.
This is at the very forefront of computing on-demand, where an application only exists when it is called, and is removed when it is not. Should it be called multiple times within a short interval, multiple instances are automatically set up to handle the load, and all or some of them are removed again, should the request rate drop. And what makes AWS Lambda useful, is that it can be set up to automatically trigger on a number of events from the AWS services, such as a Dynamo Stream, as mentioned earlier.

 

Infinitely scaling managed applications

In this section I will present a set-up that makes use of the three services that have been presented above, and combines them into an extremely scaling and fully managed application to process a stream of high intensity data.
The application is exposed to the outside world via an application uploaded to Elastic Beanstalk. Again, this isn’t super sexy, but it provides a solid foundation for the application to scale as the network activity to the application increases or decreases. When the streaming data is sent to the application, it is this REST application that receives the data, and then saves it to a DynamoDB table for storage.
DynamoDB is chosen for its high performance, and because it is possible to design queries that allows us to reprocess data, should we desire so for any given batch. To avoid being flooded with data as it keeps pouring into the table, TTL is used to automatically remove data after it has reached a certain age. We just set the TTL value in the REST application every time data is posted to it.
The next step is to process the data. This is done by creating a AWS Lambda function and set it up to trigger automatically on a stream from the Dynamo table. This ensures that the lambda function is executed every time an element is created in the table. When setting up automatic triggers on a Lambda function, it is worth pointing out, that the type of invocation this will result in cannot be configured, but is completely given by the type of trigger.
For our purpose, a trigger from a Dynamo stream will result in a blocking synchronous invocation. This means if data is written to the table faster than a single lambda instance can process it, the lambda will lag further and further behind and data will eventually be lost. To compensate for this problem, we make use of a distributor-worker set-up.
In such a setup, it is the distributor function that triggers on the Dynamo stream, while a worker function is created to do the processing. The distributor function should be super slim and does nothing but read a batch from the stream and pass it on asynchronously to the worker lambda for processing. The distributor is then free to read yet another batch and repeat the process, while the worker lambda is still processing the data. This will spawn a new instance of the worker lambda and data can now be processed in parallel. Once the worker is done, the result is written to another dynamo table and is exposed via the REST application on the Elastic Beanstalk.

Code Example
The key component of this set-up is the distributor lambda. But it is a very simple component to write. I will provide a code example of how it can be done in Java, along with a few notes.

/**
 * AWS Lambda function that is triggered by events from our table.
 */
public class StreamEventEngineDistributorLambda {

    private static WorkerInterface service;
    private final Logger logger = Logger.getLogger(this.getClass());

    /**
     * Parameter-less constructor required for AWS to instantiate.
     */
    public DistributorLambda() {
        AWSLambda awsLambdaClient =
            AWSLambdaClientBuilder.standard().build()
;
        service = LambdaInvokerFactory.builder()
                    .lambdaClient(awsLambdaClient)
                    .build(WorkerInterface.
class);
    }

    /**
     * Handle the stream event from the table.
     *
     * @param
ddbEvent The event in the database as reported by the stream.
     * @param
context  context The context object allows you to access useful
     *                 information available within the Lambda execution
     *                 environment.
     */
    public void distribute(DynamodbEvent ddbEvent, Context context) {
        try {
           Instant start = Instant.now();
           logger.info("Distributor started with " +
               ddbEvent.getRecords().size() + " records.");
           // Get the records
           List<Map<String, AttributeValue>> attributeMap =
               ddbEvent.getRecords().stream()
                   // keep only insertion events
                   .filter(dynamodbStreamRecord ->
                       dynamodbStreamRecord.getEventName().equals(
"INSERT"))
                   // extract the keys to minimize network and serialization
                   .map(dynamodbStreamRecord ->
                       dynamodbStreamRecord.getDynamodb().getKeys())
                   // finally collect the keys in a list.
                   .collect(Collectors.toList());
           service.delegate(attributeMap);
           Instant finish = Instant.now();
           logger.info("Sent " + attributeMap.size() + " records in " +
               Duration.between(start
, finish).toMillis());
        } catch (Exception e) {
           logger.error("Error in handleRequest " + e.getMessage());
        }
    }
}

First, I will consider the function that is triggered on invocation by the stream. To trigger on such a stream, it takes a DynamodbEvent as its first input parameter. The second parameter, the Context, is optional, but can be used for useful interaction, for example with the logging framework for lambda functions.
The first thing the distributor does, is it gets the records from the Event and streams it (a glorified way of saying “loops over”). Then the records are not related to creation of new elements in the table are removed. Next, the keys (the hash and range keys) for each event are extracted from each record, and these are collected into a list. And finally these keys are send to the worker lambda with service.delegate. The worker can then use the keys to get the relevant entries. We only send the keys to minimize both time spend serializing the payload and the network traffic, for the distributor to finish as quickly as possible, so that it can start on the next batch of dynamo events.
The integration to the worker lambda is handled by the AWS SDK, and it does so quite elegantly. We only have to create an interface with a single method. Most of the integration then happens via the @LambdaFunction annotation. In this annotation we give the name of the worker lambda function, as it appears in the AWS console (the online GUI), and we set the invitation type. Here setting the invitation type to event is paramount. This is what tells the SDK that the worker should be invoked asynchronously and what makes this entire setup feasible.

/**
 * This interface allows the distributor to call its worker.
 */
public interface WorkerInterface {

    @LambdaFunction(functionName = "Worker",
                        invocationType = InvocationType.Event)
    void delegate(List<Map<String, AttributeValue>> attributeMap);
}

Summary

AWS is a platform with a great deal of services that makes it easy to set-up an auto-scaling application, and most of these are managed almost to a point where you can fire and forget. Especially the SDKs makes it very easy to integrate with the many services provided on the platform, which see rapid development and constant addition of new services and features.
Using just three of the components available on AWS we have seen a setup that provides an almost infinitely scaling application for processing streaming data. The first three components in the setup is Elastic Beanstalk. While it only performs a minor role in this set up, exposing the API to add data to a table, it is generally only as sexy as the foundation of a building, but also just as dependable. The second component used in the setup is AWS’s document store, DynamoDB. This is remarkably fast database service, that may require some planning in advance, but has some very useful features, such as TTL and streams, both of which are used in this setup. The last component is the workhorse of the setup, AWS Lambda functions. They are a quite interesting feature, that allows you to separate specific functionality into small packages that are only deployed and run when needed.
The key component to make the setup work, is a distributor-worker lambda setup. In this setup, it is the distributor that is triggered by the dynamo stream, and then uses minimal processing to quickly delegate the work asynchronously to a worker lambda. It is fairly straightforward to achieve this setup – mostly due to the completeness of the AWS SDK, and a working code example of the distributor lambda has been included to start you off, on your own extremely scalable adventure.

Recieve more information

Thank you for your message. You will hear from us soon.
Thank you for your message. You will hear from us soon.

Novataris offers a strong and experienced team of consultants, specialized in delivering IT-solutions that are customized to our customers' business.
Let us know what you are interested in knowing more about, and you will receive an e-mail with more information.

Contact us

Book a meeting

Thank you for your meeting request! You will receive a confirmation e-mail from us soon.
Thank you for your meeting request! You will receive a confirmation e-mail from us soon.