miércoles, 12 de marzo de 2014

AzureBench: Benchmarking the Storage Services of the Azure Cloud Platform and an Application Framework for Scientific Applications on Windows Azure Cloud Platform

Experiment reproduction and evaluation. From the works of:  Dinesh Agarwal & Sushil Prasad. Actual scientific computing scenario faces the reality and the need for dealing with enormous set of data-intensive computational tasks. These activities require massive storage resources as well as immense computational power. While the largest and best funded research projects are able to afford expensive computing infrastructure, most other projects are forced to opt for cheaper resources such as commodity clusters or simply limit the scope of their research (Lu, Jackson & Barga, 2011).

However, a not so well-known characteristic of cloud computing environments, and a major concern for the consumer segment is performance. This topic is considered as a permanent subject of discussion in various scientific forums. Actual providers are unable to provide solid offers about performance guarantee. The common practice is availability, portability and compatibility.

Since 2010, Microsoft offers a Platform as a Service Model, with the associate financial model of pay-as-you-go. Under the name of Microsoft Azure platform, this cloud computing platform offers a set of cloud computing services and might provide an option for the computational and storage needs of such scientific computation activities.

Windows Azure allows the users to lease Windows virtual machines instances according to a platform as service model and offer the .Net runtime as the platform through two programmable roles called Worker Roles and Web Roles. Azure also supports VM roles, enabling the users to deploy virtual machine instances supporting an infrastructure as a service model as well.

Azure Cloud computing seems to promise a viable solution for the computational demands from the scientific community. However, many questions and concerns arises in regards the performance guarantees and capabilities of such system. Consulted literature suggests that there is a generalized sense in the community for the need of deeper understanding of the performance variables of cloud systems. Recent work on the area of cloud computing has focused on this concern.

Azure network and storage service is optimized for scalability and cost efficiency. However, the primary performance factors in an HPC system are communications latency and bisectional bandwidth. Windows Azure cloud platform lacks the traditional HPC software environment including MPI and Map-reduce. On the other hand, while it might not be the best fit for some scientific applications that essentially require MPI-like functionality, it can provide a very simple model for a wide variety of scientific applications (Awargal & Prasad, 2012).

There are striking differences between scientific application workloads and the workloads for which Azure and other cloud platforms were originally designed, specifically long lived web services with modest intra-cluster communications. Scientific application workloads, on the other hand, define a wide spectrum of system requirements. There is a very large and important class of \parameter sweep" or \ensemble computations" that only require a large number of fast processors and have little inter-process communication requirements. At the other extreme there are parallel computations, such as fluid flow simulations, where the messaging requirements are so extensive that execution time is dominated by communication (Lu, Jackson & Barga, 2011).

The importance of Azure platform has been recognized by industry as well as academia as is evident from the rare partnership of National Science Foundation (NSF) and Microsoft in funding scientific research on Azure cloud (Agarwal & Prassad, 2012).

In 2012, an open-source system called AzureBench was presented to the community as and aid to the HPC developers and as a contribution to the scientific community in need of HPC resources and who are facing the task of choosing the most suitable provider for their needs. This system provides a benchmark suite for performance analysis of Azure cloud platform’s and its various storage services. The authors provide a generic open-source application framework that can be a starting point for application development over Azure.

AzureBench provide a comprehensive scalability assessment of Azure platform’s storage services using up to 100 processors. It provides updated, realistic performance measurements as they utilize the latest APIs released after significant changes were made to the Azure cloud platform since 2010. AzureBench work represents an extension of the preliminary works from Hill et al. in the publication “Early observations on the performance of Windows Azure”, from 2010.


Prior research

As mentioned before, a significant concern about cloud systems is performance. Nowadays, few or none of the major providers are able to offer a solid guarantee on application performance. Due to this concern, Hill et al initiated their research in 2010 under the subject of “Early Observations on the Performance of Windows Azure”. The use of the “early” terms is obviously derived from the first days of Azure system which was launched in 2010.

In their research they initially tried to overcome the issue that developers faces as more cloud providers and technologies enter the market: Deciding which vendor to choose for deploying their applications. Therefore their work was developed under the premise that one critical step in the process of evaluating various cloud offerings can be to determine the performance of the services offered and how those performance conditions match the requirements of the consumer application.

Regardless of the potential advantages of the cloud in comparison to enterprise-deployed applications, cloud infrastructures may ultimately fail if deployed applications cannot predictably meet behavioral requirements (Hill et al, 2011).

Hill et al presented the results from experiments showing an exhaustive performance evaluation of each of the integral parts of the Azure platform: virtual machines, table, blob, queues and SQL services. Based on these experiments, their work provided a list of performance related recommendations for users of the Windows Azure platform. 

One interesting element that was found during the analysis was related to the Blob storage performance. It was observed that depending on the number of concurrent clients and the size of the data objects to be stored, the performance of Blob was between 35% to 3 times faster than Table. However, due to the black box nature of the cloud services the researchers were unable to provide an explanation for this behavior.  

Regarding the queues behavior it was found that multiple queues should be used for supporting many concurrent readers/writers as performance degraded as concurrent readers and/or writers were added. They also found that message retrieval was more affected by concurrency than message put operations so users cannot assume similar scale at each end of the queue.

In general, it was concluded that performance over time was consistent, although there are rare occasions -less than 1.6% occurrence for 50% slowdown or worse, 0.3% for 2x slowdown or worse- where performance degraded significantly. Therefore, we have seen no reason to provision assuming a worst-case behavior in Azure that is significantly worse than average-case behavior (Hill et al, 2011).
Methodology

The Windows Azure Platform is composed of three services: Windows Azure, SQL Azure, and AppFabric. Agarwal & Prasad research focuses on Windows Azure, which encompasses both compute resources and scalable storage services.

In order to evaluate Windows Azure storage mechanisms, Agarwal & Prassad, deployed 100 Small VM. A typical Small VM for Windows Azure has 1 CPU core, 1.75GB RAM and 225 GB HD. Those virtual machines read/write from/to Azure storage concurrently.

The AzureBench Application workflow starts with a web-interface where users have an option to specify the parameters for background processing. VM configuration for web role depends on the intensity of the tasks to be handled by the web role. For applications where web role performs computationally-intensive operations, a fat VM configuration should be chosen. Similarly, if the web role needs to access large data items from cloud storage, it could be a fat VM to upload/download data to/from the storage using multiple threads. To communicate task with worker roles, web role puts a message on a Task assignment queue.


They also selelected several metrics, structured according to the execution phases of scientific applications. A first step consists of deploying the customized environment and fetching the initial data. In a second phase, the application was executed, so they were also interested in the computation performance and the efficiency of the network transfers. The following table summarized the procedures executed during the evaluation process.




The exact experiment is repeated 5 times with varying entity sizes of 4 KB, 8 KB, 16 KB, 32 KB, and 64 KB.


Blob
Each VM upload one 100MB blob to the cloud in 100 chunks of 1 MB each. Workers are synchronized to wait until all VM are done uploading blobs before starting download. There is no API in the Azure to manage process synchronization. This situation if worked around with the implementation of a queue as a shared memory resource.



Queue
Three operations were tested: inserting a message using PutMessage API, reading a message using GetMessage API, and reading a message using PeekMessage API. Evaluation was done under two scenarios: each worker works with its own dedicated queue, and all workers access the same queue. For both experiments, a total number of 20K messages were first inserted in the queue, and then read using both APIs, and finally deleted from the queue.



Table
Each worker role instance inserts 500 entities in the table, all of which are stored in a separate partition in the same table. Once the insertion completes, the worker role queries the same entities 500 times. After the querying phase ends, the worker role updates all of the 500 entities with newer data. Finally, all of these entities are deleted.


The methodology implemented by Agarwal & Prasad allows for a deep understanding an analysis of Azure platform’s middleware. They perform exhaustive activities over the Queue storage mechanism, query-like interface to store data through Table storage, and persistent random access to hierarchically stored data through Blob storage. Also the utility based framework facilitates experimenting with large amount of compute power obviating the need to own a parallel or distributed system.

The experimentation focuses on Azure cloud storage services which are its primary artifacts for inter-processor coordination and communication. The iterative algorithm provides a full cycle of test from different perspective of data size and parallel VM workers. This approach allowed the researcher to point out various bottlenecks in parallel access of storage services.

Contribution

The major contribution for this work, aside from the results and guides provide for HPC developers, is that AzureBench is an open-source benchmark suite hosted at Codeplex repository available under GPLv2. Therefore, the open source of this project nature should motivate further research in this direction

Another important factor is that Agarwal & Prasad assessments on Microsoft Azure loud platform provide updated and realistic performance measurements. This was verified not only for the present results but also because the source code demonstrate that all utilized APIs were released after significant changes were made to the Azure cloud platform since 2010.

They also provided to the community with a template for a generic application framework for Azure cloud. This will also help interested developers to reduce time on the learning curve by means of a clear and solid starting point to design their own applications.

Finally, Agarwal & Prasad provide a summary of their findings and make recommendations for developers to efficiently leverage the maximum throughput of storage service

Personal Experiment Discussion

In the personal experiment we intended to reproduce the execution of the algorithms used for blob, tables and queue performance analysis. In order to do so, we download and installed the open source version (Alpha) of AzureBench from Codeplex repository. The source code available is based on C#/ASP and has it last review on January 2012. 

Resources used during the experiment reproduction are the following:

Microsoft Azure Account
Trial
Cluster Affinity Group
East USA
VM Size
Small ( 1 Core + 1.75GB RAM)
Maximum VM Cores
20
Development Platform
Visual Studio 2010 + Azure SDK 1.7
VM WebRoles
2
VM WorkRoles
2-4-8-16

Experiment was divided into 10 steps:
1. Source Code Deployment
2. Azure Account Setup
3. Source Code analysis and framework interpretation
4. Source tweaking and modification for Local Emulation
5. Source tweaking and modification for Source Deployment
6. Source Code compilation and binary built
7. Azure Application Package Generation
8. Cloud Application Deployment
9. VM Workers Provisioning 
10. Application Execution and Benchmark Execution
a. Behavior Observation
b. Analysis of Results

Our testing environment was the same as the one used in the original paper, except for the total number of VM workers available for testing. The original work performed test from 1 to 96 VM workers (Cores). Our scenario was limited to 20 VM workers due to budget restrictions. Our execution included the iterations of data sizes for messages from 4K to 64K.

It was observed that the behavior obtained during the tests for tables and queue, using 2-4-8 & 16 VM Workers had the same patter (not the same values) as the ones presented on the original paper. The projection shows that this patter would be the same for testing with more than 16 cores.

We were able to reproduce 2 out of  3 experiments presented in the original paper. Source available for Blob performance analysis has bugs and falls into infinite loops when executed over the cloud. By analyzing the source we estimate that this condition is due to an error on the index boundaries used to keep track of the pages during the Synchronization phase. We intend to correct this issue in later versions of this work.

Figure 2 show our results when running Azure Blob Experimental Bench. The application crashed, however, the VM Workers kept trying to recover (heal) themselves from the failure. 50%-90% VM workers gone offline. Application ran form 14 hours without showing any progress on recovery action.


Figure 3 shows that updating a table is the most time consuming process. AS in the original paper it was evidenced that for entity sizes 32 KB and 64 KB, the time taken for all of the four operations increases drastically with increasing number of worker role instances




Windows Azure platform maintains three replicas of each storage object with strong consistency. Figure 4 shows the time to put a message on the queue. For Put Message operation, the queue needs to be synchronized among replicated copies across different servers. We were able to reproduce the behavior of the Peek message operation, noticing that it is the fastest of all three operations. According to Agarwal & Prasad this is due to the fact that there is no synchronization needed on the server end. On the other hand, the Get Message operation in the most time consuming, in addition to synchronization, the message also becomes invisible from the queue for all other worker role instances, therefore this new state needs to be maintained across all copies.
Figure 5 shows a result that is also consistent with the original paper result. During the bench for Azure Shared Queue, we were able to achieve the same behavior as Agarwal and Prasad experiment of Queue storage when multiple workers are accessing a queue in parallel. We can observe that access to the queues increases the contention at the queue and in consequence the time taken by each operation is greater than the time taken when each worker accesses its own queue.

The learning curve for the Azure platform, the operational architecture and the development framework from Visual Studio are very steep, unless the developer has a solid knowledge of Windows WCF. Once the architecture is understood the deployment process is very simple and almost seamless. However, one of the drawbacks learned during the experimentation phase was the application deployment time and roles instantiation. These activities are time consuming and their duration is unpredictable. It can take minutes or even hours depending on the size of the project (amounts of VM workers) and the current state of the cloud. Microsoft charges deployment time as computing time, therefore there is a cost implied.

The results show that Azure offers good performance for the benchmarks tested. However, as stated by Microsoft Research, we can observe that Windows Azure is not designed to replace the traditional HPC supercomputer. In its current data center configuration it does not have the high-bandwidth, low-latency communication model that is appropriate for tightly-coupled jobs. However, Windows Azure can be used to host large parallel computations that do not require MPI messaging.
Further Research

As a future work, additional services provided by Windows Azure platform, such as local drives, caches, and SQL Azure database should be studied in terms of performance. Another aspect that is out of the scope of the actual research, but could be an expansion from Agarwal & Prasad works, is the study of resource provisioning times and application deployment timings and Azure Appfabric.


Instance acquisition and release times are critical metrics when evaluating the performance of dynamic scalability for applications. Other subjects of study derived from this work could be evaluation of Azure Drives and the NTFS storage abstraction.

Another aspect that is suitable for deeper studies is behavior evaluation of direct instance-to-instance TCP performance as this mechanism provides an alternative to the other storage services for communication between instances that has lower latency.

Finally, benchmarking suited for other cloud offerings by different vendors could be incorporated. However, comparison with other cloud platforms is also not studied in this paper primarily due to the differences in architectures.

References
Agarwal, D., & Prasad, S. K. (2012). AzureBench: Benchmarking the Storage Services of the Azure Cloud Platform. 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (Vol. 0, pp. 1048–1057). Los Alamitos, CA, USA: IEEE Computer Society. 

Gunarathne, T., Zhang, B., Wu, T.-L., & Qiu, J. (2011). Portable Parallel Programming on Cloud and HPC: Scientific Applications of Twister4Azure. Utility and Cloud Computing, IEEE Internatonal Conference on (Vol. 0, pp. 97–104). Los Alamitos, CA, USA: IEEE Computer Society. 

Hill, Z., Li, J., Mao, M., Ruiz-Alvarez, A., & Humphrey, M. (2010). Early observations on the performance of Windows Azure. Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC  ’10 (pp. 367–376). New York, NY, USA: ACM. 

Iosup, A., Ostermann, S., Yigitbasi, M. N., Prodan, R., Fahringer, T., & Epema, D. H. J. (2011). Performance Analysis of Cloud Computing Services for Many-Tasks Scientific Computing. IEEE Transactions on Parallel and Distributed Systems, 22(6), 931–945. 

Lu, W., Jackson, J., & Barga, R. (2010). AzureBlast: a case study of developing science applications on the cloud. Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC  ’10 (pp. 413–420). New York, NY, USA: ACM. 

Roloff, E., Birck, F., Diener, M., Carissimi, A., & Navaux, P. O. A. (2012). Evaluating High Performance Computing on the Windows Azure Platform. 2012 IEEE Fifth International Conference on Cloud Computing (pp. 803–810). Los Alamitos, CA, USA: IEEE Computer Society. 





No hay comentarios:

Publicar un comentario