Windows Azure details

I recently attended PDC 2008, where Microsoft announced Windows Azure. I spent the week absorbing Microsoft’s messaging, diving deep into the technical details, and thinking about the broader implications for cloud computing as a whole. Here are some of the technical details.

Also see my overall impressions and presentation slides. (Full disclosure: I work on Google App Engine.)

Contents:

Fabric
- Services
- Implementation
Monitoring and alerting
Storage
Developer Experience
SQL Services

Fabric

The Azure fabric is an umbrella term for the logical cluster of machines that Azure services run on, along with the processes that manage the cluster, provision and deploy services, monitor system status, and recover from failures.

The fabric consists of nodes, which may be physical machines, VMs, or (eventually) partial VMs.

Services

An Azure service is defined by a service definition, aka a model, similar to a RightScale deployment. It’s an XML definition of the different components of the app, the resources they need, and how they interact with each other and the outside world.

In the CTP, only a few canned service definition templates may be used. In the commercial release, the full XML definition language will be available and arbitrary service definitions will be supported.

Components

A role is an individual runnable component of an app. A role typically specifies a binary or piece of code to run as an OS process. A role may be instantiated many times. Example roles include web frontent, backend, garbage collector, daemon, transaction processor, and worker.

Roles run on fabric nodes. In the CTP, each role runs in its own Hyper-V-based VM, and may only run managed code. In the commercial release, roles may include native code and even run on bare hardware. In the other direction, they may also run in a VM shared between many roles or may run

Load balancers may be included in service definitions. They’re connected to roles or other load balancers via channels. Based on a conversation in person, their policies will be fairly flexible, so you can use them to do experiments on live traffic.

A few kinds of networking components are available to service definitions. A channel is a connection between two roles. An endpoint is a generic network endpoint, either internal or external, for serving requests. An interface is an externally accessible endpoint. Supported protocols include HTTP(S), SMTP, ATOM, and SOAP. Notably, while HTTPS is supported, you can’t automatically require it, ie there’s no equivalent of App Engine’s secure: always in app.yaml.

Provisioning

A role’s (or other component’s) definition include the resources it needs, including CPU, memory, disk, and network bandwith (I think). Notably, when the role is deployed, it’s limited to these resources. The role definition may also include specific OS features that it needs.

The role definition also includes the number of instances that the fabric should deploy, either a specific number or a lower and upper bound. Finally, constraints may be specified, including how many role instances may run in the same node, whether instances of different roles should be co-located, and how to allocate instances across update domains and fault domains.

Update domains and fault domains are optional but useful features. Update domains are used to partition the service during upgrades. A role may specify the number of update domains its instances should be deployed in.

During rolling OS upgrades and service updates, only one update domain will be upgraded at a time. While an update domain is being upgraded, the fabric won’t route live traffic to any roles or load balancers inside it. After an update domain has been fully upgraded and all services in it report that they’re health, the fabric returns them to service and starts on the next update domain.

Fault domains are similar to update domains. They’re basically disjoint failure zones within a single datacenter. They’re determined according to the datacenter topology, based on things like rack and switch configurations, which make certain classes of failures likely to affect well-defined groups of nodes. As with update domains, a role may specify the number of fault domains its instances should be deployed in.

Here’s an example service definition:

<ServiceDefinition name="MyService" xmlns="...">
  <WebRole name="WebRole">
    <ConfigurationSettings>
      <Setting name="GreetingString" />
    </ConfigurationSettings>
    <InputEndpoints>
      <InputEndpoint name="HttpIn"
                     protocol="http"
                     port="80" />
    </InputEndpoints>
  </WebRole>
</ServiceDefinition>

Hosting

Role code and binaries run on VMs with a limited choice of Windows OS (which versions?). It has a standard MS stack available: IIS7 (with XML config), .NET CLR, ASP.NET, .NET services, Live services, SSL. Some .NET CLR managed code languages are available, and more are coming, along with native code. Role code is subject to a custom Azure CAS (code access security) policy for managed code.

There are APIs for interacting with the fabric, including introspecting the service definition and the current node, accessing configuration settings, and reporting health and other status info. Role instances may also use the local disk on their nodes as temporary scratch space, within a quota limited, chrooted directory. Finally, as mentioned in Developer Experience, there’s a distributed logging system that collects logs and makes them available for browsing, searching, and downloading.

Asymmetric keys are provided for authenticating to the development portal from local machines, authenticating to the storage systems from live role instances, and signing and encrypting service images before publishing them.

Configuration and APIs

Services include configuration settings, which are similar to the registry in Windows and /etc, dotfiles, and flags in *nix. Some are system-defined, such as the instance id, update domain id, and fault domain id. The rest are declared by the developer in the service definition, along with default values. Those values may be changed at runtime, I think via CSRun.exe (see Developer Experience or the developer portal.

Roles can define callbacks to be called when settings change. This allows developers/deployers to change them dynamically, at runtime, without requiring a restart. Of course, most (all?) of the system-defined settings will only change on a restart.

Here’s an example service configuration:

<ServiceConfiguration name="MyService" xmlns="...">
  <Role name="WebRole">
    <Instances min="3" max="10" />
    <UpdateDomains count="3" />
    <FaultDomains count="3" />
    <ConfigurationSettings>
      <Setting name="GreetingString"
               value="Hello world!" />
    </ConfigurationSettings>
  </Role>
</ServiceConfiguration>

Warning: full service definition and configuration isn’t available in CTP, so this includes minor speculation, specifically the Instances tag’s min and max attributes and the UpdateDomains and FaultDomains tags.

Geodistribution and Georeplication

In the CTP, Azure runs in “a single datacenter on the US west coast.” In the commercial release, MS will offer some control over geodistribution and georeplication, ie where and how your apps and data live. Details on this were scarce.

Implementation

The Azure fabric implementation consists of two main components, the fabric controller and the fabric controller agent.

Fabric Controller

The fabric controller (FC) manages the life cycle of Azure services. It allocates resources for them, provisions, deploys them, monitors them, and maintains their goal state. It also manages and monitors both VMs and physical machines in the fabric.

State machine

The FC has an internal state machine that it uses for managing the life cycle of a service. It maintains the current (last known) state of each virtual and physical node, which it updates by on talking with FC agents on each machine. It also knows the goal state of each service that’s been published.

The state machine maintains internal data structures for logical services, logical roles, logical role instances, logical nodes, and physical nodes.

The FC’s main loop is described as a heartbeat. During each heartbeat, it compares its services’ goal states with the current state of the nodes they’re deployed on, if any. When it finds a role with a goal state that doesn’t match its nodes’ current state, it tries to move the node’s state toward the role’s goal state. If it can’t, it might move the role instance to another node.

The FC also monitors for certain events that move nodes out of goal state, e.g. failures.

Service life cycle

Inside the FC, there are four specific stages in the life cycle of a service: resource allocation, provisioning, deployment and upgrading, and maintennance.

Resource allocation is largely a constraint satisfaction problem. Constraints include update and fault domains, resources and OS features/versions, network channels, etc. They also described it as a graph construction problem, where nodes are nodes and network channels between roles and load balancers are edges.

Resource allocation is transactional, at either the service or role/node level (not sure which, I think the former). Either all of the necessary resources are allocated or none.

Other provisioning and standard datacenter/NOC operations are handled either automatically, by the FC, or offline, e.g. burn-in, diagnostics, replacing parts, capacity planning, TOR/L2 switches (?), load balancers, SNMP, and access routers.

Deployments and upgrades are done mostly automatically by the FC, one update domain at a time. Within each update domain, new OS or service images are copied to all nodes. Then, all nodes are marked as out of service (which moves them out of the goal state), so live traffic is no longer sent to them. Then, they’rer all upgraded. Finally, once the OS and services are back up and healthy, the nodes are marked in service and moved back into the goal states, as appropriate.

Upgrades may be done either full automatically or manually. In the full auto case, each update domain starts immediately after the previous one finishes. In the manual case, humans start each new update domain. Service deployments are usually automatic; OS upgrades are either automatic or manual.

Maintennance consists of the standard monitoring for failures and unhealthiness, as reported by both the FC agents and roles themselves. When certain events occur, the FC moves a service/role out of its goal state, and may mark parts out of service. It then falls back to the deployment process of moving the service back toward the goal state via its heartbeats.

Networking

VIPs and DIPs (dedicated IPs) for services are allocated and programmed in load balancers automatically. The FC moves DIPs as in service and out of service based on running role instances with interface when they move in and out of goal state. Load balancers only send traffic to role instances in goal state.

Implementation details

The FC itself is cluster of 5-7 machines. At any given time, one is the primary, and the rest are secondary workers. They vaguely alluded to using a form of master election for the primary, almost certainly Paxos-based.

Internally, there are four main components, from high to low:

The core runs the heartbeat, the state machine, the resource allocation constraint solver, etc.
The object model includes the business logic for roles, services, etc.
The state system stores, retrieves, and validates state checkpoints. It can roll back and forward in the case of failures.
The data replication system is similar to GFS, but dedicated to the FC. It’s distributed across all of the FC machines.

Notably, the FC cluster actually runs on a smaller, limited, dedicated version of Azure itself (!).

Virtualization

Node VMs use a hypervisor architecture based on Hyper-V, which takes advantage of new hardware features designed to assist virtualization. The hypervisor runs on the hardware directly, and multiple VMs run on top of it.

One VM is the host. It includes device drivers, which talk directly to the machine’s hardware. The hypervisor routes system calls and other direct hardware accesses from the other (guest) VMs to the host VM, which performs them on behalf of the guest VMs.

The OSes also include a feature called VMBus, which is an optimized, high-bandwidth direct data bus between VMs. (This is probably used for disk reads and writes from guest VMs.)

Image-based deployment

Deployments and upgrades are based on virtual hard disk images, which provide a number of benefits. They’re homogenous, stabler and more predictable, faster to copy and install, cacheable, easier to repare via replacement, and may be constructed and serviced offline, which is automated.

There was a vague reference to an image manager, akin to package management systems for Linux, which handles dependencies, image distribution via multicast, replication, and caching.

Monitoring and alerting

This is one of the areas we have the least information so far. There’s little to no support for developer-configurable monitoring in the CTP. The commercial release will support limited monitoring, as well as health and status reporting from roles, but it’s unclear how powerful it will be.

At the minimum, developers will be able to set alerts to fire when role instances become unhealthy, crash, or otherwise fail, when their nodes fail, or when a role logs a message at the CRITICAL level (or whatever the level is called). Alerts may be delivered by email, IM, and/or SMS.

However, it’s still unclear whether or how custom alerts may be defined, and what fine-grained metrics (e.g. qps, latency, error rate) will be available to those alerts.

There’s also a CSMonitor.exe file in the SDK. I haven’t seen much information about it, but I’m guessing it’s for querying the status and monitoring info of a local role instance, or maybe even a live one running in Azure. See Developer Experience.

Storage

The three Azure storage services in the CTP are tables, queues, and blobs. In the commercial release, they’ll be joined by lock and cache services. They’re all intended to be simple and massively scalable. Notably, they all run on top of the Azure fabric itself.

Each storage service has a programmatic .NET API and an HTTP REST API. The REST endpoints are under .[storage,blob,queue].core.windows.net. With the right keys, the REST API can be used to access data across accounts.

There are also command-line utilities for browsing and modifying data in each service (except for tables, currently), in both production and staging instances.

Tables

Features

Tables is Azure’s scalable, schemaless, structured datastore, similar to the App Engine datastore. It’s a distributed system that’s very similar in design to Bigtable. It’s designed to scale to billions of entities and terabytes of data.

Entity model

Each entity has a table name and a schemaless, key/value property bag. Entities are limited to 1MB and 255 properties each. The supported property types are:

string
binary
int
long
int
bool
double
guid

Three properties are special: the partition key, the row key, and the version.

The partition key identifies a partition, ie a group of entities that should be stored together. The row key identifies an entity within a partition. Together, an entity’s partition key and row key comprise its primary key. The partition and row key are both strings, up to 64KB.

Notably, the developer must choose both the partition key and the row key. I didn’t ask if partitions are limited in size, but I believe they are.

Queries

Beyond the standard CRUD operations, tables supports a limited form of querying.

In the CTP, queries supports only equals filters, on one or more properties. Results are returned in partition + row key order. In the commercial release, developer-defined secondary indexes for sorting and inequality filters will be supported, but they haven’t decided how they’ll work, or even whether they’ll be global or partition-local.

The programmatic query API uses LINQ, a simplified query language similar to Google App Engine’s GQL. The HTTP REST API uses the ADO.NET Data Services (aka Astoria) URL-based interface.

Along with the result entities, each query returns a continuation marker, which may be passed back in following query calls to retrieve the next page of results.

Consistency and Transactions

Azure tables is strongly consistent. Single entity writes are synchronous, and there are no dirty reads. ACID transactions are supported on single entities.

Consistency and transactions are implemented via optimistic concurrency, similar to the App Engine datastore, using a system-defined version property on each entity that’s updated on each write.

Notably, the HTTP REST interface supports the same optimistic concurrency behavior, using the standard Etag and If-Match HTTP headers.

In the commercial release, transactional batch writes will be supported, ie all writes in the batch will be committed or none. They’re not sure if reading across entities in txes as well as writing, ie consistency and isolation, will be supported.

One interesting feature is that table-scan-based queries within partitions provide snapshot isolation, since they can use a single base timestamp. It’s unclear what kind of isolation secondary index queries will provide, since they don’t know if they’ll be global or local.

Schemas

As mentioned before, Azure tables is schemaless, like the App Engine datastore. There’s a built-in Tables table that stores the list of tables currently used by an account, but not the properties or their types. It’s queryable the same way as the normal tables.

For schema changes, they strongly recommend the backward compatibility model, ie pushing code that understands both the old and new schemas. After that, you can either update the data in the background, then drop the old schema code path, or just leave both code paths in (ugh!).

Implementation

From discussions with Brad Calder et al, it sounds like tables is designed similarly to Bigtable and the App Engine datastore.

Notably, partitions are similar to Bigtable tablets, with a few of the added characteristics of App Engine’s entity groups. All entities in a partition are stored on the same table server. I have to assume partitions are unavailable if/when they’re moved between table servers, like Bigtable tablets, but that’s just speculation on my part.

Transactions are implemented a la the App Engine datastore, using the built-in per-entity version property for optimistic concurrency. The notable difference there is that the versioning is at the entity level, not the entity group level, which makes cross-entity transactions much more difficult.

Query are currently implemented via full partition scans and filtering in memory. Secondary indices are coming in the full commercial release.

Queues

Features

Queues is similar to Amazon SQS. It allows messages to be enqueued and processed later, asynchronously, in a loosely coupled fashion. A service specifies a set of queues in its definition. At runtime, it may enqueue, dequeue, and delete messages from those queues. Messages have a typed payload, up to 8KB.

The available queue operations are create queue, delete queue, enqueue message, dequeue message, delete message, clear queue, and get queue length.

The dequeue operation acquires a lease on a message. While that lease is held, the message is invisible to other queue clients. When the owner of the lease is finished with the message, it may permanently delete it. If it doesn’t delete it within the lease period, the lease expires and the message becomes available to dequeue again. (The lease period is configurable, and defaults to 30s.)

They’ve considered providing higher-level abstractions on top of this, e.g. simple map, MapReduce, Dryad, and Pig, but they haven’t made any decisions.

Note that SQL Server 2005 has a service broker that offers queueing functionality too. This is similar, but different, and doesn’t share any code.

Implementation

The order that tasks are dequeued is roughly FIFO, but not exactly. The system makes no guarantees about starvation or fairness within a queue.

Blobs

The blob service stores opaque blobs, similar to Amazon S3. Blobs may be created and retrieved programmatically. Blobs are identified by unique, human-readable URL paths under <account>.blob.core.windows.net, chosen by the developer. Blobs may be up to 50GB, and may have additional metadata consisting of key/value pairs.

Blobs up to 64MB may be uploaded directly. Blobs above 64MB must be divided into blocks. Each block is uploaded separately. Blocks may be uploaded out of order, and individual blocks may be repeated. At the end, a Commit call is made with the block ids in order.

The blob namespace is the hierarchical URL path namespace under <account>.blob.core.windows.net. Blobs may be enumerated at any level in the URL space, non-recursively. Paging is supported with continuation markers, similar to tables.

Blobs are not immutable! They may be modified, appended, and cloned. For block-based blobs, modifications are made at the block level.

Developer Experience

One of the loudest messages at PDC was that your existing tools, libraries, skills, and knowledge transfer to the cloud. Specifically, you use Visual Studio, .NET, LINQ, etc. to write Azure services, just like you use them to write on-premises service.

One of MS’s core messages is that the service development lifecycle consists of four steps:

Write the code and service model. This is done by developer(s).
Determine the necessary resources and configure the service. This is done by developer(s) or deployer(s).
Allocate resources, provision, and deploy the service. This is done by the fabric.
Maintain the service in the “goal state,” ie monitor the system health and handle failures. This is done by the fabric.

The Azure SDK includes a full, stand-alone, stubbed-out implementation of the Azure fabric, APIs, and storage services: runme.exe. It’s multi-process, so you can run multiple roles, and multiple instances of each role, on the local machine.

Visual Studio is not required, but it’s naturally by far the most supported environment. When using the SDK, all features are available, most notably the debugger. Debuggers may not be attached to apps running in the cloud on Azure, of course.

To publish a service to the cloud, first build each role binary with Visual Studio, or any compatible compiler. Then, use CSPack.exe to package a service image that includes the role binaries. Finally, use CSRun.exe to upload the image and publish the service to a staging instances, where the service is available for testing before it serves live traffic. Finally, to start serving live traffic flip the switch to swap the staging and prod instances, via either CSRun.exe or the developer portal.

The SDK, CSPack, and CSRun are all command line, with Visual Studio plugins built on top of them. This allows for third party tools and plugins. For example, they have an Eclipse plugin internally. As mentioned, asymmetric keys are provided for authenticating to the development portal from local machines, authenticating to the storage systems from live role instances, and signing and encrypting service images before publishing them.

There’s also a CSMonitor.exe in the SDK. I haven’t found much information about it, but I’m guessing it’s for querying the status and monitoring info of a local role instance, or maybe even a live one running in Azure.

The developer portal is much like the App Engine admin console. It shows you the status of your deployed service.

Apps get a subdomain of cloudapp.net to serve off of. Notably, dedicated top-level domains are not available in the CTP, but may be available in the commercial release.

Finally, the core Azure API includes a logging facility that logs to a reliable, distributed store, similar to App Engine. The logs are then available for browsing, searching, and download.

See this MSDN article for more.

SQL Services

SQL Services is a schemaless structured database that lives in the cloud. It’s based on sharded SQL Server. The entity model is similar to Tables.

SQL Services is intended to complement Tables. It doesn’t scale quite as automatically or easily, it’s not intended to serve a high load of end user traffic, and it’s expected to be much more expensive. However, it supports the more demanding traditional OLAP/OLTP usage patterns, and includes many of the existing SQL Server tools and extra features, including data mining, reporting, etc.

Developers must manually shard entities into containers. The container to SQL Server mapping under the hood is many-to-one, similar the Azure tables partition to table server mapping. Containers are limited in size, which seems dangerous if you get your sharding parameters wrong.

Currently, queries, transactions, etc. cannot cross containers. I’m not sure if joins are supported yet, but they eventually will be, along with queries and joins across containers. Transactions across containers will probably never be supported without something like two phase commit, though.

snarfed.org

Ryan Barrett's blog

Windows Azure details

Fabric

Services

Components

Provisioning

Hosting

Configuration and APIs

Geodistribution and Georeplication

Implementation

Fabric Controller

State machine

Service life cycle

Networking

Implementation details

Virtualization

Image-based deployment

Monitoring and alerting

Storage

Tables

Features

Entity model

Queries

Consistency and Transactions

Schemas

Implementation

Queues

Features

Implementation

Blobs

Developer Experience

SQL Services

2 thoughts on “Windows Azure details”

Mentions

Leave a Reply