1 Key Design

Snowflake is an enterprise-ready data warehousing solution for the cloud.

Cloud promises increased economies of scale, extreme scalability and availability and a pay-as you go cost model — but it can only be captured if the software itself is able to scale elastically over the pool of commodity resources in the cloud.

Meanwhile, Saas brings enterprise-class systems to users who previously could not afford them. Snowflake key features includes: relational model, semi-structured data, elastic compute and storage, highly available, durable, cost-efficient and secure.

Compute vs Storage

Snowflake proposes a multi-cluster, shared-data architecture compare to traditional shared-nothing achitecture, where compute and storage are tightly coupled. This means it can cater to heterogeneous workload, support both io-intensive and compute intensive workload with different hardware configurations. Membership changes due to node failures or resizing also doesn’t require reshuffling data.

2 Architecture

Service-oriented architecture composed of highly fault tolerant and independently scalable services.

2.1 Virtual Warehouses

VW consists of worker nodes, which are ec3 instances and comes in different “t-shirt” sizes.

2.1.1 Elasticity and Isolation

VW can be created, destroyed or resized at any point, on demand. Worker nodes are not shared across VWs, resulting in strong performance isolation for queries. Each individual query runs on exactly one VW.

Private compute resources avoids interference of different workloads and organizational units. A worker process by itself - even if part of a update statement - never causes externally visible effects because table files are immutable.

2.1.2 Execution Engine

Columnar: SIMD instructions and CPU caches, and light-weight compression

Vectorized: data is processed in batches, similar to VectorWise (MonetDB/X100)

Push-based: improves cache efficiency, because it removes control flow logic from tight loops (also discussed in Query Engines: Push vs Pull)

Queries are executed against a fixed set of immutable files. No buffer pool for table because most queries scan large amounts of data, but operations can spill to disk.

2.1.2 Local Caching and File Stealing

Each worker node maintains a cache of table data on local disk. The cache holds file headers and individual columns of files, maintained using Least-Recently-Used (LRU) algorithm.

Query optimizer assigns input file sets to worker nodes using consistent hashing over table file names, subsequent or concurrent queries accessing the same table file will do this on same worker node. Files stealing - whenever a worker process completes scanning its set of input files, it requests additional files from its peers

2.2 Data Storage

While S3 performance could vary, it’s usability, high availability and strong durability guarantees were hard to beat.

Tables are horizontally partitioned into large, immutable files which are equivalent to blocks or pages in a traditional database system; attributes and column data are heavily compressed with PAX, each table file also has a header metadata containing the column offset.

Also uses S3 to store temp data generated by query operators (massive joins) once local disk space is exhausted, as well as for large query results.

2.3 Cloud Services

Multi-tenant: each service in this layer — access control, query optimizer, transaction manager and others — is long-lived and shared across many users:

  • improves utilization and reduces administrative overhead
  • replicated for high availability and scalability

2.3.1 Query Management and Optimization

Cascades-style approach, top-down cost-based optimization: the resulting execution plan is distributed to all the worker nodes that are part of the query.

2.3.2 Pruning

Min-max based pruning: the system maintains the data distribution information for a given chunk of data, determine that a given chunk of data might not be needed for a given query.

Dynamic pruning: collects statistics on the distribution of join key in the build-side records. bloom joins.

2.3.3 Concurrency Control

Snapshot Isolation (IS): all reads by a transaction see a consistent snapshot of the atabase as the time the transaction started, MVCC. Write operations on a table produce a newer version of the table by adding and removing whole files relative to the prior table version; allows the set of files belong to a specific table version to be computed very efficiently

3 Feature Highlights

3.1 Semi-structured and schema-less data

compact binary serialization which supports fast key-value lookup

3.1.1 Post-relational Operations

extraction operations in both functional SQL notation and JavaScript-like path syntax

flattening: SQL lateral views to represent flattening operation

3.1.2 Columnar Storage and Processing

data stored in a hybrid columnar format: automatically perform statistical analysis of the collection of documents within a single table file

3.1.3 Optimistic Conversion

preserving both the result of the conversion and the original string from JSON

3.2 Time Travel and Cloning

When files are moved by a new version, they are retained for a configurable duration (currently up to 90 days). One can even access different verrsions of the same table in a single query.

Cloning creates a new table with the same definitin and contents quickly and without making physical copies of table files. Right after cloning, both tables refer to the same set of files, but both tables can be modified independently thereafter.

3.3 Saas User Experience

standard database interfaces: ODBC and JDBC

third-party tools and services

a web UI makes it very easy to access from any location and environment, reduce the complexity of bootstrapping and using the system

3.4 Continuous Availability

continuous availability became an important requirement for any data warehouse

downtimes — both planned and unplanned — used to not have a large impact on operations

3.4.1 Fault Resilience

replicated across multiple data centers AZ

3.4.2 Online Upgrade

first deploys the new version of the service alongside the previous version; users accounts are then progressively switched to the new version at every which point all new queries issued by the respective user are directed to the new version, all queries executing against the previous version are allowed to run to completion

WV of different versions are able to share the same worker nodes and their respective caches; there’s no need to repopulate the caches after an upgrade

4 Related Work

Cloud-based Parallel Database Systems: Amazon Redshift (ParAccel), Google Big Query (Dremel), Microsoft SQL Data Warehouse

Document Stores and Big Data: MongoDB, Couchbase, Cassandra

Snowflake only has one tuning parameter: how much performance the user wants (and is wlling to pay for).