1 Key Design
Snowflake is an enterprise-ready data warehousing solution for the cloud.
Cloud promises increased economies of scale, extreme scalability and availability and a pay-as you go cost model — but it can only be captured if the software itself is able to scale elastically over the pool of commodity resources in the cloud.
Meanwhile, Saas brings enterprise-class systems to users who previously could not afford them. Snowflake key features includes: relational model, semi-structured data, elastic compute and storage, highly available, durable, cost-efficient and secure.
Compute vs Storage
Snowflake proposes a multi-cluster, shared-data architecture compare to traditional shared-nothing achitecture, where compute and storage are tightly coupled. This means it can cater to heterogeneous workload, support both io-intensive and compute intensive workload with different hardware configurations. Membership changes due to node failures or resizing also doesn’t require reshuffling data.
2 Architecture
Service-oriented architecture composed of highly fault tolerant and independently scalable services.
2.1 Virtual Warehouses
VW consists of worker nodes, which are ec3 instances and comes in different “t-shirt” sizes.
2.1.1 Elasticity and Isolation
VW can be created, destroyed or resized at any point, on demand. Worker nodes are not shared across VWs, resulting in strong performance isolation for queries. Each individual query runs on exactly one VW.
Private compute resources avoids interference of different workloads and organizational units. A worker process by itself - even if part of a update statement - never causes externally visible effects because table files are immutable.
2.1.2 Execution Engine
Columnar: SIMD instructions and CPU caches, and light-weight compression
Vectorized: data is processed in batches, similar to VectorWise (MonetDB/X100)
Push-based: improves cache efficiency, because it removes control flow logic from tight loops (also discussed in Query Engines: Push vs Pull)
Queries are executed against a fixed set of immutable files. No buffer pool for table because most queries scan large amounts of data, but operations can spill to disk.
2.1.2 Local Caching and File Stealing
Each worker node maintains a cache of table data on local disk. The cache holds file headers and individual columns of files, maintained using Least-Recently-Used (LRU) algorithm.
Query optimizer assigns input file sets to worker nodes using consistent hashing over table file names, subsequent or concurrent queries accessing the same table file will do this on same worker node. Files stealing - whenever a worker process completes scanning its set of input files, it requests additional files from its peers
2.2 Data Storage
While S3 performance could vary, it’s usability, high availability and strong durability guarantees were hard to beat.
Tables are horizontally partitioned into large, immutable files which are equivalent to blocks or pages in a traditional database system; attributes and column data are heavily compressed with PAX, each table file also has a header metadata containing the column offset.
Also uses S3 to store temp data generated by query operators (massive joins) once local disk space is exhausted, as well as for large query results.
2.3 Cloud Services
Multi-tenant: each service in this layer — access control, query optimizer, transaction manager and others — is long-lived and shared across many users:
- improves utilization and reduces administrative overhead
- replicated for high availability and scalability
2.3.1 Query Management and Optimization
Cascades-style approach, top-down cost-based optimization: the resulting execution plan is distributed to all the worker nodes that are part of the query.
2.3.2 Pruning
Min-max based pruning: the system maintains the data distribution information for a given chunk of data, determine that a given chunk of data might not be needed for a given query.
Dynamic pruning: collects statistics on the distribution of join key in the build-side records. bloom joins.
2.3.3 Concurrency Control
Snapshot Isolation (IS): all reads by a transaction see a consistent snapshot of the atabase as the time the transaction started, MVCC. Write operations on a table produce a newer version of the table by adding and removing whole files relative to the prior table version; allows the set of files belong to a specific table version to be computed very efficiently
3 Feature Highlights
3.1 Semi-structured and schema-less data
compact binary serialization which supports fast key-value lookup
3.1.1 Post-relational Operations
extraction operations in both functional SQL notation and JavaScript-like path syntax
flattening: SQL lateral views to represent flattening operation
3.1.2 Columnar Storage and Processing
data stored in a hybrid columnar format: automatically perform statistical analysis of the collection of documents within a single table file
3.1.3 Optimistic Conversion
preserving both the result of the conversion and the original string from JSON
3.2 Time Travel and Cloning
When files are moved by a new version, they are retained for a configurable duration (currently up to 90 days). One can even access different verrsions of the same table in a single query.
Cloning creates a new table with the same definitin and contents quickly and without making physical copies of table files. Right after cloning, both tables refer to the same set of files, but both tables can be modified independently thereafter.
3.3 Saas User Experience
standard database interfaces: ODBC and JDBC
third-party tools and services
a web UI makes it very easy to access from any location and environment, reduce the complexity of bootstrapping and using the system
3.4 Continuous Availability
continuous availability became an important requirement for any data warehouse
downtimes — both planned and unplanned — used to not have a large impact on operations
3.4.1 Fault Resilience
replicated across multiple data centers AZ
3.4.2 Online Upgrade
first deploys the new version of the service alongside the previous version; users accounts are then progressively switched to the new version at every which point all new queries issued by the respective user are directed to the new version, all queries executing against the previous version are allowed to run to completion
WV of different versions are able to share the same worker nodes and their respective caches; there’s no need to repopulate the caches after an upgrade
4 Related Work
Cloud-based Parallel Database Systems: Amazon Redshift (ParAccel), Google Big Query (Dremel), Microsoft SQL Data Warehouse
Document Stores and Big Data: MongoDB, Couchbase, Cassandra
Snowflake only has one tuning parameter: how much performance the user wants (and is wlling to pay for).