Amazon Redshift is a fully managed data warehouse service that can store petabytes of data.
Redshift is an OLAP Data Warehouse solution based upon PostgreSQL.
Redshift automatically assists in setting up, operating, and scaling a data warehouse. This includes provisioning the infrastructure capacity
Patches and backs up data warehouse, storing backups for a user-defined retention.
Monitors the drives and nodes to aid recovery from failures
This not only lowers the cost of a warehouse but also makes it possible to quickly analyze large amounts of data.
Provide fast querying capabilities for structured and semi-structured information using familiar SQL-based clients. Business intelligence (BI) tools can also be used using standard ODBC or JDBC connections.
Uses replication and continuous backups to increase availability and data durability. It can also automatically recover from component and node failures.
Scale up or down in a few clicks using the AWS Management Console, or with one API call
distribute & parallelize queries across multiple physical resources
Supports VPC, SSL and AES256 encryption, and Hardware Security Modules, (HSMs), to protect data at rest and in transit.
Redshift supports single-AZ deployments only. The nodes can be found within the same AZ if Redshift clusters are supported.
Redshift offers monitoring via CloudWatch. Metrics for compute utilization, storage utilization and read/write traffic to cluster are available. You can also add custom metrics to your Redshift account
Redshift offers AWS CloudTrail integration and Audit Logging
Redshift can be easily enabled to a second region for disaster recovery.Redshift ArchitectureClustersCore infrastructure component of an Redshift data warehouse
Cluster is made up of one or more compute nodes.
A leader node coordinates the cluster’s compute nodes and handles communication with external parties.
Client applications can only interact with the leader node.
External applications can see compute nodes.
Leader nodeLeader Node manages communication with client programs and all communications with compute nodes.
It creates execution plans for database operations and parses them.
The execution plan is used by the leader node to compile code, distribute the compiled code among the compute nodes and assign a portion to each compute node.
Only when a query refers to tables stored on the compute nosdes, the leader node distributes SQL statements from the compute nodes to the SQL statements. All other queries are run only on the leader node.
Compute nodesLeader node compiles code and assigns it to individual compute nodes.
Compute nodes execute compiled code and send intermediate results back the leader node for final aggregate.
Each compute node has its very own CPU, memory, and attached storage. These are determined by the type of node.
As the workload increases, the cluster’s compute and storage capacities can be increased by increasing the number or upgrading the type of node, or both.
Slices of node Node slicesA compute node can be divided into slices
Each slice is given a portion the node’s disk space and memory, and it processes a part of the assigned workload.
The leader node distributes data to the slices and allocates the workload for queries or other database operations to slices. The slices then work together to complete the operation.
The cluster size determines the number of slices per node.
Optionally, one column can be designated as the distribution key when a table is created. When a table is loaded with data the rows are distributed to node slices according the distribution key.