AWS Redshift Best Practices
Distribution Style Selection
Distribute the fact and one dimension tables on their common columns. A fact table can only have one distribution key. Any tables that connect to another key cannot be collocated with a fact table.
Based on the frequency it is joined and how large the joining rows are, choose one dimension to collocate.
Designate the primary key of the dimension table and the foreign key of the fact table as the DISTKEY.
The largest dimension should be chosen based on how large the filtered data is.
If you choose a column with high cardinality, the filtered result set will show you a column that has a high cardinality.
If you use a range-restricted filter to filter for a narrow period of time, the majority of the filtered rows will occur on a small set of slices, and the query workload is skewed.
Change dimension tables to use ALL distribution.
Use ALL distribution increases storage space requirements, and increases load times.
Redshift stores data on disk in sorted order according the sort key. This allows query optimizers to determine the best query plans.
If you are looking for recent data, make sure to specify the timestamp column in the leading column. This will allow you to skip blocks that are not within the time range.
Redshift can skip entire blocks of data if you use range filtering or equality filters on one column.
Redshift can track the minimum and maximum column values of each block and can skip blocks that do not apply to the predicate range.
If you join tables often, you should specify the join column as both a sort key and a distribution key. This allows the query optimizer choose a sort merge instead of a slower haveh join.
The best results are achieved by automatic compression
COPY commands analyzes the data and applies compression encodings automatically to an empty table as part of the load operation
Whenever possible, define primary key and foreign keys constraints between tables. These constraints are only informational, but the query optimizer uses them to create more efficient query plans.
The following methods can be used to load data into tables:Using Multi-Row Insert
Use COPY command
Copy CommandCOPY command loads data parallel from S3, EMR and DynamoDB or multiple data sources on remote host.
COPY loads large data volumes much faster than INSERT statements and stores it more effectively.
To load multiple files, use a single COPY command
Redshift will not allow multiple concurrent COPY commands to load a table from multiple files. This is because Redshift is forced into a serialized load which is much slower.
Split the Load data into multiple filesDivide the data into multiple files of equal size (between 1MB to 1GB).
Number of files to be multiple of the number in the cluster
This helps to distribute the workload evenly within the cluster.
Use a Manifest FileS3 to provide consistency for certain operations. It is possible that new data may not be available immediately following the upload. This could lead to an incomplete data load.
AWS Redshift Best Practices