Physical Design

Physical design – translate the logical description of data into the technical specifications for storing and retrieving data so as to provide adequate performance and insure database integrity, security, and recoverability

Input

Normalized relations
Attribute definition – data type, integrity control, error handling
Data statistics: data-volume, frequency of use
Usage map (Fig. 6-1)
Requirements: response time, data security, backup. recovery, retention, and integrity
IT specifications: hardware, DBMS, …

Data statistics – relative vs precise numbers

Data volume – size of business with growth adjustment
Access frequency – timing of events, transaction volumes, reporting activities, querying activities

Composite usage map

Data volumes – lower left hand corner of an entity
Frequency of use – dashed arrow

Decisions

Data type of each attribute – minimize storage space and maximize data integrity
Grouping of attributes into physical records – denormalization, horizontal and vertical partitioning
File organizations – sequential, indexed, hashed
Database architectures
Queries handling strategies

Data decisions

Data type – storage space, domain, validation, manipulation
Code table, compression & encryption
Data integrity – data type, default, range, null (missing or unknown), referential
Index -- a table to determine the location of rows in a file that satisfy some condition

Index -- for information retrieval

Unique index -- primary key
Secondary index -- any attributes
Bitmap index -- a table of bits in which each row represents the distinct values of a key and each entry in the table indicates whether or not the record for that bit column position has the associated field value (Figure 7-7)

Record decisions – choosing the sequencing of fields into adjacent storage locations to achieve efficient use of storage and data processing speed

Blocking factor – the number of physical records per page
Denormalization – the process of transforming normalized relations into unnormalized physical record specifications

Denormalization

Combining tables to avoid doing joins (tools: function-relation matrix)
Horizontal partitioning – placing different rows of a table into separate files (~ supertype/subtype relationship)
Vertical partitioning – placing different columns of a table into separate files, repeating the primary key in each of the files
Record partitioning – a combination of both horizontal and vertical partitioning as in distributed database processing
Data replication – same data are stored in multiple places in the database

Candidates for denormalization

Candidate	Why	How	Benefits
1:1 relationship	Attributes have been mistakenly identified as entities Data from both relations is accessed together, created and deleted at the same time	· Combine the two relations into one	· Reduce the number of relations · Reduce the number of joins
m:m relationship	One of the relation has little data apart from its primary key	· Merge data from one of the primary relations with the intersection relation	· eliminating one relation out of three
Reference data	Multiple instances of reference data in one relation Code description is the natural user view	· Duplicate the code description in the relation	· No reduction in number of relations · Remove the need to join target entities with code table
Detailed data	Multiple rows of data fanned out from some least significant attribute, e.g., monthly balance for a budget item of a department in an organization	· Discard the least significant attribute and use one column per possible value of that attribute, e.g., month vs Balance_1, …, Balance_12	· Reduce the number of rows in the relation containing the most detailed data
Derived attributes	Frequent use of data calculated from its component data	· Store the calculated value together with the component data in a relation	· Reduce time consuming calculations at run-time

Joining two tables with many rows for every access is as bad as joining more tables with few rows

Denormalization is a practical necessity if performance is important

Design tables that match closely the user's view of data

File decisions

File organization – physical arrangement of records of a file on secondary storage devices (Table 6-3)
File clustering -- related records from several tables (e.g., a CUSTOMER's ORDER) are placed in adjacent secondary memory space (a cluster)
File controls -- protects the file from destruction or contamination or to reconstruct the file if it is damaged
Parallel processing – RAID (Redundant Arrays of Inexpensive Disks) to optimize I/O performance

RAID (Figure 6-10)

Hardware -- a set/array of disk drives with segments of data (strips) cutting across all the drives

Software -- parallel read & write, concatenate results into one logical record, multithreading

Data -- logically sequential pages of data, e.g., multiple pages of one record (with audio & video data), multiple related record (a cluster)

Database decisions – choosing the appropriate type of DBMS

Hierarchical
Network
Relational
Object-oriented
Multidimensional

Query optimization

Parallel processing -- the same query is run against each portion of the database (horizontal partition) in parallel on separate processors. The intermediate results from each processor are combined to create the final query result.
Query plan -- statistics are kept about the database structure, number of distinct values for fields, and the best strategy for accessing each index and table

e.g.,

S(S#, Sname) has 1,000 tuples

SP(S#,P#) has 10,000 tuples; 50 of which have P#=P2

Get the names of suppliers who supplies P2

Plan1:

Join S with SP over S#

Restrict where P#="P2"

Project over Sname

Plan2:

Restrict SP where P#="P2"

Join with S over S#

Project over Sname