Advanced DB Tech — 1-Page Cheat Sheet

Relational Model & DBMS

Relational Model: tables (relations) with rows (tuples) & columns (attributes). Vendors: Oracle, Informix, Sybase; runner-ups: DB2, MySQL; challengers: OO, ORDB.
DBMS core goals: manage very-large data sets; flexible/efficient access; support concurrent, secure, atomic access.
Example Relation: Account# | Name | Balance (e.g., 123, Joe, 451.00).
Three aspects: design, queries/transactions, implementation.

“Megatron 747” (Teaching DBMS)

Idea Data as Unix files; schema in separate file. Example tuple encodings like Smith#123#CS.

Query SELECT * FROM Students; → list tuples.

Projection+Filter→File SELECT * FROM R WHERE Cond | T; reads schema, scans R, writes qualifying tuples to T, and registers new file T.

Naïve Query Execution (Scan-Join)

SELECT * FROM R WHERE Cond; → read schema; scan R; test Cond; output matches.
Simple two-table join plan: nested loops over Students and Depts; check WHERE; output attribute(s).
What’s wrong with this? inflexible tuple layout; expensive scans; brute-force joins; no buffer manager; no CC; weak security.

disk capacity=tracks*blocks*sectors*bytes

2^30=1tb

circle pi*diameters, bytes/(90%*diameters*pi), inner and outer are different, everage inner+outer/2

maxseek=d1+maxtrack/trackpersec,averageseek=d1+1/3(maxtrack/trackpersec)

T_fullrotation=time/rpm,T_averagerot=T_fullrotation/2

two pass:RX2+SX2+R+S,$\frac{block}{M}$ < M-1

$\frac{k}{k+1}$

Serverbreak_p:percent_break x recover_hour/8760 hour

raid0: 0 fault tolerance,raid1:mirror,raid3:1diskfail byte level xor,raid4,large read slow writer block level xor, raid5:1 disk can fail distributed parity balance performance,raid6:2disk can fail

raid4:XOR:00:0,01:1,10:1,11:0

number of track to move: left+right, left/total*left_avg+right/total*right_avg

seek=seek_overhead+track_avg/trackpersec

Record_per_block=$\frac{kb*1024}{record_bytes}$

Require_blocks=record_num/Record_per_block

Tscan=Tseek+Trot_avg+B*Ttransfer

Trand=Tseek+Trot_avg+Trans

Tscan < Trand*Tuple_num

dense Index Block:total_block=$\frac{index entries}{Record_per_block}+\frac{index entries}{Index_per_block}$

Sparse Index Block:total_block=$\frac{index entries}{Record_per_block}+\frac{index entries/Record_per_block}{Index_per_block}$

$RPM=\frac{600000}{avg_dalay*2}

read whole tracks=Tseek+avg_dalay*2*track_number

seek time=overhead+(time per cylinder)*N/3,random accesses distance = N/3.

The first-level index (on data) is typically dense, not sparse. The second-level index (on the first-level index) is sparse, not dense.

Clustering Index Built on a non-key attribute of an ordered data file where records with the same value are stored together (clustered).

Secondary Index on Key Attribute A secondary index built on an attribute that is also a key

Sparse Primary Index is is the usual case for a primary index. order file One entry per data block (not every record).

Dense Secondary Index One entry for every record in the data file. Built on a non-ordering attribute

Memory, Storage & Disks

Memory hierarchy: volatile main memory vs. non-volatile disks/tapes; disk I/O dominates cost; block I/O in ms; e.g., 4KB block.
Typical disk facts: cylinders (tracks at same radius) enable fast sequential access; sector sizes ~0.5–50KB; capacities from floppies to hundreds of GB (historical ranges).

Access-time & Throughput (Rules of Thumb)

Disk Access Time (approx.): $T_{access}=T_{seek}+T_{rot}+T_{xfer}$ (seek ms + rotational delay ms + transfer). (Generic formula; aligns with hierarchy notion that “Disk I/O dominates computation”.)
Block Transfer Time: $T_{xfer}=\dfrac{\text{BlockSize}}{\text{TransferRate}}$.
Scan I/O Count (unindexed): $\approx$ #blocks in file.

Files, Blocks & Records

Use blocks: file header holds (e.g.) next-block link, block ID, record offsets, free-space pointer, timestamp.
Typical block layout: header + records; offset table can “slide” records for insertions.
Offset addressing: record address = physical block address + offset entry. Move records by changing only offset entries.
Tombstones: leave a marker in offset entry (or map table) after deletion to avoid dangling-pointer surprises.:contentReference[oaicite:21]{index=21}

Phase1 runs=$\frac{diskblock}{Memorysize}$

Mergesort $\frac{diskblock}{M} <= M-1, \sqrt{diskblock}+1$

Mergesort (M-1)*M=Block size

Sortest distance first, elevator: move up then down

Primary Index Built on a sorted (ordered) data file. The index key is the primary key (ordering key) of the data file. There is one index entry per data block (not per record). Points to the first record in each block.

Secondary Index Built on a non-ordering field (not the primary key). The data file itself may be unordered or ordered on another field. There can be multiple secondary indexes on the same file. Typically has one index entry per record, because data are not ordered on this field.

Secondary Level Index (or Multi-Level Index) This is not a different kind of index (primary or secondary) — it’s about the structure of the index. When an index (primary or secondary) is too large to fit in memory, we build another index on top of it — a second-level index. This creates a hierarchy (multi-level index) that speeds up lookups.

Search Key A field or combination of fields used to look up records in an index. It is a physical access concept, used in file organization and indexing. Does not have to be unique — multiple records can have the same search key value.

Variable-Length & Variable-Format Records

Support varying sizes (e.g., address), repeating groups (e.g., movie refs), and SSD-style flexible schemas; store enough info to locate any field.
Layouts: variable-length strings with field pointers/lengths; repeating lists; “schema-with-record” tagging for SSD.:contentReference[oaicite:25]{index=25}
SQL example table: MovieStar(name CHAR(30) PK, address VARCHAR(255), gender CHAR(1), birthdate DATE).

Insertions & Deletions (Ordered Files)

If no free space: redistribute records across adjacent blocks or use overflow blocks.
Deletions: if not pinned → reclaim; if pinned → use tombstones (record or map-table).

Pointer Swizzling & Addresses

DB address vs. memory address: DB uses physical/logical addresses (e.g., 8B); app uses 4B memory addresses via a translation table.
Swizzling: convert DB address → memory pointer on load; unswizzle on write-back; options: none (always translate), automatic, on-demand.:contentReference[oaicite:34]{index=34}
Pinned records: some swizzled pointer points to it → must unswizzle before returning to disk; track refs (e.g., linked list).
Map table & flexibility: indirection enables moving records without changing client code — update the map.
Scale thought experiment: vast object counts motivate logical IDs (e.g., up to $2^{77}$ objects over many machines/time).

Heap file page count: $P=\left\lceil \dfrac{N \cdot R}{B} \right\rceil$ (N tuples × tuple size / block size)
Scan time (rough): $T_{scan}\approx P \cdot (T_{seek}+T_{rot}+T_{xfer})$
Avg. join (NLJ, no index): $N_R \cdot N_S$ comparisons (tuple-wise), or page-wise $P_R\cdot P_S$.
Block reorg overhead: offset-table update $\ll$ tuple rewrite (keeps inserts cheap).
Swizzle invariant: all in-mem objects must have valid reverse entries to unswizzle safely before eviction.

Primary Index Ordering (primary) key Yes Block Sparse 1 entry/block
Sparse Primary Index Ordering key Yes Block Sparse Common primary index
Dense Primary Index Primary key No Record Dense Used if file unordered
Secondary Index Non-ordering field No Record Dense One per record
Sparse Secondary Index Non-ordering field No – ❌ Not practical
Secondary Index on Key Attribute Non-ordering unique field No Record Dense Each record unique
Clustering Index Non-key attribute (with duplicates) Yes Distinct value Spar