Data Mining (CST466) - Complete Master Bank

All Questions Discussed: 1 Point = 1 Mark (Progress Saves Automatically)

Module 2: Data Pre-processing

  1. Step 1 - Sorting: First, arrange the data in ascending order: [12, 15, 20, 20, 23, 24, 29, 31, 32, 35, 36, 40].
  2. Step 2 - Partitioning: Divide the 12 sorted values into 3 equal bins (4 values per bin). Bin 1: [12, 15, 20, 20], Bin 2: [23, 24, 29, 31], Bin 3: [32, 35, 36, 40].
  3. Step 3 - Smoothing: Calculate the mean for each bin and replace all values with it. Bin 1 (mean 16.75) becomes [16.75, 16.75, 16.75, 16.75]. Bin 2 becomes [26.75, 26.75, 26.75, 26.75]. Bin 3 becomes [35.75, 35.75, 35.75, 35.75].

  1. Handling Imperfect Data: Real-world data is often incomplete (missing values), noisy (errors/outliers), and inconsistent, which pre-processing must fix.
  2. Improving Data Quality: Pre-processing ensures the data is accurate and reliable, which is fundamentally essential for successful and trustworthy mining outcomes.
  3. Enhancing Efficiency: By reducing the volume and complexity of data, pre-processing makes the core mining process significantly faster and less resource-heavy.

  1. Handling Missing Values - Ignore Tuple: Used when the class label is missing; simple but ineffective if many tuples have missing values.
  2. Handling Missing Values - Fill Manually: A developer manually enters the missing data; highly accurate but extremely time-consuming and impractical for large datasets.
  3. Handling Missing Values - Global Constant: Replace all missing values with a standard label like "Unknown" to keep data complete without misleading the algorithm.
  4. Handling Missing Values - Attribute Mean: Fill the gap using the calculated average value of that specific attribute across all known records.
  5. Smoothing Noisy Data - Binning: Data is sorted into bins and smoothed by bin means, medians, or boundaries to remove random errors.
  6. Smoothing Noisy Data - Clustering: Similar data values are grouped together; points falling far outside these groups are identified as outliers and removed.
  7. Data Transformation - Normalization: Converts data values into a specific small range to ensure one attribute does not dominate distance-based calculations.
  8. Consistency Resolution - Data Integration: Resolves conflicts between different data sources where the same entity might have different names, codes, or units.

  1. Significance - Data Reduction: Discretization reduces the number of values for a continuous attribute by dividing the range into defined intervals.
  2. Significance - Pattern Discovery: Replaces continuous numerical values with category labels, making it easier for mining algorithms to find meaningful patterns.
  3. Strategy 1 - Binning: A top-down splitting technique that uses bin means or boundaries to partition attribute values into discrete intervals.
  4. Strategy 2 - Histogram Analysis: Values are partitioned into disjoint ranges called "buckets" to represent the overall distribution of the data.
  5. Strategy 3 - Cluster Analysis: A clustering algorithm is applied to partition the continuous attribute into distinct groups or clusters of similar values.
  6. Strategy 4 - Concept Hierarchy: For categorical data, hierarchies are created by moving from lower-level detailed concepts to higher-level broader ones (e.g., specific age to "Adult").

  1. Regression Models: Approximate the data by finding a mathematical function that fits the dataset, allowing you to store a formula instead of millions of records.
  2. Log-Linear Models: These are used to estimate the probability of discrete attribute value combinations in multi-dimensional space.
  3. Histograms: Data is partitioned into disjoint buckets where only the total count for each bucket is stored to save space.
  4. Clustering: Data objects are grouped into clusters; instead of storing every point, only the cluster representations (like the center) are stored.
  5. Sampling: A large dataset is reduced by selecting a smaller, representative subset (sample) for analysis, heavily reducing total volume.
  6. Data Cube Aggregation: Multiple levels of data are summarized (e.g., daily sales aggregated into yearly sales) to drastically reduce record counts.
  7. Lossless vs Lossy: Some methods keep all original information, while others (like sampling) are "lossy," losing some detail to gain massive storage space.
  8. Application Context: These techniques are absolutely vital when the original dataset is simply too large to fit into a computer's memory for processing.

  1. Simple Random Sample Without Replacement (SRSWOR): Once an item is chosen from the dataset, it is not put back, meaning it cannot be selected again in the same sample.
  2. Simple Random Sample With Replacement (SRSWR): Each time an item is selected, it is recorded and then placed back, so the same item could potentially be picked more than once.
  3. Purpose: Both methods aim to create a much smaller, yet representative version of a large dataset to speed up analysis while maintaining accuracy.

  1. SRSWOR Definition: Once a value is chosen from the dataset, it is removed from the list and cannot be picked again in the same sample.
  2. SRSWOR Application: For our 15 customers, we would pick a subset where each customer is unique and no value is repeated in the final sample set.
  3. SRSWR Definition: A selected value is recorded and then put back into the original set, allowing it to be chosen multiple times.
  4. SRSWR Application: In this scenario, a customer's specific spending amount (e.g., "40") could potentially appear twice or more in our final sample set.
  5. Cluster Sampling: This involves dividing the total population into groups or "clusters" and then randomly selecting entire clusters for analysis.
  6. Cluster Application: We could group the 15 customers by their entry order into clusters of five, and then pick one whole cluster to represent the entire population.
  7. Stratified Sampling: The data is first divided into meaningful groups called "strata" based on a specific attribute before sampling takes place.
  8. Stratified Application: We first categorize the customers into "Low", "Medium", and "High" spenders, then take a random sample from *each* group to ensure fair representation.

  1. Equal-Frequency Binning: This strategy divides the sorted data into bins that each contain the exact same number of data records.
  2. Bin Partitioning: Since we have 12 values and need 3 bins, each bin contains 4 values. Bin 1: [5, 7, 8, 12], Bin 2: [15, 18, 20, 22], Bin 3: [25, 30, 35, 40].
  3. Smoothing by Bin Means (Concept): Replaces every individual value in a bin with the calculated mathematical average of that specific bin.
  4. Bin Mean Calculation: For Bin 1, the mean is (5+7+8+12)/4 = 8. We replace all values, resulting in the smoothed bin [8, 8, 8, 8].
  5. Smoothing by Bin Boundaries (Concept): Values are replaced by the closest "boundary" value, which is strictly either the minimum or the maximum value of that bin.
  6. Bin Boundary Calculation: In Bin 1 [5, 12 boundaries], 7 is closer to min (5) and becomes 5. 8 is closer to max (12) and becomes 12. Bin 1 becomes [5, 5, 12, 12].
  7. Purpose of Smoothing: These techniques are vital to remove random "noise" and anomalies from the data to clarify the underlying true trends.
  8. Data Preparation: Smoothing helps the analyst simplify the dataset so that subsequent mining algorithms can work more effectively on core patterns rather than getting distracted by noise.

  1. Process Identification: The process is Data Pre-processing, essential for turning raw, unstructured "dirty" data into a high-quality format suitable for algorithms.
  2. Significance: It is fundamentally necessary because real-world data is plagued with missing values, noise, errors, and formatting inconsistencies.
  3. Goal: The main objective is to improve the overall accuracy, reliability, and computational efficiency of the mining process.
  4. Task 1 - Data Cleaning: Involves filling in missing values, smoothing out noise using binning, and identifying/removing extreme outliers to correct data errors.
  5. Task 2 - Data Integration: The process of systematically combining data from multiple diverse sources (like different department databases) into a single, cohesive, consistent data store.
  6. Task 3 - Data Transformation & Reduction: Includes scaling data to a common range (Normalization) and reducing the number of variables/volume (Dimensionality/Numerosity Reduction) to simplify the dataset.

  1. Min-Max Normalization (Concept): This method linearly transforms each value into a fixed new range, typically [0, 1] or [-1, 1], preserving original data relationships.
  2. Min-Max Formula: v' = ((v - min) / (max - min)) * (new_max - new_min) + new_min.
  3. Min-Max Calculation: For 300: v' = (300 - 100) / (900 - 100) = 200 / 800 = 0.25.
  4. Z-Score Normalization (Concept): Also called zero-mean normalization, it standardizes data based on the mean and standard deviation of the attribute.
  5. Z-Score Advantage: It is highly effective when the actual minimum and maximum values are unknown or when there are many extreme outliers in the dataset.
  6. Z-Score Formula: v' = (v - mean) / standard_deviation.
  7. Z-Score Components Calculation: Mean (μ) = (100+200+300+500+900) / 5 = 400. Std Dev (σ) = √(((-300)² + (-200)² + (-100)² + 100² + 500²) / 5) = 282.84.
  8. Z-Score Final Calculation: For 300: v' = (300 - 400) / 282.84 = -100 / 282.84 = -0.353.

  1. Purpose: Reduces the number of random variables or attributes under consideration to simplify the data, making machine learning algorithms train faster.
  2. Principal Component Analysis (PCA): A core technique that finds a new set of orthogonal variables (Principal Components) that are linear combinations of original variables.
  3. Attribute Subset Selection: Another technique that removes redundant, irrelevant, or highly correlated attributes that do not contribute to the overall mining goal.
  4. Step 1 - Data Normalization: The first step in PCA where attributes are scaled to a common range so that no single variable dominates the reduction process.
  5. Step 2 - Covariance Matrix: A matrix is created to mathematically evaluate how different variables in the dataset relate to one another.
  6. Step 3 - Feature Selection: The algorithm calculates eigenvalues, selects the most important features (highest variance), and discards the rest.

Module 1: Data Warehousing (IE1 Questions)

  1. Main Purpose: A Database is designed for day-to-day operational transactions (OLTP), while a Data Warehouse is built for complex analysis and reporting (OLAP).
  2. Type of Data: Databases store current, highly detailed, rapidly changing data, whereas Data Warehouses store both current and historical data in a static, summarized format.
  3. Query Complexity: Queries in a database are usually simple and execute instantly, but Data Warehouse queries are highly complex, aggregating large volumes of data for trend analysis.

  1. Task 1 (Department to Student): Moving from a summarized view (department averages) down to a highly detailed view (individual student performance) is a Drill-down operation.
  2. Task 2 (Filtering by CSE and 2024): Selecting a specific sub-cube by defining constraints on dimensions (like year=2024 and dept=CSE) is a Slice and Dice operation.
  3. Task 3 (Time-based to Location-based): Rearranging or rotating the data cube axes to view the exact same data from a completely different perspective is a Pivot (Rotate) operation.

  1. Star Schema Structure: Consists of a massive central Fact Table connected directly to multiple un-normalized Dimension Tables, forming a distinct star shape.
  2. Star Schema Simplicity: It is the simplest architecture where each dimension is represented by a single table, making database queries very fast and easy to write.
  3. Snowflake Schema Structure: An extension of the star schema where some dimension tables are further normalized and broken down into additional related tables.
  4. Snowflake Appearance: Because dimensions are branched out (e.g., Product -> Category -> Subcategory), the entity-relationship diagram resembles a snowflake.
  5. Normalization vs Space: The Snowflake schema strictly uses normalization to reduce data redundancy, which saves significant storage space compared to the Star schema.
  6. Query Performance: While saving space, Snowflake schemas suffer from slower query performance because fetching data requires more complex "joins" across multiple tables.
  7. Maintenance: Star schemas are much easier to maintain for simpler business models, while Snowflake is better suited for managing highly complex hierarchical dimensions.
  8. Standard Usage: Star schemas are typically deployed in smaller Data Marts, whereas Snowflake is the preferred choice for massive, large-scale enterprise Data Warehouses.