Knowledge Discovery in Databases

Unit V: Dimensional Database Design

Database Systems

MTech First Semester

Dr. Mohsin Dar

Assistant Professor

Cloud & Software Operations Cluster

UPES

What is Knowledge Discovery in Databases (KDD)?

Knowledge Discovery in Databases (KDD) is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.

Key Characteristics

KDD vs Data Mining: Understanding the Difference

KDD (Process)

  • Complete end-to-end process
  • Includes data preparation, selection, cleaning
  • Encompasses interpretation and evaluation
  • Iterative and interactive
  • Broader scope

Data Mining (Step)

  • One step within KDD
  • Focuses on pattern discovery
  • Applies specific algorithms
  • Core analytical phase
  • Narrower focus
Key Insight: Data Mining is the heart of KDD, but KDD includes all the preparatory and post-processing steps necessary for successful knowledge discovery.

The KDD Process: A Step-by-Step Journey

1
Data Selection
Identify and retrieve relevant data from various sources
2
Data Preprocessing
Clean, handle missing values, remove noise
3
Data Transformation
Normalize, aggregate, and reduce dimensions
4
Data Mining
Apply algorithms to discover patterns
5
Interpretation
Evaluate and interpret discovered patterns
6
Knowledge
Deploy actionable insights and decisions

Data Preprocessing: Foundation of Quality KDD

Why Preprocessing Matters

Real-world data is often incomplete, noisy, and inconsistent. Preprocessing can consume 60-80% of the KDD effort but dramatically improves results.

Key Preprocessing Tasks

Data Cleaning

  • Fill missing values
  • Smooth noisy data
  • Identify/remove outliers
  • Resolve inconsistencies

Data Integration

  • Combine multiple sources
  • Resolve schema conflicts
  • Handle redundancy
  • Entity resolution

Data Reduction

  • Dimensionality reduction
  • Numerosity reduction
  • Data compression
  • Feature selection

Data Transformation

  • Normalization
  • Aggregation
  • Generalization
  • Attribute construction

Core Data Mining Techniques

Classification
Predict categorical class labels. Examples: Decision Trees, Neural Networks, SVM, Naive Bayes
Clustering
Group similar objects together. Examples: K-Means, Hierarchical, DBSCAN
Association Rules
Discover interesting relationships. Examples: Apriori, FP-Growth algorithms
Regression
Predict continuous values. Examples: Linear, Polynomial, Logistic Regression
Anomaly Detection
Identify unusual patterns. Used in fraud detection, network security
Sequential Patterns
Find patterns in sequence data. Examples: Time series analysis, web log mining

Classification: Supervised Learning

Definition: Classification is the process of finding a model that describes and distinguishes data classes for predicting the class of new objects.

Classification Process

Training Phase
Build Model using Training Data
Validate Model Accuracy
Testing Phase
Apply Model to New Data

Popular Algorithms

Clustering: Unsupervised Learning

Definition: Clustering groups data objects based on similarity without predefined class labels. Objects in the same cluster are similar; objects in different clusters are dissimilar.

Clustering Methods

  • Partitioning: K-Means, K-Medoids
  • Hierarchical: Agglomerative, Divisive
  • Density-based: DBSCAN, OPTICS
  • Grid-based: STING, CLIQUE

Applications

  • Customer segmentation
  • Image segmentation
  • Document organization
  • Anomaly detection
  • Gene analysis
K-Means Algorithm: Most popular partitioning method. Assigns each object to the cluster with nearest centroid. Iteratively refines clusters until convergence.

Association Rule Mining

Definition: Finding frequent patterns, associations, or causal structures among sets of items in transaction databases.

Key Concepts

Support

Frequency of occurrence of an itemset in the database

Support(A→B) = P(A∪B)

Confidence

Probability that B occurs when A occurs

Confidence(A→B) = P(B|A)

Classic Example: Market Basket Analysis

Rule: {Milk, Bread} → {Butter}
Interpretation: Customers who buy milk and bread also tend to buy butter
Support: 30% of transactions | Confidence: 75%

Apriori Algorithm

Most influential algorithm for mining frequent itemsets. Uses a level-wise search strategy with candidate generation.

Real-World Applications of KDD

🏢 Business Intelligence
Customer profiling, market basket analysis, churn prediction, sales forecasting
💳 Finance
Credit scoring, fraud detection, risk assessment, algorithmic trading
🏥 Healthcare
Disease diagnosis, treatment effectiveness, patient monitoring, drug discovery
🛒 E-commerce
Recommendation systems, dynamic pricing, inventory optimization, customer segmentation
📱 Social Media
Sentiment analysis, trend detection, influence analysis, content recommendation
🔒 Security
Intrusion detection, malware classification, network monitoring, threat intelligence

Challenges and Considerations in KDD

Technical Challenges

Methodological Challenges

Ethical and Legal Considerations

Summary and Future Directions

Key Takeaways

  • KDD is a complete process; data mining is one crucial step
  • Data preprocessing is critical and time-consuming (60-80% effort)
  • Multiple techniques: Classification, Clustering, Association Rules, etc.
  • Wide range of real-world applications across industries
  • Significant technical and ethical challenges remain

Emerging Trends

Deep Learning

Advanced neural networks for complex pattern recognition

AutoML

Automated machine learning pipeline optimization

Graph Databases

Efficient handling of connected data relationships

Edge Computing

Data processing closer to the source

Blockchain

Secure and transparent data transactions

AI Ethics

Responsible AI and data governance