From e030913d9f41da7de8cbff9d7e160a04b4b1cbec Mon Sep 17 00:00:00 2001
From: chrislu <chris.lu@gmail.com>
Date: Thu, 4 Sep 2025 08:37:01 -0700
Subject: [PATCH] clean up

---
 SQL_FEATURE_PLAN.md | 90 ++++++++++++++++++++-------------------------
 1 file changed, 39 insertions(+), 51 deletions(-)

diff --git a/SQL_FEATURE_PLAN.md b/SQL_FEATURE_PLAN.md
index bb30526ce..e1d58611b 100644
--- a/SQL_FEATURE_PLAN.md
+++ b/SQL_FEATURE_PLAN.md
@@ -1,26 +1,24 @@
 # SQL Query Engine Feature, Dev, and Test Plan
 
-This document outlines the plan for adding comprehensive SQL support to SeaweedFS, focusing on schematized Message Queue (MQ) topics with full DDL and DML capabilities, plus S3 objects querying.
+This document outlines the plan for adding SQL querying support to SeaweedFS, focusing on reading and analyzing data from Message Queue (MQ) topics and S3 objects.
 
 ## Feature Plan
 
 **1. Goal**
 
-To provide a full-featured SQL interface for SeaweedFS, treating schematized MQ topics as database tables with complete DDL/DML support. This enables:
-- Database-like operations on MQ topics (CREATE TABLE, ALTER TABLE, DROP TABLE)
-- Advanced querying with SELECT, WHERE, JOIN, aggregations
-- Schema management and metadata operations (SHOW DATABASES, SHOW TABLES)
+To provide a SQL querying interface for SeaweedFS, enabling analytics on existing MQ topics and S3 objects. This enables:
+- Advanced querying with SELECT, WHERE, JOIN, aggregations on MQ topics
+- Schema discovery and metadata operations (SHOW DATABASES, SHOW TABLES, DESCRIBE)
 - In-place analytics on Parquet-stored messages without data movement
+- Direct querying of S3 objects in various formats
 
 **2. Key Features**
 
-*   **Schematized Topic Management (Priority 1):**
+*   **Schema Discovery and Metadata (Priority 1):**
     *   `SHOW DATABASES` - List all MQ namespaces
-    *   `SHOW TABLES` - List all topics in a namespace
-    *   `CREATE TABLE topic_name (field1 INT, field2 STRING, ...)` - Create new MQ topic with schema
-    *   `ALTER TABLE topic_name ADD COLUMN field3 BOOL` - Modify topic schema (with versioning)
-    *   `DROP TABLE topic_name` - Delete MQ topic
+    *   `SHOW TABLES` - List all topics in a namespace  
     *   `DESCRIBE table_name` - Show topic schema details
+    *   Automatic schema detection from existing Parquet data
 *   **Advanced Query Engine (Priority 1):**
     *   Full `SELECT` support with `WHERE`, `ORDER BY`, `LIMIT`, `OFFSET`
     *   Aggregation functions: `COUNT()`, `SUM()`, `AVG()`, `MIN()`, `MAX()`, `GROUP BY`
@@ -67,8 +65,8 @@ To provide a full-featured SQL interface for SeaweedFS, treating schematized MQ
 *   **Schema Catalog:**
     *   Leverage existing `weed/mq/schema/` infrastructure
     *   Map MQ namespaces to "databases" and topics to "tables"
-    *   Store schema metadata with version history
-    *   Handle schema evolution and migration
+    *   Discover schema metadata from existing Parquet files
+    *   Handle schema evolution in read operations
 *   **Query Planner:**
     *   Parse SQL statements using custom PostgreSQL parser
     *   Create optimized execution plans leveraging Parquet columnar format
@@ -90,8 +88,8 @@ To provide a full-featured SQL interface for SeaweedFS, treating schematized MQ
     *   Handle partition-based parallelism for scalable queries
 *   **Schema Registry Integration:**
     *   Extend `weed/mq/schema/schema.go` for SQL metadata operations
-    *   Implement DDL operations that modify underlying MQ topic schemas
-    *   Version control for schema changes with migration support
+    *   Read existing topic schemas for query planning
+    *   Handle schema evolution during query execution
 *   **S3 Connector (Secondary):**
     *   Reading data from S3 objects with CSV, JSON, and Parquet parsers
     *   Efficient streaming for large files with columnar optimizations
@@ -113,7 +111,7 @@ To provide a full-featured SQL interface for SeaweedFS, treating schematized MQ
 
 ## Example Usage Scenarios
 
-**Scenario 1: Topic Management**
+**Scenario 1: Schema Discovery and Metadata**
 ```sql
 -- List all namespaces (databases)
 SHOW DATABASES;
@@ -122,18 +120,7 @@ SHOW DATABASES;
 USE my_namespace;
 SHOW TABLES;
 
--- Create a new topic with schema
-CREATE TABLE user_events (
-    user_id INT,
-    event_type STRING,
-    timestamp BIGINT,
-    metadata STRING
-);
-
--- Modify topic schema
-ALTER TABLE user_events ADD COLUMN session_id STRING;
-
--- View topic structure
+-- View topic structure and discovered schema
 DESCRIBE user_events;
 ```
 
@@ -215,10 +202,10 @@ SQL Query Flow:
 *   Schema Fields ↔ Table Columns
 
 **2. Schema Evolution Handling:**
-*   Maintain schema version history in topic metadata
+*   Read schema version history from existing topic metadata
 *   Support backward-compatible queries across schema versions
-*   Automatic type coercion where possible
-*   Clear error messages for incompatible changes
+*   Automatic type coercion where possible during reads
+*   Clear error messages for incompatible data
 
 **3. Query Optimization:**
 *   Leverage Parquet columnar format for projection pushdown
@@ -237,10 +224,10 @@ SQL Query Flow:
 *   **Error Handling:** Typed error system with proper PostgreSQL error code mapping
 *   **Parser Features:** Lightweight, focused on SeaweedFS SQL subset, easily extensible
 
-**5. Transaction Semantics:**
-*   DDL operations (CREATE/ALTER/DROP) are atomic per topic
-*   SELECT queries provide read-consistent snapshots
-*   No cross-topic transactions initially (future enhancement)
+**5. Query Semantics:**
+*   SELECT queries provide read-consistent snapshots of topic data
+*   Queries operate on immutable Parquet files for consistency
+*   No transactional guarantees across multiple topics
 
 **6. Performance Considerations:**
 *   Prioritize read performance over write consistency
@@ -256,41 +243,42 @@ SQL Query Flow:
 3. Implemented metadata catalog mapping MQ topics to SQL tables
 4. Added `SHOW DATABASES`, `SHOW TABLES`, `DESCRIBE` commands with full PostgreSQL compatibility
 
-**Phase 2: DDL Operations (Weeks 4-5)**
-1. `CREATE TABLE` → Create MQ topic with schema
-2. `ALTER TABLE` → Modify topic schema with versioning
-3. `DROP TABLE` → Delete MQ topic
-4. Schema validation and migration handling
-
-**Phase 3: Query Engine (Weeks 6-8)**
+**Phase 2: Query Engine (Weeks 4-6)**
 1. `SELECT` with `WHERE`, `ORDER BY`, `LIMIT`, `OFFSET`
 2. Aggregation functions and `GROUP BY`
 3. Basic joins between topics
 4. Predicate pushdown to Parquet layer
+5. Schema discovery from existing Parquet files
 
-**Phase 4: API & CLI Integration (Weeks 9-10)**
+**Phase 3: API & CLI Integration (Weeks 7-8)**
 1. HTTP `/sql` endpoint implementation
 2. `weed sql` CLI command with interactive mode
 3. Result streaming and pagination
 4. Error handling and query optimization
 
+**Phase 4: Advanced Features (Weeks 9-10)**
+1. Window functions and advanced analytics
+2. S3 object querying capabilities
+3. Performance optimizations
+4. Connection pooling and query caching
+
 ## Test Plan
 
 **1. Unit Tests**
 
-*   **SQL Parser Tests:** Validate parsing of all supported DDL/DML statements
-*   **Schema Mapping Tests:** Test topic-to-table conversion and metadata operations
+*   **SQL Parser Tests:** Validate parsing of all supported SELECT statements and metadata operations
+*   **Schema Mapping Tests:** Test topic-to-table conversion and schema discovery
 *   **Query Planning Tests:** Verify optimization and predicate pushdown logic
 *   **Execution Engine Tests:** Test query execution with various data patterns
-*   **Edge Cases:** Malformed queries, schema evolution, concurrent operations
+*   **Edge Cases:** Malformed queries, schema evolution in existing data, concurrent reads
 
 **2. Integration Tests**
 
-*   **End-to-End Workflow:** Complete SQL operations against live SeaweedFS cluster
-*   **Schema Evolution:** Test backward compatibility during schema changes
+*   **End-to-End Workflow:** Complete SQL querying operations against live SeaweedFS cluster
+*   **Schema Discovery:** Test automatic schema detection from existing Parquet data
 *   **Multi-Topic Joins:** Validate cross-topic query performance and correctness
 *   **Large Dataset Tests:** Performance validation with GB-scale Parquet data
-*   **Concurrent Access:** Multiple SQL sessions operating simultaneously
+*   **Concurrent Access:** Multiple SQL query sessions operating simultaneously
 
 **3. Performance & Security Testing**
 
@@ -302,11 +290,11 @@ SQL Query Flow:
 
 ## Success Metrics
 
-*   **Feature Completeness:** Support for all specified DDL/DML operations
+*   **Feature Completeness:** Support for all specified SELECT operations and metadata commands
 *   **Performance:** 
     *   **Simple SELECT queries**: < 100ms latency for single-table queries with up to 3 WHERE predicates on ≤ 100K records
     *   **Complex queries**: < 1s latency for queries involving aggregations (COUNT, SUM, MAX, MIN) on ≤ 1M records
     *   **Time-range queries**: < 500ms for timestamp-based filtering on ≤ 500K records within 24-hour windows
 *   **Scalability:** Handle topics with millions of messages efficiently
-*   **Reliability:** 99.9% success rate for valid SQL operations
-*   **Usability:** Intuitive SQL interface matching standard database expectations
+*   **Reliability:** 99.9% success rate for valid SQL queries
+*   **Usability:** Intuitive SQL querying interface matching standard database expectations