Dataset Workflow - Database Changes

Database Schema & Changes

---
config:
  theme: base
  layout: dagre
  flowchart:
    curve: linear
    htmlLabels: true
  themeVariables:
    edgeLabelBackground: "transparent"
---
flowchart TB
    %%ds_analyzer.
    subgraph GBConsoleDB ["gb_console DB (MySQL)"]
        direction TB
        WTG["wishlist_to_groups<br/>📝 name, group_id<br/>📊 status: 1 | Active"]
        
        subgraph SummaryTables ["Summary Wishlist Tables"]
            direction LR
            SWP["summary_wishlist_products<br/>📝 input, input_type<br/>📊 crawl_status: 0→1→2"]
            SWPR["summary_wishlist_product_reviews<br/>📝 summary_wishlist_product_id<br/>📊 crawl_status: 0→1→2"]
            SWC["summary_wishlist_categories<br/>📝 category_id, mall_id<br/>📊 crawl_status: 0→1→2"]
            SWSQ["summary_wishlist_search_queries<br/>📝 keyword, mall_id<br/>📊 crawl_status: 0→1→2"]
        end
        
        WDH["wishlist_dataset_histories<br/>📝 dataset_id, config<br/>📊 status: 1→2→3<br/>📊 spvp_status: 1→2→3"]
    end

    subgraph AnalyzerV2DB ["analyzer_v2 DB (MySQL)"]
        direction LR
        CP["products<br/>📝 name, price, description<br/>📊 Raw marketplace data"]
        CR["reviews<br/>📝 content, rating<br/>📊 User reviews"]
        CRS["review_sentences<br/>📝 sentence, embedding<br/>📊 Processed sentences"]
    end

    subgraph DSAnalyzerDB ["ds_analyzer DB (MySQL)"]
        direction TB
        DS["datasets<br/>📝 name, settings<br/>📊 status: 1→2→3<br/>📊 progress: 0→100"]
        
        subgraph AnalyzerBatchTables ["analyzer_batch creates"]
            direction LR
            ADP["products<br/>📝 processed_name<br/>📊 Dataset products"]
            ADPD["product_details<br/>📝 specifications<br/>📊 Product metadata"]
            PS["product_similarities<br/>📝 similarity_score<br/>📊 Similarity matrix"]
            AV["ai_viewpoints<br/>📝 name, description<br/>📊 AI-generated labels"]
            RSAIVP["review_sentence_aivp<br/>📝 sentence-ai_viewpoint mappings<br/>📊 Clustering results"]
            ADR["reviews<br/>📝 sentiment_overall<br/>📊 Dataset reviews"]
            ARS["review_sentences<br/>📝 cluster_id, sentiment<br/>📊 Clustered sentences"]
        end
        
        subgraph SPVPBatchTables ["spvp_batch manages"]
            direction LR
            SV["specific_viewpoints<br/>📝 last_object_id<br/>📊 SPVP progress tracking"]
            VC["viewpoint_categories<br/>📝 category_name<br/>📊 SPVP categories"]
            RSSPVP["review_sentence_spvp<br/>📝 sentence-viewpoint mappings<br/>📊 Qwen results"]
        end
    end

    %% Phase 1: Wishlist Creation
    WTG --- Step1[
        <div style='text-align: center'>
            <span style='display: inline-block; background-color: #6699cc !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>1</span>
            <p style='margin-top: 8px'>Create Wishlist Items</p>
        </div>
    ]
    Step1 --> SummaryTables

    %% Phase 2: Data Crawling
    SummaryTables --- Step2[
        <div style='text-align: center'>
            <span style='display: inline-block; background-color: #99cc66 !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>2</span>
            <p style='margin-top: 8px'>Crawl Marketplace</p>
        </div>
    ]
    Step2 --> CP
    CP --> CR
    CR --> CRS

    %% Phase 3: Dataset Creation
    SummaryTables --- Step3[
        <div style='text-align: center'>
            <span style='display: inline-block; background-color: #cc6699 !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>3</span>
            <p style='margin-top: 8px'>Create Dataset Metadata</p>
        </div>
    ]
    Step3 --> DS
    DS --> WDH

    %% Phase 4: ML Analysis
    CRS --- Step4[
        <div style='text-align: center'>
            <span style='display: inline-block; background-color: #ff9900 !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>4</span>
            <p style='margin-top: 8px'>Analyze & Cluster</p>
        </div>
    ]
    Step4 --> AnalyzerBatchTables

    %% Phase 5: SPVP Processing
    ARS --- Step5[
        <div style='text-align: center'>
            <span style='display: inline-block; background-color: #cc3366 !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>5</span>
            <p style='margin-top: 8px'>SPVP Processing</p>
        </div>
    ]
    Step5 --> SPVPBatchTables
    SV -.-> WDH

    %% Styling per api-guide.md
    style Step1 fill:transparent,stroke:transparent,stroke-width:1px
    style Step2 fill:transparent,stroke:transparent,stroke-width:1px
    style Step3 fill:transparent,stroke:transparent,stroke-width:1px
    style Step4 fill:transparent,stroke:transparent,stroke-width:1px
    style Step5 fill:transparent,stroke:transparent,stroke-width:1px
    
    style GBConsoleDB fill:#fcf3d2
    style SummaryTables fill:#faf4de
    style AnalyzerV2DB fill:#d2e3fc
    style DSAnalyzerDB fill:#d9f2d9
    style AnalyzerBatchTables fill:#e6ffe6
    style SPVPBatchTables fill:#ffe6f2
    
    style WTG fill:#ffe6cc,stroke:#ff9900,stroke-width:2px
    style WDH fill:#d9f2d9,stroke:#339933,stroke-width:2px
    style SWP fill:#fffae8,stroke:#ebdba2,stroke-width:2px
    style SWPR fill:#fffae8,stroke:#ebdba2,stroke-width:2px
    style SWC fill:#fffae8,stroke:#ebdba2,stroke-width:2px
    style SWSQ fill:#fffae8,stroke:#ebdba2,stroke-width:2px
    style CP fill:#e6f0ff,stroke:#6699cc,stroke-width:1px
    style CR fill:#e6f0ff,stroke:#6699cc,stroke-width:1px
    style CRS fill:#e6f0ff,stroke:#6699cc,stroke-width:1px
    style DS fill:#ffe6cc,stroke:#ff9900,stroke-width:2px
    
    %% analyzer_batch tables
    style ADP fill:#d9f2d9,stroke:#339933,stroke-width:2px
    style ADPD fill:#d9f2d9,stroke:#339933,stroke-width:2px
    style PS fill:#d9f2d9,stroke:#339933,stroke-width:2px
    style AV fill:#d9f2d9,stroke:#339933,stroke-width:2px
    style RSAIVP fill:#d9f2d9,stroke:#339933,stroke-width:2px
    style ADR fill:#d9f2d9,stroke:#339933,stroke-width:2px
    style ARS fill:#d9f2d9,stroke:#339933,stroke-width:2px
    
    %% spvp_batch tables
    style SV fill:#ffccdd,stroke:#cc3366,stroke-width:2px
    style VC fill:#ffccdd,stroke:#cc3366,stroke-width:2px
    style RSSPVP fill:#ffccdd,stroke:#cc3366,stroke-width:2px

Key Database Tables

gb_console DB - Business Logic

Table Key Fields Status Values
wishlist_to_groups name, group_id, subscription_id status: 1 | Active
wishlist_dataset_histories dataset_id, config status: 1→2→3, spvp_status: 1→2→3
summary_wishlist_products input, input_type, mall_id crawl_status: 0→1→2
summary_wishlist_product_reviews summary_wishlist_product_id, schedule_id crawl_status: 0→1→2
summary_wishlist_categories category_id, mall_id crawl_status: 0→1→2
summary_wishlist_search_queries keyword, mall_id crawl_status: 0→1→2

analyzer_v2 DB - Raw Crawled Data

Table Key Fields Service Responsible
products name, description, price, mall_id Crawler Service
reviews product_id, content, rating Crawler Service
review_sentences review_id, sentence, embedding Crawler Service + PLG API

ds_analyzer DB - Analysis Results

Table Key Fields Service Responsible
datasets name, settings, status, progress, error_code trend-viewer-backend

analyzer_batch creates 7 tables:

Table Key Fields Description
products dataset_id, crawler_product_id, processed_name Dataset products with processed data
product_details product_id, specifications, metadata Product detailed specifications
product_similarities dataset_id, mall_product_id_a, product_id_b, similarity_score Product similarity matrix
ai_viewpoints dataset_id, cluster_id, name, description AI-generated cluster labels
review_sentence_aivp sentence_id, dataset_id, ai_viewpoint_id, mall_product_id AI Viewpoint ↔ Sentence mappings
reviews dataset_id, product_id, sentiment_overall Dataset reviews with sentiment
review_sentences dataset_id, ai_viewpoint_id, cluster_id, is_noise Clustered review sentences

spvp_batch manages 3 tables:

Table Key Fields Description
specific_viewpoints viewpoint_category_id, last_object_id SPVP progress tracking
viewpoint_categories category_name, description SPVP category definitions
review_sentence_spvp sentence_id, viewpoint_ordinal_number, mall_product_id, viewpoint_category_id Specific Viewpoint ↔ Sentence mappings

Detailed Data Changes

Phase 1: Wishlist Creation (trend-viewer-api → gb_console)

-- Create main wishlist
INSERT INTO gb_console.wishlist_to_groups (
    name, group_id, subscription_id, 
    status = 1 -- 1 | Active
);

-- Store crawl items - Products
INSERT INTO gb_console.summary_wishlist_products (
    wishlist_to_group_id, input, input_type, mall_id,
    crawl_status = 0 -- 0 | New
);

-- Store crawl items - Product Reviews
INSERT INTO gb_console.summary_wishlist_product_reviews (
    summary_wishlist_product_id, schedule_id,
    crawl_status = 0 -- 0 | New
);

-- Store crawl items - Categories
INSERT INTO gb_console.summary_wishlist_categories (
    category_id, mall_id, wishlist_to_group_id,
    crawl_status = 0 -- 0 | New
);

-- Store crawl items - Search Queries
INSERT INTO gb_console.summary_wishlist_search_queries (
    keyword, mall_id, wishlist_to_group_id,
    crawl_status = 0 -- 0 | New
);

Phase 2: Data Crawling (trend-viewer-backend → Crawler Service → analyzer_v2)

-- Update crawl status for all summary tables
UPDATE gb_console.summary_wishlist_products 
SET crawl_status = 1 -- 1 | InProgress
WHERE crawl_status = 0;

UPDATE gb_console.summary_wishlist_product_reviews 
SET crawl_status = 1 -- 1 | InProgress
WHERE crawl_status = 0;

UPDATE gb_console.summary_wishlist_categories 
SET crawl_status = 1 -- 1 | InProgress
WHERE crawl_status = 0;

UPDATE gb_console.summary_wishlist_search_queries 
SET crawl_status = 1 -- 1 | InProgress
WHERE crawl_status = 0;

-- Raw data from Crawler Service - Products & Reviews
INSERT INTO analyzer_v2.products (name, description, price, mall_id);
INSERT INTO analyzer_v2.reviews (product_id, content, rating);
INSERT INTO analyzer_v2.review_sentences (review_id, sentence, embedding);

-- Complete crawl for all summary tables
UPDATE gb_console.summary_wishlist_*
SET crawl_status = 2 -- 2 | Success
WHERE crawl_completed;

Phase 3: Dataset Creation (trend-viewer-backend → ds_analyzer)

-- Create dataset metadata
INSERT INTO ds_analyzer.datasets (
    name, group_id, viewpoint_category_id,
    status = 1, -- 1 | Pending
    progress = 0,
    settings = JSON_OBJECT('product_ids', [...])
);

-- Create history record
INSERT INTO gb_console.wishlist_dataset_histories (
    wishlist_to_group_id, dataset_id,
    status = 1, -- 1 | Pending
    spvp_status = 1 -- 1 | Pending
);

Phase 4: ML Analysis (analyzer_batch → ds_analyzer)

-- Update progress
UPDATE ds_analyzer.datasets 
SET status = 2, progress = 50 -- 2 | Processing
WHERE id = dataset_id;

-- analyzer_batch creates 7 complete tables
-- 1. Products & Details
INSERT INTO ds_analyzer.products (dataset_id, crawler_product_id, processed_name);
INSERT INTO ds_analyzer.product_details (product_id, specifications, metadata);

-- 2. Product Similarities  
INSERT INTO ds_analyzer.product_similarities (dataset_id, mall_product_id_a, mall_product_id_b, similarity_score);

-- 3. AI Viewpoints & Clustering
INSERT INTO ds_analyzer.ai_viewpoints (dataset_id, cluster_id, name, description);
INSERT INTO ds_analyzer.review_sentence_spvp (
    sentence_id, 
    viewpoint_ordinal_number, 
    mall_product_id,
    viewpoint_category_id
);

-- 4. Reviews & Sentences with sentiment analysis
INSERT INTO ds_analyzer.reviews (dataset_id, product_id, sentiment_overall);
INSERT INTO ds_analyzer.review_sentences (dataset_id, ai_viewpoint_id, cluster_id, is_noise);

-- Complete analysis
UPDATE ds_analyzer.datasets 
SET status = 3, progress = 100 -- 3 | Completed
WHERE id = dataset_id;

-- Sync status
UPDATE gb_console.wishlist_dataset_histories 
SET status = 3, spvp_status = 2 -- 2 | Analyzing (SPVP)
WHERE dataset_id = dataset_id;

Phase 5: SPVP Processing (spvp_batch → ds_analyzer)

-- Load data for SPVP processing
SELECT * FROM ds_analyzer.review_sentences WHERE dataset_id = ?;
SELECT * FROM ds_analyzer.specific_viewpoints WHERE viewpoint_category_id = ?;
SELECT * FROM ds_analyzer.viewpoint_categories WHERE id = ?;

-- Qwen mapping process: specific_viewpoints ↔ review_sentences
-- Store mapping results
INSERT INTO ds_analyzer.review_sentence_spvp (
    sentence_id, 
    viewpoint_ordinal_number, 
    mall_product_id,
    viewpoint_category_id
);

-- Update progress tracking
UPDATE ds_analyzer.specific_viewpoints 
SET last_object_id = max_processed_sentence_id
WHERE viewpoint_category_id = dataset_viewpoint_category;

-- SPVP completion check
UPDATE gb_console.wishlist_dataset_histories 
SET spvp_status = 3 -- 3 | Completed
WHERE dataset_id = dataset_id 
  AND all_specific_viewpoints_processed();

Service Responsibilities

trend-viewer-api

  • Creates: gb_console.wishlist_to_groups, gb_console.summary_wishlist_*
  • Updates: Status fields for user-facing operations

trend-viewer-backend

  • Creates: ds_analyzer.datasets, gb_console.wishlist_dataset_histories
  • Updates: Crawl status sync, dataset status monitoring
  • Triggers: analyzer_batch, spvp_batch
  • Scheduled crawl management
  • Dataset creation logic
  • Status synchronization

Crawler Service

  • Creates: analyzer_v2.products, analyzer_v2.reviews, analyzer_v2.review_sentences
  • Coordinates with: PLG API for embedding/prediction

analyzer_batch

  • Reads from: analyzer_v2.* tables (products, reviews, review_sentences)
  • Creates: 7 tables in ds_analyzer:
    • products - Dataset products with processed data
    • product_details - Product specifications & metadata
    • product_similarities - Product similarity matrix
    • ai_viewpoints - AI-generated cluster labels
    • review_sentence_aivp - AI Viewpoint ↔ Sentence mappings
    • reviews - Dataset reviews with sentiment analysis
    • review_sentences - Clustered review sentences
  • Updates: ds_analyzer.datasets (status, progress)
  • Processing: ML clustering (K-means), AI labeling (OpenAI GPT-4), Product similarity calculation, Sentiment analysis
  • Architecture: Data writing with distributed transactions

spvp_batch

  • Reads from: ds_analyzer.review_sentences, ds_analyzer.specific_viewpoints, ds_analyzer.viewpoint_categories
  • Creates: ds_analyzer.review_sentence_spvp (sentence-viewpoint mappings)
  • Updates: ds_analyzer.specific_viewpoints (last_object_id), gb_console.wishlist_dataset_histories (spvp_status)
  • Qwen-based sentence to specific viewpoint mapping
  • Progress tracking through last_object_id
  • Final completion status updates

PLG API

  • Embedding status checks
  • Prediction readiness verification
  • Crawler service coordination

Critical Status Fields

Dataset Lifecycle Status

-- trend-viewer-backend creates dataset
ds_analyzer.datasets.status = 1 | Pending

-- analyzer_batch processes  
ds_analyzer.datasets.status = 2 | Processing
ds_analyzer.datasets.progress = 0 → 100

-- analyzer_batch completes
ds_analyzer.datasets.status = 3 | Completed

-- Status sync to gb_console
gb_console.wishlist_dataset_histories.status = 3 | Complete
gb_console.wishlist_dataset_histories.spvp_status = 2 | Analyzing

-- spvp_batch completes
gb_console.wishlist_dataset_histories.spvp_status = 3 | Completed

Error Tracking

-- analyzer_batch errors
ds_analyzer.datasets.status = 9 | Failed
ds_analyzer.datasets.error_code = 1001-3003
ds_analyzer.datasets.progress = -1

Status Values Reference

Dataset Status (ds_analyzer.datasets.status)

  • 1 | Pending: trend-viewer-backend creates dataset, waiting for analyzer_batch
  • 2 | Processing: analyzer_batch is processing
  • 3 | Completed: analyzer_batch completed
  • 9 | Failed: analyzer_batch failed

SPVP Status (gb_console.wishlist_dataset_histories.spvp_status)

  • 1 | Pending: Waiting for spvp_batch
  • 2 | Analyzing: spvp_batch is processing
  • 3 | Completed: spvp_batch completed
  • 9 | Failed: spvp_batch failed

Crawl Status (gb_console.summary_wishlist_*.crawl_status)

  • 0 | New: Not crawled yet, waiting for trend-viewer-backend schedule
  • 1 | InProgress: Crawler Service is crawling
  • 2 | Success: Crawler Service crawl successful
  • 3 | Error: Crawler Service crawl failed
  • 4 | Canceled: Crawl canceled

Error Recovery by Service

trend-viewer-backend

  • Crawl sync failures: Automatic retry with scheduled jobs
  • Dataset creation failures: Manual trigger available
  • Status sync issues: Reconciliation jobs

analyzer_batch

  • Processing failures: Full rollback, error_code tracking
  • Resource issues: Container restart, scaling
  • OpenAI API failures: Retry with backoff

spvp_batch

  • Processing stuck: Manual reset last_object_id
  • Completion detection failures: Manual status update

Related Documentation

Core Workflow Components

Related Systems Documentation

Backend Services Documentation

  • trend-viewer-backend Documentation - Complete backend documentation
    • Dataset Commands: link - Console commands for dataset management
    • Crawler Integration: link - PLG API integration details

Processing Services (Documentation in Development)

  • analyzer_batch: Python ML/AI processing service
    • Functions: ML clustering, AI labeling, Product similarity analysis
  • spvp_batch: Python Qwen-based specific viewpoint processing
    • Functions: SPVP mapping, Qwen model integration, Progress tracking