Dataset Workflow - Overview

Introduction

Dataset Workflow is a complete automation process for creating and analyzing data from wishlist_to_group. This document describes the main workflow summary for dataset creation and processing.

Related Workflows

This main workflow builds upon several prerequisite processes that are documented separately:

  • Temp Wishlist to GroupWishlist to Group conversion workflow: Wishlist System Overview
  • Detailed Wishlist Management: wishlist_products, wishlist_categories, wishlist_search_queries, wishlist_product_reviewssummary_wishlist_* tables
  • Backend Console Commands: Detailed documentation available in trend-viewer-backend docs
    • Dataset Commands: link
    • Crawler Integration: link
  • Processing Services: Documentation in development
    • analyzer_batch: Python ML/AI processing service (docs in development)
    • spvp_batch: Python Qwen-based specific viewpoint processing (docs in development)

Note: This overview focuses on the core dataset creation flow from wishlist_to_group to completed analysis results.

PLG API - Interface with Crawler

PLG API (Playground API) is an API gateway provided by the Crawler team for TV interaction:

  • Crawl Management: TV sends crawl requests through PLG API
  • Data Validation: TV checks embedding/prediction status via PLG API
  • Error Tracking: PLG API manages logs and failures to crawler_v2 DB
  • Success Storage: PLG API stores successful crawled data to analyzer_v2 DB

PLG API is managed by Crawler and operates directly on crawler_v2 and analyzer_v2 databases.

Overview Diagram

---
config:
  theme: base
  layout: dagre
  flowchart:
    curve: linear
    htmlLabels: true
  themeVariables:
    edgeLabelBackground: "transparent"
---
flowchart LR
    %% User Input
    User[User Input]
    
    %% Core Systems
    TrendApp[trend-viewer-app]
    TrendAPI[trend-viewer-api]
    TrendBackend[trend-viewer-backend]
    AnalyzerBatch[analyzer_batch]
    SPVPBatch[spvp_batch]
    PLGApi[PLG API]
    
    %% Databases
    GBConsole[(gb_console DB)]
    DSAnalyzer[(ds_analyzer DB)]
    AnalyzerV2[(analyzer_v2 DB)]
    CrawlerV2[(crawler_v2 DB)]

    %% Flow with numbered steps - Steps 1 and 2 vertical
    User --- Step1[
        <div style='text-align: center'>
            <span style='display: inline-block; background-color: #6699cc !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>1</span>
            <p style='margin-top: 8px'>Create Wishlist</p>
        </div>
    ]
    Step1 --> TrendApp
    TrendApp --> TrendAPI
    TrendAPI --> GBConsole
    
    GBConsole --- Step2[
        <div style='text-align: center'>
            <span style='display: inline-block; background-color: #99cc66 !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>2</span>
            <p style='margin-top: 8px'>Crawl Data</p>
        </div>
    ]
    Step2 --> TrendBackend
    TrendBackend --> PLGApi
    PLGApi --> AnalyzerV2
    PLGApi --> CrawlerV2
    
    %% Steps 3, 4, 5 horizontal flow (left to right)
    TrendBackend --- Step3[
        <div style='text-align: center'>
            <span style='display: inline-block; background-color: #cc6699 !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>3</span>
            <p style='margin-top: 8px'>Create Dataset</p>
        </div>
    ]
    Step3 --> DSAnalyzer
    
    TrendBackend --> Step4[
        <div style='text-align: center'>
            <span style='display: inline-block; background-color: #ff9900 !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>4</span>
            <p style='margin-top: 8px'>Analyze</p>
        </div>
    ]
    Step4 --> AnalyzerBatch
    AnalyzerV2 --> AnalyzerBatch
    AnalyzerBatch --> DSAnalyzer
    DSAnalyzer --> SPVPBatch
    
    AnalyzerBatch --> Step5[
        <div style='text-align: center'>
            <span style='display: inline-block; background-color: #cc3366 !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>5</span>
            <p style='margin-top: 8px'>SPVP Process</p>
        </div>
    ]
    Step5 --> SPVPBatch
    SPVPBatch --> DSAnalyzer

    %% Style step boxes as transparent
    style Step1 fill:transparent,stroke:transparent,stroke-width:1px
    style Step2 fill:transparent,stroke:transparent,stroke-width:1px
    style Step3 fill:transparent,stroke:transparent,stroke-width:1px
    style Step4 fill:transparent,stroke:transparent,stroke-width:1px
    style Step5 fill:transparent,stroke:transparent,stroke-width:1px

Sequence Diagram

sequenceDiagram
    participant User
    participant TrendApp as trend-viewer-app
    participant TrendAPI as trend-viewer-api
    participant TrendBackend as trend-viewer-backend
    participant PLG as PLG API
    participant Analyzer as analyzer_batch
    participant SPVP as spvp_batch
    participant TVDB as TV DB
    participant CrawlerDB as Crawler DB
    participant Logger
    participant Slack
    
    Note over User,Slack: Dataset Workflow Complete Flow
    
    rect rgb(200, 255, 200)
    Note right of User: Happy Case - Phase 1: Wishlist Creation
    
    User->>TrendApp: 1. Create wishlist_to_group
    TrendApp->>TrendAPI: POST /api/v1/general/wishlist-to-group
    TrendAPI->>TrendAPI: Validate and store data
    TrendAPI->>TVDB: Store wishlist data
    Note over TVDB: gb_console.wishlist_to_groups (status: 1 | Active)<br/>gb_console.summary_wishlist_* (crawl_status: 0 | New)
    
    rect rgb(230, 200, 255)
    Note right of TrendAPI: Success Monitoring
    TrendAPI->>Logger: Log wishlist creation
    TrendAPI->>Slack: Send creation notification
    end
    end
    
    rect rgb(200, 230, 255)
    Note right of TrendBackend: Happy Case - Phase 2: Data Crawling
    
    TrendBackend->>TVDB: Query pending items
    TVDB-->>TrendBackend: Return crawl queue
    TrendBackend->>PLG: Send crawl config
    PLG->>CrawlerDB: Store crawl requests
    Note over CrawlerDB: crawler_v2.configs
    
    rect rgb(255, 255, 200)
    Note right of PLG: Optional - Multiple Products
    PLG->>PLG: Execute crawling process
    PLG->>CrawlerDB: Store successful crawled data
    Note over CrawlerDB: analyzer_v2.products<br/>analyzer_v2.reviews<br/>analyzer_v2.review_sentences
    PLG->>CrawlerDB: Update crawl status & errors
    Note over CrawlerDB: crawler_v2.configs
    end
    
    PLG-->>TrendBackend: Return crawl status
    TrendBackend->>TVDB: Update crawl_status: 0→1→2 (Success)
    Note over TVDB: gb_console.summary_wishlist_*
    end
    
    rect rgb(255, 230, 200)
    Note right of TrendBackend: Critical - Phase 3: Dataset Creation
    
    TrendBackend->>TVDB: Check wishlist readiness
    TVDB-->>TrendBackend: Return ready wishlists
    TrendBackend->>PLG: Validate embedding & prediction
    PLG->>CrawlerDB: Check data completeness
    Note over CrawlerDB: analyzer_v2.products (embedding status)
    CrawlerDB-->>PLG: Return validation status
    PLG-->>TrendBackend: Validation completed
    
    TrendBackend->>TVDB: Create dataset metadata
    Note over TVDB: ds_analyzer.datasets (status: 1 | Pending, progress: 0)
    TrendBackend->>TVDB: Create history record
    Note over TVDB: gb_console.wishlist_dataset_histories (status: 1, spvp_status: 1)
    
    TrendBackend->>Analyzer: Trigger analyzer_batch
    Note over Analyzer: Google Cloud Run job started
    end
    
    rect rgb(200, 255, 200)
    Note right of Analyzer: Happy Case - Phase 4: ML Analysis
    
    Analyzer->>TVDB: Update status to Processing
    Note over TVDB: ds_analyzer.datasets (status: 1→2 | Processing, progress: 0→25)
    
    Analyzer->>CrawlerDB: Load source data directly
    Note over CrawlerDB: analyzer_v2.products<br/>analyzer_v2.reviews<br/>analyzer_v2.review_sentences
    CrawlerDB-->>Analyzer: Return products, reviews, sentences
    
    Analyzer->>Analyzer: ML Processing
    Note over Analyzer: - K-means Clustering (progress: 25→50)<br/>- OpenAI GPT-4 Labeling (progress: 50→75)<br/>- Product Similarity Calc (progress: 75→85)
    
    Analyzer->>TVDB: Write analysis results (7 tables)
    Note over TVDB: ds_analyzer.products<br/>ds_analyzer.product_details<br/>ds_analyzer.product_similarities<br/>ds_analyzer.ai_viewpoints<br/>ds_analyzer.review_sentence_aivp<br/>ds_analyzer.reviews<br/>ds_analyzer.review_sentences (progress: 85→100)
    
    Analyzer->>TVDB: Complete analysis
    Note over TVDB: ds_analyzer.datasets (status: 2→3 | Completed, progress: 100)
    
    Analyzer-->>TrendBackend: Analysis completed notification
    TrendBackend->>TVDB: Sync dataset status
    Note over TVDB: gb_console.wishlist_dataset_histories<br/>(status: 3, spvp_status: 2 | Analyzing)
    end
    
    rect rgb(200, 230, 255)
    Note right of TrendBackend: Happy Case - Phase 5: SPVP Processing
    
    TrendBackend->>SPVP: Trigger spvp_batch
    SPVP->>TVDB: Load review sentences
    Note over TVDB: ds_analyzer.review_sentences
    SPVP->>TVDB: Load specific viewpoints & categories
    Note over TVDB: ds_analyzer.specific_viewpoints<br/>ds_analyzer.viewpoint_categories
    TVDB-->>SPVP: Return sentences, viewpoints & categories
    
    SPVP->>SPVP: Qwen mapping process
    Note over SPVP: Qwen model maps<br/>specific_viewpoints ↔ review_sentences
    
    SPVP->>TVDB: Store mapping results
    Note over TVDB: ds_analyzer.review_sentence_spvp<br/>(sentence-viewpoint mappings)
    
    SPVP->>TVDB: Update viewpoint progress
    Note over TVDB: ds_analyzer.specific_viewpoints<br/>(last_object_id updated)
    
    SPVP-->>TrendBackend: SPVP completed
    TrendBackend->>TVDB: Final status update
    Note over TVDB: gb_console.wishlist_dataset_histories<br/>(spvp_status: 2→3 | Completed)
    
    rect rgb(230, 200, 255)
    Note right of TrendBackend: Success Monitoring
    TrendBackend->>Logger: Log workflow completion
    TrendBackend->>Slack: Send success notification
    TrendBackend-->>TrendAPI: Dataset ready notification
    TrendAPI-->>TrendApp: WebSocket status update
    TrendApp-->>User: Dataset available
    end
    end
    
    rect rgb(255, 200, 200)
    Note right of TrendBackend: Error Handling
    rect rgb(255, 230, 230)
    alt Crawl Failure
        PLG->>CrawlerDB: Log crawl errors & failures
        Note over CrawlerDB: crawler_v2.error_logs
        PLG->>TVDB: Set crawl_status = 3 (Error)
        Note over TVDB: gb_console.summary_wishlist_*
        PLG->>Logger: Log crawl error
        PLG->>Slack: Send crawl failure notification
    else Analysis Failure  
        Analyzer->>TVDB: Set status = 9 (Failed), error_code
        Note over TVDB: ds_analyzer.datasets
        Analyzer->>TVDB: Update history status = 9
        Note over TVDB: gb_console.wishlist_dataset_histories
        Analyzer->>Logger: Log analysis error
        Analyzer->>Slack: Send analysis failure notification
    else SPVP Failure
        SPVP->>TVDB: Set spvp mapping error
        Note over TVDB: ds_analyzer.review_sentence_spvp<br/>ds_analyzer.specific_viewpoints
        SPVP->>TVDB: Set spvp_status = 9 (Failed)
        Note over TVDB: gb_console.wishlist_dataset_histories
        SPVP->>Logger: Log SPVP error
        SPVP->>Slack: Send SPVP failure notification
    end
    end
    end

System Components

Frontend

  • trend-viewer-app: Vue.js frontend for user interface

API Layer

  • trend-viewer-api: Laravel REST API layer

Backend Services

  • trend-viewer-backend: Laravel backend with scheduled commands
  • PLG API: Playground API provided by Crawler for TV interaction

Processing Services

  • analyzer_batch: Python ML/AI processing service
  • spvp_batch: Python Qwen-based specific viewpoint processing

Databases

  • gb_console DB (MySQL): TV Backend - User data, wishlist lifecycle, status tracking
  • crawler_v2 DB (MySQL): Crawler managed - Crawl requests, logs, and error tracking
  • analyzer_v2 DB (MySQL): Crawler managed - Successful crawled data from marketplace
  • ds_analyzer DB (MySQL): TV Analyzer - Processed analysis results from analyzer_batch

Data Flow Summary

  1. User Input → gb_console DB (wishlist_to_groups, summary_wishlist_*)
  2. Crawl Requests → TV calls PLG API → Crawler operates crawler_v2 DB
  3. Successful Crawling → PLG API stores to analyzer_v2 DB (products, reviews, review_sentences)
  4. Dataset Creation → ds_analyzer DB (datasets metadata)
  5. Analysis → analyzer_batch reads directly from analyzer_v2 DB → stores to ds_analyzer DB (7 tables)
  6. SPVP Processing → ds_analyzer DB (specific_viewpoints updates)

Module List

Name Link Description
Database Schema Changes Database Schema Detailed database schema changes for dataset workflow
Detailed Sequence Diagrams Sequence Diagrams Detailed sequence diagrams for each workflow phase

Dataset Status

Dataset Status (ds_analyzer.datasets.status)

  • 1 | Pending: Dataset created, waiting for analyzer_batch
  • 2 | Processing: analyzer_batch is processing
  • 3 | Completed: analyzer_batch completed
  • 9 | Failed: analyzer_batch failed

SPVP Status (gb_console.wishlist_dataset_histories.spvp_status)

  • 1 | Pending: Waiting for spvp_batch
  • 2 | Analyzing: spvp_batch is processing
  • 3 | Completed: spvp_batch completed
  • 9 | Failed: spvp_batch failed

Crawl Status (gb_console.summary_wishlist_*.crawl_status)

  • 0 | New: Not crawled yet
  • 1 | InProgress: Crawler Service is crawling
  • 2 | Success: Crawl successful
  • 3 | Error: Crawl failed
  • 4 | Canceled: Crawl canceled

Successful Dataset: status = 3 AND spvp_status = 3