Dataset Workflow - Tổng quan

Giới thiệu

Dataset Workflow là quy trình tự động hóa hoàn chỉnh để tạo và phân tích dữ liệu từ wishlist_to_group. Tài liệu này mô tả luồng chính tóm tắt cho việc tạo và xử lý dataset.

Các Luồng Liên Quan

Luồng chính này được xây dựng dựa trên một số quy trình tiên quyết đã được tài liệu hóa riêng biệt:

  • Temp Wishlist to GroupWishlist to Group quy trình chuyển đổi: Wishlist System Overview
  • Quản lý Wishlist Chi tiết: wishlist_products, wishlist_categories, wishlist_search_queries, wishlist_product_reviewssummary_wishlist_* tables
  • Backend Console Commands: Tài liệu chi tiết có sẵn trong trend-viewer-backend docs
    • Dataset Commands: link
    • Crawler Integration: link
  • Processing Services: Tài liệu đang phát triển
    • analyzer_batch: Python ML/AI processing service (docs in development)
    • spvp_batch: Python Qwen-based specific viewpoint processing (docs in development)

Lưu ý: Tổng quan này tập trung vào luồng tạo dataset cốt lõi từ wishlist_to_group đến kết quả phân tích hoàn chỉnh.

PLG API - Điểm Giao Tiếp với Crawler

PLG API (Playground API) là API gateway do bên Crawler cung cấp để TV tương tác:

  • Crawl Management: TV gửi yêu cầu cào dữ liệu qua PLG API
  • Data Validation: TV kiểm tra trạng thái embedding/prediction qua PLG API
  • Error Tracking: PLG API quản lý logs và failures vào crawler_v2 DB
  • Success Storage: PLG API lưu dữ liệu cào thành công vào analyzer_v2 DB

PLG API được Crawler phụ trách và thao tác trực tiếp trên crawler_v2analyzer_v2 databases.

Sơ đồ tổng quan

---
config:
  theme: base
  layout: dagre
  flowchart:
    curve: linear
    htmlLabels: true
  themeVariables:
    edgeLabelBackground: "transparent"
---
flowchart LR
    %% User Input
    User[User Input]
    
    %% Core Systems
    TrendApp[trend-viewer-app]
    TrendAPI[trend-viewer-api]
    TrendBackend[trend-viewer-backend]
    AnalyzerBatch[analyzer_batch]
    SPVPBatch[spvp_batch]
    PLGApi[PLG API]
    
    %% Databases
    GBConsole[(gb_console DB)]
    DSAnalyzer[(ds_analyzer DB)]
    AnalyzerV2[(analyzer_v2 DB)]
    CrawlerV2[(crawler_v2 DB)]

    %% Flow with numbered steps - Steps 1 and 2 vertical
    User --- Step1[
        <div style='text-align: center'>
            <span style='display: inline-block; background-color: #6699cc !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>1</span>
            <p style='margin-top: 8px'>Tạo Wishlist</p>
        </div>
    ]
    Step1 --> TrendApp
    TrendApp --> TrendAPI
    TrendAPI --> GBConsole
    
    GBConsole --- Step2[
        <div style='text-align: center'>
            <span style='display: inline-block; background-color: #99cc66 !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>2</span>
            <p style='margin-top: 8px'>Crawl Data</p>
        </div>
    ]
    Step2 --> TrendBackend
    TrendBackend --> PLGApi
    PLGApi --> AnalyzerV2
    PLGApi --> CrawlerV2
    
    %% Steps 3, 4, 5 horizontal flow (left to right)
    TrendBackend --- Step3[
        <div style='text-align: center'>
            <span style='display: inline-block; background-color: #cc6699 !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>3</span>
            <p style='margin-top: 8px'>Tạo Dataset</p>
        </div>
    ]
    Step3 --> DSAnalyzer
    
    TrendBackend --> Step4[
        <div style='text-align: center'>
            <span style='display: inline-block; background-color: #ff9900 !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>4</span>
            <p style='margin-top: 8px'>Phân Tích</p>
        </div>
    ]
    Step4 --> AnalyzerBatch
    AnalyzerV2 --> AnalyzerBatch
    AnalyzerBatch --> DSAnalyzer
    DSAnalyzer --> SPVPBatch
    
    AnalyzerBatch --> Step5[
        <div style='text-align: center'>
            <span style='display: inline-block; background-color: #cc3366 !important; color:white; width: 28px; height: 28px; line-height: 28px; border-radius: 50%; font-weight: bold'>5</span>
            <p style='margin-top: 8px'>SPVP Process</p>
        </div>
    ]
    Step5 --> SPVPBatch
    SPVPBatch --> DSAnalyzer

    %% Style step boxes as transparent
    style Step1 fill:transparent,stroke:transparent,stroke-width:1px
    style Step2 fill:transparent,stroke:transparent,stroke-width:1px
    style Step3 fill:transparent,stroke:transparent,stroke-width:1px
    style Step4 fill:transparent,stroke:transparent,stroke-width:1px
    style Step5 fill:transparent,stroke:transparent,stroke-width:1px

Sequence Diagram

sequenceDiagram
    participant User
    participant TrendApp as trend-viewer-app
    participant TrendAPI as trend-viewer-api
    participant TrendBackend as trend-viewer-backend
    participant PLG as PLG API
    participant Analyzer as analyzer_batch
    participant SPVP as spvp_batch
    participant TVDB as TV DB
    participant CrawlerDB as Crawler DB
    participant Logger
    participant Slack
    
    Note over User,Slack: Dataset Workflow Complete Flow
    
    rect rgb(200, 255, 200)
    Note right of User: Happy Case - Phase 1: Wishlist Creation
    
    User->>TrendApp: 1. Tạo wishlist_to_group
    TrendApp->>TrendAPI: POST /api/v1/general/wishlist-to-group
    TrendAPI->>TrendAPI: Validate và store data
    TrendAPI->>TVDB: Store wishlist data
    Note over TVDB: gb_console.wishlist_to_groups (status: 1 | Active)<br/>gb_console.summary_wishlist_* (crawl_status: 0 | New)
    
    rect rgb(230, 200, 255)
    Note right of TrendAPI: Success Monitoring
    TrendAPI->>Logger: Log wishlist creation
    TrendAPI->>Slack: Send creation notification
    end
    end
    
    rect rgb(200, 230, 255)
    Note right of TrendBackend: Happy Case - Phase 2: Data Crawling
    
    TrendBackend->>TVDB: Query pending items
    TVDB-->>TrendBackend: Return crawl queue
    TrendBackend->>PLG: Send crawl config
    PLG->>CrawlerDB: Store crawl requests
    Note over CrawlerDB: crawler_v2.configs
    
    rect rgb(255, 255, 200)
    Note right of PLG: Optional - Multiple Products
    PLG->>PLG: Execute crawling process
    PLG->>CrawlerDB: Store successful crawled data
    Note over CrawlerDB: analyzer_v2.products<br/>analyzer_v2.reviews<br/>analyzer_v2.review_sentences
    PLG->>CrawlerDB: Update crawl status & errors
    Note over CrawlerDB: crawler_v2.configs
    end
    
    PLG-->>TrendBackend: Return crawl status
    TrendBackend->>TVDB: Update crawl_status: 0→1→2 (Success)
    Note over TVDB: gb_console.summary_wishlist_*
    end
    
    rect rgb(255, 230, 200)
    Note right of TrendBackend: Critical - Phase 3: Dataset Creation
    
    TrendBackend->>TVDB: Check wishlist readiness
    TVDB-->>TrendBackend: Return ready wishlists
    TrendBackend->>PLG: Validate embedding & prediction
    PLG->>CrawlerDB: Check data completeness
    Note over CrawlerDB: analyzer_v2.products (embedding status)
    CrawlerDB-->>PLG: Return validation status
    PLG-->>TrendBackend: Validation completed
    
    TrendBackend->>TVDB: Create dataset metadata
    Note over TVDB: ds_analyzer.datasets (status: 1 | Pending, progress: 0)
    TrendBackend->>TVDB: Create history record
    Note over TVDB: gb_console.wishlist_dataset_histories (status: 1, spvp_status: 1)
    
    TrendBackend->>Analyzer: Trigger analyzer_batch
    Note over Analyzer: Google Cloud Run job started
    end
    
    rect rgb(200, 255, 200)
    Note right of Analyzer: Happy Case - Phase 4: ML Analysis
    
    Analyzer->>TVDB: Update status to Processing
    Note over TVDB: ds_analyzer.datasets (status: 1→2 | Processing, progress: 0→25)
    
    Analyzer->>CrawlerDB: Load source data directly
    Note over CrawlerDB: analyzer_v2.products<br/>analyzer_v2.reviews<br/>analyzer_v2.review_sentences
    CrawlerDB-->>Analyzer: Return products, reviews, sentences
    
    Analyzer->>Analyzer: ML Processing
    Note over Analyzer: - K-means Clustering (progress: 25→50)<br/>- OpenAI GPT-4 Labeling (progress: 50→75)<br/>- Product Similarity Calc (progress: 75→85)
    
    Analyzer->>TVDB: Write analysis results (7 tables)
    Note over TVDB: ds_analyzer.products<br/>ds_analyzer.product_details<br/>ds_analyzer.product_similarities<br/>ds_analyzer.ai_viewpoints<br/>ds_analyzer.review_sentence_aivp<br/>ds_analyzer.reviews<br/>ds_analyzer.review_sentences (progress: 85→100)
    
    Analyzer->>TVDB: Complete analysis
    Note over TVDB: ds_analyzer.datasets (status: 2→3 | Completed, progress: 100)
    
    Analyzer-->>TrendBackend: Analysis completed notification
    TrendBackend->>TVDB: Sync dataset status
    Note over TVDB: gb_console.wishlist_dataset_histories<br/>(status: 3, spvp_status: 2 | Analyzing)
    end
    
    rect rgb(200, 230, 255)
    Note right of TrendBackend: Happy Case - Phase 5: SPVP Processing
    
    TrendBackend->>SPVP: Trigger spvp_batch
    SPVP->>TVDB: Load review sentences
    Note over TVDB: ds_analyzer.review_sentences
    SPVP->>TVDB: Load specific viewpoints & categories
    Note over TVDB: ds_analyzer.specific_viewpoints<br/>ds_analyzer.viewpoint_categories
    TVDB-->>SPVP: Return sentences, viewpoints & categories
    
    SPVP->>SPVP: Qwen mapping process
    Note over SPVP: Qwen model maps<br/>specific_viewpoints ↔ review_sentences
    
    SPVP->>TVDB: Store mapping results
    Note over TVDB: ds_analyzer.review_sentence_spvp<br/>(sentence-viewpoint mappings)
    
    SPVP->>TVDB: Update viewpoint progress
    Note over TVDB: ds_analyzer.specific_viewpoints<br/>(last_object_id updated)
    
    SPVP-->>TrendBackend: SPVP completed
    TrendBackend->>TVDB: Final status update
    Note over TVDB: gb_console.wishlist_dataset_histories<br/>(spvp_status: 2→3 | Completed)
    
    rect rgb(230, 200, 255)
    Note right of TrendBackend: Success Monitoring
    TrendBackend->>Logger: Log workflow completion
    TrendBackend->>Slack: Send success notification
    TrendBackend-->>TrendAPI: Dataset ready notification
    TrendAPI-->>TrendApp: WebSocket status update
    TrendApp-->>User: Dataset available
    end
    end
    
    rect rgb(255, 200, 200)
    Note right of TrendBackend: Error Handling
    rect rgb(255, 230, 230)
    alt Crawl Failure
        PLG->>CrawlerDB: Log crawl errors & failures
        Note over CrawlerDB: crawler_v2.error_logs
        PLG->>TVDB: Set crawl_status = 3 (Error)
        Note over TVDB: gb_console.summary_wishlist_*
        PLG->>Logger: Log crawl error
        PLG->>Slack: Send crawl failure notification
    else Analysis Failure  
        Analyzer->>TVDB: Set status = 9 (Failed), error_code
        Note over TVDB: ds_analyzer.datasets
        Analyzer->>TVDB: Update history status = 9
        Note over TVDB: gb_console.wishlist_dataset_histories
        Analyzer->>Logger: Log analysis error
        Analyzer->>Slack: Send analysis failure notification
    else SPVP Failure
        SPVP->>TVDB: Set spvp mapping error
        Note over TVDB: ds_analyzer.review_sentence_spvp<br/>ds_analyzer.specific_viewpoints
        SPVP->>TVDB: Set spvp_status = 9 (Failed)
        Note over TVDB: gb_console.wishlist_dataset_histories
        SPVP->>Logger: Log SPVP error
        SPVP->>Slack: Send SPVP failure notification
    end
    end
    end

Thành phần hệ thống

Frontend

  • trend-viewer-app: Vue.js frontend cho user interface

API Layer

  • trend-viewer-api: Laravel REST API layer

Backend Services

  • trend-viewer-backend: Laravel backend với scheduled commands
  • PLG API: Playground API do Crawler cung cấp cho TV tương tác

Processing Services

  • analyzer_batch: Python ML/AI processing service
  • spvp_batch: Python Qwen-based specific viewpoint processing

Databases

  • gb_console DB (MySQL): TV Backend - User data, wishlist lifecycle, status tracking
  • crawler_v2 DB (MySQL): Crawler quản lý - Crawl requests, logs, và error tracking
  • analyzer_v2 DB (MySQL): Crawler quản lý - Successful crawled data từ marketplace
  • ds_analyzer DB (MySQL): TV Analyzer - Processed analysis results từ analyzer_batch

Data Flow Summary

  1. User Input → gb_console DB (wishlist_to_groups, summary_wishlist_*)
  2. Crawl Requests → TV gọi PLG API → Crawler thao tác crawler_v2 DB
  3. Successful Crawling → PLG API lưu vào analyzer_v2 DB (products, reviews, review_sentences)
  4. Dataset Creation → ds_analyzer DB (datasets metadata)
  5. Analysis → analyzer_batch đọc trực tiếp analyzer_v2 DB → lưu vào ds_analyzer DB (7 tables)
  6. SPVP Processing → ds_analyzer DB (specific_viewpoints updates)

Danh sách Mô-đun

Tên Link Mô tả
Database Schema Changes Database Schema Chi tiết thay đổi cấu trúc database cho dataset workflow
Detailed Sequence Diagrams Sequence Diagrams Sơ đồ trình tự chi tiết cho từng giai đoạn workflow

Trạng thái Dataset

Dataset Status (ds_analyzer.datasets.status)

  • 1 | Pending: Dataset được tạo, chờ analyzer_batch
  • 2 | Processing: analyzer_batch đang xử lý
  • 3 | Completed: analyzer_batch hoàn thành
  • 9 | Failed: analyzer_batch thất bại

SPVP Status (gb_console.wishlist_dataset_histories.spvp_status)

  • 1 | Pending: Chờ spvp_batch
  • 2 | Analyzing: spvp_batch đang xử lý
  • 3 | Completed: spvp_batch hoàn thành
  • 9 | Failed: spvp_batch thất bại

Crawl Status (gb_console.summary_wishlist_*.crawl_status)

  • 0 | New: Chưa crawl
  • 1 | InProgress: Crawler Service đang crawl
  • 2 | Success: Crawl thành công
  • 3 | Error: Crawl thất bại
  • 4 | Canceled: Crawl bị hủy

Dataset thành công: status = 3 AND spvp_status = 3