BigQuery Missed Data Sync

Command Signatures

php artisan gcp:sync-products --missed [--items-per-page=]
php artisan gcp:sync-reviews --missed [--items-per-page=]
php artisan gcp:sync-review-sentences --missed [--items-per-page=]

Purpose

These commands ensure that any data missed during regular synchronization is eventually synchronized from BigQuery to the local database. They specifically target records with null status values in BigQuery tables, regardless of creation time. This serves as a safety net to maintain data integrity and completeness.

Sequence Diagrams

Products Missed Data Sync

sequenceDiagram
    participant System
    participant MissedProducts as gcp:sync-products --missed
    participant BigQuery
    participant ProductsTable as products table
    participant ProductDetailsTable as product_details table
    participant Redis
    
    Note over System,Redis: Products Missed Data Sync Flow
    
    rect rgb(255, 200, 200)
    Note right of System: Daily
    System->>MissedProducts: Execute
    MissedProducts->>BigQuery: Query Products WHERE status IS NULL
    BigQuery-->>MissedProducts: Return Missed Products Data
    MissedProducts->>ProductsTable: Insert/Update Products Records
    MissedProducts->>ProductDetailsTable: Insert/Update Product Details Records
    MissedProducts->>Redis: Store Product IDs for Status Update
    end

Reviews Missed Data Sync

sequenceDiagram
    participant System
    participant MissedReviews as gcp:sync-reviews --missed
    participant BigQuery
    participant ReviewsTable as reviews table
    participant Redis
    
    Note over System,Redis: Reviews Missed Data Sync Flow
    
    rect rgb(255, 200, 200)
    Note right of System: Daily
    System->>MissedReviews: Execute
    MissedReviews->>BigQuery: Query Reviews WHERE status IS NULL
    BigQuery-->>MissedReviews: Return Missed Reviews Data
    MissedReviews->>ReviewsTable: Insert/Update Reviews Records
    MissedReviews->>Redis: Store Review IDs for Status Update
    end

Review Sentences Missed Data Sync

sequenceDiagram
    participant System
    participant MissedSentences as gcp:sync-review-sentences --missed
    participant BigQuery
    participant ReviewSentencesTable as review_sentences table
    participant Redis
    
    Note over System,Redis: Review Sentences Missed Data Sync Flow
    
    rect rgb(255, 200, 200)
    Note right of System: Daily
    System->>MissedSentences: Execute
    MissedSentences->>BigQuery: Query Review Sentences WHERE status IS NULL
    BigQuery-->>MissedSentences: Return Missed Sentences Data
    MissedSentences->>ReviewSentencesTable: Insert/Update Review Sentences Records
    MissedSentences->>Redis: Store Sentence IDs for Status Update
    end

Implementation Details

Parameters

  • --missed: Flag that modifies the query to select all records with null status regardless of creation time
  • --items-per-page=N: Optional parameter to control batch size (default: 500)

Frequency

Daily - scheduled to run during low-traffic periods

Dependencies

  • Google Cloud Platform access credentials
  • BigQuery project and dataset configuration
  • Redis for tracking processed IDs
  • Queue workers for processing jobs
  • Update status command for marking records as processed

Processing Flow

  1. Command queries BigQuery for records with null status (missing the time constraint of regular sync)
  2. Data is chunked into batches and processed via queue jobs
  3. Same transformation and storage logic as regular sync is applied
  4. Processed IDs are stored in Redis
  5. Status update command later marks these records as processed

Error Handling

Job Structure

  • Uses same error handling as regular sync
  • Retries failed jobs based on Laravel queue configuration
  • Records detailed errors for troubleshooting

Logging

  • Detailed logs with job counts and record counts
  • Error stack traces for debugging
  • Cross-references to BigQuery tables and conditions

Notifications

  • Slack alerts for critical failures
  • Success notifications with processing statistics

Troubleshooting

Common Issues

  1. Large Data Volumes: Missed data sync can process large volumes if regular sync has been failing
  2. Queue Overload: Monitor queue size and worker capacity during missed data sync
  3. Resource Contention: Schedule missed sync during off-peak hours to avoid performance impact

Performance Optimization

  1. Adjust batch sizes via --items-per-page parameter
  2. Increase queue worker count before running missed data sync
  3. Monitor system resource usage during execution
  4. Consider staggering missed sync commands rather than running simultaneously

Verification Steps

  1. Compare record counts in BigQuery with null status before and after sync
  2. Check Redis for successfully processed record IDs
  3. Verify local database record counts and content
  4. Review status update command logs for successful batch processing