Overview

This guide covers best practices for preparing training data to ensure optimal performance of your chatbot. We’ll cover different data formats and provide specific guidelines for each type.

Q&A Pairs

Individual Entry Guidelines

  • Keep questions clear and specific
  • Provide comprehensive answers
  • Use consistent formatting
  • Include relevant context
  • Avoid duplicate questions

Batch Upload (JSON)

[
  {
    "question": "What are your business hours?",
    "answer": "We are open Monday through Friday, 9 AM to 5 PM EST."
  }
]

JSON Format Requirements

  • Use an array of objects
  • Each object must have “question” and “answer” fields
  • Use UTF-8 encoding
  • Maximum file size: 10MB
  • Validate JSON structure before upload

Best Practices

  1. Question Formatting

    • Use natural language
    • Include variations of similar questions
    • Keep questions concise but descriptive
  2. Answer Quality

    • Provide complete, accurate information
    • Use consistent tone and style
    • Include necessary context
    • Break down complex answers into digestible parts

Document Uploads

Supported File Types

  • PDF (.pdf)
  • Markdown (.md, .mdx)
  • Microsoft Word (.docx, .doc)
  • CSV (.csv)
  • JSON (.json)

General File Guidelines

  1. File Naming

    • Use descriptive names
    • Follow kebab-case convention
    • Include version numbers if applicable
    • Examples:
      • product-return-policy-2024.pdf
      • customer-faqs-v2.md
      • doc1.pdf
      • New Document (1).docx
  2. File Organization

    • Group related documents
    • Use consistent naming patterns
    • Keep file sizes manageable
    • Remove unnecessary metadata
  3. Content Structure

    • Use clear headings and sections
    • Include a table of contents for longer documents
    • Maintain consistent formatting
    • Remove irrelevant content

CSV Best Practices

File Structure

  1. Headers

    • Use clear, descriptive column names
    • Avoid spaces (use underscores or hyphens)
    • Include a description row if needed
    • Example:
      title,description,category
      Product Returns,How to process product returns,Support
      
  2. Data Cleaning

    • Remove empty rows
    • Fill or remove blank cells
    • Standardize data formats
    • Check for and remove duplicate entries
  3. Formatting Guidelines

    • Use UTF-8 encoding
    • Consistent date formats (YYYY-MM-DD)
    • Proper escaping of special characters
    • Consistent number formatting

CSV Content Best Practices

  1. Data Organization

    • One concept per row
    • Consistent data types per column
    • Logical column ordering
    • Appropriate data granularity
  2. Quality Control

    • Validate data accuracy
    • Check for formatting consistency
    • Remove trailing spaces
    • Verify character encoding

General Tips

  1. Data Quality

    • Regular content updates
    • Version control
    • Quality assurance reviews
    • Consistent terminology
  2. Performance Optimization

    • Compress large files
    • Split very large datasets
    • Remove redundant information
    • Optimize images and media
  3. Maintenance

    • Schedule regular reviews
    • Document changes
    • Archive outdated versions
    • Monitor chatbot performance

Common Pitfalls to Avoid

  1. Content Issues

    • Inconsistent formatting
    • Duplicate information
    • Outdated content
    • Ambiguous answers
  2. Technical Issues

    • Wrong file encodings
    • Missing data validation
    • Improper file organization
    • Oversized files
  3. Process Issues

    • Lack of version control
    • Poor documentation
    • Inconsistent naming conventions
    • Missing backup procedures

Testing and Validation

  1. Before Upload

    • Validate file formats
    • Check for errors
    • Review content quality
    • Test with sample queries
  2. After Upload

    • Verify data integration
    • Test chatbot responses
    • Monitor performance
    • Gather user feedback

Regular maintenance and updates of your training data will ensure optimal chatbot performance. Set up a review schedule to keep your content fresh and accurate.

Always backup your training data before making major changes or updates to avoid data loss.

Content Quality Guidelines

Preventing Hallucinations

  1. Explicit References

    • Replace pronouns with specific nouns
    • Examples:
      • ❌ “It provides fast shipping”
      • ✅ “Amazon Prime provides fast shipping”
      • ❌ “They can help with returns”
      • ✅ “Customer service representatives can help with returns”
  2. Specific Quantities and Metrics

    • Use exact numbers when possible
    • Include units of measurement
    • Examples:
      • ❌ “Delivery takes a few days”
      • ✅ “Delivery takes 3-5 business days”
      • ❌ “The product is quite large”
      • ✅ “The product dimensions are 10” x 12” x 15""
  3. Clear Entity References

    • Avoid ambiguous references
    • Name specific products, services, or departments
    • Examples:
      • ❌ “Contact support for assistance”
      • ✅ “Contact technical support at [email protected] for assistance”
      • ❌ “Use the tool to analyze data”
      • ✅ “Use the Google Analytics dashboard to analyze website traffic data”
  4. Temporal Clarity

    • Specify exact dates or time periods
    • Use absolute references instead of relative ones
    • Examples:
      • ❌ “The feature was recently added”
      • ✅ “The feature was added in January 2024”
      • ❌ “Updates coming soon”
      • ✅ “Updates scheduled for Q2 2024”

Writing Style Guidelines

  1. Factual Language

    • Use concrete, verifiable statements
    • Avoid subjective descriptions
    • Examples:
      • ❌ “The best solution for your needs”
      • ✅ “A solution that offers features A, B, and C”
      • ❌ “Amazing performance improvements”
      • ✅ “30% faster response time compared to version 1.0”
  2. Consistent Terminology

    • Maintain a glossary of approved terms
    • Use the same term for the same concept
    • Examples:
      • ❌ “login, sign-in, log in” (mixing terms)
      • ✅ “login” (consistent usage)
      • ❌ “customer, user, client” (mixing terms)
      • ✅ “user” (consistent usage)
  3. Structured Information

    • Break down complex concepts

    • Use lists and tables for clarity

    • Include step-by-step instructions

    • Example:

      To reset your password:
      
      1. Navigate to login.company.com
      2. Click "Forgot Password"
      3. Enter your email address
      4. Follow the instructions in the reset email
      
  4. Context Completeness

    • Include all necessary context in each answer
    • Avoid assuming prior knowledge
    • Examples:
      • ❌ “Enable the feature in settings”
      • ✅ “Enable the dark mode feature in User Settings > Display > Theme”
      • ❌ “Use the API key to authenticate”
      • ✅ “Use the API key from your dashboard at api.company.com/keys to authenticate requests”

Data Validation Checklist

  1. Accuracy Verification

    • Cross-reference with official documentation
    • Verify URLs and email addresses
    • Confirm product specifications
    • Check policy statements
  2. Completeness Check

    • All required fields are filled
    • No missing steps in procedures
    • Contact information is complete
    • Reference links are provided
  3. Consistency Review

    • Terminology alignment
    • Format standardization
    • Unit consistency
    • Date and time formats

When preparing training data, remember that explicit, specific information helps prevent AI hallucinations and improves response accuracy.