Overview
This guide covers best practices for preparing training data to ensure optimal performance of your chatbot. We’ll cover different data formats and provide specific guidelines for each type.Q&A Pairs
Individual Entry Guidelines
- Keep questions clear and specific
- Provide comprehensive answers
- Use consistent formatting
- Include relevant context
- Avoid duplicate questions
Batch Upload (JSON)
JSON Format Requirements
- Use an array of objects
- Each object must have “question” and “answer” fields
- Use UTF-8 encoding
- Maximum file size: 10MB
- Validate JSON structure before upload
Best Practices
-
Question Formatting
- Use natural language
- Include variations of similar questions
- Keep questions concise but descriptive
-
Answer Quality
- Provide complete, accurate information
- Use consistent tone and style
- Include necessary context
- Break down complex answers into digestible parts
Document Uploads
Supported File Types
- PDF (.pdf)
- Markdown (.md, .mdx)
- Microsoft Word (.docx, .doc)
- CSV (.csv)
- JSON (.json)
General File Guidelines
-
File Naming
- Use descriptive names
- Follow kebab-case convention
- Include version numbers if applicable
- Examples:
- ✅
product-return-policy-2024.pdf
- ✅
customer-faqs-v2.md
- ❌
doc1.pdf
- ❌
New Document (1).docx
- ✅
-
File Organization
- Group related documents
- Use consistent naming patterns
- Keep file sizes manageable
- Remove unnecessary metadata
-
Content Structure
- Use clear headings and sections
- Include a table of contents for longer documents
- Maintain consistent formatting
- Remove irrelevant content
CSV Best Practices
File Structure
-
Headers
- Use clear, descriptive column names
- Avoid spaces (use underscores or hyphens)
- Include a description row if needed
- Example:
-
Data Cleaning
- Remove empty rows
- Fill or remove blank cells
- Standardize data formats
- Check for and remove duplicate entries
-
Formatting Guidelines
- Use UTF-8 encoding
- Consistent date formats (YYYY-MM-DD)
- Proper escaping of special characters
- Consistent number formatting
CSV Content Best Practices
-
Data Organization
- One concept per row
- Consistent data types per column
- Logical column ordering
- Appropriate data granularity
-
Quality Control
- Validate data accuracy
- Check for formatting consistency
- Remove trailing spaces
- Verify character encoding
General Tips
-
Data Quality
- Regular content updates
- Version control
- Quality assurance reviews
- Consistent terminology
-
Performance Optimization
- Compress large files
- Split very large datasets
- Remove redundant information
- Optimize images and media
-
Maintenance
- Schedule regular reviews
- Document changes
- Archive outdated versions
- Monitor chatbot performance
Common Pitfalls to Avoid
-
Content Issues
- Inconsistent formatting
- Duplicate information
- Outdated content
- Ambiguous answers
-
Technical Issues
- Wrong file encodings
- Missing data validation
- Improper file organization
- Oversized files
-
Process Issues
- Lack of version control
- Poor documentation
- Inconsistent naming conventions
- Missing backup procedures
Testing and Validation
-
Before Upload
- Validate file formats
- Check for errors
- Review content quality
- Test with sample queries
-
After Upload
- Verify data integration
- Test chatbot responses
- Monitor performance
- Gather user feedback
Regular maintenance and updates of your training data will ensure optimal
chatbot performance. Set up a review schedule to keep your content fresh and
accurate.
Always backup your training data before making major changes or updates to
avoid data loss.
Content Quality Guidelines
Preventing Hallucinations
-
Explicit References
- Replace pronouns with specific nouns
- Examples:
- ❌ “It provides fast shipping”
- ✅ “Amazon Prime provides fast shipping”
- ❌ “They can help with returns”
- ✅ “Customer service representatives can help with returns”
-
Specific Quantities and Metrics
- Use exact numbers when possible
- Include units of measurement
- Examples:
- ❌ “Delivery takes a few days”
- ✅ “Delivery takes 3-5 business days”
- ❌ “The product is quite large”
- ✅ “The product dimensions are 10” x 12” x 15""
-
Clear Entity References
- Avoid ambiguous references
- Name specific products, services, or departments
- Examples:
- ❌ “Contact support for assistance”
- ✅ “Contact technical support at [email protected] for assistance”
- ❌ “Use the tool to analyze data”
- ✅ “Use the Google Analytics dashboard to analyze website traffic data”
-
Temporal Clarity
- Specify exact dates or time periods
- Use absolute references instead of relative ones
- Examples:
- ❌ “The feature was recently added”
- ✅ “The feature was added in January 2024”
- ❌ “Updates coming soon”
- ✅ “Updates scheduled for Q2 2024”
Writing Style Guidelines
-
Factual Language
- Use concrete, verifiable statements
- Avoid subjective descriptions
- Examples:
- ❌ “The best solution for your needs”
- ✅ “A solution that offers features A, B, and C”
- ❌ “Amazing performance improvements”
- ✅ “30% faster response time compared to version 1.0”
-
Consistent Terminology
- Maintain a glossary of approved terms
- Use the same term for the same concept
- Examples:
- ❌ “login, sign-in, log in” (mixing terms)
- ✅ “login” (consistent usage)
- ❌ “customer, user, client” (mixing terms)
- ✅ “user” (consistent usage)
-
Structured Information
- Break down complex concepts
- Use lists and tables for clarity
- Include step-by-step instructions
-
Example:
-
Context Completeness
- Include all necessary context in each answer
- Avoid assuming prior knowledge
- Examples:
- ❌ “Enable the feature in settings”
- ✅ “Enable the dark mode feature in User Settings > Display > Theme”
- ❌ “Use the API key to authenticate”
- ✅ “Use the API key from your dashboard at api.company.com/keys to authenticate requests”
Data Validation Checklist
-
Accuracy Verification
- Cross-reference with official documentation
- Verify URLs and email addresses
- Confirm product specifications
- Check policy statements
-
Completeness Check
- All required fields are filled
- No missing steps in procedures
- Contact information is complete
- Reference links are provided
-
Consistency Review
- Terminology alignment
- Format standardization
- Unit consistency
- Date and time formats
When preparing training data, remember that explicit, specific information
helps prevent AI hallucinations and improves response accuracy.