Building a Discord Chatbot with Voice Interaction

Exploring various algorithms and techniques in programming and bot development can be both educational and immensely captivating. One intriguing project that caught my interest is creating a Discord chatbot with voice interaction capabilities. This project utilizes Python, Discord.py, and several text-to-speech (TTS) and chatbot APIs to enhance user engagement on Discord.

In this blog post, I'll discuss how to build a Discord bot that joins voice channels, responds to text queries using AI chatbots, and converts those responses to speech. By the end of this post, you'll understand the basics of setting up a Discord bot with voice interaction, and how I've implemented it using Python


Understanding the Basics

This project combines multiple APIs and libraries to create an interactive chatbot experience on Discord. The bot can join voice channels, interact with users through text and voice, and utilize different TTS services to convert text responses into speech.

The bot uses the OpenAI GPT-3.5 model for generating responses to user queries. The responses are then converted to speech using Google's TTS service. The bot connects to voice channels, allowing it to speak the responses directly to users in the channel.


Key Learnings

  • Discord.py: Gained experience in setting up and managing a Discord bot, handling events, and interacting with Discord's API.
  • Text-to-Speech Integration: Learned how to integrate various TTS services to convert text responses into speech.
  • Chatbot APIs: Explored different chatbot APIs and how to integrate them into the bot for generating conversational responses.
  • Voice Channel Interaction: Understood the complexities of managing voice connections and playing audio in Discord voice channels.

Implementing the Discord Chatbot

Let's delve into the code to see how the Discord chatbot with voice interaction is implemented using Python. The code consists of several components:

Libraries & Helper Functions
import discord
from discord.ext import commands
from dotenv import load_dotenv
import os
from discord import FFmpegOpusAudio
from googleTTS import googleTTS # This uses Google's TTS
from openAIchatbot import simplechat # This uses OpenAI's chatbot

def open_file(filepath):
    with open(filepath, 'r', encoding='utf-8') as infile:
        return infile.read()

We start by importing the necessary libraries and functions. Discord.py handles the bot setup and interactions, dotenv loads environment variables, and the TTS and chatbot functions handle text-to-speech and conversational responses.

Bot Setup
# Load variables from the .env file
load_dotenv()
discordAPI = os.getenv("DISCORD")

intents = discord.Intents.default()
intents.voice_states = True
intents.message_content = True

bot = commands.Bot(command_prefix='!', intents=intents)

@bot.event
async def on_ready():
    print(f'We have logged in as {bot.user}')

Here, we set up the bot with necessary intents and load the Discord API token from a .env file. The on_ready event confirms the bot has logged in successfully.

Joining and Leaving Voice Channels
@bot.command(pass_context=True)
async def join(ctx):
    if(ctx.author.voice):
        channel = ctx.message.author.voice.channel
        await channel.connect()
    else:
        await ctx.send("You are not in a voice channel")

@bot.command(pass_context=True)
async def leave(ctx):
    if(ctx.voice_client):
        await ctx.voice_client.disconnect()
    else:
        await ctx.send("I am currently not in a voice channel")

These functions allow the bot to join and leave voice channels. The join function checks if the user is in a voice channel and connects the bot to that channel. The leave function disconnects the bot from the voice channel.

Handling User Queries
conversation = []

@bot.command(pass_context=True)
async def ask(ctx):
    global conversation
    input_prompt = ctx.message.content[5:]
    file_path_audio = 'output.mp3'
    output_text, conversation = simplechat(message=input_prompt, conversation=conversation)
    if ctx.voice_client and ctx.author.voice:
        googleTTS(output_text)
        voice = ctx.voice_client
        source = FFmpegOpusAudio(source=file_path_audio)
        await ctx.send(output_text)
        voice.play(source)
    else:
        await ctx.send(output_text)

The ask command handles user queries. It sends the query to the chatbot, receives a response, and checks if the bot and user are in a voice channel. If they are, it converts the response to speech and plays it in the voice channel.


Chatbot and TTS Functions

The chatbot and TTS functions are implemented separately to handle conversational responses and text-to-speech conversion. The chatbot function can use OpenAI's GPT-3.5 model or MistralAI to generate a response, while the TTS function uses Google's TTS service to convert text to speech.

OpenAI Chatbot
import openai
import os
import dotenv

dotenv.load_dotenv()
openai.api_key = os.getenv("OPENAI")

def open_file(filepath):
    with open(filepath, 'r', encoding='utf-8') as infile:
        return infile.read()

def simplechat(message, conversation):
    prompt = open_file('prompt.txt')
    limit = 2
    if len(conversation) > limit:
        conversation = conversation[limit:len(conversation)+1]
    conversation.append({"role": "user", "content": message})
    messagesinput = conversation.copy()
    messagesinput.insert(0, {"role": "system", "content": prompt})
    chat = openai.ChatCompletion.create(model="gpt-3.5-turbo", messages=messagesinput)
    reply = chat.choices[0].message.content
    conversation.append({"role": "assistant", "content": reply})
    return reply, conversation

This function handles the interaction with OpenAI's GPT-3.5 model, maintaining a conversation history and generating responses.

Google TTS
from google.cloud import texttospeech

def googleTTS(input):
    client = texttospeech.TextToSpeechClient()
    synthesis_input = texttospeech.SynthesisInput(text=input)
    voice = texttospeech.VoiceSelectionParams(
        language_code="en-US",
        ssml_gender=texttospeech.SsmlVoiceGender.NEUTRAL,
        name="en-US-Standard-F"
    )
    audio_config = texttospeech.AudioConfig(
        audio_encoding=texttospeech.AudioEncoding.MP3,
        speaking_rate=1.10,
        pitch=0.8
    )
    response = client.synthesize_speech(
        input=synthesis_input, voice=voice, audio_config=audio_config
    )
    with open("output.mp3", "wb") as out:
        out.write(response.audio_content)
        print('Audio content written to file "output.mp3"')

This function converts text input into speech using Google's TTS service and saves the audio output to a file.


Experimentation and Exploration

Building a Discord chatbot with voice interaction opens up a world of possibilities for experimentation and exploration. You can integrate different chatbot APIs, TTS services, and AI models to enhance the bot's capabilities and create engaging user experiences.

  • Custom Chatbot Prompts: Modify the prompt.txt file to change the behavior and responses of the chatbot.
  • Different TTS Voices: Experiment with different voice settings in the Google TTS function to change the voice and tone of the responses.
  • Enhanced Interactivity: Add more commands to the bot to handle different types of interactions and responses.

Conclusion

Creating a Discord chatbot with voice interaction is a fascinating project that combines multiple technologies to enhance user engagement. This project provided valuable insights into bot development, text-to-speech integration, and using AI chatbots. I hope this post has inspired you to explore bot development and perhaps experiment with your own implementations.

Thank you for reading, and happy coding! The entire code for this project can be found here.


Posted by: Aidan Vidal

Posted on: June 3, 2024