[ OpenAI / WebsiteQnA tutorial ] Embedding을 이용한 Context 생성 및 응답 (4)

Openai

[ OpenAI / WebsiteQnA tutorial ] Embedding을 이용한 Context 생성 및 응답 (4)

OnnJE 2023. 2. 28. 20:42

4. Embedding을 이용한 Context 생성 및 응답

def create_context(
    question, df, max_len=1800, size="ada"
):
    """
    Create a context for a question by finding the most similar context from the dataframe
    """

    # Get the embeddings for the question
    q_embeddings = openai.Embedding.create(input=question, engine='text-embedding-ada-002')['data'][0]['embedding']

    # Get the distances from the embeddings
    df['distances'] = distances_from_embeddings(q_embeddings, df['embeddings'].values, distance_metric='cosine')


    returns = []
    cur_len = 0

    # Sort by distance and add the text to the context until the context is too long
    for i, row in df.sort_values('distances', ascending=True).iterrows():
        
        # Add the length of the text to the current length
        cur_len += row['n_tokens'] + 4
        
        # If the context is too long, break
        if cur_len > max_len:
            break
        
        # Else add it to the text that is being returned
        returns.append(row["text"])

    # Return the context
    return "\n\n###\n\n".join(returns)

create_context(question, df, max_len=1800, size='data')는 question, df, max_len, size를 입력으로 받아 question 가장 비슷한 맥락의 데이터를 찾습니다. q_embeddings(question을 임베딩한 결과를 담은 변수)과 df의 각 embedding 결과를 비교한 결과를 distances에 담은 뒤 이를 오름차순으로 정렬해 가장 작은값을 가진 데이터부터 순회하며 returns에 추가합니다. 이때 각 데이터가 추가될 때 마다 해당 데이터에 대응하는 토큰을 cur_lens에 추가하여 return되는 값의 크기를 제한합니다. cur_len 에 토큰을 누적할 때 4를 추가적으로 더하는 이유는 마침표때문입니다. 함수가 반환하는 값은 returns의 각 요소를 '\n\n###\n\n'으로 join한 결과값입니다.

def answer_question(
    df,
    model="text-davinci-003",
    question="Am I allowed to publish model outputs to Twitter, without a human review?",
    max_len=1800,
    size="ada",
    debug=False,
    max_tokens=150,
    stop_sequence=None
):
    """
    Answer a question based on the most similar context from the dataframe texts
    """
    context = create_context(
        question,
        df,
        max_len=max_len,
        size=size,
    )
    # If debug, print the raw model response
    if debug:
        print("Context:\n" + context)
        print("\n\n")

    try:
        # Create a completions using the question and context
        response = openai.Completion.create(
            prompt=f"Answer the question based on the context below, and if the question can't be answered based on the context, say \"I don't know\"\n\nContext: {context}\n\n---\n\nQuestion: {question}\nAnswer:",
            temperature=0,
            max_tokens=max_tokens,
            top_p=1,
            frequency_penalty=0,
            presence_penalty=0,
            stop=stop_sequence,
            model=model,
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""

answer_question(df, model, question, max_len, size, debug, max_tokens, stop_sequence)에서는 앞서 정의한 create_context 함수를 통해 주어진 인자의 컨텍스트에 가장 근접한 데이터를 추출합니다.(debug의 경우 이 데이터를 프린트합니다.) 추출한 데이터는 prompt에 적용되어 completion.create() 함수에 전달되며, 위 코드에서는 davinci-003 모델을 사용해 텍스트를 생성합니다. completion.create의 각 인자에 대한 설명은 아래와 같습니다.

prompt : 모델이 생성할 내용에 대한 컨텍스트를 제공하는 인자입니다. 무엇을 생성할 것인지에 대한 문자열을 받습니다..
temperature : 택스트 생성시 랜덤성을 결정합니다. 1에 가까울수록 창의적인 답변이 생성됩니다.
max_tokens : 모델이 생성할 내용의 최대치를 결정합니다. 토큰을 기준으로 하며 1 토큰이 대략 0.75 단어에 대응됩니다.
top_p : temperature과 비슷하게 모델의 무작위성을 제어합니다.
frequency_penalty : 모델이 예측을 반복하는 경향을 줄이도록 제어합니다. 이미 생성된 단어의 확률을 줄입니다.
presence_penalty : 모델이 새로운 예측을 하도록 권장하는 인자입니다. 단어가 예측된 텍스트에 나타난 경우 단어의 확률을 낮춥니다. frequency_penalty와 달리 과거 예측에서 나타난 빈도에 영향을 받지 않습니다.
stop : 모델이 텍스트 생성을 멈추는 토큰을 특정합니다. 부적절한 컨텐츠 생성을 방지하는데 사용됩니다.
model / engine : text generater를 어떤 언어모델을 이용해 실행할지 특정합니다.

자세한 사항은 아래 openai document를 통해 확인 가능합니다.

OpenAI API

An API for accessing new AI models developed by OpenAI

platform.openai.com

answer_question(df, question="What day is it?", debug=False)
# "I don't know."

answer_question(df, question="What is our newest embeddings model?")
# 'The newest embeddings model is text-embedding-ada-002.'

실행 예시는 위와 같습니다.