[ OpenAI / WebsiteQnA tutorial ] 데이터 가공 - tiktoken 라이브러리를 통한 데이터 프로세싱 (2)

Openai 2023. 2. 28. 20:37

2. 데이터 가공 - tiktoken 라이브러리를 통한 데이터 프로세싱

def remove_newlines(serie):
    serie = serie.str.replace('\n', ' ')
    serie = serie.str.replace('\\n', ' ')
    serie = serie.str.replace('  ', ' ')
    serie = serie.str.replace('  ', ' ')
    return serie

remove_newlines(serie)는 python의 Series는 1차원 배열과 같은 자료구조입니다. Series 객체 생성시 따로 인덱스를 할당하지 않는다면 0부터 시작되는데 자세한 사항은 아래 링크에서 확인 가능합니다. 뭐 어쨌든 이함수는 pandas의 Series 관련 객체를 인자로 받은 뒤 줄바꿈, 중복된 띄어쓰기, \ 등을 제거하도록 설계되었습니다.

2) Series 기초

앞서 pandas에는 Series와 DataFrame이라는 두 종류의 자료구조가 있다고 했습니다. pandas의 Series는 1차원 배열과 같은 자료구조입니다. 파이썬 리스트와…

wikidocs.net

import pandas as pd

# Create a list to store the text files
texts=[]

# Get all the text files in the text directory
for file in os.listdir("text/" + domain + "/"):

    # Open the file and read the text
    with open("text/" + domain + "/" + file, "r") as f:
        text = f.read()

        # Omit the first 11 lines and the last 4 lines, then replace -, _, and #update with spaces.
        texts.append((file[11:-4].replace('-',' ').replace('_', ' ').replace('#update',''), text))

# Create a dataframe from the list of texts
df = pd.DataFrame(texts, columns = ['fname', 'text'])

# Set the text column to be the raw text with the newlines removed
df['text'] = df.fname + ". " + remove_newlines(df.text)
df.to_csv('processed/scraped.csv')
df.head()

앞서 crawl()이 실행 된 뒤 이어서 실행되는 부분입니다. 크롤링후 저장된 .txt 파일을 os.listdir() 함수와 for 문을 통해 순회합니다. 각 순회에서 .txt 파일은 file이란 변수에 저장되며, with open as 구문을 통해 file을 파일명으로하는 파일을 읽어 그 내용을 text 변수에 저장합니다. text/domain/ 디렉토리에 저장된 모든 파일을 (파일명, 텍스트) 꼴로 배열에 저장하며, 이를 바탕으로 pandas DateFrame 객체를 생성합니다. dataframe의 text column에 파일명을 붙이고 기존 내용에 앞서 정의한 remove_newlines() 함수를 적용하는 작업 진행 후, dataframe을 csv 파일로 processed 디렉터리에 저장합니다. 이때 .head()를 사용해 처음 5 행의 데이터를 출력합니다.

[ python module / os ] os 모듈 기본 함수

os는 파이썬에서 파일 읽기, 파일 쓰기, 프로세스 관리 및 환경 변수 작업 등 운영체제 종속 기능을 사용할 수 있도록 하는 방법을 제공하는 모듈입니다. os에서 자주 사용되는 함수는 아래와 같

ojhallae.tistory.com

import tiktoken

# Load the cl100k_base tokenizer which is designed to work with the ada-002 model
tokenizer = tiktoken.get_encoding("cl100k_base")

df = pd.read_csv('processed/scraped.csv', index_col=0)
df.columns = ['title', 'text']

# Tokenize the text and save the number of tokens to a new column
df['n_tokens'] = df.text.apply(lambda x: len(tokenizer.encode(x)))

# Visualize the distribution of the number of tokens per row using a histogram
df.n_tokens.hist()

tiktoken 모듈은 OpenAI 모델에서 사용되기 위해 제작된 BPE 토크나이저입니다. tiktoken.get_encoding("encoding name")을 통해 토크나이저를 로드합니다. 이후 .encode() 함수를 통해 텍스트를 토큰으로, 혹은 그 반대로 변환합니다. 아래 표는 각 모델에 대응하는 encoding name을 나타냅니다.

위의 코드는 tokenizer를 로드한 뒤 앞서 저장한 csv의 text 컬럼에 적용해 각 텍스트를 토큰화해 n_token 컬럼을 생성합니다. 이후 .hist() 함수를 통해 n_tokens에 대한 히스토그램으로 나타냅니다. 아래 링크를 통해 tiktoken에 대한 간략한 정리를 확인 가능합니다.

GitHub - openai/openai-cookbook: Examples and guides for using the OpenAI API

Examples and guides for using the OpenAI API. Contribute to openai/openai-cookbook development by creating an account on GitHub.

github.com

max_tokens = 500

# Function to split the text into chunks of a maximum number of tokens
def split_into_many(text, max_tokens = max_tokens):

    # Split the text into sentences
    sentences = text.split('. ')

    # Get the number of tokens for each sentence
    n_tokens = [len(tokenizer.encode(" " + sentence)) for sentence in sentences]
    
    chunks = []
    tokens_so_far = 0
    chunk = []

    # Loop through the sentences and tokens joined together in a tuple
    for sentence, token in zip(sentences, n_tokens):

        # If the number of tokens so far plus the number of tokens in the current sentence is greater 
        # than the max number of tokens, then add the chunk to the list of chunks and reset
        # the chunk and tokens so far
        if tokens_so_far + token > max_tokens:
            chunks.append(". ".join(chunk) + ".")
            chunk = []
            tokens_so_far = 0

        # If the number of tokens in the current sentence is greater than the max number of 
        # tokens, go to the next sentence
        if token > max_tokens:
            continue

        # Otherwise, add the sentence to the chunk and add the number of tokens to the total
        chunk.append(sentence)
        tokens_so_far += token + 1

    return chunks

split_into_many(text, max_token = max_tokens) 는 각 텍스트에 대응하는 토큰이 max_tokens 보다 크지 않도록 나누는 기능을 합니다. 입력받은 텍스트를 split() 함수를 통해 문장별로 나누고 이를 토큰화한 결과를 n_tokens 배열로 저장합니다. 청크는 관계된 텍스트의 그룹이라고 볼 수 있으며 split_into_many()에서는 텍스트를 max_tokens을 기준으로 나눠 chunks에 저장합니다. tokens_so_far, chunk는 chunks에 저장할 데이터를 만들기 위한 처리과정에 사용될 변수입니다.

for 문에서는 각 순회에서 sentence와 그에 대응하는 token을 받습니다. 앞서 저장한 토큰인 tokens_so_far과 현재 문장에 대응하는 토큰인 token의 합이 max_tokens 보다 크다면 앞서 chunk를 join() 함수를 이용해 하나의 데이터로 변환한 뒤 chunks에 추가합니다. 이어서 처리에 사용되는 각 변수(chunk, tokens_so_far)을 초기화하고 여기에 현재 문장과 토큰을 추가합니다. 만약 현재 문장에 대응하는 토큰이 max_tokens모다 크다면 해당 문장에 대한 처리는 생략됩니다. 이러한 과정은 인자로 받은 text로 부터 추출된 모든 문장에 대해 진행된 후 max_tokens를 기준으로 나누어진 문장들을 담은 배열인 chunks를 반환합니다.

shortened = []

# Loop through the dataframe
for row in df.iterrows():

    # If the text is None, go to the next row
    if row[1]['text'] is None:
        continue

    # If the number of tokens is greater than the max number of tokens, split the text into chunks
    if row[1]['n_tokens'] > max_tokens:
        shortened += split_into_many(row[1]['text'])
    
    # Otherwise, add the text to the list of shortened texts
    else:
        shortened.append( row[1]['text'] )
        
df = pd.DataFrame(shortened, columns = ['text'])
df['n_tokens'] = df.text.apply(lambda x: len(tokenizer.encode(x)))
df.n_tokens.hist()

df.iterrows()는 데이터프레임의 각 인덱스와 행을 튜플로 반환합니다. 위 코드에서는 df에 저장된 text를 max_tokens를 기준으로 처리 후 shortened에 저장합니다. df에서 text의 토큰이 max_tokens보다 큰 경우 앞서 정의한 split_into_many()를 적용한 뒤 += 연산자를 이용해 여러 텍스트를 shortened에 추가하고 max_tokens 보다 작은 경우 .append를 이용해 text를 바로 shortened에 추가합니다. 모든 순회를 끝낸 뒤 shortened를 이용해 df를 새롭게 초기화하고 앞서 했던 것과 동일하게 각 텍스트를 토큰화 한 뒤 히스토그램으로 나타냅니다.

'Openai' 카테고리의 다른 글

[ OpenAI / WebsiteQnA tutorial ] 총정리 (1)	2023.02.28
[ OpenAI / WebsiteQnA tutorial ] Embedding을 이용한 Context 생성 및 응답 (4) (0)	2023.02.28
[ OpenAI / WebsiteQnA tutorial ] Embedding - openai 라이브러리를 통한 Embedding (3) (0)	2023.02.28
[ OpenAI / WebsiteQnA tutorial ] 데이터 수집 - beautifulsoup 라이브러리를 통한 크롤링 (1) (0)	2023.02.28

ABOUT ME

COMPMOS COMPMOS

2. 데이터 가공 - tiktoken 라이브러리를 통한 데이터 프로세싱

'Openai' 카테고리의 다른 글

티스토리툴바

ABOUT ME

2. 데이터 가공 - tiktoken 라이브러리를 통한 데이터 프로세싱

'Openai' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바