[Data Analysis] 데이터 전처리 해보기

Data Analysis/Basic

[Data Analysis] 데이터 전처리 해보기

임승택 2025. 3. 29. 02:50

텍스트 전처리 개요

텍스트 전처리란?

자연어로 작성된 데이터를 기계 학습 및 분석에 적합한 형식으로 정제(Cleaning)하고 구조화(Structuring)하는 작업

자연어 처리(NLP)와 텍스트 마이닝(Text Mining)의 차이

자연어 처리 (NLP): 인간의 언어를 이해하고 처리하는 기술로, 언어의 구조와 의미를 분석하여 텍스트나 음성 데이터를 컴퓨터가 이해할 수 있게 만듭니다.
텍스트 마이닝 (Text Mining): 텍스트 데이터에서 유용한 정보나 패턴을 추출하는 기술로, 주로 대량의 텍스트 데이터에서 의미 있는 인사이트나 지식을 발견하는 데 집중합니다.

자연어 처리 (NLP)는 인간 언어의 구조적, 의미적 특성을 분석하여 텍스트나 음성을 "이해"하는 것이 주 목적입니다. 예를 들어, 문법적 분석이나 의미 해석을 통해 텍스트를 이해하고 처리하는 데 중점을 둡니다. 텍스트 마이닝 (Text Mining)은 주어진 텍스트 데이터에서 숨겨진 정보나 패턴을 찾고 이를 분석하는 것이 주 목적입니다. 예를 들어, 특정 감정의 흐름을 파악하거나 특정 주제에 대한 토픽을 분석하는 작업입니다.

텍스트 전처리 방법과 필요성

데이터 일관성 유지
- 목적: 다양한 출처에서 수집된 텍스트의 형식과 표현 방식 차이를 줄여 일관된 데이터를 만들기.
- 방법: 대소문자 통일, 특수 문자 제거, 불필요한 공백 처리 등을 통해 모델이 일관되게 학습하도록 유도.
노이즈 제거
- 목적: 분석에 방해되는 불필요한 정보를 제거하여 데이터 품질과 모델 성능을 향상.
- 방법: HTML 태그, 이모지, 광고 문구, 반복 문자, 오타 등을 제거.
중요 정보 강조
- 목적: 핵심 정보를 추출하고 분석 효율성을 높이기.
- 방법: 불용어 제거, 어간 추출, 원형 복원 등을 사용하여 중요한 단어와 의미를 강조.
차원 축소
- 목적: 고차원 데이터를 낮춰 계산 비용을 절감하고 모델의 학습 속도와 성능을 개선.
- 방법: 불필요한 단어 제거나 단어 통합을 통해 텍스트의 차원을 줄임.

전처리 기본 함수

`re.sub()` 함수

* `re` 모듈에서 제공하는 `re.sub()` 함수는 정규 표현식 패턴에 일치하는 문자열을 다른 문자열로 대체하는 데 사용됨
* 텍스트 전처리에서 특수 문자 제거, 숫자/문자 필터링, 공백 정리 등 다양한 용도로 자주 사용됨

https://brownbears.tistory.com/506

[Python] re 모듈 사용법

regex는 정규 표현식으로 흔히 알려져 있습니다. 파이썬에서 정규 표현식을 사용할 때, 내장 모듈인 re를 사용하고 있습니다. re 모듈에서 제공해주는 함수들을 보면 match(), fullmatch(), findall(), search(

brownbears.tistory.com

re.sub(pattern, replacement, string)

pattern: 대체할 문자열을 찾기 위한 정규 표현식 패턴.
replacement: 찾은 문자열을 대체할 문자열. 보통 빈 문자열('')로 설정하여 문자열을 제거할 때 사용.
string: 정규 표현식을 적용할 원본 문자열.

특수문자를 제거 해보자

`[^\w\s]` "단어 문자"와 "공백 문자" 제거

\w: 알파벳, 숫자, 밑줄(_)을 포함한 "단어 문자"를 의미합니다.
\s: 공백 문자를 의미하며, 공백(space), 탭(tab) 등 다양한 공백 문자를 포함합니다.
[^...]: 괄호 안에 포함되지 않은 문자를 의미합니다. not
\[at\] : 문자열 `[at]`을 정확히 찾기 위한 패턴 ( 대괄호(`[]`)는 정규표현식에서 메타문자이므로, 백슬래시(`\`)로 이스케이프 처리가 필요
\s+ : 하나 이상의 공백 문자(스페이스, 탭, 줄바꿈 등)를 의미

text = "Hello, World! This is an example... #Python @2023"

import re

clean_text = re.sub(r'[^\w\s]', '', text)
print(clean_text)

=> Hello World This is an example Python 2023

text = "This is an example with multiple spaces."

single_spaced_text = re.sub(r'\s+', ' ', text)
print(single_spaced_text)

=> This is an example with multiple spaces.

text = "Please contact us at: support[at]example.com."

corrected_text = re.sub(r'\[at\]', '@', text)
print(corrected_text)

=> Please contact us at: support@example.com.

split() 함수

문자열을 특정 구분자(delimiter)를 기준으로 잘라서 리스트 형태로 분리함

string.split([separator])

`separator`를 생략하면 기본값은 공백 (`' '`)

text = "This is an example text"

words = text.split()  # 공백을 기준으로 문자열을 분리
print(words)

=> ['This', 'is', 'an', 'example', 'text']

join() 함수

separator.join(list)

* `separator`는 리스트의 요소(단어들) 사이에 들어갈 문자를 의미함

* 예: 공백 `' '`, 쉼표 `','`, 줄바꿈 `'\n'` 등

words = ['This', 'is', 'joined', 'text']
joined_text = ' '.join(words)
print(joined_text)

=> This is joined text

string.replace(old, new)

문자열 내에서 특정 문자열을 다른 문자열로 바꾸는 데 사용

* `old`: 문자열 내에서 찾아서 대체할 대상 문자열

* `new`: old를 대체할 새로운 문자열

text = "banana is not an apple, but banana is a fruit."
print(text.replace("banana", "mango"))

`apply()`와 `lambda`

# 결측치 처리
df.dropna(subset=['reviewText'], inplace=True)

# 'reviewText' 컬럼의 단어 수 계산
df['word_count'] = df['reviewText'].apply(lambda x: len(x.split()))

from tqdm import tqdm
tqdm.pandas()

# tqdm과 함께 진행률 표시하며 처리
df['word_count'] = df['reviewText'].progress_apply(lambda x: len(x.split()))

텍스트 전처리

스트 전처리는 텍스트 데이터를 분석 또는 머신러닝 모델링에 적합한 형태로 가공하는 과정이며, 일반적으로 다음의 3단계로 수행됨

텍스트 정제 연습

1. HTML 태그 제거

`re.sub()` vs `BeautifulSoup`

text = """
    <html>
        <head><title>My Title</title></head> 😍😍
        <body><p>Hello, World! &#128512;</p><p>Python is fun. <a href="http://example.com">Example</a></p></body>
    </html>
    """ # 예시 데이터
    
# re.sub 함수
text_no_html = re.sub(r'<.*?>', ' ', text)
print(text_no_html)


from bs4 import BeautifulSoup

# BeautifulSoup
text_no_html = BeautifulSoup(text, "html.parser").get_text() 
print(text_no_html)

re.sub()

BeautifulSoup

2. 이모지 제거

cleaned = re.sub(r'[^\x00-\x7F]+', '', text)
print(cleaned)

3. 특수문자 제거

cleaned = re.sub(r'[^a-zA-Z0-9\s]', '', text)
print(cleaned)

4. 숫자 제거

cleaned = re.sub(r'\d+', '', text)
print(cleaned)

5. 소문자 변환

lowered = text.lower() # 모두 소문자로 변경
print(lowered)

6. 공백 처리

2025.03.23 - [Data Analysis/Basic] - [Data Analysis] 데이터 정제 및 분석 해보기

[Data Analysis] 데이터 정제 및 분석 해보기

아마존 리뷰 데이터 세트를 이용해서 Pandas를 통해 데이터 정제 및 분석을 해보았다. 아마존 리뷰 데이터 세트를 사용했다. pandas 데이터 프레임 길이 제한pd.set_option('display.max_rows', 10) 1. 데

c0mputermaster.tistory.com

이전 포스트 참고

7. 텍스트 정규화

# 예제 텍스트  ex) U.S.A => USA
text = "I love U.S.A. and U.K.!"

# 정규화 규칙을 담은 사전
normalization_dict = {
    "U.S.A.": "USA",
    "U.K.": "UK",
    # 필요한 추가 규칙
}

# 함수 정의
def normalize_text(text, normalization_dict):
    for old, new in normalization_dict.items():
        text = text.replace(old, new)
    return text
    
# 텍스트 정규화 수행
normalized_text = normalize_text(text, normalization_dict)

print(normalized_text)

8. 축약어 확장

# pip install contractions

import contractions   # EX) `'don't'`를 `'do not'`으로, `'I'm'`을 `'I am'`으로

# 축약형이 포함된 텍스트
text = "I'm here but I don't know what to do."

# 축약형 확장
expanded_text = contractions.fix(text)

print(expanded_text) # I am here but I do not know what to do.

텍스트 정제 함수 정의

def clean_text(text):
    text = re.sub(r'<.*?>', '', text)                 # HTML 태그 제거
    text = re.sub(r'[^\x00-\x7F]+', '', text)         # 이모지 제거
    text = contractions.fix(text)                     # 축약형 확장  
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)        # 특수문자 제거
    text = re.sub(r'\d+', '', text)                   # 숫자 제거
    text = text.lower()                               # 소문자 변환
    text = re.sub(r'\s+', ' ', text).strip()          # 공백 정리
    return text
    
 # 예제 텍스트
text = """<p>Welcome to the digital age! The world is now <em>more connected</em> than ever before. From <strong>social media</strong> to <a href="https://example.com">e-commerce</a>, digital platforms are transforming our lives.</p>
<p>However, it's not all sunshine and rainbows. We must consider the <span style="color: red;">cybersecurity risks</span> associated with our online activities. Don't share personal info on <u>unsecured sites</u>.</p>
<p>Lastly, remember the importance of 'digital detox'. Spending time away from screens is crucial for our mental health. Let's embrace the digital world responsibly!</p>"""

# 결과 출력
print(clean_text(text)) 

# welcome to the digital age the world is now more connected than ever before from social media to ecommerce digital platforms are transforming our lives however it is not all sunshine and rainbows we must consider the cybersecurity risks associated with our online activities do not share personal info on unsecured sites lastly remember the importance of digital detox spending time away from screens is crucial for our mental health let us embrace the digital world responsibly

데이터 변환

토큰화

문장을 의미 있는 단위인 단어(word), 형태소(morpheme), 서브워드(subword) 등으로 나누는 작업

1. NLTK (Natural Language Toolkit)

NLTK는 Python에서 자연어 처리를 수행할 수 있도록 도와주는 대표적인 오픈소스 라이브러리

# !pip install nltk
# nltk.download('punkt')  # 토큰화 도구(최초 1회만 다운로드 필요)

import nltk
from nltk.tokenize import word_tokenize

text = "Natural Language Processing is fun!"
tokens = word_tokenize(text)
print(tokens) # ['Natural', 'Language', 'Processing', 'is', 'fun', '!']

2. spaCy

* 영어, 한국어를 포함한 다양한 언어 모델을 제공하는 Python 기반 고성능 자연어 처리 오픈소스 라이브러리임

* 실제 산업 환경에서 사용할 수 있도록 최적화된 속도 중심의 라이브러리로, 대용량 텍스트 처리와 품사 태깅, NER, 토큰화 등을 효율적으로 지원함

* Microsoft C++ Build Tools 설치: https://visualstudio.microsoft.com/visual-cpp-build-tools/

* 설치 시, C++ build tools, Windows 10 SDK, MSVC v142 확인

# conda install -c conda-forge spacy

import spacy

nlp = spacy.load("en_core_web_sm") # !python -m spacy download en_core_web_sm  # 모델 다운로드 (최초 1회만 필요)
doc = nlp("Natural Language Processing is fun!")

tokens = [token.text for token in doc]
print(tokens)

불용어 제거

Ex) `is`, `the`, `a`, `and`, `in`, `of`,` to`, `this`, `that` 등 제거

1. NLTK를 이용

from nltk.corpus import stopwords
nltk.download('stopwords')   # 불용어 리스트(최초 1회만 다운로드 필요)

text = "This is a sample sentence with some stopwords."
tokens = word_tokenize(text.lower())

# 불용어 제거
stop_words = set(stopwords.words('english'))
filtered = [word for word in tokens if word not in stop_words]

print(filtered) #  ['sample', 'sentence', 'stopwords', '.']

2. spaCy를 이용한 불용어 제거

lp = spacy.load("en_core_web_sm")
doc = nlp("This is a sample sentence with some stopwords.")

filtered = [token.text for token in doc if not token.is_stop]
print(filtered)

사용 도메인에 따라 기본 불용어 외에 추가/제거가 필요할 수 있음 ( 불용어 사전 커스터마이징 )

# 기본 불용어 리스트를 기반으로 커스터마이징
custom_stopwords = set(stopwords.words('english'))
custom_stopwords.remove('not')                     # 제거: 'not'은 분석에 중요할 수 있음
custom_stopwords.update(['sample', 'sentence'])    # 추가

# 확인을 위해 일부 불용어 출력
print(list(stop_words)[:10])

# 적용
filtered = [word for word in tokens if word not in custom_stopwords]
print(filtered)