머신러닝( scikit-learn )의 method이해

카테고리 없음

머신러닝( scikit-learn )의 method이해

gggg21 2025. 1. 2. 12:28

1. sklearn.model_selection: 가벼운 문제들의 해결책

model_selection은 모델 평가 및 선택과 관련된 도구를 제공합니다.
즉, 데이터를 적절히 나누고(Train-Test Split), 모델을 평가하고(교차 검증), 최적의 하이퍼파라미터를 선택하는 과정에 사용됩니다.

주요 기능:

데이터 분할:
- train_test_split: 데이터를 학습용과 테스트용으로 나누는 가장 기본적인 함수.
- KFold / StratifiedKFold: 교차 검증용 데이터 분할.
모델 선택 및 평가:
- cross_val_score: 교차 검증을 통해 모델 성능 평가.
- GridSearchCV / RandomizedSearchCV: 하이퍼파라미터 튜닝.
- validation_curve / learning_curve: 모델의 학습 성능과 과적합 확인

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier

# 데이터 분할
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 교차 검증
model = RandomForestClassifier()
scores = cross_val_score(model, X_train, y_train, cv=5)
print("교차 검증 점수:", scores)

2. sklearn.preprocessing: 데이터 전처리

preprocessing은 머신러닝 알고리즘이 데이터를 더 잘 학습할 수 있도록 데이터를 변환하고 표준화하는 도구를 제공합니다.

주요 기능:

스케일링 및 정규화:
- StandardScaler: 평균 0, 표준편차 1로 데이터 표준화.
- MinMaxScaler: 데이터를 0과 1 사이로 정규화.
- RobustScaler: 이상치에 민감하지 않은 스케일링.
범주형 데이터 변환:
- OneHotEncoder: 범주형 데이터를 원-핫 인코딩.
- LabelEncoder: 범주형 데이터를 숫자로 변환.
다항식 변환:
- PolynomialFeatures: 다항식 피처 추가.
결측값 처리:
- Imputer (현재는 SimpleImputer로 업데이트됨): 결측값 채우기

from sklearn.preprocessing import StandardScaler, OneHotEncoder
import numpy as np

# 스케일링
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 범주형 데이터 변환
encoder = OneHotEncoder()
categorical_encoded = encoder.fit_transform(np.array(["red", "blue", "green"]).reshape(-1, 1))
print(categorical_encoded.toarray())

모듈	주요 역할	예시 기능
model_selection	모델 학습과정에서 데이터를 나누고 평가하는 도구	train_test_split , GridSearchCV
preprocessing	데이터 전처리 및 변환	StandardScaler, OneHotEncoder

결론

**model_selection**은 모델의 평가와 선택에 초점.
**preprocessing**은 머신러닝 모델이 데이터를 더 잘 이해할 수 있도록 변환하는 데 초점