Overview

statsmodels는 통계 모델링과 통계적 추론, 그리고 시계열 분석에 특화된 라이브러리이다.

https://www.kaggle.com/code/agileteam/t3-regression-pyhttps://www.kaggle.com/code/agileteam/t3-two-way-anova-pyhttps://www.kaggle.com/code/agileteam/t3-confidence-interval-pyhttps://www.kaggle.com/code/agileteam/t3-probability-pyhttps://www.kaggle.com/code/agileteam/t3-chi2-contingency-pyhttps://www.kaggle.com/code/agileteam/t3-pmf-py

Statsmodels

Statsmodels는 다양한 통계 모델의 추정과 통계적 검정, 데이터 탐색을 위한 함수를 제공하는 통계 분석 모듈이다.

import statsmodels.api as sm
import statsmodels.formula.api as smf

Regression Analysis

Regression Analysis(회귀분석)은 독립변수와 종속변수 사이의 관계를 분석하여, 두 변수 간의 관계식을 추정하고 예측하는 통계 기법이다.

Ordinary Least Squares

Ordinary Least Squares(최소제곱법, OLS)는 실제 값과 예측 값의 차이의 제곱합(잔차제곱합)을 최소로 만드는 계수를 추정하는 방법이다.

X = df.drop(columns=["mpg"])
y = df["mpg"]
X = sm.add_constant(X)

model = sm.OLS(y, X)
result = model.fit()
print(result.summary())
model = smf.ols("mpg ~ cylinders + horsepower + weight + acceleration + model_year + origin", data=df)
result = model.fit()
print(result.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    mpg   R-squared:                       0.818
Model:                            OLS   Adj. R-squared:                  0.815
Method:                 Least Squares   F-statistic:                     288.8
Date:                Sat, 24 Jan 2026   Prob (F-statistic):          3.67e-139
Time:                        08:51:56   Log-Likelihood:                -1027.0
No. Observations:                 392   AIC:                             2068.
Df Residuals:                     385   BIC:                             2096.
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept      -18.0915      4.669     -3.875      0.000     -27.271      -8.912
cylinders        0.0746      0.244      0.306      0.760      -0.405       0.554
horsepower      -0.0062      0.013     -0.470      0.638      -0.032       0.020
weight          -0.0058      0.001     -9.580      0.000      -0.007      -0.005
acceleration     0.0538      0.099      0.543      0.587      -0.141       0.249
model_year       0.7418      0.051     14.472      0.000       0.641       0.843
origin           1.1927      0.266      4.487      0.000       0.670       1.715
==============================================================================
Omnibus:                       35.452   Durbin-Watson:                   1.271
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               61.984
Skew:                           0.562   Prob(JB):                     3.47e-14
Kurtosis:                       4.591   Cond. No.                     8.54e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.54e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Logistic Regression

Logistic Regression(로지스틱 회귀분석)은 종속변수가 범주형인 데이터를 분류하는 통계 모형이다.