ロジスティック回帰の小ネタ
小ネタ①:ロジスティック回帰は集計値を用いても同じ結果となる
tmp <- epitools::expand.table(Titanic) library(tidyverse) dat_table <- tmp %>% group_by(Class, Sex, Age) %>% summarise("Yes" = sum(Survived == "Yes"), "No" = sum(Survived == "No"))
集計値
> summary(glm(cbind(Yes, No) ~ ., dat_table, family = binomial("logit")))$coef Estimate Std. Error z value Pr(>|z|) (Intercept) 0.6853195 0.2729943 2.510380 1.206013e-02 Class2nd -1.0180950 0.1959976 -5.194427 2.053519e-07 Class3rd -1.7777622 0.1715666 -10.361935 3.694112e-25 ClassCrew -0.8576762 0.1573389 -5.451138 5.004844e-08 SexFemale 2.4200603 0.1404101 17.235655 1.434207e-66 AgeAdult -1.0615424 0.2440257 -4.350125 1.360598e-05
生データ
> summary(glm(Survived ~ ., tmp, family = binomial("logit")))$coef Estimate Std. Error z value Pr(>|z|) (Intercept) 0.6853195 0.2729934 2.510388 1.205986e-02 Class2nd -1.0180950 0.1959969 -5.194443 2.053331e-07 Class3rd -1.7777622 0.1715657 -10.361993 3.691891e-25 ClassCrew -0.8576762 0.1573387 -5.451147 5.004592e-08 SexFemale 2.4200603 0.1404093 17.235750 1.431830e-66 AgeAdult -1.0615424 0.2440247 -4.350143 1.360490e-05
標準誤差がわずかに異なるものの、回帰係数は一致。
小ネタ②:ロジスティック回帰とロジットに対する線形回帰は異なる
set.seed(123) n <- 10000 x1 <- sample(c("A", "B"), n, replace = T) x1_A <- ifelse(x1 == "A", 1, 0) x1_B <- ifelse(x1 == "B", 1, 0) x2 <- sample(c("X", "Y", "Z"), n, replace = T) x2_X <- ifelse(x2 == "X", 1, 0) x2_Y <- ifelse(x2 == "Y", 1, 0) x2_Z <- ifelse(x2 == "Z", 1, 0) b0 <- 0 b1_B <- 1.0 b2_Y <- 1.5 b2_Z <- 3.0 l <- b0 + x1_B * b1_B + x2_Y * b2_Y + x2_Z * b2_Z p <- exp(l) / (1 + exp(l)) dat <- data.frame( y = rbinom(n, 1, p), x1 = x1, x2 = x2 ) tmp <- dat %>% group_by(x1, x2) %>% summarise("Yes" = sum(y == 1), "No" = sum(y == 0)) %>% mutate(p = Yes / (Yes + No)) %>% mutate(logit = log(p / (1-p)))
> summary(glm(y ~ x1 + x2, dat, family = binomial("logit")))$coef Estimate Std. Error z value Pr(>|z|) (Intercept) 0.003870327 0.04524904 0.0855339 9.318369e-01 x1B 1.005336861 0.05966550 16.8495494 1.057171e-63 x2Y 1.548350891 0.06500045 23.8206168 2.042443e-125 x2Z 3.016559809 0.10570629 28.5371830 4.051563e-179
> summary(lm(logit ~ x1 + x2, tmp))$coef Estimate Std. Error t value Pr(>|t|) (Intercept) -0.03945677 0.09246599 -0.4267166 0.711129289 x1B 1.08845178 0.09246599 11.7713743 0.007139621 x2Y 1.55366958 0.11324725 13.7192702 0.005271007 x2Z 3.08631608 0.11324725 27.2529016 0.001343688
回帰係数は概ね一致するものの標準誤差に大きな違いがあり、結果的にP値は大きく異なる。