Logistic Regression

 

The logit model is a modification of linear regression that makes sure to output a probability between 0 and 1 ( classification with two classes) by applying the sigmoid function,which, when graphed, looks like the characteristic S-shaped curve.

Here we’ve isolated p, the probability that Y=1, on the left side of the equation. If we want to solve for a nice clean β0 + β1x + ϵ on the right side so we can straightforwardly interpret the beta coefficients we’re going to learn, we’d instead end up with the log-odds ratio, or logit, on the left side — hence the name “logit model”:

 

 

certified business analysis analysis business business intelligence it master business analysis

 

logistic regression in 3d with 2 independent variables

 

 

 

ba business analyst business intelligence software business analyst it senior business analyst

business analysis business analyst bi analytics business intelligence systems

log odds or logit increases with B1 for every increasement of x1
the odds is e^B1 for every increasement of x1.

 

In logistic regression, the cost function is basically a measure of how often you predicted 1 when the true answer was 0, or vice versa. Below is a regularized cost function just like the one we went over for linear regression.

logit logodds logistic regression We’ll minimize this cost function with gradient descent, as above, and voilà! we’ve built a logistic regression model to make class predictions as accurately as possible.

The first chunk is the data loss, i.e. how much discrepancy there is between the model’s predictions and reality. The second chunk is the regularization loss, i.e. how much we penalize the model for having large parameters that heavily weight certain features (remember, this prevents overfitting). We’ll minimize this cost function with gradient descent, as above, and voilà! we’ve built a logistic regression model to make class predictions as accurately as possible.

 

 

business process management business analysis process business analytics tools business data analysis

  • True Positive = TP = positivt predictade värden som verkligen är positiva
  • True Negative = TN = negativt predictade värden som verkligen är negativa
  • False Positives = FP = Type I error = positivt predictade värden som verkligen är negativa
  • False Negatives = FN = Type II error = negativt predictade värden som verkligen är positiva

 

  • Simple baseline = TN+FP / Alla värden
  • Accuracy = TP + TN / Alla värden
  • Sensitivity = True Positive rate TP/(TP+FN)= andelen positiva fall vi har klassiferat rätt
  • Specificity = True Negative rate TN/(TN+FP) = andelen negativa fall vi har klassiferat rätt
  • Fall-out = False Positive rate FP/(FP+TN)
Model with higher Threshold = lower sensitivity/TP rate and higher specificity/TN rate
Model with lower Threshold = higher sensitivity/TP rate and lower specificity/TN rate
 business analytics certified business analyst professional business intelligence analyst it business analyst

 

AUC: Area under the ROC Curve measures model quality where 1 is the perfect score
Prob of predicting correctly, flip coin would be AUC = 0.5

 


R:

library(caTools)
set.seed(88)                bestämmer en enda bestämd split
split = sample.split($variabel , SplitRatio = 0.75)          delar upp din data 75% o 25%
train = subset(filen , split == TRUE)                               75% blir då train
test = subset(filen, split ==FALSE)
glm(söktavariabel ~ x + y , data=trainingset, family = binomial)
glm(söktavariabeln ~ ., data = train, family=binomial)
glm(söktavariabeln ~ . – en variabel, data = train, family=binomial) tar ut en obr variabl
  • Null deviance : when only using the intercept
  • Residual deviance : include all the variables
  • AIC like adj R^2 skall helst vara lite. can only be compared between models on the same data set
predictTrain = predict(logregfkn, type=”response”)
predictTest = predict(QualityLog, type=”response”, newdata=qualityTest)
#type=”response” ger sammonkheter, alltså 0-1
predictTest = predict(QualityLog, type=”response”, newdata=qualityTest) > 0.5
#to get False or True
geom_smooth(method = “glm”, se = FALSE, method.args = list(family = “binomial”))
# Confusion matrix with threshold of 0.5
table(test$TenYearCHD, predictTest > 0.5)
# Confusion matrix for threshold of 0.7
table(qualityTrain$PoorCare, predictTrain > 0.7)
confusionMatrix(table(pred, actual))
  • library(broom) 
  • model %>% augment(type.predict = “response”)
  • augment(mod, type.predict = “response”) %>% mutate(pred = round(.fitted))
  • augment(mod, newdata = new_data, type.predict = “response”) 
    • data_space <- ggplot(data = MedGPA_binned, aes(x = mean_GPA, y = acceptance_rate)) +
      geom_point() + geom_line()
    • MedGPA_plus <- mod %>% augment(type.predict = “response”)
    • data_space + geom_line(data = MedGPA_plus, aes(x = GPA, y = .fitted), color = “red”)
  • mutate(log_odds = log(acceptance_rate / (1 – acceptance_rate)))
  • mutate(log_odds_hat = log(.fitted / (1 – .fitted)))
library(ROCR)
ROCRpred = prediction(predictTrain, qualityTrain$PoorCare)
ROCRperf = performance(ROCRpred, “tpr”, “fpr”)
plot(ROCRperf, colorize=TRUE, print.cutoffs.at=seq(0,1,by=0.1), text.adj=c(-0.2,1.7))
as.numeric(performance(ROCRpred, “auc”)@y.values)
prediction(predictTrain, filnamnTrain$söktavärdet)
performance(prediction, “tpr”,”fpr”)
ROCRpredTest = prediction(predictTest, qualityTest$PoorCare)
auc = as.numeric(performance(ROCRpredTest, “auc”)@y.values)
auc – sannolikheten en modell kan urskilja mellan random valda 0or o 1or
predictTest = predict(QualityLog, type = “response”, newdata = qualityTest)
ROCRpredTest = prediction(predictTest, qualityTest$PoorCare)
ROCRperfTest = performance(ROCRpredTest, “tpr”, “fpr”)
plot(ROCRperfTest, colorize=TRUE, print.cutoffs.at=seq(0,1,by=0.06), text.adj=c(-0.2,1.7))
auc = as.numeric(performance(ROCRpredTest, “auc”)@y.values)
auc
plot(performance)
plot(performance, colorize = TRUE)
plot(performance, colorize = TRUE, print.cutoffs.at=seq(0,1,0.1), text.adj=c(-0.2,1.7))

# draw 3D scatterplot
p <- plot_ly(data = nyc, z = ~Price, x = ~Food, y = ~Service, opacity = 0.6) %>% add_markers()

# draw a plane
p %>% add_surface(x = ~x, y = ~y, z = ~plane, showscale = FALSE)


P:

#Import Library
from sklearn.linear_model import LogisticRegression
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create logistic regression object
model = LogisticRegression()
# Train the model using the training sets and check score
model.fit(X, y)
model.score(X, y)
#Equation coefficient and Intercept
print(‘Coefficient: \n’, model.coef_)
print(‘Intercept: \n’, model.intercept_)
#Predict Output
predicted= model.predict(x_test)