Logistic Regression - Roshan Talimi

The logit model is a modification of linear regression that makes sure to output a probability between 0 and 1 ( classification with two classes) by applying the sigmoid function,which, when graphed, looks like the characteristic S-shaped curve.

Here we’ve isolated p, the probability that Y=1, on the left side of the equation. If we want to solve for a nice clean β0 + β1x + ϵ on the right side so we can straightforwardly interpret the beta coefficients we’re going to learn, we’d instead end up with the log-odds ratio, or logit, on the left side — hence the name “logit model”:

certified business analysis analysis business business intelligence it master business analysis

logistic regression in 3d with 2 independent variables

ba business analyst business intelligence software business analyst it senior business analyst

business analysis business analyst bi analytics business intelligence systems

log odds or logit increases with B1 for every increasement of x1

the odds is e^B1 for every increasement of x1.

In logistic regression, the cost function is basically a measure of how often you predicted 1 when the true answer was 0, or vice versa. Below is a regularized cost function just like the one we went over for linear regression.

logit logodds logistic regression We’ll minimize this cost function with gradient descent, as above, and voilà! we’ve built a logistic regression model to make class predictions as accurately as possible.

The first chunk is the data loss, i.e. how much discrepancy there is between the model’s predictions and reality. The second chunk is the regularization loss, i.e. how much we penalize the model for having large parameters that heavily weight certain features (remember, this prevents overfitting). We’ll minimize this cost function with gradient descent, as above, and voilà! we’ve built a logistic regression model to make class predictions as accurately as possible.

business process management business analysis process business analytics tools business data analysis

True Positive = TP = positivt predictade värden som verkligen är positiva
True Negative = TN = negativt predictade värden som verkligen är negativa
False Positives = FP = Type I error = positivt predictade värden som verkligen är negativa
False Negatives = FN = Type II error = negativt predictade värden som verkligen är positiva

Simple baseline = TN+FP / Alla värden
Accuracy = TP + TN / Alla värden
Sensitivity = True Positive rate TP/(TP+FN)= andelen positiva fall vi har klassiferat rätt
Specificity = True Negative rate TN/(TN+FP) = andelen negativa fall vi har klassiferat rätt
Fall-out = False Positive rate FP/(FP+TN)

Model with higher Threshold = lower sensitivity/TP rate and higher specificity/TN rate

Model with lower Threshold = higher sensitivity/TP rate and lower specificity/TN rate

business analytics certified business analyst professional business intelligence analyst it business analyst

AUC: Area under the ROC Curve measures model quality where 1 is the perfect score

Prob of predicting correctly, flip coin would be AUC = 0.5

library(caTools)

set.seed(88) bestämmer en enda bestämd split

split = sample.split($variabel , SplitRatio = 0.75) delar upp din data 75% o 25%

train = subset(filen , split == TRUE) 75% blir då train

test = subset(filen, split ==FALSE)

glm(söktavariabel ~ x + y , data=trainingset, family = binomial)

glm(söktavariabeln ~ ., data = train, family=binomial)

glm(söktavariabeln ~ . – en variabel, data = train, family=binomial) tar ut en obr variabl

Null deviance : when only using the intercept
Residual deviance : include all the variables
AIC like adj R^2 skall helst vara lite. can only be compared between models on the same data set

predictTrain = predict(logregfkn, type=”response”)

predictTest = predict(QualityLog, type=”response”, newdata=qualityTest)

#type=”response” ger sammonkheter, alltså 0-1

predictTest = predict(QualityLog, type=”response”, newdata=qualityTest) > 0.5

#to get False or True

geom_smooth(method = “glm”, se = FALSE, method.args = list(family = “binomial”))

# Confusion matrix with threshold of 0.5

table(test$TenYearCHD, predictTest > 0.5)

# Confusion matrix for threshold of 0.7
table(qualityTrain$PoorCare, predictTrain > 0.7)

confusionMatrix(table(pred, actual))

library(broom)
model %>% augment(type.predict = “response”)
augment(mod, type.predict = “response”) %>% mutate(pred = round(.fitted))
augment(mod, newdata = new_data, type.predict = “response”)
- data_space <- ggplot(data = MedGPA_binned, aes(x = mean_GPA, y = acceptance_rate)) +
  geom_point() + geom_line()
- MedGPA_plus <- mod %>% augment(type.predict = “response”)
- data_space + geom_line(data = MedGPA_plus, aes(x = GPA, y = .fitted), color = “red”)
mutate(log_odds = log(acceptance_rate / (1 – acceptance_rate)))
mutate(log_odds_hat = log(.fitted / (1 – .fitted)))

library(ROCR)
ROCRpred = prediction(predictTrain, qualityTrain$PoorCare)
ROCRperf = performance(ROCRpred, “tpr”, “fpr”)

plot(ROCRperf, colorize=TRUE, print.cutoffs.at=seq(0,1,by=0.1), text.adj=c(-0.2,1.7))

as.numeric(performance(ROCRpred, “auc”)@y.values)

prediction(predictTrain, filnamnTrain$söktavärdet)

performance(prediction, “tpr”,”fpr”)

ROCRpredTest = prediction(predictTest, qualityTest$PoorCare)

auc = as.numeric(performance(ROCRpredTest, “auc”)@y.values)

auc – sannolikheten en modell kan urskilja mellan random valda 0or o 1or

predictTest = predict(QualityLog, type = “response”, newdata = qualityTest)

ROCRpredTest = prediction(predictTest, qualityTest$PoorCare)
ROCRperfTest = performance(ROCRpredTest, “tpr”, “fpr”)

plot(ROCRperfTest, colorize=TRUE, print.cutoffs.at=seq(0,1,by=0.06), text.adj=c(-0.2,1.7))
auc = as.numeric(performance(ROCRpredTest, “auc”)@y.values)
auc

plot(performance)

plot(performance, colorize = TRUE)

plot(performance, colorize = TRUE, print.cutoffs.at=seq(0,1,0.1), text.adj=c(-0.2,1.7))

# draw 3D scatterplot
p <- plot_ly(data = nyc, z = ~Price, x = ~Food, y = ~Service, opacity = 0.6) %>% add_markers()

# draw a plane
p %>% add_surface(x = ~x, y = ~y, z = ~plane, showscale = FALSE)

#Import Library
from sklearn.linear_model import LogisticRegression
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create logistic regression object
model = LogisticRegression()
# Train the model using the training sets and check score
model.fit(X, y)
model.score(X, y)
#Equation coefficient and Intercept
print(‘Coefficient: \n’, model.coef_)
print(‘Intercept: \n’, model.intercept_)
#Predict Output
predicted= model.predict(x_test)