Table of Contents
- Introduction
- Basic Level
- Intermediate Data Science Interview Questions
- Advanced Level Data Science Interview Questions
- Build a random forest model on top of this ‘CTG’ dataset, where ‘NSP’ is the dependent variable and all other columns are the independent variables.
- Mention various kernel functions that can be used in SVM.
Data Science is one of the leading and most popular technologies in the world today. Major organizations are recruiting professionals in this field. With the high demand and low availability of these professionals, data scientists are among the highest-paid IT professionals. This data science interview preparation blog covers the most frequently asked questions in data science job interviews.
In this article, we will discuss the various types of Data Science interview questions asked based on your level of knowledge. There are three stages on which we will frame the interview questions, these are
- Basic Level Data Science Interview Questions
- Intermediate Level Data Science Interview Questions
- Advanced Level Data Science Interview Questions
Basic Level
1. What is Data Science?
Data Science is a field of computer science that is explicitly concerned with transforming data into information and extracting meaningful insights from it. The reason why data science is so popular is that the kind of insights it allows us to glean from available data has led to some major innovations in many products and companies. Using these insights, we are able to determine the taste of a particular customer, the likelihood of a product being successful in a particular market, etc.
2. Differentiate Between Data Analytics and Data Science
Data Analytics | Data Science |
Data Analytics is a subset of Data Science. | Data Science is a broad technology that includes various subsets such as Data Analytics, Data Mining, Data Visualization, etc. |
The goal of data analytics is to illustrate the precise details of retrieved insights. | The goal of data science is to discover meaningful insights from massive datasets and derive the best possible solutions to resolve business issues. |
Requires just basic programming languages. | Requires knowledge in advanced programming languages. |
It focuses on just finding the solutions. | Data Science not only focuses on finding the solutions but also predicts the future with past patterns or insights. |
A data analyst’s job is to analyze data in order to make decisions. | A data scientist’s job is to provide insightful data visualizations from raw data that are easily understandable. |
3. What do you understand by linear regression?
Linear regression helps to understand the linear relationship between dependent and independent variables. Linear regression is a supervised learning algorithm, which helps to find the linear relationship between two variables. One is the predictor or independent variable and the other is the response or dependent variable.
In linear regression, we try to understand how the dependent variable changes with respect to the independent variable. If there is only one independent variable, it is called simple linear regression, and if there is more than one independent variable it is known as multiple linear regression.
4. What do you understand by logistic regression?
Logistic regression is a classification algorithm that can be used when the dependent variable is binary. Let’s take an example. Here we are trying to determine whether it will rain or not on the basis of temperature and humidity.
Temperature+ humidity = rain
Temperature and humidity are the independent variables, and rain will be our dependent variable. So, the logistic regression algorithm actually produces an S shape curve.
5. What is a Confusion Matrix?
A confusion matrix is a table used to estimate the performance of a model. It tabulates the actual values and the predicted values in a 2×2 matrix.
True Positive (D): It denotes all those records where the actual values are true and the predicted values are also true. So, all these reflect real positives. False Negative (C): This denotes all records where the actual values are true, but the predicted values are false. False Positive (b): In this the actual values are false, but the predicted values are true.
True Negative (A): Here, the actual values are false and the predicted values are also false. So, if you want to get the true value, the true value will basically represent all true positives and true negatives. This is how the Confusion Matrix works.
6. What do you understand by True-positive rate and False-positive rate?
True Positive Rate: In machine learning, true-positive rates, also called sensitivity or recall, are used to measure the percentage of true positives that are correctly identified. Formula: True Positive Rate = True Positive / False Positive Rate: The false-positive rate is basically the probability of falsely rejecting the null hypothesis for a particular test.
The false-positive rate is calculated as the ratio between the number of negative events incorrectly classified as positive (false positive) over the total number of true events. Formula: False-Positive Rate = False-Positive/Negative.
7. How is data science different from traditional application programming?
Data Science takes a fundamentally different approach in building systems that provide value compared to traditional application development.
In traditional programming paradigms, we would analyze the input, find the expected output, and write code that contained the necessary rules and statements to convert the given input into the expected output. As we can imagine, these rules were not easy to write, especially for data that even a computer would have difficulty understanding, e.g., images, videos, etc.
Data Science changes this process a bit. In this, we need to access a large amount of data that contains the required inputs and their mapping to the expected outputs. Then, we use data science algorithms, which use mathematical analysis to create rules to map given inputs to outputs.
This process of rule-making is called training. After training, we use some data that was set aside before the training phase to test and check the accuracy of the system. The generated rules are a kind of black box, and we can’t understand how the inputs are being transformed into outputs.
However, if the accuracy is good enough, we can use System (also known as Model).
As described above, in traditional programming, we had to write rules to map inputs to outputs, but in data science, rules are generated automatically or learned from the given data. This helped solve some of the toughest challenges facing many companies.
8. What is bias in data science?
Bias is a type of error that occurs in a data science model due to using an algorithm that is not robust enough to capture the underlying patterns or trends present in the data. In other words, this error occurs when the data is too complex for the algorithm to understand, so it builds a model that makes simple assumptions. This leads to low accuracy due to underfitting. Algorithms that can lead to high bias are linear regression, logistic regression, etc.
9. What is Dimensionality Reduction?
Dimension reduction is the process of converting a dataset with a large number of dimensions (fields) to a dataset of fewer dimensions. This is done by removing certain fields or columns from the dataset. However, this is not done randomly. In this process, dimensions or fields are removed only after ensuring that the remaining information will still be sufficient to briefly describe the same information.
10. Why is Python used for data cleaning in DS?
Data scientists have to clean and transform huge data sets in a way that they can work with. It is important to deal with redundant data for better results by removing redundant outliers, malformed records, missing values, inconsistent formatting, etc.
Python libraries such as Matplotlib, Pandas, Numpy, Keras, and SciPy are widely used for data cleaning and analysis. These libraries are used to load and clean the data and perform effective analysis. For example, a CSV file named “student” contains information about the students of an institution like their name, standard, address, phone number, grade, marks, etc.
Intermediate Data Science Interview Questions
11. Two candidates, Aman and Mohan appear for a Data Science Job interview. The probability of Aman cracking the interview is 1/8 and that of Mohan is 5/12. What is the probability that at least one of them will crack the interview?
The probability of Aman getting selected for the interview is 1/8
P(A) = 1/8
The probability of Mohan getting selected for the interview is 5/12
P(B)=5/12
Now, the probability of at least one of them getting selected can be denoted at the Union of A and B, which means
P(A U B) =P(A)+ P(B) – (P(A ∩ B)) ………………………(1)
Where P(A ∩ B) stands for the probability of both Aman and Mohan getting selected for the job.
To calculate the final answer, we first have to find out the value of P(A ∩ B)
So, P(A ∩ B) = P(A) * P(B)
1/8 * 5/12
5/96
Now, put the value of P(A ∩ B) into equation (1)
P(A U B) =P(A)+ P(B) – (P(A ∩ B))
1/8 + 5/12 -5/96
So, the answer will be 47/96.
12. How is data modeling different from database design?
Data Modeling: This can be considered as the first step towards the design of a database. Data modeling builds a conceptual model based on the relationships between different data models. The process involves moving from the conceptual stage to the logical model to the physical schema. It involves a systematic method of applying data modeling techniques.
Database Design: It is the process of designing the database. Database design creates an output which is a detailed data model of the database. Strictly speaking, database design includes a detailed logical model of the database, but can also include physical design choices and storage parameters.
13. What is precision?
Precision: When we are applying algorithms for the classification of data or retrieval of information, precision helps us to get a portion of positive class values which are positively predicted. Basically, it measures the accuracy of true positive predictions. The formula to calculate accuracy is given below:
14. What is a recall?
Recall: This is the set of all positive predictions out of the total number of positive examples. Recall helps us to identify misclassified positive predictions. We use the formula below to calculate the recall:
15. What is an F1 score and how to calculate it?
The F1 score helps us to calculate the precision and harmonic mean of the recall which gives us the accuracy of the test. If F1 = 1, then accuracy and recall are accurate. If F1 < 1 or equal to 0, the precision or recall is less accurate, or they are completely wrong. See below for the formula to calculate the F1 score:
16. What is the p-value?
The p-value is a measure of the statistical significance of an observation. It is the probability that reflects the importance of the output of the data. We calculate the p-value to know the test statistics of a model. Usually, it helps us to choose whether we can accept or reject the null hypothesis.
17. What is the difference between error and residual error?
Values have an error whereas prediction gives us the difference between the observed values and the true values of the dataset. Whereas, the residual error is the difference between the observed values and the predicted values.
The reason we use residual error to evaluate algorithm performance is that the actual values are never known. Therefore, we use the observed values to measure the error using the residuals. This helps us to estimate the error accurately.
18. Why do we use the summary function?
The summary function in R gives us statistics on the algorithm implemented on a particular dataset. This includes various objects, variables, data attributes, etc. It provides summary statistics for individual items when fed into the function. We use the summary function when we want information about the values present in the dataset. This gives us summary statistics in the following form:
Here, it returns the minimum and maximum values from a specific column of the dataset. Also, it provides median, mean, first quartile, and third quartile values which help us to understand the values better.
19. What is the advantage of dimensionality reduction?
Dimension reduction reduces the dimension and size of the entire dataset. It discards redundant features while keeping the overall information in the data intact. The reduction of dimensions leads to faster processing of data. Data with high dimensions is known to be so difficult to deal with that it leads to high time consumption while processing the data and training a model on it. Reducing the dimensions speeds up the process, removes noise, and also leads to better model accuracy.
Advanced Level Data Science Interview Questions
20. Draw a scatter plot between ‘price’ and ‘carat’ using ggplot. The ‘value’ should be on the y-axis, the ‘carat’ should be on the x-axis, and the ‘color’ of the points should be determined by the ‘cut’.
We will implement the scatter plot using ggplot.
So, we’ll start with the data layer, and we’ll stack the beauty layer on top of the data layer. Lastly, we will stack the Geometry layer on top of the Beauty layer.
Code:
ggplot(data = diamonds, aes(x = carat, y = value, col = cut)) + geom_point()
library(ggplot2)
Next, we will use the dplyr package:
library(dplyr)// It is based on the grammar of data manipulation.
To extract those particular records, use the below command:
diamonds %>% filter(price>1000 & cut==”Ideal”)-> diamonds_1000_idea
To introduce missing values, we will be using the missForest package:
library(missForest)
Using the prodNA function, we will be introducing 25 percent of missing values:
Iris.mis<-prodNA(iris,noNA=0.25)
For imputing the ‘Sepal.Length’ column with ‘mean’ and the ‘Petal.Length’ column with ‘median,’ we will be using the Hmisc package and the impute function:
library(Hmisc)
iris.mis$Sepal.Length<-with(iris.mis, impute(Sepal.Length,mean))
iris.mis$Petal.Length<-with(iris.mis, impute(Petal.Length,median))
21. Create a confusion matrix for the model where the threshold value for the probability of predicted values is 0.6, and also find the accuracy of the model.
Accuracy is calculated as:
Accuracy = (True positives + true negatives)/(True positives+ true negatives + false positives + false negatives)
To build a confusion matrix in R, we will use the table function:
table(test$target,pred_heart>0.6)
Here, we are setting the probability threshold as 0.6. So, wherever the probability of pred_heart is greater than 0.6, it will be classified as 0, and wherever it is less than 0.6 it will be classified as 1.
Then, we calculate the accuracy by the formula for calculating Accuracy.
22. Build a random forest model on top of this ‘CTG’ dataset, where ‘NSP’ is the dependent variable and all other columns are the independent variables.
We will load the CTG dataset by using read.csv:
data<-read.csv(“C:/Users/intellipaat/Downloads/CTG.csv”,header=True)
str(data)
Converting the integer type to a factor
data$NSP<-as.factor(data$NSP)
table(data$NSP)
#data partition
set.seed(123)
split_tag<-createDataPartition(data$NSP, p=0.65, list=F)
data[split_tag,]->train
data[~split_tag,]->test
#random forest -1
library(randomForest)
set.seed(222)
rf<-randomForest(NSP~.,data=train)
rf
#prediction
predict(rf,test)->p1
Building confusion matrix and calculating accuracy:
table(test$NSP,p1)
23. Write a function to calculate the Euclidean distance between two points.
The formula for calculating the Euclidean distance between two points (x1, y1) and (x2, y2) is as follows:
√(((x1 – x2) ^ 2) + ((y1 – y2) ^ 2))
Code for calculating the Euclidean distance is as given below:
def euclidean_distance(P1, P2):
return (((P1[0] – P2[0]) ** 2) + ((P1[1] – P2[1]) ** 2)) ** .5
24. Given a list of actual and predicted values, write the code to calculate the root mean square error (RMSE).
To calculate the root mean square error (RMSE), we need to:
- Calculate errors, i.e. difference between actual and predicted values
- square each of these errors
- Calculate the mean of these squared errors
- return square root of mean
The code in Python for calculating RMSE is given below:
def rmse(actual, predicted):
errors = [abs(actual[i] – predicted[i]) for i in range(0, len(actual))]
squared_errors = [x ** 2 for x in errors]
mean = sum(squared_errors) / len(squared_errors)
return mean ** .5
25. Mention various kernel functions that can be used in SVM.
There are four types of kernel functions in SVM:
linear kernel
polynomial kernel
radial base kernel
sigmoid kernel
26. How to detect if time series data is stationary?
Time series data is considered stationary when the variance or mean is constant over time. If the variance or mean in the dataset does not change over a period of time, we can conclude that, for that period, the data is stationary.
27. Write code to calculate the accuracy of a binary classification algorithm using your confusion matrix.
We can use the code given below to calculate the accuracy of a binary classification algorithm:
def accuracy_score(matrix):
true_positives = matrix[0][0]
true_negatives = matrix[1][1]
total_observations = sum(matrix[0]) + sum(matrix[1])
return (true_positives + true_negatives) / total_observations
Are you looking forward to becoming a Data Science expert? This career guide is a perfect read to get you started in the thriving field of data science. Join Now