Editor’s note: In their book, “Data Science: Concepts and Practice,” authors Vijay Kotu and Bala Deshpande explain the core principles and applications of modern data science. Kotu is vice president of analytics at ServiceNow; Deshpande is a data scientist and consultant. This article, which focuses on primary applications of data science, is adapted with permission.
Data science problems can be broadly categorized into supervised or unsupervised learning models.
Supervised or directed data science tries to infer a function or relationship based on known (labeled) training data and uses this function to map new unknown (unlabeled) data—for example, predicting if a current customer will not return based on the behaviors of all customers who have left before. Supervised techniques predict the value of the output variables based on a set of input variables.
To do this, a data science model is developed from a training dataset where the values of input and output are previously known. The model generalizes the relationship between the input and output variables and uses it to predict for a dataset where only input variables are known. The output variable that is being predicted is also called a class label or target variable. Supervised data science needs a sufficient number of labeled records to learn the model from the data.
Unsupervised or undirected data science uncovers hidden patterns in unlabeled data. In unsupervised data science, there are no output variables to predict. The objective of this class of data science techniques is to find patterns in data based on the relationship between data points themselves. For example, clustering a group of documents based on the similarity of the content in those documents. An application can employ both supervised and unsupervised learners.
Data science problems can also be classified into tasks such as classification, regression, association analysis, clustering, anomaly detection, recommendation engines, feature selection, time series forecasting, deep learning, and text mining.
- Classification and regression techniques predict a target variable based on input variables. The prediction is based on a generalized model built from a previously known dataset. In regression tasks, the output variable is numeric (e.g., the mortgage interest rate on a loan). Classification tasks predict output variables, which are categorical or polynomial (e.g., the yes or no decision to approve a loan).
- Deep learning is a machine learning subfield that involves the use of more sophisticated artificial neural networks. It’s increasingly used for classification and regression problems.
- Clustering is the process of identifying the natural groupings in a dataset. For example, clustering is helpful in finding natural clusters in customer datasets, which can be used for market segmentation. Since this is unsupervised technique, it is up to data scientists to investigate why these clusters are formed in the data and generalize the uniqueness of each cluster.
- Association analysis. In retail analytics, it is common to identify pairs of items that are purchased together, so those specific items can be bundled or placed next to each other. This task is called market basket analysis or association analysis, which is commonly used in cross-selling.
- Recommendation engines are the systems that recommend items to the users based on individual user preference.
- Anomaly or outlier detection identifies the data points that are significantly dif- ferent from other data points in a dataset. One common application is detecting credit card transaction fraud.
- Time series forecasting is the process of predicting the future value of a variable (e.g., temperature) based on past historical values that may exhibit a trend and seasonality.
- Text mining is a data science application where the input data is text in the form of documents, messages, emails, or web pages. To aid the data science on text data, the text files are first converted into document vectors where each unique word is an attribute. Once the text file is converted to document vectors, standard data science tasks such as classification, clustering, etc., can be applied.
- Feature selection is a process in which attributes in a dataset are reduced to a few attributes that really matter.
A complete data science application can contain elements of both supervised and unsupervised techniques. Unsupervised techniques provide an increased understanding of the dataset and hence, are sometimes called descriptive data science.
As an example of how both unsupervised and supervised data science can be combined in an application, consider the following scenario.
In marketing analytics, clustering can be used to find the natural clusters in customer records. Each customer is assigned a cluster label at the end of the clustering process. A labeled customer dataset can now be used to develop a model that assigns a cluster label for any new customer record with a supervised classification technique.