DATA SCIENCE METHODOLOGY

BUSINESS UNDERSTANDING

Before curating solutions for a project, a data scientist must develop an understanding of the business in its entirety. Through comprehensive discussions with the clients, they must be able to locate the actual problem, define its aspects, and derive well-founded requirements for the solution.

ANALYTIC APPROACH

Once the problem has been defined, the data scientist can identify which analytic approach may solve the problem. The process of identification involves the expression of the problem by statistical and machine learning techniques. It determines the type of outcome needed, which then decides on the best analytical approach to be used.

If the problem requires data trends, counts, and summary, then the statistical model may do the job. However, if it requires probabilities, the predictive model may fare better. The descriptive model works best if the problem requires the relationships between elements and their environments.

DATA REQUIREMENTS

After the necessary data for problem-solving is determined, the data scientist must identify various data contents, formats, and sources for data collection. The said data must be capable of answering fundamental questions about the business problem.

DATA COLLECTION

After learning the requirements, the data scientist must identify the data resources, and collect relevant data. Data, whether structured, unstructured, or semi-structured, may be acquired without charge on several websites. Some offer premade datasets, which often come as CSV or Excel files. However, if the required data is not free of charge, the data scientist must compromise. They must also check for inconsistencies within the collected data, and revise the requirements accordingly. Since more data constitutes better outcomes, an extensive collection of data is also recommended.

DATA UNDERSTANDING

Data scientists use descriptive statistics and visualization techniques to better understand data now that the data collection stage is complete. Data scientists examine the dataset to determine its content, determine if additional data is required to fill any gaps, and verify the data's quality.

During the Data Understanding stage, data scientists attempt to learn more about the previously collected data. We must verify the data type and learn more about the attributes and their names.

CRISP-data DM's understanding phase entails taking a closer look at the data available for mining. This step is critical in avoiding unexpected problems during the following phase, data preparation, which is typically the most time-consuming part of a project.

Data understanding entails accessing and exploring data using tables and graphics organized in IBM® SPSS® Modeler with the CRISP-DM project tool. This allows you to assess the data's quality and describe the results of these steps in the project documentation.

DATA PREPARATION

The act of obtaining, integrating, formatting, and organizing data so that it may be utilized in business intelligence (BI), analytics, and data visualization applications is known as data preparation. Data preparation includes data pre-processing, profiling, cleaning, validation, and transformation; it also frequently entails bringing together data from many internal and external systems. Information technology (IT), business intelligence (BI), and data management teams prepare data for loading into a data warehouse, NoSQL database, or data lake repository, or when new analytics applications are built. Furthermore, data scientists, other data analysts, and business users can collect and prepare data on their own using self-service data preparation technologies. Informally, data preparation is referred to as data prep. It's also known as data wrangling, however some practitioners use the word narrowly to refer to cleaning, organizing, and manipulating data as part of the broader data preparation process, separating it from the data pre-processing stage.

MODELING

During the Modeling stage, the data scientist can determine whether his work is ready to go or whether it needs to be reviewed. Modeling is concerned with the development of models that are either descriptive or predictive, and these models are based on a statistical or machine learning analytic approach. Descriptive modeling is a mathematical method that depicts real-world occurrences and the links between the components that cause them. For example, a descriptive model would investigate: if a person performed this, they're likely to favor that. Predictive modeling is a method that forecasts outcomes using data mining and probability; for example, a predictive model might be used to detect whether an email is spam or not. Data scientists employ a training set, which is a set of historical data with known outcomes, for predictive modeling. This process can be performed as many times as necessary until the model understands the query and its response.

EVALUATION

Data scientists can evaluate the model in two ways during the Model Evaluation stage: hold-out and cross-validation. The dataset is divided into three subsets in the Hold-Out method: a training set, as mentioned in the modeling stage; a validation set, which is a subset used to assess the performance of the model built in the training phase; and a test set, which is a subset used to evaluate the likely future performance of a model.

You've completed the majority of your data mining project at this point. In the Modeling phase, you've also determined that the models created are technically correct and effective in terms of the data mining success criteria that you defined earlier.

However, before proceeding, you should evaluate the results of your efforts using the business success criteria established at the start of the project. This is critical to ensuring that the results you've obtained can be used by your organization. Data mining generates two types of outcomes:

The final models chosen in the previous CRISP-DM phase. Any conclusions or inferences derived from the models as well as the data mining process. These are referred to as discoveries.

DEPLOYMENT

In data science, the term "deployment" refers to the use of a model to make predictions based on new data. It is the process of integrating a machine learning model into an existing production environment in order to make business decisions as practical as possible. Deployment is considered as one of the final phases in a machine learning life cycle and at times also the most time-consuming. Often, an organization's IT systems are incompatible with traditional model-building languages, requiring data scientists and programmers to rewrite them, wasting significant time and brainpower.

A model must be effectively deployed into production before it can be used for real decision-making. If you cannot consistently obtain practical insights from your model, the model's impact is severely constrained.

Model deployment is one of the most difficult aspects of reaping the benefits of machine learning. To guarantee that the model functions reliably in the organization's production environment, data scientists, IT teams, software developers, and business professionals must collaborate. This is a significant difficulty since there is frequently a mismatch between the programming language in which a machine learning model is built and the languages that your production system understands, and re-coding the model can add weeks or months to the project timeframe.

Accordingly, there are four ways model deployment in data science:

Data science tools (or cloud)
Programming language (Java, C, VB, …)
Database and SQL script (TSQL, PL-SQL, …)
PMML (Predictive Model Markup Language)

FEEDBACK

Typically, the customer contributes the most at the feedback stage. Following the implementation stage, customers can decide whether or not the model is appropriate for their purposes. Because the modeling-to-feedback process is very iterative, data scientists use this input to determine whether or not to adjust the model.

Feedback is a process that occurs when the output of a system is used as input back into the system as part of a chain of cause and effect. This changes variables in the system, resulting in different output and, as a result, varied feedback, which can be either good or harmful. In the event of a system that requires knowledge of the output in order to enhance or deliver on a given outcome, feedback is both necessary and beneficial. However, for a system that does not require feedback, such as an audio system, feedback is frequently negative. Consider a microphone and speaker system: when the sound from the speakers (output) is picked up by the microphone (input), it generates a signal.