Overview

Part 1 - Introduction

DAT565
Date: January 16, 2024
Last modified: July 10, 2025
3 min read
DAT565

Introduction

In this series we’ll cover the very large and mystical field of data science and AI. We’re going to spend these first few parts to understand what data science is.

Data science

What is data science? (Theoretical) Computer science is method-driven and often abstract. In data science, the question is often answered with insights derived from the data.

Data scientists aim for results that can describe the world, or a business sector, that needs it. Whereas in computer science, most of the time the worry is about the correctness of the output.

So data scientists worry about the robustness in face or noisy and incorrect data.

Thus, data science can be seen as a scientific method.

The basic problems of data science

The most basic of problems that we’ll encounter in our data science journey will be:

  • Classification
    • Aims to give discrete labels to items in a dataset.
  • Regression
    • Aims to predict numerical quantities based on observations at some point outside the dataset.

These are computational problems, not research questions.

Classical examples from the data science world are:

Properties of data

So we understand that data is clearly important, but can we classify and set properties on data? Of course.

  • Structured data
    • As the name suggests, well-defined, standardized format that is easy to manipulate algorithmically.
  • Unstructured data
    • Poorly defined and is not organized in a pre-defined manner, e.g. raw text documents, images, videos etc.
  • Quantitative data
    • Consists of numerical values, e.g. height and weight.
  • Categorical data
    • Consist of discrete labels, e.g. gender, hair color or occupation.
  • Big data
    • Buzz word, but this is often unstructured data that is too big for convenient processing because of its sheer volume, multiple sources etc.
  • Little data
    • Even if the dataset is small, it can answer the right question given that the dataset is correct.