EZ

Eduzan

Learning Hub

Eduzan
Eduzan / AI & Machine Learning

Data Collection and Preprocessing

Computer Science / AI & Machine Learning tutorial chapter - Published 2025-12-17 - AI & Machine Learning

1. Data Types:

  • Structured Data: Organized in a clear, easily searchable format, typically in tables with rows and columns (e.g., databases, spreadsheets).
  • Unstructured Data: Lacks a predefined structure, often text-heavy, such as emails, social media posts, images, or videos.
  • Semi-Structured Data: Contains elements of both structured and unstructured data, like JSON, XML, or log files.
  • Time-Series Data: Data points collected or recorded at specific time intervals, used in financial markets, sensor readings, etc.
  • Geospatial Data: Information about physical objects on Earth, often used in maps and GPS systems.

2. Data Sources:

  • Databases: Relational (e.g., MySQL, PostgreSQL) and non-relational (e.g., MongoDB) databases.
  • APIs: Interfaces provided by services to access their data programmatically (e.g., Twitter API, Google Maps API).
  • Web Scraping: Extracting data from websites using tools like BeautifulSoup or Scrapy.
  • Sensors: IoT devices, wearables, and other hardware that collect real-time data.
  • Public Datasets: Open data repositories like Kaggle, UCI Machine Learning Repository, or government databases.

Tensors:

  • Definition: A tensor is a generalization of vectors and matrices to higher dimensions. Tensors are used in deep learning, physics, and more complex data representations.
  • Notation: Tensors are often denoted by uppercase letters (e.g., T) with indices representing different dimensions, such as TijkT_{ijk}Tijk​.
  • Operations: Tensor operations generalize matrix operations to higher dimensions, including addition, multiplication, and contraction.
End of lesson.