Data Science Glossary: 52 Terms You Need to Know
The world of Data Science is full of concepts, some less and some more well-known.
No matter how much experience and specialization one has in this field, it develops so rapidly that new definitions constantly emerge, making it difficult to always stay up to date.
Here at BigBlue Data Academy, having daily contact with both our students and instructors, as well as with the network of companies we collaborate with, this is something we see on a daily basis.
For this reason, staying true to our vision of cultivating a culture of innovation and critical thinking and promoting real solutions to complex data challenges, we have created the following Data Science glossary.
All in an effort to make the world of Data Science as accessible as possible. In short, we will briefly cover 52 definitions and tools in alphabetical order:
Let's get started!
An algorithm is a logical procedure for solving a problem or performing a task. Algorithms are used in various fields of computer science, including data science, for automating and simplifying complex processes.
Apache Spark is a unified analytical engine for processing large datasets. It can be used for processing accumulated data and provides a variety of APIs for data analysis, including machine learning.
An API (Application Programming Interface) is the interface through which data you input into an application or platform is sent to a server and returned back to you with the responses and results you desire.
Artificial Intelligence (AI)
Artificial Intelligence (AI) is a technology that essentially mimics human intelligence to create automated systems that perform various tasks. It is used in a variety of applications, including natural language processing, machine learning, and robotics.
Artificial Neural Networks (ANN)
Artificial Neural Networks (ANN) are a type of machine learning algorithm inspired by the structure and function of the human brain. ANNs can learn from data and make predictions without being explicitly programmed.
Business Analytics is the process in which a company uses methods to process data and draw conclusions in order to make better business decisions. These methods can vary and may include data mining, predictive analytics, machine learning, and more.
Big Data refers to datasets that are too large or complex to be processed using traditional data processing methods. They can be collected from various sources, including social media and financial transactions.
Business Intelligence (BI)
Business Intelligence refers to the process in which a business combines data from various sources to analyze it and make decisions. The ultimate goal is to present complex data in a simple manner and make the appropriate decisions.
Classification is a machine learning process that involves categorizing data into predefined classes or categories based on their characteristics. It is used for tasks such as spam detection, sentiment analysis, and image recognition.
Clustering, or cluster analysis, is a machine learning technique used for data processing. It is essentially an unsupervised learning algorithm that identifies elements with common characteristics and organizes them into groups or clusters.
Computer science is the study of algorithms, data structures, and principles that govern computer systems. It provides the theoretical and practical foundation for many data science techniques and tools.
Computer vision is a field of artificial intelligence that trains and enables computers to extract basic information from various sources, such as digital images and videos.
A dashboard is a visual representation of important data and key performance indicators (KPIs) that allows users to monitor and analyze information at a glance. Data scientists often create dashboards to present information to colleagues or clients.
A database is an organized structure in which data is collected and categorized for better utilization.
Data analytics involves the process of analyzing various data to extract meaningful information and support decision-making. It encompasses a range of techniques, from basic statistical analysis to advanced machine learning.
A Data Architect is a professional who deals with the design, creation, development, and management of an organization's data architecture. Their primary responsibility is to create the design that aligns with the company's data.
Data cleaning is the process of correcting or removing data that is incorrect, in the wrong format, altered, duplicated, or incomplete. The purpose of data is to ensure that algorithms run correctly and produce the desired results.
A Data Engineer is a professional who specializes in managing data, specifically large volumes of data. Their responsibilities revolve around creating systems for collecting, storing, and analyzing data.
A Data Lake is a large repository of raw, unstructured, and semi-structured data that can come from various sources, including JSON, CSV, and Parquet.
Data mining is the process of extracting information through the analysis of a large volume of data, with the goal of identifying trends and patterns.
Data science is a multidisciplinary field that combines elements of statistics, computer science, and knowledge to extract conclusions from data, often using techniques such as machine learning.
A data warehouse is a central repository of structured data that has been processed, transformed, and modeled to meet specific business needs.
Data visualization is the representation of data and information in a visual form to facilitate the understanding and interpretation of complex and intricate data.
Deep learning is a subset of machine learning methods that are based on artificial neural networks. The "deep" adjective in deep learning refers to the use of multiple layers in the network. The methods can be supervised, semi-supervised, or unsupervised.
ELT is a process where data is first extracted from source systems, loaded into a data storage space, and then transformed for analysis. It differs from ETL, where transformation takes place before loading.
ETL is a process that involves extracting data from source systems, transforming it to fit a target schema, and loading the data into a destination database.
The F-score (also known as F1 score or F-measure) is a measure used to evaluate the performance of a Machine Learning model.
Gradient Descent is an optimization algorithm used to find the local minimum of a differentiable function. In machine learning, it's used to find parameter values (coefficients) that minimize a cost function.
Apache Hadoop is an open-source software framework that enables the management of large datasets (known as Big Data), allowing a network of computers to solve complex data problems.
A histogram is a graph that shows the frequency of numerical data using bars. The height of each bar (the vertical axis) represents the frequency distribution of a variable (how often that variable appears).
K-Means clustering is a vector quantization method, initially from signal processing, that aims to divide n observations into k clusters, where each observation belongs to the cluster with the nearest mean.
Linear regression is used to predict the value of one variable based on the value of another variable. The variable you want to predict is called the dependent variable, and the variable used to make the prediction is called the independent variable.
Machine Learning is a branch of artificial intelligence based on the idea that computers/machines can learn from data they collect to recognize patterns and make their own decisions.
The median is the middle value of a dataset, meaning that 50% of the data has a value less than or equal to the median, and 50% of the data has a value greater than or equal to the median.
Natural Language Processing (NLP)
NLP is a field of computer science, specifically artificial intelligence, that deals with how machines perceive and understand human words, much like humans do.
Normalization is the process of taking a measurement and dividing it by something else to make a number more comparable or to place it in a more understandable context.
NoSQL databases are non-relational databases, meaning they allow for different data structures than a SQL database. In other words, they don't work with rows and columns.
Open-source software is software with source code that anyone can inspect, modify, and enhance.
Predictive analytics is the use of past and present data to create forecasts about what is likely to happen in the future.
Python is a high-level, object-oriented programming language with integrated data structures and dynamic semantics. It is suitable for automation, as well as for use in artificial intelligence and machine learning systems.
R is an open-source programming language that provides the user with the ability to perform computational statistics and create graphs.
Regression is a statistical method used in finance, investments, and other fields to determine the strength of the relationship between a dependent variable and a series of other variables (known as independent variables).
Sentiment analysis is a method in the field of NLP (Natural Language Processing) for determining the emotional tone through the analysis of digital text. Sentiment analysis can identify whether the emotional tone of a specific message is positive, negative, or even neutral.
Structured data refers to data that is standardized for efficient access by both software and humans. It is typically in the form of tables with rows and columns that clearly define data characteristics.
SQL (Structured Query Language) is the language used by data analysts and data scientists to retrieve and organize data from relational databases for further use.
Standard deviation is a measure of how spread out data is in relation to the mean. A low or small standard deviation indicates that data is closely clustered around the mean, while a high or large standard deviation indicates that data is more dispersed.
Supervised learning uses a training set to "teach" models to produce the desired output. This training dataset includes inputs and correct outputs, allowing the model to learn over time.
Synthetic data refers to data generated artificially using algorithms, models, or statistical techniques to replicate various properties, distributions, and relationships found in real data.
Unstructured data is information that is not organized according to a predefined data model and cannot be stored in a traditional relational database.
Unsupervised learning uses machine learning algorithms to analyze and cluster data sets without labels.
Web scraping is a method used to collect data in an unstructured format, such as HTML, and transfer it into structured formats like an Excel spreadsheet.
The Z-score is a statistical measure of the relationship of a score to the mean in a set of scores. A Z-score can indicate whether a value is typical for a specific dataset or atypical.
As mentioned, there are many terminologies in the world of Data Science and the fields that surround it (e.g. statistics).
Although it's not easy to be familiar with and understand all the terminologies, we believe that the 52 terms above can prove useful for anyone interested in delving deeper into Data Science!