AI & Data Glossary for Social Sector
The Definitive Glossary for Understanding Artificial Intelligence (AI) For Social Sector.
# | Term | Definition |
---|---|---|
1 | Algorithm | A finite set of well-defined instructions to solve a problem or perform a computation. |
2 | Analytics | The discovery, interpretation, and communication of meaningful patterns in data. |
3 | Artificial Intelligence (AI) | The simulation of human intelligence processes by machines, especially computer systems. |
4 | Artificial Neural Network (ANN) | A computing system inspired by biological neural networks, consisting of interconnected layers of nodes (“neurons”) that process data via weighted connections. |
5 | API (Application Programming Interface) | A set of routines, protocols, and tools for building software and applications, enabling different systems to communicate. |
6 | A/B Testing | A method of comparing two versions of something (A and B) to determine which performs better under controlled conditions. |
7 | AutoML | Automated Machine Learning: techniques and tools that automate the end-to-end process of applying machine learning to real-world problems. |
8 | Backpropagation | A training algorithm for neural networks that calculates the gradient of the loss function and updates weights via gradient descent. |
9 | Batch Processing | Executing a series of jobs on a computer without manual intervention, processing data in large groups (“batches”). |
10 | Big Data | Extremely large and complex data sets that traditional data processing applications cannot handle efficiently. |
11 | BI (Business Intelligence) | Technologies, applications, and practices for collection, integration, analysis, and presentation of business information. |
12 | Black-Box Model | A model whose internal workings are not visible or interpretable by the user. |
13 | Blockchain | A distributed, decentralized ledger that records transactions across many computers in a way that prevents alteration. |
14 | Boosting | An ensemble technique that combines weak learners sequentially to create a strong learner, by focusing on errors of prior models. |
15 | Chatbot | A software application that simulates human conversation through text or voice interactions, often using NLP. |
16 | Classification | A supervised learning task of predicting a discrete label for input data. |
17 | Clustering | An unsupervised learning technique that groups data points based on similarity. |
18 | CNN (Convolutional Neural Network) | A class of deep neural networks, most commonly applied to analyzing visual imagery, using convolutional layers to detect patterns. |
19 | Cohort Analysis | A subset of behavioral analytics that takes data from a given dataset and rather than looking at all users as one unit, it breaks them into related groups for analysis. |
20 | Computer Vision | A field of AI that enables computers to interpret and process visual data from the world (images, video). |
21 | Confusion Matrix | A table used to evaluate classification models, showing true vs. predicted labels (TP, TN, FP, FN). |
22 | Cross-Validation | A model validation technique for assessing how results of a statistical analysis will generalize, by partitioning data into complementary subsets. |
23 | Data Cleaning | The process of detecting and correcting (or removing) corrupt or inaccurate records from a dataset. |
24 | Data Engineering | The practice of designing and building systems for collecting, storing, and analyzing data at scale. |
25 | Data Governance | The overall management of the availability, usability, integrity, and security of data within an organization. |
26 | Data Lake | A centralized repository that allows you to store all structured and unstructured data at any scale. |
27 | Data Mart | A subset of a data warehouse, usually oriented to a specific business line or team. |
28 | Data Mining | The practice of examining large pre-existing databases to generate new information. |
29 | Data Pipeline | A set of processes that move data from source to destination, applying transformations along the way. |
30 | Data Privacy | The proper handling, processing, storage, and usage of personal data. |
31 | Data Provenance | The record of the origin and transformations applied to data, ensuring traceability. |
32 | Data Quality | The measure of data’s condition, including accuracy, completeness, reliability, and relevance. |
33 | Data Science | An interdisciplinary field that uses scientific methods, algorithms, and systems to extract knowledge and insights from data. |
34 | Data Warehouse | A central repository of integrated data from multiple sources, structured for query and analysis. |
35 | Data Wrangling | The process of cleaning, structuring, and enriching raw data into a desired format for better decision making. |
36 | Decision Tree | A flowchart-like structure used for classification and regression, where each internal node represents a test on a feature. |
37 | Deep Learning | A subset of machine learning involving neural networks with many layers that can learn representations of data with multiple levels of abstraction. |
38 | Dimensionality Reduction | Techniques (e.g., PCA, t-SNE) to reduce the number of variables under consideration by obtaining a set of principal variables. |
39 | Dropout | A regularization technique for neural networks that randomly “drops out” units during training to prevent overfitting. |
40 | EDA (Exploratory Data Analysis) | An approach to analyzing data sets to summarize their main characteristics, often using visual methods. |
41 | Embedding | A learned representation for categorical variables or items (e.g., words) as vectors in a continuous vector space. |
42 | Ensemble Learning | Methods that combine multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent algorithms alone. |
43 | Epoch | One complete pass through the entire training dataset during model training. |
44 | ETL (Extract, Transform, Load) | The process of extracting data from sources, transforming it for analysis, and loading it into a target database or warehouse. |
45 | Explainable AI (XAI) | Techniques that make the outputs of AI models understandable to humans. |
46 | Feature Engineering | The process of using domain knowledge to create features that make machine learning algorithms work better. |
47 | Feature Importance | Metrics that assign a score to input features based on how useful they are at predicting a target variable. |
48 | Feature Selection | The process of selecting a subset of relevant features for model construction. |
49 | Federated Learning | A distributed approach to machine learning where the model is trained across multiple decentralized devices holding local data samples. |
50 | Fine-Tuning | The process of taking a pre-trained model and adapting it to a new, related task by continuing training on new data. |
51 | Forecasting | The process of making predictions about the future based on past and present data. |
52 | GAN (Generative Adversarial Network) | A class of neural networks where two networks (generator and discriminator) compete, enabling generation of realistic data. |
53 | GPU (Graphics Processing Unit) | A specialized processor optimized for parallel computations, widely used for training deep learning models. |
54 | Graph Database | A database that uses graph structures with nodes, edges, and properties to represent and store data. |
55 | GUI (Graphical User Interface) | A user interface that allows users to interact with electronic devices through graphical icons. |
56 | Hadoop | An open-source framework for distributed storage and processing of large data sets using the MapReduce programming model. |
57 | Hyperparameter | A configuration parameter external to the model that cannot be estimated from data and must be set prior to training. |
58 | Hyperparameter Tuning | The process of searching for the optimal hyperparameter values for a learning algorithm. |
59 | Imbalanced Data | A dataset where the classes are not represented equally. |
60 | Inference | The process of using a trained model to make predictions on new data. |
61 | Input Layer | The first layer in a neural network that receives the input data. |
62 | Instance | A single data point or record in a dataset. |
63 | Integration Testing | Testing combined parts of an application to determine if they function together correctly. |
64 | Internet of Things (IoT) | The network of physical objects embedded with sensors, software, and other technologies to connect and exchange data with other devices and systems. |
65 | Jupyter Notebook | An open-source web application for creating and sharing documents that contain live code, equations, visualizations, and narrative text. |
66 | K-Nearest Neighbors (KNN) | A simple algorithm that stores all available cases and classifies new cases based on a similarity measure. |
67 | KPI (Key Performance Indicator) | A measurable value that demonstrates how effectively an organization is achieving key objectives. |
68 | L1/L2 Regularization | Techniques that add a penalty to the loss function based on the magnitude (L1) or square (L2) of model coefficients to prevent overfitting. |
69 | Label Encoding | Converting categorical text data into model-understandable numeric labels. |
70 | Large Language Model (LLM) | A deep learning model, often transformer-based, trained on massive text corpora to understand and generate human language. |
71 | Latent Variable | A variable that is not directly observed but is inferred from other variables. |
72 | Latency | The time delay between an input being processed and the corresponding output. |
73 | Layer | A collection of neurons in a neural network; includes input, hidden, and output layers. |
74 | Learning Curve | A plot of model learning performance over experience or time (e.g., training iterations). |
75 | Learning Rate | A hyperparameter that controls how much to change the model in response to the estimated error each time the model weights are updated. |
76 | Linear Regression | A statistical method for modeling the relationship between a scalar response and one or more explanatory variables. |
77 | Logistic Regression | A statistical method for binary classification that models the probability of a binary outcome. |
78 | Log Loss (Cross-Entropy Loss) | A performance metric for classification models measuring the distance between predicted probabilities and actual labels. |
79 | Looker | A data exploration and business intelligence platform. (Note: proper noun) |
80 | Loss Function | A function that maps values of one or more variables onto a real number representing some “cost” associated with those values. |
81 | LSTM (Long Short-Term Memory) | A type of recurrent neural network capable of learning long-term dependencies. |
82 | Machine Learning (ML) | A subset of AI that gives computers the ability to learn from data without being explicitly programmed. |
83 | Macro-Average | Averaging performance metrics independently for each class and then taking the average, treating all classes equally. |
84 | Magnitude | The size or length of a vector; often used in context of embeddings. |
85 | MapReduce | A programming model for processing large data sets with a parallel, distributed algorithm. |
86 | Markov Chain | A stochastic process where the next state depends only on the current state, not on the sequence of events that preceded it. |
87 | Masked Language Model | A language model trained by hiding (“masking”) some tokens and predicting them from context. |
88 | Mean Absolute Error (MAE) | The average of absolute differences between predicted and actual values. |
89 | Mean Squared Error (MSE) | The average of squared differences between predicted and actual values. |
90 | MediaPipe | A cross-platform framework for building multimodal (e.g., video, audio) ML pipelines. |
91 | Metadata | Data that provides information about other data (e.g., creation date, author, format). |
92 | Microservice | An architectural style that structures an application as a collection of loosely coupled services. |
93 | Model Compression | Techniques to reduce the size of a trained model for deployment on resource-constrained devices. |
94 | Model Explainability | The degree to which a human can understand the cause of a decision made by a model. |
95 | Model Persistence | Saving a trained model to disk for later reuse. |
96 | Model Serving | Deploying a trained model so that it can respond to inference requests. |
97 | Monte Carlo Simulation | A computational algorithm that uses random sampling to obtain numerical results for probabilistic systems. |
98 | Multicollinearity | A situation in which two or more predictor variables in a multiple regression model are highly correlated. |
99 | Multilabel Classification | A classification task where each instance may be assigned multiple labels. |
100 | Multivariate Analysis | The examination of more than two variables to determine relationships and patterns. |
101 | Naive Bayes | A family of probabilistic classifiers based on applying Bayes’ theorem with strong independence assumptions between features. |
102 | Natural Language Processing (NLP) | The field of AI focused on the interaction between computers and human (natural) languages. |
103 | Neural Network | A network of interconnected nodes organized in layers that can learn complex patterns in data. |
104 | NLG (Natural Language Generation) | The process of automatically generating coherent text from data. |
105 | NLP Pipeline | A sequence of processing steps (tokenization, parsing, tagging) to analyze textual data. |
106 | Node | A basic unit of a data structure, such as a linked list or tree, or a point in a graph. |
107 | Normalization | Scaling numeric data to a common range, often [0,1], to improve model performance. |
108 | Object Detection | A computer vision task of identifying and localizing objects within an image. |
109 | OCR (Optical Character Recognition) | Technology to convert different types of documents, such as scanned paper documents, into editable and searchable data. |
110 | One-Hot Encoding | Representing categorical variables as binary vectors. |
111 | Online Learning | A model training paradigm where the model is updated incrementally as new data arrives. |
112 | Outlier | An observation point that is distant from other observations, possibly indicating variability in measurement or experimental error. |
113 | Overfitting | When a model learns training data too well, capturing noise and failing to generalize to new data. |
114 | Parameter | An internal configuration variable of a model learned from data (e.g., weights in a neural network). |
115 | PCA (Principal Component Analysis) | A technique to reduce dimensionality by transforming to a new set of variables (principal components) ordered by variance. |
116 | Precision | The ratio of true positive predictions to the total predicted positives. |
117 | Predictive Modeling | The process of using data and statistical algorithms to predict outcomes. |
118 | Precision-Recall Curve | A plot of precision vs. recall for different thresholds, useful for imbalanced datasets. |
119 | Principal Component | A new variable constructed as a linear combination of original variables in PCA. |
120 | Probability Distribution | A function that describes the likelihood of each outcome in an experiment. |
121 | Productionalization | The process of deploying and integrating a model into a live production environment. |
122 | Programmatic Access | Interacting with software or data via code rather than manually. |
123 | Propensity Score | The probability of assignment to a particular treatment given covariates, used in causal inference. |
124 | PR Curve | Precision-Recall curve. |
125 | Python | A high-level, interpreted programming language widely used in data science and AI. |
126 | R-Squared | A statistical measure representing the proportion of variance for a dependent variable explained by independent variables in a regression model. |
127 | Random Forest | An ensemble learning method using multiple decision trees to improve predictive accuracy and control overfitting. |
128 | RapidMiner | A data science platform for building predictive models. (Proper noun) |
129 | Recommender System | An information filtering system that predicts user preferences for items. |
130 | Recall | The ratio of true positive predictions to the total actual positives. |
131 | Regression | A statistical method for modeling relationships between variables. |
132 | Reinforcement Learning | A type of machine learning where agents learn to make decisions by performing actions and receiving rewards. |
133 | Repeatability | The degree to which an experiment or measurement yields the same results under unchanged conditions. |
134 | Reproducibility | The ability to duplicate the results of a study using the same data and methods. |
135 | ResNet | A deep neural network architecture with “skip connections” that mitigate the vanishing gradient problem. |
136 | REST (Representational State Transfer) | An architectural style for designing networked applications using stateless communication. |
137 | Return on Investment (ROI) | A performance measure used to evaluate the efficiency of an investment, calculated as (gain−cost)/cost. |
138 | ROC Curve | Receiver Operating Characteristic curve: plot of true positive rate vs. false positive rate at various thresholds. |
139 | Root Mean Squared Error (RMSE) | The square root of the average of squared differences between predicted and actual values. |
140 | Sampling | The process of selecting a subset of data from a population for analysis. |
141 | Scalability | The capability of a system to handle growing amounts of work by adding resources. |
142 | Schema | The structure that defines the organization of data in a database. |
143 | Score Function | A function that assigns a numerical score to potential outputs of a model. |
144 | Script | A file containing a sequence of instructions executed by a program. |
145 | Search Algorithm | An algorithm for retrieving information stored within some data structure or calculated in the search space of a problem domain. |
146 | Semantic Segmentation | A computer vision task that assigns a class label to each pixel in an image. |
147 | Sensitivity Analysis | The study of how uncertainty in model output can be apportioned to different sources of uncertainty in model input. |
148 | Sentiment Analysis | The use of NLP to identify and extract subjective information from text. |
149 | Sequence Model | Models (e.g., RNN, LSTM) designed to handle sequential data. |
150 | Sigmoid Function | An S-shaped activation function used in neural networks, mapping inputs to (0,1). |
151 | Similarity Measure | A metric that quantifies the similarity between two data objects. |
152 | Simplex Algorithm | A popular algorithm for numerical solution of linear programming problems. |
153 | Skewness | A measure of the asymmetry of the probability distribution of a real-valued variable. |
154 | SLAM (Simultaneous Localization and Mapping) | A technique used in robotics and autonomous vehicles to build a map of an unknown environment while simultaneously keeping track of an agent’s location. |
155 | Softmax Function | An activation function that converts a vector of values into a probability distribution. |
156 | Software as a Service (SaaS) | A software licensing model where access is provided on a subscription basis and hosted centrally. |
157 | Source Code | The human-readable instructions that define what a program does. |
158 | Spectral Clustering | A clustering technique using the eigenvalues of a similarity matrix to reduce dimensions before clustering. |
159 | SQL (Structured Query Language) | A domain-specific language used in programming for managing relational databases. |
160 | Stack | A data structure that follows Last In First Out (LIFO) principle. |
161 | Standardization | Scaling data to have zero mean and unit variance. |
162 | Statistical Significance | The likelihood that a result or relationship is caused by something other than random chance. |
163 | Stepwise Regression | A method of fitting regression models by adding or removing predictors based on statistical criteria. |
164 | Stop Word | Commonly used words (e.g., “the,” “is”) that are often removed in NLP preprocessing. |
165 | Stratified Sampling | A sampling method that divides the population into strata and samples each stratum. |
166 | Stream Processing | Real-time processing of data in motion rather than at rest. |
167 | Structured Data | Data that adheres to a pre defined data model and is easily searchable. |
168 | Supervised Learning | ML tasks where models are trained on labeled data. |
169 | Support Vector Machine (SVM) | A supervised learning model that finds the hyperplane that best separates classes in feature space. |
170 | Survey Analysis | The practice of examining survey data to extract insights. |
171 | Swarm Intelligence | The collective behavior of decentralized systems, natural or artificial. |
172 | TF-IDF (Term Frequency-Inverse Document Frequency) | A numerical statistic intended to reflect how important a word is to a document in a collection or corpus. |
173 | TCO (Total Cost of Ownership) | The total cost of purchasing, operating, and maintaining a system over its life cycle. |
174 | Tensor | A multi-dimensional array used by deep learning frameworks. |
175 | TensorFlow | An open-source library for dataflow and differentiable programming, commonly used for deep learning. (Proper noun) |
176 | Testing Set | A subset of data used to assess the performance of a fully trained model. |
177 | Time Series Analysis | Techniques for analyzing time-ordered data points to extract meaningful statistics and trends. |
178 | Tokenization | The process of breaking text into individual words or subwords (tokens). |
179 | TP/FP/TN/FN | True Positive, False Positive, True Negative, False Negative—components of the confusion matrix. |
180 | Transfer Learning | Reusing a pre-trained model on a new, related problem, often requiring less data. |
181 | Tree Ensemble | An ensemble of decision trees (e.g., random forest, gradient boosting). |
182 | Training Set | The portion of data used to fit model parameters. |
183 | Tuning | The process of optimizing model hyperparameters. |
184 | Underfitting | When a model is too simple to capture underlying patterns in data, resulting in poor performance. |
185 | Unstructured Data | Data that does not adhere to a pre-defined model (e.g., text, images). |
186 | Unsupervised Learning | ML tasks where models find patterns in unlabeled data. |
187 | Validation Set | A subset of data used to tune hyperparameters and prevent overfitting. |
188 | Variance | The variability of model predictions for different training data. |
189 | Vector | A one-dimensional array of numbers representing features or embeddings. |
190 | Video Analytics | The process of applying computer vision to extract meaningful information from video. |
191 | Virtual Assistant | A software agent that can perform tasks or services for an individual based on commands or questions. |
192 | Visualization | The graphical representation of data or model outputs. |
193 | Voice Recognition | The ability of a system to identify and process human speech. |
194 | Web Scraping | Automated extraction of information from websites. |
195 | Weight Decay | A regularization technique that adds a penalty proportional to the magnitude of weights to the loss function. |
196 | Word Embedding | A representation of words as continuous vectors capturing semantic meaning. |
197 | Word2Vec | A two-layer neural network that produces word embeddings by predicting context words. |
198 | XGBoost | An optimized distributed gradient boosting library designed to be highly efficient, flexible, and portable. |
199 | XML (eXtensible Markup Language) | A markup language that defines a set of rules for encoding documents in a format readable by both humans and machines. |
200 | Zero-Shot Learning | The ability of a model to correctly make predictions on classes it has not seen during training. |
Subscribe to our newsletter
The Impact Post
Insights on trends, challenges and opportunities in the social impact ecosystem