Statistical applications in pharmaceutical and chemical field
by Riccardo Bonfichi
Bootstrap using R: a useful approach for handling chunky data
The term 'chunky data' was coined by Dr. Wheeler in the 90s to describe data that has been measured "in increments too large for the task at hand" or that result
from "rounding or truncating experimental measurements". This type of data often occurs when experimental values must be reported in compliance with pre-established
specifications that perhaps do not require decimal digits, or at most, only one. In the case of time series data that are naturally similar to each other (e.g.,
Annual Product Quality Reviews), and in the absence of decimal digits that differentiate them, it is common to find values that are repeated many times identically.
This type of data, which can be clearly visualized using a probability plot or an individual value plot, leads to a substantial reduction in the variability of the dataset.
As a result, a dataset may not follow a normal distribution, even though there is no scientific reason for this deviation. However, the non-normality of the datasets can represent
an obstacle to the application of certain statistical tests that require the normality of the data.
A simple way to eliminate the problem caused by chunky data is to repeat the measurements with suitable tools or to report the measurements with the decimal places eliminated
during rounding. Unfortunately, this is often not feasible, such as when comparing measurements from two laboratories that have used different data reporting criteria.
In these cases, the absence of normality makes it impossible to correctly apply those statistical tests that are commonly used, for example, to compare the means and dispersions
of two data series (e.g., Two-sample t-test or F-test for Equal Variances).
Bootstrapping, a nonparametric resampling technique, serves as an effective and easy-to-implement alternative to non-parametric tests (e.g., Mann-Whitney) for handling such data.
Bootstrapping allows for the creation of many simulated samples from a single dataset, without making assumptions about the data's distribution. This technique can help in
estimating the distribution of a population and can be used to make inferences about the mean and variance differences between two datasets, even when one or both are not normally
distributed. This post demonstrates how to use a simple R script to implement a specific bootstrapping method, providing a quick and reliable solution.
Clearly, the approach presented here can be extended to compare two non-normally distributed datasets for reasons beyond the presence of chunky data.
Typical examples include analytical parameters (e.g., related substances content) or critical process parameters that are naturally "limited" (the impurity content
can never be less than zero) or arbitrarily constrained and are not normally distributed.
Read more for Apple
Applied Statistics for QA & QC in a GMP environment
In the 2011 FDA guideline on Process Validation, the term "statistical" was already used 13 times and the message to pharmaceutical manufacturers was clear:
use quantitative statistical methods whenever possible to keep processes under control so as to ensure their stability over time and consistency with initial validation.
The concept of "Continued Process Verification", introduced by the FDA Guidance on Process Validation, was subsequently also taken up by Eudralex's Annex 15
"Qualification and Validation", published in 2015, which also recommended that “statistical tools should be used, where appropriate, to support any conclusions
with regard to the variability and capability of a given process and ensure a state of control”.
Other important regulatory documents published later (ICH Q10 and ICHQ12) have further reaffirmed the importance of using statistical tools and not only to better
define the processes control strategy, but also to design them adequately (Design of Experiments, Design Space, etc.) and all this with a view to reducing any post approval changes.
From all this, not only the multiple uses of statistical tools are evident, but also their strong practical impact.
The slides attached here, and which were used for a two-day webinar held in June 2022, present, with a structured approach, numerous quantitative statistical tools applied
to pharmaceutical manufacturing and control. Given the vastness of the subject, this material obviously cannot cover all topics. However, it provides an overview that
should encourage the adoption of these tools, if only for the advantages, including economic ones, that they offer.
Read more for Apple
Elements of Acceptance Sampling by Attributes
The need to verify whether a material supplied by a producer to a consumer, or by a department to another of the same company,
corresponds to pre-established requirements, requires a set of statistical techniques that are called acceptance control. In general,
the acceptance control can be carried out “by attributes” or “by variables” and is mainly used to establish whether the lots subjected
to the control can be accepted or rejected, not to determine their quality level. This post focuses on acceptance control by attributes
and the quality of the lot is measured by its percentage of defects. The three main schemes of sampling plan (i.e., hypergeometric,
binomial and poissonian) are discussed and practical application examples are presented. The “control by attributes” is then considered
from the process standpoint using the appropriate control charts (i.e., p or np-charts and c or u-charts). The analysis of the topic is
completed by a discussion of the ISO 2859-1 standard with some practical application examples. The ISO 2859-1 standard specifies an acceptance
sampling system for inspection by attributes indexed in terms of Acceptance Quality Limit (AQL).
The ultimate purpose of this post is to draw attention to the fact that although sampling plans are challenging to design and implement,
they can perform a much higher function than just "police control". The information they return is indeed invaluable and is a real waste
of resources if, as often happens, it is simply filed and ignored.
Read more for Apple
How to extend the shelf life of an API ?
Look at its Stability Data from a Multivariate standpoint !
Stability studies are mandatory activities that, in general, are routinely conducted and equally routinely monitored as per official guidelines.
The traditional approach to stability studies is limited exclusively to recording the occurrence of a degradation process with the sole purpose of estimating a possible shelf life for the product.
The objective is achieved by following the trend over time of a quantitative attribute, usually the assay value.
This approach, due to its univariate nature, is however unable to say anything about the possible causes of the degradation phenomenon and therefore suggest a way to improve things.
Since at each stability time point other quality attributes are also determined beside assay (e.g., pH, water content, etc.), the adoption of a new perspective, i.e., a multivariate approach,
allows to identify those parameters, among that are measured, that most influence the degradation process. This allows us to hypothesize improvement actions on the process aimed at reducing,
if not even minimizing, degradation and therefore, ultimately, extending the shelf life of the product itself.
In this post, stability data obtained under "accelerated conditions" were chosen as a case study precisely because, being available before the others (i.e., long term), they allow the
degradation process to be investigated immediately.
Experimentally it was also observed that even with only the data of the third month it was possible to obtain a model similar to that obtained with the data of the sixth month.
It is therefore reasonable to assume that the use of additional accelerated aging techniques (e.g., 40°C ≤ T ≤ 80°C and 10% ≤ RH ≤ 75%) will make the data available for analysis in an
even shorter time frame.
Read more for Apple
ASEPTIC FILLING OF STERILE POWDERS: SOME ELEMENTS OF STATISTICAL PROCESS CONTROL AND PREVENTIVE MAINTENANCE
A precise and accurate dosing of sterile powders under aseptic conditions in vials still represents a challenge in the pharmaceutical field and this is even more true
when it comes to small quantities of high-potency active substances.
To conduct this important operation of the pharmaceutical industry effectively and efficiently, microdosing machines are available that can fill up to over 20,000 vials per hour.
Among the various filling methods available, the one that uses a vacuum / pressure system is very popular.
The discs of the microdosing machine, and the chambers contained therein, are subjected to a continuous operational stress which leads to an inevitable deterioration of their performance.
To what extent is this deterioration acceptable?
When should preventive actions be taken to limit it?
These questions are answered by the Descriptive Statistics which, thanks to a simple summary index, the coefficient of variation, allows to compare the variability of each dosing chamber over time,
build a case history, set limits of acceptability and then indicate when it is time to intervene in a preventive way.
Furthermore, the statistical methods allow us to go into even more detail of the filling process, modeling it and verifying its consistency between the different dosing chambers and over time.
It is worth noting that the approach and methods presented here are applicable to similar processes, at least in some respects, such as compression to produce tablets, etc.
Read more for Apple
PRINCIPAL COMPONENT ANALYSIS AND CLUSTER ANALYSIS AS STATISTICAL TOOLS FOR A MULTIVARIATE CHARACTERIZATION OF PHARMACEUTICAL RAW MATERIALS
Numerous factors contribute to the variability of the pharmaceutical industry processes and among these the raw materials play a primary role as they often come from different
sources that use different production processes.
Raw materials characterization therefore plays a fundamental role in terms of Quality which, by its nature, is "the enemy of variability".
Multivariate Statistical Analysis of Data (MVDA), beyond of its complex mathematical, is here presented as a powerful and practical tool for the study and classification of raw materials.
Thanks to the use of multivariate techniques such as Principal Component Analysis (PCA) or Cluster Analysis (CA), it is possible to graphically represent each lot, defined by the values of
the different analytical parameters that characterize it, as a point in a Cartesian diagram whose coordinates are the principal components. Since these components are built to intercept
the variability in the data, these graphs reveal characteristics which would escape other types of surveys and therefore allow to catalog the lots based on the degree of intrinsic homogeneity
that defines them and identify any anomalous behavior. This approach can therefore be used both initially, to characterize the incoming raw materials, and subsequently,
in the case of any anomalies, to see how the raw materials of the batches under investigation were located compared to those that had not given problems.
The techniques that have been detailed here can also be extended to other typical situations in the pharmaceutical industry such as, for instance:
• comparative evaluation of finished product lots, for example for the purposes of Annual Product Quality Review (APQR).
• comparative evaluation of series of measurements performed by different operators, etc.
Once again, statistical methods show how it is possible to "simplify complexity" and extract practical and "ready-to-use" knowledge from complex datasets by capturing their information content.
Read more for Apple
MULTIPLE LINEAR REGRESSION: A POWERFUL STATISTICAL TOOL TO UNDERSTAND AND IMPROVE APIs MANUFACTURING PROCESSES
It is known that, over time, all production processes tend to deviate from their initial conditions, and this happens because of many different reasons such as changes
in materials, personnel, environment, etc.
This variability in the processes, which often goes unnoticed, is instead well intercepted by the data that Quality Control systematically collects for batch release purposes.
If these data are analyzed using Multiple Linear Regression (MLR), they reveal a lot regarding the manufacturing processes that generated them.
This product knowledge is of great practical use to the Company as it allows to:
• understand which are the parameters that most affect the product quality and how they interact with each other,
• establish whether the parameters that are controlled are really the ones we need or, instead, which ones would be better to consider,
• define / improve a product control strategy based on experimental data and quantitative models rather than speculation,
• define and graphically represent the design space (ICH Q8) inherent to the production process considered,
• identify possible ways to improve process performance and scientifically pilot this improvement,
• mitigate the Regulatory impact in case of changes.
In this post is detailed, step by step, how this ready-to-use process knowledge can be obtained from experimental data easily available.
Read more for Apple
QUALITY METRICS AND DATA CONSISTENCY – Part 2
This second part is the continuation and completion of the previous one.
In this second post the points dealt with are:
Read more for Apple
- CASE STUDY 3:
Capability Analysis: metrics for stable/mature processes (Cp and Cpk) and metrics for new processes (Pp and Ppk)
- CASE STUDIES 4/5:
Probabilistic methods for a quick evaluation of the manufacturing process (Standardized Normal distribution, Poisson and Binomial distributions)
- CASE STUDY 6:
- Processes non-normally distributed: impurities content, microbial counts, Particle Size Distributions (PSD), black particles (or black specs)
- Normalization of non-normal data using mathematical transformations (logarithm, square root, reverse or reciprocal)
- Johnson Transformations
- CASE STUDY 7:
Multivariate methods: a different way to look at Quality Control data!
- Quality Metrics are ease of use quantitative indicators that allow to intercept the variability of products / processes, quantify it and therefore ensure Quality.
- Quality Metrics provide a “quantitative knowledge” of the process that allows to manage (preventing or justifying them) anomalous events (OOT, OOS, deviations, etc.)
and communicate awareness in what is done and reliability in the processes used.
All this is summed up in two words: ECONOMIC ADVANTAGE!
QUALITY METRICS AND DATA CONSISTENCY – Part 1
In 2002, FDA launched the “Pharmaceutical cGMPs for the 21st Century” initiative with the aim of promoting a modern production approach, risk- and science-based.
In 2015, always in that context, FDA asked the industry for inputs to define a “FDA Quality Metrics program” and in December 2019 announced that the implementation of a “Quality Metrics Program” has become a priority.
Taking its cue from these FDA stimuli, this post and the next deal with the use of quantitative tools (or Quality Metrics) for understanding, monitoring and possibly improving pharmaceutical manufacturing processes.
Real case studies that show the practical application of Quality Metrics to typical QA / QC topics are discussed and their statistical analysis detailed step by step.
In practice it is shown how, from data normally available at the company, it is possible to easily extract useful information on the state of the processes and, above all, predict their possible outcome.
It is exactly this combination of two aspects, one descriptive and the other predictive, which allows to really know a given process, control it and possibly even improve it. This knowledge is also useful for managing
issues like OOS, OOT, deviations, etc. In fact, a poor knowledge of the process and of its quality indicators can lead to consider anomalous what is not.
Given the number of Quality Metrics considered and the breadth of the case studies discussed, the topic was splitted in two parts. In this first post the points dealt with are:
Read more for Apple
What are Quality Metrics?
Understanding variability as «Quality is inversely proportional to variability»
State of the art
Create knowledge from available data and therefore:
- manage possible anomalous or risky situations (OOS, OOT, deviations, etc.)
- communicate awareness in what is done and in the reliability of the processes used
- CASE STUDY 1:
Use of graphical methods
Normal distribution, Normality test and Hypothesis test (α, P-value)
Anscombe’s quartet: Graphics reveal data!
- CASE STUDY 2:
Control Charts (I-MR Chart, Run Chart, Xbar Chart
Structures in data: clustering, trending, etc.
ANOVA, t-test, 2-Variances test and comparison between two series of data
Example: evaluation of Supplier data
Bland-Altman (or Tukey Mean-Difference) plot
Central Limit Theorem: regardless of the shape of parent population, the distribution of means quickly approaches the normal distribution
Basics of Statistical Risk Analysis
Risk is an essential part of daily life and even the society, as a whole, needs to take risks to continue growing and developing.
Risk management is the process of identifying, analyzing and responding to risk factors.
According to ICH Q9, Risk Assessment consists of the identification of hazards and the analysis and evaluation of risks associated with exposure
to those hazards. Apart from a few exceptions (e.g., quantitative FTA), most of the risk analysis tools commonly used in the pharmaceutical field
(e.g., FMEA, etc.) are basically subjective. However, in some cases, there are statistical techniques that allow us to assess the extent of the risk associated
with some decisions. A typical example of this is, for example, the decision regarding the conformity, or not, of a lot based on the analysis of a sample of it.
In such a decision two figures must be considered, the PRODUCER and the CUSTOMER (or CONSUMER), who run two different types of risk. The PRODUCER runs the risk
of rejecting a “good lot” while the CUSTOMER (or CONSUMER) that of accepting a “not compliant” or a “poor quality” product. This post briefly addresses this topic.
Read more for Apple
Regulatory Technical Writing - Labor Ergo Scribo!
Those who work must necessarily write! The aims are many: to communicate the results of one's studies, to give operating instructions, to respond to requests, etc.
In all cases, however, if the message contained in the writing does not reach the recipient, the entire communication process is frustrated and the consequences of this can be significant. For this purpose, it is sufficient to think that at least a third of the time of an executive is spent in writing documents and that the quality of a given job, the choice to continue it, interrupt it, finance it, etc. are often determined solely by the document that illustrates it!
The focus of this presentation is therefore to analyze the structure of a technical document and provide practical suggestions for its preparation.
Writing, however, is still much more than this and therefore the presentation considers, more generally, the "what it means to write and how to do it".
Read more for Apple
Solvents Classification using a Multivariate Approach: Cluster Analysis.
This post continues and completes the analysis of a database consisting of 64 solvents, each
described by eight physico-chemical descriptors, initiated in the previous post.
Subject matter of this study is the application of Cluster Analysis with the intention of finding
groups in data, i.e., identifying which observations are alike and categorize them in groups, or
clusters. As clustering is a broad set of techniques, this study focuses just on the so-called hard
clustering methods, i.e., those assigning observations with similar properties to the same group
and dissimilar data points to different groups. Two types of algorithms have been considered:
hierarchical and partitional.
Quite apart from the chosen technique, the experimental evidence indicates the presence, in the
• three main groups, each consisting of individuals categorized as similar among them
• a few isolated individuals dissimilar from the others.
A similar finding was also obtained in the previous post using 2d-contour plots.
A closer examination of these three main groups of solvent shows a finer structure consisting
of smaller groups of individuals highly similar among them (e.g., members of a given chemical
family (e.g., alcohols, chlorinated hydrocarbons) or of chemical entities sharing common
characteristics (e.g., aprotic dipolar solvents).
Read more for Apple
Solvents Classification using a Multivariate Approach: Correlation and Principal Component Data Analysis.
The identification of data-driven criteria to make a conscious choice of solvents for practical applications is a rather old issue in the chemical field. Solvents, in fact, are mainly selected based on Chemist’s experience and intuition driven by parameters such as polarity, basicity and acidity. At least two research groups, already in 1985, approached the issue of solvent selection using multivariate statistical methods. These Scientists, using different databases, each based on different types of physicochemical descriptors, obtained different classification patterns. In this post, it has been chosen one of those databases and the data analysis process has been repeated detailing it systematically. This post deals with the first part of the process and it covers the intercorrelation among the physicochemical descriptors used to characterize the solvents under study and Principal Component Analysis. The correlation found allows to capture 70% of the initial data variability just using two principal components the first of which is related to “polarity/polarizability” and “lipophilicity” of molecules and the second to “strength of intermolecular forces”.
The use of these two principal components suggests the possibility of grouping solvents into aggregates (or clusters) of similar individuals and this aspect will be covered in the following post.
Read more for Apple
A different way to look at pharmaceutical Quality Control data: multivariate instead of univariate.
In the pharmaceutical industry, Quality Control (QC) data are typically arranged in data tables
each row of which refers to a specific production lot and contains the results from different types of measurements (chemical and microbiological).
As for each active chemical entity, or dosage form, there is a specific data table and since all lots listed therein are manufactured using the same approved process, the data table contains the “analytical fingerprint” of that specific manufacturing process.
In spite of their table form, QC data are usually reviewed, evaluated and trended in a univariate mode, i.e., each type of data is analyzed individually using statistical tools such as control charts, box plots, etc. The dataset is therefore studied “ by columns ”.
In this post, it is proposed a different way to analyze QC data, i.e., by using a multivariate approach that improves upon separate univariate analyses of each variable by using information about the relationships between the variables. Moreover, the combination of multivariate methods with the power of the programming language R and its unsurpassed graphic tools, allows analyzing data mainly relying on graphics and, as stated by Chambers et al., “there is no statistical tool that is as powerful as a well-chosen graph”.
This post shows how using R for combined multivariate data analysis and visualization, the information contained in QC chemical dataset can be easily extracted and converted into “knowledge ready to use”.
Read more for Apple
Hi and Welcome on my website
I am a Chemist and I work in the pharmaceutical industry since 1982 where I had experience of Analytical R&D, Quality Control and Quality Assurance.
In the last six - seven years, I have developed a deep, personal interest in Statistical data analysis. After a start using Minitab and the univariate approach, I later discovered R/RStudio and Multivariate Analysis.
Both these last findings, that impressed and fascinated me, are one of the main reasons for creating this website. I hope, in fact, it would allow me to get in touch with Scientists involved in the field of Multivariate Analysis and Clustering to learn from and to cooperate with.
Therefore, please, get in touch to talk about statistical methods in the pharmaceutical / chemical industries and, in particular, Multivariate Analysis and Data Clustering.
The content of this website and the opinions therein have nothing to do with my current position or with my previous or current employers.
1986 Master in Analytical and Chemical Methods of Fine Organic Chemistry
Polytechnic University of Milan, Italy
1981 Graduated in Chemistry
University of Milan, Italy
• Statistical Process Control for the FDA regulated Industry, Pragmata, Teramo, May, 3rd - 4th 2016
• Statistics for Data Science with R, Quantide, Legnano, October, 19th - 20th 2018
• Data Mining with R, Quantide, Legnano, February, 15th - 16th 2018
• Intermediate R Course, DataCamp, February, 27th 2018
• Data Visualization and Dashboard with R, Quantide, Legnano, June, 25th - 26th 2018
• Member of the Italian Statistical Society (since May 2018)
My mother tongue is Italian. From 1989 to 1992, I have worked and lived in Basel (Switzerland) where I learned German. Beside this, I also speak English and a bit of French.