Statistical applications in pharmaceutical and chemical field

by Riccardo Bonfichi

Solvents Classification using a Multivariate Approach: Cluster Analysis.



This post continues and completes the analysis of a database consisting of 64 solvents, each described by eight physico-chemical descriptors, initiated in the previous post. Subject matter of this study is the application of Cluster Analysis with the intention of finding groups in data, i.e., identifying which observations are alike and categorize them in groups, or clusters. As clustering is a broad set of techniques, this study focuses just on the so-called hard clustering methods, i.e., those assigning observations with similar properties to the same group and dissimilar data points to different groups. Two types of algorithms have been considered: hierarchical and partitional. Quite apart from the chosen technique, the experimental evidence indicates the presence, in the database, of: • three main groups, each consisting of individuals categorized as similar among them and • a few isolated individuals dissimilar from the others. A similar finding was also obtained in the previous post using 2d-contour plots. A closer examination of these three main groups of solvent shows a finer structure consisting of smaller groups of individuals highly similar among them (e.g., members of a given chemical family (e.g., alcohols, chlorinated hydrocarbons) or of chemical entities sharing common characteristics (e.g., aprotic dipolar solvents)).

Read more

Solvents Classification using a Multivariate Approach: Correlation and Principal Component Data Analysis.



The identification of data-driven criteria to make a conscious choice of solvents for practical applications is a rather old issue in the chemical field. Solvents, in fact, are mainly selected based on Chemist’s experience and intuition driven by parameters such as polarity, basicity and acidity. At least two research groups, already in 1985, approached the issue of solvent selection using multivariate statistical methods. These Scientists, using different databases, each based on different types of physicochemical descriptors, obtained different classification patterns. In this post, it has been chosen one of those databases and the data analysis process has been repeated detailing it systematically. This post deals with the first part of the process and it covers the intercorrelation among the physicochemical descriptors used to characterize the solvents under study and Principal Component Analysis. The correlation found allows to capture 70% of the initial data variability just using two principal components the first of which is related to “polarity/polarizability” and “lipophilicity” of molecules and the second to “strength of intermolecular forces”. The use of these two principal components suggests the possibility of grouping solvents into aggregates (or clusters) of similar individuals and this aspect will be covered in the following post.

Read more

A different way to look at pharmaceutical Quality Control data: multivariate instead of univariate.



In the pharmaceutical industry, Quality Control (QC) data are typically arranged in data tables each row of which refers to a specific production lot and contains the results from different types of measurements (chemical and microbiological). As for each active chemical entity, or dosage form, there is a specific data table and since all lots listed therein are manufactured using the same approved process, the data table contains the “analytical fingerprint” of that specific manufacturing process. In spite of their table form, QC data are usually reviewed, evaluated and trended in a univariate mode, i.e., each type of data is analyzed individually using statistical tools such as control charts, box plots, etc. The dataset is therefore studied “ by columns ”. In this post, it is proposed a different way to analyze QC data, i.e., by using a multivariate approach that improves upon separate univariate analyses of each variable by using information about the relationships between the variables. Moreover, the combination of multivariate methods with the power of the programming language R and its unsurpassed graphic tools, allows analyzing data mainly relying on graphics and, as stated by Chambers et al., “there is no statistical tool that is as powerful as a well-chosen graph”. This post shows how using R for combined multivariate data analysis and visualization, the information contained in QC chemical dataset can be easily extracted and converted into “knowledge ready to use”.

Read more

Riccardo Bonfichi Hi and Welcome on my website Smile

I am a Chemist and I work in the pharmaceutical industry since 1982 where I had experience of Analytical R&D, Quality Control and Quality Assurance. In the last six - seven years, I have developed a deep, personal interest in Statistical data analysis. After a start using Minitab and the univariate approach, I later discovered R/RStudio and Multivariate Analysis. Both these last findings, that impressed and fascinated me, are one of the main reasons for creating this website. I hope, in fact, it would allow me to get in touch with Scientists involved in the field of Multivariate Analysis and Clustering to learn from and to cooperate with. Therefore, please, get in touch to talk about statistical methods in the pharmaceutical / chemical industries and, in particular, Multivariate Analysis and Data Clustering.
The content of this website and the opinions therein have nothing to do with my current position or with my previous or current employers.


1986 Master in Analytical and Chemical Methods of Fine Organic Chemistry
Polytechnic University of Milan, Italy

1981 Graduated in Chemistry
University of Milan, Italy

Training courses
• Statistical Process Control for the FDA regulated Industry, Pragmata, Teramo, May, 3rd - 4th 2016
• Statistics for Data Science with R, Quantide, Legnano, October, 19th - 20th 2018
• Data Mining with R, Quantide, Legnano, February, 15th - 16th 2018
• Intermediate R Course, DataCamp, February, 27th 2018
• Data Visualization and Dashboard with R, Quantide, Legnano, June, 25th - 26th 2018

• Member of the Italian Council of Chartered Chemists (since July 1996)
• Member of the Italian Statistical Society (since May 2018)

My mother tongue is Italian. From 1989 to 1992, I have worked and lived in Basel (Switzerland) where I learned German. Beside this, I also speak English and a bit of French.

Read Italian legislation on data protection and privacy.

Privacy policy

Template by Danny Design

Privacy policy

Law D.Lgs. n. 196/03

This site doesn't use any type of cookies (technical cookies or profiling cookies).
Pursuant to Section 122 of the “Italian Privacy Act” and Authority Provision of 8 May 2014, no consent is required from site visitors.
Garante della privacy (en-it)
This website doesn't collect or store any kind of personal data.

Questo sito non fa uso di cookies di profilazione per i quali è richiesto il consenso del navigatore come meglio specificato nelle pagine del Garante della privacy (en-it)
Questo sito non richiede, non raccoglie e non tratta dati personali di alcun genere.