Statistical applications in pharmaceutical and chemical field
by Riccardo Bonfichi
Solvents Classification using a Multivariate Approach: Cluster Analysis.
AbstractThis post continues and completes the analysis of a database consisting of 64 solvents, each described by eight physico-chemical descriptors, initiated in the previous post. Subject matter of this study is the application of Cluster Analysis with the intention of finding groups in data, i.e., identifying which observations are alike and categorize them in groups, or clusters. As clustering is a broad set of techniques, this study focuses just on the so-called hard clustering methods, i.e., those assigning observations with similar properties to the same group and dissimilar data points to different groups. Two types of algorithms have been considered: hierarchical and partitional. Quite apart from the chosen technique, the experimental evidence indicates the presence, in the database, of: • three main groups, each consisting of individuals categorized as similar among them and • a few isolated individuals dissimilar from the others. A similar finding was also obtained in the previous post using 2d-contour plots. A closer examination of these three main groups of solvent shows a finer structure consisting of smaller groups of individuals highly similar among them (e.g., members of a given chemical family (e.g., alcohols, chlorinated hydrocarbons) or of chemical entities sharing common characteristics (e.g., aprotic dipolar solvents)).
Solvents Classification using a Multivariate Approach: Correlation and Principal Component Data Analysis.
AbstractThe identification of data-driven criteria to make a conscious choice of solvents for practical applications is a rather old issue in the chemical field. Solvents, in fact, are mainly selected based on Chemist’s experience and intuition driven by parameters such as polarity, basicity and acidity. At least two research groups, already in 1985, approached the issue of solvent selection using multivariate statistical methods. These Scientists, using different databases, each based on different types of physicochemical descriptors, obtained different classification patterns. In this post, it has been chosen one of those databases and the data analysis process has been repeated detailing it systematically. This post deals with the first part of the process and it covers the intercorrelation among the physicochemical descriptors used to characterize the solvents under study and Principal Component Analysis. The correlation found allows to capture 70% of the initial data variability just using two principal components the first of which is related to “polarity/polarizability” and “lipophilicity” of molecules and the second to “strength of intermolecular forces”. The use of these two principal components suggests the possibility of grouping solvents into aggregates (or clusters) of similar individuals and this aspect will be covered in the following post.
A different way to look at pharmaceutical Quality Control data: multivariate instead of univariate.
AbstractIn the pharmaceutical industry, Quality Control (QC) data are typically arranged in data tables each row of which refers to a specific production lot and contains the results from different types of measurements (chemical and microbiological). As for each active chemical entity, or dosage form, there is a specific data table and since all lots listed therein are manufactured using the same approved process, the data table contains the “analytical fingerprint” of that specific manufacturing process. In spite of their table form, QC data are usually reviewed, evaluated and trended in a univariate mode, i.e., each type of data is analyzed individually using statistical tools such as control charts, box plots, etc. The dataset is therefore studied “ by columns ”. In this post, it is proposed a different way to analyze QC data, i.e., by using a multivariate approach that improves upon separate univariate analyses of each variable by using information about the relationships between the variables. Moreover, the combination of multivariate methods with the power of the programming language R and its unsurpassed graphic tools, allows analyzing data mainly relying on graphics and, as stated by Chambers et al., “there is no statistical tool that is as powerful as a well-chosen graph”. This post shows how using R for combined multivariate data analysis and visualization, the information contained in QC chemical dataset can be easily extracted and converted into “knowledge ready to use”.