XDIP
The XDIP provides a comprehensive foundation for subsequent data-driven analysis and modeling efforts.
XDIP
Iron plays ssential role in numerous biological processes and it has complex coordination chemistry.
Utilized existing protein (local) structure database and munally annotated data to construct such a database.
Integrated detailed XAS spectra with protein structural data, enabling direct comparison of XAS features with structural motifs.
Firstly, we constructed searching keywords to identify relevant literature. The search was conducted using combinations of keywords from two sets: {XAS, XANES, XAFS,48 EXAFS} and {Metalloprotein, Protein, Enzyme, Fe-iron}, such as "XAS Protein " or "EXAFS Metalloprotein ". The searching keywords were used to retrieve the literature from various publishers. To locate relevant literature, we utilized either the publishers' search APIs or conducted manual searches under copyright permission. We excluded duplicate results, ending up with 20,915 articles. After that, we manually selected papers with the simple selection criterion of mentioning the protein and including at least one Fe absorption spectrum. Eventually, we obtained 573 articles that met these criteria. The human experts then annotated the text and XAS plot information, converting it into digitized data samples. The extracted dataset was then refined by removing the low-quality samples, poorly presented XAS and structures, and outdated annotations. Finally, the extracted dataset was constructed through a combination of automatic searching and manual data extraction and cleaning. Each data sample in our dataset consists of three main parts: (1) The entire or local protein structure (The first-coordination sphere of the element of interest), (2) The protein's corresponding Fe K-edge XAS spectra, and (3) the basic information of papers from which the protein structure and XAS spectrum were derived.
Figure below is the structure of data records. For each paper, there may be multiple extractable information, so we stacked the spectrums and their corresponding structures. (a) The literature metadata, which includes every paper's DOI, title, and absorbing element. In this study, we focus exclusively on the Fe element. (b) The extracted spectrum data. We categorized the absorption spectrums into two categories: near-edge absorption spectrum and extended absorption spectrum.(c) The structure extracted from the paper. Sometimes, the structure of the target material is available online by searching its ID or name, or it can be directly accessed from its SMILES representation. Otherwise, we constructed the protein structure by labeling its adjacent matrix, atom list, atom coordinates, bond lengths, bond angles, and optional notes.
In our work, we discusses the creation of a comprehensive X-ray Absorption Spectroscopy (XAS) database focused on iron-containing proteins. Iron is a crucial element for numerous biological processes, and the database aims to bridge the gap in available high-quality annotated XAS spectral data for such proteins.
- The database contains 437 protein structures and their corresponding 1652 XAS spectra, collected from 573 research papers published between 2007 and 2023.
- XAS data, including both XANES (X-ray Absorption Near-Edge Structure) and EXAFS (Extended X-ray Absorption Fine Structure) spectra, were extracted, and these spectra provide insights into the local chemical environments
of iron atoms within proteins.
- The dataset is tailored to be useful for machine learning applications in predicting and analyzing protein structures and functions based on spectral data.
- A systematic search was conducted using keyword combinations like "XAS, " "XANES, " "Fe-iron, " and "protein. " From over 20,000 articles, only those featuring iron absorption spectra related to proteins were selected.
- Data was digitized using the WebPlotDigitizer tool to extract XAS spectra from research papers.
- Manual review and extraction of structural data focused on iron's local chemical environment in proteins. This data was cross-referenced with public databases (e.g., PDB, CCDC).
The accuracy of annotated XAS spectra was tested by comparing them to raw data from the Materials Project. Inter-expert reliability was measured using the Intra-class Correlation Coefficient (ICC), showing high consistency in data annotation.
The database is expected to be useful for catalysis research, particularly in the study of metalloproteins. It integrates structural and spectral data, facilitating applications such as deep learning for predicting catalytic properties and guiding the design of new materials. This work fills a critical gap in the study of iron-containing proteins by providing a well-structured dataset, essential for biological chemistry, catalysis, and related fields.