1. The first step is to retrieve the abstracts from PubMed. One could either save the text version or the xml version. The text version retrieval is reasonably quick and is sufficient for most exercises. However, both versions can be used for text mining using pubmed.mineR. The xml version is better suited for clean operations. Below is displayed a screenshot of the advanced query builder.
Input terms in PubMed Query builder
Figure 1. Screenshot of the Advanced Query builder in PubMed. Users can set suitable criteria in order to retrieve the most relevant articles.
2. The second step is to save the Abstracts file in a suitably named working directory.
Figure 2. The PubMed offers saving the data in various formats. We select the Abstract (text) format for further processing. While saving, users may be asked for a filename. Users can type in the assigned name or use the default option output from PubMed.
3. The third step is to read the Abstracts file into an R object, more specifically, a S4 Object of class Abstracts. The function used is readabsnew().
Figure 3. Screenshot of the process of reading the Abstracts file into S4 R object by the function readabsnew(). A warning appears, which can be ignored. Note that by running the function printabs() on the newly created R object the number of Abstracts read and small portions of the starting and the ending Abstracts are displayed. This step can be used to verify whether the number Abstracts read matches with that displayed in the PubMed page on the top.
4. Once we have the abstracts corpus as an S4 object, many questions can be pursued using the variety of functions of pubmed.mineR. A preliminary step could be to extract the top frequency terms or words from the corpus.
Figure 1. Screenshot of the Advanced Query builder in PubMed. Users can set suitable criteria in order to retrieve the most relevant articles.
2. The second step is to save the Abstracts file in a suitably named working directory.
Figure 2. The PubMed offers saving the data in various formats. We select the Abstract (text) format for further processing. While saving, users may be asked for a filename. Users can type in the assigned name or use the default option output from PubMed.
3. The third step is to read the Abstracts file into an R object, more specifically, a S4 Object of class Abstracts. The function used is readabsnew().
Figure 3. Screenshot of the process of reading the Abstracts file into S4 R object by the function readabsnew(). A warning appears, which can be ignored. Note that by running the function printabs() on the newly created R object the number of Abstracts read and small portions of the starting and the ending Abstracts are displayed. This step can be used to verify whether the number Abstracts read matches with that displayed in the PubMed page on the top.
4. Once we have the abstracts corpus as an S4 object, many questions can be pursued using the variety of functions of pubmed.mineR. A preliminary step could be to extract the top frequency terms or words from the corpus.
Figure 4. Screenshot of the process of viewing the top 50 words and their frequencies of occurrence in the corpus. As is evident in this case, the terms resistance, antibiotic, bacteria and antimicrobial dominate signifying the characteristics of the corpus as antibiotic or antimicrobial resistance.
5. A curious question is what is the distribution of the number of abstracts with respect to each term?
Figure 5. Screen shot of the number abstracts corresponding to a given term within a corpus. It is evident that among the three terms Respiratory, Gastroenterology and Dermatology, the highest number of Abstracts corresponds to Respiratory. Recalling that the corpus is about antimicrobial resistance, it is becoming apparent that top ranking work is being carried out and published in the area of antimicrobial resistance associated with respiratory infections.
6. What are the functions available in pubmed.mineR?
Figure 6. Screenshot of the functions of pubmed.mineR. These various functions offer wide varieties of utilities for text mining.
7. The functions continued
8. What are the slots of the S4 object of class Abstracts
9. Annotating Biological Entities, the PubTator facility.
Figure 9. Screenshot of the Pubtator site.
10. Extracting annotated data from Pubtator.
Please note that the pubtator_function() runs well in linux OS such as Ubuntu for example. In the case of Windows recently some problems have been noticed. In this case please use pubtator_function_JSON().
Please load the library(RJSONIO) first
The code is here:
pubtator_function_JSON <- function(x){
check <- isValidJSON(paste("https://www.ncbi.nlm.nih.gov/research/pubtator-api/publications/export/biocjson?pmids=",x,sep = ""))
if (check != FALSE){
test <- fromJSON(paste("https://www.ncbi.nlm.nih.gov/research/pubtator-api/publications/export/biocjson?pmids=",x,sep = ""))
gene = NULL
disease = NULL
mutation = NULL
chemical = NULL
species = NULL
temp <- test$passages[[2]]$annotations
if (length(temp) != 0) {for ( i in 1:length(temp))
{ if (temp[[i]]$infons[2] == "Disease" ) disease = c(disease, paste(temp[[i]]$infons[1],temp[[i]]$text, sep = ">" ))
else if (temp[[i]]$infons[2] == "Species" ) species = c(species, paste(temp[[i]]$infons[1],temp[[i]]$text, sep = ">" ))
else if (temp[[i]]$infons[2] == "Mutation" ) mutation = c(mutation, paste(temp[[i]]$infons[1],temp[[i]]$text, sep = ">" ))
else if (temp[[i]]$infons[2] == "Chemical" ) chemical = c(chemical, paste(temp[[i]]$infons[1],temp[[i]]$text, sep = ">" ))
else if (temp[[i]]$infons[2] == "Gene" ) gene = c(gene, paste(temp[[i]]$infons[1],temp[[i]]$text, sep = ">" ))
}
gene = union(gene, gene)
disease = union(disease, disease)
mutation = union(mutation, mutation)
chemical = union(chemical, chemical)
species = union(species, species)
return(list(Genes = gene, Diseases = disease, Mutations = mutation, Chemicals = chemical, Species = species, PMID = x))
} else
temp <- test$passages[[1]]$annotations
if (length(temp) != 0) {for ( i in 1:length(temp))
{ if (temp[[i]]$infons[2] == "Disease" ) disease = c(disease, paste(temp[[i]]$infons[1],temp[[i]]$text, sep = ">" ))
else if (temp[[i]]$infons[2] == "Species" ) species = c(species, paste(temp[[i]]$infons[1],temp[[i]]$text, sep = ">" ))
else if (temp[[i]]$infons[2] == "Mutation" ) mutation = c(mutation, paste(temp[[i]]$infons[1],temp[[i]]$text, sep = ">" ))
else if (temp[[i]]$infons[2] == "Chemical" ) chemical = c(chemical, paste(temp[[i]]$infons[1],temp[[i]]$text, sep = ">" ))
else if (temp[[i]]$infons[2] == "Gene" ) gene = c(gene, paste(temp[[i]]$infons[1],temp[[i]]$text, sep = ">" ))
}
gene = union(gene, gene)
disease = union(disease, disease)
mutation = union(mutation, mutation)
chemical = union(chemical, chemical)
species = union(species, species)
return(list(Genes = gene, Diseases = disease, Mutations = mutation, Chemicals = chemical, Species = species, PMID = x))}
else return(" No Data " )};return(" No Data ")}
12. Getting deeper : the sentence context
Figure 12. Screenshot of extracting the sentences in the Abstracts containing a given term 'resistance' in this example. There are two ways of doing this exercise. In the top, the Give_sentences () function is used and the subabs () function is used to get the first abstract from the main corpus. In the bottom, the pmids vector is used to input the first pmid to the function pmids_to_abstracts from the main corpus.
13. The XML option.
1. Please download the PMIDs of the Abstracts by selecting the option PMID under the click button Save.
2. After that please include the PMIDs in this URL
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=28785560,25695404&retmode=xml
and paste it in the URL bar of the internet browser say for example Chrome and press enter for retrieving the Abstracts in XML format. In this example, I have included two PMIDs 28785560 and 25695404 separated by a comma. You can include upto 50 PMIDs comma separated in one such retrieval. The number 50 is the limit specified by The NLM. Please don't change anything else.
3. Save the web page in your working directory as a file with a filename of your choice. The Web Browser automatically saves the retrieved data as an XML file.
4. After that you can use new_xmlreadabs() from pubmed.mineR as before and enjoy.
No comments:
New comments are not allowed.