Saturday, July 7, 2018

Breaking the ice - Beginners tutorial


1. The first step is to retrieve the abstracts from PubMed. One could either save the text version or the xml version. The text version retrieval is reasonably quick and is sufficient for most exercises. However, both versions can be used for text mining using pubmed.mineR. The xml version is better suited for clean operations. Below is displayed a screenshot of the advanced query builder.

    Input terms in PubMed Query builder

  



Figure 1. Screenshot of the Advanced Query builder in PubMed. Users can set suitable criteria in order to retrieve the most relevant articles.

2. The second step is to save the Abstracts file in a suitably named working directory.

  

Figure 2. The PubMed offers saving the data in various formats. We select the Abstract (text) format for further processing. While saving, users may be asked for a filename. Users can type in the assigned name or use the default option output from PubMed.

3. The third step is to read the Abstracts file into an R object, more specifically, a S4 Object of class Abstracts. The function used is readabsnew(). 




Figure 3. Screenshot of the process of reading the Abstracts file into S4 R object by the function readabsnew(). A warning appears, which can be ignored. Note that by running the function printabs() on the newly created R object the number of Abstracts read and small portions of the starting and the ending Abstracts are displayed. This step can be used to verify whether the number Abstracts read matches with that displayed in the PubMed page on the top.

4. Once we have the abstracts corpus as an S4 object, many questions can be pursued using the variety of functions of pubmed.mineR. A preliminary step could be to extract the top frequency terms or words from the corpus.

 

 Figure 4. Screenshot of the process of viewing the top 50 words and their frequencies of occurrence in the corpus. As is evident in this case, the terms resistance, antibiotic, bacteria and antimicrobial dominate signifying the characteristics of the corpus as antibiotic or antimicrobial resistance.


5. A curious question is what is the distribution of the number of abstracts with respect to each term?



 Figure 5. Screen shot of the number abstracts corresponding to a given term within a corpus. It is evident that among the three terms Respiratory, Gastroenterology and Dermatology, the highest number of Abstracts corresponds to Respiratory. Recalling that the corpus is about antimicrobial resistance, it is becoming apparent that top ranking work is being carried out and published in the area of antimicrobial resistance associated with respiratory infections.

6. What are the functions available in pubmed.mineR? 



Figure 6. Screenshot of the functions of pubmed.mineR. These various functions offer wide varieties of utilities for text mining.



7. The functions continued

 
 

Figure 7. Screenshot of the functions of pubmed.mineR continued.....


8. What are the slots of the S4 object of class Abstracts

 

Figure 8. Screenshot of the slots of the S4 object of class Abstracts. S4 object of the class Abstracts contains the full text of PubMed abstracts. Note the slot operator @ marked by an arrow.
 
9. Annotating Biological Entities, the PubTator facility.

 Figure 9. Screenshot of the Pubtator site.

10. Extracting annotated data from Pubtator. 

 

Figure 10. Screenshot of the use of the pubtator_function() to extract annotated data from PubTator. This principle is Entity recognition. An input to PubTator is the PMID of an Abstract. All the PMIDs of a corpus are extracted using the slot function @PMID shown by blue arrow. This results in a vector of PMID numbers. As shown, one can input the first PMID, the first element of the vector as shown by green arrow, get the results from PubTator and display them shown by red arrow.

Please note that the pubtator_function() runs well in linux OS such as Ubuntu for example. In the case of Windows recently some problems have been noticed. In this case please use pubtator_function_JSON().

Please load the library(RJSONIO) first
The code is here:
pubtator_function_JSON <- function(x){
    check <- isValidJSON(paste("https://www.ncbi.nlm.nih.gov/research/pubtator-api/publications/export/biocjson?pmids=",x,sep = ""))
  if (check != FALSE){
    test <- fromJSON(paste("https://www.ncbi.nlm.nih.gov/research/pubtator-api/publications/export/biocjson?pmids=",x,sep = ""))
  gene = NULL
  disease = NULL
  mutation = NULL
  chemical = NULL
  species = NULL
  temp <- test$passages[[2]]$annotations
  if (length(temp) != 0) {for ( i in 1:length(temp))
  { if (temp[[i]]$infons[2] == "Disease" ) disease = c(disease, paste(temp[[i]]$infons[1],temp[[i]]$text, sep = ">" ))
  else if (temp[[i]]$infons[2] == "Species" ) species = c(species, paste(temp[[i]]$infons[1],temp[[i]]$text, sep = ">" ))
  else if (temp[[i]]$infons[2] == "Mutation" ) mutation = c(mutation, paste(temp[[i]]$infons[1],temp[[i]]$text, sep = ">" ))
  else if (temp[[i]]$infons[2] == "Chemical" ) chemical = c(chemical, paste(temp[[i]]$infons[1],temp[[i]]$text, sep = ">" ))
  else if (temp[[i]]$infons[2] == "Gene" ) gene = c(gene, paste(temp[[i]]$infons[1],temp[[i]]$text, sep = ">" ))
  }
    gene = union(gene, gene)
    disease = union(disease, disease)
    mutation = union(mutation, mutation)
    chemical = union(chemical, chemical)
    species = union(species, species)
    return(list(Genes = gene, Diseases = disease, Mutations = mutation, Chemicals = chemical, Species = species, PMID = x))
    
  } else  
    temp <- test$passages[[1]]$annotations
  if (length(temp) != 0) {for ( i in 1:length(temp))
  { if (temp[[i]]$infons[2] == "Disease" ) disease = c(disease, paste(temp[[i]]$infons[1],temp[[i]]$text, sep = ">" ))
  else if (temp[[i]]$infons[2] == "Species" ) species = c(species, paste(temp[[i]]$infons[1],temp[[i]]$text, sep = ">" ))
  else if (temp[[i]]$infons[2] == "Mutation" ) mutation = c(mutation, paste(temp[[i]]$infons[1],temp[[i]]$text, sep = ">" ))
  else if (temp[[i]]$infons[2] == "Chemical" ) chemical = c(chemical, paste(temp[[i]]$infons[1],temp[[i]]$text, sep = ">" ))
  else if (temp[[i]]$infons[2] == "Gene" ) gene = c(gene, paste(temp[[i]]$infons[1],temp[[i]]$text, sep = ">" ))
  }
    gene = union(gene, gene)
    disease = union(disease, disease)
    mutation = union(mutation, mutation)
    chemical = union(chemical, chemical)
    species = union(species, species)
    return(list(Genes = gene, Diseases = disease, Mutations = mutation, Chemicals = chemical, Species = species, PMID = x))}
    else return(" No Data " )};return(" No Data ")}








11.  Collate results of PubTator from multiple Abstracts
 
 

Figure 11. Screenshot of the use of the pubtator_function() to extract annotated data from PubTator from multiple Abstracts. Note that compared with the previous example, here 50 PMIDs are input (please restrict to 50 PMIDs at a time as per the NCBI server limit)and the results are finally obtained as a table. The table file is a tab delimited file and it can be opened using either google sheet or MSExcel or open office. Shown below the R scripts is an example of the table. Note the data from multiple PMIDs one each in a row in the 6th column. Note that in the first PMID 27092975, there are multiple diseases 2nd column in the table.

12. Getting deeper : the sentence context

  
Figure 12. Screenshot of extracting the sentences in the Abstracts containing a given term 'resistance' in this example. There are two ways of doing this exercise. In the top, the Give_sentences () function is used and the subabs () function is used to get the first abstract from the main corpus. In the bottom, the pmids vector is used to input the first pmid to the function pmids_to_abstracts from the main corpus. 


 13. The XML option.


1. Please download the PMIDs of the Abstracts by selecting the option PMID under the click button Save.

2. After that please include the PMIDs in this URL

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=28785560,25695404&retmode=xml 

and paste it in the URL bar of the internet browser say for example Chrome and press enter for retrieving the Abstracts in XML format. In this example, I have included two PMIDs 28785560 and 25695404 separated by a comma. You can include upto 50 PMIDs comma separated  in one such retrieval. The number 50 is the limit specified by The NLM. Please don't change anything else.

3. Save the web page in your working directory as a file with a filename of your choice. The Web Browser automatically saves the retrieved data as an XML file.

4. After that you can use new_xmlreadabs() from pubmed.mineR as before and enjoy.