KMAP

KMAP is a knowledge database generated in a semi-automated fashion from the scientific literature, in this case, millions of MEDLINE abstracts. A schematic overview of the construction of KMAP is shown below. We have constructed a large database with biological keywords, organized in vocabularies describing genes, pathways, diseases, organisms etc. which we refer to as concepts. We used these concept vocabularies to search the entire MEDLINE database of ~35 million abstracts. Based on the results we constructed relations between concepts that are statistically associated with each other in the literature.

image0

Concept

The keywords in the vocabularies are referred to as biological concepts, and are organized in vocabularies. Each concept always has the following properties:

  • An identifier that uniquely identifies the concept. An identifier is constructed with a prefix followed by a number. From the prefix, the vocabulary can be deduced.

  • A name that is the most common name for the biological concept.

Next to that, a concept can have synonyms and additional identifiers (xrefs) that link the concept to other databases.

image1

Relation

A relation is defined between two concepts, when they occur together in an abstract or in a sentence. A single abstract in which the two concepts occur is sufficient to define a relation. Of course, when the concepts occur together in more abstracts, the relation is probably stronger. A relation always has a Subject (the first concept) and an Object (the second concept) and a type of relation (Predicate). In the case that two concepts occur in an abstract, the Predicate is “Co-occurrence” and the relationship is reciprocal. If we analyze a relationship on sentence level, we may be able to more specifically infer the nature and the direction of the relation.

image2

To characterize the strength of the relation, we can look at 3 numbers.

  • Number of abstracts. This is the number of abstracts in which both concepts co-occur with each other.

  • The enrichment score (ES). This score is based on the Mutual Information (MI) score and corrects for publication bias in concepts. For example, the term “cancer” has a lot of hits in PubMed and consequently each concept would have automatically a strong hit with cancer if we would score on abstracts alone, but this would not necessarily imply a strong, specific connection (see also figure below).

  • In practice the ES tends to be very high for relations in which both concepts only have a few hits in MEDLINE whereas the number of abstracts is biased towards well-known concepts with a lot of abstracts. The local MI is the product of the ES and number of abstracts and is in practice a good trade-of. For further theory on scoring word associations, see the thesis of Stefan Evert.

image3

The formula that is used to calculate the Enrichment Score (ES) is

image4

In which FAB is the number of abstracts in which both concepts co-occur and FA and FB are the number of abstracts in which the concepts occur separately.

Concept Sets

A concept set can be regarded as a set of concepts that have one or more mutual relations which each other. A natural representation of such a concept set is a network in which all pairwise relations are shown. Typically concept sets can be :

  • a set of regulated genes derived from a gene expression experiment.

  • a set of diseases that is tied to a specific SNP or region of the chromosome.

  • a set of molecular pathways with their corresponding genes.

  • a set of micro-organisms obtained from a metagenomics experiment.

image5image6

These concept sets can be used to find interesting network patterns, but can also be used to find additional terms that are related to this set as whole. For example, a new concept that would have a connection with all of the members of the concept set would be interesting (such as Pathway 2 in the figure).

Reference sets

All of the information is deduced from text. Currently KMAP works mostly on abstracts, although full text analysis is also possible. Abstracts sets can be used for the deduction of relations between concepts, but can also be used to quickly gather relevant literature around a subject. So, apart from searching with concepts, the API can also be approached with a starting set of PMIDS that are obtained from another source or application.

Access to the API

Input

The input required for the API is a URL indicating the type of function that is required and a list of one or more parameters. Most of the time these parameters will be a number of identifiers for abstracts, concepts or vocabularies and parameters that modify the results by e.g., setting a limit on the number of results that are returned, ordering the results or the exact format of the returned results

Parameter type

Format

Description

single identifier input

A string representing a single identifier

An identifier is used to uniquely pinpoint a single vocabulary, concept or abstract.

identifier_list input

A comma separated list of single identifiers

Lists are sued to indicate multiple identifiers, e.g. when working with sets of concepts and sets of abstracts.

format of the output

A number or string controlling the output.

These parameters are used to limit the results to a maximum, order the results or formatting the results in a specific way.

Calls & Results

Each call the servers produces a result in JSON format. The standard response is a dictionary with 2 entries. For example the call concept/search/?terms=CXCR3&apikey=MY_VALID_APIKEY returns the following JSON output

image7

The status field gives an indication of the outcome of the run. currently this has the following values

  • Success. The execution of the call was OK

  • Input Error. Some required arguments are missing or not correctly formatted.

  • Internal server Error. Something went wrong on the server side. If this problem persists in multiple calls, please contact us.

The result field returns the actual results. In the case that the call was OK, there will be the actual results. In the case of an Input Error, a relevant help message is displayed. In case of an Internal server error, the message “The server could not process your request for an unknown reason.” is displayed.

Access keys

You need an APIKEY to get access to the API methods. The APIKEY This is a code that needs to be used in the scripts and send to the server each time an API request is done. Please contact us at *kmine@tenwise.nl* for getting a trial API-key. If you do not supply an APIKEY in your request the server will give an error and ask you to supply an APIKEY.

Without supplying an apikey, the server returns an Input Error

**concept/search/?terms=CXCR3 **

image8

GET vs. POST requests

All methods work both as a GET and POST method. In the GET methods all parameters and their values are supplied in a URL format that can be directly executed in the browser. In a POST request, the parameters and their values are supplied in the body of a form. You can use GET request for simple calls to the server, but if you have large concept sets or abstracts sets to send to the server, you should use POST, because URLS can only have a maximum length of characters and may be dependent on your browser. When running a POST request, you are required to include an additional cookie in your request as a security protection. Please see the R and Python examples on how to do this.

Methods

The following tables give an overview of the methods that are available, with a short description. In general, each call needs to be executed with a number of parameters. The exact parameters that need to be used in each call are detailed in the documentation for each call. Parameters that are required are in bold. Parameters that are optional can be omitted and the default value is shown between [].

Concept methods

These methods are preceded by the /concept/ statement in the URL and return data and relations for individual concepts.

Method name

Function

Input

Output

search

Searches the database for concepts that match your keywords

A set of keywords or external identifiers.

The concept identifiers that match to your keywords.

hits

Returns the MEDLINE abstracts in which the concept occurs.

A concept identifier.

A list of PMIDs.

relations

Returns related literature concepts.

A concept identifier.

A set of related concepts and the statistical scores.

concept/hits

This method returns the references in which a concept is found. A hit is characterized by the PMID of the reference, the number of hits in the reference and by the string_hit, i.e., the string to which the concept matches in the abstract. This can be informative in case you want to see whether the hit was with the preferred name of the concept or with one if it’s synonyms. In case of multiple hits in the reference, only the string_hit for the first encountered term is given.

Parameter

Value [default]

Description

apikey

Your API key

concept_id

The identifier of the concept for which the results should be retrieved.

retmax

[100]

The maximum number of hits to retrieve.

ordering

[pmid]

hitnr

The ordering is done on PMID, so effectively returning the most recent references. If ordering is set to hitnr, the references with the most hits of hits concepts are returned.

Returns

Value

Description

hits

A list of dictionaries with as key-value pairs the PMID, the hitnr and the string_hit.

concept/relations

This method returns the relations for a given concept. A relation is defined as a co-occurrence of two concepts in a reference (see example below), identified by a subject and object, connected by a predicate. See here for a more extended description of what we mean by a relation.

image9

The relations can be ranked on basis of the

  • overlap

  • the enrichment score (escore)

  • local_mi, which is the product of the overlap and the enrichment score and as such balances the influence of both parameters.

By default, all relations for a concept are returned. By setting the vocab_ids parameter, the relations can be restricted to certain vocabularies only.

Parameter

Value [default]

Description

apikey

Your API key

concept_id

The identifier of the concept for which the results should be retrieved.

vocab_ids

string

A comma separated list of vocabulary_ids for which the relations need to be retrieved. If not supplied, all relations are returned.

ordering

[local_mi]

overlap

escore

The way the ordering should be done.

retmax

[50]

The maximum number of relations to return after ordering

Returns (Example)

Value

Description

hits

A list of dictionaries with as key-value pairs the abs_id(=PMID), the hitnr and the string_hit.

Relation methods

These methods are preceded by the /relation/ statement in the URL and return details on a relation between concepts. Currently there is only one relation method, but this will be extended in the future.

Method name

Function

Input

Output

evidence/

Returns the references for a relation.

Two concept identifiers.

A set of PMIDs in which the concepts co-occur.

relation/evidence

Returns all the PMIDs for a relation, along with the number of hits for both concepts in each PMID.

Parameter

Value [default]

Description

apikey

Your API key

subject

The concept_id for the subject of the relation.

object

string

The concept_id for the object of the relation. If not supplied, defaults to the subject, so effectively retrieving the references in which the subject occurs with itself. This is normally not what you want, but the server does not throw an error if you do.

ordering

[total_hits]

pmid

The default is to order on the number of total hits, i.e. the references in which the concepts have a lot of hits. If ordering is done on PMID, it returns in practice the most recent references.

retmax

[100]

The maximum number of relations to return after ordering.

Returns (Example)

Value

Description

evidence

A list of dictionaries with as with the subject_id, the object_id, the pmid and the hits for each of the concepts.

Reference Set methods

These methods work on a set of MEDLINE abstracts, which are referred to by their identifiers, the PMIDs. PMIDs are numbers, ranging from 1 to ~35.000.000 that uniquely identify a PubMed reference.

Method name

Function

Input

Output

citations/

Returns citation information.

A list of PMIDS

Citation information for each PMID.

concepts/

Returns the concepts that are found in the references.

A list of PMIDS

For each reference the concepts and associated hits.

markup/

Highlights the concepts in the actual references and individual sentences

A list of PMIDS and a list of concepts to highlight.

Strings that can be used for display as highlighted HTML.

refset/citations

This method returns basic annotation information for a set of PMIDs. This citation information includes amongst other the title, abstract, journal, publication date. This should be sufficient to reconstruct the most important citation information. If you are interested in more details, e.g. author affiliation, ISNN number of the journal etc., you can retrieve the data directly from PubMed using their Entrez system. The fields that are returned by this function are as follows:

Field

Description

pmid

The PMID, a number of 2 or more digits

title

The reference title as text

abstract

The abstract as text

journal_title

The abbreviated journal title

volume

The volume

issue

The issue

page

The page(s)

authors

“;” separated full names of the authors e.g. ‘Marie Krogh Nielsen; Torben Lykke Sørensen’

meshterms

“;” separated MESH terms e.g. ‘Aged;Case-Control Studies;Chemokine CCL5;’ Can be an empty string.

chemicalterms

“;” separated chemical terms. Can be an empty string.

pubmed_date

Date in the format YYYY-MM-DD, e.g. ‘2020-04-24’

NB. Currently we have placed a cap of 1000 on the number of references that can be retrieved in a single query. If you need to retrieve more than 1000 references, please use this call in a loop, with no more 1 call per ~2 seconds.

Parameter

Value [default]

Description

apikey

Your API key

pmids

A string with a number of “,” separated PMIDs.

Returns (Example)

Value

Description

citations

A list of dictionaries with citation information for the abstracts, as described above.

refset/concepts

This method returns the concepts for a set of PMIDs. This can thus be regarded as the reverse as the function to get the PMIDs for a set of concepts. The use case for this is that you have somehow retrieved a set of PMIDs e.g. from a PubMed query, or from and Endnote library or a reference list from a paper, and you want to look up the concepts that are referred to in these abstracts. It can also be used to prioritize a large list of PMIDs to only those ones that contain a lot of concepts of your interest.

Parameter

Value [default]

Description

apikey

Your API key

pmids

A string with a number of “,” separated PMIDs.

vocab_ids

string

A string with an optional parameter specifying the vocabularies from which the concepts should be returned. This can be used to limit the results of the call. If this parameter is not specified, all concepts are returned.

Returns (Example)

Value

Description

pmid2concept

A dictionary with PMIDs as keys and a list of concept_ids as values

refset/markup

This function takes a set of PMIDs and concept_ids and returns highlighted versions of the abstracts or the sentences in the abstracts. Only the abstracts and sentences in which at least 1 concept is found are returned. When you set the highlight parameter to False, the non-highlighted version of the text is returned. This can make sense if you want to do your own analysis on the sentences in which a co-occurrence is found. The combination of markup_mode=’abstract’, highlight =False does not make a lot of sense, since you would be basically returning the plain abstracts, which can easier be obtained via another way. Currently the highlighting can be done in two ways

Parameter

Value [default]

Description

apikey

A string

Your API key

pmids

string

A string with a number of “,” separated PMIDs.

concept_ids

string

A string with concept_ids.

retmode

[highlight]

plain_text

Whether to return highlighted text.

Returns (Example)

Value

Description

markup

For each PMID a dictionary with the a title and abstract and sentences, in which the concepts are enclosed in <span style=\”background-color:#ADD8E6;\”> </span> tags if retmode is “highlight”.

Concept Set methods

These methods are preceded by the /conceptset/ statement in the URL and return data and relations for sets of concepts. When running this on a lot of concepts, you may need to use the POST request in order to avoid overloading the URL.

Method name

Function

Input

Output

annotation/

Returns details for concepts, such as synonym, xrefs etc.

A set of one or more concept identifiers.

A list of annotated concepts.

relations/

Retrieves related concepts for each concept in the set.

A set of concept identifiers.

A set of new, related concepts.

enrichment/

Retrieves related concepts for the set as a whole using an enrichment calculation.

A set of concept identifiers.

An enrichment table with related concepts and the p-value.

evidence/

Retrieves most relevant references for a set of concepts.

A set of concept identifiers.

A set of PMIDs along with a relevance score.

conceptset/annotation

With this function, detailed on information on the concepts can be obtained. This includes information on the name and synonyms but also references to other external databases. See example below:

image10

Not all concepts have the same level of detail. For genes, much more information is available then for certain metabolites. The only annotation tags that are always available are the ‘name’ and ‘concept_id’ tag.

Parameter

Value [default]

Description

apikey

Your API key

concept_ids

A string with a number of “,” separated values.

Returns (Example)

Value

Description

annotation

A list of dictionaries (one for each concept identifier) with tag-value pairs for the annotation fields.

conceptset/relations

This function connects the concepts with a new set of concepts. The parameters in the call can be used to control how the relations should be created. By setting these parameters you can build the following networks.

Extended network

This is built by connecting all concept ids to all target concepts from the databases that match to one or more of these concept ids. Optionally, the target concepts can be limited to only a certain vocabulary.

image11

Restricted network

This is similar to the extended network, but now only relations are shown between user specified target concepts. This is an interesting method if you have a set of genes and you are interested in the links to certain prespecified diseases or pathways of interest.

image12

Restricted intra-network

A special form of the restricted network is the intra-network, which can be obtained by specifying the objects with the same identifiers as the subject identifiers. You then only get links between the original concepts in the data set.

image13

Parameter

Value [default]

Description

apikey

Your API key

concept_ids_subject

A string with a number of “,” separated values.

concept_ids_target

string

A string with a number of “,” separated values that will be used to limit the output to only show relations with the specified targets.

vocab_ids

string

A string with a number of “,” separated values specifying the vocabularies for which the targets should be retrieved.

Returns (Example)

Value

Description

connectivity

For each node in the network a specification of how many other nodes it is connected to.

relations

A list of dictionaries, one for each relation. The keys of the dictionaries describe the relation, and is explained here.

concept_set/enrichment

This function returns the enriched concepts for the set of concept_ids. This function is related to the relations function, but with the difference that a statistical analysis on the results is performed. This statistical analysis ensures that only concepts are retained that have a stronger connection to your input set compared to a random background set. This analysis has been described as Gene Set Enrichment Analysis (GSEA), but the principle can be performed on any set of concepts. The principle is demonstrated in the following figure:

image14

The left network shows the relations with your concepts of interest (foreground set). The right panel shows the relations in a random set of the same composition, i.e. the same number of concepts from each vocabulary type (background set). This shows that Pathway 3 and Organism 1 are specifically connected to the foreground set whereas Pathway 2, Pathway 3 and Organism 1 are connected to both sets and thus less specific for the foreground set. The specificity / relatedness for the foreground set can be calculated by a so-called contingency table on which a Fisher Exact test can be applied. The p-value is an indication of how specific the new concepts are for your concept set (i.e. the foreground set).

The outcome of this method is very much dependent on what exactly you choose as a background set. This is often dependent on your research question and sometimes difficult to create. If you do not specify a background set in the method, we will match your foreground set with respect to the number of concepts and distribution across vocabularies. However, since this is a random process, you may expect changes in the results when rerunning the method, especially with smaller sets.

Parameter

Value [default]

Description

apikey

Your API key

concept_ids

A string with a number of “,” separated values.

background_ids

string

An optional string with background_ids

Returns (Example)

Value

Description

enrichment

A dictionary with the enriched objects. Each element has the slots as explained below.

background_set

A set of concept ids that formed the background set. This is either equal to the one supplied by the user, or the automatically generated one. Especially in the last case, saving this background set and using it in future reruns of your call will get you reproducible results.

The results in the enrichment dictionary are as follows.

  • id: The concept id of the associated concept.

  • pval: The P-val for the association as determined.

  • hitsfg: The number of hits with the foreground set.

  • nohits: The number of no-hits with the foreground set.

  • hitsbg: The number of hits with the background set.

  • nohitsbg: The number of no-hits with the background set.

  • enrichment: The fraction of hits with the foreground set compared to the hits in the background set.

  • ratio: The natural log of the enrichment.

  • targets: The concept ids in the foreground set for which a relation was found.

conceptset/evidence

This function returns the PMIDs that have the highest coverage of concepts in your set. For example, if you have a set of 5 concepts, and in a single abstract, 4 of those 5 concepts occur, this abstract gets a score of 0.8. Because this method potentially can return a lot of results there is a cap of 1000 PMIDs returned per function call. By default, the ordering is done on the score, meaning that the best covered references are returned. Ordering on the PMID yields the most recent references.

Parameter

Value [default]

Description

apikey

Your API key

concept_ids

A string with a number of “,” separated values.

hitlimit

[2]

An integer specifying the number of concepts that should at least be present in the abstracts. Setting this to 1 also returns abstracts in which only one concept occurs, which may be interesting for sparsely connected concept sets.

ordering

[score]

pmid

The ordering of the results, either on most relevant (score) or most recent (pmid).

retmax

[50]

The maximum number of PMIDs to return after applying the hitlimit and ordering.

Returns (Example)

Value

Description

evidence

A list of dictionaries with the PMID, the score and the number of hits.

conceptset/hits

This function returns all the hits for a set of concepts. This means that for each concept, all the PMIDs in which the concept has a hit, are returned. This is analogous the concept/hits method, but now repeated for all concepts in the set. Because this method potentially returns a lot of data, the number of concepts in the set that can be used for this call is limited to 500.

Parameter

Value [default]

Description

apikey

Your API key

concept_ids

A string with a number of “,” separated values.

Returns (Example)

Value

Description

hits

A dictionary with the PMIDs as keys and a list of concepts from the set as values

Vocabularies

The following vocabularies are currently in use.

ID

Name

(prefix of the IDS)

Description

Number

of concepts

ONT0001

HGNC Genes (HGNC)

A curated online repository of HGNC-approved gene nomenclature.

42357

ONT0003

Human Disease (TWDIS)

A vocabulary for human disease representation.

17615

ONT1004

Tool Compounds (TOOLC)

A set of pharmaceutical tool compounds.

16061

ONT0005

Organisms (TAX)

A set of terms covering a large subset of the NCBI taxonomy database.

519716

ONT0006

Pathways (PATH)

A set of terms covering a large set of biological pathways.

2798

ONT0008

Cell types (TWCELL)

A set of terms covering a large set of cell types.

117

ONT1005

Bacterial Genes (BACG)

A set of terms describing bacterial gene symbols.

10857

ONT1006

Metabolites (TWMET)

A set of terms describing metabolites.

1214

ONT5003

GenesComp (TWHGNC)

A set of additional terms describing gene complexes.

4

ONT1007

Workflow terms (TWRWO)

A set of terms describing workflow terms.

86

ONT1008

Food terms (TWFOOD)

A set of terms describing food items.

2875

Workflows

This section details a number of workflows that can be used to address common problems. A workflow is defined here as a series of API calls, where the results of one API call are used by the subsequent API calls. The shuttling of data between the API calls can be done within Python, R or on the command line with shell scripts.

Gene set enrichment analysis

This workflow is designed to start with a list of genes and to perform a Gene Set Enrichment analysis.

image15

An explanation of the steps:

1: A keyword search is carried out to obtain the correct identifiers for the for the gene names.

2: An enrichment call is done to get the concepts that are preferentially connected to the gene set of interest. Optionally a background set can be applied.

3: The most enriched concepts are retrieved from the results and a network visualization is created. Alternatively, the output can be exported to Excel.

Drug repurposing

This workflow is designed to discover new relations between concepts based on a set of shared intermediates. This is also called the ABC approach. An example of the application of an ABC approach using this API interface can be found here.

image16

An explanation of the steps:

1: A keyword search is carried out to obtain the correct identifier for the disease term.

2: All known relations with drugs are obtained by executing the relations call, limiting the results to the drug vocabulary. This yields known A-C relations.

3: All known relations with genes are obtained via the relations call, limiting the results to only genes. This establishes the A-B relation in the ABC principle.

4: The retrieved genes are treated as a concept set, and the entire concept set is matched to new drugs. This establishes all B-C relations in the ABC principle.

The A-B and B-C data sets can now be connected via overlapping B-nodes. This yields all potential A-B-C relations. To find truly new relations, subtract known A-C relations from the found A-B-C relations.