conclusion.tex

\part{Discussion and conclusion}
\label{pa:conclusion}
\chapter{Discussion}
\section{Network handling}
In this study I have demonstrated the development of a Cytoscape app and its
application to prioritize network biomarkers in prostate cancer. A possible
significant improvement which can be made, in the use of prioritizing network
biomarkers, is to use directed networks. The ranking algorithms used can all
come to useful results with undirected edges, but directed edges take better
advantage of how \gls{pr}, \gls{prwp}, \gls{hits} are meant to be used.

\subsection{Clustering}
The clusters could to a greater degree have been filtered more strictly. The
average cluster size was around 8.8 nodes (Table \ref{tab:mcl-inflation}), the
biggest cluster consisted if over 900 nodes, and the smallest ones were 2.0 
nodes in size. Removing the smaller clusters with only 2 genes has been done by
other researchers because of the low likeliness of a protein complex with only
2 genes to have a significant impact on disease status. The biggest cluster
might also be removed due to its large size when compared to the average. Such
a large cluster has a good chance of not being realistically compartmentalized
into a protein complex, and can skew the results in either the direction of
false positives or false negatives.

\subsection{Ranking}
After clustering the network, the user could end up with two types of networks.
One which is the original network with all of its edges, and the only new piece
of information after the clustering is that the node table contains information
about which cluster each node belongs to. The second type is a network that
represents the clusters without edges connecting to other clusters, which
creates disconnected graph for each cluster. Ranklust has been used exclusively
with the first alternative throughout this assignment. Though, it does not
necessarily indicate that the disconnected graph-version is a poor choice to
prioritize network biomarkers on.

Adjusting the alpha parameter of \gls{prwp} to higher values as 0.8 instead of
0.3 would also give interesting information. Had the cross-validation with
\gls{prwp} been done with 0.8, and resulted in a linear regression fit (Figure
\ref{fig:irefweb-prwp}) that would have had a less descending trend of candidate
cancer genes throughout the cluster ranks, it would be significant proof of the
network structure having a bigger effect on the rankings than the priors scores.

\subsection{General improvements}
I have used an undirected network, which is not the preferred type to use with
PageRank, but it works. An idea could be to use KEGG pathways\cite{kegg}, which
has the option of downloading directed networks of several types.
STRING\cite{str} also has some directed network information, together with
weights. Because of STRINGs size, utilizing a Neo4J database to perform the
ranking algorithms could be a better option than directly having Cytoscape do
all of the heavy computations.

In the future, maybe a database with complete protein complexes could exclude
the need for clustering and provide different ways of ranking them based on
query criteria.

Other ranking algorithms than PageRank could also have been used. There are
numerous variations based off of this well known algorithm, for example NetRank
and GeneRank\cite{netrank,generank}. Not all of these PageRank variants are
intended to be used with PPI networks, but they can in many cases be modified to
fit this specific purpose.

In Ranklust, the average score in the nodes becomes the cluster score. PageRank
was used to combine prior knowledge of cancer genes and network structure to
prioritize network biomarkers in prostate cancer. Taking the structure of the
network in higher consideration and maybe consider the distance between the
nodes to have an effect on the result could be a contribution to the PageRank
algorithm. Also, going the other way and increase the significance of the prior
scores added to the nodes in the network could help to get a better segregated
result of which clusters are related to cancer and not. For example, PRWP used
an alpha value of 0.3 because of earlier experiments in ranking biological
networks with PageRank had good results with it. What if the alpha value was set
to 0.8 and give the prior scores in the node a higher degree of bias?

\section{Cross-validation between PRWP and MAA}
There is clearly a trend in both \gls{prwp} and \gls{maa} (figures:
\ref{fig:irefweb-prwp} and \ref{fig:irefweb-maa}). The difference between them
being mainly the number of clusters ranked and the coupling of values around the
linear regression fit. It is important to point out that this is just a result
over the distribution of candidate cancer genes at certain ranks. Where the
clusters are positioned in the rankings may be very different between the
\gls{prwp} and \gls{maa}.

The values in \gls{prwp} have a tighter coupling, in other words, the distance
the coordinates have in the scatter plot deviate less from the fit than the ones
in the \gls{maa} plot. The difference is not huge, as the coefficient of
determination indicates 0.336 in \gls{prwp} against 0.332 in \gls{maa}. Both
ranking algorithms achieves a descending distribution of cross-validated genes
from the topmost to the lowest ranked cluster. Though, when comparing the
cross-validation results between \gls{prwp} and \gls{maa}, \gls{prwp} comes out
ahead by a margin in \gls{rsquared} value.

\section{Benchmarking through text mined genes from the DISEASE database}
This demonstrates the main difference between \gls{prwp} and \gls{maa} (figures:
\ref{fig:txt-iref-prwp} and \ref{fig:txt-iref-maa}).  \gls{prwp} takes network
structure into comparison as well as the prior scores, so it will rank genes
without priors higher than in \gls{maa}. \gls{maa} focuses purely on the prior
scores, and so the network structure of protein complexes are being completely
ignored.

\section{Benchmarking through manually knowledge curated genes from the DISEASE database}
Benchmarking \gls{maa} with manually knowledge curated gene test data has
uncovered its weakness of ranking prostate candidate cancer genes correctly
(figures: \ref{fig:know-iref-prwp} and \ref{fig:know-iref-maa}).
\gls{prwp} had the same trend, but the fit was not very accurate and the trend
was negligible. Though, with such a small sample size of manually knowledge
curated genes when compared to text mined genes, \gls{maa} is not necessarily
excluded as a viable ranking algorithm for prioritizing network biomarkers in
prostate cancer.

\section{Benchmarking through genes from experiments from the DISEASE database}
As with the manually knowledge curated genes, the experimental genes have
a small sampling size (figures: \ref{fig:exp-iref-prwp} and
\ref{fig:exp-iref-maa}). Also, the trends that would discredit Ranklust's
ability to prioritize network biomarkers is not backed up by very high values of
confidence in terms of \gls{rsquared} values in the linear regression fits. In
this specific case of benchmarking Ranklust, it is of a high probability that
the cause of the trend is a single outlier in the plot.

\section{Testing the ranks of PRWP and MAA with prostate cancer relevant genes}
Both \gls{prwp} and \gls{maa} have demonstrated both the ability of prioritizing
network biomarkers for prostate cancer according to text mined, manually
knowledge curated and experimental data. Also, when tested against data from the
movember project, the COSMIC database and proven lethal prostate cancer genes,
they managed to prioritize network biomarker genes for prostate cancer (tables:
\ref{tab:prwp-movember}, \ref{tab:maa-movember}, \ref{tab:prwp-cosmic},
\ref{tab:maa-cosmic}, \ref{tab:prwp-lethal} and \ref{tab:maa-lethal}).

\gls{prwp} has demonstrated throughout the benchmarks and tests to score
marginally to significantly better than \gls{maa} in all of the tests and
benchmarks. \gls{prwp} also managed to rank a greater amount of clusters
compared to \gls{maa} (see captions in tables
\ref{tab:top10-prwp},\ref{tab:top10-maa}).

A perfect prioritization of these clusters should have contained every gene in
the test data in the highest ranked clusters. However, this was not likely to be
the case because the amount of genes from the test data could have had genes
that did not occur in the network at all. The prioritizations \gls{prwp} and
\gls{maa} created are the result of a specific PPI network, converted to genes
and complimented with prior scores related to prostate cancer. In order to
perfectly assess the prioritization ability of these ranking algorithms against
each specific set of test data, with the goal of achieving the 100\% percentage
of genes prioritized above all other irrelevant genes, the network, the
interactions and the prior scores added would have had to be changed. Future
work in the area this assignment has focused on could be more specific and
specialized in what kind of networks, interactions and prior scores that is
being used, in order to achieve a better prioritization result.

\chapter{Conclusion}
\section{Future work}
\subsection{Minor features to complement Ranklust/clusterMaker2}
A compiled list of all considered features and tweaks in the Ranklust
contribution to clusterMaker2

\begin{itemize}
    \item User can specify what direction the ranking algorithm should consider
        for the graph
    \item User can specify a certain color scheme to better adjust for
        color blindness
\end{itemize}

\subsection{Ranking through Neo4J}
ClusterMaker2 does every calculation within Cytoscape. There exists another way
of doing large and complex calculations, especially when it comes to algorithms
that focus on edge information in the network. CyNeo4j\cite{cyneo4j} is
a Cytoscape App that can realise this idea. It supports import/export of data
with a Neo4J\cite{neo4j} database. The original CyNeo4J Github-repository does
not support user authentication at the moment, but during the discussion about
what ranking algorithm should be used, Neo4J was an alternative. It resulted in
a git fork\cite{git-fork} of the CyNeo4J repository and simple user/password
authentication was added.

\subsubsection{Data communication}
Data communication between the Neo4J database server and the Cytoscape instance
is done through the CyNeo4J app to Cytoscape \cite{cyneo4j}. CyNeo4J is a
Cytoscape app that was developed during the Google Summer Code 2014 arrangement.
It supports connecting to a Neo4J instance, as well as syncing data up and down
from and to the database server. One thing it did not support was authentication
on Neo4J servers. Not having authentication is a serious problem, so we
implemented a simple way of getting access to the database server by providing a
possibility to insert username and password at the same time the user has to
provide a URL to the Neo4J database server instance. Implementation-wise, this
only required an extra header to be included in each http request going to the
password protected Neo4J database server instance. Every request used the static
\textbf{Request} class to Get/Post/Put HTTP requests to the Neo4J database
server instance. Except for creating the Auth64 encoded information, the
refactor looked something along these lines in all of the files.

Before:
\begin{lstlisting}[frame=single,language=Java]
Request req = Request.Post(url)
        .bodyString(call.getPayload(), ContentType.APPLICATION_JSON);
\end{lstlisting}

After:
\begin{lstlisting}[frame=single,language=Java]
Request req = Request.Post(url)
        .addHeader("Authorization", auth64EncodedInfo)
        .bodyString(call.getPayload(), ContentType.APPLICATION_JSON);
\end{lstlisting}

Synchronization time between Neo4J and Cytoscape through the CyNeo4J app is a
huge timesink. As of now, the time it takes to populate an empty Cytoscape
network with the gene information from STRING is about 2 hours, though on a slow
laptop. This could be shortened by exporting the Neo4J data with GraphML and
into Cytoscape. Because after the initial data is inside Cytoscape, updates to
the Neo4J instance goes much faster.

The CyNeo4J also uses a legacy HTTP library to get information from the Neo4J
database \cite{legacy-neo4j}. It is possible that the performance increases with
the new library \cite{transactional-neo4j}. The new library supports creating
transactions, which implicit gives support for rollbacks in case something goes
wrong with the query.

A future improvement to the CyNeo4J app could be to change the communication
between Neo4J and Cytoscape to be done in GraphML and not Cypher. No convertion
needs to be done in order to have Cytoscape understand this format since it
already supports importing networks in this format. Through testing CyNeo4j as
an alternative to use together with Ranklust, it was also discovered that
GraphML is not only easy to use with Cytoscape, but it required substantially
less memory and time while importing GraphML data into a Neo4j database
instance. Though, to this day (2016-08-14), direct queries to a running Neo4J
instance does not support GraphML.

\section{Final results from prioritizing network biomarkers in prostate cancer}
Comparing \gls{prwp} to \gls{maa} shows that \gls{prwp} is better at
prioritizing network biomarkers for prostate cancer (table \ref{tab:final}).
This conclusion is based on which of the two ranking algorithms that had the
best results from the benchmarks and the highest percentage of genes in the
cluster ranks that was found in the test gene sets. The main difference between
\gls{prwp} and \gls{maa} is, as mentioned earlier, that \gls{prwp} takes network
structure into consideration when ranking nodes in the network, and as a result,
manages to prioritize a greater number of clusters than \gls{maa}.

In this assignment I have described the development of an app extension to
Cytoscape which ranks clusters of networks based on their biomarker relevance.
This is important in the field of biomarker discovery as single agent biomarkers
operate in a collective of complex molecular networks. 

This development have been benchmarked and cross-validated against a golden
standard developed from a prostate cancer database and a candidate prostate
cancer database, to reveal a trend demonstrating the ability of Ranklust to
prioritize prostate cancer relevant gene clusters.

Finally, it has been demonstrated in this study the ability of Ranklust to
elaborate on the discovered network biomarkers in prostate cancer, through the
identifaction of network clusters in a human PPI network and filtered with
manually curated prostate cancer genes, genes from experiments and expanding on
a gene signature of lethal prostate cancer.