Nicholas Spyrison, Monash University, Nicholas.spyrison@monash.edu
PRIMARY
Miji Kim, Monash University, mkim0002@student.monash.edu
Ha Nam Anh
Pham, Monash University, hnpha5@student.monash.edu
Student Team: YES
-
R (via RStudio)
o
Especially
the packages: dplyr, tidyr,
ggplot2, gganimate, ggraph,
Rtsne, D3 heatmap.
Approximately how many
hours were spent working on this submission in total?
200 hours between 3 people
May we post your submission
in the Visual Analytics Benchmark Repository after VAST Challenge 2020 is
complete? YES
Video
https://www.youtube.com/watch?v=--lrw6UICbc&t=28s
GitHub
link
https://github.com/nspyrison/VAST_Challenge_2020
Center for Global Cyber
Strategy (CGCS) researchers have used the data donated by the white hat groups
to create anonymized profiles of the groups. One such profile has been
identified by CGCS sociopsychologists as most likely to resemble the structure
of the group who accidentally caused this internet outage. You have been asked
to examine CGCS records and identify those groups who most closely resemble the
identified profile
How the visual analytics software
helped our analysis
2.
Design
decisions
a. Heatmap: `ggplot2` was used to
quickly grasp the distributions across the Edge Types within each Data Source.
b. Network layouts: The R packages
`ggraph` and `igraph`
packages were used to format into a graph object, apply layout locations
according to various algorithms (especially `igraphs`
Large Graph Layout) to facilitate rapid iterations in R.
c. tSNE: t-distributed
stochastic neighbour embedding (“tSNE”, van der Maaten & Hinton, 2008), is a form of non-linear
dimension reduction. Within each Data Source, we apply tSNE
to embed 5 attribute-dimensions (3 factors: Node ID, Node Direction, Edge Type
and 2 quantitative: Time [seconds], and Weight) of all cleaned rows into their
own, potentially highly non-linear, 2-dimensional projection spaces. We use the
same hyperparameters, one of which is a function of sample size. Namely,
perplexity = ⅓ * the square root of(number of rows in this dataset).
Viewing these spaces side-by-side we tried to identify features of the
projection spaces to better compare and contrast the networks.
d. Visuals of weight animated
across time: the `gganimate` package was used in
generating animated plots. The animated bar chart presents bars, racing to the
top based on ranks within each frame. It was developed with the intent of
presenting the flow of procurement transactions over time. Then, the animated
scatter plot with a timeline element was developed to identify and visualize
similarities between each suspect and the template shown over time.
3.
Visualization
and interactions
a. Heatmap: fast, light
distribution of observations across 2 discrete variables.
b. Network layouts and tSNE: To identify and contrast particular features in the
different networks.
c. Visuals of weight animated
across time:
4.
Filters
and transformation
a. With visual data exploration,
Suspects 4 & Suspect 5 were removed as they presented fewer similarities
compared to Suspects 1-3. Then, we narrowed down our analysis to procurement
transactions as meaningful findings were identified in the financial category
during the exploration stage. Therefore, the dataset was filtered by the edge
type, selecting sell and purchase data.
b. A discrete transformation was
applied across time as we created a frame variable by slicing time to aggregate
and animate. We are currently revisiting this transformation to see if we can
adopt an agnostic approach instead of subjectively selecting durations based on
integer grains of time (ie. year and month) based on
the respective count of observations. The top candidates include uniform slices
of time and slices of time containing a uniform number of observations.
5.
Anomalies
shown in a visualization
a. Through the previous
visualizations, we choose to rule out Suspects 4 & 5 as candidates and
proceed to animations across time with the subset of Template, and Suspects
1-3.
6.
Definition
of terms
a. In this write-up we tried to
articulate nuisance terms including:
i.
tSNE projection space: tSNE
is non-linear and stochastic in nature. By this we mean that the precise
transformations used to embed 5D data space into 2D projection space are
obscured, and particularly projections are not a global solution, but rather
local extrema that are hard to reproduce. Despite these shortcomings, we find
meaningful interpretations in them corroborating our other findings. It is also
worth noting that the signal suspect was not clearly identified.
ii.
Selection
of time duration for each “Frame”-slice of time. The animations in the video
were selected subjectively based on the distribution of observations in all
Data Sources across time and selected on nice, whole units of time. We are
going to revisit this as previously described above.
7.
Assumptions
a. The accuracy and precision of
the data
b. The suspect networks include
the most suspect behavior
c. Data cleaning:
d. The animated bar chart:
i.
Randomized
tie-breaking within each rank of a given frame
d. The animated scatter plot:
i.
Disregarded
direction whether it was the originator of the transaction or the recipient
Questions
1. Using visual analytics, compare the template subgraph with the potential
matches provided. Show where the two graphs agree and disagree. Use your tool
to answer the following questions:
The heat map shows the number of transactions made in each
suspect, the template and the edge type. From the heat map, suspects 1 - 3 have
similar values in the template, although the template does not include
‘co-authorship’.
To further identify the suspect subgraphs that match
the template, we have used tSNE on edges. “tSNE” is a
technique to visualize high-dimensional data. This technique enabled us to
generate better visualizations by decreasing the tendency to crowd points
together in the center of the map that linear projections suffer from. From the
visualization, the template graph has more rounded, but unconnected splotches.
Suspects 4 and 5 contain relatively shorter strings compared to other suspects
and the template
The splotches of the
template data are quite unique. While the short, choppiness of the strings in
suspect 4 and 5 corroborate the findings in the earlier visualizations. We
continue our search within suspects 1, 2, and 3.
2.
CGCS has a set of IDs that may be
members of other potential networks that could have been involved. Take a look at the very large graph. Can you determine if
those IDs lead to other networks that matches the template? Describe your
process and findings in no more than ten images and 500 words.
The process to
create this visualization is done via the `igraph`
package. Looking at the 5 suspects and the template, the direction, clustering and node types used, it can be found that the
template for identifying malicious attacks contains more travel than the 5
suspects. The temple also has dense arrows pointing inward to a few nodes in a
tight group. Looking at the suspects, suspects 4 and 5 exhibit these properties
from the template graph.
3.
Optional: Take a
look at the very large graph. Can you find other subgraphs that match
the template provided? Describe your process and your findings in no more than
ten images and 500 words.
4.
Based on your answers to the question
above, identify the group of people that you think is responsible for the
outage. What is your rationale? Please limit your response to 5 images and 300
words.
Based on analysis and given
constraints, we believe the full-network behaviour of
suspects 4 & 5 is quite unlike that of the template network. Between
suspect networks 1, 2, and 3 we have not been able to positively identify one
or more networks that look most like the template. What is further, the remaining
candidates seem to have more in common with one another than that of the
template network.
We advise an immediate meeting
with CGCS socio-psychologists to discuss exactly how
closely we expect the network behaviour to adhere to the template. The search
may need to broaden to include other networks outside of the suspects, or
perhaps further explore precise behavioural differences with domain experts.
5.
What was the greatest challenge you
had when working with the large graph data? How did you overcome that
difficulty? What could make it easier to work with this kind of data?
b. The number of levels when all
discrete variables are taken into account. The constant
need to validate the sentiment “am I within the correct dataset for the correct
Node Type and Edge Type?”.