Entry Name: MON-Spyrison-MC1

VAST Challenge 2020
Mini-Challenge 1

 

 

Team Members:

Nicholas Spyrison, Monash University, Nicholas.spyrison@monash.edu PRIMARY

Miji Kim, Monash University, mkim0002@student.monash.edu

Ha Nam Anh Pham, Monash University, hnpha5@student.monash.edu

Student Team: YES

 

Tools Used:

-          R (via RStudio)

o   Especially the packages: dplyr, tidyr, ggplot2, gganimate, ggraph, Rtsne, D3 heatmap.

 

Approximately how many hours were spent working on this submission in total?

200 hours between 3 people

 

May we post your submission in the Visual Analytics Benchmark Repository after VAST Challenge 2020 is complete? YES

 

Video

https://www.youtube.com/watch?v=--lrw6UICbc&t=28s

 

GitHub link

https://github.com/nspyrison/VAST_Challenge_2020

 

 

Center for Global Cyber Strategy (CGCS) researchers have used the data donated by the white hat groups to create anonymized profiles of the groups. One such profile has been identified by CGCS sociopsychologists as most likely to resemble the structure of the group who accidentally caused this internet outage. You have been asked to examine CGCS records and identify those groups who most closely resemble the identified profile

 

How the visual analytics software helped our analysis

 

  1. Steps to approach the challenge
    1. We took seven major steps to approach the challenge and the starting point was to consume the data. Once we had a better understanding of the metadata as well as the situation, we applied denormalization and cleaning to the data. After exploring the data and generating higher-level visuals, we identified which suspects to chase using network layouts and tSNE. Then, our analysis was narrowed down to Suspect1, Suspect2, Suspect3, and Template. 
    2. Selection of methods was a compromise between: 
      1. Methods that we were familiar with
      2. Applicability to the network data in MC1
      3. Learning new methods given the time constraints
    1. Selection of the scope of application was primarily driven by iterated analysis as we worked through application.       

 

2.                   Design decisions

a.       Heatmap: `ggplot2` was used to quickly grasp the distributions across the Edge Types within each Data Source.

b.       Network layouts: The R packages `ggraph` and `igraph` packages were used to format into a graph object, apply layout locations according to various algorithms (especially `igraphs` Large Graph Layout) to facilitate rapid iterations in R. 

c.       tSNE:  t-distributed stochastic neighbour embedding (“tSNE”, van der Maaten & Hinton, 2008), is a form of non-linear dimension reduction. Within each Data Source, we apply tSNE to embed 5 attribute-dimensions (3 factors: Node ID, Node Direction, Edge Type and 2 quantitative: Time [seconds], and Weight) of all cleaned rows into their own, potentially highly non-linear, 2-dimensional projection spaces. We use the same hyperparameters, one of which is a function of sample size. Namely, perplexity = ⅓ * the square root of(number of rows in this dataset). Viewing these spaces side-by-side we tried to identify features of the projection spaces to better compare and contrast the networks.

d.       Visuals of weight animated across time: the `gganimate` package was used in generating animated plots. The animated bar chart presents bars, racing to the top based on ranks within each frame. It was developed with the intent of presenting the flow of procurement transactions over time. Then, the animated scatter plot with a timeline element was developed to identify and visualize similarities between each suspect and the template shown over time. 

 

3.                   Visualization and interactions 

a.       Heatmap: fast, light distribution of observations across 2 discrete variables.

b.       Network layouts and tSNE: To identify and contrast particular features in the different networks.

c.       Visuals of weight animated across time: 

      1. The animated bar chart facilitates an understanding of the change in trends over time. Creating a diverging stacked bar, this bar chart conveys much information in a given time. 
      2. By splitting a single plot into several related plots using face_grid(), it becomes easier to compare the trend found in each suspect with that in the template. 

 

4.                   Filters and transformation

a.       With visual data exploration, Suspects 4 & Suspect 5 were removed as they presented fewer similarities compared to Suspects 1-3. Then, we narrowed down our analysis to procurement transactions as meaningful findings were identified in the financial category during the exploration stage. Therefore, the dataset was filtered by the edge type, selecting sell and purchase data.

b.       A discrete transformation was applied across time as we created a frame variable by slicing time to aggregate and animate. We are currently revisiting this transformation to see if we can adopt an agnostic approach instead of subjectively selecting durations based on integer grains of time (ie. year and month) based on the respective count of observations. The top candidates include uniform slices of time and slices of time containing a uniform number of observations.

 

5.                   Anomalies shown in a visualization

a.       Through the previous visualizations, we choose to rule out Suspects 4 & 5 as candidates and proceed to animations across time with the subset of Template, and Suspects 1-3.

 

6.                   Definition of terms 

a.       In this write-up we tried to articulate nuisance terms including:

                                                         i.            tSNE projection space: tSNE is non-linear and stochastic in nature. By this we mean that the precise transformations used to embed 5D data space into 2D projection space are obscured, and particularly projections are not a global solution, but rather local extrema that are hard to reproduce. Despite these shortcomings, we find meaningful interpretations in them corroborating our other findings. It is also worth noting that the signal suspect was not clearly identified.

                                                       ii.            Selection of time duration for each “Frame”-slice of time. The animations in the video were selected subjectively based on the distribution of observations in all Data Sources across time and selected on nice, whole units of time. We are going to revisit this as previously described above.

 

7.                   Assumptions 

a.       The accuracy and precision of the data

b.       The suspect networks include the most suspect behavior

c.       Data cleaning: 

      1. NAs found in a column for Weight were replaced with 0 for calculation.
      2. Negative values in Weight were changed to absolute values.

d.       The animated bar chart: 

                                                         i.            Randomized tie-breaking within each rank of a given frame

d.       The animated scatter plot:

                                                         i.            Disregarded direction whether it was the originator of the transaction or the recipient

Questions

1.       Using visual analytics, compare the template subgraph with the potential matches provided. Show where the two graphs agree and disagree. Use your tool to answer the following questions:

  1. Compare the five candidate subgraphs to the provided template. Show where the two graphs agree and disagree. Which subgraph matches the template the best? Please limit your answer to seven images and 500 words.

The heat map shows the number of transactions made in each suspect, the template and the edge type. From the heat map, suspects 1 - 3 have similar values in the template, although the template does not include ‘co-authorship’.

To further identify the suspect subgraphs that match the template, we have used tSNE on edges.  tSNE” is a technique to visualize high-dimensional data. This technique enabled us to generate better visualizations by decreasing the tendency to crowd points together in the center of the map that linear projections suffer from. From the visualization, the template graph has more rounded, but unconnected splotches. Suspects 4 and 5 contain relatively shorter strings compared to other suspects and the template

  1. Which key parts of the best match help discriminate it from the other potential matches? Please limit your answer to five images and 300 words.

The splotches of the template data are quite unique. While the short, choppiness of the strings in suspect 4 and 5 corroborate the findings in the earlier visualizations. We continue our search within suspects 1, 2, and 3.

 

2.       CGCS has a set of IDs that may be members of other potential networks that could have been involved. Take a look at the very large graph. Can you determine if those IDs lead to other networks that matches the template? Describe your process and findings in no more than ten images and 500 words.

The process to create this visualization is done via the `igraph` package. Looking at the 5 suspects and the template, the direction, clustering and node types used, it can be found that the template for identifying malicious attacks contains more travel than the 5 suspects. The temple also has dense arrows pointing inward to a few nodes in a tight group. Looking at the suspects, suspects 4 and 5 exhibit these properties from the template graph.

3.       Optional: Take a look at the very large graph. Can you find other subgraphs that match the template provided? Describe your process and your findings in no more than ten images and 500 words.

 

4.       Based on your answers to the question above, identify the group of people that you think is responsible for the outage. What is your rationale? Please limit your response to 5 images and 300 words.

Based on analysis and given constraints, we believe the full-network behaviour of suspects 4 & 5 is quite unlike that of the template network. Between suspect networks 1, 2, and 3 we have not been able to positively identify one or more networks that look most like the template. What is further, the remaining candidates seem to have more in common with one another than that of the template network. 

 

We advise an immediate meeting with CGCS socio-psychologists to discuss exactly how closely we expect the network behaviour to adhere to the template. The search may need to broaden to include other networks outside of the suspects, or perhaps further explore precise behavioural differences with domain experts.

 

5.       What was the greatest challenge you had when working with the large graph data? How did you overcome that difficulty? What could make it easier to work with this kind of data?

    1. Complex time aggregation -- quickly becomes confusing, and the need for another set of eyes to look at the results for validation.

b.       The number of levels when all discrete variables are taken into account. The constant need to validate the sentiment “am I within the correct dataset for the correct Node Type and Edge Type?”.