Last modified: Wednesday, 06-Mar-2019 05:21:46 UTC. Maintained by: Elisa E. Beshero-Bondar (eeb4 at psu.edu). Powered by firebellies.

Network Analysis Exercise, Part 1 (XQuery to Network)

Jane Austen in a network of real people and fictional characters as mentioned in Mary Russell Mitford's writings — SVG figure: Jane Austen as she is networked with other historical people and fictional characters in Mary Russell Mitford’s web of writings. For details on how to read this network graph, please see Visualizing the Worlds of Mary Mitford.

The task, and introducing the TSV

With this pair of assignments you will first learn (in Part 1) how to extract data from your XML in a special tabular plain text format called a TSV file, which you will then import into the network analysis software, Cytoscape. In Part 2, you will learn how to analyze and organize your data as a network graph working in Cytoscape. What you are learning here will prepare you for other kinds of data analysis and visualization work, because this simple, handy data format can be read by spreadsheets and web mapping applications, too. TSV stands for Tab Separated Values, and it applies a tab control character with the unicode special entity notation 	, which signals a movement to the next tab stop, the location the cursor jumps to when you hit the tab key. Basically a TSV presents a table layout in plain text, and actually, any plain text file can represent a tabular column format just by using a regularly repeating pattern of characters, such a white space or a comma (the comma-separated output is known as a CSV file). You should save these files with a .tsv (or a .csv) extension depending on whether you use a comma or a tab separator. Here is some sample TSV output from the Decameron project in TSV format:

Stratilia	frame	Pampinea 
Stratilia	frame	Fiammetta 
Stratilia	frame	Filomena 
Stratilia	frame	Emilia 
Stratilia	frame	Lauretta 
Stratilia	frame	Neifile 
Stratilia	frame	Filostrato 
Stratilia	frame	Dioneo 
Stratilia	frame	Parmeno 
Stratilia	frame	Sirisco 
Stratilia	frame	Tindaro 
Stratilia	frame	Misia 
Stratilia	frame	Licisca 
Stratilia	frame	Chimera 
Bergamino	floatingFrame	Filostrato 
Bergamino	floatingFrame	Lauretta 
Martellino	floatingFrame	Filostrato 
Martellino	floatingFrame	Neifile 
Marchese	novella	Martellino 
Marchese	novella	Agolanti 
Agolanti	novella	Martellino 
Agolanti	novella	Marchese 
Agolanti	novella	Pampinea 
Agolanti	novella	Filostrato 
Agolanti	novella	Lamberti 
Lamberti	novella	Pampinea 
Lamberti	novella	Filostrato

This is a portion of a much larger TSV file that represents co-occurrence network data, that is, it shows individual characters from The Decameron who are connected with each other by being present in the same portion of the text, whether in the introduction or concluding frame portions of each day of storytelling, the floatingFrame sections in which the frame narrators provide commentary inside the story sections, and the stories themselves in the story or novella level. These characters appear together in the same locations in the text, and this is a typical co-occurrence relationship for network analysis, which connects nodes (the characters, here) and edges (what they share or what location host them both, whether that is inside a <floatingText> or <div type="novella"> here). For more on networks of co-occurrence see our Introduction to Network Analysis and Cytoscape for XML Coders. That is the kind of network you will be plotting from XML in this exercise.

You may work with any of our student project files loaded into our eXist database to plot a network of co-occurrence of any kind that interests you, but keep in mind our advice in our tutorial: keep it simple with just one kind of node (say, individual names, place names, reading witnesses, etc.) and some unit of co-occurence drawn from the structure of your XML files.

Planning the XQuery to produce network data for Cytoscape

To explore the code and look for things to try plotting, we recommend you pull from your project GitHub repos to open the files in <oXygen/> to study, using XPath and the Outline view (Window → Show View → Outline).

Study the XML code from a sample project file (ideally in <oXygen/>) to identify a co-occurrence relationship that interests you. For example, which places are mentioned together in the frame narration vs. the stories in The Decameron? Your network does not have to be about people and places, but could be based on something else you have marked, such as the different publications that represented particular poems in the Emily Dickinson collection.

Note: You may wish to update your project files in our eXist database (and as you review and plot your output data you will almost certainly see evidence of tagging errors like extra spaces in elements that might yield two separate nodes with the same name, etc.). To update the file, go to File → Manage, browse for your project directory and locate the file you want to change, and delete it from the database by selecting it and clicking on the trashcan symbol. Upload your new file by clicking on the upload button to the left of the trashcan.)

location of trashcan and upload buttons in the eXist DB Manager

If the project XML was encoded in the TEI namespace you will need to declare the TEI as the default namespace at the top of the file, following the examples in our XQuery tutorial. (At the top of the XQuery script, you will need the following line:

declare default element namespace "http://www.tei-c.org/ns/1.0";

Networks of data are created from nodes connected by edges. For Cytoscape to read and plot network data, it requires a CSV or TSV import in the form of:

Source-Node	Edge	Target-Node

We typically use a TSV because sometimes our node data contains commas and simple white spaces, but it never contains a tab character, so we know we can safely use it as a separator character. Since our output will be strings of text, we will need to use the concat() function to concatenate (or combine together) each single piece we need for each line, including the tab characters, 	. Read about concat() and its cousin string-join() in the Michael Kay book on p. 545 or search for concat on the w3schools XSLT, XPath, and XQuery Functions page . We will actually want to use these two functions together when we return our text output, because we will want to produce the following format for our TSV.

Source-Node [tab] Edge-Interaction [tab] Target-Node [return]

This effectively expresses something like a simple sentence:

Thing-1 [tab] is-in-a-special-shared-place-with [tab] Thing-2 [return]

We are almost certainly going to need to clean up and de-dupe (or remove duplicates from) the input data! Almost every project will feature some level of mess to clean up, and one very simple clean-up you can apply here is to remove any extra white spaces in your input nodes, while doing the XQuery! For this we use the XPath function normalize-space(), which simply removes leading and following white spaces, and makes sure that <city> Greensburg</city> turns out to be the same single distinct value as <city>Greensburg </city> and <city>Greensburg</city>. To use normalize-space(), we typically walk the tree to the nodes we want to process, and place normalize-space(.) like so at the end of the XPath:

         let $input1 := $yourVariableStartingPoint//walk//the//tree//to//here/normalize-space(.)

In our return, we are going to use concat() to hold the Source-Node, [tab], Edge-Interaction, [tab], Target-Node, and then we will bundle that concat function inside a string-join() with the special unicode character of a line-feed or hard-return, 
, as the separator of each line in the output text. Typically we don’t express the whole verb phrase as the Edge, but we output a word or phrase that identifies what the shared space or shared interaction consists of, as in this example:

Bergamino	floatingFrame	Lauretta

Here, the character Bergamino shares with the character Lauretta a position in one of the <floatingText> sections of our TEI XML for The Decameron, and this relationship constitutes one base unit of a larger network of connections. Generating the TSV file that holds a collection of information like this effectively stores all the network data, and when we import it in Cytoscape we can run the software to calculate, plot, and study its network statistics: which nodes are the most connected to other nodes? Which nodes are necessary to hold the network together? Which parts of the network are broken off from the others? Which nodes only appear to have one edge type (say only in sharing <floatingText>) and which ones share multiple edge types? We can output our network plot in many different ways to consider these questions, and that will be our focus in Part 2, but for now, we need to generate the network data to identify the nodes and edges in the first place.

Writing the XQuery to return Source, Edge, and Target Nodes

This is an exercise in nesting a pair of for loops. Let’s think about why. You need to output each Node-1 or Source-Node, so you want an outer For Loop to generate this (together with any information you want to share about that node, called a node attribute), and hold its edge information too: anything you need that is in a one-to-one relationship with the Source Node. But in order to retrieve the Target-Nodes, you need to realize that for each single Source Node, there may several other nodes that co-occur with it in the same space. That means that you need to define a variable that will catch the whole series of target nodes, and then walk through them one at a time, so that you produce each separate line of text to match on each Target node. That means that each Source Node will need to be output several times, each time for every Target. Return everything in a concat() using the tab characters we described above, and bundle that in a string-join() with the line-feed return character, also described above.

Using `distinct-values()` for Source and Target Nodes?

Think about whether you want to network every single time your node appears with every other node in your document. You would produce many duplicate lines of data and your resulting graph would contain many edge lines: Bergamino may appear in the same place with Lauretta over and over and over again. Is that data relevant to your network? You could simplify by taking distinct-values(), and then you would only be noting whether or not two characters appear together at all in a given location, not how many times they appear together. Then again, you might actually want to know that information! To make this really efficient, you can reduce the size of your output by taking distinct-values, and you could also create a separate variable that just goes and checks the count() of the number of times the target node appears in the same context with the source node. If you simply collect that as a number, you could use that number as an edge attribute in Cytoscape when you graph your edge lines: Perhaps you could plot the thickness of an edge line based on how many times the target node shows up in the presence of the source node. Varying the thickness of the edge-lines in a network graph is known as weighting the edges.

Making choices in XQuery using `if (...) then ... else ...`

Depending on what you are plotting in your network, you may want to distinguish among different kinds of nodes or different kinds of edge locations. In our example from The Decameron we output three different words to indicate whether an interaction occurred in floatingText, in the outer frame around the stories, or inside the stories themselves. We also needed to determine the peers of each distinct character who are mentioned in the same layer of text, and that meant looking only inside the appropriate ancestor::div[1] or ancestor::floatingText element that contains the characters in question, all the persName elements that are not equal to the Source Node. To output different kinds of information based on the distinct locations of these elements will require a conditional series of if () then () and else statements to determine the output of a variable. Here is how to work with iffy conditionals. These sit inside a variable definition to control how it may be defined based on the conditions you set:

let $variable:=
                     if (XPath condition 1) 
                              then some-value-to-store--either XPath or "text"
                     else if (XPath condition 2) 
                               then some-alternative-value-to-store--either XPath or "text"
                     else some-other-value-for-all-other-cases--either XPath or "text"

So, in something more like the variables we prepared for network analysis:

let $edge:=
               if ($treeWalker[. = $distinctValue]/ancestor::whatEver) 
                              then "whatEver"
         else if ($treeWalker[. = $distinctValue]/ancestor::somethingElse) 
                              then "somethingElse"
         else "remainingOption"

In this example, we involved a $treeWalker variable that we set earlier in the XQuery to walk the tree of the XML file(s) before we took distinct-values(). In order to check for the peers and to look up the edge data for our network, we need to check each name element in the XML to see if it corresponds to the current entry in our list of distinct-values, and when it does, check to see which conditions it meets. It should output a different condition depending on its placement, and if we do this right, we will identify every condition that matters to us. The final else statement could be left empty, or could be given a value to output to account for any other case that we didn't define in the preceding conditionals. We can have as many else if statements as we like and make a long running list of conditionals, but for the purpose of our network we decided to keep this simple: The words we output in this variable will signal three different states that ultimately we will be able to color-code or plot distinctly in our network graph,

Putting it all together in a TSV file

When you are retrieving good output, you need to pack this up into a TSV file that we can import into Cytoscape. This is a little tricky with outputting a plain text file, because every line of returned text seems like a separate thing to XQuery, and eXist will throw an error when you try to save your output as a single file. To bind all the lines together so it can be read as one united piece of text, you need to position a string join() around the whole FLOWR and return, so that the concat() function in the return is actually the first argument in the string-join(), which then has the second argument be a line-feed (or hard-return) character. And one more thing! To make sure that the output is understood to be plain text and not the default XML format that eXist expects to be producing, just a "text/plain" assertion to the end of the xmldb:store() function. Here's a sort of abstract view of how that should look, with a little summary of what we have discussed so far. We decided to output lines of text containing four values: a source node, an edge, an edge attribute, and a target node.

xquery version "3.0";
declare default element namespace "http://www.tei-c.org/ns/1.0"; 
declare variable $ThisFileContent:=
string-join(
   let $engdecameron := doc('/db/decameron/engDecameronTEI.xml')/*
   let $engpeople := [stuff]
   let $engdistinctPs := [stuff]
   for $edp in $engdistinctPs

      let $edgeType:=
         if (condition 1--the floating frames) 
               then "floatingFrame"
         else if (condition 2--the novellas) 
               then "novella"
         else "frame"
         
     let $edgeWeight:=
         if (condition 1--the floating frames) 
            then count(XPath-to-list-of-peers-in-floatingText)
      else if (condition 2--the novellas) 
            then count(XPath-to-list-of-peers-in-novellas)
      else count(XPath-to-all-the-other-peers-not-covered-in-the-other-conditions)
         
    let $peers:= 
      if (condition 1--the floating frames) 
            then distinct-values(XPath-to-list-of-peers-in-floatingText)
      else if (condition 2--the novellas) 
            then distinct-values(XPath-to-list-of-peers-in-novellas)
      else distinct-values(XPath-to-all-the-other-peers-not-covered-in-the-other-conditions)

      for $peer in $peers
      return
      concat($edp(:source node:), "&#x9;"(:tab character:), $edgeType(:shared interaction or edge:), "&#x9;", $edgeWeight, "&#x9;", $peer(:target node:)), "&#10;") ;

let $filename := "MyNetworkData.tsv"
let $doc-db-uri := xmldb:store("/db/myOutput", $filename, $ThisFileContent, "text/plain")
return $doc-db-uri
(: output at :http://newtfire.org:8338/exist/rest/db/myOutput/MyNetworkData.tsv ) :)

View your data in the browser and you should be able to download it from there and save it locally (when prompted, save as all files instead of plain text, so your computer preserves the .tsv at the end and doesn’t add .txt to the file extension.) Or navigate your way to it in your output directory in eXist, and use the File menu there to download it.

Test your TSV: Import into Cytoscape

To make sure that your data is good and readable, we conclude this assignment by having you import your TSV file into Cytoscape. Follow the instructions for import in the Cytoscape Tutorial. If Cytoscape gives you a preliminary plot and a network table, you have successfully prepared a good TSV file to work with! If not, you may need to repair something in your XQuery. In the next assignment, we will work on processing your data in Cytoscape to calculate its network statistics and prepare meaningful and legible network visualizations.

What to submit

Upload your XQuery script (in a text file), and your output TSV file to the Courseweb upload point for this assignment.