Data lineage documentation in Pajek

A more technical post.

I learned about pajek years ago when a friend of mine followed a summerclass about complexity at MIT. He used it to illustrate the difference in complexity between folk, pop and jazz music by visualizing a typical chord-progression of said genres. I was impressed by the simplicity and power of the tool and have now found a practical application in my field of expertise.

Pajek is a program, for Windows, for analysis and visualization of large networks having some thousands or even millions of vertices. In Slovenian language the word pajek means spider. Pajek should provide tools for analysis and visualization of such networks: collaboration networks, organic molecule in chemistry, protein-receptor interaction networks, genealogies, Internet networks, citation networks, diffusion (AIDS, news, innovations) networks, data-mining (2-mode networks), etc.

Click here to tweet a link to this blog post: Data Lineage Documentation for #QlikView using pajek [Tutorial]

Pajek

Summary

  1. Create relations in excel
  2. Save excel as .txt
  3. Convert .txt to .net
  4. Paste .net in excel and enrich
  5. Paste enriched code in .net

Summary

Steps

List for every data-field where it is coming from, where it is flowing through and where it is going in an excel in two columns: From and To. Excellist

  • Save this excel as a .txt-file.

  • Convert the .txt-file to a .net-file using the tool txt2pajek.exe. txt2pajek

  • If you load this .net-file into Pajek you get something like this: default network Now you can drag and drop the nodes in a more informational position, but you can also add more info to the .net-file to control the layout.

  • Open the .net-file with Notepad++ and paste it in an excel-file. The yellow part in the print screen below is pasted, the other columns are added.

Vertices - from left to right

X : x-position of the node
Y: y-position of the node
Shape: shape of the vertex, code: sh
Shape: name of the shape: box, ellipse (default), diamond, triangle
Internal color: color of the vertex, code: ic
Internal color: caps-sensitive! White, Green, Yellow, Blue, Red, …

You can also add the legend as vertex, but if you do, don’t forget to change the number of vertices in cell (“B2”) – otherwise an error will occur on loading the .net-file.

legend

Arcs - from left to right

Value: weigth of the arc

arcs

Concatenate

Concatenate the fields from left to right with a space in between. A simple copy-paste of the columns takes a tab as separator, this doesn’t work in the .net file so we concatenate the fields and copy paste this column to a .net-file.

Now if you load the .net-file the dataflow looks like this:

network

Some more tweaking can be done manually, but this will be lost every time you load the .net file again.

Energy

You can also let Pajek find its own visualization of the network (for example by choosing layout/energy/kamada-kawai/free). Now you see that Pajek recognizes the qvd’s that are created but not used in the datamodel are put in a group (right hand side).

You can also animate a 3D rendering of the network.

ps: vertices is the term used in mathematics, nodes in informatics. They mean the same.

energy

The downloads and manuals can be found online