Damian Trilling, February 2022
In this document, I try to give a brief overview of techniques we often use at CCS.Amsterdam. This is not a definitive list of all cool stuff that is out there, but can help give you an idea of skills you may want to acquire. Of course, nobody knows everything equally well, and especially if you are a beginning student joining our lab, nobody expects you to master all of this. But chances are high that at one point, you will get into contact with all of them. So getting at least a basic understanding is highly recommended. In particular, the following techniques come to mind:
- Common text editors
- Databases (SQL, Elastic)
- Linux basics & working with remote servers
There are some personal preferences involved here: Some people prefer R, others Python; some write everything in LaTeX, others don’t; some are Linux wizzards, others not. But it’s safe to say that at least a passive knowledge of all of this techniques is really recommended. A good general resource is The Missing Semester of Your CS Education course by MIT. It is aimed at computer science students, but many things that are discussed in the online materials (such as working on the command line or version control with git) are very much relevant to us. If you do not know where to start, but do like getting into the basics, then this is a good place to start. Below, I discuss some themes more in detail.
Python and R
Some people have strong feelings about which of these two is the “better” language for data analysis, but that is – for as far as I am concerned – the wrong way to go about it. Fact is that both languages are very prominent in the computational communication science domain, and that both have their own strengths and weaknesses. If you read the book Computational Analysis of Communication: A practical introduction to the analysis of texts, networks, and images with code examples in Python and R by Wouter van Atteveldt, Damian Trilling, and Carlos Arcila Calderon, then you will quickly realize that most things you may care about can be done in either language.
But that does not mean that you only have to know one of them. You really do not want to build an advanced web crawler in R or apply the latest fancy language models there – simply because the machine learning and the computational linguistics community mainly consists of Python people. Conversely, the graphic capabilities and the ability to run very advanced statistical models (almost) out of the box in R is not Python’s strongest point. Probably the best advise is to become an expert in one of the two languages and develop at least a basic (preferably advanced) knowledge of the other one. After all, if your team members decide to do a project in one of the two languages, it’s good if you can contribute!
- Recommended starting point: Computational Analysis of Communication: A practical introduction to the analysis of texts, networks, and images with code examples in Python and R by Wouter van Atteveldt, Damian Trilling, and Carlos Arcila Calderon
- But there are countless other resources as well…
What about other programming languages?
While Python and R are clearly languages you cannot do without, there are other languages that you may come across. Julia is a relatively new player on the same field as Python and R for data analysis, and is known especially for its speed. However, it is definitly not something you have to know unless you are personally intersted in it. Sometimes, code that is really time-critical is written in C, but also that is not something you need to bother with.
Markdown and LaTeX
You may be used to writing assignments and papers in a word processor like Microsoft Word. Preferences differ, and some of us like to point out that there is some research suggesting that Word users are more productive than LaTeX users. Having said that, large parts of the community write their papers in LaTeX. There are several reasons for that:
- Because it can be written in a plain text editor, it integrates well with git (see below)
- The result looks very professional.
- There are templates (“document classes”) to directly format the manuscript according to not only style guides like APA, but also for direct publishing in many conference proceedings and journals.
- Python and R can directly output LaTeX code for tables – no need for error-prone reformatting or copy-pasting!
- Linking multiple files (e.g., different files for different parts of the manuscript, separate folder for figures – and all compiles into one PDF!)
- All reference managers support bibtex (for automatic citation formatting), and you can even type bibtex by hand!
Especially since Overleaf (think GoogleDocs for LaTeX) became popular, it is also easy to collaborate on LaTeX files. All PhD candidates I supervise and an increasing number of Master students I supervise use Overleaf for their papers (I do not force them to do this!)
A much simpler language than LaTeX is Markdown. It is possible to write articles in Markdown and directly include R code in there, such that when you compile the document, all results are calculated in R and directly inserted. But even if you do not do that, Markdown is omnipresent. For instance, it is used to format README files and other texts on Github (see below), and in Jupyter Notebooks, you use it to explain your code. So, you will need some Markdown one way or another…
- For Markdown: Basic writing and formatting syntax, a guide to Markdown provided by Github (it does not apply only to Github though!)
- For LaTeX: One of the easiest things may be to just start based on a colleague’s LaTeX file, so that you don’t have to start from scratch.
- For LaTeX: I use TexStudio for making my course slides (e.g., here) and I sometimes use emacs (the text editor, see below) to write papers in LaTeX (especially when I want to focus and/or do not have reliable internet connection). But most LaTeX writing in our group is done on Overleaf, an online LaTeX editor that also allows you sharing and collaborating documents, a bit like GoogleDocs. For beginners, this may be a good starting point, also because Overleaf offers not only easy templates to start from, but also an Online introduction to LaTeX.
Common text editors
No computational communication scientists can do without a good text editor. Of course, there are specialized programs for writing, say, Python, R, LaTeX, or Markdown, but it is always good to have a general-purpose editor for your daily work (or for configuring your system) as well. Especially if working on other systems (see section on server below), it may be good to have at least a basic knowledge of some editors that are often used. Popular contemporary editors are Atom, Sublime, Notepad++; but I would suggest to also have a look at the “classics” emacs and vi. On the linux command line, you will need either of these, or the more limited but more user-friendly nano very often.
If you wonder why you should engage in all of this weird plain-text stuff rather than simply using visually appealing WYSIWYG word processors or point-and-click programs for data collection and analysis, consider having a look at Kieran Healy’s The Plain Person’s Guide to Plain Text Social Science.
- Below, we discuss some text-based editors, and it is good to know at least one of them. But in addition, get familiar with an editor that you like and that makes your life easy. Sublime, Atom, Notepad++ come to mind, but that is something for you to figure out.
- Also, you could consider using an IDE that offers a lot of extra functionality for programming, such as PyCharm or VisualStudio Code.
- Unless you really want to, there is nowadays little need to learn the details of vi, an editor that works very differently from all other editors you probably know. However, it usually is present on even the most basic Linux and MacOS configurations and may be configured as a default for, for instance, writing commit messages in git. The most important command to remember, though, is probably :q! — which means: quit immediately. ;-). A very brief introduction to vi is provided here.
- Nowadays, most system also offer nano, a much more basic but also user-friendly text editor. Quite nicely, all commands are displayed at the bottom of the display. Get a short intro here. This one you should really know, because you then will be able to edit text files on almost any system without having to know vi.
- emacs is a much more powerful editor that is also very common on Linux systems, although on small installations, it may be missing. It is available with a text-based and a graphical interface, and it has been ported to all major operating systems. Compared to nano, for instance, it offers really nice things like syntax highlighting for many languages; opening multiple files at one; using regular expressions for search-and-replace; and much more. See it as a Swiss pocket knife that you can use for anything (for instance, I wrote this very article in HTML in emacs, but I also sometimes use it to write Python code, or to write an article in LaTeX). I often use an Emacs Cheat Sheet if I forget the available commands. To read the cheat sheet, you should know that C stands for the Ctrl key and M stands for Meta, which usually is mapped to the Esc key.
- For some entertainment about weird, almost religious, fights some people are involved in, have a look at the Editor War between Emacs and Vi
Version control with git
This is maybe one of the most important skills to have in addition to knowing R and Python. The terminology can sometimes be confusing at the beginning (pulling, committing, pushing are probably not words you are used to when synchronizing your local data with a cloud service). But we really do not mail around scripts or put them into dropbox — we use version control for this. Also for your own projects, you should do this, even if you do not intend to share your code in the first place. It will allow you to go back to any earlier version of the code later on, and save you a lot of confusion in the long run.
- Recommended staring point: CCS.Amsterdam Github workshop
- Note that git is also integrated in some IDEs (e.g., in R Studio) and that a graphical user interface to git exists, but I would advise to learn how to use it on the command line. First, because it is easier to get help with that (the command line approach is what will show up when you google for help); second, because you can then use it anywhere, also to sync your files on a remote server, for instance (see below).
Databases (SQL, Elastic)
Communication scientists coming form a traditional background tend to think of their data as something that can be stored in one file, such as a CSV table. But once your data grow bigger, or are dynamically changing, this may not be a good approach any more. Rather than re-inventing the wheel and trying to figure out the best way of how to read and write the data, you may want to outsource this work to a database – a specialized service running either on your own machine or somewhere else. Luckily, R and Python offer very good support for many databases – for instance, you can directly send a query to the database and either loop over the results or put them into a dataframe. For our purpose, we can distinguish between two major types: SQL databases that work with (linked) tables, and noSQL tables that are especially useful for non-tabular data.
- There is a (very brief) overview and example in our book. I also co-authored a book chapter on the considerations for when to use which database.
- There is a lot written about SQL – after all, it has been existing for decades. So much that it actually is quite hard to give a clear recommendation. There is a Medium Post on Basic SQL for data science and a post on Towards Datascience, but again, there may be better ones out there.
- To get the idea behind the strengths of SQL-databases (namely, linking multiple tables), you may want to read up on Database Normaliation (link to Dutch wikipedia which is shorter and thus more to the point on this.)
- Elasticsearch is a much more recent database and example of a so-called NoSQL database. The main difference to relational databases (like mySQL) is that data are not organized as interlinked tables. Rather, you can essentially just dump a nested dictionary (or a JSON object) into it, and it is completely fine if not all records have the same keus. In general, if your data are nested, relatively messy, and/or if you need to have efficient full-text search, you may consider ElasticSearch. While it is quite easy to dump your first document into an ES database, the learning curve for queries etc. (all of which are essentially nested dictionaries in themselves) can be quite steep. One example for a tutorial that introduces Elasticsearch in combination with Python is here.
- Another popular noSQL database is MongoDB. It is also a good choice, it just happens that most of our projects use ElasticSearch, mainly because (at leat in the past, I do not know for sure now) it offers superior full text search, which is important for all the content analysis work we do.
Linux basics & working with remote servers
You do not have to uninstall Windows immediately and run Linux on your laptop (even though there are quite some people in our group who do run Linux on their laptops), but it may help to realize that by far most software and examples in our field are based on Linux. In particular, virtually all (research) servers run on Linux. Historically, MacOS and Linux share the same roots, so quite some things that work on Linux can be relatively easily adapted for MacOS — in particular, the command line (shell) is almost identical. Hence, if you familiarize yourself with Linux, you gain some knowledge you can also on MacOS. Nowadays, Windows has jumped the bandwagon and there is now a Windows Subsystem for Linux (even though I do not have any experience with that). To try around, you could consider installing a virtual machine (like Virtualbox)with Ubuntu (for instance, download a lightweight imagine from Lubuntu and install it in Virtualbox) in it.
- A very good starting point is the aforementioned The Missing Semester of Your CS Education course by MIT.
- If you look for bash (the name of a very common Linux shell) tutorials on YouTube, you also find a lot. Notice that you can also write bash scripts (see, e.g., here, just like you can write Python scripts. This can be very useful if you want to automate some operation (like renaming multiple files) or if you want to make sure that your data processing or analysis is reproducible.
- Additionally, you should familiarize yourself with the idea behind remote logins via ssh, copying files back-and-forth via scp (and/or rsync, shebangs and user permissions to make scripts executable, tools like nohup and/or screen to run scripts even after disconnecting from the server (see, e.g., here, and cron to automatically run scripts at specific time intervals (see, e.g., here).
I hope this overview provided you with a good starting point. Feel free to reach out if you have any suggestions for improvement. But above all, be curious and try things out!