Research Questions

Data Origin

I retrieved data from the OECD website, where I downloaded migration information from all OECD countries in 2018. I subsequently retrieved a dataset linking countries with geographical subregions from github.

library(readr)
library(here)

wf_mig = read_csv(here("data-raw/workforce-migration.csv")) #migration data
sub_reg = read_csv(here("data-raw/subregions.csv")) #subregions data

The workforce migration data set includes the number of foreignly trained doctors who, in 2018, were registered or in the process of gaining registration to practise in a country other than the one in which they had obtained their medical education qualifications - this includes medical interns and residents.

head(wf_mig)
## # A tibble: 6 x 11
##   COU   Country   VAR   Variable        CO2   `Country of ori~   YEA  Year Value
##   <chr> <chr>     <chr> <chr>           <chr> <chr>            <dbl> <dbl> <dbl>
## 1 CAN   Canada    FTDS  Foreign-traine~ AFG   Afghanistan       2018  2018     5
## 2 FRA   France    FTDS  Foreign-traine~ AFG   Afghanistan       2018  2018    12
## 3 DEU   Germany   FTDS  Foreign-traine~ AFG   Afghanistan       2018  2018   153
## 4 NZL   New Zeal~ FTDS  Foreign-traine~ AFG   Afghanistan       2018  2018     0
## 5 NOR   Norway    FTDS  Foreign-traine~ AFG   Afghanistan       2018  2018    22
## 6 CHE   Switzerl~ FTDS  Foreign-traine~ AFG   Afghanistan       2018  2018     6
## # ... with 2 more variables: Flag Codes <lgl>, Flags <lgl>

The subregions data set links countries with their respective geographical regions and subregions. I focused on subregions in my analysis because their number is more manageable than that of countries (far too many!) or regions (far too few!), which will enhance the readability of the plot.

head(sub_reg)
## # A tibble: 6 x 11
##   name      `alpha-2` `alpha-3` `country-code` `iso_3166-2` region `sub-region` 
##   <chr>     <chr>     <chr>     <chr>          <chr>        <chr>  <chr>        
## 1 Afghanis~ AF        AFG       004            ISO 3166-2:~ Asia   Southern Asia
## 2 Åland Is~ AX        ALA       248            ISO 3166-2:~ Europe Northern Eur~
## 3 Albania   AL        ALB       008            ISO 3166-2:~ Europe Southern Eur~
## 4 Algeria   DZ        DZA       012            ISO 3166-2:~ Africa Northern Afr~
## 5 American~ AS        ASM       016            ISO 3166-2:~ Ocean~ Polynesia    
## 6 Andorra   AD        AND       020            ISO 3166-2:~ Europe Southern Eur~
## # ... with 4 more variables: intermediate-region <chr>, region-code <chr>,
## #   sub-region-code <chr>, intermediate-region-code <chr>

Data Processing

You can see the full script here.

I joined wf_mig and sub_reg by ISO Alpha-3 codes. These are standardised codes for countries, and thus unique identifiers that function as joining keys. This allowed me to retrieve subregion-level information about the number of migrants.

There were several steps:

Removing unnecessary data

I removed:

  • flows between identical countries (not interested in domestic migration)
  • unnecessary columns, e.g., the year of migration (since data is solely from 2018) or information about regions (since I was only interested in subregions).

I also gave columns suggestive, shorter names.

library(dplyr)

#wf_mig - keep relevant columns and rename them
wf_mig = wf_mig[, c("COU", "Country", "CO2", "Country of origin", "Value")]
wf_mig = wf_mig %>%
  rename(code_to = "COU",
         country_to = "Country",
         code_from = "CO2",
         country_from = "Country of origin",
         number = "Value" #number of migrants
)

#wf_mig - remove domestic migration
wf_mig = wf_mig %>%
  filter(country_to != country_from)

#sub_reg - keep relevant columns and rename them
sub_reg = sub_reg[, c("name", "alpha-3", "sub-region")]
sub_reg = sub_reg %>%
  rename(country = name,
         code = "alpha-3",
         subregion = "sub-region")

Joining data frames

I allocated subregions to each country by joining wf_mig and sub_reg by country code. I also provided helpful names to distinguish between origin and destination subregions.

#join datasets based on country code
data = left_join(wf_mig, sub_reg, 
                 by = c("code_to" = "code")
) #add subregions for destination countries

data = rename(data, 
              subregion_to =  subregion) #destination subregions

data = left_join(data, 
                 sub_reg, 
                 by = c("code_from" = "code")
) #add subregions to origin countries

data = rename(data, 
              subregion_from = subregion) #subregions of origin

Evaluating the data set

The new data set included country and subregion information.

sapply(data, 
       function(x) sum(is.na(x))
)
##        code_to     country_to      code_from   country_from         number 
##              0              0              0              0              0 
##   subregion_to subregion_from 
##              0              0
head(data)
## # A tibble: 6 x 7
##   code_to country_to  code_from country_from number subregion_to  subregion_from
##   <chr>   <chr>       <chr>     <chr>         <dbl> <chr>         <chr>         
## 1 CAN     Canada      AFG       Afghanistan       5 Northern Ame~ Southern Asia 
## 2 FRA     France      AFG       Afghanistan      12 Western Euro~ Southern Asia 
## 3 DEU     Germany     AFG       Afghanistan     153 Western Euro~ Southern Asia 
## 4 NZL     New Zealand AFG       Afghanistan       0 Australia an~ Southern Asia 
## 5 NOR     Norway      AFG       Afghanistan      22 Northern Eur~ Southern Asia 
## 6 CHE     Switzerland AFG       Afghanistan       6 Western Euro~ Southern Asia

Data Transformations

The code for this work is here.

The visualization I planned followed the procedure of Sander et al. (2014) and required two objects:

The flow matrix

I created the subregions data frame, containing the total number of migrants per subregion of origin and destination, irrespective of country:

library(dplyr)
library(reshape2)

#get number of immigrants/emigrants at subregion level
subregions = data %>%
  group_by(subregion_from, subregion_to) %>%
  summarize(subregion_number = sum(number))

#convert subregions data frame into wide format
subregions = dcast(subregions,
                   subregion_from ~ subregion_to, #origin subregions as rows
                   value.var = "subregion_number" #number of migrants per subregion
)

#give rows subregion names to facilitate indexing
rownames(subregions) = subregions$subregion_from 

head(subregions)
##                                                  subregion_from
## Australia and New Zealand             Australia and New Zealand
## Central Asia                                       Central Asia
## Eastern Asia                                       Eastern Asia
## Eastern Europe                                   Eastern Europe
## Latin America and the Caribbean Latin America and the Caribbean
## Melanesia                                             Melanesia
##                                 Australia and New Zealand Eastern Europe
## Australia and New Zealand                            2689              1
## Central Asia                                            3             95
## Eastern Asia                                          756             32
## Eastern Europe                                        182           7343
## Latin America and the Caribbean                        69             16
## Melanesia                                              67             NA
##                                 Latin America and the Caribbean
## Australia and New Zealand                                     2
## Central Asia                                                  2
## Eastern Asia                                                  1
## Eastern Europe                                              113
## Latin America and the Caribbean                           10674
## Melanesia                                                    NA
##                                 Northern America Northern Europe
## Australia and New Zealand                    697             941
## Central Asia                                  26              95
## Eastern Asia                                 527             336
## Eastern Europe                              1699           11632
## Latin America and the Caribbean             2064             900
## Melanesia                                      4               5
##                                 Southern Europe Western Asia Western Europe
## Australia and New Zealand                     1           85             44
## Central Asia                                  2           69            442
## Eastern Asia                                 NA           NA            539
## Eastern Europe                               92        10892          24244
## Latin America and the Caribbean               6          652           2159
## Melanesia                                    NA           NA              1

I initialized the flow matrix with all subregions as rows and columns which contained only 0s, treating rows as origin subregions and columns as destination subregions.

I updated the values in the flow_matrix with the ones in subregions. This approach ensured that all possible combinations of subregions were present in flow_matrix, even if they were absent in the subregions data set. Absence would indicate those combinations of subregions had 0 migration levels.

#find all subregions in the dataset
unique_subreg = unique(c(unique(data$subregion_to), unique(data$subregion_from)))

#update flow_matrix with values from subregions
for(i in unique_subreg) { #take each unique subregion
  for(j in unique_subreg) { #combine it with all subregions
    flow_matrix[i, j] = ifelse( #for each combination
      (flow_matrix[i, j] != subregions[i, j] && #if subregions value is different from flow_matrix value
         !(is.na(subregions[i, j]))),  #providing subregions value is not missing
      subregions[i, j], #replace value in flow_matrix with subregions value 
      flow_matrix[i, j] #otherwise keep 0 in flow_matrix
    )
  }
}

At the end, flow_matrix looked like this:

head(flow_matrix)
##                    Eastern Asia South-Eastern Asia Sub-Saharan Africa
## Eastern Asia                  0                  0                  0
## South-Eastern Asia            0                  0                  0
## Sub-Saharan Africa            0                  0                  0
## Northern Africa               0                  0                  0
## Southern Europe               0                  0                  0
## Northern America              0                  0                  0
##                    Northern Africa Southern Europe Northern America
## Eastern Asia                     0               0              527
## South-Eastern Asia               0               0              488
## Sub-Saharan Africa               0               0             3849
## Northern Africa                  0               0             1895
## Southern Europe                  0             956              459
## Northern America                 0               1             1218
##                    Latin America and the Caribbean Western Asia
## Eastern Asia                                     1            0
## South-Eastern Asia                               1            2
## Sub-Saharan Africa                               1          125
## Northern Africa                                  0          228
## Southern Europe                                128         1570
## Northern America                                 8          534
##                    Australia and New Zealand Southern Asia Eastern Europe
## Eastern Asia                             756             0             32
## South-Eastern Asia                      1086             0              4
## Sub-Saharan Africa                      2828             0              2
## Northern Africa                           67             0             12
## Southern Europe                           94             0            110
## Northern America                         784             0              7
##                    Northern Europe Western Europe
## Eastern Asia                   336            539
## South-Eastern Asia             819            571
## Sub-Saharan Africa            5931           2112
## Northern Africa               5622           8771
## Southern Europe               4800          14367
## Northern America               172            463

The subregion details data frame

This data frame included colours for circle sectors, colours for circle links, and the total number of immigrants and emigrants in each subregion. Code for this section has been largely adapted from Sander et al. (2014).

I started by adding the number of emigrants, immigrants, and total migrants for each subregion to the newly created subregion_details data frame.

#Compute number of emigrants per subregion 
df_from = data %>%
  group_by(subregion_from) %>%
  summarize(emig = sum(number))

#Compute number of immigrants per subregion
df_to = data %>%
  group_by(subregion_to) %>%
  summarize(immig = sum(number))

#create subregion_details data frame with info about total migration flow
subregion_details = left_join(df_from, 
                              df_to, 
                              by = c("subregion_from" = "subregion_to") 
)

##I am aware I could have done this by summing the rows and columns of `flow_matrix`
##but I wanted to do things this way so that I could compare the two outputs 
##and hopefully find they are identical as a way to check my work (they were!). 

Because circular plots offer limited space, I wanted to eliminate subregions with few migrants from the data set, but also give the user the choice to include as many subregions as they want.

In my case, I excluded subregions that had the bottom 20% number of total migrants.

#eliminate subregions where the total number of migrants is below the given quantile
(tiny_subreg = subset(subregion_details, 
                      total < quantile(total, 0.2) #user-defined quantile
))

#remove tiny subregions from subregion_details
subregion_details = subregion_details[!(subregion_details$subregion %in% tiny_subreg$subregion), ]

I then sorted this data set in ascending order by total (number of migrants = emigrants + immigrants), to plot subregions in the order of their total amounts of migrants later on.

I assigned colours to each available subregion - this process runs automatically independently of how many subregions out of the total of 17 the user selects, as I have created a pool of colours:

#add rgb codes to each subregion
rgb_pool =  c("255,0,0", #red
              "0,255,0", #lime
              "128,128,0", #olive   
              "148,0,211", #dark violet
              "0,206,209", #dark turquoise
              "255,0,255", #magenta
              "128,0,0", #maroon
              "255,99,71", #tomato
              "0,128,0", #green
              "0,0,255", #blue
              "128,0,128", #purple
              "0,128,128", #teal
              "0,0,128", #navy
              "250,128,144", #salmon
              "100,149,237", #corn flower blue
              "153,50,204", #dark orchid
              "60,179,113" #medium sea green
) #googled 17 rgb codes that enhance contrast; 17 = length(unique_subreg)

#select as many colours as needed depending on the amount of subregions included
subregion_details$rgb = rgb_pool[1:nrow(subregion_details)]

I then stored two versions of these colours in HEX format (one of which had increased transparency) for different elements of the graph.

#split rgb codes into 3 variables
n = nrow(subregion_details)
subregion_details = cbind(subregion_details, #split codes and treat them as numbers
                          matrix(as.numeric(unlist(strsplit(subregion_details$rgb, split = ","))), 
                                 nrow = n, byrow = TRUE 
                                 ) #arrange them in a matrix
)

subregion_details = subregion_details %>%
  rename( #rename columns according to the colour index
    r = '1',
    g = '2',
    b = '3',
  )

#increase transparency and transform rgb into HEX codes
subregion_details$rcol = rgb(subregion_details$r, 
                             subregion_details$g, 
                             subregion_details$b, 
                             max = 255
)

subregion_details$lcol = rgb(subregion_details$r, 
                             subregion_details$g, 
                             subregion_details$b, 
                             alpha = 200, #transparency index
                             max = 255
)

I also ordered rows in subregion_details in ascending order by the total number of migrants to facilitate indexing later on.

Finally, I added the xmin = 0 and xmax columns, which will demarcate axis limits in the plot (from 0 to total amount of migrants) for each subregion.

At the end, the subregion_details data frame looked like this:

head(subregion_details)
##            subregion  emig immig total order       rgb   r   g   b    rcol
## 1       Eastern Asia  2191     0  2191     5   255,0,0 255   0   0 #FF0000
## 2 South-Eastern Asia  2971     0  2971     6   0,255,0   0 255   0 #00FF00
## 3 Sub-Saharan Africa 14848     0 14848     7 128,128,0 128 128   0 #808000
## 4    Northern Africa 16595     0 16595     8 148,0,211 148   0 211 #9400D3
## 5    Southern Europe 22484  1085 23569     9 0,206,209   0 206 209 #00CED1
## 6   Northern America  3187 24303 27490    10 255,0,255 255   0 255 #FF00FF
##        lcol xmin  xmax
## 1 #FF0000C8    0  2191
## 2 #00FF00C8    0  2971
## 3 #808000C8    0 14848
## 4 #9400D3C8    0 16595
## 5 #00CED1C8    0 23569
## 6 #FF00FFC8    0 27490

Data Visualization

The data visualization I have selected is a circular plot diagram. You can find the code here.

Background information

Essentially, this plot draws tracks on a circle and splits them into sectors to reflect differences between subregions in migrants numbers, while separating immigrants and emigrants. It also plots links to illustrate the flow of migrants between subregions.

Readers who are interested in the inner workings of the code should consult Sander et al. (2014) and Gu (2020). I used both of these resources, but I prefer the latter because it offers an in-reasonable-depth explanation of the circlize package.

Preliminary circular plots

I started by setting some plotting parameters related to the size of the circle, padding, sector gaps, etc.

suppressPackageStartupMessages(library(circlize))

circos.clear() #reset circular layout parameters

par(mar = c(0, 0, 0, 0)) #margin around chart
circos.par(cell.padding = c(0, 0, 0, 0), 
           track.margin = c(0, 0.1), 
           start.degree = 45, #start plotting at 2 o'clock
           gap.degree = 2, #gap between circle sectors
           points.overflow.warning = FALSE, 
           canvas.xlim = c(-1, 1), #size of circle
           canvas.ylim = c(-1, 1)  #size of circle
)

I then initialized the layout to allocate subregions into sectors whose sizes are bounded by xmin and xmax. This approach ensures the relative size of sectors is in keeping with the relative number of migrants for each subregion.

circos.initialize(factors = subregion_details$subregion, #allocate sectors on circle to subregions
                  xlim = cbind(subregion_details$xmin, 
                               subregion_details$xmax) #set limits of the x axis for each sector between 0 and total migrant numbers = xmax
)

The next step involved creating a plotting region to which I added graphics. This first track is thus split into sectors reflecting the total amount of migrants per subregion.

circos.initialize(factors = subregion_details$subregion, #allocate sectors on circle to subregions
                  xlim = cbind(subregion_details$xmin, 
                               subregion_details$xmax) #set limits of the x axis for each sector between 0 and total = xmax
)

circos.trackPlotRegion(ylim = c(0, 1), #y-axis limits for each sector
                       factors = subregion_details$subregion, 
                       track.height = 0.1, 
                       panel.fun = function(x, y) { #for each new cell (i.e., intersection between sector and track)
                         name = get.cell.meta.data("sector.index") #retrieve cell meta data
                         i = get.cell.meta.data("sector.numeric.index")
                         xlim = get.cell.meta.data("xlim")
                         ylim = get.cell.meta.data("ylim")
                         
                         #plot a sector for each subregion
                         circos.rect(xleft = xlim[1], 
                                     ybottom = ylim[1], 
                                     xright = xlim[2], 
                                     ytop = ylim[2], 
                                     col = subregion_details$rcol[i], #use less transparent colours
                                     border = subregion_details$rcol[i]
                         )
        
                       }
)

I next added another track where:

  • the coloured circle arcs represent relative numbers of emigrants per subregion
  • the white circle arcs represent relative numbers of immigrants per subregion.

This was achieved by including the following code inside the panel.fun() function in the previous code chunk.

#distinguish between immigrants and emigrants in each subregion
circos.rect(xleft = xlim[1], 
            ybottom = ylim[1], 
            xright = xlim[2] - rowSums(flow_matrix)[i], #i.e., total - emigrants
            ytop = ylim[1] + 0.3,
            col = "white", 
            border = "white"
) 
                         
#add a white contour to separate the previous two rectangles
circos.rect(xleft = xlim[1], 
            ybottom = 0.3, 
            xright = xlim[2], 
            ytop = 0.32, 
            col = "white", 
            border = "white"
)

Final plot

Next, I included links between origin subregions and destination subregions, in order to show migration patterns, and an axis to give an indication of actual migrant numbers. I also fixed the positioning of the text in relation to the plot and the wider page.

This required some further data processing to:

  • transform flow_matrix into its long format

  • add parameters to guide the position of links, sums1 and sums2

  • keep only the largest migration flows to improve readability.

The links were plotted with the following code:

#plot links for each combination of regions
for(k in 1:nrow(flow_matrix_long)){ #for each row in the flow matrix
  i = match(flow_matrix_long$subregion_from[k],
            subregion_details$subregion) #get plotting details for subregion of origin
  j = match(flow_matrix_long$subregion_to[k],
            subregion_details$subregion) #get plotting details for destination subregion
  
  circos.link(sector.index1 = subregion_details$subregion[i], #need to identify indices to identify 
              point1 = c(subregion_details$sum1[i], 
                         subregion_details$sum1[i] + abs(flow_matrix[i, j])), #starting point of link
              
              sector.index2 = subregion_details$subregion[j], 
              point2 = c(subregion_details$sum2[j], 
                       subregion_details$sum2[j] + abs(flow_matrix[i, j])), #endpoint of link
              
              border = subregion_details$lcol[i],
              col = subregion_details$lcol[i], #use the more transparent colour to increase visibility
  )
  
  #update sum1 and sum2 to move along the circle into the next sector
  subregion_details$sum1[i] = subregion_details$sum1[i] + abs(flow_matrix[i, j]) 
  subregion_details$sum2[j] = subregion_details$sum2[j] + abs(flow_matrix[i, j])
}

As a result, I produced this plot - which is linked here as a png file because R output in this package looks horrendous.

Notice that:

  • the colour of a link suggests its origin (e.g., orange link ending in Western Europe signifies immigration from Western Asia)

  • immigration is denoted by links starting in the emigrants section of a subregion (the second coloured arc inwards), and ending up in the immigrants section of another (or the same) subregion (the white arc continuing from the emigrant arc)

    • e.g., notice the dark blue link starting in the emigrant section of Western Europe and ending in the immigrant section of Western Europe (i.e., most emigrants move to countries in the same subregion)
  • the size of the link gives the relative amount of migrants moving from one subregion to another

    • e.g., many more people move from Eastern Europe to Western Europe than to Northern Europe

Summary

Some insights revealed by the plot include:

Reflection

If I had more time to spend on this project:

Resources used

To complete this project, I used several resources:

You can also find my repository here.