To which geographical regions do foreignly trained doctors migrate?
From which geographical regions do foreignly trained doctors emigrate?
Are there any distinguishable patterns? For example:
I retrieved data from the OECD website, where I downloaded migration information from all OECD countries in 2018. I subsequently retrieved a dataset linking countries with geographical subregions from github.
library(readr)
library(here)
= read_csv(here("data-raw/workforce-migration.csv")) #migration data
wf_mig = read_csv(here("data-raw/subregions.csv")) #subregions data sub_reg
The workforce migration data set includes the number of foreignly trained doctors who, in 2018, were registered or in the process of gaining registration to practise in a country other than the one in which they had obtained their medical education qualifications - this includes medical interns and residents.
head(wf_mig)
## # A tibble: 6 x 11
## COU Country VAR Variable CO2 `Country of ori~ YEA Year Value
## <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 CAN Canada FTDS Foreign-traine~ AFG Afghanistan 2018 2018 5
## 2 FRA France FTDS Foreign-traine~ AFG Afghanistan 2018 2018 12
## 3 DEU Germany FTDS Foreign-traine~ AFG Afghanistan 2018 2018 153
## 4 NZL New Zeal~ FTDS Foreign-traine~ AFG Afghanistan 2018 2018 0
## 5 NOR Norway FTDS Foreign-traine~ AFG Afghanistan 2018 2018 22
## 6 CHE Switzerl~ FTDS Foreign-traine~ AFG Afghanistan 2018 2018 6
## # ... with 2 more variables: Flag Codes <lgl>, Flags <lgl>
The subregions data set links countries with their respective geographical regions and subregions. I focused on subregions in my analysis because their number is more manageable than that of countries (far too many!) or regions (far too few!), which will enhance the readability of the plot.
head(sub_reg)
## # A tibble: 6 x 11
## name `alpha-2` `alpha-3` `country-code` `iso_3166-2` region `sub-region`
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Afghanis~ AF AFG 004 ISO 3166-2:~ Asia Southern Asia
## 2 Åland Is~ AX ALA 248 ISO 3166-2:~ Europe Northern Eur~
## 3 Albania AL ALB 008 ISO 3166-2:~ Europe Southern Eur~
## 4 Algeria DZ DZA 012 ISO 3166-2:~ Africa Northern Afr~
## 5 American~ AS ASM 016 ISO 3166-2:~ Ocean~ Polynesia
## 6 Andorra AD AND 020 ISO 3166-2:~ Europe Southern Eur~
## # ... with 4 more variables: intermediate-region <chr>, region-code <chr>,
## # sub-region-code <chr>, intermediate-region-code <chr>
You can see the full script here.
I joined wf_mig
and sub_reg
by ISO Alpha-3 codes. These are standardised codes for countries, and thus unique identifiers that function as joining keys. This allowed me to retrieve subregion-level information about the number of migrants.
There were several steps:
I removed:
I also gave columns suggestive, shorter names.
library(dplyr)
#wf_mig - keep relevant columns and rename them
= wf_mig[, c("COU", "Country", "CO2", "Country of origin", "Value")]
wf_mig = wf_mig %>%
wf_mig rename(code_to = "COU",
country_to = "Country",
code_from = "CO2",
country_from = "Country of origin",
number = "Value" #number of migrants
)
#wf_mig - remove domestic migration
= wf_mig %>%
wf_mig filter(country_to != country_from)
#sub_reg - keep relevant columns and rename them
= sub_reg[, c("name", "alpha-3", "sub-region")]
sub_reg = sub_reg %>%
sub_reg rename(country = name,
code = "alpha-3",
subregion = "sub-region")
I allocated subregions to each country by joining wf_mig
and sub_reg
by country code. I also provided helpful names to distinguish between origin and destination subregions.
#join datasets based on country code
= left_join(wf_mig, sub_reg,
data by = c("code_to" = "code")
#add subregions for destination countries
)
= rename(data,
data subregion_to = subregion) #destination subregions
= left_join(data,
data
sub_reg, by = c("code_from" = "code")
#add subregions to origin countries
)
= rename(data,
data subregion_from = subregion) #subregions of origin
The new data set included country and subregion information.
sapply(data,
function(x) sum(is.na(x))
)
## code_to country_to code_from country_from number
## 0 0 0 0 0
## subregion_to subregion_from
## 0 0
head(data)
## # A tibble: 6 x 7
## code_to country_to code_from country_from number subregion_to subregion_from
## <chr> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 CAN Canada AFG Afghanistan 5 Northern Ame~ Southern Asia
## 2 FRA France AFG Afghanistan 12 Western Euro~ Southern Asia
## 3 DEU Germany AFG Afghanistan 153 Western Euro~ Southern Asia
## 4 NZL New Zealand AFG Afghanistan 0 Australia an~ Southern Asia
## 5 NOR Norway AFG Afghanistan 22 Northern Eur~ Southern Asia
## 6 CHE Switzerland AFG Afghanistan 6 Western Euro~ Southern Asia
The code for this work is here.
The visualization I planned followed the procedure of Sander et al. (2014) and required two objects:
flow_matrix
- a matrix containing the number of migrants between all combinations of subregions (origin subregions as rows, destination subregions as columns)
subregion_details
- a data frame showing plotting parameters (e.g., colour codes) for each subregion.
I created the subregions
data frame, containing the total number of migrants per subregion of origin and destination, irrespective of country:
library(dplyr)
library(reshape2)
#get number of immigrants/emigrants at subregion level
= data %>%
subregions group_by(subregion_from, subregion_to) %>%
summarize(subregion_number = sum(number))
#convert subregions data frame into wide format
= dcast(subregions,
subregions ~ subregion_to, #origin subregions as rows
subregion_from value.var = "subregion_number" #number of migrants per subregion
)
#give rows subregion names to facilitate indexing
rownames(subregions) = subregions$subregion_from
head(subregions)
## subregion_from
## Australia and New Zealand Australia and New Zealand
## Central Asia Central Asia
## Eastern Asia Eastern Asia
## Eastern Europe Eastern Europe
## Latin America and the Caribbean Latin America and the Caribbean
## Melanesia Melanesia
## Australia and New Zealand Eastern Europe
## Australia and New Zealand 2689 1
## Central Asia 3 95
## Eastern Asia 756 32
## Eastern Europe 182 7343
## Latin America and the Caribbean 69 16
## Melanesia 67 NA
## Latin America and the Caribbean
## Australia and New Zealand 2
## Central Asia 2
## Eastern Asia 1
## Eastern Europe 113
## Latin America and the Caribbean 10674
## Melanesia NA
## Northern America Northern Europe
## Australia and New Zealand 697 941
## Central Asia 26 95
## Eastern Asia 527 336
## Eastern Europe 1699 11632
## Latin America and the Caribbean 2064 900
## Melanesia 4 5
## Southern Europe Western Asia Western Europe
## Australia and New Zealand 1 85 44
## Central Asia 2 69 442
## Eastern Asia NA NA 539
## Eastern Europe 92 10892 24244
## Latin America and the Caribbean 6 652 2159
## Melanesia NA NA 1
I initialized the flow matrix with all subregions as rows and columns which contained only 0s, treating rows as origin subregions and columns as destination subregions.
I updated the values in the flow_matrix
with the ones in subregions
. This approach ensured that all possible combinations of subregions were present in flow_matrix
, even if they were absent in the subregions
data set. Absence would indicate those combinations of subregions had 0 migration levels.
#find all subregions in the dataset
= unique(c(unique(data$subregion_to), unique(data$subregion_from)))
unique_subreg
#update flow_matrix with values from subregions
for(i in unique_subreg) { #take each unique subregion
for(j in unique_subreg) { #combine it with all subregions
= ifelse( #for each combination
flow_matrix[i, j] != subregions[i, j] && #if subregions value is different from flow_matrix value
(flow_matrix[i, j] !(is.na(subregions[i, j]))), #providing subregions value is not missing
#replace value in flow_matrix with subregions value
subregions[i, j], #otherwise keep 0 in flow_matrix
flow_matrix[i, j]
)
} }
At the end, flow_matrix
looked like this:
head(flow_matrix)
## Eastern Asia South-Eastern Asia Sub-Saharan Africa
## Eastern Asia 0 0 0
## South-Eastern Asia 0 0 0
## Sub-Saharan Africa 0 0 0
## Northern Africa 0 0 0
## Southern Europe 0 0 0
## Northern America 0 0 0
## Northern Africa Southern Europe Northern America
## Eastern Asia 0 0 527
## South-Eastern Asia 0 0 488
## Sub-Saharan Africa 0 0 3849
## Northern Africa 0 0 1895
## Southern Europe 0 956 459
## Northern America 0 1 1218
## Latin America and the Caribbean Western Asia
## Eastern Asia 1 0
## South-Eastern Asia 1 2
## Sub-Saharan Africa 1 125
## Northern Africa 0 228
## Southern Europe 128 1570
## Northern America 8 534
## Australia and New Zealand Southern Asia Eastern Europe
## Eastern Asia 756 0 32
## South-Eastern Asia 1086 0 4
## Sub-Saharan Africa 2828 0 2
## Northern Africa 67 0 12
## Southern Europe 94 0 110
## Northern America 784 0 7
## Northern Europe Western Europe
## Eastern Asia 336 539
## South-Eastern Asia 819 571
## Sub-Saharan Africa 5931 2112
## Northern Africa 5622 8771
## Southern Europe 4800 14367
## Northern America 172 463
This data frame included colours for circle sectors, colours for circle links, and the total number of immigrants and emigrants in each subregion. Code for this section has been largely adapted from Sander et al. (2014).
I started by adding the number of emigrants, immigrants, and total migrants for each subregion to the newly created subregion_details
data frame.
#Compute number of emigrants per subregion
= data %>%
df_from group_by(subregion_from) %>%
summarize(emig = sum(number))
#Compute number of immigrants per subregion
= data %>%
df_to group_by(subregion_to) %>%
summarize(immig = sum(number))
#create subregion_details data frame with info about total migration flow
= left_join(df_from,
subregion_details
df_to, by = c("subregion_from" = "subregion_to")
)
##I am aware I could have done this by summing the rows and columns of `flow_matrix`
##but I wanted to do things this way so that I could compare the two outputs
##and hopefully find they are identical as a way to check my work (they were!).
Because circular plots offer limited space, I wanted to eliminate subregions with few migrants from the data set, but also give the user the choice to include as many subregions as they want.
In my case, I excluded subregions that had the bottom 20% number of total migrants.
#eliminate subregions where the total number of migrants is below the given quantile
tiny_subreg = subset(subregion_details,
(< quantile(total, 0.2) #user-defined quantile
total
))
#remove tiny subregions from subregion_details
= subregion_details[!(subregion_details$subregion %in% tiny_subreg$subregion), ] subregion_details
I then sorted this data set in ascending order by total
(number of migrants = emigrants + immigrants), to plot subregions in the order of their total amounts of migrants later on.
I assigned colours to each available subregion - this process runs automatically independently of how many subregions out of the total of 17 the user selects, as I have created a pool of colours:
#add rgb codes to each subregion
= c("255,0,0", #red
rgb_pool "0,255,0", #lime
"128,128,0", #olive
"148,0,211", #dark violet
"0,206,209", #dark turquoise
"255,0,255", #magenta
"128,0,0", #maroon
"255,99,71", #tomato
"0,128,0", #green
"0,0,255", #blue
"128,0,128", #purple
"0,128,128", #teal
"0,0,128", #navy
"250,128,144", #salmon
"100,149,237", #corn flower blue
"153,50,204", #dark orchid
"60,179,113" #medium sea green
#googled 17 rgb codes that enhance contrast; 17 = length(unique_subreg)
)
#select as many colours as needed depending on the amount of subregions included
$rgb = rgb_pool[1:nrow(subregion_details)] subregion_details
I then stored two versions of these colours in HEX format (one of which had increased transparency) for different elements of the graph.
#split rgb codes into 3 variables
= nrow(subregion_details)
n = cbind(subregion_details, #split codes and treat them as numbers
subregion_details matrix(as.numeric(unlist(strsplit(subregion_details$rgb, split = ","))),
nrow = n, byrow = TRUE
#arrange them in a matrix
)
)
= subregion_details %>%
subregion_details rename( #rename columns according to the colour index
r = '1',
g = '2',
b = '3',
)
#increase transparency and transform rgb into HEX codes
$rcol = rgb(subregion_details$r,
subregion_details$g,
subregion_details$b,
subregion_detailsmax = 255
)
$lcol = rgb(subregion_details$r,
subregion_details$g,
subregion_details$b,
subregion_detailsalpha = 200, #transparency index
max = 255
)
I also ordered rows in subregion_details
in ascending order by the total
number of migrants to facilitate indexing later on.
Finally, I added the xmin = 0
and xmax
columns, which will demarcate axis limits in the plot (from 0 to total amount of migrants) for each subregion.
At the end, the subregion_details
data frame looked like this:
head(subregion_details)
## subregion emig immig total order rgb r g b rcol
## 1 Eastern Asia 2191 0 2191 5 255,0,0 255 0 0 #FF0000
## 2 South-Eastern Asia 2971 0 2971 6 0,255,0 0 255 0 #00FF00
## 3 Sub-Saharan Africa 14848 0 14848 7 128,128,0 128 128 0 #808000
## 4 Northern Africa 16595 0 16595 8 148,0,211 148 0 211 #9400D3
## 5 Southern Europe 22484 1085 23569 9 0,206,209 0 206 209 #00CED1
## 6 Northern America 3187 24303 27490 10 255,0,255 255 0 255 #FF00FF
## lcol xmin xmax
## 1 #FF0000C8 0 2191
## 2 #00FF00C8 0 2971
## 3 #808000C8 0 14848
## 4 #9400D3C8 0 16595
## 5 #00CED1C8 0 23569
## 6 #FF00FFC8 0 27490
The data visualization I have selected is a circular plot diagram. You can find the code here.
Essentially, this plot draws tracks on a circle and splits them into sectors to reflect differences between subregions in migrants numbers, while separating immigrants and emigrants. It also plots links to illustrate the flow of migrants between subregions.
Readers who are interested in the inner workings of the code should consult Sander et al. (2014) and Gu (2020). I used both of these resources, but I prefer the latter because it offers an in-reasonable-depth explanation of the circlize
package.
I started by setting some plotting parameters related to the size of the circle, padding, sector gaps, etc.
suppressPackageStartupMessages(library(circlize))
circos.clear() #reset circular layout parameters
par(mar = c(0, 0, 0, 0)) #margin around chart
circos.par(cell.padding = c(0, 0, 0, 0),
track.margin = c(0, 0.1),
start.degree = 45, #start plotting at 2 o'clock
gap.degree = 2, #gap between circle sectors
points.overflow.warning = FALSE,
canvas.xlim = c(-1, 1), #size of circle
canvas.ylim = c(-1, 1) #size of circle
)
I then initialized the layout to allocate subregions into sectors whose sizes are bounded by xmin
and xmax
. This approach ensures the relative size of sectors is in keeping with the relative number of migrants for each subregion.
circos.initialize(factors = subregion_details$subregion, #allocate sectors on circle to subregions
xlim = cbind(subregion_details$xmin,
$xmax) #set limits of the x axis for each sector between 0 and total migrant numbers = xmax
subregion_details )
The next step involved creating a plotting region to which I added graphics. This first track is thus split into sectors reflecting the total amount of migrants per subregion.
circos.initialize(factors = subregion_details$subregion, #allocate sectors on circle to subregions
xlim = cbind(subregion_details$xmin,
$xmax) #set limits of the x axis for each sector between 0 and total = xmax
subregion_details
)
circos.trackPlotRegion(ylim = c(0, 1), #y-axis limits for each sector
factors = subregion_details$subregion,
track.height = 0.1,
panel.fun = function(x, y) { #for each new cell (i.e., intersection between sector and track)
= get.cell.meta.data("sector.index") #retrieve cell meta data
name = get.cell.meta.data("sector.numeric.index")
i = get.cell.meta.data("xlim")
xlim = get.cell.meta.data("ylim")
ylim
#plot a sector for each subregion
circos.rect(xleft = xlim[1],
ybottom = ylim[1],
xright = xlim[2],
ytop = ylim[2],
col = subregion_details$rcol[i], #use less transparent colours
border = subregion_details$rcol[i]
)
} )
I next added another track where:
This was achieved by including the following code inside the panel.fun()
function in the previous code chunk.
#distinguish between immigrants and emigrants in each subregion
circos.rect(xleft = xlim[1],
ybottom = ylim[1],
xright = xlim[2] - rowSums(flow_matrix)[i], #i.e., total - emigrants
ytop = ylim[1] + 0.3,
col = "white",
border = "white"
)
#add a white contour to separate the previous two rectangles
circos.rect(xleft = xlim[1],
ybottom = 0.3,
xright = xlim[2],
ytop = 0.32,
col = "white",
border = "white"
)
Next, I included links between origin subregions and destination subregions, in order to show migration patterns, and an axis to give an indication of actual migrant numbers. I also fixed the positioning of the text in relation to the plot and the wider page.
This required some further data processing to:
transform flow_matrix
into its long format
add parameters to guide the position of links, sums1
and sums2
keep only the largest migration flows to improve readability.
The links were plotted with the following code:
#plot links for each combination of regions
for(k in 1:nrow(flow_matrix_long)){ #for each row in the flow matrix
= match(flow_matrix_long$subregion_from[k],
i $subregion) #get plotting details for subregion of origin
subregion_details= match(flow_matrix_long$subregion_to[k],
j $subregion) #get plotting details for destination subregion
subregion_details
circos.link(sector.index1 = subregion_details$subregion[i], #need to identify indices to identify
point1 = c(subregion_details$sum1[i],
$sum1[i] + abs(flow_matrix[i, j])), #starting point of link
subregion_details
sector.index2 = subregion_details$subregion[j],
point2 = c(subregion_details$sum2[j],
$sum2[j] + abs(flow_matrix[i, j])), #endpoint of link
subregion_details
border = subregion_details$lcol[i],
col = subregion_details$lcol[i], #use the more transparent colour to increase visibility
)
#update sum1 and sum2 to move along the circle into the next sector
$sum1[i] = subregion_details$sum1[i] + abs(flow_matrix[i, j])
subregion_details$sum2[j] = subregion_details$sum2[j] + abs(flow_matrix[i, j])
subregion_details }
As a result, I produced this plot - which is linked here as a png file because R output in this package looks horrendous.
Notice that:
the colour of a link suggests its origin (e.g., orange link ending in Western Europe signifies immigration from Western Asia)
immigration is denoted by links starting in the emigrants section of a subregion (the second coloured arc inwards), and ending up in the immigrants section of another (or the same) subregion (the white arc continuing from the emigrant arc)
the size of the link gives the relative amount of migrants moving from one subregion to another
Some insights revealed by the plot include:
Western Europe and Northern Europe have the largest number of immigrants.
Western Europe has the most migrants, closely followed by Northern Europe and Eastern Europe.
Whilst Western and Northern Europe have relatively low proportions of emigrants and high proportions of immigrants, this is completely the opposite in Eastern Europe.
Most doctors in Latin America and the Caribbean emigrate to countries within the same region.
The migration flow in Southern Asia consists exclusively of emigrants, most of whom tend to go to Northern Europe, with far fewer going to Australia and New Zealand or Northern America.
Immigrants from Southern Asia make up a sizeable chunk of the total amount of immigrants in Northern Europe.
If I had more time to spend on this project:
I would fix the issue that causes some links to go ever so slightly beyond the borders for emigrants - yes, Northern Africa, I’m looking at you! I am not sure what the reason for this is, but the circlize
package still receives updates, so I will keep an eye on it.
I would figure out how to automatically adjust the distance between text and plot depending on the length of the string whilst avoiding hardcoding string length values.
I would extend this project beyond this plot to understand why healthcare workers migrate to specific subregions - Is it better pay? Is it the chance to work in specific areas of medicine?
I would allow the user to choose what migration flows to exclude - the plot becomes difficult to interpret when too many links are present, so having the chance to plot all links corresponding to one subregion or combination of subregions incrementally and then remove them as needed could enhance interpretability and give a more nuanced insight into migration patterns. This could be done in Shiny
, but the circlize
package can be beautifully integrated with JavaScript as well to make this possible.
To complete this project, I used several resources:
Sander et al. (2014) - the minds behind adapting circular plots to explore migration flows
Gu (2020) - in-reasonable-depth resource for the circlize
package
stackoverflow.com - I am grateful to random skilled strangers for helping me learn new ways of doing things
Xie et al. (2021) - fantastic resource for tackling RMarkdown issues
You can also find my repository here.