PhD in Data Science 2022 – Quantum Leap Africa

About PhD in Data Science 2022

Program Description

The field of data science is emerging as a critical discipline with high relevance to economic growth and development. This doctoral training program established by AIMS will provide emerging African scientists the opportunity to conduct research at the forefront of data science, and work towards a PhD degree within a high-quality training program in Africa, in cooperation with institutions internationally.

The program will focus on theoretical foundations of data science as well as applications of data science to improve the daily lives of Africans. It is built on the understanding that modern approaches in data science require a combination of expertise spanning the areas of mathematics, statistics, computer science, and the applied sciences.

AIMS will be offering up to seven fully-funded PhD positions in this prestigious new doctoral program. The recruited students will be based in Rwanda at AIMS Rwanda, or any of the other AIMS centers, in partnership with universities and research institutions across Africa and globally. The program aims to train future change-makers, who will have an impact across academia, industry, education, and government.

Candidates can choose from a list of proposed research topics, and AIMS will assist in building a supervision team around these topics. Alternatively, candidates can suggest their own research topics, together with a proposed supervision team. Selected students will start in October 2022.

Eligibility criteria

Master’s degree (completed by Sept 2022) in mathematics, statistics, computer science, engineering, physics or other relevant fields;
Sufficient theoretical foundations evidenced by prior work (courses/thesis/other training);
Qualification for pursuing research on the chosen topic, including relevant programming expertise;
Research potential evidenced by academic performance and involvement in relevant academic activities;
Motivation for pursuing a PhD by research in the suggested topic;
Being an African national.

Summary

Length of program: 4 years
Fully funded (stipend, equipment, health insurance, relocation costs, conference attendance, direct cost to graduating institution such as tuition fees and registration fees)
International supervision teams from well-known research institutions
Research topics that push the boundaries of data science (focused on AI/ML and/or health)
Program start: Oct 2022

Application Information

Selection Process

The selection process is competitive, and conducted in two phases:

Phase 1:

Deadline to apply (Phase 1) is: 06/13/2022

You will be asked to submit the following documents

Application Form
CV (you can use your own format, but please make sure to cover at least all items mentioned in this template that apply to your case)
Transcripts (Please submit both your undergraduate and your masters level transcripts. Additional transcripts can be submitted if relevant. )

As part of the application form, you will be able to

Select or suggest a research topic
Name two academics (ideally senior researchers), who are willing to write a letter of support
Indicate whether you already are admitted at a university or otherwise have future plans that you consider to combine with this program

Additionally, the application form will allow you to tell us more about your research interests, motivation for pursuing a PhD and future plans. Your answers to the following questions will be central to the selection process. We advise that you prepare your answers offline with care.

What is your motivation to pursue a PhD? Here, you can also mention plans for your future career. (1500 characters)
Which research directions are you most interested in and why? Justify why you are qualified to pursue research in this area. Here, you can also comment on your reason for choosing the research topics selected above (2500 characters).

As part of this application process, it is our goal to support your individual situation and work together to design a PhD program that fits your circumstances.

Remarks on submitting your application

You will be asked to select one main research topic (or suggest your own); there is the possibility to indicate a second topic choice (or suggest your own).
If you propose your own topic, make sure you prepare the topic description with care (max 2000 characters). Have a look at the proposed TOPIC DESCRIPTION as a guidance for style and detail. Most importantly, the topic description should explain what exactly you will do as part of your PhD, including the methods used, the technical skills you need, and if data is needed, which data sets you are planning to use and how you will get access to them. From your description, it should be possible for us to assess if the proposed research topic is
1. novel and interesting to the international research community,
2. realistically achievable in the context of a PhD,
3. matching with your skill background and with the skill background of your proposed supervisors,
4. equipping you with the expertise to proceed into a successful future career,
5. fitting to the scope of our PhD program.
Make sure that you satisfy the required background and skills when selecting or proposing a research topic.
Arrangements are flexible, and we can work with selected candidates to adapt to individual circumstances.
No reference letters are required in Phase 1 of the application process. Please notify your referees that you are submitting their names as part of this application, as they may be contacted for letters or additional information as part of the selection process.
Make sure that your CV covers as a minimum all items provided in this template (if they apply to your situation).

Apply Online Now

Phase 2:

A small number of shortlisted candidates will be invited to round 2 of the selection process. In Phase 2, supervision teams are formed and applicants discuss with potential supervisors more details on their research plans. As part of Phase 2, applicants are asked to submit a detailed research proposal.

Women applicants are encouraged to apply.

Questions should be directed to applications@quantumleapafrica.org

The new Doctoral Training Program in Data Science (DTP-DS) is established by Quantum Leap Africa (QLA) at the African Institute for Mathematical Sciences (AIMS) Rwanda in collaboration with top researchers from across the globe. Here, you can find more information on the following:

Ph.D. streams
Enrollment & Graduation
Supervision
Research Topics
Training Components

Ph.D. streams
The second cohort of the DTP-DS (start 2022) will offer two streams:

General Data Science,
Data Science for Health in Rwanda.

General Data Science: We have several Ph.D. positions available for topics in mathematics, statistics, computer science, and the applied sciences that broadly fit under the umbrella of data science. These QLA positions are funded in partnership with the African Institute for Mathematical Sciences, The Government of Rwanda, The Carnegie Corporation of New York, and DeepMind Technologies Limited. Two (2) fully-funded Ph.D. positions offered in partnership with DeepMind Technologies Limited are putting a particular focus on data science topics in the fields of artificial intelligence and machine learning.

Data Science for Health in Rwanda: Four (4) fully funded PhD positions are offered in partnership with University of Rwanda (UR, Rwanda), AIMS Rwanda (Rwanda) and Washington University in St. Louis (State of Missouri, United States of America), as part of the grant Harnessing Data Science for Health Discovery and Innovation in Africa Research Training Program. This program focuses on the following three major scientific areas:

computer science & informatics
statistics & mathematics
biomedical sciences & public health

A high prevalence of communicable diseases, coupled with a rapidly expanding epidemic of non-communicable diseases (NCDs), forecasts a perfect public health storm in Rwanda, providing impetus and rationale to leverage DS to address
health care gaps. A structured program design will help develop trainee research careers in data science with particular focus on health care topics relevant to Rwanda, including communicable (i.e., HIV, malaria, COVID-19, etc.) and chronic NCDs (i.e., hypertension, heart disease, diabetes, etc.).

Enrollment & Graduation

General Data Science: Candidates that are accepted into the program will be enrolled in two institutions:

One of the five AIMS Centers of Excellence (Rwanda, Ghana, Cameroon, Senegal, South Africa)
A higher education institution (generally in Africa) partnering with AIMS

Candidates will need to be ordinarily resident in an African country, and satisfy the degree requirements for a PhD in Research of their graduating institution, as well as the program requirements of the AIMS DTP-DS. The PhD degree will be conferred by the partnering institution upon successful completion; QLA is facilitating the creation of international co-supervision partnerships, is providing funding through global partnerships, and offers additional training in research skills and transferable skills to the PhD candidates.

Data science for Health in Rwanda: This stream is restricted to Rwandan nationals only. Successful candidates will be enrolled at University of Rwanda and work on topics relevant to Rwanda.

Supervision

General Data Science: Candidates are mentored by a supervision team of 2-4 supervisors, forming a partnership between AIMS and higher education institutions in Africa and internationally. Each supervision team should consist of at least one supervisor affiliated with AIMS, and one supervisor affiliated with the graduating institution. These rules are flexible and details can be discussed and adjusted on a case-by-case basis.

Data Science for Health in Rwanda: Candidates are mentored by a supervision team where one supervisor is affiliated with the University of Rwanda. Additional supervisors can be affiliated with AIMS, Washington University in St. Louis or other institutions in Africa and internationally as suitable.

The supervision team will be formed during Phase 2 of the application process in communication with shortlisted candidates, the DTP-DS management board, and potential supervisors. Candidates have the possibility to suggest their own supervision team.

Research topics

Applicants can select from a list of research topics suggested by leading researchers in their field. Alternatively, applicants are welcome to suggest their own research topics. Shortlisted candidates will be put in touch with the supervision teams that proposed their selected topics for discussions on more concrete research ideas in Phase 2 of the application process. More more details on the selection of research topics, click Research topics.

Training Components

All candidates are invited to participate in an intensive training school in the first year of the program, organized by QLA. Here, candidates will acquire skills relevant to their research and broaden their subject knowledge in data science through a small number of intensive core courses taught by top international researchers.

The program plans to provide continuous training opportunities virtually and/or in person. Additional training components may include (but are not limited to):

Guided seminars and reading groups
Participation in transferable skills courses (academic writing, presentations skills, research methodology course)
Group projects / mini dissertations
Designing and delivering a mini-course (senior PhD students)
3 Minute Thesis Competition (senior PhD students)
Tutoring in AIMS structural masters program (senior PhD students)

PhD candidates are encouraged to pursue internships in industry or external institutions towards the end of their PhD in a field related to their research topic, depending on sufficient progress on their dissertation.

2022 RESEARCH TOPICS

To become a PhD student as part of this doctoral training program, you have two options:

Work on one of the research topics suggested by the program,
Suggest your own research topic.

Working on a proposed topic

For option i), please consult the research topics posted below. Each research topic already comes with a supervision team. The top five applicants for each topic will be shortlisted and then put in touch with the supervision team directly. In the next step, supervision teams will select those candidates with whom they would like to move into Phase 2. In Phase 2, the supervision team together with selected shortlisted candidates prepare a detailed research proposal. This will also provide you the opportunity to discuss the precise focus of your chosen research topic with your supervision team, discuss if you and your supervision team want to add additional supervisors, and decide at which university you plan to register and graduate for the PhD.

When shortlisting candidates for the proposed topics, we will evaluate how well your academic background and skills fit the topic you are applying to. Please read the description and required background of the topic you are applying to in detail, and make sure your application matches the research direction.

Suggesting your own research topic

To suggest your own research topic, simply select ‘Other’ in the application form. A box will open where you can specify a title and a topic description as well as your supervisors. You will also have to classify in which stream your topic best fits.

When suggesting your own research topic, make sure that you have put detailed thought into the research direction you are suggesting. In the end, this is the problem you may spend 3-4 years working on full-time. Your research plans should be advanced enough that it is clear that your research can start immediately and has a high chance of success. Prepare your topic description with care; note that the application form is restricted to max 2000 characters for the topic description. Most importantly, the topic description should explain what exactly you will do as part of your PhD, including the methods used, the technical skills you need, what exactly is the novel research contribution it provides, and if data is needed, which data sets you are planning to use and how you will get access to them. Be precise and to the point, do not spend too much characters on general introduction and motivation. From your description, it should be possible for us to assess if the proposed research topic is

novel and interesting to the international research community,
realistically achievable in the context of a PhD,
matching with your skill background and with the skill background of your proposed supervisors,
equipping you with the expertise to proceed into a successful future career,
fitting to the scope of our PhD program.

When proposing your own topic, you will also have to build your own supervision team. Make sure your supervisors are aware you are planning to work with them and agree to supervise you before mentioning their names. If you are shortlisted, we may also be able to help you find additional supervisors if needed. At the end of Phase 2, your application will be evaluated as a package, including how well your supervisors and your background are suited to advance the research you propose overall.

Streams

For the 2022 cohort, we will have two streams:

General Data Science (G)
Data Science for Health in Rwanda (H)

Some topics fit in both streams. Proposed topics are listed below, the streams are indicated in brackets with the title.

Topics:

Accelerating multitask reinforcement learning with attention mechanisms(G)
Quantum Topological Data Analysis (G)
Physics Informed Learning Machines with application in climate science (G)
Federated learning handling EEG data spread in Africa and Europe (G,H)
How the human embryo develops: combining mathematical modelling and data science (G,H)
Data-Driven Modelling for Risk Evaluation and Early Detection of vector borne-disease outbreaks in Rwanda (H)
Mathematical model for predicting and analyzing the impact of air pollution on the cardiovascular-respiratory system in Rwanda (H)
Identification of new cervical cancer gene signatures and predictors of clinical outcome using machine learning techniques (G,H)

Topic description and required background:

Accelerating multitask reinforcement learning with attention mechanisms (G)

Reinforcement learning has recently been successful in behaviour learning in a variety of high-profile, complex tasks. Unfortunately, it is generally very sample inefficient, which has implications for it being widely used in real-world problems. Attention mechanisms such as transformers have recently provided significant benefits in other classes of temporal domains. We propose to leverage recent advances in attention to investigate whether a learning agent can learn to focus only on relevant features of a problem, thus greatly accelerating learning. In addition, we will explore opportunities this provides to learning invariant properties and objects in an environment: knowledge which can be exploited to solve new problem instances.

Required Background

Multivariate calculus, linear algebra, optimization, basic familiarity with concepts in machine learning, strong Python programming experience, familiarity with reinforcement learning or attention mechanisms.

Quantum Topological Data Analysis (G)

Boundary operators are a key connectivity concept in mathematics, physics and data science. Calculations involving the operator on arbitrary data-points, generate high-dimensional interpretable summaries of the full data-set related to its local and global “shape”. However, classical computational costs are exorbitant due to the underlying combinatorics. Recent work has shown that the restricted boundary operator may be implemented on a Quantum Computer exponentially faster than classical computers in linear depth. This PhD topic proposes to explore various algorithms and applications based on the boundary operator in a search for exponentially-accelerated quantum advantage on real-world problems.

Required Background

MSc in a mathematical topic. Preferable: Quantum physics/computing, Topology, Bayesian Inference, Strong programming skills.

Physics Informed Learning Machines with application in climate science (G)

Physics Informed Learning Machines (PILM) are increasingly used to solve problems resulting from natural or engineering processes formulated as mathematical models, e.g. partial/ordinary differential equations. Due to the recent development of low cost sensors that produce high spatial and temporal resolution data, source term estimation in differential equations has attracted more attention, especially within the air pollution modelling community. In this work, we will propose a new PILM approach for modelling the source function as a Gaussian process or neural networks, and approximate the solution of the associated differential equations with neural networks. Our approach will be applied to data from a network of low-cost sensors deployed in Rwanda.
For related references, see https://www.sciencedirect.com/science/article/pii/S0021999118307125 and https://arxiv.org/abs/2202.04589

Required Background

A Master’s degree in mathematical sciences; Strong background in numerical analysis of PDEs, Optimization, deep learning and inverse problem; Strong programming skills in Python especially Tensorflow.

Federated learning handling EEG data spread in Africa and Europe (G,H)

Our group is focused on neuroimaging, ranging from microscopy to MRI and non-expensive solutions as EEG. The analysis of those data have been always conducted by machine learning or other data science tools (https://bam.sano.science/). In 2014, before the deep learning boom, the main supervisor conducted several projects with novel technology in rural areas of Ghana while teaching at AIMS Ghana. The most popular project is the prenatal care project called http://www.docmeup.org/. Now, we want to investigate brain disease on a large scale both in Europe and Africa, taking advantage of inexpensive EEG devices and cloud computing, with the goal of achieving a Global South and Global North federated learning approach. Federated learning (also known as collaborative learning) is a machine learning technique that trains an algorithm across multiple decentralized edge devices or servers holding local data samples, without exchanging them. The awareness and study of brain disorders has not been fully embraced in Rwanda and Uganda and Africa at large due to factors such as stigmatization associated with these disorders, lack of adequate bio-medical data of these disorders, a few research experts, among others. It is reported that there is an increasing prevalence of neurological disorders in sub-Saharan Africa. This project will allow a researcher in the areas of the application of data and network science tools to have impact in the medical domain, particularly the brain disorder sector. The ripple-effect of this will be the passing on of the knowledge and expertise to the next generation thus strengthening a network of researchers in this interdisciplinary domain. Data are by nature “big”, we expect daily streaming of GB. Moreover, they are represented as time series. Therefore, blending of machine/deep learning with Fourier, Wavelet, and various spectrum analysis is expected. Functional temporal connectivity will lead to graph representations. Therefore, also complex networks approaches can be considered.

Required Background

Strong Python skills, Good machine learning background (with practical experience particularly recurrent neural networks), knowledge of signal theory (Frequency domain analysis, independent component analysis…), passion for hands on real world circuit and data acquisition through devices as EEG

How the human embryo develops: combining mathematical modelling and data science (G,H)

How the early human embryo develops from a single fertilised egg to a multicellular structure is still poorly understood. Studying this has traditionally been difficult due to ethical and logistical issues working with human embryos. However, in 2021, a cellular model system was developed at the University of Exeter that makes investigating embryogenesis much easier. Based on data from this model system, this project will study mathematical models of embryo development. This will involve a combination of computer simulation, model design, mathematical analysis, parameter fitting and image analysis. Excitingly, this project has potential applications to improving IVF treatment, by suggesting improved ways that the best embryos can be selected for implantation.

Required Background

This PhD requires strong mathematical and computational skills, including mathematical modelling, data analysis, simulation (in a language like MATLAB, Python or C++) and analysis. No prior knowledge of biology or image analysis is required; training in these will be provided during the project.

Data-Driven Modelling for Risk Evaluation and Early Detection of vector borne-disease outbreaks in Rwanda (H)

Traditional public health surveillance relies heavily on statistical techniques. Recent years have seen tremendous growth of AI-enabled methods, including but not limited to deep learning–based models, complementing statistical approaches. Many disease-causing organisms are strongly influenced by environmental factors such as temperature, rainfall, and humidity, which are in turn influenced by the prevailing climate. This project will integrate multiple types of data such as environmental, epidemiological, news reports, and search data, and develop novel mathematical, statistical, and big data techniques to a) evaluate the risk that the most common vector-borne diseases such as Rift Valley Fever and Foot-and-mouth disease pose to Rwanda, b) detect and give early warnings to domestic spread for imported cases from neighboring countries; c) evaluate the risk of the case spread from these to other districts in Rwanda; d) mitigating strategies, and e) address the numerous remaining gaps in data for such systems, f) develop an AI based application for identification of mosquitos associated with vector-borne diseases.
We will use data from Meteo Rwanda Institute, Rwanda Biomedical Center (RBC), Rwanda Agriculture and Animal Resources Development Board (RAB), and data from our ongoing cohort studies. Our main partners will be AIMS, University of Rwanda and Dalhousie University.

Required Background

Statistical analysis and computing, Machine Learning, Deep Learning, processing large data sets, Data Visualization, Data Wrangling, Mathematics and Programming skills.

Mathematical model for predicting and analysing the impact of air pollution on the cardiovascular-respiratory system in Rwanda (H)

The topic focuses mainly on developing a mathematical model for predicting and analysing the impact of air pollution on the cardiovascular-respiratory system in Rwanda. It involves different datasets: air pollution, epidemiological data and data related to non-communicable diseases (NCDs). Rwanda Environment Management (REMA) and Rwanda Biomedical Center (RBC) should provide these data so that model parameters are estimated using optimization methods. Using the developed mathematical model, different mathematical methods including statistical, deterministic or stochastic approaches can be used for analyzing the impact of air pollution on the cardiovascular-respiratory system. Based on the developed mathematical model, Bayesian approaches like Markov chain Monte Carlo algorithms and other deterministic approaches focusing on the basic reproduction number can allow to predict NCDs.

Required Background

The student should have a background in Mathematics (Applied Mathematic or Statistics)

Identification of new cervical cancer gene signatures and predictors of clinical outcome using machine learning techniques (G,H)

Cervical cancer is a highly preventable disease, therefore, early screening represents the most effective strategy to minimize the global burden of this disease. Genomic profile analysis has been successfully applied to deconvoluting the molecular profile of various cancers and in gene biomarker discovery. Genomic profiling is also highly compatible with machine learning methods such as Support Vector Machine, Random Forest and Convolutional Neural Network. Further, gene datasets of most cancers are publicly accessible via the US National Center for Biotechnology Information Gene Expression Omnibus database (https://www.ncbi.nlm.nih.gov/geo/ ) and the European Bioinformatics Institute ArrayExpress portal (https://www.ebi.ac.uk/arrayexpress/). We hypothesize that the integrative use of novel machine learning methods on cervical cancer gene datasets could lead to new gene signatures with high prognostic accuracy. Thus, our aims are:
To define new gene signatures of cervical cancer using published datasets accessible from the NCBI GEO and ArrayExpress, by applying machine learning approaches.
To test the identified gene signatures in cell lines and tumor samples of cervical cancer using molecular methods such as quantitative PCR, western blotting, proteomics.
To develop a predictive model that will suggest the possibility of cervical cancer in patients.

Required Background

Cervical Cancer, Machine Learning, Deep Learning,