edsa-data-collection

Git repo for the storage of code and documentation of the EDSA data collection projects.

Introduction

The world around us is changing at a fast pace. To participate in the digital future of South Africa, it is necessary that AI models adapt to our indigenous languages.

Of the eleven official languages in South Africa, the main focus for this project will be South African English, as it differs from United States and United Kingdom English.

Objectives

  1. Collect over 10 000 hours of South African English audio (MP3 & MP4).
  2. Collect over 20 million sentences of South African English text (plain text files with CSV metadata).

GETTING STARTED

development branch = dev

  1. clone the repository into your local machine:
 git clone https://gitlab.com/strategicinsights/telkom-ai/edsa-data-collection.git
  1. checkout the development branch & pull changes to ensure it's up to date.
git checkout Dev

git pull 
  1. Create your own working branch. NB You may use your name and these will be changed later at merging stages
git branch Task_Name
  1. After making changes (even small changes), you may commit.
git status
git add .
git commit -m "comment on changes"
git pull
  1. To push your changes to gitlab you need to create a new password and generate/add an SSH Key.

After creating a new password, login to gitlab Click on the profile icon Click on settings Click on SSH Keys If you do not already have on click on the hyperlink "generate one" and follow the instructions Once successfully created you will be able to commit changes to gitlab

  1. To push your changes to the online repo use: NB replace Task_Name with the branch name you created.
git push --set-upstream origin Task_Name 
  1. If task is complete and you wish to merge the work to the master branch. create a pull request for review using the web-based gitlab.

When you encounter difficulties, create an issue and one of the team members will assist