X

ar-quran-hadith14books-MSA

Information

# Dataset Card for quran and hadith dataset ## Dataset Summary - Arabic specialized dataset to make sure that AI is not changing our sacred scriptures in speech recognition by training and evaluating upon quran and hadith. - Combining quran + [magma'a el zawa'ed book](https://app.turath.io/book/61) of sidi [Nour eldin elhaithamy](https://tarajm.com/people/18414) author including 14 book of hadith of approximately 10,000 hadith without repititions + other existing datasets like common voice, fleurs, media speech - First dataset to have full quran along with bokhary and muslim hadith validation with professional reciters to ease evaluating speech recognition models on quran and hadith. - We avoided biasing to training split by changing the reciters in quran and hadith in training and validation, also while training we used a combination of approximately 10,000 hadith from a different book than the book we validated upon to make sure no overfitting - Also it should be used in day to day Arabic MSA transcriptions not just islamic ones as we avoided overfitting/biaisng to only quran/hadith since we included clean validated split of common voice17 + fleurs + 100 audio of media speech following the dataset normlaized pattern of removing diacritics and numeric numbers to make sure that it reflects how quran is written. - The dataset is compatable with Whisper as the max duration of all audios is 30 seconds to make sure no trimming in the middle of the audio, while being compatable to whisper it can also be used with wtv models brought later or existing. - this dataset shouldn't be used in any mathematical transcriptions as we intended to make the numbers into arabic letters in all the splits to match how the holy Quran is written. - **Training hrs:** 119 hours - **Validation hrs:** 44 hours, 47 minutes - Concatenated common voice validated split audios to reduce the disk usage since whisper will pad all audios to 30 seconds, While concatenating i found an improvement in the WER compared to same audios not concatenated, reason can be [THIS](https://github.com/openai/whisper/discussions/1118#discussioncomment-5586748) - The audios are normalized and removed the silence from it using ffmpeg tool. - They are later padded 0.1 seconds at the start and the end of each audio chunk, we found that this technique along with removing the noise and normalizing with ffmpeg give better results and reduce Whisper hallucinations. ## Dataset Details ### The dataset is composed of 8 splits: * Training hours -> 119 hours - **train split:** consists of quran and hadith -> 45 hours, 18 minutes. **Full Quran:** reciter sidi elmenshawy -> 28 hours 46 minutes. **Hadith 14 books:** Magma'a el zawa'ed author sidi Nour eldin elhaithamy -> 16 hours, 32 minutes. - **cm_voice17:** validated split by mozilla cleaned it, concatenated and made it follow the dataset normalized pattern -> 63 hours, 13 minutes - **fleurs:** clean and normalized -> 7 hours, 28 minutes - **improve asr in quran:** -> 1 hour, 36 minutes, This split is created by checking the wrong words produced by whisper-large-v3 in quran and trying to train on those words - **improve asr in hadith:** -> 1 hour, 21 minutes, This split is created by checking the wrong words produced by whisper-large-v3 in hadith and trying to train on those words - **mediaspeech:** first 100 rows clean and normalized -> 22 minutes * Validation hours -> 44 hours, 47 minutes. - **Validation:** validation on full quran, reciter sidi abdulbasit abdulsamad -> 30 hours, 20 minutes. - **Hadith validation:** including sahih bokhary and muslim book game'a el sahihen author Dr: Walid el Hemdan -> 10 hours, 27 minutes [audios of hadith validation](https://soundcloud.com/user-670914352/sets/qeopyed6polf) - **mgb2 validation:** including concatenated filtered and validated samples of mgb2 -> almost 4 hrs of media data. Original data was took from belal el hussainy [dataset](https://huggingface.co/datasets/BelalElhossany/mgb2_audios_transcriptions_non_overlap) of mgb2 non overlap. ### Tools/websites used: * [Quaran every ayah](https://everyayah.com/) which made my life easier getting every verse and its audio, while still worked on parsing the verses longer than 30 seconds to different chunks but they really helped. * ChatGPT for AI scripts, also a little of Claude 3.5 Sonnet. * Google Kaggle free 30 hrs GPU weekly helped me in validating the dataset and correcting it by checking insertions/subsititutions/deletions, also colab has almost 1.5 hrs per day. * Python for scripting. * FFMPEG redcuing the noise and normalizing [check this link](https://github.com/openai/whisper/discussions/679#discussioncomment-4664510). * PYDUB for padding the audios and applying gain to low amplitude audios. * Tried Silero for VAD but it always had issues with Quran recitations thinking its below the audio threshold so removed it as it caused bad results. * Audacity. * voice recorder. * VLC media player. * while not added to the dataset but using audiomentations library for augmentations while training definitely helps reducing overfitting. ### How to use: - [loading datasets](https://huggingface.co/blog/audio-datasets) - [training whisper](https://huggingface.co/blog/fine-tune-whisper) - You can use the dataset with any model and any training approach not specific for whisper. ### Credits: - All credits goes to Allah, The one who created everything even our doings. - Prophet Mohamed peace be upon him who enlighted our life, may Allah send all his blessings and salutations to him, his family and companions. - AboBakr/Omar/Othman/Ali and all the companions who helped keeping our sacred scriptures(Quran and hadith) safe from being altered and avoided what happened to other holy books. - Bokhary/Muslim and all our islamic scientists and imams who did a huge effort across the history till now preserving the hadith. - [Dr Ali Gomaa](https://www.draligomaa.com/) my teacher and imam for his teachings across years to know GOD and the prophet, And his efforts in avoiding any changes that AI can do for our sacred scriptures Quran/hadith. ### Dataset Description - **Curated by:** [More Information Needed] - **Funded by [optional]:** [More Information Needed] - **Shared by [optional]:** [More Information Needed] - **Language(s) (NLP):** [More Information Needed] - **License:** [More Information Needed] ### Dataset Sources [optional] - **Repository:** [More Information Needed] - **Paper [optional]:** [More Information Needed] - **Demo [optional]:** [More Information Needed] ## Uses ### Direct Use [More Information Needed] ### Out-of-Scope Use [More Information Needed] ## Dataset Structure [More Information Needed] ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Data Collection and Processing [More Information Needed] #### Who are the source data producers? [More Information Needed] ### Annotations [optional] #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] #### Personal and Sensitive Information [More Information Needed] ## Bias, Risks, and Limitations [More Information Needed] ### Recommendations Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations. ## Citation [optional] **BibTeX:** [More Information Needed] **APA:** [More Information Needed] ## Glossary [optional] [More Information Needed] ## More Information [optional] [More Information Needed] ## Dataset Card Authors [optional] [More Information Needed] ## Dataset Card Contact [More Information Needed]

Prompts

Reviews

Tags

Write Your Review

Detailed Ratings

ALL
Correctness
Helpfulness
Interesting
Upload Pictures and Videos

Name
Size
Type
Download
Last Modified
  • Community

Add Discussion

Upload Pictures and Videos