View the h-hausa Discussion Logs by month
View the Prior Message in h-hausa's June 2004 logs by: [date] [author] [thread] View the Next Message in h-hausa's June 2004 logs by: [date] [author] [thread] Visit the h-hausa home page.
The following project may be of interest. Note the questions Kevin has below concerning the form of the texts he finds for Hausa. I've only now had the time to follow up on this correspondence and have no ready answers, although the potential is interesting. Don Osborn Bisharat.net ----- Forwarded message from Kevin Patrick Scannell <scannell@slu.edu> ----- Date: Thu, 10 Jun 2004 11:13:59 -0500 From: Kevin Patrick Scannell <scannell@slu.edu> Reply-To: "scannell@slu.edu" <scannell@slu.edu> Subject: Hausa spell checking To: "dzo@bisharat.net" <dzo@bisharat.net> Dear Don, I've been enjoying looking over your pages at bisharat.net. I'm hoping you might be interested in some web crawling software I've written that is able to target given languages -- I've been using it for the past few months to gather text corpora for about 150 languages, most of them "minority" languages or languages with limited resources from the point of view of technology. See http://borel.slu.edu/crubadan/ http://borel.slu.edu/crubadan/stadas.html I've been using statistical means (frequency counts, 3-grams, co-occurrence) to generate clean word lists from these corpora, hopefully to be used for spell checking and the like. I've had good luck with languages like Setswana, Tagalog, Kinyarwanda, etc. that are written essentially in ASCII. For languages like Hausa, the software works with utf8 but of course much of the text it finds is transliterated into plain Latin. Is it hopeless to try and "normalize" these texts into utf-8 in some (semi-)automatic so that I can create an accurate word list? Or should I be working with the small subset encoded "correctly"? I'd appreciate any input you have on this! Unfortunately I have no particular expertise with African languages; I came to this through my work developing resources for the Celtic languages, mostly Irish. Thanks Kevin ----- End forwarded message ----- _____________________________________ Please send your question or comment about the USA in plain ASCII text to H-USA on-line editor John Philips: <philips@mail.h-net.msu.edu>
|