A data driven word-breaker incorporating the way text is composed on the webweb-ngram.research.microsoft.com1 pointshriphani15 years ago