www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Natural language parsing (NLP) with D

reply Eliatto <arietto86 gmail.com> writes:
Hello! I am rather new to D ecosystem (I am a C++ developer). I 
know that there are code-dlang and awesome-D collections of 
libraries. But I have not found any NLP libraries in D 
(https://github.com/jogojapan/drulex is not worth mentioning), 
though there are Go and Rust NLP libraries on github (they are 
new languages too). Why is this field unpopular among 
(D)evelopers?
What can be used for base POS tagging and NP chunking of English 
texts instead? I mean wrapping some C/C++ library without 
porting. Which one will cause minimal headache during glueing 
with D?
P.S. I suppose that it will be nice to see the histogram of libs 
using "awesome-D" list. For example, one rectangle shows 3D 
engine percentage(libs number divided by total awesome-D libs 
count and multiplied by 100), another shows logger libs 
percentage...
Oct 20 2015
next sibling parent Rikki Cattermole <alphaglosined gmail.com> writes:
On 21/10/15 1:01 AM, Eliatto wrote:
 Hello! I am rather new to D ecosystem (I am a C++ developer). I know
 that there are code-dlang and awesome-D collections of libraries. But I
 have not found any NLP libraries in D
 (https://github.com/jogojapan/drulex is not worth mentioning), though
 there are Go and Rust NLP libraries on github (they are new languages
 too). Why is this field unpopular among (D)evelopers?
 What can be used for base POS tagging and NP chunking of English texts
 instead? I mean wrapping some C/C++ library without porting. Which one
 will cause minimal headache during glueing with D?
 P.S. I suppose that it will be nice to see the histogram of libs using
 "awesome-D" list. For example, one rectangle shows 3D engine
 percentage(libs number divided by total awesome-D libs count and
 multiplied by 100), another shows logger libs percentage...
The only thing I could find on code.dlang.org was https://github.com/Herringway/natcmp Not really what you want I think. In terms of binding c/c++ to D, you should be able to do it almost whole sale. You will need to create shims on C++'s side for certain features such as operator overloads, templates and of course creation. At least as of what I know. There was some serious C++ improvement fairly recently (DDMD) so somebody else will need to confirm about it. As for which C/C++ library you should base off of? Well no idea, what would you like to use? Also this would be better suited for D.learn.
Oct 20 2015
prev sibling next sibling parent ponce <contact gam3sfrommars.fr> writes:
On Tuesday, 20 October 2015 at 12:01:44 UTC, Eliatto wrote:
 Why is this field unpopular among (D)evelopers?
We aren't numerous, so there hasn't been anyone to tackle the NLP problems now (and many other domains). There is plenty of space to start domain-specific libraries. You could do it :)
Oct 20 2015
prev sibling next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/20/2015 08:01 AM, Eliatto wrote:
 Hello! I am rather new to D ecosystem (I am a C++ developer). I know
 that there are code-dlang and awesome-D collections of libraries. But I
 have not found any NLP libraries in D
 (https://github.com/jogojapan/drulex is not worth mentioning), though
 there are Go and Rust NLP libraries on github (they are new languages
 too). Why is this field unpopular among (D)evelopers?
 What can be used for base POS tagging and NP chunking of English texts
 instead? I mean wrapping some C/C++ library without porting. Which one
 will cause minimal headache during glueing with D?
 P.S. I suppose that it will be nice to see the histogram of libs using
 "awesome-D" list. For example, one rectangle shows 3D engine
 percentage(libs number divided by total awesome-D libs count and
 multiplied by 100), another shows logger libs percentage...
In my NLP days I remember the common procedure was to run taggers/chunkers/etc as processes driven by scripts. That said, a library offers more options and it would be interesting to see such in code.dlang.org. -- Andrei
Oct 20 2015
prev sibling next sibling parent Chris <wendlec tcd.ie> writes:
On Tuesday, 20 October 2015 at 12:01:44 UTC, Eliatto wrote:
 Why is this field unpopular among (D)evelopers?
I work with NLP almost all the time and D is very well suited for it. It's mainly text-to-speech stuff, but I have a tiny POS tagger (or rather POS identifier) as well. D would be well suited for creating higher, simpler rule languages that linguists who have no clue about programming could easily use. I've been thinking about this for a while now, and I wish I had the time to come up with something and implement it. I'm thinking of a suite that would cater for the various aspects of NLP, e.g. phonemic transcriptions, POS tagging, morphological and grammatical analysis, collocation etc. A one stop shop for linguists. But, alas, time is scarce. If you have any ideas, please share.
Oct 20 2015
prev sibling parent reply bachmeier <no spam.com> writes:
On Tuesday, 20 October 2015 at 12:01:44 UTC, Eliatto wrote:
 Hello! I am rather new to D ecosystem (I am a C++ developer). I 
 know that there are code-dlang and awesome-D collections of 
 libraries. But I have not found any NLP libraries in D 
 (https://github.com/jogojapan/drulex is not worth mentioning), 
 though there are Go and Rust NLP libraries on github (they are 
 new languages too). Why is this field unpopular among 
 (D)evelopers?
 What can be used for base POS tagging and NP chunking of 
 English texts instead? I mean wrapping some C/C++ library 
 without porting. Which one will cause minimal headache during 
 glueing with D?
 P.S. I suppose that it will be nice to see the histogram of 
 libs using "awesome-D" list. For example, one rectangle shows 
 3D engine percentage(libs number divided by total awesome-D 
 libs count and multiplied by 100), another shows logger libs 
 percentage...
It's not my area, but are you thinking of something like Freeling? http://nlp.lsi.upc.edu/freeling/ Asking for a friend. I think a C++ expert could get it to work with D with little difficulty, at least by creating C bindings, but I'm not a C++ expert and I failed.
Oct 20 2015
next sibling parent reply Chris <wendlec tcd.ie> writes:
On Tuesday, 20 October 2015 at 15:49:18 UTC, bachmeier wrote:
 It's not my area, but are you thinking of something like 
 Freeling?

 http://nlp.lsi.upc.edu/freeling/

 Asking for a friend. I think a C++ expert could get it to work 
 with D with little difficulty, at least by creating C bindings, 
 but I'm not a C++ expert and I failed.
Interesting, I heard of it a while ago. In D I have the following: Text tokenization Yes. Sentence splitting Yes. Morphological analysis Yes. Suffix treatment [, retokenization of clitic pronouns] Yes. Flexible multiword recognition Yes. Contraction splitting Depends on what they mean. But I can handle contractions like "l'ami". Probabilistic prediction of unkown word categories No. Phonetic encoding Transcription? If so, yes. SED-based search for similar words in dictionary No. Named entity detection No. Recognition of dates, numbers, ratios, currency, and physical magnitudes (speed, weight, temperature, density, etc.) Partially implemented. PoS tagging Started. Chart-based shallow parsing No. Named entity classification No. WordNet-based sense annotation and disambiguation No. Rule-based dependency parsing No. Nominal correference resolution No. If anyone is interested in starting something like FreeLing in D, please share your thoughts.
Oct 20 2015
next sibling parent reply Laeeth Isharc <laeethnospam nospam.laeeth.com> writes:
On Tuesday, 20 October 2015 at 16:01:41 UTC, Chris wrote:
 On Tuesday, 20 October 2015 at 15:49:18 UTC, bachmeier wrote:
 It's not my area, but are you thinking of something like 
 Freeling?

 http://nlp.lsi.upc.edu/freeling/

 Asking for a friend. I think a C++ expert could get it to work 
 with D with little difficulty, at least by creating C 
 bindings, but I'm not a C++ expert and I failed.
Interesting, I heard of it a while ago. In D I have the following: Text tokenization Yes. Sentence splitting Yes. Morphological analysis Yes. Suffix treatment [, retokenization of clitic pronouns] Yes. Flexible multiword recognition Yes. Contraction splitting Depends on what they mean. But I can handle contractions like "l'ami". Probabilistic prediction of unkown word categories No. Phonetic encoding Transcription? If so, yes. SED-based search for similar words in dictionary No. Named entity detection No. Recognition of dates, numbers, ratios, currency, and physical magnitudes (speed, weight, temperature, density, etc.) Partially implemented. PoS tagging Started. Chart-based shallow parsing No. Named entity classification No. WordNet-based sense annotation and disambiguation No. Rule-based dependency parsing No. Nominal correference resolution No. If anyone is interested in starting something like FreeLing in D, please share your thoughts.
Hi. I am very interested in this topic (especially sentiment analysis), and slowly I am getting a bit more firepower. I started porting the Python version of the stanford NLP API (the underlying code is Java) to D - it's not very complicated, but I have too much on my plate and so it goes slowly. I would be interested in working together on this with others, and I don't mind open sourcing the building blocks (which is really the time consuming bit). I hope to have some others from D world helping me, so it should go a bit faster, although the NLP stuff might not be the first project we work on. Feel free to drop me an email. Laeeth At kaleidicassociates.com Thanks. Laeeth
Oct 20 2015
parent reply Chris <wendlec tcd.ie> writes:
On Tuesday, 20 October 2015 at 18:43:54 UTC, Laeeth Isharc wrote:
 Hi.

 I am very interested in this topic (especially sentiment 
 analysis), and slowly I am getting a bit more firepower.  I 
 started porting the Python version of the stanford NLP API (the 
 underlying code is Java) to D - it's not very complicated, but 
 I have too much on my plate and so it goes slowly.

 I would be interested in working together on this with others, 
 and I don't mind open sourcing the building blocks (which is 
 really the time consuming bit).  I hope to have some others 
 from D world helping me, so it should go a bit faster, although 
 the NLP stuff might not be the first project we work on.

 Feel free to drop me an email. Laeeth


 At kaleidicassociates.com


 Thanks.


 Laeeth
What exactly is sentiment analysis and how do you go about it?
Oct 21 2015
parent Henry Gouk <henry.gouk gmail.com> writes:
On Wednesday, 21 October 2015 at 09:09:27 UTC, Chris wrote:

 What exactly is sentiment analysis and how do you go about it?
Determining whether the sentiment of a piece of text is positive, neutral, or negative. Currently twitter is a pretty popular source of data in academia, as emoticons can be used as sufficiently accurate proxies for labels. Using psuedo-labelled tweets, one can then come up with a feature representation (e.g. bag of words, tf-idf) and use some sort of classifier (e.g. linear SVM or softmax regression) to determine the sentiment of novel tweets. This is a pretty simple approach, and probably not hard to improve on.
Oct 21 2015
prev sibling parent Laeeth Isharc <laeethnospam nospamlaeeth.com> writes:
On Tuesday, 20 October 2015 at 16:01:41 UTC, Chris wrote:

 If anyone is interested in starting something like FreeLing in 
 D, please share your thoughts.
Chris - please drop me a line. I am sure there are some things we could work together on over time. auto domain="laeeth.com"; auto user="laeeth"; writefln(user~" "~domain);
Oct 23 2015
prev sibling parent reply Eliatto <arietto86 gmail.com> writes:
On Tuesday, 20 October 2015 at 15:49:18 UTC, bachmeier wrote:
 It's not my area, but are you thinking of something like 
 Freeling?

 http://nlp.lsi.upc.edu/freeling/
I think that in order to make a new wrapper more popular, it should be created with LGPL license (not GPL). Freeling is GPL. Is YamCha worth revival in D? http://chasen.org/~taku/software/yamcha/
Oct 20 2015
parent bachmeier <no spam.net> writes:
On Wednesday, 21 October 2015 at 06:34:44 UTC, Eliatto wrote:
 On Tuesday, 20 October 2015 at 15:49:18 UTC, bachmeier wrote:
 It's not my area, but are you thinking of something like 
 Freeling?

 http://nlp.lsi.upc.edu/freeling/
I think that in order to make a new wrapper more popular, it should be created with LGPL license (not GPL). Freeling is GPL. Is YamCha worth revival in D? http://chasen.org/~taku/software/yamcha/
The internet doesn't need another discussion about licensing. I'll just say that it depends. However, the most important factors when you currently have nothing to offer are: - How complete is the library? - How many people are using it? - How easy is it to create the bindings? From my conversations, Freeling does quite well on all counts. Maybe that won't work for you personally because you want to use it in a proprietary project. That's not a compelling reason to ignore it, though, as others might want to use it, and they may be willing to comply with the GPL.
Oct 21 2015