www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Unnecessary Large Memory Usage of Levenshtein Distance in Phobos

reply =?UTF-8?B?Ik5vcmRsw7Z3Ig==?= <per.nordlow gmail.com> writes:
I justd discovered that the std.algorithm implementation of 
Levenshtein Distance requires O(m*n) memory usage.

This is not neccessary. I have a C++-implementation of 
Damerau-Levenshtein that requires only O(3*min(m,n)). Is anybody 
interested in discussion modifying std.algorithm to make use of 
this?

Here's C++ implementation:


template<class T> inline void perm3_231(T& a, T& b, T& c) {
     T _t=a; a=b; b=c; c=_t;
}
template<class T> inline pure bool is_min(const T& a) {
     return a == pnw::minof<T>();
}
template<class T> inline pure bool is_max(const T& a) {
     return a == pnw::maxof<T>();
}

template<class T, class D = size_t>
inline pure
D damerau_levenshtein(const T& s_, const T& t_,
                       D max_distance = 
std::numeric_limits<D>::max(),
                       D insert_weight = static_cast<D>(10),
                       D delete_weight = static_cast<D>(7),
                       D replace_weight = static_cast<D>(5),
                       D transposition_weight = static_cast<D>(3))
{
     // reorder s and t to minimize memory usage
     bool ook = s_.size() >= t_.size(); // argument ordering ok 
flag
     const T& s = ook ? s_ : t_; // assure \c s becomes the \em 
longest
     const T& t = ook ? t_ : s_; // assure \c t becomes the \em 
shortest

     const D m = s.size();
     const D n = t.size();

     if (m == 0) { return n; }
     if (n == 0) { return m; }

     // Adapt the algorithm to use less space, O(3*min(n,m)) 
instead of O(mn),
     // since it only requires that the previous row/column and 
current row/column be stored at
     // any one time.
#ifdef HAVE_C99_VARIABLE_LENGTH_ARRAYS
     D cc_[n+1], pc_[n+1], sc_[n+1]; // current, previous and 
second-previous column on stack
#elif HAVE_CXX11_UNIQUE_PTR
     std::unique_ptr<D[]> cc_(new D[n+1]);      // current column
     std::unique_ptr<D[]> pc_(new D[n+1]);      // previous column
     std::unique_ptr<D[]> sc_(new D[n+1]);      // second previous 
column
#else
     auto cc_ = new D[n+1];      // current column
     auto pc_ = new D[n+1];      // previous column
     auto sc_ = new D[n+1];      // second previous column
     //std::vector<D> cc_(n+1), pc_(n+1), sc_(n+1); // current, 
previous and second previous column
#endif
     D * cc = &cc_[0], * pc = &pc_[0], * sc = &sc_[0]; // pointers 
for efficient swapping

     // initialize previous column
     for (D i = 0; i < n+1; ++i) { pc[i] = i * insert_weight; }

     // second previous column \c sc will be defined in second \c 
i iteration in outer-loop

     const auto D_max = std::numeric_limits<D>::max();

     // Computing the Levenshtein distance is based on the 
observation that if we
     // reserve a matrix to hold the Levenshtein distances between 
all prefixes
     // of the first string and all prefixes of the second, then 
we can compute
     // the values in the matrix by flood filling the matrix, and 
thus find the
     // distance between the two full strings as the last value 
computed.
     // This algorithm, an example of bottom-up dynamic 
programming, is
     // discussed, with variants, in the 1974 article The 
String-to-string
     // correction problem by Robert A. Wagner and Michael 
J.Fischer.
     for (D i = 0; i < m; ++i) {
         cc[0] = i+insert_weight;
         auto tmin = D_max; // row/column total min
         for (D j = 0; j < n; ++j) {
             // TODO Use sub_dist
             //auto sub_dist = damerau_levenshtein(s[i], t[j]); // 
recurse if for example T is an std::vector<std::string>
             cc[j+1] = pnw::min(pc[j+1] + insert_weight,         
// insertion
                                cc[j] + delete_weight,           
// deletion
                                pc[j] + (s[i] == t[j] ? 0 : 
replace_weight)); // substitution

             // transposition
             if (not is_max(transposition_weight)) { // if 
transposition should be allowed
                 if (i > 0 and j > 0 and // we need at least two 
characters
                     s[i-1] == t[j] and  // and first must be 
equal second
                     s[i]   == t[j-1]    // and vice versa
                     ) {
                     cc[j+1] = std::min(cc[j+1],
                                        sc[j-1] + 
transposition_weight);
                 }
             }

             if (not is_max(max_distance)) {
                 tmin = std::min(tmin, cc[j+1]);
             }
         }

         if ((not is_max(max_distance)) and
             tmin >= max_distance) {
             // if no element is smaller than \p max_distance
             return max_distance;
         }

         if (transposition_weight) {
             perm3_231(pc, cc, sc); // rotate pointers
         } else {
             std::swap(cc, pc);
         }
     }

#if !(defined(HAVE_C99_VARIABLE_LENGTH_ARRAYS) || 
defined(HAVE_CXX11_UNIQUE_PTR))
     delete [] cc_;
     delete [] pc_;
     delete [] sc_;
#endif
     return pc[n];
}


/*! Get \em Levenshtein (Edit) Distance (LD) metric between the 
\em sequences \p s and \p t.
  * Computing LD is also called Optimal String Alignment (OSA).
  */
template<class T, class D = size_t>
inline pure
D levenshtein(const T& s, const T& t,
               D max_distance = std::numeric_limits<D>::max(),
               D insert_weight = static_cast<D>(10),
               D delete_weight = static_cast<D>(7),
               D replace_weight = static_cast<D>(5))
{
     return damerau_levenshtein(s, t, max_distance, insert_weight, 
delete_weight, replace_weight,
                                std::numeric_limits<D>::max());
}

/*! Get \em Levenshtein (Edit) Distance (LD) metric between the 
\em arrays \p s and \p t.
  * Computing LD is also called Optimal String Alignment (OSA).
  */
template<class D = size_t>
inline pure
D levenshtein(const char * s, const char * t,
               D max_distance = std::numeric_limits<D>::max(),
               D insert_weight = static_cast<D>(10),
               D delete_weight = static_cast<D>(7),
               D replace_weight = static_cast<D>(5))
{
     return levenshtein(csc(s),
                        csc(t),
                        max_distance, insert_weight, 
delete_weight, replace_weight);
}

/* ---------------------------- Group Separator 
---------------------------- */

template<class T, class D = size_t>
inline pure
D test_levenshtein_symmetry(const T& s, const T& t,
                             D max_distance = 
std::numeric_limits<D>::max())
{
     D st = levenshtein(s, t, max_distance, 
static_cast<D>(1),static_cast<D>(1),static_cast<D>(1));
     D ts = levenshtein(t, s, max_distance, 
static_cast<D>(1),static_cast<D>(1),static_cast<D>(1));
     bool sym = (st == ts); // symmetry
     return sym ? st : std::numeric_limits<D>::max();
}
May 22 2014
next sibling parent reply Max Klyga <max.klyga gmail.com> writes:
You should make a pull request with this implementation adapted to 
std.algorithm interface

On 2014-05-22 09:49:09 +0000, Nordl÷w said:

 I justd discovered that the std.algorithm implementation of Levenshtein 
 Distance requires O(m*n) memory usage.
 
 This is not neccessary. I have a C++-implementation of 
 Damerau-Levenshtein that requires only O(3*min(m,n)). Is anybody 
 interested in discussion modifying std.algorithm to make use of this?
 
 Here's C++ implementation:
 
 
 template<class T> inline void perm3_231(T& a, T& b, T& c) {
      T _t=a; a=b; b=c; c=_t;
 }
 template<class T> inline pure bool is_min(const T& a) {
      return a == pnw::minof<T>();
 }
 template<class T> inline pure bool is_max(const T& a) {
      return a == pnw::maxof<T>();
 }
 
 template<class T, class D = size_t>
 inline pure
 D damerau_levenshtein(const T& s_, const T& t_,
                        D max_distance = std::numeric_limits<D>::max(),
                        D insert_weight = static_cast<D>(10),
                        D delete_weight = static_cast<D>(7),
                        D replace_weight = static_cast<D>(5),
                        D transposition_weight = static_cast<D>(3))
 {
      // reorder s and t to minimize memory usage
      bool ook = s_.size() >= t_.size(); // argument ordering ok flag
      const T& s = ook ? s_ : t_; // assure \c s becomes the \em longest
      const T& t = ook ? t_ : s_; // assure \c t becomes the \em shortest
 
      const D m = s.size();
      const D n = t.size();
 
      if (m == 0) { return n; }
      if (n == 0) { return m; }
 
      // Adapt the algorithm to use less space, O(3*min(n,m)) instead of O(mn),
      // since it only requires that the previous row/column and current 
 row/column be stored at
      // any one time.
 #ifdef HAVE_C99_VARIABLE_LENGTH_ARRAYS
      D cc_[n+1], pc_[n+1], sc_[n+1]; // current, previous and 
 second-previous column on stack
 #elif HAVE_CXX11_UNIQUE_PTR
      std::unique_ptr<D[]> cc_(new D[n+1]);      // current column
      std::unique_ptr<D[]> pc_(new D[n+1]);      // previous column
      std::unique_ptr<D[]> sc_(new D[n+1]);      // second previous column
 #else
      auto cc_ = new D[n+1];      // current column
      auto pc_ = new D[n+1];      // previous column
      auto sc_ = new D[n+1];      // second previous column
      //std::vector<D> cc_(n+1), pc_(n+1), sc_(n+1); // current, 
 previous and second previous column
 #endif
      D * cc = &cc_[0], * pc = &pc_[0], * sc = &sc_[0]; // pointers for 
 efficient swapping
 
      // initialize previous column
      for (D i = 0; i < n+1; ++i) { pc[i] = i * insert_weight; }
 
      // second previous column \c sc will be defined in second \c i 
 iteration in outer-loop
 
      const auto D_max = std::numeric_limits<D>::max();
 
      // Computing the Levenshtein distance is based on the observation 
 that if we
      // reserve a matrix to hold the Levenshtein distances between all prefixes
      // of the first string and all prefixes of the second, then we can compute
      // the values in the matrix by flood filling the matrix, and thus find the
      // distance between the two full strings as the last value computed.
      // This algorithm, an example of bottom-up dynamic programming, is
      // discussed, with variants, in the 1974 article The String-to-string
      // correction problem by Robert A. Wagner and Michael J.Fischer.
      for (D i = 0; i < m; ++i) {
          cc[0] = i+insert_weight;
          auto tmin = D_max; // row/column total min
          for (D j = 0; j < n; ++j) {
              // TODO Use sub_dist
              //auto sub_dist = damerau_levenshtein(s[i], t[j]); // 
 recurse if for example T is an std::vector<std::string>
              cc[j+1] = pnw::min(pc[j+1] + insert_weight,         // insertion
                                 cc[j] + delete_weight,           // deletion
                                 pc[j] + (s[i] == t[j] ? 0 : 
 replace_weight)); // substitution
 
              // transposition
              if (not is_max(transposition_weight)) { // if 
 transposition should be allowed
                  if (i > 0 and j > 0 and // we need at least two characters
                      s[i-1] == t[j] and  // and first must be equal second
                      s[i]   == t[j-1]    // and vice versa
                      ) {
                      cc[j+1] = std::min(cc[j+1],
                                         sc[j-1] + transposition_weight);
                  }
              }
 
              if (not is_max(max_distance)) {
                  tmin = std::min(tmin, cc[j+1]);
              }
          }
 
          if ((not is_max(max_distance)) and
              tmin >= max_distance) {
              // if no element is smaller than \p max_distance
              return max_distance;
          }
 
          if (transposition_weight) {
              perm3_231(pc, cc, sc); // rotate pointers
          } else {
              std::swap(cc, pc);
          }
      }
 
 #if !(defined(HAVE_C99_VARIABLE_LENGTH_ARRAYS) || 
 defined(HAVE_CXX11_UNIQUE_PTR))
      delete [] cc_;
      delete [] pc_;
      delete [] sc_;
 #endif
      return pc[n];
 }
 
 
 /*! Get \em Levenshtein (Edit) Distance (LD) metric between the \em 
 sequences \p s and \p t.
   * Computing LD is also called Optimal String Alignment (OSA).
   */
 template<class T, class D = size_t>
 inline pure
 D levenshtein(const T& s, const T& t,
                D max_distance = std::numeric_limits<D>::max(),
                D insert_weight = static_cast<D>(10),
                D delete_weight = static_cast<D>(7),
                D replace_weight = static_cast<D>(5))
 {
      return damerau_levenshtein(s, t, max_distance, insert_weight, 
 delete_weight, replace_weight,
                                 std::numeric_limits<D>::max());
 }
 
 /*! Get \em Levenshtein (Edit) Distance (LD) metric between the \em 
 arrays \p s and \p t.
   * Computing LD is also called Optimal String Alignment (OSA).
   */
 template<class D = size_t>
 inline pure
 D levenshtein(const char * s, const char * t,
                D max_distance = std::numeric_limits<D>::max(),
                D insert_weight = static_cast<D>(10),
                D delete_weight = static_cast<D>(7),
                D replace_weight = static_cast<D>(5))
 {
      return levenshtein(csc(s),
                         csc(t),
                         max_distance, insert_weight, delete_weight, 
 replace_weight);
 }
 
 /* ---------------------------- Group Separator ---------------------------- */
 
 template<class T, class D = size_t>
 inline pure
 D test_levenshtein_symmetry(const T& s, const T& t,
                              D max_distance = std::numeric_limits<D>::max())
 {
      D st = levenshtein(s, t, max_distance, 
 static_cast<D>(1),static_cast<D>(1),static_cast<D>(1));
      D ts = levenshtein(t, s, max_distance, 
 static_cast<D>(1),static_cast<D>(1),static_cast<D>(1));
      bool sym = (st == ts); // symmetry
      return sym ? st : std::numeric_limits<D>::max();
 }

May 22 2014
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/22/14, 4:26 AM, Max Klyga wrote:
 You should make a pull request with this implementation adapted to
 std.algorithm interface

Yes please. I recall I wanted to add that optimization later but forgot about it. Andrei
May 22 2014
prev sibling next sibling parent James Carr via Digitalmars-d <digitalmars-d puremagic.com> writes:
--001a11c17334c3562804f9ff9047
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

I've begun work on this and my implementation in D passes all the
std.algorithm unit tests, but because it now uses a single array instead of
a matrix, path() no longer provides the correct answer.

I'm working on trying to amend it so that there is consistency.


On Thu, May 22, 2014 at 12:26 PM, Max Klyga via Digitalmars-d <
digitalmars-d puremagic.com> wrote:

 You should make a pull request with this implementation adapted to
 std.algorithm interface


 On 2014-05-22 09:49:09 +0000, Nordl=C3=B6w said:

  I justd discovered that the std.algorithm implementation of Levenshtein
 Distance requires O(m*n) memory usage.

 This is not neccessary. I have a C++-implementation of
 Damerau-Levenshtein that requires only O(3*min(m,n)). Is anybody interes=


 in discussion modifying std.algorithm to make use of this?

 Here's C++ implementation:


 template<class T> inline void perm3_231(T& a, T& b, T& c) {
      T _t=3Da; a=3Db; b=3Dc; c=3D_t;
 }
 template<class T> inline pure bool is_min(const T& a) {
      return a =3D=3D pnw::minof<T>();
 }
 template<class T> inline pure bool is_max(const T& a) {
      return a =3D=3D pnw::maxof<T>();
 }

 template<class T, class D =3D size_t>
 inline pure
 D damerau_levenshtein(const T& s_, const T& t_,
                        D max_distance =3D std::numeric_limits<D>::max(),
                        D insert_weight =3D static_cast<D>(10),
                        D delete_weight =3D static_cast<D>(7),
                        D replace_weight =3D static_cast<D>(5),
                        D transposition_weight =3D static_cast<D>(3))
 {
      // reorder s and t to minimize memory usage
      bool ook =3D s_.size() >=3D t_.size(); // argument ordering ok flag
      const T& s =3D ook ? s_ : t_; // assure \c s becomes the \em longes=


      const T& t =3D ook ? t_ : s_; // assure \c t becomes the \em shorte=


      const D m =3D s.size();
      const D n =3D t.size();

      if (m =3D=3D 0) { return n; }
      if (n =3D=3D 0) { return m; }

      // Adapt the algorithm to use less space, O(3*min(n,m)) instead of
 O(mn),
      // since it only requires that the previous row/column and current
 row/column be stored at
      // any one time.
 #ifdef HAVE_C99_VARIABLE_LENGTH_ARRAYS
      D cc_[n+1], pc_[n+1], sc_[n+1]; // current, previous and
 second-previous column on stack
 #elif HAVE_CXX11_UNIQUE_PTR
      std::unique_ptr<D[]> cc_(new D[n+1]);      // current column
      std::unique_ptr<D[]> pc_(new D[n+1]);      // previous column
      std::unique_ptr<D[]> sc_(new D[n+1]);      // second previous colum=


 #else
      auto cc_ =3D new D[n+1];      // current column
      auto pc_ =3D new D[n+1];      // previous column
      auto sc_ =3D new D[n+1];      // second previous column
      //std::vector<D> cc_(n+1), pc_(n+1), sc_(n+1); // current, previous
 and second previous column
 #endif
      D * cc =3D &cc_[0], * pc =3D &pc_[0], * sc =3D &sc_[0]; // pointers=


 efficient swapping

      // initialize previous column
      for (D i =3D 0; i < n+1; ++i) { pc[i] =3D i * insert_weight; }

      // second previous column \c sc will be defined in second \c i
 iteration in outer-loop

      const auto D_max =3D std::numeric_limits<D>::max();

      // Computing the Levenshtein distance is based on the observation
 that if we
      // reserve a matrix to hold the Levenshtein distances between all
 prefixes
      // of the first string and all prefixes of the second, then we can
 compute
      // the values in the matrix by flood filling the matrix, and thus
 find the
      // distance between the two full strings as the last value computed=


      // This algorithm, an example of bottom-up dynamic programming, is
      // discussed, with variants, in the 1974 article The String-to-stri=


      // correction problem by Robert A. Wagner and Michael J.Fischer.
      for (D i =3D 0; i < m; ++i) {
          cc[0] =3D i+insert_weight;
          auto tmin =3D D_max; // row/column total min
          for (D j =3D 0; j < n; ++j) {
              // TODO Use sub_dist
              //auto sub_dist =3D damerau_levenshtein(s[i], t[j]); //
 recurse if for example T is an std::vector<std::string>
              cc[j+1] =3D pnw::min(pc[j+1] + insert_weight,         //
 insertion
                                 cc[j] + delete_weight,           //
 deletion
                                 pc[j] + (s[i] =3D=3D t[j] ? 0 :
 replace_weight)); // substitution

              // transposition
              if (not is_max(transposition_weight)) { // if transposition
 should be allowed
                  if (i > 0 and j > 0 and // we need at least two
 characters
                      s[i-1] =3D=3D t[j] and  // and first must be equal =


                      s[i]   =3D=3D t[j-1]    // and vice versa
                      ) {
                      cc[j+1] =3D std::min(cc[j+1],
                                         sc[j-1] + transposition_weight);
                  }
              }

              if (not is_max(max_distance)) {
                  tmin =3D std::min(tmin, cc[j+1]);
              }
          }

          if ((not is_max(max_distance)) and
              tmin >=3D max_distance) {
              // if no element is smaller than \p max_distance
              return max_distance;
          }

          if (transposition_weight) {
              perm3_231(pc, cc, sc); // rotate pointers
          } else {
              std::swap(cc, pc);
          }
      }

 #if !(defined(HAVE_C99_VARIABLE_LENGTH_ARRAYS) ||
 defined(HAVE_CXX11_UNIQUE_PTR))
      delete [] cc_;
      delete [] pc_;
      delete [] sc_;
 #endif
      return pc[n];
 }


 /*! Get \em Levenshtein (Edit) Distance (LD) metric between the \em
 sequences \p s and \p t.
   * Computing LD is also called Optimal String Alignment (OSA).
   */
 template<class T, class D =3D size_t>
 inline pure
 D levenshtein(const T& s, const T& t,
                D max_distance =3D std::numeric_limits<D>::max(),
                D insert_weight =3D static_cast<D>(10),
                D delete_weight =3D static_cast<D>(7),
                D replace_weight =3D static_cast<D>(5))
 {
      return damerau_levenshtein(s, t, max_distance, insert_weight,
 delete_weight, replace_weight,
                                 std::numeric_limits<D>::max());
 }

 /*! Get \em Levenshtein (Edit) Distance (LD) metric between the \em
 arrays \p s and \p t.
   * Computing LD is also called Optimal String Alignment (OSA).
   */
 template<class D =3D size_t>
 inline pure
 D levenshtein(const char * s, const char * t,
                D max_distance =3D std::numeric_limits<D>::max(),
                D insert_weight =3D static_cast<D>(10),
                D delete_weight =3D static_cast<D>(7),
                D replace_weight =3D static_cast<D>(5))
 {
      return levenshtein(csc(s),
                         csc(t),
                         max_distance, insert_weight, delete_weight,
 replace_weight);
 }

 /* ---------------------------- Group Separator
 ---------------------------- */

 template<class T, class D =3D size_t>
 inline pure
 D test_levenshtein_symmetry(const T& s, const T& t,
                              D max_distance =3D
 std::numeric_limits<D>::max())
 {
      D st =3D levenshtein(s, t, max_distance, static_cast<D>(1),static_c=


 D>(1),static_cast<D>(1));
      D ts =3D levenshtein(t, s, max_distance, static_cast<D>(1),static_c=


 D>(1),static_cast<D>(1));
      bool sym =3D (st =3D=3D ts); // symmetry
      return sym ? st : std::numeric_limits<D>::max();
 }


--001a11c17334c3562804f9ff9047 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable <div dir=3D"ltr">I&#39;ve begun work on this and my implementation in D pas= ses all the std.algorithm unit tests, but because it now uses a single arra= y instead of a matrix, path() no longer provides the correct answer.<br><br=

iv class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">On Thu, May 22,= 2014 at 12:26 PM, Max Klyga via Digitalmars-d <span dir=3D"ltr">&lt;<a hre= f=3D"mailto:digitalmars-d puremagic.com" target=3D"_blank">digitalmars-d pu= remagic.com</a>&gt;</span> wrote:<br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex">You should make a pull request with this imp= lementation adapted to std.algorithm interface<div class=3D"HOEnZb"><div cl= ass=3D"h5"> <br> <br> On 2014-05-22 09:49:09 +0000, Nordl=C3=B6w said:<br> <br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> I justd discovered that the std.algorithm implementation of Levenshtein Dis= tance requires O(m*n) memory usage.<br> <br> This is not neccessary. I have a C++-implementation of Damerau-Levenshtein = that requires only O(3*min(m,n)). Is anybody interested in discussion modif= ying std.algorithm to make use of this?<br> <br> Here&#39;s C++ implementation:<br> <br> <br> template&lt;class T&gt; inline void perm3_231(T&amp; a, T&amp; b, T&amp; c)= {<br> =C2=A0 =C2=A0 =C2=A0T _t=3Da; a=3Db; b=3Dc; c=3D_t;<br> }<br> template&lt;class T&gt; inline pure bool is_min(const T&amp; a) {<br> =C2=A0 =C2=A0 =C2=A0return a =3D=3D pnw::minof&lt;T&gt;();<br> }<br> template&lt;class T&gt; inline pure bool is_max(const T&amp; a) {<br> =C2=A0 =C2=A0 =C2=A0return a =3D=3D pnw::maxof&lt;T&gt;();<br> }<br> <br> template&lt;class T, class D =3D size_t&gt;<br> inline pure<br> D damerau_levenshtein(const T&amp; s_, const T&amp; t_,<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0D max_distance =3D std::numeric_limits&lt;D&gt;::max(),<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0D insert_weight =3D static_cast&lt;D&gt;(10),<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0D delete_weight =3D static_cast&lt;D&gt;(7),<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0D replace_weight =3D static_cast&lt;D&gt;(5),<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0D transposition_weight =3D static_cast&lt;D&gt;(3))<br> {<br> =C2=A0 =C2=A0 =C2=A0// reorder s and t to minimize memory usage<br> =C2=A0 =C2=A0 =C2=A0bool ook =3D s_.size() &gt;=3D t_.size(); // argument o= rdering ok flag<br> =C2=A0 =C2=A0 =C2=A0const T&amp; s =3D ook ? s_ : t_; // assure \c s become= s the \em longest<br> =C2=A0 =C2=A0 =C2=A0const T&amp; t =3D ook ? t_ : s_; // assure \c t become= s the \em shortest<br> <br> =C2=A0 =C2=A0 =C2=A0const D m =3D s.size();<br> =C2=A0 =C2=A0 =C2=A0const D n =3D t.size();<br> <br> =C2=A0 =C2=A0 =C2=A0if (m =3D=3D 0) { return n; }<br> =C2=A0 =C2=A0 =C2=A0if (n =3D=3D 0) { return m; }<br> <br> =C2=A0 =C2=A0 =C2=A0// Adapt the algorithm to use less space, O(3*min(n,m))= instead of O(mn),<br> =C2=A0 =C2=A0 =C2=A0// since it only requires that the previous row/column = and current row/column be stored at<br> =C2=A0 =C2=A0 =C2=A0// any one time.<br> #ifdef HAVE_C99_VARIABLE_LENGTH_<u></u>ARRAYS<br> =C2=A0 =C2=A0 =C2=A0D cc_[n+1], pc_[n+1], sc_[n+1]; // current, previous an= d second-previous column on stack<br> #elif HAVE_CXX11_UNIQUE_PTR<br> =C2=A0 =C2=A0 =C2=A0std::unique_ptr&lt;D[]&gt; cc_(new D[n+1]); =C2=A0 =C2= =A0 =C2=A0// current column<br> =C2=A0 =C2=A0 =C2=A0std::unique_ptr&lt;D[]&gt; pc_(new D[n+1]); =C2=A0 =C2= =A0 =C2=A0// previous column<br> =C2=A0 =C2=A0 =C2=A0std::unique_ptr&lt;D[]&gt; sc_(new D[n+1]); =C2=A0 =C2= =A0 =C2=A0// second previous column<br> #else<br> =C2=A0 =C2=A0 =C2=A0auto cc_ =3D new D[n+1]; =C2=A0 =C2=A0 =C2=A0// current= column<br> =C2=A0 =C2=A0 =C2=A0auto pc_ =3D new D[n+1]; =C2=A0 =C2=A0 =C2=A0// previou= s column<br> =C2=A0 =C2=A0 =C2=A0auto sc_ =3D new D[n+1]; =C2=A0 =C2=A0 =C2=A0// second = previous column<br> =C2=A0 =C2=A0 =C2=A0//std::vector&lt;D&gt; cc_(n+1), pc_(n+1), sc_(n+1); //= current, previous and second previous column<br> #endif<br> =C2=A0 =C2=A0 =C2=A0D * cc =3D &amp;cc_[0], * pc =3D &amp;pc_[0], * sc =3D = &amp;sc_[0]; // pointers for efficient swapping<br> <br> =C2=A0 =C2=A0 =C2=A0// initialize previous column<br> =C2=A0 =C2=A0 =C2=A0for (D i =3D 0; i &lt; n+1; ++i) { pc[i] =3D i * insert= _weight; }<br> <br> =C2=A0 =C2=A0 =C2=A0// second previous column \c sc will be defined in seco= nd \c i iteration in outer-loop<br> <br> =C2=A0 =C2=A0 =C2=A0const auto D_max =3D std::numeric_limits&lt;D&gt;::max(= );<br> <br> =C2=A0 =C2=A0 =C2=A0// Computing the Levenshtein distance is based on the o= bservation that if we<br> =C2=A0 =C2=A0 =C2=A0// reserve a matrix to hold the Levenshtein distances b= etween all prefixes<br> =C2=A0 =C2=A0 =C2=A0// of the first string and all prefixes of the second, = then we can compute<br> =C2=A0 =C2=A0 =C2=A0// the values in the matrix by flood filling the matrix= , and thus find the<br> =C2=A0 =C2=A0 =C2=A0// distance between the two full strings as the last va= lue computed.<br> =C2=A0 =C2=A0 =C2=A0// This algorithm, an example of bottom-up dynamic prog= ramming, is<br> =C2=A0 =C2=A0 =C2=A0// discussed, with variants, in the 1974 article The St= ring-to-string<br> =C2=A0 =C2=A0 =C2=A0// correction problem by Robert A. Wagner and Michael J= .Fischer.<br> =C2=A0 =C2=A0 =C2=A0for (D i =3D 0; i &lt; m; ++i) {<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0cc[0] =3D i+insert_weight;<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0auto tmin =3D D_max; // row/column total = min<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0for (D j =3D 0; j &lt; n; ++j) {<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0// TODO Use sub_dist<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0//auto sub_dist =3D damerau= _levenshtein(s[i], t[j]); // recurse if for example T is an std::vector&lt;= std::string&gt;<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0cc[j+1] =3D pnw::min(pc[j+1= ] + insert_weight, =C2=A0 =C2=A0 =C2=A0 =C2=A0 // insertion<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 cc[j] + delete_weight, =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 // deletion<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 pc[j] + (s[i] =3D=3D t[j] ? 0 : repl= ace_weight)); // substitution<br> <br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0// transposition<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (not is_max(transpositio= n_weight)) { // if transposition should be allowed<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (i &gt; 0 = and j &gt; 0 and // we need at least two characters<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0s[i-1] =3D=3D t[j] and =C2=A0// and first must be equal second<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0s[i] =C2=A0 =3D=3D t[j-1] =C2=A0 =C2=A0// and vice versa<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0) {<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0cc[j+1] =3D std::min(cc[j+1],<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 sc[j-1] = + transposition_weight);<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0}<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0}<br> <br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (not is_max(max_distance= )) {<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0tmin =3D std:= :min(tmin, cc[j+1]);<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0}<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0}<br> <br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if ((not is_max(max_distance)) and<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0tmin &gt;=3D max_distance) = {<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0// if no element is smaller= than \p max_distance<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return max_distance;<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0}<br> <br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (transposition_weight) {<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0perm3_231(pc, cc, sc); // r= otate pointers<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0} else {<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0std::swap(cc, pc);<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0}<br> =C2=A0 =C2=A0 =C2=A0}<br> <br> #if !(defined(HAVE_C99_VARIABLE_<u></u>LENGTH_ARRAYS) || defined(HAVE_CXX11= _UNIQUE_PTR)<u></u>)<br> =C2=A0 =C2=A0 =C2=A0delete [] cc_;<br> =C2=A0 =C2=A0 =C2=A0delete [] pc_;<br> =C2=A0 =C2=A0 =C2=A0delete [] sc_;<br> #endif<br> =C2=A0 =C2=A0 =C2=A0return pc[n];<br> }<br> <br> <br> /*! Get \em Levenshtein (Edit) Distance (LD) metric between the \em sequenc= es \p s and \p t.<br> =C2=A0 * Computing LD is also called Optimal String Alignment (OSA).<br> =C2=A0 */<br> template&lt;class T, class D =3D size_t&gt;<br> inline pure<br> D levenshtein(const T&amp; s, const T&amp; t,<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0D max_distance =3D s= td::numeric_limits&lt;D&gt;::max(),<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0D insert_weight =3D = static_cast&lt;D&gt;(10),<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0D delete_weight =3D = static_cast&lt;D&gt;(7),<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0D replace_weight =3D= static_cast&lt;D&gt;(5))<br> {<br> =C2=A0 =C2=A0 =C2=A0return damerau_levenshtein(s, t, max_distance, insert_w= eight, delete_weight, replace_weight,<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 std::numeric_limits&lt;D&gt;::max())= <u></u>;<br> }<br> <br> /*! Get \em Levenshtein (Edit) Distance (LD) metric between the \em arrays = \p s and \p t.<br> =C2=A0 * Computing LD is also called Optimal String Alignment (OSA).<br> =C2=A0 */<br> template&lt;class D =3D size_t&gt;<br> inline pure<br> D levenshtein(const char * s, const char * t,<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0D max_distance =3D s= td::numeric_limits&lt;D&gt;::max(),<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0D insert_weight =3D = static_cast&lt;D&gt;(10),<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0D delete_weight =3D = static_cast&lt;D&gt;(7),<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0D replace_weight =3D= static_cast&lt;D&gt;(5))<br> {<br> =C2=A0 =C2=A0 =C2=A0return levenshtein(csc(s),<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 csc(t),<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 max_distance, insert_weight, delete_weight, replace_weight);<br> }<br> <br> /* ---------------------------- Group Separator ---------------------------= - */<br> <br> template&lt;class T, class D =3D size_t&gt;<br> inline pure<br> D test_levenshtein_symmetry(<u></u>const T&amp; s, const T&amp; t,<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0D max_distance =3D std::numeric_limits&lt;D&= gt;::max())<br> {<br> =C2=A0 =C2=A0 =C2=A0D st =3D levenshtein(s, t, max_distance, static_cast&lt= ;D&gt;(1),static_cast&lt;<u></u>D&gt;(1),static_cast&lt;D&gt;(1));<br> =C2=A0 =C2=A0 =C2=A0D ts =3D levenshtein(t, s, max_distance, static_cast&lt= ;D&gt;(1),static_cast&lt;<u></u>D&gt;(1),static_cast&lt;D&gt;(1));<br> =C2=A0 =C2=A0 =C2=A0bool sym =3D (st =3D=3D ts); // symmetry<br> =C2=A0 =C2=A0 =C2=A0return sym ? st : std::numeric_limits&lt;D&gt;::max();<= br> }<br> </blockquote> <br> <br> </div></div></blockquote></div><br></div> --001a11c17334c3562804f9ff9047--
May 22 2014
prev sibling next sibling parent =?UTF-8?B?Ik5vcmRsw7Z3Ig==?= <per.nordlow gmail.com> writes:
 I've begun work on this and my implementation in D passes all

I uploaded it here https://github.com/nordlow/justcxx/blob/master/levenshtein_distance.hpp I don't have time right now to look into adapting it to Phobos. But I hope it can give some clues for your work ;) /Per
May 22 2014
prev sibling next sibling parent James Carr via Digitalmars-d <digitalmars-d puremagic.com> writes:
--001a11c2c0d49214b104fa024f21
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

I've submitted a pull request.

https://github.com/D-Programming-Language/phobos/pull/2195


On Thu, May 22, 2014 at 8:05 PM, "Nordl=C3=B6w" <digitalmars-d puremagic.co=
m>wrote:

 I've begun work on this and my implementation in D passes all

I uploaded it here https://github.com/nordlow/justcxx/blob/master/levenshtein_distance.hpp I don't have time right now to look into adapting it to Phobos. But I hope it can give some clues for your work ;) /Per

--001a11c2c0d49214b104fa024f21 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable <div dir=3D"ltr">I&#39;ve submitted a pull request.<br><br><a href=3D"https= ://github.com/D-Programming-Language/phobos/pull/2195">https://github.com/D= -Programming-Language/phobos/pull/2195</a><br></div><div class=3D"gmail_ext= ra"> <br><br><div class=3D"gmail_quote">On Thu, May 22, 2014 at 8:05 PM, &quot;N= ordl=C3=B6w&quot; <span dir=3D"ltr">&lt;<a href=3D"mailto:digitalmars-d pur= emagic.com" target=3D"_blank">digitalmars-d puremagic.com</a>&gt;</span> wr= ote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border= -left:1px #ccc solid;padding-left:1ex"> <div class=3D""><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8e= x;border-left:1px #ccc solid;padding-left:1ex"> I&#39;ve begun work on this and my implementation in D passes all<br> </blockquote> <br></div> I uploaded it here<br> <br> <a href=3D"https://github.com/nordlow/justcxx/blob/master/levenshtein_dista= nce.hpp" target=3D"_blank">https://github.com/nordlow/<u></u>justcxx/blob/m= aster/<u></u>levenshtein_distance.hpp</a><br> <br> I don&#39;t have time right now to look into adapting it to Phobos.<br> But I hope it can give some clues for your work ;)<br> <br> /Per<br> </blockquote></div><br></div> --001a11c2c0d49214b104fa024f21--
May 22 2014
prev sibling next sibling parent James Carr via Digitalmars-d <digitalmars-d puremagic.com> writes:
--001a11c27e385e7d6f04fa40e534
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Hi again,

I've implemented the memory efficient version
https://github.com/D-Programming-Language/phobos/pull/2195

I was wondering if I could get some feedback please.

Thanks.


On Thu, May 22, 2014 at 8:42 PM, James Carr <jdcarrman gmail.com> wrote:

 I've submitted a pull request.

 https://github.com/D-Programming-Language/phobos/pull/2195


 On Thu, May 22, 2014 at 8:05 PM, "Nordl=C3=B6w" <digitalmars-d puremagic.=

  I've begun work on this and my implementation in D passes all

I uploaded it here https://github.com/nordlow/justcxx/blob/master/levenshtein_distance.hpp I don't have time right now to look into adapting it to Phobos. But I hope it can give some clues for your work ;) /Per


--001a11c27e385e7d6f04fa40e534 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable <div dir=3D"ltr">Hi again,<div><br></div><div>I&#39;ve implemented the memo= ry efficient version <a href=3D"https://github.com/D-Programming-Language/p= hobos/pull/2195">https://github.com/D-Programming-Language/phobos/pull/2195= </a></div> <div><br></div><div>I was wondering if I could get some feedback please.</d= iv><div><br></div><div>Thanks.</div></div><div class=3D"gmail_extra"><br><b= r><div class=3D"gmail_quote">On Thu, May 22, 2014 at 8:42 PM, James Carr <s= pan dir=3D"ltr">&lt;<a href=3D"mailto:jdcarrman gmail.com" target=3D"_blank= ">jdcarrman gmail.com</a>&gt;</span> wrote:<br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"><div dir=3D"ltr">I&#39;ve submitted a pull r= equest.<br><br><a href=3D"https://github.com/D-Programming-Language/phobos/= pull/2195" target=3D"_blank">https://github.com/D-Programming-Language/phob= os/pull/2195</a><br> </div><div class=3D"HOEnZb"><div class=3D"h5"><div class=3D"gmail_extra"> <br><br><div class=3D"gmail_quote">On Thu, May 22, 2014 at 8:05 PM, &quot;N= ordl=C3=B6w&quot; <span dir=3D"ltr">&lt;<a href=3D"mailto:digitalmars-d pur= emagic.com" target=3D"_blank">digitalmars-d puremagic.com</a>&gt;</span> wr= ote:<br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> <div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-le= ft:1px #ccc solid;padding-left:1ex"> I&#39;ve begun work on this and my implementation in D passes all<br> </blockquote> <br></div> I uploaded it here<br> <br> <a href=3D"https://github.com/nordlow/justcxx/blob/master/levenshtein_dista= nce.hpp" target=3D"_blank">https://github.com/nordlow/<u></u>justcxx/blob/m= aster/<u></u>levenshtein_distance.hpp</a><br> <br> I don&#39;t have time right now to look into adapting it to Phobos.<br> But I hope it can give some clues for your work ;)<br> <br> /Per<br> </blockquote></div><br></div> </div></div></blockquote></div><br></div> --001a11c27e385e7d6f04fa40e534--
May 25 2014
prev sibling parent "Brian Schott" <briancschott gmail.com> writes:
On Sunday, 25 May 2014 at 22:22:28 UTC, James Carr via 
Digitalmars-d wrote:
 Hi again,

 I've implemented the memory efficient version
 https://github.com/D-Programming-Language/phobos/pull/2195

 I was wondering if I could get some feedback please.

 Thanks.

This PR has been sitting around for over a month without anybody doing anything about it. Can we get a Phobos dev to take a look?
Jul 11 2014