www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - GitHub Copilot is starting to produce D code that can almost pass code

reply Petar Kirov [ZombineDev] <petar.p.kirov gmail.com> writes:
IIRC, a while ago, there was a post on the newsgroup by someone 
asking if GH Copilot works with D. Here's my limited experience 
so far.

I've been using the Copilot VSCode extension for a few months at 
work, mostly with TypeScript and Go, and lately it has been 
producing surprisingly (*) good results. Back when I started 
using it, I tried it out with D, but quickly disabled it as I 
found its suggestions way off and too distracting.

Recently I decided to give it another try for small D program and 
I got surprisingly good results. Here's one example:

Input:

```d
SemVer[] getGitHubRepoTags(string repo, bool includePre|
```

Output (everything after the cursor, denoted by `|` above, was 
suggested by Copilot):

```d
SemVer[] getGitHubRepoTags(string repo, bool includePrereleases = 
false)
{
     auto url = "https://api.github.com/repos/" ~ repo ~ "/tags";
     auto tags = httpGet(url).readText.parseJSON.array;

     return tags
         .map!(tag => SemVer(tag.object.byPair.get("name")))
         .filter!(x => x.isValid && (includePrereleases || 
!x.isPrerelease))
         .array;
}
```

Things to note:

* On a purely syntactical level, the code is grammatically 
correct - parenthesis are properly balanced, all statements end 
with semicolon, and it even decided to group with parenthesis the 
`includePrereleases || !x.isPrerelease` subexpression

* Apparently, there are enough examples of how to get the tags 
for a GH repo that it got it right on the first try. I had to 
look up the docs to verify that the REST API path format was 
indeed correct.

* For some reason it insists on suggesting `httpGet`, instead of 
simply `get` (from `std.net.curl`). I guess `get` is too generic 
for its taste :D

* I still haven't seen suggestions containing function local 
imports. My guess is that's because D is relatively unique 
compared to most other languages, and is not well-represented in 
the dataset Copilot is being trained on.

* While at the beginning, its suggestions mostly resembled 
snippets from JavaScript or Python code, and for example it used 
to suggest `+` (instead of `~`) for string concatenation, after a 
while started to use `~` more consistently.

* Same for `map` and `filter` - in earlier parts of the program 
Copilot used to suggest passing the lambda as a runtime parameter 
(as in JavaScript), but after it saw a few examples in my code, 
it finally started to consistently use the D template args syntax

* After a while it started suggesting `.array` here and there in 
range pipelines

* For now, the suggestions I get involving slicing mostly use the 
`.substr` function (most likely borrowed from a JS program), so 
apparently, it hasn't seen enough `[start .. end]` expressions in 
my code.

* Amusingly enough, even though DCD ought to be in a much more 
advantaged position (it has an actual D parser, knows about my 
imports paths, etc.), it gets beaten by Copilot pretty easily 
both in terms of speed, usefulness and likelihood of even 
attempting to produce any suggestion (e.g. DCD gives up at the 
first sight of a UFCS pipeline, while those are bread and butter 
for Copilot.

---

All in all, don't expect any wonders. Almost all suggestions of 
non-trivial size will contain mistakes. But from time to time it 
can pleasantly surprise you. It's funny when after a long day 
Copilot starts to put `!` before `(some => lambda)` more 
consistently than you :)

---

P.S. I'm leaving the semantic mistakes in the suggestion for the 
fellow humans here to find :P

(*) That is to say surprising for me, considering I was expecting 
it to produce pure gibberish, as it has no semantic understanding 
of neither the programming language, nor the problem domain.
Apr 03 2022
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
This is extraordinary!

Can Copilot be used standalone, i.e. not as a plugin to VS?
Apr 03 2022
parent reply Petar Kirov [ZombineDev] <petar.p.kirov gmail.com> writes:
On Sunday, 3 April 2022 at 23:25:45 UTC, Walter Bright wrote:
 This is extraordinary!

 Can Copilot be used standalone, i.e. not as a plugin to VS?
There are two parts of this feature: 1. The editor extension (plugin) 2. The GH Copilot API Currently, the are editor extensions for Visual Studio (the big Windows-only IDE), Visual Studio Code (the cross-platform, Electron and Monaco-powered open-source editor written TypeScript / JavaScript), Jet Brains (they mention only IntelliJ and PyCharm) and NeoVim 0.6+ (which a popular fork of Vim). Note that VS Code for Web (the one that is opened when you press dot `.` while browsing a GH repo, or when you simply replace `github.com` with `github.dev` in the URL, e.g. https://github.dev/dlang/dmd) doesn't yet support Copilot. I don't know if that's a technical limitation (e.g. because VS Code for Web can't run extensions containing binary components, while the normal VS Code can), or whether it was a business decision (for now). For 1 to work, you need to have access to 2, which is currently in a free closed beta. Anyone can request access with their GH account on the official page: https://copilot.github.com. I don't know how exactly they decide who's eligible and how they prioritize access to the beta, but my guess is they ought to heavily favor older accounts with a big stream of contributions (such as yours), than blank recently created accounts. For example, I requested access to GH Codespaces Beta almost immediately after it was announced, but didn't get access to it for more than an year. On the other hand, I think GH Copilot was announced at the end of June last year and I got access sometime in August, so your mileage may vary. For now at least, the API is not documented. My guess is that they will document it eventually, since the secret sauce is in the implementation, not the API schema. All of these extensions are freely available (so signup required), but are closed source (for now?), provided under the [gh-beta-tos]. That said, the NeoVim extension is actually distributed via this public repo: https://github.com/github/copilot.vim. As you can see it consists of two parts: * Integration with the editor implemented in vimscript (the autoload and plugin folders) * Copilot Agent consisting of: * A minified JavaScript file (about 55k LoC after formatting with Prettier), which contains the core of the frontend implementation * Treesitter parsers (compiled to WebAssembly) for Go, JavaScript. Python, Ruby and TypeScript. I think Vladimir's work could be really valuable here: https://github.com/CyberShadow/tree-sitter-d * Tokenizer data (Byte Pair Encoding, see [bpe]) --- In summary, even in the current state of affairs, I think it's technically possible to create an extension for other editors including your Micro Emacs editor, by porting the vimscript part of the NeoVim extension and using Node.js to run the JavaScript & Web Assembly client-side core. None of this is documented, and subject to change, but as far as I can tell, they haven't tried to obfuscate the inner workings, as it seems like they didn't want to commit to a fully open-source version yet. [gh-beta-tos]: https://docs.github.com/en/site-policy/github-terms/github-terms-of-service#j-beta-previews [bpe]: https://towardsdatascience.com/byte-pair-encoding-subword-based-tokenization-algorithm-77828a70bee0
Apr 04 2022
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
Thank you for your extensive and thorough reply.

You're right, I was thinking of integrating it with MicroEmacs. I'd like it to 
be as simple as:

     hinttext = copilot(sourcetext);

which would make it great fun to play around with. Alas, I have little hope it 
would be that simple :-) since most programmers can't bear to make things
simple.
Apr 05 2022
next sibling parent reply Max Samukha <maxsamukha gmail.com> writes:
On Tuesday, 5 April 2022 at 07:23:53 UTC, Walter Bright wrote:
 Thank you for your extensive and thorough reply.

 You're right, I was thinking of integrating it with MicroEmacs. 
 I'd like it to be as simple as:

     hinttext = copilot(sourcetext);

 which would make it great fun to play around with. Alas, I have 
 little hope it would be that simple :-) since most programmers 
 can't bear to make things simple.
It works the other way, too. Many engineers are obsessed with simplicity. They go to great lengths to come up with the simplest solution where a complex one would suffice.))
Apr 05 2022
next sibling parent reply Abdulhaq <alynch4047 gmail.com> writes:
On Tuesday, 5 April 2022 at 12:46:17 UTC, Max Samukha wrote:
 On Tuesday, 5 April 2022 at 07:23:53 UTC, Walter Bright wrote:
 Thank you for your extensive and thorough reply.

 You're right, I was thinking of integrating it with 
 MicroEmacs. I'd like it to be as simple as:

     hinttext = copilot(sourcetext);

 which would make it great fun to play around with. Alas, I 
 have little hope it would be that simple :-) since most 
 programmers can't bear to make things simple.
It works the other way, too. Many engineers are obsessed with simplicity. They go to great lengths to come up with the simplest solution where a complex one would suffice.))
Making things simple is the hard part.
Apr 05 2022
parent max haughton <maxhaton gmail.com> writes:
On Tuesday, 5 April 2022 at 16:13:08 UTC, Abdulhaq wrote:
 On Tuesday, 5 April 2022 at 12:46:17 UTC, Max Samukha wrote:
 On Tuesday, 5 April 2022 at 07:23:53 UTC, Walter Bright wrote:
 Thank you for your extensive and thorough reply.

 You're right, I was thinking of integrating it with 
 MicroEmacs. I'd like it to be as simple as:

     hinttext = copilot(sourcetext);

 which would make it great fun to play around with. Alas, I 
 have little hope it would be that simple :-) since most 
 programmers can't bear to make things simple.
It works the other way, too. Many engineers are obsessed with simplicity. They go to great lengths to come up with the simplest solution where a complex one would suffice.))
Making things simple is the hard part.
Making things is the hard part, I'd say.
Apr 05 2022
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 4/5/2022 5:46 AM, Max Samukha wrote:
 It works the other way, too. Many engineers are obsessed with simplicity. They 
 go to great lengths to come up with the simplest solution where a complex one 
 would suffice.))
Anybody can come up with a a complex solution. A simple one takes genius. You know it's genius when others say: "phui, anyone could have done that!" Except that nobody did.
Apr 05 2022
parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Tue, Apr 05, 2022 at 01:09:56PM -0700, Walter Bright via Digitalmars-d wrote:
 On 4/5/2022 5:46 AM, Max Samukha wrote:
 It works the other way, too. Many engineers are obsessed with
 simplicity. They go to great lengths to come up with the simplest
 solution where a complex one would suffice.))
Anybody can come up with a a complex solution. A simple one takes genius. You know it's genius when others say: "phui, anyone could have done that!" Except that nobody did.
+100. As the old saying goes, you can always add another level of abstraction (when faced with some programming problem). It takes genius to *reduce* the level of abstraction *and* solve the problem at the same time. T -- Those who don't understand Unix are condemned to reinvent it, poorly.
Apr 05 2022
prev sibling parent Vladimir Panteleev <thecybershadow.lists gmail.com> writes:
On Tuesday, 5 April 2022 at 07:23:53 UTC, Walter Bright wrote:
 Thank you for your extensive and thorough reply.

 You're right, I was thinking of integrating it with MicroEmacs. 
 I'd like it to be as simple as:

     hinttext = copilot(sourcetext);

 which would make it great fun to play around with. Alas, I have 
 little hope it would be that simple :-) since most programmers 
 can't bear to make things simple.
I don't have experience with Copilot specifically but it's probably not that simple because of reasons such as: 1. Authentication. It is a paid service so access has to be authenticated. 2. Context is probably inferred not just from the text before the cursor, but also text after the cursor or maybe even other files from within the same project. 3. Session information, so that multiple queries can be associated together for additional context. 4. Multiple answers (or retrying for a different answer). 5. Feedback (the service probably wants to know which answer, if any, the user chose, so that they can improve future predictions). 6. Obligatory telemetry :) A local (offline) though non-free Copilot alternative is TabNine. It completes on a much smaller scope (rest of line or so) and can understand D to a limited extent. TabNine is written in Rust. There are some undertakings for a libre implementation, but none I know that are at a point where they can be useful in practice.
Apr 05 2022
prev sibling parent Paolo Invernizzi <paolo.invernizzi gmail.com> writes:
On Monday, 4 April 2022 at 07:25:02 UTC, Petar Kirov [ZombineDev] 
wrote:
 On Sunday, 3 April 2022 at 23:25:45 UTC, Walter Bright wrote:
 This is extraordinary!

 Can Copilot be used standalone, i.e. not as a plugin to VS?
<snip>
 For 1 to work, you need to have access to 2, which is currently 
 in a free closed beta. Anyone can request access with their GH 
 account on the official page: https://copilot.github.com. I 
 don't know how exactly they decide who's eligible and how they 
 prioritize access to the beta, but my guess is they ought to 
 heavily favor older accounts with a big stream of contributions 
 (such as yours), than blank recently created accounts. For 
 example, I requested access to GH Codespaces Beta almost 
 immediately after it was announced, but didn't get access to it 
 for more than an year. On the other hand, I think GH Copilot 
 was announced at the end of June last year and I got access 
 sometime in August, so your mileage may vary.
As a feedback, I've requested the access yesterday, after having read Petar post (thank you), and I've received the admission today, yah! I'm really curious about it!
Apr 05 2022
prev sibling parent max haughton <maxhaton gmail.com> writes:
On Sunday, 3 April 2022 at 22:19:43 UTC, Petar Kirov [ZombineDev] 
wrote:
 IIRC, a while ago, there was a post on the newsgroup by someone 
 asking if GH Copilot works with D. Here's my limited experience 
 so far.

 I've been using the Copilot VSCode extension for a few months 
 at work, mostly with TypeScript and Go, and lately it has been 
 producing surprisingly (*) good results. Back when I started 
 using it, I tried it out with D, but quickly disabled it as I 
 found its suggestions way off and too distracting.

 Recently I decided to give it another try for small D program 
 and I got surprisingly good results. Here's one example:

 Input:

 ```d
 SemVer[] getGitHubRepoTags(string repo, bool includePre|
 ```

 Output (everything after the cursor, denoted by `|` above, was 
 suggested by Copilot):

 ```d
 SemVer[] getGitHubRepoTags(string repo, bool includePrereleases 
 = false)
 {
     auto url = "https://api.github.com/repos/" ~ repo ~ "/tags";
     auto tags = httpGet(url).readText.parseJSON.array;

     return tags
         .map!(tag => SemVer(tag.object.byPair.get("name")))
         .filter!(x => x.isValid && (includePrereleases || 
 !x.isPrerelease))
         .array;
 }
 ```

 Things to note:

 * On a purely syntactical level, the code is grammatically 
 correct - parenthesis are properly balanced, all statements end 
 with semicolon, and it even decided to group with parenthesis 
 the `includePrereleases || !x.isPrerelease` subexpression

 * Apparently, there are enough examples of how to get the tags 
 for a GH repo that it got it right on the first try. I had to 
 look up the docs to verify that the REST API path format was 
 indeed correct.

 * For some reason it insists on suggesting `httpGet`, instead 
 of simply `get` (from `std.net.curl`). I guess `get` is too 
 generic for its taste :D

 * I still haven't seen suggestions containing function local 
 imports. My guess is that's because D is relatively unique 
 compared to most other languages, and is not well-represented 
 in the dataset Copilot is being trained on.

 * While at the beginning, its suggestions mostly resembled 
 snippets from JavaScript or Python code, and for example it 
 used to suggest `+` (instead of `~`) for string concatenation, 
 after a while started to use `~` more consistently.

 * Same for `map` and `filter` - in earlier parts of the program 
 Copilot used to suggest passing the lambda as a runtime 
 parameter (as in JavaScript), but after it saw a few examples 
 in my code, it finally started to consistently use the D 
 template args syntax

 * After a while it started suggesting `.array` here and there 
 in range pipelines

 * For now, the suggestions I get involving slicing mostly use 
 the `.substr` function (most likely borrowed from a JS 
 program), so apparently, it hasn't seen enough `[start .. end]` 
 expressions in my code.

 * Amusingly enough, even though DCD ought to be in a much more 
 advantaged position (it has an actual D parser, knows about my 
 imports paths, etc.), it gets beaten by Copilot pretty easily 
 both in terms of speed, usefulness and likelihood of even 
 attempting to produce any suggestion (e.g. DCD gives up at the 
 first sight of a UFCS pipeline, while those are bread and 
 butter for Copilot.

 ---

 All in all, don't expect any wonders. Almost all suggestions of 
 non-trivial size will contain mistakes. But from time to time 
 it can pleasantly surprise you. It's funny when after a long 
 day Copilot starts to put `!` before `(some => lambda)` more 
 consistently than you :)

 ---

 P.S. I'm leaving the semantic mistakes in the suggestion for 
 the fellow humans here to find :P

 (*) That is to say surprising for me, considering I was 
 expecting it to produce pure gibberish, as it has no semantic 
 understanding of neither the programming language, nor the 
 problem domain.
Expecting it to produce gibberish off the bat is slightly naive given recent machine learning advances (e.g. You can impress with a fairly simple neural network, when you have almost 200 gigabytes of worth of parameters you can do wonders). In my own testing it can do some clever things but also really gives away it's lack of understanding of structure if you ask too much of it. That and it's urge to write C++ is too strong sometimes. Maybe you've been lucky or I haven't coddled it enough. e.g. ``` //struct Foo containing an int struct Foo { int x; }; //It's C++ ``` When it has had a nudge in the right direction it can be pretty interesting, e.g. it can do abductive reasoning to work out what you might want as long as it can inferred from the rest of the code. Before it sees a pattern to copy: ``` int runGCC(string[] p) { import std.gc; gc.collect(); return 0; } ``` after I've defined the runGCC thing properly ``` //Run python interpreter with command and collect exit code int runPython(string[] p) { import std.process; const save = execute(["python"] ~ p); return save.status; } ``` so better but still fumbling around in the dark. I wouldn't be surprised if things are much better fairly shortly though.
Apr 04 2022