www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - code databases for ai

reply monkyyy <crazymonkyyy gmail.com> writes:
I started the process of extracting all the code from the forums 
3 weeks ago:

https://github.com/crazymonkyyy/dlangforums (I know of some flaws 
here, ai slop but semi-functional)

adr's style of code of giant files with example programs in 
comments needs some amount of processing (qwen doesnt read adr's 
files without being explicitly told to, I dont know if any of 
them have the "attention" to handle "simple display")

extracting links from dub webpages likely isnt that hard

My own code is a horrible mess, I never got around to actually 
cleaning up my repos, when I planned on doing that last year or 
the year before, or the year before. To say nothing of my unnamed 
gists.

etc.

---

Its a big project to try to collect as much of trusted code into 
one organization system, "rag" is a bit of a meme but seeding a 
code base with known good code(compared to ai hullinations 
anyway) for a degree of taste and something that actually 
compiles is a real technique.

(dont any of yall tell me "I told you so" about dub, it still 
will require processing)

if anyone else is working on pieces id like to know about it. I 
have some thoerys about how to meta program to detect if a struct 
is a container, if a function is a range algorithm, if a file is 
a program, etc.

Has anyone done anything on this subject? Is anyone interested in 
it? It may need a real hosting solution, github has file size 
caps that I ran into with just the forums if I start extracting 
from dub and then try to host that github may get quite upset.
Dec 13
next sibling parent reply jmh530 <john.michael.hall gmail.com> writes:
On Saturday, 13 December 2025 at 23:27:40 UTC, monkyyy wrote:
 [snip]

 Has anyone done anything on this subject? Is anyone interested 
 in it? It may need a real hosting solution, github has file 
 size caps that I ran into with just the forums if I start 
 extracting from dub and then try to host that github may get 
 quite upset.
I talked a little about what I had tried here https://forum.dlang.org/post/sfuxoiwthnqacwmfwxxs forum.dlang.org
Dec 16
parent reply monkyyy <crazymonkyyy gmail.com> writes:
On Tuesday, 16 December 2025 at 11:59:49 UTC, jmh530 wrote:
 On Saturday, 13 December 2025 at 23:27:40 UTC, monkyyy wrote:
 [snip]

 Has anyone done anything on this subject? Is anyone interested 
 in it? It may need a real hosting solution, github has file 
 size caps that I ran into with just the forums if I start 
 extracting from dub and then try to host that github may get 
 quite upset.
I talked a little about what I had tried here https://forum.dlang.org/post/sfuxoiwthnqacwmfwxxs forum.dlang.org
Im not entirely sure what that markup format is that snar found/made but I wonder if its trivially convertible.
Dec 16
parent reply jmh530 <john.michael.hall gmail.com> writes:
On Tuesday, 16 December 2025 at 16:07:27 UTC, monkyyy wrote:
 On Tuesday, 16 December 2025 at 11:59:49 UTC, jmh530 wrote:
 On Saturday, 13 December 2025 at 23:27:40 UTC, monkyyy wrote:
 [snip]

 Has anyone done anything on this subject? Is anyone 
 interested in it? It may need a real hosting solution, github 
 has file size caps that I ran into with just the forums if I 
 start extracting from dub and then try to host that github 
 may get quite upset.
I talked a little about what I had tried here https://forum.dlang.org/post/sfuxoiwthnqacwmfwxxs forum.dlang.org
Im not entirely sure what that markup format is that snar found/made but I wonder if its trivially convertible.
Honestly, my first attempt at it, I basically just asked ChatGPT what I should do to make a Dlang RAG and it recommended basically what I describe in that post. So I put together the single combined file without thinking of Har. Basically looks like [File A contents] [File B contents] etc. Har [1] is used for run.dlang.io as a format for handling multiple files. The examples on that page are very similar to what I have above. But it's nothing special, per se. As it stands, that repo just handles reading Har, not writing them. So that doesn't do much good for this application (writing it is an issue). I just figured rather than writing my own thing, I would contribute to that. But then I got distracted by other things and haven't come back to it. [1] https://github.com/marler8997/har
Dec 16
next sibling parent jmh530 <john.michael.hall gmail.com> writes:
On Tuesday, 16 December 2025 at 18:59:54 UTC, jmh530 wrote:
 [snip]
Didn't display correctly due to markdown being checked. Should look like this


 [File A contents]



 [File B contents]

 etc.
Dec 16
prev sibling parent reply monkyyy <crazymonkyyy gmail.com> writes:
On Tuesday, 16 December 2025 at 18:59:54 UTC, jmh530 wrote:
 I basically just asked ChatGPT what I should do to make a Dlang 
 RAG and it recommended basically what I describe in that post.
I would strongly suggest the first 10 lines of every project should be a human written rant. Taste is extremely important and possibly can be lost.
Dec 16
parent reply jmh530 <john.michael.hall gmail.com> writes:
On Wednesday, 17 December 2025 at 01:17:25 UTC, monkyyy wrote:
 On Tuesday, 16 December 2025 at 18:59:54 UTC, jmh530 wrote:
 I basically just asked ChatGPT what I should do to make a 
 Dlang RAG and it recommended basically what I describe in that 
 post.
I would strongly suggest the first 10 lines of every project should be a human written rant. Taste is extremely important and possibly can be lost.
I still don't trust the code generated by the latest versions of ChatGPT without reviewing it.
Dec 17
parent monkyyy <crazymonkyyy gmail.com> writes:
On Wednesday, 17 December 2025 at 18:01:50 UTC, jmh530 wrote:
 On Wednesday, 17 December 2025 at 01:17:25 UTC, monkyyy wrote:
 On Tuesday, 16 December 2025 at 18:59:54 UTC, jmh530 wrote:
 I basically just asked ChatGPT what I should do to make a 
 Dlang RAG and it recommended basically what I describe in 
 that post.
I would strongly suggest the first 10 lines of every project should be a human written rant. Taste is extremely important and possibly can be lost.
I still don't trust the code generated by the latest versions of ChatGPT without reviewing it.
Worse of both worlds. Its the first move of a sudoku puzzle thats the hardest and requires the entire stucture to be understood. Its the title that decides the conclusion of an essay. They just fill in details. A compiler may prompt of edit but if theres a logic error in the way the butterfly flaps its wing, nothing the compiler does will detect it. https://youtu.be/dcolM6W5Odc?si=EVFgJzFH9jmape7A&t=510
 children with 10 ais, or 1 ai writting plans for essays show 
 lower brain activity
 children writing a plan first before given the ai show more 
 brain activity then ai-less children
Dec 17
prev sibling parent reply Basile B. <b2.temp gmx.com> writes:
On Saturday, 13 December 2025 at 23:27:40 UTC, monkyyy wrote:
 I started the process of extracting all the code from the 
 forums 3 weeks ago:

 [...]
That reminds me an old idea... make/update a symbol database when you compile a project. Then your _next-gen_ completion deamon can use it.
Dec 16
parent monkyyy <crazymonkyyy gmail.com> writes:
On Tuesday, 16 December 2025 at 18:19:46 UTC, Basile B. wrote:
 On Saturday, 13 December 2025 at 23:27:40 UTC, monkyyy wrote:
 I started the process of extracting all the code from the 
 forums 3 weeks ago:

 [...]
That reminds me an old idea... make/update a symbol database when you compile a project. Then your _next-gen_ completion deamon can use it.
I have been anti these tools for myself and know nothing, whats the easiest to grab data?
Dec 16