Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Scraping my Confluence and I get a weird error about invalid data #1854

Open
JayCroghan opened this issue Jul 11, 2024 · 6 comments
Open
Assignees
Labels
investigating Core team or maintainer will or is currently looking into this issue needs info / can't replicate Issues that require additional information and/or cannot currently be replicated, but possible bug

Comments

@JayCroghan
Copy link

How are you running AnythingLLM?

Docker (local)

What happened?

When I try to import my Confluence space I get the following error popup:

image

But then it continues in the background, I can see it is still pulling pages because it shows

[collector] info: [Confluence Loader]: Saving: ... to ...

in the docker logs for a long time after the initial error while it fetches the rest of the pages before it finally says

[backend] info: [CollectorApi] fetch failed

Which then leaves my Vector db empty. Anything I can do to make it ignore that one specific error coming from one specific page? My space is massive, 1.3gb.

Are there known steps to reproduce?

Just run the Confluence collector on my local Confluence or the one hosted on Attlassian with the same space imported.

@JayCroghan JayCroghan added the possible bug Bug was reported but is not confirmed or is unable to be replicated. label Jul 11, 2024
@shatfield4
Copy link
Collaborator

What content do you have in the Confluence space? Are you able to make another space with just some text in it and test to see if that works?

@shatfield4 shatfield4 self-assigned this Jul 11, 2024
@shatfield4 shatfield4 added needs info / can't replicate Issues that require additional information and/or cannot currently be replicated, but possible bug and removed possible bug Bug was reported but is not confirmed or is unable to be replicated. labels Jul 11, 2024
@timothycarambat
Copy link
Member

Collecting the space does not automatically embed it. First we fetch and then you pick and choose what files you wish to embed. As @shatfield4 suggested, is there a small space you can test with just to first ensure that its not some kind of install/config issue.

Also are there API key limits in place here? I imagine not for local installs, but likely yes for cloud

@timothycarambat timothycarambat added the investigating Core team or maintainer will or is currently looking into this issue label Jul 11, 2024
@JayCroghan
Copy link
Author

JayCroghan commented Jul 12, 2024

Ah ok my mistake. So I tried this from two confluence pages with the same space imported. If I attempt it from the hosted Atlassian page it does actually load the list all of the pages for me to import. I have just clicked on that right now and it seems to have worked.

The locally hosted one did not have the same success, once I put the URL for which I had to use a reverse proxy on https/443 because it would not allow me to use a non-standard port due to the regex you use to determine if the URL is valid or not, it throws the following error. I have checked 100s of times, this space is active at this address, if I add a page after the address it does not help, I have regenerated the access key, the email address is correct, Confluence is setup to use nginx reverse proxy (the conf needs changing when access this way) and it still always gives the same error.

[collector] error: undefined
[backend] info: [CollectorApi] Response could not be completed
[backend] info: [TELEMETRY SENT]

@timothycarambat
Copy link
Member

What does the url look like schema-wise? Does it have a port number or something to that effect? The regex for the URL pattern is pretty strict and I can imagine a locally hosted one will not pass

@JayCroghan
Copy link
Author

What does the url look like schema-wise? Does it have a port number or something to that effect? The regex for the URL pattern is pretty strict and I can imagine a locally hosted one will not pass

https://domain.net/display/XXX

The nginx reverse proxy on 443 connects to the non-standard port so the domain on https can be a normal port. It has a valid cert for domain.net too.

If I browse the URL I get exactly what I would expect.

image

@JayCroghan
Copy link
Author

What’s the best way for me to debug this and see what’s throwing the error? I didn’t see any options to add additional logging to the docker logs? I’m a software engineer so I’m handy enough with stuff. I was thinking of changing the regex on the URL and compiling from source before remembering I could use an nginx reverse proxy 😂 Any way to add verbose debug logging? Am I just stupid and missed that in the docs?

@timothycarambat timothycarambat self-assigned this Jul 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
investigating Core team or maintainer will or is currently looking into this issue needs info / can't replicate Issues that require additional information and/or cannot currently be replicated, but possible bug
Projects
None yet
Development

No branches or pull requests

3 participants