Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

115 Illegal instruction Container exit - Solution in thread #1331

Closed
lishaojun616 opened this issue May 10, 2024 · 30 comments
Closed

115 Illegal instruction Container exit - Solution in thread #1331

lishaojun616 opened this issue May 10, 2024 · 30 comments
Assignees
Labels
bug Something isn't working core-team-only Docker Issue related to Docker

Comments

@lishaojun616
Copy link

The anythingllm is installed in Ubuntu server.
In the system LLM set ,the system can connect to the Ollama server and get the models .
But when chat in workspace ,the docker is exited.
1.Show the info in browser:
image
2.and the docker logs:
"/usr/local/bin/docker-entrypoint.sh: line 7: 115 Illegal instruction (core dumped) node /app/server/index.js"

What's the problem?

@timothycarambat
Copy link
Member

Docker engine - it appears
#1290 (comment)

That issue is specifically occurring on Mac, but it is the same on Linux/Ubuntu as well.

@SyuanYo
Copy link

SyuanYo commented May 14, 2024

Hi @lishaojun616 , have you managed to resolve the issue?
I am experiencing the same situation as you described in #1323 .

@joachimt-git
Copy link

Hello everyone,

I face the same problem.

Same setup, ubuntu 22.04 LTS, using ollama as llm.

I have installed the newest docker engine, build anything-llm with docker-compose-v2.

...
[TELEMETRY SENT] {
  event: 'workspace_created',
  distinctId: 'c060354a-b171-4702-b83a-9da2ef0612e4',
  properties: {
    multiUserMode: false,
    LLMSelection: 'ollama',
    Embedder: 'native',
    VectorDbSelection: 'lancedb',
    runtime: 'docker'
  }
}
[Event Logged] - workspace_created
[TELEMETRY SENT] {
  event: 'onboarding_complete',
  distinctId: 'c060354a-b171-4702-b83a-9da2ef0612e4',
  properties: { runtime: 'docker' }
}
[NativeEmbedder] Initialized
/usr/local/bin/docker-entrypoint.sh: line 7:   119 Illegal instruction     (core dumped) node /app/server/index.js
gitlab-runner@gradio:~$ docker --version
Docker version 26.1.3, build b72abbb

This issue is marked as closed. Is there a solution available?

Best regrads

Joachim

@xsn-cloud
Copy link

xsn-cloud commented May 20, 2024

Encountering the same issue. Using Ubuntu Server 22.04 with Docker, Yarn, and Node installed as recommended in HOW_TO_USE_DOCKER.md#how-to-use-dockerized-anything-llm. Ollama is on another machine, serving at 0.0.0.0 (other remote apps function correctly with this setup, even in Docker).

EDIT: Forgot to mention:

Docker version 26.1.3, build b72abbb

Ubuntu is running in a VM

Experiencing the identical error as posted by @joachimt-git:

[Event Logged] - update_llm_provider
[Event Logged] - update_embedding_engine
[Event Logged] - update_vector_db
[TELEMETRY SENT] {
  event: 'enabled_multi_user_mode',
  distinctId: '20b022d1-14cc-490c-ab86-4f941a32f7bc',
  properties: { multiUserMode: true, runtime: 'docker' }
}
[Event Logged] - multi_user_mode_enabled
[TELEMETRY SENT] {
  event: 'login_event',
  distinctId: '20b022d1-14cc-490c-ab86-4f941a32f7bc::1',
  properties: { multiUserMode: false, runtime: 'docker' }
}
[Event Logged] - login_event
[TELEMETRY SENT] {
  event: 'workspace_created',
  distinctId: '20b022d1-14cc-490c-ab86-4f941a32f7bc::1',
  properties: {
    multiUserMode: true,
    LLMSelection: 'ollama',
    Embedder: 'native',
    VectorDbSelection: 'lancedb',
    runtime: 'docker'
  }
}
[Event Logged] - workspace_created
[TELEMETRY SENT] {
  event: 'onboarding_complete',
  distinctId: '20b022d1-14cc-490c-ab86-4f941a32f7bc',
  properties: { runtime: 'docker' }
}
[NativeEmbedder] Initialized
/usr/local/bin/docker-entrypoint.sh: line 7:   117 Illegal instruction     (core dumped) node /app/server/index.js

Any suggestions for resolving this?

How can I assist?

Thanks

@timothycarambat
Copy link
Member

This is certainly a configuration issue. Considering all is well until the native embedder is called this might be arch related - but we support both ARM and x86. Regardless, here is my exact steps that fail to repro:

  1. Obtain Ubuntu 22.04 LTS AWS instance - used t3.small - x86
  2. curl -fsSL https://get.docker.com -o get-docker.sh
  3. sudo sh get-docker.sh
  4. sudo usermod -aG docker $USER
  5. docker -v

Docker version 26.1.3, build b72abbb

  1. docker pull mintplexlabs/anythingllm

Run:

export STORAGE_LOCATION=$HOME/anythingllm && \
mkdir -p $STORAGE_LOCATION && \
touch "$STORAGE_LOCATION/.env" && \
docker run -d -p 3001:3001 \
--cap-add SYS_ADMIN \
-v ${STORAGE_LOCATION}:/app/server/storage \
-v ${STORAGE_LOCATION}/.env:/app/server/.env \
-e STORAGE_DIR="/app/server/storage" \
mintplexlabs/anythingllm

Access via instance IP on port 3001 - I get the interface, onboard, create workspace, and upload documents.

[Event Logged] - workspace_created
[TELEMETRY SENT] {
  event: 'onboarding_complete',
  distinctId: 'fe73dd18-d52e-4a4c-bb62-a6adba7491d1',
  properties: { runtime: 'docker' }
}
-- Working readme.pdf --
-- Parsing content from pg 1 --
-- Parsing content from pg 2 --
-- Parsing content from pg 3 --
-- Parsing content from pg 4 --
-- Parsing content from pg 5 --
[SUCCESS]: readme.pdf converted & ready for embedding.

[CollectorApi] Document readme.pdf uploaded processed and successfully. It is now available in documents.
[TELEMETRY SENT] {
  event: 'document_uploaded',
  distinctId: 'fe73dd18-d52e-4a4c-bb62-a6adba7491d1',
  properties: { runtime: 'docker' }
}
[Event Logged] - document_uploaded
Adding new vectorized document into namespace sample
[NativeEmbedder] Initialized
[RecursiveSplitter] Will split with { chunkSize: 1000, chunkOverlap: 20 }
Chunks created from document: 14
[NativeEmbedder] The native embedding model has never been run and will be downloaded right now. Subsequent runs will be faster. (~23MB)
[NativeEmbedder] Downloading Xenova/all-MiniLM-L6-v2 from https://huggingface.co/
....truncated
[NativeEmbedder - Downloading model] onnx/model_quantized.onnx 100%
[NativeEmbedder] Embedded Chunk 1 of 1
Inserting vectorized chunks into LanceDB collection.
Caching vectorized results of custom-documents/readme.pdf-d717ca8c-6ac0-4514-8d6d-94ac48760afe.json to prevent duplicated embedding.
[TELEMETRY SENT] {
  event: 'documents_embedded_in_workspace',
  distinctId: 'fe73dd18-d52e-4a4c-bb62-a6adba7491d1',
  properties: {
    LLMSelection: 'openai',
    Embedder: 'native',
    VectorDbSelection: 'lancedb',
    runtime: 'docker'
  }
}
[Event Logged] - workspace_documents_added

Considering all of this occurs on [NativeEmbedder] Initialized this to me would indicate a lack of resources to run the local embedder and if that is the case, you should allocate more resources to the container or use another embedder. That is the only way I could imagine a full core dump or illegal instruction occurring with Illegal instruction. Either that or the underlying chip arch is not found/supported for Xenova transformers.js.

@joachimt-git
Copy link

Hi Timothy,

I think what @xsn-cloud and I have in commun is that we both use ollama.

May that ne the cause of the failure?

Joachim

@timothycarambat
Copy link
Member

It would not, since the exception is in the AnythingLLM container and if there was an illegal instruction in the Ollama program it would throw in that container/program. All AnythingLLM does is execute a fetch request to the Ollama instance, which would be permitted in any container

@xsn-cloud
Copy link

xsn-cloud commented May 21, 2024

@timothycarambat Thanks for addressing this issue. Please let me know if there's anything I can assist you with.

I've conducted the following experiment, also considering that it might be an issue with Docker running on VMs and to verify resource issues:

UPDATE: Also tested it in Windows 10 (WSL, Docker for Windows, Docker version 26.1.1, build 4cf5afa): Same issue

  • Clean install of Debian 12 on baremetal - Dual Xeon E5-2650 v2 @ 2.60GHz with 96GB of RAM
  • Docker version 26.1.3, build b72abbb
  • Followed the procedure exactly as you did in your previous comment (the same one you used on an Ubuntu 22.04 LTS AWS)
  • No documents loaded
  • During the onboarding, anything-llm successfully communicates with ollama, accurately retrieves the installed models, and allows the selection of the model without issues.
  • Model selected: llama3 with 4K context window. Other models tested with same results.

This is the outcome after the onboarding when attempting to send a "hello" in a new chat. (docker logs -f [containerid]).

Please note that the container was restarted from scratch to ensure the clarity of the logs.

Collector hot directory and tmp storage wiped!
Document processor app listening on port 8888
Environment variables loaded from .env
Prisma schema loaded from prisma/schema.prisma

✔ Generated Prisma Client (v5.3.1) to ./node_modules/@prisma/client in 338ms

Start using Prisma Client in Node.js (See: https://pris.ly/d/client)

'''
import { PrismaClient } from '@prisma/client'
const prisma = new PrismaClient()
'''


or start using Prisma Client at the edge (See: https://pris.ly/d/accelerate)
'''
import { PrismaClient } from '@prisma/client/edge'
const prisma = new PrismaClient()
'''

See other ways of importing Prisma Client: http://pris.ly/d/importing-client

Environment variables loaded from .env
Prisma schema loaded from prisma/schema.prisma
Datasource "db": SQLite database "anythingllm.db" at "file:../storage/anythingllm.db"

20 migrations found in prisma/migrations


No pending migrations to apply.
┌─────────────────────────────────────────────────────────┐
│  Update available 5.3.1 -> 5.14.0                       │
│  Run the following to update                            │
│    npm i --save-dev prisma@latest                       │
│    npm i @prisma/client@latest                          │
└─────────────────────────────────────────────────────────┘
[TELEMETRY ENABLED] Anonymous Telemetry enabled. Telemetry helps Mintplex Labs Inc improve AnythingLLM.
prisma:info Starting a sqlite pool with 33 connections.
fatal: not a git repository (or any of the parent directories): .git
getGitVersion Command failed: git rev-parse HEAD
fatal: not a git repository (or any of the parent directories): .git

[TELEMETRY SENT] {
  event: 'server_boot',
  distinctId: '4f39e3fb-ac8c-4043-9586-c21ef46b0c47',
  properties: { commit: '--', runtime: 'docker' }
}
[CommunicationKey] RSA key pair generated for signed payloads within AnythingLLM services.
Primary server in HTTP mode listening on port 3001
[NativeEmbedder] Initialized
/usr/local/bin/docker-entrypoint.sh: line 7:   163 Illegal instruction     (core dumped) node /app/server/index.js

One more clarification: The error occurs after sending a message in the chatbox. Until then, the last message displayed is [NativeEmbedder] Initialized, and it remains unchanged until the message is sent.

Thanks a lot for your time.

@timothycarambat
Copy link
Member

If you were to not use the native embedder, this problem would not surface. The only commonality between all of this is varying CPUs. Transformers.js which runs the native embedder, uses ONNX runtime and at this point the root cause has to be coming from there as this only occurs when using the native embedder and that is the supporting libraries to enable that functionality.

@jorgen-k
Copy link

I had the same issue running Docker in Ubuntu 24.04 VM on a Proxmox host. I switched the CPU in the guest to "host," and it fixed the problem. Just wanted to share in case anyone else is having the same struggle I did. Hope this helps!

@timothycarambat
Copy link
Member

Does the CPU you swapped to support AVXv2?

@computersrmyfriends
Copy link

No, my CPU does not support AVX2 however it supports AVX

@jorgen-k
Copy link

Does the CPU you swapped to support AVXv2?

I am sorry, i do not know how to check that, i just changed o "host" being a Intel Core i9-9900K CPU

@timothycarambat
Copy link
Member

At this time, the working hypothesis is that since Transformers.js uses ONNX runtime it will fail to execute any model (including the built in embedder) if AVX2 is not supported
https://github.com/microsoft/onnxruntime

@jorgen-k https://www.intel.com/content/www/us/en/products/sku/186605/intel-core-i99900k-processor-16m-cache-up-to-5-00-ghz/specifications.html

Instruction Set Extensions Intel® SSE4.1, Intel® SSE4.2, Intel® AVX2

Supports AVX2

@timothycarambat timothycarambat pinned this issue May 24, 2024
@joachimt-git
Copy link

I am using kvm as hyperviser and the virtual cpu doesn't support avx2. When I configure a passthrough of the cpu (which supports avx2) as @jorgen-k has suggested it works for me as well.

@computersrmyfriends
Copy link

I ran out of luck for the AVX cpu. It's a XEON 2660 but only supports AVX. Had to find another machine.

@Smocvin
Copy link

Smocvin commented May 24, 2024

I had the same issue with "/usr/local/bin/docker-entrypoint.sh: line 7: 115 Illegal instruction (core dumped) node /app/server/index.js". It seems the new Docker images of AnythingLLM have some issues, possibly on older systems. To fix the issue, I tried using older Docker versions and previous AnythingLLM images. While the older Docker versions did not resolve the issue, the older AnythingLLM images worked great. The newest working version for me was "sha256:1d994f027b5519d4bc5e1299892e7d0be1405308f10d0350ecefc8e717d3154f". You can find it here: https://github.com/Mintplex-Labs/anything-llm/pkgs/container/anything-llm/209298508

Running on Centos7 Linux with (CWP7), 2X Intel(R) Xeon(R) CPU E5-2680 v2, 2X Nvidia 2080TI GPUs

@timothycarambat
Copy link
Member

timothycarambat commented May 25, 2024

@Smocvin, excellent work. Okay, then that pretty much nails down commit ca63012 as the issue commit. In that commit we moved from lancedb 0.1.19 to 0.4.11 (which is what we use on desktop version).

However, given how this issue seems to only be a problem with certain CPUs we have two choices:
Bump to 0.5.0 and see if that fixes it or roll back to 0.1.19. Given how we do not leverage or dive deep into LanceDBs API much, the code change is quite minimal or none.

What I will need though is some help from the community as I do not have a single machine, VM, or instance that I can replicate this bug with. So my ask is:

  • If you are getting this bug, and it is on a Cloud-container service. What service and instance specs are you using so we can provision a test instance for replicate and debugging.

or

If anyone is willing to help debug the hard way I am going to creat two new tags on docker :lancedb_bump :lancedb_revert and I would need someone suffering from this issue to pull both and see which works.

Obviously if we can bump up, that would be ideal, but I would rather not field this issue for the rest of time since lancedb should just work.

Links to images

lancedb_bump: docker pull mintplexlabs/anythingllm:lancedb_bump
https://hub.docker.com/layers/mintplexlabs/anythingllm/lancedb_bump/images/sha256-40b0b28728d1bb481f01e510e96351a1970ac3fafafe4b2641cb264f0e7f8a93?context=repo

lancedb_revert: docker pull mintplexlabs/anythingllm:lancedb_revert
https://hub.docker.com/layers/mintplexlabs/anythingllm/lancedb_revert/images/sha256-f6a8d37a305756255302a8883e445056e1ab2f9ecf301f7c542685689436685d?context=repo

@timothycarambat timothycarambat self-assigned this May 25, 2024
@timothycarambat timothycarambat added Docker Issue related to Docker investigating Core team or maintainer will or is currently looking into this issue labels May 25, 2024
@acote88
Copy link

acote88 commented May 25, 2024

Can repro with a basic cloud instance on Vultr with the following specs: Cloud Compute - Shared CPU, Ubuntu 22.04 LTS x64, Intel High Performance, 25 GB NVMe, 1 vCPU, 1 GB Ram.

Then I basically just:

export STORAGE_LOCATION=$HOME/anything-llm
docker run -d -p 3001:3001 --cap-add SYS_ADMIN -v ${STORAGE_LOCATION}:/app/server/storage -v ${STORAGE_LOCATION}/.env:/app/server/.env -e STORAGE_DIR="/app/server/storage" mintplexlabs/anythingllm

Configured with OpenAI / lancedb. At that point, just tried any chat eg. typed 'hello' and then it hangs for a bit and comes up with the error message shown above and I can see the docker container died with the log:

/usr/local/bin/docker-entrypoint.sh: line 7:   102 Illegal instruction     (core dumped) node /app/server/index.js

@Dozer316
Copy link

I'm happy to help debug here locally with the newly created image tags when available. I have two machines I can test on here with AVX (Debian docker) and AVX2 (Windows docker desktop). I get the core dump on the AVX machine with :latest but the AVX2 machine runs the container fine so I can provide output from both of them if needed.

@timothycarambat
Copy link
Member

@Dozer316 @acote88 @computersrmyfriends can any of you who have this issue on the master/latest image check and see if lancedb_bump or lancedb_revert work on the impact machine?

Hopefully the _bump image works, otherwise we are in for some pain, but at least we can debug from there. I am friends with the LanceDB team so I can escalate to them if the issue persists.

@Dozer316
Copy link

Hey there - revert has solved the problem on the impacted machine, bump still core dumps unfortunately.

Thanks for taking a look at this for us.

@acote88
Copy link

acote88 commented May 29, 2024

Same here. _revert works, _bump crashes. Cheers.

@xsn-cloud
Copy link

xsn-cloud commented May 29, 2024

Results of the test:

lancedb_bump: Crashes
lancedb_revert: Works

Notes:

CPU: AVX only
Testing: Tested with local documents; works perfectly.

(edited: several typos, sorry)

@timothycarambat
Copy link
Member

Thank you @Dozer316 @acote88 @xsn-cloud for all taking the time to test both, which is very tedious. Ill contact the lancedb team as well as see if we can rollback the docker vectordb package in the interim.

@cyberlink1
Copy link
Contributor

cyberlink1 commented Jun 6, 2024

I just closed my report out #1618 because it was caused by the same thing. AVX was not a flag on the virtual CPU.

I set the virtual CPU to pass through and it solved the issues.

Thank you @xsn-cloud

@timothycarambat timothycarambat added bug Something isn't working core-team-only and removed investigating Core team or maintainer will or is currently looking into this issue labels Jun 6, 2024
@timothycarambat
Copy link
Member

timothycarambat commented Jun 6, 2024

Okay, so the reason this issue occurs is due to LanceDB having its minimum target of haswell as of version ~0.20. This is because performance on AVX2 is just much better.

So right now there are two options to go around this:

  • Upgrade or migrate use to a CPU that supports AVX2
  • We can maintain an image that works with the older vectordb package. Truthfully, I really dont want to do that and we have to draw the line somewhere. The reason I don't is that while code-change is minor this likely will become increasingly burdensome to maintain as we continue to bump lance into later versions. Knowing where the issue lies though is very useful.

Either way, the root cause is the requirement of the underlying CPU to have AVX2. Closing currently as wontfix but discussion is still open for any more commentary.

@timothycarambat timothycarambat closed this as not planned Won't fix, can't repro, duplicate, stale Jun 6, 2024
@timothycarambat timothycarambat changed the title 115 Illegal instruction 115 Illegal instruction Container exit - Solution in thread Jun 6, 2024
@acote88
Copy link

acote88 commented Jun 7, 2024

Thanks for following up on this Timothy. In case this can help others, I compared 2 types of instances on Vultr. One called "High Performance" and the other one called "High Frequency". The "High Frequency" one does support AVX2, while the other doesn't. You can check by running:

cat /proc/cpuinfo | grep -i avx

@Nododot
Copy link

Nododot commented Jun 14, 2024

You have no idea how long I've had to search everywhere and how many reconfigurations and reinstalls I did before I found this thread. Could you MAYBE write SOMEWHERE that currently AnythingLLM requires an AVX2 CPU to work properly?

@akrotor
Copy link

akrotor commented Aug 1, 2024

Hello @timothycarambat

Thank you for publishing the Lancedb_revert image at all in the first place.

Currently, googling the error message took me to this thread , which in turn links to this one.

To resolve the issue all I had to do was update the docker run command with the lancedb_revert tag, and otherwise "off it went"

My pc is old, but it's what I've got and sadly upgrading just isn't on the cards any time soon - I'm grateful to have a way to try it out at all.

I appreciate it's unreasonable to put in ongoing effort for small subset of users running into incompatibility problems because they insist on using a relic from the before times - Especially since it's going to start increasingly cropping up elsewhere as well.

Having an image an in the first place is great, but it'd be nice if there was some way to "run out the clock" on updates until breaking changes inevitably came along.

Would it be possible to have an unsupported update that pins the version of lancedb in place, dumps latest and/or dev over the top of it and "When it breaks, that's the end of the ride.... May the odds be ever in your favour"?

When it does, ideally the docker image gets a 2nd "unsupported final build" release based on that point version and that's the end of that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working core-team-only Docker Issue related to Docker
Projects
None yet
Development

No branches or pull requests