Complete installation, configuration, health monitoring & troubleshooting
Complete installation, configuration, health monitoring & troubleshooting
| Component | This Guide's Setup | Notes |
|---|---|---|
| GPU | 2x NVIDIA H200 SXM | Confirmed working. All expected values in this guide are for H200. |
| VRAM per GPU | 143,771 MiB (~141GB) | After model load: ~132,964 MiB GPU0 / ~131,399 MiB GPU1 |
| CUDA Version | 12.8 | Minimum: 12.4. Tested on 12.8. |
| Driver | 570.211.01 | Any 520+ should work |
| GPU Count | Exactly 2 | Guide uses --data-parallel-size 2 |
| System RAM | 64GB+ | Needed for build + model loading |
| Disk | 300GB+ persistent | Model ~140GB + builds ~50GB + chain ~5GB |
| OS | Ubuntu 22.04 | RunPod default image |
| Provider | RunPod | See Section 00a for other providers |
| GPU power at full mining | ~690W each (near 700W TDP) | This is the health indicator — if power is 120W, mining is not happening |
| GPU | Community Status | Notes |
|---|---|---|
| H200 SXM ×2 | ✅ This guide — confirmed | Reference setup for this guide |
| H100 SXM ×2 | ✅ Community confirmed | Works. 80GB VRAM each. Adjust VRAM expectations in health checks. |
| H100 NVL ×1 + H200 ×1 | ⚠️ Community reported | Mixed setup. Some users got blocks. |
| Single H200 | ⚠️ Possible | Use --data-parallel-size 1, 64 requests. Lower hashrate. |
| A100 ×2 | ❌ Not recommended | Ampere architecture — Pearl kernel targets Hopper. May not compile. |
| RTX 4090 ×2 | ❌ Insufficient VRAM | 24GB each = 48GB total. Not enough for 70B model. |
Run these immediately after SSH-ing in. If any check fails, reprovision before continuing.
nvidia-smi
df -h | sort -rh | head -8
free -h
lsb_release -a 2>/dev/null || cat /etc/os-release | head -5
curl -s --max-time 5 https://github.com > /dev/null && echo "GitHub OK" && curl -s --max-time 5 https://huggingface.co > /dev/null && echo "HuggingFace OK"
| Scenario | Peers | Impact |
|---|---|---|
| Port 44108 NOT exposed (RunPod default) | ~16 outbound only | Works fine. Block propagation slightly slower. |
| Port 44108 exposed | Up to 200 inbound+outbound | Better connectivity, faster block propagation. |
| Discord reports of 200+ peers | 200+ | These users have inbound port exposed AND are on providers with open firewall. |
echo "=== GPU ===" && nvidia-smi --query-gpu=name,memory.total,driver_version --format=csv,noheader && echo "=== DISK ===" && df -h | sort -rh | head -5 && echo "=== RAM ===" && free -h | grep Mem && echo "=== OS ===" && lsb_release -d 2>/dev/null && echo "=== NETWORK ===" && curl -s --max-time 5 https://github.com > /dev/null && echo "GitHub OK" || echo "GitHub BLOCKED"
This guide was built and tested on RunPod. The core setup is identical across providers — only storage paths and a few installation details differ. Use this table to adapt the guide for your provider.
| Provider | HF_HOME path | UV Cache path | Persistent storage | Notes |
|---|---|---|---|---|
| RunPod ✅ Tested | /workspace/.hf |
/workspace/.uv-cache |
/workspace |
Deadsnakes PPA blocked — use UV for Python 3.12. Ubuntu 22.04. |
| Vast.ai | /root/.cache/huggingface or /workspace/.hf |
/root/.cache/uv |
/workspace (if attached) |
Use Custom Template. Ubuntu 22.04 works. apt python3.12 may work via deadsnakes. |
| Lambda Labs | /home/ubuntu/.cache/huggingface |
/home/ubuntu/.cache/uv |
/home/ubuntu |
Ubuntu 22.04. Python 3.12 via deadsnakes should work. Run as ubuntu not root. |
| CoreWeave | /mnt/data/.hf |
/mnt/data/.uv-cache |
/mnt/data |
Kubernetes-based. Persistent volume must be mounted manually. |
| Paperspace | /notebooks/.hf |
/notebooks/.uv-cache |
/notebooks |
Ubuntu 20.04/22.04. Python 3.12 via deadsnakes. |
| Any provider | Any path with 200GB+ free space | Any writable path | Check df -h for largest partition | Find largest partition: df -h | sort -rh | head -5 |
Replace every occurrence of /workspace/.hf with your provider's persistent storage path, and /workspace/.uv-cache with the UV cache path. The two places these appear are:
export HF_HOME=/YOUR_PROVIDER_PATH/.hf
cd /root/pearl && export UV_CACHE_DIR=/YOUR_PROVIDER_PATH/.uv-cache && export HF_HOME=/YOUR_PROVIDER_PATH/.hf && task build:miner
| Provider | Python 3.12 Method | Command |
|---|---|---|
| RunPod | apt blocked — use UV | uv python install 3.12 |
| Vast.ai | Try apt first, fallback to UV | apt-get install -y python3.12 || uv python install 3.12 |
| Lambda / Paperspace | apt via deadsnakes PPA | add-apt-repository ppa:deadsnakes/ppa && apt-get install -y python3.12 |
| Any provider (universal) | UV always works | uv python install 3.12 |
nvidia-smi && echo "CUDA OK" || echo "NO GPU DETECTED"
df -h | sort -rh | head -5
Pick the partition with 300GB+ free space for HF_HOME. The 70B model needs ~140GB.
| Setting | Value | Why |
|---|---|---|
| Parallelism | --data-parallel-size 2 | NOT tensor parallel — TP reduces m dimension |
| Prefix Caching | --no-enable-prefix-caching | MUST disable — caching = no GEMM = no mining |
| Chunked Prefill | --no-enable-chunked-prefill | Must disable for correct mining behavior |
| GPU Memory | --gpu-memory-utilization 0.9 | Leave 10% headroom |
| Model Length | --max-model-len 8192 | Fits in 80GB VRAM |
| Execution | --enforce-eager | Required for Pearl kernel |
| ZK Speed | export RAYON_NUM_THREADS=96 | Faster proof generation |
| Deep GEMM | export VLLM_USE_DEEP_GEMM=0 | Disable — conflicts with Pearl GEMM |
| Requests | 128 concurrent long-prompt requests | Long prompts (~150+ tokens) needed for m≥5000 |
| Loop pattern | sleep 1 (NOT wait) | wait causes GPU to idle between batches → 0% util |
| Request port | port 8000 ONLY | DP=2 exposes single port — port 8001 drops silently |
| Socket Count | 4 ESTAB connections | 2 per DP engine = 4 total when healthy |
| n value in NOISY_GEMM | 57344 | Confirms DP mode (TP gives 28672) |
| Node RPC | port 44107 (pearld) | pearl daemon |
| Wallet RPC | port 44207 (oyster) | wallet daemon |
wget -q https://go.dev/dl/go1.24.2.linux-amd64.tar.gz && tar -C /usr/local -xzf go1.24.2.linux-amd64.tar.gz && export PATH=$PATH:/usr/local/go/bin && echo 'export PATH=$PATH:/usr/local/go/bin' >> ~/.bashrc
go version
go version go1.24.2 linux/amd64curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y && source ~/.cargo/env
rustc --version
rustc 1.xx.x (xxxxxxx YYYY-MM-DD)curl -LsSf https://astral.sh/uv/install.sh | sh && source $HOME/.local/bin/env
uv 0.x.xsh -c "$(curl --location https://taskfile.dev/install.sh)" -- -d -b /usr/local/bin
Task version: vx.x.xapt-get update -qq && apt-get install -y tmux
apt-get install python3.12 will fail with "Unable to locate package". Use UV to install Python 3.12 instead (UV is already installed above).uv python install 3.12
ln -sf $(uv python find 3.12) /usr/local/bin/python3.12 && update-alternatives --install /usr/bin/python3 python3 /usr/local/bin/python3.12 1 && python3 --version
Python 3.12.xcd /root && git clone https://github.com/pearl-research-labs/pearl.git && cd pearl
pwd → should show /root/pearlcd /root/pearl && task build:blockchain
ls -la /root/pearl/bin/pearld /root/pearl/bin/oyster /root/pearl/bin/prlctl
cd /root/pearl && export UV_CACHE_DIR=/workspace/.uv-cache && export HF_HOME=/workspace/.hf && task build:miner
ls /root/pearl/.venv/bin/vllm && ls /root/pearl/.venv/bin/pearl-gateway
Add all required env vars to ~/.bashrc now (you will update PEARLD_MINING_ADDRESS after Step 3):
cat >> ~/.bashrc << 'EOF'
export PEARLD_RPC_URL=http://localhost:44107
export PEARLD_RPC_USER=rpcuser
export PEARLD_RPC_PASSWORD=rpcpass
export PEARLD_MINING_ADDRESS=PLACEHOLDER
export HF_HOME=/workspace/.hf
export VLLM_USE_DEEP_GEMM=0
export RAYON_NUM_THREADS=96
EOF
source ~/.bashrc && echo $VLLM_USE_DEEP_GEMM
source ~/.bashrc before starting the miner in Step 4.cd /root/pearl && ./bin/oyster --create
When prompted, answer as follows:
| Prompt | Answer |
|---|---|
| Do you want to add a passphrase? | No (just press Enter) — or set one you'll remember |
| Do you have an existing seed phrase? | No |
| Seed phrase shown | ⚠️ WRITE IT DOWN NOW — all 12 words in order |
| Type OK to confirm | OK |
tmux new-session -d -s node && tmux new-session -d -s miner && tmux new-session -d -s loop
tmux send-keys -t node "cd /root/pearl && ./bin/pearld --rpcuser=rpcuser --rpcpass=rpcpass --rpclisten=0.0.0.0:44107 --txindex --notls --maxpeers=200" Enter
Wait 30 seconds then verify:
cd /root/pearl && ./bin/prlctl -u rpcuser -P rpcpass -s localhost:44107 --notls getblockcount
/root/pearl/bin/oyster -u rpcuser -P pearl123 --noclienttls --noservertls --pearldusername=rpcuser --pearldpassword=rpcpass > /tmp/oyster.log 2>&1 & sleep 15 && /root/pearl/bin/prlctl -u rpcuser -P pearl123 -s localhost:44207 --wallet --notls getnewaddress
sed -i 's/PEARLD_MINING_ADDRESS=PLACEHOLDER/PEARLD_MINING_ADDRESS=YOUR_ACTUAL_ADDRESS/' ~/.bashrc && source ~/.bashrc && echo $PEARLD_MINING_ADDRESS — confirm it prints your address before proceeding./root/pearl/bin/prlctl -u rpcuser -P pearl123 -s localhost:44207 --wallet --notls validateaddress YOUR_ADDRESS
rm -f /tmp/pearlgw.sockrm -f /tmp/pearlgw.sock && tmux kill-session -t miner 2>/dev/null; tmux new-session -d -s miner && tmux send-keys -t miner "cd /root/pearl && source ~/.bashrc && /root/pearl/.venv/bin/pearl-gateway start > /tmp/gateway.log 2>&1 & sleep 10 && /root/pearl/.venv/bin/vllm serve pearl-ai/Llama-3.3-70B-Instruct-pearl --host 0.0.0.0 --port 8000 --max-model-len 8192 --gpu-memory-utilization 0.9 --enforce-eager --data-parallel-size 2 --no-enable-prefix-caching --no-enable-chunked-prefill" Enter
tail -5 /tmp/gateway.logcd /root/pearl && ./bin/prlctl -u rpcuser -P rpcpass -s localhost:44107 --notls getblockchaininfo 2>/dev/null | grep -E "blocks|headers"
nvidia-smi --query-gpu=index,memory.used --format=csv,noheader
curl -s http://localhost:8000/health && echo "READY" || echo "NOT READY"
Create the Python worker script:
python3 << 'PYEOF'
code = '''#!/usr/bin/env python3
import threading, random, requests, time, sys, signal
VLLM_URL = "http://localhost:8000/v1/chat/completions"
MODEL = "pearl-ai/Llama-3.3-70B-Instruct-pearl"
NUM_WORKERS = 32
MAX_TOKENS = 3
WORD_LIST_LENGTH = 120
REQUEST_TIMEOUT = 120
CONSONANTS = "bcdfghjklmnpqrstvwxyz"
VOWELS = "aeiou"
def random_word(length=None):
if length is None:
length = random.randint(4, 10)
return "".join(random.choice(CONSONANTS if i%2==0 else VOWELS) for i in range(length))
def build_prompt():
bypass = random_word(random.randint(5, 12))
words = " ".join(random_word() for _ in range(WORD_LIST_LENGTH))
return bypass + ", decipher this secret message: " + words
class MiningWorker(threading.Thread):
def __init__(self, wid):
super().__init__(daemon=True)
self.wid = wid
self.count = 0
self.running = True
def run(self):
print(f"[W{self.wid}] Started", flush=True)
while self.running:
try:
r = requests.post(VLLM_URL, json={"model": MODEL, "messages": [{"role": "user", "content": build_prompt()}], "max_tokens": MAX_TOKENS}, timeout=REQUEST_TIMEOUT)
if r.status_code == 200:
self.count += 1
if self.count % 10 == 0:
out = r.json().get("choices",[{}])[0].get("message",{}).get("content","")
print(f"[W{self.wid}] req={self.count} out=\'{out.strip()}\'", flush=True)
else:
time.sleep(1)
except requests.exceptions.Timeout:
print(f"[W{self.wid}] Timeout, retrying...", flush=True)
time.sleep(2)
except requests.exceptions.ConnectionError:
print(f"[W{self.wid}] ConnError, retrying in 5s...", flush=True)
time.sleep(5)
except Exception as e:
print(f"[W{self.wid}] Error: {e}", flush=True)
time.sleep(2)
def stats(workers):
while True:
time.sleep(30)
total = sum(w.count for w in workers)
print(f"[Stats] total={total} | " + " ".join(f"W{w.wid}:{w.count}" for w in workers), flush=True)
def main():
print(f"Pearl Worker -- {NUM_WORKERS} workers, max_tokens={MAX_TOKENS}", flush=True)
workers = [MiningWorker(i) for i in range(NUM_WORKERS)]
def shutdown(s,f):
for w in workers:
w.running = False
sys.exit(0)
signal.signal(signal.SIGINT, shutdown)
signal.signal(signal.SIGTERM, shutdown)
for w in workers:
w.start()
threading.Thread(target=stats, args=(workers,), daemon=True).start()
while True:
time.sleep(1)
if __name__ == "__main__":
main()
'''
with open("/root/pearl/pearl_worker.py", "w") as f:
f.write(code)
print("Written OK")
PYEOF
python3 -c "import ast; ast.parse(open('/root/pearl/pearl_worker.py').read()); print('Syntax OK')"
Start the worker in the worker tmux session:
tmux new-session -d -s worker && tmux send-keys -t worker "cd /root/pearl && /root/pearl/.venv/bin/python pearl_worker.py" Enter && echo "✓ Worker started"
Verify after 30 seconds:
sleep 30 && nvidia-smi --query-gpu=index,utilization.gpu,power.draw --format=csv,noheader && tmux capture-pane -t worker -p -S -5 | tail -5
{random_word}, decipher this secret message: {120 random words} — first random word bypasses prefix caching, long word list fills the prefill matrix for maximum GEMM size.tmux send-keys -t loop "COUNT=0; while true; do COUNT=\$((COUNT+1)); for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128; do curl -s http://localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{\"model\": \"pearl-ai/Llama-3.3-70B-Instruct-pearl\", \"messages\": [{\"role\": \"user\", \"content\": \"Write a detailed comprehensive academic essay about topic \$COUNT variant \$i covering the following aspects in depth: historical background and origins dating back centuries, mathematical foundations and theoretical frameworks, scientific principles and empirical evidence, technological applications and modern implementations, economic implications and market dynamics, social and cultural impacts on society, philosophical interpretations and ethical considerations, future prospects and emerging research directions, comparative analysis with related fields, and practical case studies with real world examples.\"}], \"max_tokens\": 1}' > /dev/null & done; sleep 1; done" Enter
nvidia-smi --query-gpu=index,utilization.gpu --format=csv,noheader
tmux capture-pane -t miner -p -S -5000 | grep "NOISY_GEMM" | tail -3
-S -5000 (not -S -50) to look back far enough — the buffer fills with other logs quickly.The vLLM metrics endpoint is the most reliable way to confirm everything is working correctly:
curl -s http://localhost:8000/metrics | grep -E "num_requests_running|cache_hit" | grep -v "^#\|reason\|external\|mm_cache"
This patch adds a single print statement inside the Pearl mining kernel that fires every time a valid GEMM operation is detected. It is the only way to visually confirm your miner is actually performing work and to measure your batch size (m value).
python3 -c "
with open('/root/pearl/miner/vllm-miner/src/vllm_miner/config.py', 'r') as f:
content = f.read()
old = ' return (m >= min_m) and (n >= min_n) and (k >= min_k)'
new = ''' result = (m >= min_m) and (n >= min_n) and (k >= min_k)
if result:
print(f\"NOISY_GEMM_CALLED: m={m} n={n} k={k}\", flush=True)
return result'''
content = content.replace(old, new)
with open('/root/pearl/miner/vllm-miner/src/vllm_miner/config.py', 'w') as f:
f.write(content)
print('Patched!')
"
pkill -9 -f "pearl-gateway"; pkill -9 -f "vllm"; pkill -9 -f "EngineCore"; sleep 3 && rm -f /tmp/pearlgw.sock && tmux kill-session -t miner && tmux new-session -d -s miner && tmux send-keys -t miner "cd /root/pearl && source ~/.bashrc && /root/pearl/.venv/bin/pearl-gateway start > /tmp/gateway.log 2>&1 & sleep 10 && /root/pearl/.venv/bin/vllm serve pearl-ai/Llama-3.3-70B-Instruct-pearl --host 0.0.0.0 --port 8000 --max-model-len 8192 --gpu-memory-utilization 0.9 --enforce-eager --data-parallel-size 2 --no-enable-prefix-caching --no-enable-chunked-prefill" Enter && echo "✓ Restarting with patch active"
sleep 120 && curl -s http://localhost:8000/health && echo "READY" && tmux capture-pane -t miner -p -S -5000 | grep "NOISY_GEMM" | grep "n=57344" | tail -3
| What you see | What it means | Action |
|---|---|---|
m=5000-8174, n=57344 on both workers | Mining at full capacity ✅ | Proceed to Step 5 |
m=1024, n=57344 | Batch too small — too few concurrent requests | Ensure 32 Python workers are running |
n=28672 instead of 57344 | TP mode instead of DP mode | Add --data-parallel-size 2 to vLLM command |
| No NOISY_GEMM output at all | Patch not active or worker not started | Verify patch was applied, vLLM restarted, worker running |
echo "=== TMUX ===" && tmux ls && echo "=== SOCKETS ===" && ss -x | grep pearlgw | wc -l && echo "=== VLLM ===" && pgrep -f "vllm serve" | wc -l && echo "=== GPU ===" && nvidia-smi --query-gpu=index,utilization.gpu,memory.used,power.draw --format=csv,noheader && echo "=== MINING ADDRESS ===" && cat /proc/$(pgrep -f "pearl-gateway" | head -1)/environ | tr '\0' '\n' | grep "MINING_ADDRESS" && echo "=== NOISY_GEMM ===" && tmux capture-pane -t miner -p -S -5000 | grep "NOISY_GEMM" | grep "n=57344" | tail -3 && echo "=== LOOP ===" && tmux capture-pane -t loop -p -S -3 | tail -2 && echo "=== CURL JOBS ===" && pgrep -f "curl.*localhost:8000" | wc -l && echo "=== REQUESTS RUNNING ===" && curl -s http://localhost:8000/metrics | grep "num_requests_running" | grep -v "^#\|reason" | awk '{print $2}' | tr '\n' ' ' && echo "" && echo "=== CACHE HITS ===" && curl -s http://localhost:8000/metrics | grep "cache_hit" | grep -v "^#\|external\|mm_cache" | awk '{print $2}' | tr '\n' ' ' && echo "" && echo "=== PEERS ===" && cd /root/pearl && ./bin/prlctl -u rpcuser -P rpcpass -s localhost:44107 --notls getpeerinfo 2>/dev/null | grep "addr" | wc -l && echo "=== BLOCK COUNT ===" && ./bin/prlctl -u rpcuser -P rpcpass -s localhost:44107 --notls getblockcount 2>/dev/null && echo "=== WATCHDOG ===" && cat /tmp/loop_watchdog.log 2>/dev/null || echo "No restarts yet" && echo "=== BLOCKS ===" && tmux capture-pane -t miner -p -S -5000 | grep -i "block accepted\|Block found\|proof"
| Check | Healthy Value | Action if Wrong |
|---|---|---|
| TMUX SESSIONS | miner, loop, node, worker, watchdog | Recreate missing sessions |
| SOCKETS | 4 | Restart miner — gateway/vLLM disconnected |
| PEERS | 8-16 on RunPod (normal) | 16 = normal without exposed port 44108. Not a problem. See Step 0b. |
| VLLM | 1 | Restart miner tmux session |
| GPU utilization | 90-98% both GPUs | Restart Python worker — check worker session for errors |
| GPU power draw | 600-690W each (near 700W TDP) | Low power = GPU idle = worker stalled or vLLM degraded |
| GPU memory | ~132GB each | vLLM crashed — restart miner |
| NOISY_GEMM m value | 5000-8000+ | Use longer prompts in loop. Low m = less mining throughput. |
| NOISY_GEMM n value | 57344 | Must be 57344 — confirms DP mode working |
| NOISY_GEMM workers | Both Worker PIDs firing | Only one firing = one GPU idle — restart loop |
| CURL JOBS | 0 (Python worker) or 500+ (bash loop) | Python worker uses threads not curl — 0 curl jobs is correct |
| REQUESTS RUNNING | 30-70 per engine (balanced) | 0 on both engines = worker stopped or vLLM degraded |
| WORKER | req counts climbing, outputs visible | Timeouts = vLLM not ready. Restart worker after vLLM is READY. |
| REQUESTS RUNNING | 30-50 per engine (balanced) | 0 on one engine = unbalanced — restart loop |
| CACHE HITS | 0.0 | Prompts too similar — randomize more |
| LOOP | Many PIDs visible, large count number | Restart loop or check watchdog log |
| MINING ADDRESS | Your prl1p... address | Kill gateway and restart with correct address in ~/.bashrc |
| WATCHDOG | No restarts yet / shows timestamps | Not running → set up loop watchdog (Section 08b) |
This is NOT a sampling artifact if RunPod dashboard also shows 0%. Root cause is almost always the request loop — either using wait instead of sleep 1, or short prompts that produce m values below the 1024 threshold.
curl -s http://localhost:8000/metrics | grep "num_requests_running" | grep -v "^#\|reason" | awk '{print $2}' | tr '\n' ' '
Fix: Kill loop, restart with sleep 1 (not wait) and long prompts (~150+ tokens). See Step 4 loop command.
If the watchdog log shows restarts every 60 seconds with 0 curl jobs each time, and vLLM responds healthy but GPU stays at 0% with ~120W power draw — vLLM is in a degraded state. This happens after processing hundreds of millions of tokens continuously (typically after 1-2 days of running). vLLM responds to health checks and completes requests instantly, but stops actually using the GPU.
Signs in watchdog log:
Sun May 3 10:54:38 UTC 2026 - Loop stalled (0 curl jobs), restarting...
Sun May 3 10:54:42 UTC 2026 - Loop restarted
Sun May 3 10:55:42 UTC 2026 - Loop stalled (0 curl jobs), restarting...
Sun May 3 10:55:46 UTC 2026 - Loop restarted
# Repeating every 60 seconds = vLLM degraded, not loop issue
Fix: Full restart of vLLM and gateway. Model is cached so takes ~3 minutes:
pkill -9 -f "pearl-gateway"; pkill -9 -f "vllm"; pkill -9 -f "EngineCore"; pkill -9 -f "Worker"; sleep 3 && rm -f /tmp/pearlgw.sock && tmux kill-session -t miner && tmux new-session -d -s miner && tmux send-keys -t miner "cd /root/pearl && source ~/.bashrc && /root/pearl/.venv/bin/pearl-gateway start > /tmp/gateway.log 2>&1 & sleep 10 && /root/pearl/.venv/bin/vllm serve pearl-ai/Llama-3.3-70B-Instruct-pearl --host 0.0.0.0 --port 8000 --max-model-len 8192 --gpu-memory-utilization 0.9 --enforce-eager --data-parallel-size 2 --no-enable-prefix-caching --no-enable-chunked-prefill" Enter
rm -f /tmp/loop_watchdog.log && echo "cleared"DeepGEMM is trying to JIT-compile CUDA kernels and failing. Root cause: VLLM_USE_DEEP_GEMM env var is not set or not reaching the vLLM process.
echo $VLLM_USE_DEEP_GEMM
The blockchain node is still syncing. vLLM starts but immediately crashes because there is no block to mine.
cd /root/pearl && ./bin/prlctl -u rpcuser -P rpcpass -s localhost:44107 --notls getblockchaininfo 2>/dev/null | grep -E "blocks|headers"
Wait until blocks == headers before starting vLLM. Can take 5-15 minutes on first launch.
The PEARLD_MINING_ADDRESS env var is not reaching the gateway process. This happens when env vars are only exported inline rather than in ~/.bashrc, or when the miner tmux session was created before the vars were set.
echo $PEARLD_MINING_ADDRESS
If empty: add to ~/.bashrc, source it, then kill and recreate the miner tmux session before restarting.
source .venv/bin/activate often fails silently inside tmux send-keys, so vllm is not in PATH.
Fix: Always use FULL PATHS: /root/pearl/.venv/bin/vllm and /root/pearl/.venv/bin/pearl-gateway instead of relying on venv activation.
Stale socket file from previous run. Gateway creates /tmp/pearlgw.sock and won't overwrite it.
rm -f /tmp/pearlgw.sock && echo "cleared"
Always delete the socket before restarting. Add to all restart procedures.
pkill -9 -f "pearl-gateway" && pkill -9 -f "vllm" && pkill -9 -f "EngineCore" && sleep 5
Then restart miner with full command from Step 4.
Gateway and vLLM are not connected. Happens when they start in separate sessions.
pkill -9 -f "pearl-gateway" && pkill -9 -f "vllm" && pkill -9 -f "EngineCore" && sleep 5
Restart BOTH gateway and vLLM together in the SAME miner session.
TP mode is active instead of DP. Restart with --data-parallel-size 2 flag.
pgrep -f "vllm serve" | xargs -I{} cat /proc/{}/cmdline | tr '\0' ' ' | grep "data-parallel"
cat /proc/$(pgrep -f "pearl-gateway" | head -1)/environ | tr '\0' '\n' | grep "MINING_ADDRESS"
If wrong: pkill -f "pearl-gateway" then restart miner with correct PEARLD_MINING_ADDRESS.
If vLLM crashes repeatedly, add a watchdog that monitors and restarts it automatically. Note: this is separate from the loop watchdog below.
cat > /root/pearl/watchdog.sh << 'EOF'
#!/bin/bash
while true; do
VLLM=$(pgrep -f "vllm serve" | wc -l)
SOCK=$(ss -x | grep pearlgw | wc -l)
if [ "$VLLM" -eq 0 ] || [ "$SOCK" -lt 2 ]; then
echo "$(date) - Restarting miner..." >> /tmp/watchdog.log
pkill -9 -f "pearl-gateway"; pkill -9 -f "vllm"; pkill -9 -f "EngineCore"
sleep 5
rm -f /tmp/pearlgw.sock
cd /root/pearl && source ~/.bashrc && \
/root/pearl/.venv/bin/pearl-gateway start > /tmp/gateway.log 2>&1 & sleep 10 && \
/root/pearl/.venv/bin/vllm serve pearl-ai/Llama-3.3-70B-Instruct-pearl \
--host 0.0.0.0 --port 8000 --max-model-len 8192 \
--gpu-memory-utilization 0.9 --enforce-eager \
--data-parallel-size 2 --no-enable-prefix-caching \
--no-enable-chunked-prefill &
sleep 900
fi
sleep 60
done
EOF
chmod +x /root/pearl/watchdog.sh && tmux new-session -d -s watchdog && tmux send-keys -t watchdog "/root/pearl/watchdog.sh" Enter && echo "✓ Miner watchdog running"
Monitors the Python worker and restarts it if it stops. Checks every 60 seconds.
cat > /root/loop_watchdog.sh << 'EOF'
#!/bin/bash
while true; do
WORKER_COUNT=$(pgrep -f "pearl_worker.py" | wc -l)
if [ "$WORKER_COUNT" -lt 1 ]; then
echo "$(date) - Worker stopped, restarting..." >> /tmp/loop_watchdog.log
tmux send-keys -t worker C-c 2>/dev/null
sleep 2
tmux send-keys -t worker "cd /root/pearl && /root/pearl/.venv/bin/python pearl_worker.py" Enter
echo "$(date) - Worker restarted" >> /tmp/loop_watchdog.log
fi
sleep 60
done
EOF
chmod +x /root/loop_watchdog.sh && tmux new-session -d -s watchdog && tmux send-keys -t watchdog "/root/loop_watchdog.sh" Enter && echo "✓ Watchdog running" && tmux ls | grep watchdog
cat /tmp/loop_watchdog.log 2>/dev/null || echo "No restarts yet"
cat > /root/loop_watchdog.sh << 'EOF'
#!/bin/bash
while true; do
CURL_COUNT=$(pgrep -f "curl.*localhost:8000" | wc -l)
if [ "$CURL_COUNT" -lt 10 ]; then
echo "$(date) - Loop stalled (${CURL_COUNT} curl jobs), restarting..." >> /tmp/loop_watchdog.log
tmux send-keys -t loop C-c 2>/dev/null
sleep 2
pkill -f "curl.*localhost:8000" 2>/dev/null
sleep 2
tmux send-keys -t loop "COUNT=0; while true; do COUNT=\$((COUNT+1)); for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128; do curl -s http://localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{\\\"model\\\": \\\"pearl-ai/Llama-3.3-70B-Instruct-pearl\\\", \\\"messages\\\": [{\\\"role\\\": \\\"user\\\", \\\"content\\\": \\\"Write a detailed comprehensive academic essay about topic \$COUNT variant \$i covering the following aspects in depth: historical background and origins dating back centuries, mathematical foundations and theoretical frameworks, scientific principles and empirical evidence, technological applications and modern implementations, economic implications and market dynamics, social and cultural impacts on society, philosophical interpretations and ethical considerations, future prospects and emerging research directions, comparative analysis with related fields, and practical case studies with real world examples.\\\"}], \\\"max_tokens\\\": 1}' > /dev/null & done; sleep 1; done" Enter
echo "$(date) - Loop restarted" >> /tmp/loop_watchdog.log
fi
sleep 60
done
EOF
chmod +x /root/loop_watchdog.sh && tmux new-session -d -s watchdog && tmux send-keys -t watchdog "/root/loop_watchdog.sh" Enter && echo "✓ Loop watchdog running" && tmux ls | grep watchdog
cat /tmp/loop_watchdog.log 2>/dev/null || echo "No restarts yet"
tmux capture-pane -t miner -p -S -50000 | grep -i "block accepted\|Block found\|proof\|submit"
https://explorer.pearlresearch.ai/address/YOUR_MINING_ADDRESS
pgrep -f "vllm serve" | wc -l && ss -x | grep pearlgw | wc -l && nvidia-smi --query-gpu=utilization.gpu --format=csv,noheader
pkill -9 -f "pearl-gateway"; pkill -9 -f "vllm"; pkill -9 -f "EngineCore"; pkill -9 -f "Worker"; sleep 3 && rm -f /tmp/pearlgw.sock && tmux kill-session -t miner 2>/dev/null; tmux new-session -d -s miner && tmux send-keys -t miner "cd /root/pearl && source ~/.bashrc && /root/pearl/.venv/bin/pearl-gateway start > /tmp/gateway.log 2>&1 & sleep 10 && /root/pearl/.venv/bin/vllm serve pearl-ai/Llama-3.3-70B-Instruct-pearl --host 0.0.0.0 --port 8000 --max-model-len 8192 --gpu-memory-utilization 0.9 --enforce-eager --data-parallel-size 2 --no-enable-prefix-caching --no-enable-chunked-prefill" Enter
tmux send-keys -t loop C-c
Then send the full loop command from Step 4.
/root/pearl/bin/oyster -u rpcuser -P pearl123 --noclienttls --noservertls --pearldusername=rpcuser --pearldpassword=rpcpass > /tmp/oyster.log 2>&1 & sleep 15 && /root/pearl/bin/prlctl -u rpcuser -P pearl123 -s localhost:44207 --wallet --notls getbalance
tmux capture-pane -t miner -p -S -20 | grep -i "download\|Downloading\|fetching"
python3 -c "
with open('/root/pearl/miner/vllm-miner/src/vllm_miner/config.py', 'r') as f:
content = f.read()
old = ' return (m >= min_m) and (n >= min_n) and (k >= min_k)'
new = ''' result = (m >= min_m) and (n >= min_n) and (k >= min_k)
if result:
print(f\"NOISY_GEMM_CALLED: m={m} n={n} k={k}\", flush=True)
return result'''
content = content.replace(old, new)
with open('/root/pearl/miner/vllm-miner/src/vllm_miner/config.py', 'w') as f:
f.write(content)
print('Patched!')
"
After reapplying, restart vLLM and check: tmux capture-pane -t miner -p -S -5000 | grep "NOISY_GEMM" | grep "n=57344" | tail -3
| Difficulty | Expected Block Time (2x H200) | Status |
|---|---|---|
| ~29,000 | ~1 block/hour | Early network (April 27, 2026) |
| ~68,000 | ~2 hours/block | Day 3 |
| ~115,000 | ~4 hours/block | Day 4 |
| >150,000 | 6-8+ hours/block | Highly competitive |
cd /root/pearl && ./bin/prlctl -u rpcuser -P rpcpass -s localhost:44107 --notls getblockchaininfo 2>/dev/null | grep -E "blocks|difficulty"
tmux kill-session -t vllm 2>/dev/null; echo "done"
Add --debug flag to gateway for more verbose logs including block submissions:
pearl-gateway start
pearl-gateway --debug start
tmux send-keys -t loop C-c && sleep 2 && pkill -f "curl.*localhost:8000" && sleep 2 && tmux send-keys -t loop "COUNT=0; while true; do COUNT=\$((COUNT+1)); for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128; do curl -s http://localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{\"model\": \"pearl-ai/Llama-3.3-70B-Instruct-pearl\", \"messages\": [{\"role\": \"user\", \"content\": \"Write a detailed comprehensive academic essay about topic \$COUNT variant \$i covering the following aspects in depth: historical background and origins dating back centuries, mathematical foundations and theoretical frameworks, scientific principles and empirical evidence, technological applications and modern implementations, economic implications and market dynamics, social and cultural impacts on society, philosophical interpretations and ethical considerations, future prospects and emerging research directions, comparative analysis with related fields, and practical case studies with real world examples.\"}], \"max_tokens\": 1}' > /dev/null & done; sleep 1; done" Enter
cd /root/pearl && ./bin/prlctl -u rpcuser -P rpcpass -s localhost:44107 --notls getpeerinfo 2>/dev/null | grep "addr" | wc -l
| Critical Mistake | Consequence | Fix |
|---|---|---|
| Using `wait` in request loop | GPU goes to 0% between batches — burst/idle pattern, very inefficient | Use `sleep 1` instead — keeps requests continuously overlapping |
| Sending requests to port 8001 | DP=2 only exposes port 8000 — port 8001 requests are dropped | Always send all requests to port 8000 only |
| Using --tensor-parallel-size 2 | Reduces n to 28672, less mining efficiency | Use --data-parallel-size 2 |
| Prefix caching enabled | Same prompts cached — NO GEMM = NO MINING | Always use --no-enable-prefix-caching |
| Gateway in separate session from vLLM | Socket not connected, env vars not inherited | Start both in same tmux miner session |
| Sending same prompt repeatedly | KV cache kicks in, GEMM skipped entirely | Randomize with COUNT and i variables |
| config.yaml thresholds at 1 | Overhead without benefit for our matrix sizes | Keep at 1024 (default) |
| Not verifying mining address | Blocks could go to wrong wallet | Always validateaddress + check /proc environ |
| MINER_DEBUG env vars | Don't reach EngineCore subprocess | Use PEARL_LOG_LEVEL=DEBUG instead |
| Service | Username | Password | Port |
|---|---|---|---|
| pearld node (prlctl) | rpcuser | rpcpass | 44107 |
| oyster wallet (prlctl --wallet) | rpcuser | pearl123 | 44207 |
Error creating a default config file: open /root/.oyster/oyster.conf: no such file or directory
Error creating a default config file: open /root/.pearld/pearld.conf: no such file or directory
Warning: Running on mainnet with --noclienttls is not recommended
Warning: Running on mainnet with --noservertls is not recommended
The startup command uses pearl-gateway start & sleep 10 && vllm serve ...
The & runs gateway in background, sleep 10 gives it time to create the socket, then vLLM starts and connects to it. If vLLM starts before the socket exists, they won't connect.
The pearl-ai model downloaded fine without an HF token in our setup. If you get auth errors:
export HF_TOKEN=your_token_here
pgrep -f "api_server" to detect vLLM. This returns 0 even when vLLM IS running! Always use pgrep -f "vllm serve" instead.By default tmux only stores a limited scroll buffer. Block activity messages from hours ago may not appear in tmux capture-pane. The explorer is more reliable for historical block confirmation.
Running getnewaddress multiple times generates different addresses — all from the same seed phrase, all recoverable. But only one address is set as the mining address at a time. The second address generated (prl1p8jt0...) is a valid backup address from the same wallet.
The number of Python workers directly controls vLLM's batch size and therefore the m value:
| Workers | m Value | GPU Util | Result |
|---|---|---|---|
| 3 (GPUs+1) | ~1024 (minimum) | 0% | Barely mining — too few concurrent requests |
| 16 | ~2000-4000 | ~60% | Partial mining |
| 32 | 5000-8174 | 90-98% | ✅ Optimal — recommended |
| 128+ | N/A | 0% | vLLM crashes from overload |
Starting the Python worker before vLLM is fully loaded causes a degraded state — vLLM accepts connections but stops processing requests. Always verify curl -s http://localhost:8000/health returns READY and metrics show 0.0 0.0 before starting the worker.
The bash loop with sleep 1 fires 128 new curl jobs every second regardless of whether previous jobs finished. Over hours this accumulates to 10,000-15,000 concurrent processes, causing fork errors and degrading m values from 8174 down to 1000-3000. The Python worker avoids this entirely by using threads instead of processes.
After processing hundreds of millions of tokens, vLLM can enter a degraded state where it responds to health checks and completes requests instantly but stops using the GPU. Requests complete in milliseconds with 0% GPU utilization and ~120W power draw. The watchdog loop restarts every 60 seconds but immediately stalls again — this is the telltale sign.
Fix: Full restart of vLLM and gateway. Model is cached so back online in ~3 minutes. Plan for a periodic restart every 24-48 hours as preventive maintenance.
The request loop can stall without any error message. Curl jobs drop to 0, GPU goes to 0%, but vLLM stays running and appears healthy. This happens because bash accumulates too many background jobs over time.
Signs: GPU 0% on RunPod dashboard, power draw drops to ~120W, NOISY_GEMM stops firing in tmux buffer, curl job count is 0.
Fix: Kill loop, restart it. Always set up the loop watchdog (Section 08b) to handle this automatically — it checks every 60 seconds and restarts if curl jobs drop below 10.
Sometimes requests distribute unevenly between the two DP engines — one engine gets 35 requests, the other gets 0-7. This shows as low m values on one Worker and lower GPU utilization. Root cause: loop stalled and restarted unevenly.
Fix: Restart the loop cleanly. Kill all curl jobs first, verify 0 remaining, then restart. The engines rebalance within the next batch.
curl -s http://localhost:8000/metrics | grep "num_requests_running" | grep -v "^#\|reason" | awk '{print $2}' | tr '\n' ' ' — both engines should show similar numbers.On RunPod (and most cloud providers), inbound connections are blocked by default. Your node can connect OUT to other peers but other nodes cannot connect IN to you. This limits you to ~8-16 outbound peers regardless of your --maxpeers setting.
The fix is exposing port 44108 before deploying your pod (see Step 0b). If already deployed, wait until next natural restart.
Both approaches send requests to vLLM to keep it busy mining. Here's how they compare after extensive real-world testing:
| Aspect | Bash Loop (sleep 1) | Python Worker (32 threads) |
|---|---|---|
| Concurrent requests | 128 new jobs per second, uncapped | Exactly 32 at all times |
| Job accumulation | Grows to 10,000-15,000+ over hours | Always exactly 32 — never accumulates |
| m value (fresh start) | 8174 (peak) | 5000-8174 (consistent) |
| m value after 6+ hours | Degrades to 1000-3000 | Stays at 5000-8174 |
| Fork errors | Yes — after thousands of jobs accumulate | Never |
| GPU utilization | 90-98% initially, degrades over time | 90-98% stable indefinitely |
| Stability | Requires loop restarts every few hours | Runs indefinitely without restarts |
| Prompt format | Long essay prompts (~150 tokens) | Decipher format — dev team recommended |
| Output tokens | max_tokens=1 | max_tokens=3 — can eyeball outputs |
| Complexity | Simple bash — easy to understand | Requires Python file on server |
| Watchdog needed | Yes — loop stalls frequently | Rarely — threads auto-retry |
| vLLM crash risk | High — floods vLLM with thousands of requests | Low — controlled concurrency |
The bash loop fires 128 new curl requests every second as background processes. On a fresh start, this floods vLLM with enough concurrent requests to fill the batch (m=8174). The GPU runs at near 100%. However, because it never waits for jobs to finish before firing new ones, jobs accumulate indefinitely. After 6-12 hours you have 10,000+ zombie curl processes, fork errors appear, and vLLM's batch scheduler starts behaving erratically — m values drop and GPU utilization degrades.
wait doesn't work instead of sleep 1Using wait in the bash loop makes it fire 128 jobs then wait for ALL of them to complete before firing the next batch. On H200s with fast prefill, batches complete in ~1 second — but there's still a gap between batches where the GPU sits idle at 0%. This burst/idle pattern is inefficient. sleep 1 overlaps batches continuously but causes accumulation. Neither is ideal — which is why the Python worker is better.
sleep 1 works short-term but fails long-termsleep 1 was the discovered middle ground between wait and uncapped firing. Instead of waiting for all 128 jobs to finish, it fires a new batch every second regardless — keeping requests continuously overlapping so the GPU never idles. This produces m=8174 and 90-98% GPU utilization on a fresh start.
The problem: every second, 128 new background processes are created whether or not the previous ones finished. vLLM processes requests in ~1-3 seconds each, so the queue grows by ~128 jobs/second net. After 2 hours: ~15,000 zombie curl processes. The OS hits its process limit (fork errors), vLLM's batch scheduler degrades under the queue pressure, and m values fall from 8174 to 1000-3000. A loop restart temporarily fixes it but the cycle repeats.
The Python worker solves this by design — threads block on the HTTP response, so there are always exactly NUM_WORKERS requests in flight. No accumulation, no degradation.
Each Python worker sends one request, waits for the response, then immediately sends the next. With 32 workers running simultaneously, there are always exactly 32 requests in flight. On 2x H200 with DP=2, this keeps both engines busy enough to produce m=5000-8174. Fewer workers (3, as suggested by the dev team) produces m=1024 — the minimum threshold — because H200s process requests so fast that the batch is empty most of the time. More workers (128+) overwhelms vLLM's queue and causes it to crash.
Exporting vars inline in the tmux send-keys command is unreliable — the vars often don't reach subprocesses. Always add them to ~/.bashrc and use source ~/.bashrc in the miner startup. The gateway will fail with "mining_address: Field required" if PEARLD_MINING_ADDRESS is not in the environment.
Using source .venv/bin/activate inside tmux send-keys frequently fails silently, leaving vllm not in PATH and producing "vllm: command not found". Always use /root/pearl/.venv/bin/vllm and /root/pearl/.venv/bin/pearl-gateway explicitly.
The gateway socket at /tmp/pearlgw.sock persists after the gateway dies. On restart, if the old socket file exists, the new gateway may fail or vLLM may connect to a dead socket. Always run rm -f /tmp/pearlgw.sock before restarting.
The should_use_noisy_gemm() function requires m ≥ 1024 (default threshold in config.yaml). Short prompts produce small batch sizes (m < 1024) and mining is skipped entirely. Always use long prefill-heavy prompts (~150+ tokens input, max_tokens=1). Target m=5000-8000+. Power draw is the quickest sanity check: 690W = mining, 120W = not mining.