Multi-Chain Scraper for EVM, TON, and SOL

— TL;DR

One scraper, three chains (EVM + TON + SOL), separate detection module per chain with its own state file. Key to keeping it quiet: silent baseline, diff NEW only — on first run, record current state without any alerts, then only alert for what appears after that. Plus one expensive lesson: an uncached username resolve fallback can fire ~165,000 API calls for a single name.

Problem: Three Chains, Three Worlds

EVM, TON, and SOL aren't just different names. How you read transactions differs, address formats differ, and how you determine "something new" also differs. EVM uses block numbers. TON and SOL have their own cursors. If you force one logic for all three, the result is fragile.

The goal is one: monitor sources across three chains, and send notifications only when there's new activity. Easy to say, hard to get right, because of two traps: duplicate alerts (the same notification sent multiple times) and alert floods (on first run, all old data is considered "new" and sent at once).

Architecture: Separate Modules, Separate State

The solution: one detection module per chain, each storing its own state. No chain logic leaks into another chain.

            # Structure
detection/
  evm.py      # EVM logic, uses block number
  ton.py      # TON logic, uses its own cursor
  sol.py      # SOL logic, uses its own cursor

state_evm.json   # last_ids, cooldowns, paused
state_sol.json
state_ton.json
        

Each state file stores last_ids (marker of the last processed data), cooldowns (so the same source doesn't spam), and a paused list (sources or users that are muted). Because they're separate, one chain can be paused or reset without affecting the others.

The bot itself is purely event-driven: no background polling loop running constantly. Everything runs from command and callback handlers. This is intentional, so there are no rogue threads burning API in the background.

Silent Baseline, Diff NEW Only

This is the pattern that keeps the scraper quiet. On first run (or when a new source is added), the system doesn't immediately send everything it sees. It records the current state as a baseline, silently. Only after that, what appears above the baseline is considered "new" and alerted.

            # Pseudo-logic each cycle
def check(source):
    items = fetch_latest(source)
    last = state["last_ids"].get(source)

    if last is None:
        # first baseline: record, DON'T alert
        state["last_ids"][source] = items[0].id
        return []

    # alert only what's newer than baseline
    new = [i for i in items if i.id > last]
    if new:
        state["last_ids"][source] = new[0].id
    return new
        

The difference is huge. Without a baseline, adding one source with thousands of historical transactions would send thousands of notifications at once — immediately hitting Telegram's rate limit, and spamming the user. With a baseline, the new source enters quietly, then only reports what's actually new.

Expensive Bug: 165K API Calls for One Name

This is the part that taught me a hard lesson. To display readable source names (not just raw IDs), there's a resolve username function. The normal path is one API call: request entity, get name. The problem is in the fallback when the normal path fails.

            # Dangerous pattern (simplified)
async def resolve_username(uid):
    try:
        return await client.get_entity(uid)   # 1 call
    except:
        # FALLBACK: scan all participants + all messages
        for scraper in scrapers.values():
            for src in scraper.sources:
                async for p in iter_participants(src, limit=5000):
                    ...
                async for m in iter_messages(src, limit=500):
                    ...
        

Rough calculation: 3 scrapers, each with around 10 sources. That's 30 sources. Each source scanned up to 5,000 participants plus 500 messages, so ~5,500 items per source. 30 × 5,500 = roughly 165,000 API calls. Just to resolve one username that failed the normal path.

Worse: the failure wasn't cached. So every time a user opened details for a source whose username couldn't be resolved, the entire 165K scan ran again from zero. One button click = API call storm.

— FIX

Two things: (1) cache resolve results including failures, so they're not repeated; (2) remove the brute-force fallback, replace with displaying the raw ID when resolve fails. A pretty name isn't worth 165K API calls. The general lesson: any fallback that loops over "all sources × all items" is a time bomb — always set limits and cache the results.