Research Protocol

Affiliate.Careers Methodology

Open and reproducible protocol: how data is ingested, cleaned, scored, and published in the catalog and analytics.

Protocol version: v1.1 Updated: 2026-03-06 08:56:39 UTC

1. Research goal

Collect affiliate / iGaming / crypto jobs market data from Telegram sources, remove noise, merge duplicates, score card quality, and publish open datasets.

Important: this is a Telegram-channels market slice, not a full labor-market model.

2. Sources and data layers

3. Processing pipeline

  1. Ingest: import Telegram messages into RAW layer.
  2. Classify: assign message type (vacancy/promo/resume/other).
  3. Split: split multi-vacancy messages into individual items.
  4. Extract: extract fields (title, company, salary, contacts, tags).
  5. Dedup: merge duplicates into canonical job entity.
  6. AI moderation: auto-review ambiguous cases and apply high-confidence decisions.
  7. Publish: ship to catalog/feeds and refresh analytics rollups.

Scheduled update cycle: hourly.

4. Filtering rules

5. Quality Score formula (0..100)

Final quality score is calculated as a weighted component sum:

Q = T + C + S + Sn + L + W + A + D + R + V

Where:
T  (title quality)                = 0..7
C  (company present)              = 0..18
S  (salary raw present)           = 0..18
Sn (salary numeric parsed)        = 0..8
L  (location present)             = 0..10
W  (work_format != unknown)       = 0..10
A  (apply contacts present)       = 0..14
D  (description richness)         = 0..8
R  (multi-source confidence)      = 0..3
V  (direct/verified bonus)        = 0..7

Current formula version: qscore-v1.0. Scores and flags are stored in jobs.quality_score and jobs.quality_flags.

6. Public indices and formulas

Salary Transparency Index (%)  = jobs_seen_with_salary_raw / jobs_seen_count * 100
Salary Numeric Share (%)       = jobs_seen_with_salary_numeric / jobs_seen_count * 100
Employer Transparency Index (%)= jobs_seen_with_company / jobs_seen_count * 100
Contacts Coverage (%)          = jobs_seen_with_contacts / jobs_seen_count * 100
Remote Share (%)               = jobs_seen_remote / jobs_seen_count * 100
Channel Noise Rate (%)         = repost_jobs_total / unique_jobs_total * 100

Indices are published on /market-pulse/ and /channels/.

7. Reproducibility: how to verify calculations

Below is the minimal checklist to independently verify outputs.

# 1) Миграции / Migrations
bash scripts/db/migrate.sh

# 2) Полный hourly-проход / Full hourly pass
bash scripts/pipeline/run_hourly_pipeline.sh

# 3) Пересчет quality + rollup за 90 дней / Rebuild quality + rollups
php scripts/analytics/build_rollups.php --days=90

For historical backfills use: --from=YYYY-MM-DD, --quality_only=1, --rollup_only=1.

SQL validation (raw data vs rollups)
-- Raw -> published sanity
SELECT count(*) AS raw_messages FROM tg_messages;
SELECT count(*) AS active_jobs FROM jobs WHERE status='active';

-- Index quality coverage
SELECT
  count(*) FILTER (WHERE salary_raw IS NOT NULL AND btrim(salary_raw) <> '') AS jobs_with_salary_raw,
  count(*) FILTER (WHERE company_raw IS NOT NULL AND btrim(company_raw) <> '') AS jobs_with_company,
  count(*) FILTER (WHERE work_format <> 'unknown') AS jobs_with_format
FROM jobs WHERE status='active';

-- Check latest day in analytics rollups
SELECT day, jobs_seen_count, jobs_seen_with_salary_raw, jobs_seen_with_company
FROM analytics_daily_jobs
ORDER BY day DESC
LIMIT 7;
Validation via public API/feed

8. Limitations and risks

9. Publication integrity checks

10. Public aggregated data and monitoring

Alerting sends Telegram notifications when hourly pipeline is stale or unknown format share spikes unknown_work_format.

11. Skill tags extraction

Skills are extracted deterministically (regex v2) using RU/EN aliases dictionary from config/skills_aliases.php. This reduces noise and keeps trends reproducible without a black-box model.