Segmentation rule to deal with quote marks (OmegaT support)

Fóruns técnicos » OmegaT support »
Segmentation rule to deal with quote marks
Track this topic

Segmentation rule to deal with quote marks

Tópico cartaz: Thijs Vissia

Thijs Vissia
Holanda

Jan 8, 2020

I'm translating from English to Dutch, and I notice that when a sentence ends on a quote mark (either curly quotes or straight), with the period sitting between quotes right before the end quote (i.e. "This is a sentence." or “This is a sentence.”), the segmentation script considers it a single segment, even though the quote ends. (Instead of a situation where multiple sentences are quoted, where I might want the quote to remain in a single segment.)

I was wondering if anyone could help me to figure out the break or exception rule for this, so that a segment separates after the quote, instead of after a full stop.

I've tried it with \.\" (in the Before field) and \s (in the After field), as well as the same with curly quotes, but both don't seem to have the desired effect.

If there's anywhere that offers more guidance about how to construct the rules, I'd also be interested in that, the manual is a bit sparse I thought.

I'm also not sure how to distinguish between a break rule and an exception (no break) rule, there's a checkmark in the Segmentation dialog but I'm not sure what the checkmark means - if it doesn't mean "select this rule for further operations". How does the dialog allow me to set a break or a no break rule?

Thanks for any help! ▲ Collapse

Samuel Murray

Holanda
Local time: 06:48
Membro (2006)
inglês para africâner
+ ...

@Thijs

Jan 8, 2020

I suggest you re-ask your question here:
https://sourceforge.net/projects/omegat/lists/omegat-users
...since segmentation rules are perhaps more geeky than most other issues.

esperantisto

Local time: 08:48
Membro (2006)
inglês para russo
+ ...

SITE LOCALIZER

Show it!

Jan 8, 2020

Thijs Vissia wrote:

I've tried it with \.\" (in the Before field) and \s (in the After field), as well as the same with curly quotes, but both don't seem to have the desired effect.

For starters, ", “ and ” are three different symbols, thus \.\" won’t work for ”.

In order to understand why your attempts failed, share:

a short sample file/sample project;

your segmentation rules (i. e. your segmentation.conf).

Thijs Vissia wrote:
there's a checkmark in the Segmentation dialog but I'm not sure what the checkmark means - if it doesn't mean "select this rule for further operations". How does the dialog allow me to set a break or a no break rule?

Every rule set in the dialog is applied. The checkmark when ticked means that the rule makes a break. If not ticked, the rule is a joiner.

[Edited at 2020-01-08 14:37 GMT]

tcordonniery
França
Local time: 06:48

Break rules and exceptions

Jan 8, 2020

Thijs Vissia wrote:
I've tried it with \.\" (in the Before field) and \s (in the After field), as well as the same with curly quotes, but both don't seem to have the desired effect.

That is correct, except that, as others said, \" does not cover character ”
Try with \.[\"”] instead
Ensure also that this rule appears before the rule with before = \. and after = \s ; Rules order is important because the segmenter will apply them in the order they appear, and once a rule affects a location in your phrase, no rule can affect the same location anymore.

Other option I would test: before = \. and after = [\"”], but then it is an exception, not a rule. See why in the following.

Thijs Vissia wrote:
I'm also not sure how to distinguish between a break rule and an exception (no break) rule,

An exception means that we do not want to cut, even if the rules which follow the given one say that we should.
Example:
1. Normally after a dot, we want to cut. This is a break rule which is usually at the end of the rules set.
2. However, after "Mr." you should not cut because this is an abbreviation
(example: Mr. Smith said that... ==> if we only apply rule 1, the segmenter will cut aftert the dot; the exception prevents that, but only if exception is declared before the rule)

esperantisto wrote:
Every rule set in the dialog is applied. The checkmark when ticked means that the rule makes a break. If not ticked, the rule is a joiner.

Really? Contrary of a break rule is an exception (a location where you do not want to break). I don't see what a joiner is, since segmentation rules are only used to cut segments (when you want to join them, the rule seems to be always using spaces between joined segments).

Thijs Vissia
Holanda

CRIADOR(A) DO TÓPICO

Thank you

Jan 8, 2020

Thanks for your responses.

I later gathered that the checkmark in front of each rule would mean an exception, or a "no-break rule". But I'll look into that further and look at the manual again, I'm not as focussed right now.

tcordonniery,
Thanks for your rewritten rule, that does appear to do the job. In my current (in fact, the default) English language segmentation rules, there doesn't seem to be a rule that this one should appear before, as you write ("the rule with before = \. and after = \s" isn't to be found, though this seems a little odd.)

In any case, the segment breaks now appear where I want them, hoping it doesn't cause any adverse results elsewhere.

Many thanks for the help, all!

Oh, another related question I just thought of:

I supposed that these segmentation rules are set for OmegaT as a whole, and not per project. How do I prevent such changes from messing up other translation projects, since these would presumably be re-segmented upon opening them? Does that mean I lose the target segments (when I open the project, or when I save it)? Or not? ▲ Collapse

esperantisto

Local time: 08:48
Membro (2006)
inglês para russo
+ ...

SITE LOCALIZER

Project-specific rules

Jan 9, 2020

Thijs Vissia wrote:

("the rule with before = \. and after = \s" isn't to be found, though this seems a little odd.)

It is a rule for very many languages. Thus, it is not language-specific and can be found under the Default rule set.

Thijs Vissia wrote:

How do I prevent such changes from messing up other translation projects, since these would presumably be re-segmented upon opening them?

Two options basically:

1. (Simple) Create a project-specific rule set: Project properties → Segmentation → Make segmentation rules project-specific.
2. (Not so simple) Create a separate user profile and start OmegaT to use it when you want to work with specific projects: 7. Starting OmegaT from the command line .

Thijs Vissia
Holanda

CRIADOR(A) DO TÓPICO

Cheers

Jan 9, 2020

esperantisto wrote:

Thijs Vissia wrote:

("the rule with before = \. and after = \s" isn't to be found, though this seems a little odd.)

It is a rule for very many languages. Thus, it is not language-specific and can be found under the Default rule set.

Thijs Vissia wrote:

How do I prevent such changes from messing up other translation projects, since these would presumably be re-segmented upon opening them?

Great, thank you very much.

(The first part of your answer I figured out after posting, it's there as before:"\.\?\!" and after:"\s" in the Default rules, if I'm not mistaken.)

Andrey Raugas
Geórgia
Local time: 09:48
inglês para russo

Using Java and Unicode in segmentation rules

Feb 1, 2020

Hi. I don't know if this question is still relevant, nevertheless here are my findings on segmentation rules. The rules use Java regular expressions. Here is a tutorial:
https://docs.oracle.com/javase/tutorial/essential/regex/index.html

Great feature of Java regular expressions is support of Unicode character categories. Here is a list of them:
https://en.wikipedia.org/wiki/Unicode_character_property#General_Category

Using that information I constructed following segmentation rule in order to deal with quotation marks, brackets and tags at the beginning and at the end of English sentences.

Add this as checked line (break) at the bottom of "Default" rules section:
before: [\.\?\!]+[\p{Pf}\p{Pe}(<[\w/]+>)]*
after: \s+[\p{Ps}\p{Pi}(<[\w/]+>)]*\p{Lu}

Some explanations:
before:

[\.\?\!]+ means that any of these punctuation marks (or their combination) occurs one or more times
\p{Pf} is closing quotation mark
\p{Pe} is closing bracket
<[\w/]+> is a tag, i.e. some combination of slash (/), letters and digits (\w) in angle brackets
[\p{Pf}\p{Pe}(<[\w/]+>)]* means that any combination of closing quotation mark, closing bracket and a tag occurs zero or more times

and so on.

[Edited at 2020-02-01 09:22 GMT] ▲ Collapse

Thijs Vissia
Holanda

CRIADOR(A) DO TÓPICO

many thanks

Feb 5, 2020

Andrey Raugas wrote:

Hi. I don't know if this question is still relevant, nevertheless here are my findings on segmentation rules. The rules use Java regular expressions. Here is a tutorial:
https://docs.oracle.com/javase/tutorial/essential/regex/index.html

Great feature of Java regular expressions is support of Unicode character categories. Here is a list of them:
https://en.wikipedia.org/wiki/Unicode_character_property#General_Category

hi Andrey,

Pardon my late reply, but thank you for pointing this out, very good to know. And thanks also for taking the trouble of constructing an expression to deal with this, I will give it a try.

best,
Thijs

Login to reply/comment

Não há um moderador designado especificamente para este fórum.
Para reportar violações às regras do site ou para obter ajuda, favor contatar a equipe do site »

Segmentation rule to deal with quote marks

Forum rules

Help and orientation

Trados Business Manager Lite
Create customer quotes and invoices from within Trados Studio Trados Business Manager Lite helps to simplify and speed up some of the daily tasks, such as invoicing and reporting, associated with running your freelance translation business. More info »

Trados Studio 2022 Freelance
The leading translation software used by over 270,000 translators. Designed with your feedback in mind, Trados Studio 2022 delivers an unrivalled, powerful desktop and cloud solution, empowering you to work in the most efficient and cost-effective way. More info »

Mensagens recentes | FAQ | Regras | Moderadores | Banco de artigos

Your current localization setting

português (Br)

Select a language

More languages...

Segmentation rule to deal with quote marks

Segmentation rule to deal with quote marks

You have native languages that can be verified

Your current localization setting

Select a language