Segmentation rule to deal with quote marks Tópico cartaz: Thijs Vissia
|
I'm translating from English to Dutch, and I notice that when a sentence ends on a quote mark (either curly quotes or straight), with the period sitting between quotes right before the end quote (i.e. "This is a sentence." or “This is a sentence.”), the segmentation script considers it a single segment, even though the quote ends. (Instead of a situation where multiple sentences are quoted, where I might want the quote to remain in a single segment.)
I was wondering if anyone co... See more I'm translating from English to Dutch, and I notice that when a sentence ends on a quote mark (either curly quotes or straight), with the period sitting between quotes right before the end quote (i.e. "This is a sentence." or “This is a sentence.”), the segmentation script considers it a single segment, even though the quote ends. (Instead of a situation where multiple sentences are quoted, where I might want the quote to remain in a single segment.)
I was wondering if anyone could help me to figure out the break or exception rule for this, so that a segment separates after the quote, instead of after a full stop.
I've tried it with \.\" (in the Before field) and \s (in the After field), as well as the same with curly quotes, but both don't seem to have the desired effect.
If there's anywhere that offers more guidance about how to construct the rules, I'd also be interested in that, the manual is a bit sparse I thought.
I'm also not sure how to distinguish between a break rule and an exception (no break) rule, there's a checkmark in the Segmentation dialog but I'm not sure what the checkmark means - if it doesn't mean "select this rule for further operations". How does the dialog allow me to set a break or a no break rule?
Thanks for any help! ▲ Collapse | | | Samuel Murray Holanda Local time: 06:48 Membro (2006) inglês para africâner + ... | esperantisto Local time: 08:48 Membro (2006) inglês para russo + ... SITE LOCALIZER
Thijs Vissia wrote:
I've tried it with \.\" (in the Before field) and \s (in the After field), as well as the same with curly quotes, but both don't seem to have the desired effect.
For starters, ", “ and ” are three different symbols, thus \.\" won’t work for ”.
In order to understand why your attempts failed, share:
- a short sample file/sample project;
- your segmentation rules (i. e. your segmentation.conf).
Thijs Vissia wrote:
there's a checkmark in the Segmentation dialog but I'm not sure what the checkmark means - if it doesn't mean "select this rule for further operations". How does the dialog allow me to set a break or a no break rule?
Every rule set in the dialog is applied. The checkmark when ticked means that the rule makes a break. If not ticked, the rule is a joiner.
[Edited at 2020-01-08 14:37 GMT] | | | Break rules and exceptions | Jan 8, 2020 |
Thijs Vissia wrote:
I've tried it with \.\" (in the Before field) and \s (in the After field), as well as the same with curly quotes, but both don't seem to have the desired effect.
That is correct, except that, as others said, \" does not cover character ”
Try with \.[\"”] instead
Ensure also that this rule appears before the rule with before = \. and after = \s ; Rules order is important because the segmenter will apply them in the order they appear, and once a rule affects a location in your phrase, no rule can affect the same location anymore.
Other option I would test: before = \. and after = [\"”], but then it is an exception, not a rule. See why in the following.
Thijs Vissia wrote:
I'm also not sure how to distinguish between a break rule and an exception (no break) rule,
An exception means that we do not want to cut, even if the rules which follow the given one say that we should.
Example:
1. Normally after a dot, we want to cut. This is a break rule which is usually at the end of the rules set.
2. However, after "Mr." you should not cut because this is an abbreviation
(example: Mr. Smith said that... ==> if we only apply rule 1, the segmenter will cut aftert the dot; the exception prevents that, but only if exception is declared before the rule)
esperantisto wrote:
Every rule set in the dialog is applied. The checkmark when ticked means that the rule makes a break. If not ticked, the rule is a joiner.
Really? Contrary of a break rule is an exception (a location where you do not want to break). I don't see what a joiner is, since segmentation rules are only used to cut segments (when you want to join them, the rule seems to be always using spaces between joined segments). | |
|
|
Thanks for your responses.
I later gathered that the checkmark in front of each rule would mean an exception, or a "no-break rule". But I'll look into that further and look at the manual again, I'm not as focussed right now.
tcordonniery,
Thanks for your rewritten rule, that does appear to do the job. In my current (in fact, the default) English language segmentation rules, there doesn't seem to be a rule that this one should appear before, as you write ("the rule... See more Thanks for your responses.
I later gathered that the checkmark in front of each rule would mean an exception, or a "no-break rule". But I'll look into that further and look at the manual again, I'm not as focussed right now.
tcordonniery,
Thanks for your rewritten rule, that does appear to do the job. In my current (in fact, the default) English language segmentation rules, there doesn't seem to be a rule that this one should appear before, as you write ("the rule with before = \. and after = \s" isn't to be found, though this seems a little odd.)
In any case, the segment breaks now appear where I want them, hoping it doesn't cause any adverse results elsewhere.
Many thanks for the help, all!
Oh, another related question I just thought of:
I supposed that these segmentation rules are set for OmegaT as a whole, and not per project. How do I prevent such changes from messing up other translation projects, since these would presumably be re-segmented upon opening them? Does that mean I lose the target segments (when I open the project, or when I save it)? Or not? ▲ Collapse | | | esperantisto Local time: 08:48 Membro (2006) inglês para russo + ... SITE LOCALIZER Project-specific rules | Jan 9, 2020 |
Thijs Vissia wrote:
("the rule with before = \. and after = \s" isn't to be found, though this seems a little odd.)
It is a rule for very many languages. Thus, it is not language-specific and can be found under the Default rule set.
Thijs Vissia wrote:
How do I prevent such changes from messing up other translation projects, since these would presumably be re-segmented upon opening them?
Two options basically:
1. (Simple) Create a project-specific rule set: Project properties → Segmentation → Make segmentation rules project-specific.
2. (Not so simple) Create a separate user profile and start OmegaT to use it when you want to work with specific projects: 7. Starting OmegaT from the command line . | | |
esperantisto wrote:
Thijs Vissia wrote:
("the rule with before = \. and after = \s" isn't to be found, though this seems a little odd.)
It is a rule for very many languages. Thus, it is not language-specific and can be found under the Default rule set.
Thijs Vissia wrote:
How do I prevent such changes from messing up other translation projects, since these would presumably be re-segmented upon opening them?
Two options basically:
1. (Simple) Create a project-specific rule set: Project properties → Segmentation → Make segmentation rules project-specific.
2. (Not so simple) Create a separate user profile and start OmegaT to use it when you want to work with specific projects: 7. Starting OmegaT from the command line .
Great, thank you very much.
(The first part of your answer I figured out after posting, it's there as before:"\.\?\!" and after:"\s" in the Default rules, if I'm not mistaken.) | | | Using Java and Unicode in segmentation rules | Feb 1, 2020 |
Hi. I don't know if this question is still relevant, nevertheless here are my findings on segmentation rules. The rules use Java regular expressions. Here is a tutorial:
https://docs.oracle.com/javase/tutorial/essential/regex/index.html
Great feature of Java regular expressions is support of Unicode character categories. Here is a list of them:
<... See more Hi. I don't know if this question is still relevant, nevertheless here are my findings on segmentation rules. The rules use Java regular expressions. Here is a tutorial:
https://docs.oracle.com/javase/tutorial/essential/regex/index.html
Great feature of Java regular expressions is support of Unicode character categories. Here is a list of them:
https://en.wikipedia.org/wiki/Unicode_character_property#General_Category
Using that information I constructed following segmentation rule in order to deal with quotation marks, brackets and tags at the beginning and at the end of English sentences.
Add this as checked line (break) at the bottom of "Default" rules section:
before: [\.\?\!]+[\p{Pf}\p{Pe}(<[\w/]+>)]*
after: \s+[\p{Ps}\p{Pi}(<[\w/]+>)]*\p{Lu}
Some explanations:
before:
- [\.\?\!]+ means that any of these punctuation marks (or their combination) occurs one or more times
- \p{Pf} is closing quotation mark
- \p{Pe} is closing bracket
- <[\w/]+> is a tag, i.e. some combination of slash (/), letters and digits (\w) in angle brackets
- [\p{Pf}\p{Pe}(<[\w/]+>)]* means that any combination of closing quotation mark, closing bracket and a tag occurs zero or more times
and so on.
[Edited at 2020-02-01 09:22 GMT] ▲ Collapse | |
|
|
hi Andrey,
Pardon my late reply, but thank you for pointing this out, very good to know. And thanks also for taking the trouble of constructing an expression to deal with this, I will give it a try.
best,
Thijs | | | Não há um moderador designado especificamente para este fórum. Para reportar violações às regras do site ou para obter ajuda, favor contatar a equipe do site » Segmentation rule to deal with quote marks Trados Business Manager Lite | Create customer quotes and invoices from within Trados Studio
Trados Business Manager Lite helps to simplify and speed up some of the daily tasks, such as invoicing and reporting, associated with running your freelance translation business.
More info » |
| Trados Studio 2022 Freelance | The leading translation software used by over 270,000 translators.
Designed with your feedback in mind, Trados Studio 2022 delivers an unrivalled, powerful desktop
and cloud solution, empowering you to work in the most efficient and cost-effective way.
More info » |
|
| | | | X Sign in to your ProZ.com account... | | | | | |